* [PATCH v2 0/5] mm: NUMA stats code cleanup and enhancement @ 2017-12-19 6:39 ` Kemi Wang 0 siblings, 0 replies; 68+ messages in thread From: Kemi Wang @ 2017-12-19 6:39 UTC (permalink / raw) To: Greg Kroah-Hartman, Andrew Morton, Michal Hocko, Vlastimil Babka, Mel Gorman, Johannes Weiner, Christopher Lameter, YASUAKI ISHIMATSU, Andrey Ryabinin, Nikolay Borisov, Pavel Tatashin, David Rientjes, Sebastian Andrzej Siewior Cc: Dave, Andi Kleen, Tim Chen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Aubrey Li, Kemi Wang, Linux MM, Linux Kernel The existed implementation of NUMA counters is per logical CPU along with zone->vm_numa_stat[] separated by zone, plus a global numa counter array vm_numa_stat[]. However, unlike the other vmstat counters, NUMA stats don't effect system's decision and are only consumed when reading from /proc and /sys. Also, usually nodes only have a single zone, except for node 0, and there isn't really any use where you need these hits counts separated by zone. Therefore, we can migrate the implementation of numa stats from per-zone to per-node (as suggested by Andi Kleen), and reuse the existed per-cpu infrastructure with a little enhancement for NUMA stats. In this way, we can get rid of the special way for NUMA stats and keep the performance gain at the same time. With this patch series, about 170 lines code can be saved. The first patch migrates NUMA stats from per-zone to pre-node using the existed per-cpu infrastructure. There is a little user-visual change when read /proc/zoneinfo listed below: Before After Node 0, zone DMA Node 0, zone DMA per-node stats per-node stats nr_inactive_anon 7244 *numa_hit 98665086* nr_active_anon 177064 *numa_miss 0* ... *numa_foreign 0* nr_bounce 0 *numa_interleave 21059* nr_free_cma 0 *numa_local 98665086* *numa_hit 0* *numa_other 0* *numa_miss 0* nr_inactive_anon 20055 *numa_foreign 0* nr_active_anon 389771 *numa_interleave 0* ... *numa_local 0* nr_bounce 0 *numa_other 0* nr_free_cma 0 The second patch extends the local cpu counter vm_stat_node_diff from s8 to s16. It does not have any functionality change. The third patch uses a large and constant threshold size for NUMA counters to reduce the global NUMA counters update frequency. The forth patch uses node_page_state_snapshot instead of node_page_state when query a node stats (e.g. cat /sys/devices/system/node/node*/vmstat). The only differece is that the stats value in local cpus are also included in node_page_state_snapshot. The last patch renames zone_statistics() to numa_statistics(). At last, I want to extend my heartiest appreciation for Michal Hocko's suggestion of reusing the existed per-cpu infrastructure making it much better than before. Changelog: v1->v2: a) enhance the existed per-cpu infrastructure for node page stats by entending local cpu counters vm_node_stat_diff from s8 to s16 b) reuse the per-cpu infrastrcuture for NUMA stats Kemi Wang (5): mm: migrate NUMA stats from per-zone to per-node mm: Extends local cpu counter vm_diff_nodestat from s8 to s16 mm: enlarge NUMA counters threshold size mm: use node_page_state_snapshot to avoid deviation mm: Rename zone_statistics() to numa_statistics() drivers/base/node.c | 28 +++---- include/linux/mmzone.h | 31 ++++---- include/linux/vmstat.h | 31 -------- mm/mempolicy.c | 2 +- mm/page_alloc.c | 22 +++--- mm/vmstat.c | 206 +++++++++---------------------------------------- 6 files changed, 74 insertions(+), 246 deletions(-) -- 2.7.4 ^ permalink raw reply [flat|nested] 68+ messages in thread
* [PATCH v2 0/5] mm: NUMA stats code cleanup and enhancement @ 2017-12-19 6:39 ` Kemi Wang 0 siblings, 0 replies; 68+ messages in thread From: Kemi Wang @ 2017-12-19 6:39 UTC (permalink / raw) To: Greg Kroah-Hartman, Andrew Morton, Michal Hocko, Vlastimil Babka, Mel Gorman, Johannes Weiner, Christopher Lameter, YASUAKI ISHIMATSU, Andrey Ryabinin, Nikolay Borisov, Pavel Tatashin, David Rientjes, Sebastian Andrzej Siewior Cc: Dave, Andi Kleen, Tim Chen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Aubrey Li, Kemi Wang, Linux MM, Linux Kernel The existed implementation of NUMA counters is per logical CPU along with zone->vm_numa_stat[] separated by zone, plus a global numa counter array vm_numa_stat[]. However, unlike the other vmstat counters, NUMA stats don't effect system's decision and are only consumed when reading from /proc and /sys. Also, usually nodes only have a single zone, except for node 0, and there isn't really any use where you need these hits counts separated by zone. Therefore, we can migrate the implementation of numa stats from per-zone to per-node (as suggested by Andi Kleen), and reuse the existed per-cpu infrastructure with a little enhancement for NUMA stats. In this way, we can get rid of the special way for NUMA stats and keep the performance gain at the same time. With this patch series, about 170 lines code can be saved. The first patch migrates NUMA stats from per-zone to pre-node using the existed per-cpu infrastructure. There is a little user-visual change when read /proc/zoneinfo listed below: Before After Node 0, zone DMA Node 0, zone DMA per-node stats per-node stats nr_inactive_anon 7244 *numa_hit 98665086* nr_active_anon 177064 *numa_miss 0* ... *numa_foreign 0* nr_bounce 0 *numa_interleave 21059* nr_free_cma 0 *numa_local 98665086* *numa_hit 0* *numa_other 0* *numa_miss 0* nr_inactive_anon 20055 *numa_foreign 0* nr_active_anon 389771 *numa_interleave 0* ... *numa_local 0* nr_bounce 0 *numa_other 0* nr_free_cma 0 The second patch extends the local cpu counter vm_stat_node_diff from s8 to s16. It does not have any functionality change. The third patch uses a large and constant threshold size for NUMA counters to reduce the global NUMA counters update frequency. The forth patch uses node_page_state_snapshot instead of node_page_state when query a node stats (e.g. cat /sys/devices/system/node/node*/vmstat). The only differece is that the stats value in local cpus are also included in node_page_state_snapshot. The last patch renames zone_statistics() to numa_statistics(). At last, I want to extend my heartiest appreciation for Michal Hocko's suggestion of reusing the existed per-cpu infrastructure making it much better than before. Changelog: v1->v2: a) enhance the existed per-cpu infrastructure for node page stats by entending local cpu counters vm_node_stat_diff from s8 to s16 b) reuse the per-cpu infrastrcuture for NUMA stats Kemi Wang (5): mm: migrate NUMA stats from per-zone to per-node mm: Extends local cpu counter vm_diff_nodestat from s8 to s16 mm: enlarge NUMA counters threshold size mm: use node_page_state_snapshot to avoid deviation mm: Rename zone_statistics() to numa_statistics() drivers/base/node.c | 28 +++---- include/linux/mmzone.h | 31 ++++---- include/linux/vmstat.h | 31 -------- mm/mempolicy.c | 2 +- mm/page_alloc.c | 22 +++--- mm/vmstat.c | 206 +++++++++---------------------------------------- 6 files changed, 74 insertions(+), 246 deletions(-) -- 2.7.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 68+ messages in thread
* [PATCH v2 1/5] mm: migrate NUMA stats from per-zone to per-node 2017-12-19 6:39 ` Kemi Wang @ 2017-12-19 6:39 ` Kemi Wang -1 siblings, 0 replies; 68+ messages in thread From: Kemi Wang @ 2017-12-19 6:39 UTC (permalink / raw) To: Greg Kroah-Hartman, Andrew Morton, Michal Hocko, Vlastimil Babka, Mel Gorman, Johannes Weiner, Christopher Lameter, YASUAKI ISHIMATSU, Andrey Ryabinin, Nikolay Borisov, Pavel Tatashin, David Rientjes, Sebastian Andrzej Siewior Cc: Dave, Andi Kleen, Tim Chen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Aubrey Li, Kemi Wang, Linux MM, Linux Kernel There is not really any use to get NUMA stats separated by zone, and current per-zone NUMA stats is only consumed in /proc/zoneinfo. For code cleanup purpose, we move NUMA stats from per-zone to per-node and reuse the existed per-cpu infrastructure. Suggested-by: Andi Kleen <ak@linux.intel.com> Suggested-by: Michal Hocko <mhocko@kernel.com> Signed-off-by: Kemi Wang <kemi.wang@intel.com> --- drivers/base/node.c | 23 +++---- include/linux/mmzone.h | 27 ++++---- include/linux/vmstat.h | 31 --------- mm/mempolicy.c | 2 +- mm/page_alloc.c | 16 +++-- mm/vmstat.c | 177 +++++-------------------------------------------- 6 files changed, 46 insertions(+), 230 deletions(-) diff --git a/drivers/base/node.c b/drivers/base/node.c index ee090ab..a045ea1 100644 --- a/drivers/base/node.c +++ b/drivers/base/node.c @@ -169,13 +169,14 @@ static ssize_t node_read_numastat(struct device *dev, "interleave_hit %lu\n" "local_node %lu\n" "other_node %lu\n", - sum_zone_numa_state(dev->id, NUMA_HIT), - sum_zone_numa_state(dev->id, NUMA_MISS), - sum_zone_numa_state(dev->id, NUMA_FOREIGN), - sum_zone_numa_state(dev->id, NUMA_INTERLEAVE_HIT), - sum_zone_numa_state(dev->id, NUMA_LOCAL), - sum_zone_numa_state(dev->id, NUMA_OTHER)); + node_page_state(NODE_DATA(dev->id), NUMA_HIT), + node_page_state(NODE_DATA(dev->id), NUMA_MISS), + node_page_state(NODE_DATA(dev->id), NUMA_FOREIGN), + node_page_state(NODE_DATA(dev->id), NUMA_INTERLEAVE_HIT), + node_page_state(NODE_DATA(dev->id), NUMA_LOCAL), + node_page_state(NODE_DATA(dev->id), NUMA_OTHER)); } + static DEVICE_ATTR(numastat, S_IRUGO, node_read_numastat, NULL); static ssize_t node_read_vmstat(struct device *dev, @@ -190,17 +191,9 @@ static ssize_t node_read_vmstat(struct device *dev, n += sprintf(buf+n, "%s %lu\n", vmstat_text[i], sum_zone_node_page_state(nid, i)); -#ifdef CONFIG_NUMA - for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++) - n += sprintf(buf+n, "%s %lu\n", - vmstat_text[i + NR_VM_ZONE_STAT_ITEMS], - sum_zone_numa_state(nid, i)); -#endif - for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) n += sprintf(buf+n, "%s %lu\n", - vmstat_text[i + NR_VM_ZONE_STAT_ITEMS + - NR_VM_NUMA_STAT_ITEMS], + vmstat_text[i + NR_VM_ZONE_STAT_ITEMS], node_page_state(pgdat, i)); return n; diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 67f2e3c..c06d880 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -115,20 +115,6 @@ struct zone_padding { #define ZONE_PADDING(name) #endif -#ifdef CONFIG_NUMA -enum numa_stat_item { - NUMA_HIT, /* allocated in intended node */ - NUMA_MISS, /* allocated in non intended node */ - NUMA_FOREIGN, /* was intended here, hit elsewhere */ - NUMA_INTERLEAVE_HIT, /* interleaver preferred this zone */ - NUMA_LOCAL, /* allocation from local node */ - NUMA_OTHER, /* allocation from other node */ - NR_VM_NUMA_STAT_ITEMS -}; -#else -#define NR_VM_NUMA_STAT_ITEMS 0 -#endif - enum zone_stat_item { /* First 128 byte cacheline (assuming 64 bit words) */ NR_FREE_PAGES, @@ -151,7 +137,18 @@ enum zone_stat_item { NR_VM_ZONE_STAT_ITEMS }; enum node_stat_item { - NR_LRU_BASE, +#ifdef CONFIG_NUMA + NUMA_HIT, /* allocated in intended node */ + NUMA_MISS, /* allocated in non intended node */ + NUMA_FOREIGN, /* was intended here, hit elsewhere */ + NUMA_INTERLEAVE_HIT, /* interleaver preferred this zone */ + NUMA_LOCAL, /* allocation from local node */ + NUMA_OTHER, /* allocation from other node */ + NR_VM_NUMA_STAT_ITEMS, +#else +#define NR_VM_NUMA_STAT_ITEMS 0 +#endif + NR_LRU_BASE = NR_VM_NUMA_STAT_ITEMS, NR_INACTIVE_ANON = NR_LRU_BASE, /* must match order of LRU_[IN]ACTIVE */ NR_ACTIVE_ANON, /* " " " " " */ NR_INACTIVE_FILE, /* " " " " " */ diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h index 1779c98..80bf290 100644 --- a/include/linux/vmstat.h +++ b/include/linux/vmstat.h @@ -118,37 +118,8 @@ static inline void vm_events_fold_cpu(int cpu) * Zone and node-based page accounting with per cpu differentials. */ extern atomic_long_t vm_zone_stat[NR_VM_ZONE_STAT_ITEMS]; -extern atomic_long_t vm_numa_stat[NR_VM_NUMA_STAT_ITEMS]; extern atomic_long_t vm_node_stat[NR_VM_NODE_STAT_ITEMS]; -#ifdef CONFIG_NUMA -static inline void zone_numa_state_add(long x, struct zone *zone, - enum numa_stat_item item) -{ - atomic_long_add(x, &zone->vm_numa_stat[item]); - atomic_long_add(x, &vm_numa_stat[item]); -} - -static inline unsigned long global_numa_state(enum numa_stat_item item) -{ - long x = atomic_long_read(&vm_numa_stat[item]); - - return x; -} - -static inline unsigned long zone_numa_state_snapshot(struct zone *zone, - enum numa_stat_item item) -{ - long x = atomic_long_read(&zone->vm_numa_stat[item]); - int cpu; - - for_each_online_cpu(cpu) - x += per_cpu_ptr(zone->pageset, cpu)->vm_numa_stat_diff[item]; - - return x; -} -#endif /* CONFIG_NUMA */ - static inline void zone_page_state_add(long x, struct zone *zone, enum zone_stat_item item) { @@ -234,10 +205,8 @@ static inline unsigned long node_page_state_snapshot(pg_data_t *pgdat, #ifdef CONFIG_NUMA -extern void __inc_numa_state(struct zone *zone, enum numa_stat_item item); extern unsigned long sum_zone_node_page_state(int node, enum zone_stat_item item); -extern unsigned long sum_zone_numa_state(int node, enum numa_stat_item item); extern unsigned long node_page_state(struct pglist_data *pgdat, enum node_stat_item item); #else diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 4ce44d3..b2293e3 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1920,7 +1920,7 @@ static struct page *alloc_page_interleave(gfp_t gfp, unsigned order, return page; if (page && page_to_nid(page) == nid) { preempt_disable(); - __inc_numa_state(page_zone(page), NUMA_INTERLEAVE_HIT); + inc_node_state(page_pgdat(page), NUMA_INTERLEAVE_HIT); preempt_enable(); } return page; diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 7e5e775..81e8d8f 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2793,22 +2793,24 @@ int __isolate_free_page(struct page *page, unsigned int order) static inline void zone_statistics(struct zone *preferred_zone, struct zone *z) { #ifdef CONFIG_NUMA - enum numa_stat_item local_stat = NUMA_LOCAL; + int preferred_nid = preferred_zone->node; + int nid = z->node; + enum node_stat_item local_stat = NUMA_LOCAL; /* skip numa counters update if numa stats is disabled */ if (!static_branch_likely(&vm_numa_stat_key)) return; - if (z->node != numa_node_id()) + if (nid != numa_node_id()) local_stat = NUMA_OTHER; - if (z->node == preferred_zone->node) - __inc_numa_state(z, NUMA_HIT); + if (nid == preferred_nid) + inc_node_state(NODE_DATA(nid), NUMA_HIT); else { - __inc_numa_state(z, NUMA_MISS); - __inc_numa_state(preferred_zone, NUMA_FOREIGN); + inc_node_state(NODE_DATA(nid), NUMA_MISS); + inc_node_state(NODE_DATA(preferred_nid), NUMA_FOREIGN); } - __inc_numa_state(z, local_stat); + inc_node_state(NODE_DATA(nid), local_stat); #endif } diff --git a/mm/vmstat.c b/mm/vmstat.c index 40b2db6..1dd12ae 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -30,46 +30,44 @@ #include "internal.h" -#define NUMA_STATS_THRESHOLD (U16_MAX - 2) - #ifdef CONFIG_NUMA int sysctl_vm_numa_stat = ENABLE_NUMA_STAT; -/* zero numa counters within a zone */ -static void zero_zone_numa_counters(struct zone *zone) +/* zero numa stats within a node */ +static void zero_node_numa_stats(int node) { int item, cpu; for (item = 0; item < NR_VM_NUMA_STAT_ITEMS; item++) { - atomic_long_set(&zone->vm_numa_stat[item], 0); + atomic_long_set(&(NODE_DATA(node)->vm_stat[item]), 0); for_each_online_cpu(cpu) - per_cpu_ptr(zone->pageset, cpu)->vm_numa_stat_diff[item] - = 0; + per_cpu_ptr(NODE_DATA(node)->per_cpu_nodestats, + cpu)->vm_node_stat_diff[item] = 0; } } -/* zero numa counters of all the populated zones */ -static void zero_zones_numa_counters(void) +/* zero numa stats of all the online nodes */ +static void zero_nodes_numa_stats(void) { - struct zone *zone; + int node; - for_each_populated_zone(zone) - zero_zone_numa_counters(zone); + for_each_online_node(node) + zero_node_numa_stats(node); } -/* zero global numa counters */ -static void zero_global_numa_counters(void) +/* zero global numa stats */ +static void zero_global_numa_stats(void) { int item; for (item = 0; item < NR_VM_NUMA_STAT_ITEMS; item++) - atomic_long_set(&vm_numa_stat[item], 0); + atomic_long_set(&vm_node_stat[item], 0); } static void invalid_numa_statistics(void) { - zero_zones_numa_counters(); - zero_global_numa_counters(); + zero_nodes_numa_stats(); + zero_global_numa_stats(); } static DEFINE_MUTEX(vm_numa_stat_lock); @@ -160,10 +158,8 @@ void vm_events_fold_cpu(int cpu) * vm_stat contains the global counters */ atomic_long_t vm_zone_stat[NR_VM_ZONE_STAT_ITEMS] __cacheline_aligned_in_smp; -atomic_long_t vm_numa_stat[NR_VM_NUMA_STAT_ITEMS] __cacheline_aligned_in_smp; atomic_long_t vm_node_stat[NR_VM_NODE_STAT_ITEMS] __cacheline_aligned_in_smp; EXPORT_SYMBOL(vm_zone_stat); -EXPORT_SYMBOL(vm_numa_stat); EXPORT_SYMBOL(vm_node_stat); #ifdef CONFIG_SMP @@ -679,32 +675,6 @@ EXPORT_SYMBOL(dec_node_page_state); * Fold a differential into the global counters. * Returns the number of counters updated. */ -#ifdef CONFIG_NUMA -static int fold_diff(int *zone_diff, int *numa_diff, int *node_diff) -{ - int i; - int changes = 0; - - for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) - if (zone_diff[i]) { - atomic_long_add(zone_diff[i], &vm_zone_stat[i]); - changes++; - } - - for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++) - if (numa_diff[i]) { - atomic_long_add(numa_diff[i], &vm_numa_stat[i]); - changes++; - } - - for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) - if (node_diff[i]) { - atomic_long_add(node_diff[i], &vm_node_stat[i]); - changes++; - } - return changes; -} -#else static int fold_diff(int *zone_diff, int *node_diff) { int i; @@ -723,7 +693,6 @@ static int fold_diff(int *zone_diff, int *node_diff) } return changes; } -#endif /* CONFIG_NUMA */ /* * Update the zone counters for the current cpu. @@ -747,9 +716,6 @@ static int refresh_cpu_vm_stats(bool do_pagesets) struct zone *zone; int i; int global_zone_diff[NR_VM_ZONE_STAT_ITEMS] = { 0, }; -#ifdef CONFIG_NUMA - int global_numa_diff[NR_VM_NUMA_STAT_ITEMS] = { 0, }; -#endif int global_node_diff[NR_VM_NODE_STAT_ITEMS] = { 0, }; int changes = 0; @@ -771,18 +737,6 @@ static int refresh_cpu_vm_stats(bool do_pagesets) } } #ifdef CONFIG_NUMA - for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++) { - int v; - - v = this_cpu_xchg(p->vm_numa_stat_diff[i], 0); - if (v) { - - atomic_long_add(v, &zone->vm_numa_stat[i]); - global_numa_diff[i] += v; - __this_cpu_write(p->expire, 3); - } - } - if (do_pagesets) { cond_resched(); /* @@ -829,12 +783,7 @@ static int refresh_cpu_vm_stats(bool do_pagesets) } } -#ifdef CONFIG_NUMA - changes += fold_diff(global_zone_diff, global_numa_diff, - global_node_diff); -#else changes += fold_diff(global_zone_diff, global_node_diff); -#endif return changes; } @@ -849,9 +798,6 @@ void cpu_vm_stats_fold(int cpu) struct zone *zone; int i; int global_zone_diff[NR_VM_ZONE_STAT_ITEMS] = { 0, }; -#ifdef CONFIG_NUMA - int global_numa_diff[NR_VM_NUMA_STAT_ITEMS] = { 0, }; -#endif int global_node_diff[NR_VM_NODE_STAT_ITEMS] = { 0, }; for_each_populated_zone(zone) { @@ -868,18 +814,6 @@ void cpu_vm_stats_fold(int cpu) atomic_long_add(v, &zone->vm_stat[i]); global_zone_diff[i] += v; } - -#ifdef CONFIG_NUMA - for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++) - if (p->vm_numa_stat_diff[i]) { - int v; - - v = p->vm_numa_stat_diff[i]; - p->vm_numa_stat_diff[i] = 0; - atomic_long_add(v, &zone->vm_numa_stat[i]); - global_numa_diff[i] += v; - } -#endif } for_each_online_pgdat(pgdat) { @@ -898,11 +832,7 @@ void cpu_vm_stats_fold(int cpu) } } -#ifdef CONFIG_NUMA - fold_diff(global_zone_diff, global_numa_diff, global_node_diff); -#else fold_diff(global_zone_diff, global_node_diff); -#endif } /* @@ -920,36 +850,10 @@ void drain_zonestat(struct zone *zone, struct per_cpu_pageset *pset) atomic_long_add(v, &zone->vm_stat[i]); atomic_long_add(v, &vm_zone_stat[i]); } - -#ifdef CONFIG_NUMA - for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++) - if (pset->vm_numa_stat_diff[i]) { - int v = pset->vm_numa_stat_diff[i]; - - pset->vm_numa_stat_diff[i] = 0; - atomic_long_add(v, &zone->vm_numa_stat[i]); - atomic_long_add(v, &vm_numa_stat[i]); - } -#endif } #endif #ifdef CONFIG_NUMA -void __inc_numa_state(struct zone *zone, - enum numa_stat_item item) -{ - struct per_cpu_pageset __percpu *pcp = zone->pageset; - u16 __percpu *p = pcp->vm_numa_stat_diff + item; - u16 v; - - v = __this_cpu_inc_return(*p); - - if (unlikely(v > NUMA_STATS_THRESHOLD)) { - zone_numa_state_add(v, zone, item); - __this_cpu_write(*p, 0); - } -} - /* * Determine the per node value of a stat item. This function * is called frequently in a NUMA machine, so try to be as @@ -969,23 +873,6 @@ unsigned long sum_zone_node_page_state(int node, } /* - * Determine the per node value of a numa stat item. To avoid deviation, - * the per cpu stat number in vm_numa_stat_diff[] is also included. - */ -unsigned long sum_zone_numa_state(int node, - enum numa_stat_item item) -{ - struct zone *zones = NODE_DATA(node)->node_zones; - int i; - unsigned long count = 0; - - for (i = 0; i < MAX_NR_ZONES; i++) - count += zone_numa_state_snapshot(zones + i, item); - - return count; -} - -/* * Determine the per node value of a stat item. */ unsigned long node_page_state(struct pglist_data *pgdat, @@ -1569,8 +1456,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat, seq_printf(m, "\n per-node stats"); for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) { seq_printf(m, "\n %-12s %lu", - vmstat_text[i + NR_VM_ZONE_STAT_ITEMS + - NR_VM_NUMA_STAT_ITEMS], + vmstat_text[i + NR_VM_ZONE_STAT_ITEMS], node_page_state(pgdat, i)); } } @@ -1607,13 +1493,6 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat, seq_printf(m, "\n %-12s %lu", vmstat_text[i], zone_page_state(zone, i)); -#ifdef CONFIG_NUMA - for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++) - seq_printf(m, "\n %-12s %lu", - vmstat_text[i + NR_VM_ZONE_STAT_ITEMS], - zone_numa_state_snapshot(zone, i)); -#endif - seq_printf(m, "\n pagesets"); for_each_online_cpu(i) { struct per_cpu_pageset *pageset; @@ -1688,7 +1567,6 @@ static void *vmstat_start(struct seq_file *m, loff_t *pos) if (*pos >= ARRAY_SIZE(vmstat_text)) return NULL; stat_items_size = NR_VM_ZONE_STAT_ITEMS * sizeof(unsigned long) + - NR_VM_NUMA_STAT_ITEMS * sizeof(unsigned long) + NR_VM_NODE_STAT_ITEMS * sizeof(unsigned long) + NR_VM_WRITEBACK_STAT_ITEMS * sizeof(unsigned long); @@ -1704,12 +1582,6 @@ static void *vmstat_start(struct seq_file *m, loff_t *pos) v[i] = global_zone_page_state(i); v += NR_VM_ZONE_STAT_ITEMS; -#ifdef CONFIG_NUMA - for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++) - v[i] = global_numa_state(i); - v += NR_VM_NUMA_STAT_ITEMS; -#endif - for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) v[i] = global_node_page_state(i); v += NR_VM_NODE_STAT_ITEMS; @@ -1811,16 +1683,6 @@ int vmstat_refresh(struct ctl_table *table, int write, err = -EINVAL; } } -#ifdef CONFIG_NUMA - for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++) { - val = atomic_long_read(&vm_numa_stat[i]); - if (val < 0) { - pr_warn("%s: %s %ld\n", - __func__, vmstat_text[i + NR_VM_ZONE_STAT_ITEMS], val); - err = -EINVAL; - } - } -#endif if (err) return err; if (write) @@ -1862,9 +1724,6 @@ static bool need_update(int cpu) struct per_cpu_pageset *p = per_cpu_ptr(zone->pageset, cpu); BUILD_BUG_ON(sizeof(p->vm_stat_diff[0]) != 1); -#ifdef CONFIG_NUMA - BUILD_BUG_ON(sizeof(p->vm_numa_stat_diff[0]) != 2); -#endif /* * The fast way of checking if there are any vmstat diffs. @@ -1872,10 +1731,6 @@ static bool need_update(int cpu) */ if (memchr_inv(p->vm_stat_diff, 0, NR_VM_ZONE_STAT_ITEMS)) return true; -#ifdef CONFIG_NUMA - if (memchr_inv(p->vm_numa_stat_diff, 0, NR_VM_NUMA_STAT_ITEMS)) - return true; -#endif } return false; } -- 2.7.4 ^ permalink raw reply related [flat|nested] 68+ messages in thread
* [PATCH v2 1/5] mm: migrate NUMA stats from per-zone to per-node @ 2017-12-19 6:39 ` Kemi Wang 0 siblings, 0 replies; 68+ messages in thread From: Kemi Wang @ 2017-12-19 6:39 UTC (permalink / raw) To: Greg Kroah-Hartman, Andrew Morton, Michal Hocko, Vlastimil Babka, Mel Gorman, Johannes Weiner, Christopher Lameter, YASUAKI ISHIMATSU, Andrey Ryabinin, Nikolay Borisov, Pavel Tatashin, David Rientjes, Sebastian Andrzej Siewior Cc: Dave, Andi Kleen, Tim Chen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Aubrey Li, Kemi Wang, Linux MM, Linux Kernel There is not really any use to get NUMA stats separated by zone, and current per-zone NUMA stats is only consumed in /proc/zoneinfo. For code cleanup purpose, we move NUMA stats from per-zone to per-node and reuse the existed per-cpu infrastructure. Suggested-by: Andi Kleen <ak@linux.intel.com> Suggested-by: Michal Hocko <mhocko@kernel.com> Signed-off-by: Kemi Wang <kemi.wang@intel.com> --- drivers/base/node.c | 23 +++---- include/linux/mmzone.h | 27 ++++---- include/linux/vmstat.h | 31 --------- mm/mempolicy.c | 2 +- mm/page_alloc.c | 16 +++-- mm/vmstat.c | 177 +++++-------------------------------------------- 6 files changed, 46 insertions(+), 230 deletions(-) diff --git a/drivers/base/node.c b/drivers/base/node.c index ee090ab..a045ea1 100644 --- a/drivers/base/node.c +++ b/drivers/base/node.c @@ -169,13 +169,14 @@ static ssize_t node_read_numastat(struct device *dev, "interleave_hit %lu\n" "local_node %lu\n" "other_node %lu\n", - sum_zone_numa_state(dev->id, NUMA_HIT), - sum_zone_numa_state(dev->id, NUMA_MISS), - sum_zone_numa_state(dev->id, NUMA_FOREIGN), - sum_zone_numa_state(dev->id, NUMA_INTERLEAVE_HIT), - sum_zone_numa_state(dev->id, NUMA_LOCAL), - sum_zone_numa_state(dev->id, NUMA_OTHER)); + node_page_state(NODE_DATA(dev->id), NUMA_HIT), + node_page_state(NODE_DATA(dev->id), NUMA_MISS), + node_page_state(NODE_DATA(dev->id), NUMA_FOREIGN), + node_page_state(NODE_DATA(dev->id), NUMA_INTERLEAVE_HIT), + node_page_state(NODE_DATA(dev->id), NUMA_LOCAL), + node_page_state(NODE_DATA(dev->id), NUMA_OTHER)); } + static DEVICE_ATTR(numastat, S_IRUGO, node_read_numastat, NULL); static ssize_t node_read_vmstat(struct device *dev, @@ -190,17 +191,9 @@ static ssize_t node_read_vmstat(struct device *dev, n += sprintf(buf+n, "%s %lu\n", vmstat_text[i], sum_zone_node_page_state(nid, i)); -#ifdef CONFIG_NUMA - for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++) - n += sprintf(buf+n, "%s %lu\n", - vmstat_text[i + NR_VM_ZONE_STAT_ITEMS], - sum_zone_numa_state(nid, i)); -#endif - for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) n += sprintf(buf+n, "%s %lu\n", - vmstat_text[i + NR_VM_ZONE_STAT_ITEMS + - NR_VM_NUMA_STAT_ITEMS], + vmstat_text[i + NR_VM_ZONE_STAT_ITEMS], node_page_state(pgdat, i)); return n; diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 67f2e3c..c06d880 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -115,20 +115,6 @@ struct zone_padding { #define ZONE_PADDING(name) #endif -#ifdef CONFIG_NUMA -enum numa_stat_item { - NUMA_HIT, /* allocated in intended node */ - NUMA_MISS, /* allocated in non intended node */ - NUMA_FOREIGN, /* was intended here, hit elsewhere */ - NUMA_INTERLEAVE_HIT, /* interleaver preferred this zone */ - NUMA_LOCAL, /* allocation from local node */ - NUMA_OTHER, /* allocation from other node */ - NR_VM_NUMA_STAT_ITEMS -}; -#else -#define NR_VM_NUMA_STAT_ITEMS 0 -#endif - enum zone_stat_item { /* First 128 byte cacheline (assuming 64 bit words) */ NR_FREE_PAGES, @@ -151,7 +137,18 @@ enum zone_stat_item { NR_VM_ZONE_STAT_ITEMS }; enum node_stat_item { - NR_LRU_BASE, +#ifdef CONFIG_NUMA + NUMA_HIT, /* allocated in intended node */ + NUMA_MISS, /* allocated in non intended node */ + NUMA_FOREIGN, /* was intended here, hit elsewhere */ + NUMA_INTERLEAVE_HIT, /* interleaver preferred this zone */ + NUMA_LOCAL, /* allocation from local node */ + NUMA_OTHER, /* allocation from other node */ + NR_VM_NUMA_STAT_ITEMS, +#else +#define NR_VM_NUMA_STAT_ITEMS 0 +#endif + NR_LRU_BASE = NR_VM_NUMA_STAT_ITEMS, NR_INACTIVE_ANON = NR_LRU_BASE, /* must match order of LRU_[IN]ACTIVE */ NR_ACTIVE_ANON, /* " " " " " */ NR_INACTIVE_FILE, /* " " " " " */ diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h index 1779c98..80bf290 100644 --- a/include/linux/vmstat.h +++ b/include/linux/vmstat.h @@ -118,37 +118,8 @@ static inline void vm_events_fold_cpu(int cpu) * Zone and node-based page accounting with per cpu differentials. */ extern atomic_long_t vm_zone_stat[NR_VM_ZONE_STAT_ITEMS]; -extern atomic_long_t vm_numa_stat[NR_VM_NUMA_STAT_ITEMS]; extern atomic_long_t vm_node_stat[NR_VM_NODE_STAT_ITEMS]; -#ifdef CONFIG_NUMA -static inline void zone_numa_state_add(long x, struct zone *zone, - enum numa_stat_item item) -{ - atomic_long_add(x, &zone->vm_numa_stat[item]); - atomic_long_add(x, &vm_numa_stat[item]); -} - -static inline unsigned long global_numa_state(enum numa_stat_item item) -{ - long x = atomic_long_read(&vm_numa_stat[item]); - - return x; -} - -static inline unsigned long zone_numa_state_snapshot(struct zone *zone, - enum numa_stat_item item) -{ - long x = atomic_long_read(&zone->vm_numa_stat[item]); - int cpu; - - for_each_online_cpu(cpu) - x += per_cpu_ptr(zone->pageset, cpu)->vm_numa_stat_diff[item]; - - return x; -} -#endif /* CONFIG_NUMA */ - static inline void zone_page_state_add(long x, struct zone *zone, enum zone_stat_item item) { @@ -234,10 +205,8 @@ static inline unsigned long node_page_state_snapshot(pg_data_t *pgdat, #ifdef CONFIG_NUMA -extern void __inc_numa_state(struct zone *zone, enum numa_stat_item item); extern unsigned long sum_zone_node_page_state(int node, enum zone_stat_item item); -extern unsigned long sum_zone_numa_state(int node, enum numa_stat_item item); extern unsigned long node_page_state(struct pglist_data *pgdat, enum node_stat_item item); #else diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 4ce44d3..b2293e3 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1920,7 +1920,7 @@ static struct page *alloc_page_interleave(gfp_t gfp, unsigned order, return page; if (page && page_to_nid(page) == nid) { preempt_disable(); - __inc_numa_state(page_zone(page), NUMA_INTERLEAVE_HIT); + inc_node_state(page_pgdat(page), NUMA_INTERLEAVE_HIT); preempt_enable(); } return page; diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 7e5e775..81e8d8f 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2793,22 +2793,24 @@ int __isolate_free_page(struct page *page, unsigned int order) static inline void zone_statistics(struct zone *preferred_zone, struct zone *z) { #ifdef CONFIG_NUMA - enum numa_stat_item local_stat = NUMA_LOCAL; + int preferred_nid = preferred_zone->node; + int nid = z->node; + enum node_stat_item local_stat = NUMA_LOCAL; /* skip numa counters update if numa stats is disabled */ if (!static_branch_likely(&vm_numa_stat_key)) return; - if (z->node != numa_node_id()) + if (nid != numa_node_id()) local_stat = NUMA_OTHER; - if (z->node == preferred_zone->node) - __inc_numa_state(z, NUMA_HIT); + if (nid == preferred_nid) + inc_node_state(NODE_DATA(nid), NUMA_HIT); else { - __inc_numa_state(z, NUMA_MISS); - __inc_numa_state(preferred_zone, NUMA_FOREIGN); + inc_node_state(NODE_DATA(nid), NUMA_MISS); + inc_node_state(NODE_DATA(preferred_nid), NUMA_FOREIGN); } - __inc_numa_state(z, local_stat); + inc_node_state(NODE_DATA(nid), local_stat); #endif } diff --git a/mm/vmstat.c b/mm/vmstat.c index 40b2db6..1dd12ae 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -30,46 +30,44 @@ #include "internal.h" -#define NUMA_STATS_THRESHOLD (U16_MAX - 2) - #ifdef CONFIG_NUMA int sysctl_vm_numa_stat = ENABLE_NUMA_STAT; -/* zero numa counters within a zone */ -static void zero_zone_numa_counters(struct zone *zone) +/* zero numa stats within a node */ +static void zero_node_numa_stats(int node) { int item, cpu; for (item = 0; item < NR_VM_NUMA_STAT_ITEMS; item++) { - atomic_long_set(&zone->vm_numa_stat[item], 0); + atomic_long_set(&(NODE_DATA(node)->vm_stat[item]), 0); for_each_online_cpu(cpu) - per_cpu_ptr(zone->pageset, cpu)->vm_numa_stat_diff[item] - = 0; + per_cpu_ptr(NODE_DATA(node)->per_cpu_nodestats, + cpu)->vm_node_stat_diff[item] = 0; } } -/* zero numa counters of all the populated zones */ -static void zero_zones_numa_counters(void) +/* zero numa stats of all the online nodes */ +static void zero_nodes_numa_stats(void) { - struct zone *zone; + int node; - for_each_populated_zone(zone) - zero_zone_numa_counters(zone); + for_each_online_node(node) + zero_node_numa_stats(node); } -/* zero global numa counters */ -static void zero_global_numa_counters(void) +/* zero global numa stats */ +static void zero_global_numa_stats(void) { int item; for (item = 0; item < NR_VM_NUMA_STAT_ITEMS; item++) - atomic_long_set(&vm_numa_stat[item], 0); + atomic_long_set(&vm_node_stat[item], 0); } static void invalid_numa_statistics(void) { - zero_zones_numa_counters(); - zero_global_numa_counters(); + zero_nodes_numa_stats(); + zero_global_numa_stats(); } static DEFINE_MUTEX(vm_numa_stat_lock); @@ -160,10 +158,8 @@ void vm_events_fold_cpu(int cpu) * vm_stat contains the global counters */ atomic_long_t vm_zone_stat[NR_VM_ZONE_STAT_ITEMS] __cacheline_aligned_in_smp; -atomic_long_t vm_numa_stat[NR_VM_NUMA_STAT_ITEMS] __cacheline_aligned_in_smp; atomic_long_t vm_node_stat[NR_VM_NODE_STAT_ITEMS] __cacheline_aligned_in_smp; EXPORT_SYMBOL(vm_zone_stat); -EXPORT_SYMBOL(vm_numa_stat); EXPORT_SYMBOL(vm_node_stat); #ifdef CONFIG_SMP @@ -679,32 +675,6 @@ EXPORT_SYMBOL(dec_node_page_state); * Fold a differential into the global counters. * Returns the number of counters updated. */ -#ifdef CONFIG_NUMA -static int fold_diff(int *zone_diff, int *numa_diff, int *node_diff) -{ - int i; - int changes = 0; - - for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) - if (zone_diff[i]) { - atomic_long_add(zone_diff[i], &vm_zone_stat[i]); - changes++; - } - - for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++) - if (numa_diff[i]) { - atomic_long_add(numa_diff[i], &vm_numa_stat[i]); - changes++; - } - - for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) - if (node_diff[i]) { - atomic_long_add(node_diff[i], &vm_node_stat[i]); - changes++; - } - return changes; -} -#else static int fold_diff(int *zone_diff, int *node_diff) { int i; @@ -723,7 +693,6 @@ static int fold_diff(int *zone_diff, int *node_diff) } return changes; } -#endif /* CONFIG_NUMA */ /* * Update the zone counters for the current cpu. @@ -747,9 +716,6 @@ static int refresh_cpu_vm_stats(bool do_pagesets) struct zone *zone; int i; int global_zone_diff[NR_VM_ZONE_STAT_ITEMS] = { 0, }; -#ifdef CONFIG_NUMA - int global_numa_diff[NR_VM_NUMA_STAT_ITEMS] = { 0, }; -#endif int global_node_diff[NR_VM_NODE_STAT_ITEMS] = { 0, }; int changes = 0; @@ -771,18 +737,6 @@ static int refresh_cpu_vm_stats(bool do_pagesets) } } #ifdef CONFIG_NUMA - for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++) { - int v; - - v = this_cpu_xchg(p->vm_numa_stat_diff[i], 0); - if (v) { - - atomic_long_add(v, &zone->vm_numa_stat[i]); - global_numa_diff[i] += v; - __this_cpu_write(p->expire, 3); - } - } - if (do_pagesets) { cond_resched(); /* @@ -829,12 +783,7 @@ static int refresh_cpu_vm_stats(bool do_pagesets) } } -#ifdef CONFIG_NUMA - changes += fold_diff(global_zone_diff, global_numa_diff, - global_node_diff); -#else changes += fold_diff(global_zone_diff, global_node_diff); -#endif return changes; } @@ -849,9 +798,6 @@ void cpu_vm_stats_fold(int cpu) struct zone *zone; int i; int global_zone_diff[NR_VM_ZONE_STAT_ITEMS] = { 0, }; -#ifdef CONFIG_NUMA - int global_numa_diff[NR_VM_NUMA_STAT_ITEMS] = { 0, }; -#endif int global_node_diff[NR_VM_NODE_STAT_ITEMS] = { 0, }; for_each_populated_zone(zone) { @@ -868,18 +814,6 @@ void cpu_vm_stats_fold(int cpu) atomic_long_add(v, &zone->vm_stat[i]); global_zone_diff[i] += v; } - -#ifdef CONFIG_NUMA - for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++) - if (p->vm_numa_stat_diff[i]) { - int v; - - v = p->vm_numa_stat_diff[i]; - p->vm_numa_stat_diff[i] = 0; - atomic_long_add(v, &zone->vm_numa_stat[i]); - global_numa_diff[i] += v; - } -#endif } for_each_online_pgdat(pgdat) { @@ -898,11 +832,7 @@ void cpu_vm_stats_fold(int cpu) } } -#ifdef CONFIG_NUMA - fold_diff(global_zone_diff, global_numa_diff, global_node_diff); -#else fold_diff(global_zone_diff, global_node_diff); -#endif } /* @@ -920,36 +850,10 @@ void drain_zonestat(struct zone *zone, struct per_cpu_pageset *pset) atomic_long_add(v, &zone->vm_stat[i]); atomic_long_add(v, &vm_zone_stat[i]); } - -#ifdef CONFIG_NUMA - for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++) - if (pset->vm_numa_stat_diff[i]) { - int v = pset->vm_numa_stat_diff[i]; - - pset->vm_numa_stat_diff[i] = 0; - atomic_long_add(v, &zone->vm_numa_stat[i]); - atomic_long_add(v, &vm_numa_stat[i]); - } -#endif } #endif #ifdef CONFIG_NUMA -void __inc_numa_state(struct zone *zone, - enum numa_stat_item item) -{ - struct per_cpu_pageset __percpu *pcp = zone->pageset; - u16 __percpu *p = pcp->vm_numa_stat_diff + item; - u16 v; - - v = __this_cpu_inc_return(*p); - - if (unlikely(v > NUMA_STATS_THRESHOLD)) { - zone_numa_state_add(v, zone, item); - __this_cpu_write(*p, 0); - } -} - /* * Determine the per node value of a stat item. This function * is called frequently in a NUMA machine, so try to be as @@ -969,23 +873,6 @@ unsigned long sum_zone_node_page_state(int node, } /* - * Determine the per node value of a numa stat item. To avoid deviation, - * the per cpu stat number in vm_numa_stat_diff[] is also included. - */ -unsigned long sum_zone_numa_state(int node, - enum numa_stat_item item) -{ - struct zone *zones = NODE_DATA(node)->node_zones; - int i; - unsigned long count = 0; - - for (i = 0; i < MAX_NR_ZONES; i++) - count += zone_numa_state_snapshot(zones + i, item); - - return count; -} - -/* * Determine the per node value of a stat item. */ unsigned long node_page_state(struct pglist_data *pgdat, @@ -1569,8 +1456,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat, seq_printf(m, "\n per-node stats"); for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) { seq_printf(m, "\n %-12s %lu", - vmstat_text[i + NR_VM_ZONE_STAT_ITEMS + - NR_VM_NUMA_STAT_ITEMS], + vmstat_text[i + NR_VM_ZONE_STAT_ITEMS], node_page_state(pgdat, i)); } } @@ -1607,13 +1493,6 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat, seq_printf(m, "\n %-12s %lu", vmstat_text[i], zone_page_state(zone, i)); -#ifdef CONFIG_NUMA - for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++) - seq_printf(m, "\n %-12s %lu", - vmstat_text[i + NR_VM_ZONE_STAT_ITEMS], - zone_numa_state_snapshot(zone, i)); -#endif - seq_printf(m, "\n pagesets"); for_each_online_cpu(i) { struct per_cpu_pageset *pageset; @@ -1688,7 +1567,6 @@ static void *vmstat_start(struct seq_file *m, loff_t *pos) if (*pos >= ARRAY_SIZE(vmstat_text)) return NULL; stat_items_size = NR_VM_ZONE_STAT_ITEMS * sizeof(unsigned long) + - NR_VM_NUMA_STAT_ITEMS * sizeof(unsigned long) + NR_VM_NODE_STAT_ITEMS * sizeof(unsigned long) + NR_VM_WRITEBACK_STAT_ITEMS * sizeof(unsigned long); @@ -1704,12 +1582,6 @@ static void *vmstat_start(struct seq_file *m, loff_t *pos) v[i] = global_zone_page_state(i); v += NR_VM_ZONE_STAT_ITEMS; -#ifdef CONFIG_NUMA - for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++) - v[i] = global_numa_state(i); - v += NR_VM_NUMA_STAT_ITEMS; -#endif - for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) v[i] = global_node_page_state(i); v += NR_VM_NODE_STAT_ITEMS; @@ -1811,16 +1683,6 @@ int vmstat_refresh(struct ctl_table *table, int write, err = -EINVAL; } } -#ifdef CONFIG_NUMA - for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++) { - val = atomic_long_read(&vm_numa_stat[i]); - if (val < 0) { - pr_warn("%s: %s %ld\n", - __func__, vmstat_text[i + NR_VM_ZONE_STAT_ITEMS], val); - err = -EINVAL; - } - } -#endif if (err) return err; if (write) @@ -1862,9 +1724,6 @@ static bool need_update(int cpu) struct per_cpu_pageset *p = per_cpu_ptr(zone->pageset, cpu); BUILD_BUG_ON(sizeof(p->vm_stat_diff[0]) != 1); -#ifdef CONFIG_NUMA - BUILD_BUG_ON(sizeof(p->vm_numa_stat_diff[0]) != 2); -#endif /* * The fast way of checking if there are any vmstat diffs. @@ -1872,10 +1731,6 @@ static bool need_update(int cpu) */ if (memchr_inv(p->vm_stat_diff, 0, NR_VM_ZONE_STAT_ITEMS)) return true; -#ifdef CONFIG_NUMA - if (memchr_inv(p->vm_numa_stat_diff, 0, NR_VM_NUMA_STAT_ITEMS)) - return true; -#endif } return false; } -- 2.7.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 68+ messages in thread
* Re: [PATCH v2 1/5] mm: migrate NUMA stats from per-zone to per-node 2017-12-19 6:39 ` Kemi Wang @ 2017-12-19 12:28 ` Michal Hocko -1 siblings, 0 replies; 68+ messages in thread From: Michal Hocko @ 2017-12-19 12:28 UTC (permalink / raw) To: Kemi Wang Cc: Greg Kroah-Hartman, Andrew Morton, Vlastimil Babka, Mel Gorman, Johannes Weiner, Christopher Lameter, YASUAKI ISHIMATSU, Andrey Ryabinin, Nikolay Borisov, Pavel Tatashin, David Rientjes, Sebastian Andrzej Siewior, Dave, Andi Kleen, Tim Chen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Aubrey Li, Linux MM, Linux Kernel On Tue 19-12-17 14:39:22, Kemi Wang wrote: > There is not really any use to get NUMA stats separated by zone, and > current per-zone NUMA stats is only consumed in /proc/zoneinfo. For code > cleanup purpose, we move NUMA stats from per-zone to per-node and reuse the > existed per-cpu infrastructure. Let's hope that nobody really depends on the per-zone numbers. It would be really strange as those counters are inherently per-node and that is what users should care about but who knows... Anyway, I hoped we could get rid of NR_VM_NUMA_STAT_ITEMS but your patch keeps it and follow up patches even use it further. I will comment on those separately but this still makes these few counters really special which I think is wrong. > Suggested-by: Andi Kleen <ak@linux.intel.com> > Suggested-by: Michal Hocko <mhocko@kernel.com> > Signed-off-by: Kemi Wang <kemi.wang@intel.com> I have to fully grasp the rest of the series before I'll give my Ack, but I _really_ like the simplification this adds to the code. I believe it can be even simpler. > --- > drivers/base/node.c | 23 +++---- > include/linux/mmzone.h | 27 ++++---- > include/linux/vmstat.h | 31 --------- > mm/mempolicy.c | 2 +- > mm/page_alloc.c | 16 +++-- > mm/vmstat.c | 177 +++++-------------------------------------------- > 6 files changed, 46 insertions(+), 230 deletions(-) > > diff --git a/drivers/base/node.c b/drivers/base/node.c > index ee090ab..a045ea1 100644 > --- a/drivers/base/node.c > +++ b/drivers/base/node.c > @@ -169,13 +169,14 @@ static ssize_t node_read_numastat(struct device *dev, > "interleave_hit %lu\n" > "local_node %lu\n" > "other_node %lu\n", > - sum_zone_numa_state(dev->id, NUMA_HIT), > - sum_zone_numa_state(dev->id, NUMA_MISS), > - sum_zone_numa_state(dev->id, NUMA_FOREIGN), > - sum_zone_numa_state(dev->id, NUMA_INTERLEAVE_HIT), > - sum_zone_numa_state(dev->id, NUMA_LOCAL), > - sum_zone_numa_state(dev->id, NUMA_OTHER)); > + node_page_state(NODE_DATA(dev->id), NUMA_HIT), > + node_page_state(NODE_DATA(dev->id), NUMA_MISS), > + node_page_state(NODE_DATA(dev->id), NUMA_FOREIGN), > + node_page_state(NODE_DATA(dev->id), NUMA_INTERLEAVE_HIT), > + node_page_state(NODE_DATA(dev->id), NUMA_LOCAL), > + node_page_state(NODE_DATA(dev->id), NUMA_OTHER)); > } > + > static DEVICE_ATTR(numastat, S_IRUGO, node_read_numastat, NULL); > > static ssize_t node_read_vmstat(struct device *dev, > @@ -190,17 +191,9 @@ static ssize_t node_read_vmstat(struct device *dev, > n += sprintf(buf+n, "%s %lu\n", vmstat_text[i], > sum_zone_node_page_state(nid, i)); > > -#ifdef CONFIG_NUMA > - for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++) > - n += sprintf(buf+n, "%s %lu\n", > - vmstat_text[i + NR_VM_ZONE_STAT_ITEMS], > - sum_zone_numa_state(nid, i)); > -#endif > - > for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) > n += sprintf(buf+n, "%s %lu\n", > - vmstat_text[i + NR_VM_ZONE_STAT_ITEMS + > - NR_VM_NUMA_STAT_ITEMS], > + vmstat_text[i + NR_VM_ZONE_STAT_ITEMS], > node_page_state(pgdat, i)); > > return n; > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > index 67f2e3c..c06d880 100644 > --- a/include/linux/mmzone.h > +++ b/include/linux/mmzone.h > @@ -115,20 +115,6 @@ struct zone_padding { > #define ZONE_PADDING(name) > #endif > > -#ifdef CONFIG_NUMA > -enum numa_stat_item { > - NUMA_HIT, /* allocated in intended node */ > - NUMA_MISS, /* allocated in non intended node */ > - NUMA_FOREIGN, /* was intended here, hit elsewhere */ > - NUMA_INTERLEAVE_HIT, /* interleaver preferred this zone */ > - NUMA_LOCAL, /* allocation from local node */ > - NUMA_OTHER, /* allocation from other node */ > - NR_VM_NUMA_STAT_ITEMS > -}; > -#else > -#define NR_VM_NUMA_STAT_ITEMS 0 > -#endif > - > enum zone_stat_item { > /* First 128 byte cacheline (assuming 64 bit words) */ > NR_FREE_PAGES, > @@ -151,7 +137,18 @@ enum zone_stat_item { > NR_VM_ZONE_STAT_ITEMS }; > > enum node_stat_item { > - NR_LRU_BASE, > +#ifdef CONFIG_NUMA > + NUMA_HIT, /* allocated in intended node */ > + NUMA_MISS, /* allocated in non intended node */ > + NUMA_FOREIGN, /* was intended here, hit elsewhere */ > + NUMA_INTERLEAVE_HIT, /* interleaver preferred this zone */ > + NUMA_LOCAL, /* allocation from local node */ > + NUMA_OTHER, /* allocation from other node */ > + NR_VM_NUMA_STAT_ITEMS, > +#else > +#define NR_VM_NUMA_STAT_ITEMS 0 > +#endif > + NR_LRU_BASE = NR_VM_NUMA_STAT_ITEMS, > NR_INACTIVE_ANON = NR_LRU_BASE, /* must match order of LRU_[IN]ACTIVE */ > NR_ACTIVE_ANON, /* " " " " " */ > NR_INACTIVE_FILE, /* " " " " " */ > diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h > index 1779c98..80bf290 100644 > --- a/include/linux/vmstat.h > +++ b/include/linux/vmstat.h > @@ -118,37 +118,8 @@ static inline void vm_events_fold_cpu(int cpu) > * Zone and node-based page accounting with per cpu differentials. > */ > extern atomic_long_t vm_zone_stat[NR_VM_ZONE_STAT_ITEMS]; > -extern atomic_long_t vm_numa_stat[NR_VM_NUMA_STAT_ITEMS]; > extern atomic_long_t vm_node_stat[NR_VM_NODE_STAT_ITEMS]; > > -#ifdef CONFIG_NUMA > -static inline void zone_numa_state_add(long x, struct zone *zone, > - enum numa_stat_item item) > -{ > - atomic_long_add(x, &zone->vm_numa_stat[item]); > - atomic_long_add(x, &vm_numa_stat[item]); > -} > - > -static inline unsigned long global_numa_state(enum numa_stat_item item) > -{ > - long x = atomic_long_read(&vm_numa_stat[item]); > - > - return x; > -} > - > -static inline unsigned long zone_numa_state_snapshot(struct zone *zone, > - enum numa_stat_item item) > -{ > - long x = atomic_long_read(&zone->vm_numa_stat[item]); > - int cpu; > - > - for_each_online_cpu(cpu) > - x += per_cpu_ptr(zone->pageset, cpu)->vm_numa_stat_diff[item]; > - > - return x; > -} > -#endif /* CONFIG_NUMA */ > - > static inline void zone_page_state_add(long x, struct zone *zone, > enum zone_stat_item item) > { > @@ -234,10 +205,8 @@ static inline unsigned long node_page_state_snapshot(pg_data_t *pgdat, > > > #ifdef CONFIG_NUMA > -extern void __inc_numa_state(struct zone *zone, enum numa_stat_item item); > extern unsigned long sum_zone_node_page_state(int node, > enum zone_stat_item item); > -extern unsigned long sum_zone_numa_state(int node, enum numa_stat_item item); > extern unsigned long node_page_state(struct pglist_data *pgdat, > enum node_stat_item item); > #else > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > index 4ce44d3..b2293e3 100644 > --- a/mm/mempolicy.c > +++ b/mm/mempolicy.c > @@ -1920,7 +1920,7 @@ static struct page *alloc_page_interleave(gfp_t gfp, unsigned order, > return page; > if (page && page_to_nid(page) == nid) { > preempt_disable(); > - __inc_numa_state(page_zone(page), NUMA_INTERLEAVE_HIT); > + inc_node_state(page_pgdat(page), NUMA_INTERLEAVE_HIT); > preempt_enable(); > } > return page; > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 7e5e775..81e8d8f 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -2793,22 +2793,24 @@ int __isolate_free_page(struct page *page, unsigned int order) > static inline void zone_statistics(struct zone *preferred_zone, struct zone *z) > { > #ifdef CONFIG_NUMA > - enum numa_stat_item local_stat = NUMA_LOCAL; > + int preferred_nid = preferred_zone->node; > + int nid = z->node; > + enum node_stat_item local_stat = NUMA_LOCAL; > > /* skip numa counters update if numa stats is disabled */ > if (!static_branch_likely(&vm_numa_stat_key)) > return; > > - if (z->node != numa_node_id()) > + if (nid != numa_node_id()) > local_stat = NUMA_OTHER; > > - if (z->node == preferred_zone->node) > - __inc_numa_state(z, NUMA_HIT); > + if (nid == preferred_nid) > + inc_node_state(NODE_DATA(nid), NUMA_HIT); > else { > - __inc_numa_state(z, NUMA_MISS); > - __inc_numa_state(preferred_zone, NUMA_FOREIGN); > + inc_node_state(NODE_DATA(nid), NUMA_MISS); > + inc_node_state(NODE_DATA(preferred_nid), NUMA_FOREIGN); > } > - __inc_numa_state(z, local_stat); > + inc_node_state(NODE_DATA(nid), local_stat); > #endif > } > > diff --git a/mm/vmstat.c b/mm/vmstat.c > index 40b2db6..1dd12ae 100644 > --- a/mm/vmstat.c > +++ b/mm/vmstat.c > @@ -30,46 +30,44 @@ > > #include "internal.h" > > -#define NUMA_STATS_THRESHOLD (U16_MAX - 2) > - > #ifdef CONFIG_NUMA > int sysctl_vm_numa_stat = ENABLE_NUMA_STAT; > > -/* zero numa counters within a zone */ > -static void zero_zone_numa_counters(struct zone *zone) > +/* zero numa stats within a node */ > +static void zero_node_numa_stats(int node) > { > int item, cpu; > > for (item = 0; item < NR_VM_NUMA_STAT_ITEMS; item++) { > - atomic_long_set(&zone->vm_numa_stat[item], 0); > + atomic_long_set(&(NODE_DATA(node)->vm_stat[item]), 0); > for_each_online_cpu(cpu) > - per_cpu_ptr(zone->pageset, cpu)->vm_numa_stat_diff[item] > - = 0; > + per_cpu_ptr(NODE_DATA(node)->per_cpu_nodestats, > + cpu)->vm_node_stat_diff[item] = 0; > } > } > > -/* zero numa counters of all the populated zones */ > -static void zero_zones_numa_counters(void) > +/* zero numa stats of all the online nodes */ > +static void zero_nodes_numa_stats(void) > { > - struct zone *zone; > + int node; > > - for_each_populated_zone(zone) > - zero_zone_numa_counters(zone); > + for_each_online_node(node) > + zero_node_numa_stats(node); > } > > -/* zero global numa counters */ > -static void zero_global_numa_counters(void) > +/* zero global numa stats */ > +static void zero_global_numa_stats(void) > { > int item; > > for (item = 0; item < NR_VM_NUMA_STAT_ITEMS; item++) > - atomic_long_set(&vm_numa_stat[item], 0); > + atomic_long_set(&vm_node_stat[item], 0); > } > > static void invalid_numa_statistics(void) > { > - zero_zones_numa_counters(); > - zero_global_numa_counters(); > + zero_nodes_numa_stats(); > + zero_global_numa_stats(); > } > > static DEFINE_MUTEX(vm_numa_stat_lock); > @@ -160,10 +158,8 @@ void vm_events_fold_cpu(int cpu) > * vm_stat contains the global counters > */ > atomic_long_t vm_zone_stat[NR_VM_ZONE_STAT_ITEMS] __cacheline_aligned_in_smp; > -atomic_long_t vm_numa_stat[NR_VM_NUMA_STAT_ITEMS] __cacheline_aligned_in_smp; > atomic_long_t vm_node_stat[NR_VM_NODE_STAT_ITEMS] __cacheline_aligned_in_smp; > EXPORT_SYMBOL(vm_zone_stat); > -EXPORT_SYMBOL(vm_numa_stat); > EXPORT_SYMBOL(vm_node_stat); > > #ifdef CONFIG_SMP > @@ -679,32 +675,6 @@ EXPORT_SYMBOL(dec_node_page_state); > * Fold a differential into the global counters. > * Returns the number of counters updated. > */ > -#ifdef CONFIG_NUMA > -static int fold_diff(int *zone_diff, int *numa_diff, int *node_diff) > -{ > - int i; > - int changes = 0; > - > - for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) > - if (zone_diff[i]) { > - atomic_long_add(zone_diff[i], &vm_zone_stat[i]); > - changes++; > - } > - > - for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++) > - if (numa_diff[i]) { > - atomic_long_add(numa_diff[i], &vm_numa_stat[i]); > - changes++; > - } > - > - for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) > - if (node_diff[i]) { > - atomic_long_add(node_diff[i], &vm_node_stat[i]); > - changes++; > - } > - return changes; > -} > -#else > static int fold_diff(int *zone_diff, int *node_diff) > { > int i; > @@ -723,7 +693,6 @@ static int fold_diff(int *zone_diff, int *node_diff) > } > return changes; > } > -#endif /* CONFIG_NUMA */ > > /* > * Update the zone counters for the current cpu. > @@ -747,9 +716,6 @@ static int refresh_cpu_vm_stats(bool do_pagesets) > struct zone *zone; > int i; > int global_zone_diff[NR_VM_ZONE_STAT_ITEMS] = { 0, }; > -#ifdef CONFIG_NUMA > - int global_numa_diff[NR_VM_NUMA_STAT_ITEMS] = { 0, }; > -#endif > int global_node_diff[NR_VM_NODE_STAT_ITEMS] = { 0, }; > int changes = 0; > > @@ -771,18 +737,6 @@ static int refresh_cpu_vm_stats(bool do_pagesets) > } > } > #ifdef CONFIG_NUMA > - for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++) { > - int v; > - > - v = this_cpu_xchg(p->vm_numa_stat_diff[i], 0); > - if (v) { > - > - atomic_long_add(v, &zone->vm_numa_stat[i]); > - global_numa_diff[i] += v; > - __this_cpu_write(p->expire, 3); > - } > - } > - > if (do_pagesets) { > cond_resched(); > /* > @@ -829,12 +783,7 @@ static int refresh_cpu_vm_stats(bool do_pagesets) > } > } > > -#ifdef CONFIG_NUMA > - changes += fold_diff(global_zone_diff, global_numa_diff, > - global_node_diff); > -#else > changes += fold_diff(global_zone_diff, global_node_diff); > -#endif > return changes; > } > > @@ -849,9 +798,6 @@ void cpu_vm_stats_fold(int cpu) > struct zone *zone; > int i; > int global_zone_diff[NR_VM_ZONE_STAT_ITEMS] = { 0, }; > -#ifdef CONFIG_NUMA > - int global_numa_diff[NR_VM_NUMA_STAT_ITEMS] = { 0, }; > -#endif > int global_node_diff[NR_VM_NODE_STAT_ITEMS] = { 0, }; > > for_each_populated_zone(zone) { > @@ -868,18 +814,6 @@ void cpu_vm_stats_fold(int cpu) > atomic_long_add(v, &zone->vm_stat[i]); > global_zone_diff[i] += v; > } > - > -#ifdef CONFIG_NUMA > - for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++) > - if (p->vm_numa_stat_diff[i]) { > - int v; > - > - v = p->vm_numa_stat_diff[i]; > - p->vm_numa_stat_diff[i] = 0; > - atomic_long_add(v, &zone->vm_numa_stat[i]); > - global_numa_diff[i] += v; > - } > -#endif > } > > for_each_online_pgdat(pgdat) { > @@ -898,11 +832,7 @@ void cpu_vm_stats_fold(int cpu) > } > } > > -#ifdef CONFIG_NUMA > - fold_diff(global_zone_diff, global_numa_diff, global_node_diff); > -#else > fold_diff(global_zone_diff, global_node_diff); > -#endif > } > > /* > @@ -920,36 +850,10 @@ void drain_zonestat(struct zone *zone, struct per_cpu_pageset *pset) > atomic_long_add(v, &zone->vm_stat[i]); > atomic_long_add(v, &vm_zone_stat[i]); > } > - > -#ifdef CONFIG_NUMA > - for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++) > - if (pset->vm_numa_stat_diff[i]) { > - int v = pset->vm_numa_stat_diff[i]; > - > - pset->vm_numa_stat_diff[i] = 0; > - atomic_long_add(v, &zone->vm_numa_stat[i]); > - atomic_long_add(v, &vm_numa_stat[i]); > - } > -#endif > } > #endif > > #ifdef CONFIG_NUMA > -void __inc_numa_state(struct zone *zone, > - enum numa_stat_item item) > -{ > - struct per_cpu_pageset __percpu *pcp = zone->pageset; > - u16 __percpu *p = pcp->vm_numa_stat_diff + item; > - u16 v; > - > - v = __this_cpu_inc_return(*p); > - > - if (unlikely(v > NUMA_STATS_THRESHOLD)) { > - zone_numa_state_add(v, zone, item); > - __this_cpu_write(*p, 0); > - } > -} > - > /* > * Determine the per node value of a stat item. This function > * is called frequently in a NUMA machine, so try to be as > @@ -969,23 +873,6 @@ unsigned long sum_zone_node_page_state(int node, > } > > /* > - * Determine the per node value of a numa stat item. To avoid deviation, > - * the per cpu stat number in vm_numa_stat_diff[] is also included. > - */ > -unsigned long sum_zone_numa_state(int node, > - enum numa_stat_item item) > -{ > - struct zone *zones = NODE_DATA(node)->node_zones; > - int i; > - unsigned long count = 0; > - > - for (i = 0; i < MAX_NR_ZONES; i++) > - count += zone_numa_state_snapshot(zones + i, item); > - > - return count; > -} > - > -/* > * Determine the per node value of a stat item. > */ > unsigned long node_page_state(struct pglist_data *pgdat, > @@ -1569,8 +1456,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat, > seq_printf(m, "\n per-node stats"); > for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) { > seq_printf(m, "\n %-12s %lu", > - vmstat_text[i + NR_VM_ZONE_STAT_ITEMS + > - NR_VM_NUMA_STAT_ITEMS], > + vmstat_text[i + NR_VM_ZONE_STAT_ITEMS], > node_page_state(pgdat, i)); > } > } > @@ -1607,13 +1493,6 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat, > seq_printf(m, "\n %-12s %lu", vmstat_text[i], > zone_page_state(zone, i)); > > -#ifdef CONFIG_NUMA > - for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++) > - seq_printf(m, "\n %-12s %lu", > - vmstat_text[i + NR_VM_ZONE_STAT_ITEMS], > - zone_numa_state_snapshot(zone, i)); > -#endif > - > seq_printf(m, "\n pagesets"); > for_each_online_cpu(i) { > struct per_cpu_pageset *pageset; > @@ -1688,7 +1567,6 @@ static void *vmstat_start(struct seq_file *m, loff_t *pos) > if (*pos >= ARRAY_SIZE(vmstat_text)) > return NULL; > stat_items_size = NR_VM_ZONE_STAT_ITEMS * sizeof(unsigned long) + > - NR_VM_NUMA_STAT_ITEMS * sizeof(unsigned long) + > NR_VM_NODE_STAT_ITEMS * sizeof(unsigned long) + > NR_VM_WRITEBACK_STAT_ITEMS * sizeof(unsigned long); > > @@ -1704,12 +1582,6 @@ static void *vmstat_start(struct seq_file *m, loff_t *pos) > v[i] = global_zone_page_state(i); > v += NR_VM_ZONE_STAT_ITEMS; > > -#ifdef CONFIG_NUMA > - for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++) > - v[i] = global_numa_state(i); > - v += NR_VM_NUMA_STAT_ITEMS; > -#endif > - > for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) > v[i] = global_node_page_state(i); > v += NR_VM_NODE_STAT_ITEMS; > @@ -1811,16 +1683,6 @@ int vmstat_refresh(struct ctl_table *table, int write, > err = -EINVAL; > } > } > -#ifdef CONFIG_NUMA > - for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++) { > - val = atomic_long_read(&vm_numa_stat[i]); > - if (val < 0) { > - pr_warn("%s: %s %ld\n", > - __func__, vmstat_text[i + NR_VM_ZONE_STAT_ITEMS], val); > - err = -EINVAL; > - } > - } > -#endif > if (err) > return err; > if (write) > @@ -1862,9 +1724,6 @@ static bool need_update(int cpu) > struct per_cpu_pageset *p = per_cpu_ptr(zone->pageset, cpu); > > BUILD_BUG_ON(sizeof(p->vm_stat_diff[0]) != 1); > -#ifdef CONFIG_NUMA > - BUILD_BUG_ON(sizeof(p->vm_numa_stat_diff[0]) != 2); > -#endif > > /* > * The fast way of checking if there are any vmstat diffs. > @@ -1872,10 +1731,6 @@ static bool need_update(int cpu) > */ > if (memchr_inv(p->vm_stat_diff, 0, NR_VM_ZONE_STAT_ITEMS)) > return true; > -#ifdef CONFIG_NUMA > - if (memchr_inv(p->vm_numa_stat_diff, 0, NR_VM_NUMA_STAT_ITEMS)) > - return true; > -#endif > } > return false; > } > -- > 2.7.4 > -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH v2 1/5] mm: migrate NUMA stats from per-zone to per-node @ 2017-12-19 12:28 ` Michal Hocko 0 siblings, 0 replies; 68+ messages in thread From: Michal Hocko @ 2017-12-19 12:28 UTC (permalink / raw) To: Kemi Wang Cc: Greg Kroah-Hartman, Andrew Morton, Vlastimil Babka, Mel Gorman, Johannes Weiner, Christopher Lameter, YASUAKI ISHIMATSU, Andrey Ryabinin, Nikolay Borisov, Pavel Tatashin, David Rientjes, Sebastian Andrzej Siewior, Dave, Andi Kleen, Tim Chen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Aubrey Li, Linux MM, Linux Kernel On Tue 19-12-17 14:39:22, Kemi Wang wrote: > There is not really any use to get NUMA stats separated by zone, and > current per-zone NUMA stats is only consumed in /proc/zoneinfo. For code > cleanup purpose, we move NUMA stats from per-zone to per-node and reuse the > existed per-cpu infrastructure. Let's hope that nobody really depends on the per-zone numbers. It would be really strange as those counters are inherently per-node and that is what users should care about but who knows... Anyway, I hoped we could get rid of NR_VM_NUMA_STAT_ITEMS but your patch keeps it and follow up patches even use it further. I will comment on those separately but this still makes these few counters really special which I think is wrong. > Suggested-by: Andi Kleen <ak@linux.intel.com> > Suggested-by: Michal Hocko <mhocko@kernel.com> > Signed-off-by: Kemi Wang <kemi.wang@intel.com> I have to fully grasp the rest of the series before I'll give my Ack, but I _really_ like the simplification this adds to the code. I believe it can be even simpler. > --- > drivers/base/node.c | 23 +++---- > include/linux/mmzone.h | 27 ++++---- > include/linux/vmstat.h | 31 --------- > mm/mempolicy.c | 2 +- > mm/page_alloc.c | 16 +++-- > mm/vmstat.c | 177 +++++-------------------------------------------- > 6 files changed, 46 insertions(+), 230 deletions(-) > > diff --git a/drivers/base/node.c b/drivers/base/node.c > index ee090ab..a045ea1 100644 > --- a/drivers/base/node.c > +++ b/drivers/base/node.c > @@ -169,13 +169,14 @@ static ssize_t node_read_numastat(struct device *dev, > "interleave_hit %lu\n" > "local_node %lu\n" > "other_node %lu\n", > - sum_zone_numa_state(dev->id, NUMA_HIT), > - sum_zone_numa_state(dev->id, NUMA_MISS), > - sum_zone_numa_state(dev->id, NUMA_FOREIGN), > - sum_zone_numa_state(dev->id, NUMA_INTERLEAVE_HIT), > - sum_zone_numa_state(dev->id, NUMA_LOCAL), > - sum_zone_numa_state(dev->id, NUMA_OTHER)); > + node_page_state(NODE_DATA(dev->id), NUMA_HIT), > + node_page_state(NODE_DATA(dev->id), NUMA_MISS), > + node_page_state(NODE_DATA(dev->id), NUMA_FOREIGN), > + node_page_state(NODE_DATA(dev->id), NUMA_INTERLEAVE_HIT), > + node_page_state(NODE_DATA(dev->id), NUMA_LOCAL), > + node_page_state(NODE_DATA(dev->id), NUMA_OTHER)); > } > + > static DEVICE_ATTR(numastat, S_IRUGO, node_read_numastat, NULL); > > static ssize_t node_read_vmstat(struct device *dev, > @@ -190,17 +191,9 @@ static ssize_t node_read_vmstat(struct device *dev, > n += sprintf(buf+n, "%s %lu\n", vmstat_text[i], > sum_zone_node_page_state(nid, i)); > > -#ifdef CONFIG_NUMA > - for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++) > - n += sprintf(buf+n, "%s %lu\n", > - vmstat_text[i + NR_VM_ZONE_STAT_ITEMS], > - sum_zone_numa_state(nid, i)); > -#endif > - > for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) > n += sprintf(buf+n, "%s %lu\n", > - vmstat_text[i + NR_VM_ZONE_STAT_ITEMS + > - NR_VM_NUMA_STAT_ITEMS], > + vmstat_text[i + NR_VM_ZONE_STAT_ITEMS], > node_page_state(pgdat, i)); > > return n; > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > index 67f2e3c..c06d880 100644 > --- a/include/linux/mmzone.h > +++ b/include/linux/mmzone.h > @@ -115,20 +115,6 @@ struct zone_padding { > #define ZONE_PADDING(name) > #endif > > -#ifdef CONFIG_NUMA > -enum numa_stat_item { > - NUMA_HIT, /* allocated in intended node */ > - NUMA_MISS, /* allocated in non intended node */ > - NUMA_FOREIGN, /* was intended here, hit elsewhere */ > - NUMA_INTERLEAVE_HIT, /* interleaver preferred this zone */ > - NUMA_LOCAL, /* allocation from local node */ > - NUMA_OTHER, /* allocation from other node */ > - NR_VM_NUMA_STAT_ITEMS > -}; > -#else > -#define NR_VM_NUMA_STAT_ITEMS 0 > -#endif > - > enum zone_stat_item { > /* First 128 byte cacheline (assuming 64 bit words) */ > NR_FREE_PAGES, > @@ -151,7 +137,18 @@ enum zone_stat_item { > NR_VM_ZONE_STAT_ITEMS }; > > enum node_stat_item { > - NR_LRU_BASE, > +#ifdef CONFIG_NUMA > + NUMA_HIT, /* allocated in intended node */ > + NUMA_MISS, /* allocated in non intended node */ > + NUMA_FOREIGN, /* was intended here, hit elsewhere */ > + NUMA_INTERLEAVE_HIT, /* interleaver preferred this zone */ > + NUMA_LOCAL, /* allocation from local node */ > + NUMA_OTHER, /* allocation from other node */ > + NR_VM_NUMA_STAT_ITEMS, > +#else > +#define NR_VM_NUMA_STAT_ITEMS 0 > +#endif > + NR_LRU_BASE = NR_VM_NUMA_STAT_ITEMS, > NR_INACTIVE_ANON = NR_LRU_BASE, /* must match order of LRU_[IN]ACTIVE */ > NR_ACTIVE_ANON, /* " " " " " */ > NR_INACTIVE_FILE, /* " " " " " */ > diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h > index 1779c98..80bf290 100644 > --- a/include/linux/vmstat.h > +++ b/include/linux/vmstat.h > @@ -118,37 +118,8 @@ static inline void vm_events_fold_cpu(int cpu) > * Zone and node-based page accounting with per cpu differentials. > */ > extern atomic_long_t vm_zone_stat[NR_VM_ZONE_STAT_ITEMS]; > -extern atomic_long_t vm_numa_stat[NR_VM_NUMA_STAT_ITEMS]; > extern atomic_long_t vm_node_stat[NR_VM_NODE_STAT_ITEMS]; > > -#ifdef CONFIG_NUMA > -static inline void zone_numa_state_add(long x, struct zone *zone, > - enum numa_stat_item item) > -{ > - atomic_long_add(x, &zone->vm_numa_stat[item]); > - atomic_long_add(x, &vm_numa_stat[item]); > -} > - > -static inline unsigned long global_numa_state(enum numa_stat_item item) > -{ > - long x = atomic_long_read(&vm_numa_stat[item]); > - > - return x; > -} > - > -static inline unsigned long zone_numa_state_snapshot(struct zone *zone, > - enum numa_stat_item item) > -{ > - long x = atomic_long_read(&zone->vm_numa_stat[item]); > - int cpu; > - > - for_each_online_cpu(cpu) > - x += per_cpu_ptr(zone->pageset, cpu)->vm_numa_stat_diff[item]; > - > - return x; > -} > -#endif /* CONFIG_NUMA */ > - > static inline void zone_page_state_add(long x, struct zone *zone, > enum zone_stat_item item) > { > @@ -234,10 +205,8 @@ static inline unsigned long node_page_state_snapshot(pg_data_t *pgdat, > > > #ifdef CONFIG_NUMA > -extern void __inc_numa_state(struct zone *zone, enum numa_stat_item item); > extern unsigned long sum_zone_node_page_state(int node, > enum zone_stat_item item); > -extern unsigned long sum_zone_numa_state(int node, enum numa_stat_item item); > extern unsigned long node_page_state(struct pglist_data *pgdat, > enum node_stat_item item); > #else > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > index 4ce44d3..b2293e3 100644 > --- a/mm/mempolicy.c > +++ b/mm/mempolicy.c > @@ -1920,7 +1920,7 @@ static struct page *alloc_page_interleave(gfp_t gfp, unsigned order, > return page; > if (page && page_to_nid(page) == nid) { > preempt_disable(); > - __inc_numa_state(page_zone(page), NUMA_INTERLEAVE_HIT); > + inc_node_state(page_pgdat(page), NUMA_INTERLEAVE_HIT); > preempt_enable(); > } > return page; > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 7e5e775..81e8d8f 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -2793,22 +2793,24 @@ int __isolate_free_page(struct page *page, unsigned int order) > static inline void zone_statistics(struct zone *preferred_zone, struct zone *z) > { > #ifdef CONFIG_NUMA > - enum numa_stat_item local_stat = NUMA_LOCAL; > + int preferred_nid = preferred_zone->node; > + int nid = z->node; > + enum node_stat_item local_stat = NUMA_LOCAL; > > /* skip numa counters update if numa stats is disabled */ > if (!static_branch_likely(&vm_numa_stat_key)) > return; > > - if (z->node != numa_node_id()) > + if (nid != numa_node_id()) > local_stat = NUMA_OTHER; > > - if (z->node == preferred_zone->node) > - __inc_numa_state(z, NUMA_HIT); > + if (nid == preferred_nid) > + inc_node_state(NODE_DATA(nid), NUMA_HIT); > else { > - __inc_numa_state(z, NUMA_MISS); > - __inc_numa_state(preferred_zone, NUMA_FOREIGN); > + inc_node_state(NODE_DATA(nid), NUMA_MISS); > + inc_node_state(NODE_DATA(preferred_nid), NUMA_FOREIGN); > } > - __inc_numa_state(z, local_stat); > + inc_node_state(NODE_DATA(nid), local_stat); > #endif > } > > diff --git a/mm/vmstat.c b/mm/vmstat.c > index 40b2db6..1dd12ae 100644 > --- a/mm/vmstat.c > +++ b/mm/vmstat.c > @@ -30,46 +30,44 @@ > > #include "internal.h" > > -#define NUMA_STATS_THRESHOLD (U16_MAX - 2) > - > #ifdef CONFIG_NUMA > int sysctl_vm_numa_stat = ENABLE_NUMA_STAT; > > -/* zero numa counters within a zone */ > -static void zero_zone_numa_counters(struct zone *zone) > +/* zero numa stats within a node */ > +static void zero_node_numa_stats(int node) > { > int item, cpu; > > for (item = 0; item < NR_VM_NUMA_STAT_ITEMS; item++) { > - atomic_long_set(&zone->vm_numa_stat[item], 0); > + atomic_long_set(&(NODE_DATA(node)->vm_stat[item]), 0); > for_each_online_cpu(cpu) > - per_cpu_ptr(zone->pageset, cpu)->vm_numa_stat_diff[item] > - = 0; > + per_cpu_ptr(NODE_DATA(node)->per_cpu_nodestats, > + cpu)->vm_node_stat_diff[item] = 0; > } > } > > -/* zero numa counters of all the populated zones */ > -static void zero_zones_numa_counters(void) > +/* zero numa stats of all the online nodes */ > +static void zero_nodes_numa_stats(void) > { > - struct zone *zone; > + int node; > > - for_each_populated_zone(zone) > - zero_zone_numa_counters(zone); > + for_each_online_node(node) > + zero_node_numa_stats(node); > } > > -/* zero global numa counters */ > -static void zero_global_numa_counters(void) > +/* zero global numa stats */ > +static void zero_global_numa_stats(void) > { > int item; > > for (item = 0; item < NR_VM_NUMA_STAT_ITEMS; item++) > - atomic_long_set(&vm_numa_stat[item], 0); > + atomic_long_set(&vm_node_stat[item], 0); > } > > static void invalid_numa_statistics(void) > { > - zero_zones_numa_counters(); > - zero_global_numa_counters(); > + zero_nodes_numa_stats(); > + zero_global_numa_stats(); > } > > static DEFINE_MUTEX(vm_numa_stat_lock); > @@ -160,10 +158,8 @@ void vm_events_fold_cpu(int cpu) > * vm_stat contains the global counters > */ > atomic_long_t vm_zone_stat[NR_VM_ZONE_STAT_ITEMS] __cacheline_aligned_in_smp; > -atomic_long_t vm_numa_stat[NR_VM_NUMA_STAT_ITEMS] __cacheline_aligned_in_smp; > atomic_long_t vm_node_stat[NR_VM_NODE_STAT_ITEMS] __cacheline_aligned_in_smp; > EXPORT_SYMBOL(vm_zone_stat); > -EXPORT_SYMBOL(vm_numa_stat); > EXPORT_SYMBOL(vm_node_stat); > > #ifdef CONFIG_SMP > @@ -679,32 +675,6 @@ EXPORT_SYMBOL(dec_node_page_state); > * Fold a differential into the global counters. > * Returns the number of counters updated. > */ > -#ifdef CONFIG_NUMA > -static int fold_diff(int *zone_diff, int *numa_diff, int *node_diff) > -{ > - int i; > - int changes = 0; > - > - for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) > - if (zone_diff[i]) { > - atomic_long_add(zone_diff[i], &vm_zone_stat[i]); > - changes++; > - } > - > - for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++) > - if (numa_diff[i]) { > - atomic_long_add(numa_diff[i], &vm_numa_stat[i]); > - changes++; > - } > - > - for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) > - if (node_diff[i]) { > - atomic_long_add(node_diff[i], &vm_node_stat[i]); > - changes++; > - } > - return changes; > -} > -#else > static int fold_diff(int *zone_diff, int *node_diff) > { > int i; > @@ -723,7 +693,6 @@ static int fold_diff(int *zone_diff, int *node_diff) > } > return changes; > } > -#endif /* CONFIG_NUMA */ > > /* > * Update the zone counters for the current cpu. > @@ -747,9 +716,6 @@ static int refresh_cpu_vm_stats(bool do_pagesets) > struct zone *zone; > int i; > int global_zone_diff[NR_VM_ZONE_STAT_ITEMS] = { 0, }; > -#ifdef CONFIG_NUMA > - int global_numa_diff[NR_VM_NUMA_STAT_ITEMS] = { 0, }; > -#endif > int global_node_diff[NR_VM_NODE_STAT_ITEMS] = { 0, }; > int changes = 0; > > @@ -771,18 +737,6 @@ static int refresh_cpu_vm_stats(bool do_pagesets) > } > } > #ifdef CONFIG_NUMA > - for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++) { > - int v; > - > - v = this_cpu_xchg(p->vm_numa_stat_diff[i], 0); > - if (v) { > - > - atomic_long_add(v, &zone->vm_numa_stat[i]); > - global_numa_diff[i] += v; > - __this_cpu_write(p->expire, 3); > - } > - } > - > if (do_pagesets) { > cond_resched(); > /* > @@ -829,12 +783,7 @@ static int refresh_cpu_vm_stats(bool do_pagesets) > } > } > > -#ifdef CONFIG_NUMA > - changes += fold_diff(global_zone_diff, global_numa_diff, > - global_node_diff); > -#else > changes += fold_diff(global_zone_diff, global_node_diff); > -#endif > return changes; > } > > @@ -849,9 +798,6 @@ void cpu_vm_stats_fold(int cpu) > struct zone *zone; > int i; > int global_zone_diff[NR_VM_ZONE_STAT_ITEMS] = { 0, }; > -#ifdef CONFIG_NUMA > - int global_numa_diff[NR_VM_NUMA_STAT_ITEMS] = { 0, }; > -#endif > int global_node_diff[NR_VM_NODE_STAT_ITEMS] = { 0, }; > > for_each_populated_zone(zone) { > @@ -868,18 +814,6 @@ void cpu_vm_stats_fold(int cpu) > atomic_long_add(v, &zone->vm_stat[i]); > global_zone_diff[i] += v; > } > - > -#ifdef CONFIG_NUMA > - for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++) > - if (p->vm_numa_stat_diff[i]) { > - int v; > - > - v = p->vm_numa_stat_diff[i]; > - p->vm_numa_stat_diff[i] = 0; > - atomic_long_add(v, &zone->vm_numa_stat[i]); > - global_numa_diff[i] += v; > - } > -#endif > } > > for_each_online_pgdat(pgdat) { > @@ -898,11 +832,7 @@ void cpu_vm_stats_fold(int cpu) > } > } > > -#ifdef CONFIG_NUMA > - fold_diff(global_zone_diff, global_numa_diff, global_node_diff); > -#else > fold_diff(global_zone_diff, global_node_diff); > -#endif > } > > /* > @@ -920,36 +850,10 @@ void drain_zonestat(struct zone *zone, struct per_cpu_pageset *pset) > atomic_long_add(v, &zone->vm_stat[i]); > atomic_long_add(v, &vm_zone_stat[i]); > } > - > -#ifdef CONFIG_NUMA > - for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++) > - if (pset->vm_numa_stat_diff[i]) { > - int v = pset->vm_numa_stat_diff[i]; > - > - pset->vm_numa_stat_diff[i] = 0; > - atomic_long_add(v, &zone->vm_numa_stat[i]); > - atomic_long_add(v, &vm_numa_stat[i]); > - } > -#endif > } > #endif > > #ifdef CONFIG_NUMA > -void __inc_numa_state(struct zone *zone, > - enum numa_stat_item item) > -{ > - struct per_cpu_pageset __percpu *pcp = zone->pageset; > - u16 __percpu *p = pcp->vm_numa_stat_diff + item; > - u16 v; > - > - v = __this_cpu_inc_return(*p); > - > - if (unlikely(v > NUMA_STATS_THRESHOLD)) { > - zone_numa_state_add(v, zone, item); > - __this_cpu_write(*p, 0); > - } > -} > - > /* > * Determine the per node value of a stat item. This function > * is called frequently in a NUMA machine, so try to be as > @@ -969,23 +873,6 @@ unsigned long sum_zone_node_page_state(int node, > } > > /* > - * Determine the per node value of a numa stat item. To avoid deviation, > - * the per cpu stat number in vm_numa_stat_diff[] is also included. > - */ > -unsigned long sum_zone_numa_state(int node, > - enum numa_stat_item item) > -{ > - struct zone *zones = NODE_DATA(node)->node_zones; > - int i; > - unsigned long count = 0; > - > - for (i = 0; i < MAX_NR_ZONES; i++) > - count += zone_numa_state_snapshot(zones + i, item); > - > - return count; > -} > - > -/* > * Determine the per node value of a stat item. > */ > unsigned long node_page_state(struct pglist_data *pgdat, > @@ -1569,8 +1456,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat, > seq_printf(m, "\n per-node stats"); > for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) { > seq_printf(m, "\n %-12s %lu", > - vmstat_text[i + NR_VM_ZONE_STAT_ITEMS + > - NR_VM_NUMA_STAT_ITEMS], > + vmstat_text[i + NR_VM_ZONE_STAT_ITEMS], > node_page_state(pgdat, i)); > } > } > @@ -1607,13 +1493,6 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat, > seq_printf(m, "\n %-12s %lu", vmstat_text[i], > zone_page_state(zone, i)); > > -#ifdef CONFIG_NUMA > - for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++) > - seq_printf(m, "\n %-12s %lu", > - vmstat_text[i + NR_VM_ZONE_STAT_ITEMS], > - zone_numa_state_snapshot(zone, i)); > -#endif > - > seq_printf(m, "\n pagesets"); > for_each_online_cpu(i) { > struct per_cpu_pageset *pageset; > @@ -1688,7 +1567,6 @@ static void *vmstat_start(struct seq_file *m, loff_t *pos) > if (*pos >= ARRAY_SIZE(vmstat_text)) > return NULL; > stat_items_size = NR_VM_ZONE_STAT_ITEMS * sizeof(unsigned long) + > - NR_VM_NUMA_STAT_ITEMS * sizeof(unsigned long) + > NR_VM_NODE_STAT_ITEMS * sizeof(unsigned long) + > NR_VM_WRITEBACK_STAT_ITEMS * sizeof(unsigned long); > > @@ -1704,12 +1582,6 @@ static void *vmstat_start(struct seq_file *m, loff_t *pos) > v[i] = global_zone_page_state(i); > v += NR_VM_ZONE_STAT_ITEMS; > > -#ifdef CONFIG_NUMA > - for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++) > - v[i] = global_numa_state(i); > - v += NR_VM_NUMA_STAT_ITEMS; > -#endif > - > for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) > v[i] = global_node_page_state(i); > v += NR_VM_NODE_STAT_ITEMS; > @@ -1811,16 +1683,6 @@ int vmstat_refresh(struct ctl_table *table, int write, > err = -EINVAL; > } > } > -#ifdef CONFIG_NUMA > - for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++) { > - val = atomic_long_read(&vm_numa_stat[i]); > - if (val < 0) { > - pr_warn("%s: %s %ld\n", > - __func__, vmstat_text[i + NR_VM_ZONE_STAT_ITEMS], val); > - err = -EINVAL; > - } > - } > -#endif > if (err) > return err; > if (write) > @@ -1862,9 +1724,6 @@ static bool need_update(int cpu) > struct per_cpu_pageset *p = per_cpu_ptr(zone->pageset, cpu); > > BUILD_BUG_ON(sizeof(p->vm_stat_diff[0]) != 1); > -#ifdef CONFIG_NUMA > - BUILD_BUG_ON(sizeof(p->vm_numa_stat_diff[0]) != 2); > -#endif > > /* > * The fast way of checking if there are any vmstat diffs. > @@ -1872,10 +1731,6 @@ static bool need_update(int cpu) > */ > if (memchr_inv(p->vm_stat_diff, 0, NR_VM_ZONE_STAT_ITEMS)) > return true; > -#ifdef CONFIG_NUMA > - if (memchr_inv(p->vm_numa_stat_diff, 0, NR_VM_NUMA_STAT_ITEMS)) > - return true; > -#endif > } > return false; > } > -- > 2.7.4 > -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH v2 1/5] mm: migrate NUMA stats from per-zone to per-node 2017-12-19 12:28 ` Michal Hocko @ 2017-12-20 5:32 ` kemi -1 siblings, 0 replies; 68+ messages in thread From: kemi @ 2017-12-20 5:32 UTC (permalink / raw) To: Michal Hocko Cc: Greg Kroah-Hartman, Andrew Morton, Vlastimil Babka, Mel Gorman, Johannes Weiner, Christopher Lameter, YASUAKI ISHIMATSU, Andrey Ryabinin, Nikolay Borisov, Pavel Tatashin, David Rientjes, Sebastian Andrzej Siewior, Dave, Andi Kleen, Tim Chen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Aubrey Li, Linux MM, Linux Kernel On 2017年12月19日 20:28, Michal Hocko wrote: > On Tue 19-12-17 14:39:22, Kemi Wang wrote: >> There is not really any use to get NUMA stats separated by zone, and >> current per-zone NUMA stats is only consumed in /proc/zoneinfo. For code >> cleanup purpose, we move NUMA stats from per-zone to per-node and reuse the >> existed per-cpu infrastructure. > > Let's hope that nobody really depends on the per-zone numbers. It would > be really strange as those counters are inherently per-node and that is > what users should care about but who knows... > > Anyway, I hoped we could get rid of NR_VM_NUMA_STAT_ITEMS but your patch > keeps it and follow up patches even use it further. I will comment on > those separately but this still makes these few counters really special > which I think is wrong. > Well, that's what I can think of to keep a balance between performance and simplification. If you have a better idea, please post it and I will follow that surely. >> Suggested-by: Andi Kleen <ak@linux.intel.com> >> Suggested-by: Michal Hocko <mhocko@kernel.com> >> Signed-off-by: Kemi Wang <kemi.wang@intel.com> > > I have to fully grasp the rest of the series before I'll give my Ack, > but I _really_ like the simplification this adds to the code. I believe > it can be even simpler. > >> --- >> drivers/base/node.c | 23 +++---- >> include/linux/mmzone.h | 27 ++++---- >> include/linux/vmstat.h | 31 --------- >> mm/mempolicy.c | 2 +- >> mm/page_alloc.c | 16 +++-- >> mm/vmstat.c | 177 +++++-------------------------------------------- >> 6 files changed, 46 insertions(+), 230 deletions(-) >> >> diff --git a/drivers/base/node.c b/drivers/base/node.c >> index ee090ab..a045ea1 100644 >> --- a/drivers/base/node.c >> +++ b/drivers/base/node.c >> @@ -169,13 +169,14 @@ static ssize_t node_read_numastat(struct device *dev, >> "interleave_hit %lu\n" >> "local_node %lu\n" >> "other_node %lu\n", >> - sum_zone_numa_state(dev->id, NUMA_HIT), >> - sum_zone_numa_state(dev->id, NUMA_MISS), >> - sum_zone_numa_state(dev->id, NUMA_FOREIGN), >> - sum_zone_numa_state(dev->id, NUMA_INTERLEAVE_HIT), >> - sum_zone_numa_state(dev->id, NUMA_LOCAL), >> - sum_zone_numa_state(dev->id, NUMA_OTHER)); >> + node_page_state(NODE_DATA(dev->id), NUMA_HIT), >> + node_page_state(NODE_DATA(dev->id), NUMA_MISS), >> + node_page_state(NODE_DATA(dev->id), NUMA_FOREIGN), >> + node_page_state(NODE_DATA(dev->id), NUMA_INTERLEAVE_HIT), >> + node_page_state(NODE_DATA(dev->id), NUMA_LOCAL), >> + node_page_state(NODE_DATA(dev->id), NUMA_OTHER)); >> } >> + >> static DEVICE_ATTR(numastat, S_IRUGO, node_read_numastat, NULL); >> >> static ssize_t node_read_vmstat(struct device *dev, >> @@ -190,17 +191,9 @@ static ssize_t node_read_vmstat(struct device *dev, >> n += sprintf(buf+n, "%s %lu\n", vmstat_text[i], >> sum_zone_node_page_state(nid, i)); >> >> -#ifdef CONFIG_NUMA >> - for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++) >> - n += sprintf(buf+n, "%s %lu\n", >> - vmstat_text[i + NR_VM_ZONE_STAT_ITEMS], >> - sum_zone_numa_state(nid, i)); >> -#endif >> - >> for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) >> n += sprintf(buf+n, "%s %lu\n", >> - vmstat_text[i + NR_VM_ZONE_STAT_ITEMS + >> - NR_VM_NUMA_STAT_ITEMS], >> + vmstat_text[i + NR_VM_ZONE_STAT_ITEMS], >> node_page_state(pgdat, i)); >> >> return n; >> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h >> index 67f2e3c..c06d880 100644 >> --- a/include/linux/mmzone.h >> +++ b/include/linux/mmzone.h >> @@ -115,20 +115,6 @@ struct zone_padding { >> #define ZONE_PADDING(name) >> #endif >> >> -#ifdef CONFIG_NUMA >> -enum numa_stat_item { >> - NUMA_HIT, /* allocated in intended node */ >> - NUMA_MISS, /* allocated in non intended node */ >> - NUMA_FOREIGN, /* was intended here, hit elsewhere */ >> - NUMA_INTERLEAVE_HIT, /* interleaver preferred this zone */ >> - NUMA_LOCAL, /* allocation from local node */ >> - NUMA_OTHER, /* allocation from other node */ >> - NR_VM_NUMA_STAT_ITEMS >> -}; >> -#else >> -#define NR_VM_NUMA_STAT_ITEMS 0 >> -#endif >> - >> enum zone_stat_item { >> /* First 128 byte cacheline (assuming 64 bit words) */ >> NR_FREE_PAGES, >> @@ -151,7 +137,18 @@ enum zone_stat_item { >> NR_VM_ZONE_STAT_ITEMS }; >> >> enum node_stat_item { >> - NR_LRU_BASE, >> +#ifdef CONFIG_NUMA >> + NUMA_HIT, /* allocated in intended node */ >> + NUMA_MISS, /* allocated in non intended node */ >> + NUMA_FOREIGN, /* was intended here, hit elsewhere */ >> + NUMA_INTERLEAVE_HIT, /* interleaver preferred this zone */ >> + NUMA_LOCAL, /* allocation from local node */ >> + NUMA_OTHER, /* allocation from other node */ >> + NR_VM_NUMA_STAT_ITEMS, >> +#else >> +#define NR_VM_NUMA_STAT_ITEMS 0 >> +#endif >> + NR_LRU_BASE = NR_VM_NUMA_STAT_ITEMS, >> NR_INACTIVE_ANON = NR_LRU_BASE, /* must match order of LRU_[IN]ACTIVE */ >> NR_ACTIVE_ANON, /* " " " " " */ >> NR_INACTIVE_FILE, /* " " " " " */ >> diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h >> index 1779c98..80bf290 100644 >> --- a/include/linux/vmstat.h >> +++ b/include/linux/vmstat.h >> @@ -118,37 +118,8 @@ static inline void vm_events_fold_cpu(int cpu) >> * Zone and node-based page accounting with per cpu differentials. >> */ >> extern atomic_long_t vm_zone_stat[NR_VM_ZONE_STAT_ITEMS]; >> -extern atomic_long_t vm_numa_stat[NR_VM_NUMA_STAT_ITEMS]; >> extern atomic_long_t vm_node_stat[NR_VM_NODE_STAT_ITEMS]; >> >> -#ifdef CONFIG_NUMA >> -static inline void zone_numa_state_add(long x, struct zone *zone, >> - enum numa_stat_item item) >> -{ >> - atomic_long_add(x, &zone->vm_numa_stat[item]); >> - atomic_long_add(x, &vm_numa_stat[item]); >> -} >> - >> -static inline unsigned long global_numa_state(enum numa_stat_item item) >> -{ >> - long x = atomic_long_read(&vm_numa_stat[item]); >> - >> - return x; >> -} >> - >> -static inline unsigned long zone_numa_state_snapshot(struct zone *zone, >> - enum numa_stat_item item) >> -{ >> - long x = atomic_long_read(&zone->vm_numa_stat[item]); >> - int cpu; >> - >> - for_each_online_cpu(cpu) >> - x += per_cpu_ptr(zone->pageset, cpu)->vm_numa_stat_diff[item]; >> - >> - return x; >> -} >> -#endif /* CONFIG_NUMA */ >> - >> static inline void zone_page_state_add(long x, struct zone *zone, >> enum zone_stat_item item) >> { >> @@ -234,10 +205,8 @@ static inline unsigned long node_page_state_snapshot(pg_data_t *pgdat, >> >> >> #ifdef CONFIG_NUMA >> -extern void __inc_numa_state(struct zone *zone, enum numa_stat_item item); >> extern unsigned long sum_zone_node_page_state(int node, >> enum zone_stat_item item); >> -extern unsigned long sum_zone_numa_state(int node, enum numa_stat_item item); >> extern unsigned long node_page_state(struct pglist_data *pgdat, >> enum node_stat_item item); >> #else >> diff --git a/mm/mempolicy.c b/mm/mempolicy.c >> index 4ce44d3..b2293e3 100644 >> --- a/mm/mempolicy.c >> +++ b/mm/mempolicy.c >> @@ -1920,7 +1920,7 @@ static struct page *alloc_page_interleave(gfp_t gfp, unsigned order, >> return page; >> if (page && page_to_nid(page) == nid) { >> preempt_disable(); >> - __inc_numa_state(page_zone(page), NUMA_INTERLEAVE_HIT); >> + inc_node_state(page_pgdat(page), NUMA_INTERLEAVE_HIT); >> preempt_enable(); >> } >> return page; >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >> index 7e5e775..81e8d8f 100644 >> --- a/mm/page_alloc.c >> +++ b/mm/page_alloc.c >> @@ -2793,22 +2793,24 @@ int __isolate_free_page(struct page *page, unsigned int order) >> static inline void zone_statistics(struct zone *preferred_zone, struct zone *z) >> { >> #ifdef CONFIG_NUMA >> - enum numa_stat_item local_stat = NUMA_LOCAL; >> + int preferred_nid = preferred_zone->node; >> + int nid = z->node; >> + enum node_stat_item local_stat = NUMA_LOCAL; >> >> /* skip numa counters update if numa stats is disabled */ >> if (!static_branch_likely(&vm_numa_stat_key)) >> return; >> >> - if (z->node != numa_node_id()) >> + if (nid != numa_node_id()) >> local_stat = NUMA_OTHER; >> >> - if (z->node == preferred_zone->node) >> - __inc_numa_state(z, NUMA_HIT); >> + if (nid == preferred_nid) >> + inc_node_state(NODE_DATA(nid), NUMA_HIT); >> else { >> - __inc_numa_state(z, NUMA_MISS); >> - __inc_numa_state(preferred_zone, NUMA_FOREIGN); >> + inc_node_state(NODE_DATA(nid), NUMA_MISS); >> + inc_node_state(NODE_DATA(preferred_nid), NUMA_FOREIGN); >> } >> - __inc_numa_state(z, local_stat); >> + inc_node_state(NODE_DATA(nid), local_stat); >> #endif >> } >> >> diff --git a/mm/vmstat.c b/mm/vmstat.c >> index 40b2db6..1dd12ae 100644 >> --- a/mm/vmstat.c >> +++ b/mm/vmstat.c >> @@ -30,46 +30,44 @@ >> >> #include "internal.h" >> >> -#define NUMA_STATS_THRESHOLD (U16_MAX - 2) >> - >> #ifdef CONFIG_NUMA >> int sysctl_vm_numa_stat = ENABLE_NUMA_STAT; >> >> -/* zero numa counters within a zone */ >> -static void zero_zone_numa_counters(struct zone *zone) >> +/* zero numa stats within a node */ >> +static void zero_node_numa_stats(int node) >> { >> int item, cpu; >> >> for (item = 0; item < NR_VM_NUMA_STAT_ITEMS; item++) { >> - atomic_long_set(&zone->vm_numa_stat[item], 0); >> + atomic_long_set(&(NODE_DATA(node)->vm_stat[item]), 0); >> for_each_online_cpu(cpu) >> - per_cpu_ptr(zone->pageset, cpu)->vm_numa_stat_diff[item] >> - = 0; >> + per_cpu_ptr(NODE_DATA(node)->per_cpu_nodestats, >> + cpu)->vm_node_stat_diff[item] = 0; >> } >> } >> >> -/* zero numa counters of all the populated zones */ >> -static void zero_zones_numa_counters(void) >> +/* zero numa stats of all the online nodes */ >> +static void zero_nodes_numa_stats(void) >> { >> - struct zone *zone; >> + int node; >> >> - for_each_populated_zone(zone) >> - zero_zone_numa_counters(zone); >> + for_each_online_node(node) >> + zero_node_numa_stats(node); >> } >> >> -/* zero global numa counters */ >> -static void zero_global_numa_counters(void) >> +/* zero global numa stats */ >> +static void zero_global_numa_stats(void) >> { >> int item; >> >> for (item = 0; item < NR_VM_NUMA_STAT_ITEMS; item++) >> - atomic_long_set(&vm_numa_stat[item], 0); >> + atomic_long_set(&vm_node_stat[item], 0); >> } >> >> static void invalid_numa_statistics(void) >> { >> - zero_zones_numa_counters(); >> - zero_global_numa_counters(); >> + zero_nodes_numa_stats(); >> + zero_global_numa_stats(); >> } >> >> static DEFINE_MUTEX(vm_numa_stat_lock); >> @@ -160,10 +158,8 @@ void vm_events_fold_cpu(int cpu) >> * vm_stat contains the global counters >> */ >> atomic_long_t vm_zone_stat[NR_VM_ZONE_STAT_ITEMS] __cacheline_aligned_in_smp; >> -atomic_long_t vm_numa_stat[NR_VM_NUMA_STAT_ITEMS] __cacheline_aligned_in_smp; >> atomic_long_t vm_node_stat[NR_VM_NODE_STAT_ITEMS] __cacheline_aligned_in_smp; >> EXPORT_SYMBOL(vm_zone_stat); >> -EXPORT_SYMBOL(vm_numa_stat); >> EXPORT_SYMBOL(vm_node_stat); >> >> #ifdef CONFIG_SMP >> @@ -679,32 +675,6 @@ EXPORT_SYMBOL(dec_node_page_state); >> * Fold a differential into the global counters. >> * Returns the number of counters updated. >> */ >> -#ifdef CONFIG_NUMA >> -static int fold_diff(int *zone_diff, int *numa_diff, int *node_diff) >> -{ >> - int i; >> - int changes = 0; >> - >> - for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) >> - if (zone_diff[i]) { >> - atomic_long_add(zone_diff[i], &vm_zone_stat[i]); >> - changes++; >> - } >> - >> - for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++) >> - if (numa_diff[i]) { >> - atomic_long_add(numa_diff[i], &vm_numa_stat[i]); >> - changes++; >> - } >> - >> - for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) >> - if (node_diff[i]) { >> - atomic_long_add(node_diff[i], &vm_node_stat[i]); >> - changes++; >> - } >> - return changes; >> -} >> -#else >> static int fold_diff(int *zone_diff, int *node_diff) >> { >> int i; >> @@ -723,7 +693,6 @@ static int fold_diff(int *zone_diff, int *node_diff) >> } >> return changes; >> } >> -#endif /* CONFIG_NUMA */ >> >> /* >> * Update the zone counters for the current cpu. >> @@ -747,9 +716,6 @@ static int refresh_cpu_vm_stats(bool do_pagesets) >> struct zone *zone; >> int i; >> int global_zone_diff[NR_VM_ZONE_STAT_ITEMS] = { 0, }; >> -#ifdef CONFIG_NUMA >> - int global_numa_diff[NR_VM_NUMA_STAT_ITEMS] = { 0, }; >> -#endif >> int global_node_diff[NR_VM_NODE_STAT_ITEMS] = { 0, }; >> int changes = 0; >> >> @@ -771,18 +737,6 @@ static int refresh_cpu_vm_stats(bool do_pagesets) >> } >> } >> #ifdef CONFIG_NUMA >> - for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++) { >> - int v; >> - >> - v = this_cpu_xchg(p->vm_numa_stat_diff[i], 0); >> - if (v) { >> - >> - atomic_long_add(v, &zone->vm_numa_stat[i]); >> - global_numa_diff[i] += v; >> - __this_cpu_write(p->expire, 3); >> - } >> - } >> - >> if (do_pagesets) { >> cond_resched(); >> /* >> @@ -829,12 +783,7 @@ static int refresh_cpu_vm_stats(bool do_pagesets) >> } >> } >> >> -#ifdef CONFIG_NUMA >> - changes += fold_diff(global_zone_diff, global_numa_diff, >> - global_node_diff); >> -#else >> changes += fold_diff(global_zone_diff, global_node_diff); >> -#endif >> return changes; >> } >> >> @@ -849,9 +798,6 @@ void cpu_vm_stats_fold(int cpu) >> struct zone *zone; >> int i; >> int global_zone_diff[NR_VM_ZONE_STAT_ITEMS] = { 0, }; >> -#ifdef CONFIG_NUMA >> - int global_numa_diff[NR_VM_NUMA_STAT_ITEMS] = { 0, }; >> -#endif >> int global_node_diff[NR_VM_NODE_STAT_ITEMS] = { 0, }; >> >> for_each_populated_zone(zone) { >> @@ -868,18 +814,6 @@ void cpu_vm_stats_fold(int cpu) >> atomic_long_add(v, &zone->vm_stat[i]); >> global_zone_diff[i] += v; >> } >> - >> -#ifdef CONFIG_NUMA >> - for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++) >> - if (p->vm_numa_stat_diff[i]) { >> - int v; >> - >> - v = p->vm_numa_stat_diff[i]; >> - p->vm_numa_stat_diff[i] = 0; >> - atomic_long_add(v, &zone->vm_numa_stat[i]); >> - global_numa_diff[i] += v; >> - } >> -#endif >> } >> >> for_each_online_pgdat(pgdat) { >> @@ -898,11 +832,7 @@ void cpu_vm_stats_fold(int cpu) >> } >> } >> >> -#ifdef CONFIG_NUMA >> - fold_diff(global_zone_diff, global_numa_diff, global_node_diff); >> -#else >> fold_diff(global_zone_diff, global_node_diff); >> -#endif >> } >> >> /* >> @@ -920,36 +850,10 @@ void drain_zonestat(struct zone *zone, struct per_cpu_pageset *pset) >> atomic_long_add(v, &zone->vm_stat[i]); >> atomic_long_add(v, &vm_zone_stat[i]); >> } >> - >> -#ifdef CONFIG_NUMA >> - for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++) >> - if (pset->vm_numa_stat_diff[i]) { >> - int v = pset->vm_numa_stat_diff[i]; >> - >> - pset->vm_numa_stat_diff[i] = 0; >> - atomic_long_add(v, &zone->vm_numa_stat[i]); >> - atomic_long_add(v, &vm_numa_stat[i]); >> - } >> -#endif >> } >> #endif >> >> #ifdef CONFIG_NUMA >> -void __inc_numa_state(struct zone *zone, >> - enum numa_stat_item item) >> -{ >> - struct per_cpu_pageset __percpu *pcp = zone->pageset; >> - u16 __percpu *p = pcp->vm_numa_stat_diff + item; >> - u16 v; >> - >> - v = __this_cpu_inc_return(*p); >> - >> - if (unlikely(v > NUMA_STATS_THRESHOLD)) { >> - zone_numa_state_add(v, zone, item); >> - __this_cpu_write(*p, 0); >> - } >> -} >> - >> /* >> * Determine the per node value of a stat item. This function >> * is called frequently in a NUMA machine, so try to be as >> @@ -969,23 +873,6 @@ unsigned long sum_zone_node_page_state(int node, >> } >> >> /* >> - * Determine the per node value of a numa stat item. To avoid deviation, >> - * the per cpu stat number in vm_numa_stat_diff[] is also included. >> - */ >> -unsigned long sum_zone_numa_state(int node, >> - enum numa_stat_item item) >> -{ >> - struct zone *zones = NODE_DATA(node)->node_zones; >> - int i; >> - unsigned long count = 0; >> - >> - for (i = 0; i < MAX_NR_ZONES; i++) >> - count += zone_numa_state_snapshot(zones + i, item); >> - >> - return count; >> -} >> - >> -/* >> * Determine the per node value of a stat item. >> */ >> unsigned long node_page_state(struct pglist_data *pgdat, >> @@ -1569,8 +1456,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat, >> seq_printf(m, "\n per-node stats"); >> for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) { >> seq_printf(m, "\n %-12s %lu", >> - vmstat_text[i + NR_VM_ZONE_STAT_ITEMS + >> - NR_VM_NUMA_STAT_ITEMS], >> + vmstat_text[i + NR_VM_ZONE_STAT_ITEMS], >> node_page_state(pgdat, i)); >> } >> } >> @@ -1607,13 +1493,6 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat, >> seq_printf(m, "\n %-12s %lu", vmstat_text[i], >> zone_page_state(zone, i)); >> >> -#ifdef CONFIG_NUMA >> - for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++) >> - seq_printf(m, "\n %-12s %lu", >> - vmstat_text[i + NR_VM_ZONE_STAT_ITEMS], >> - zone_numa_state_snapshot(zone, i)); >> -#endif >> - >> seq_printf(m, "\n pagesets"); >> for_each_online_cpu(i) { >> struct per_cpu_pageset *pageset; >> @@ -1688,7 +1567,6 @@ static void *vmstat_start(struct seq_file *m, loff_t *pos) >> if (*pos >= ARRAY_SIZE(vmstat_text)) >> return NULL; >> stat_items_size = NR_VM_ZONE_STAT_ITEMS * sizeof(unsigned long) + >> - NR_VM_NUMA_STAT_ITEMS * sizeof(unsigned long) + >> NR_VM_NODE_STAT_ITEMS * sizeof(unsigned long) + >> NR_VM_WRITEBACK_STAT_ITEMS * sizeof(unsigned long); >> >> @@ -1704,12 +1582,6 @@ static void *vmstat_start(struct seq_file *m, loff_t *pos) >> v[i] = global_zone_page_state(i); >> v += NR_VM_ZONE_STAT_ITEMS; >> >> -#ifdef CONFIG_NUMA >> - for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++) >> - v[i] = global_numa_state(i); >> - v += NR_VM_NUMA_STAT_ITEMS; >> -#endif >> - >> for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) >> v[i] = global_node_page_state(i); >> v += NR_VM_NODE_STAT_ITEMS; >> @@ -1811,16 +1683,6 @@ int vmstat_refresh(struct ctl_table *table, int write, >> err = -EINVAL; >> } >> } >> -#ifdef CONFIG_NUMA >> - for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++) { >> - val = atomic_long_read(&vm_numa_stat[i]); >> - if (val < 0) { >> - pr_warn("%s: %s %ld\n", >> - __func__, vmstat_text[i + NR_VM_ZONE_STAT_ITEMS], val); >> - err = -EINVAL; >> - } >> - } >> -#endif >> if (err) >> return err; >> if (write) >> @@ -1862,9 +1724,6 @@ static bool need_update(int cpu) >> struct per_cpu_pageset *p = per_cpu_ptr(zone->pageset, cpu); >> >> BUILD_BUG_ON(sizeof(p->vm_stat_diff[0]) != 1); >> -#ifdef CONFIG_NUMA >> - BUILD_BUG_ON(sizeof(p->vm_numa_stat_diff[0]) != 2); >> -#endif >> >> /* >> * The fast way of checking if there are any vmstat diffs. >> @@ -1872,10 +1731,6 @@ static bool need_update(int cpu) >> */ >> if (memchr_inv(p->vm_stat_diff, 0, NR_VM_ZONE_STAT_ITEMS)) >> return true; >> -#ifdef CONFIG_NUMA >> - if (memchr_inv(p->vm_numa_stat_diff, 0, NR_VM_NUMA_STAT_ITEMS)) >> - return true; >> -#endif >> } >> return false; >> } >> -- >> 2.7.4 >> > ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH v2 1/5] mm: migrate NUMA stats from per-zone to per-node @ 2017-12-20 5:32 ` kemi 0 siblings, 0 replies; 68+ messages in thread From: kemi @ 2017-12-20 5:32 UTC (permalink / raw) To: Michal Hocko Cc: Greg Kroah-Hartman, Andrew Morton, Vlastimil Babka, Mel Gorman, Johannes Weiner, Christopher Lameter, YASUAKI ISHIMATSU, Andrey Ryabinin, Nikolay Borisov, Pavel Tatashin, David Rientjes, Sebastian Andrzej Siewior, Dave, Andi Kleen, Tim Chen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Aubrey Li, Linux MM, Linux Kernel On 2017a1'12ae??19ae?JPY 20:28, Michal Hocko wrote: > On Tue 19-12-17 14:39:22, Kemi Wang wrote: >> There is not really any use to get NUMA stats separated by zone, and >> current per-zone NUMA stats is only consumed in /proc/zoneinfo. For code >> cleanup purpose, we move NUMA stats from per-zone to per-node and reuse the >> existed per-cpu infrastructure. > > Let's hope that nobody really depends on the per-zone numbers. It would > be really strange as those counters are inherently per-node and that is > what users should care about but who knows... > > Anyway, I hoped we could get rid of NR_VM_NUMA_STAT_ITEMS but your patch > keeps it and follow up patches even use it further. I will comment on > those separately but this still makes these few counters really special > which I think is wrong. > Well, that's what I can think of to keep a balance between performance and simplification. If you have a better idea, please post it and I will follow that surely. >> Suggested-by: Andi Kleen <ak@linux.intel.com> >> Suggested-by: Michal Hocko <mhocko@kernel.com> >> Signed-off-by: Kemi Wang <kemi.wang@intel.com> > > I have to fully grasp the rest of the series before I'll give my Ack, > but I _really_ like the simplification this adds to the code. I believe > it can be even simpler. > >> --- >> drivers/base/node.c | 23 +++---- >> include/linux/mmzone.h | 27 ++++---- >> include/linux/vmstat.h | 31 --------- >> mm/mempolicy.c | 2 +- >> mm/page_alloc.c | 16 +++-- >> mm/vmstat.c | 177 +++++-------------------------------------------- >> 6 files changed, 46 insertions(+), 230 deletions(-) >> >> diff --git a/drivers/base/node.c b/drivers/base/node.c >> index ee090ab..a045ea1 100644 >> --- a/drivers/base/node.c >> +++ b/drivers/base/node.c >> @@ -169,13 +169,14 @@ static ssize_t node_read_numastat(struct device *dev, >> "interleave_hit %lu\n" >> "local_node %lu\n" >> "other_node %lu\n", >> - sum_zone_numa_state(dev->id, NUMA_HIT), >> - sum_zone_numa_state(dev->id, NUMA_MISS), >> - sum_zone_numa_state(dev->id, NUMA_FOREIGN), >> - sum_zone_numa_state(dev->id, NUMA_INTERLEAVE_HIT), >> - sum_zone_numa_state(dev->id, NUMA_LOCAL), >> - sum_zone_numa_state(dev->id, NUMA_OTHER)); >> + node_page_state(NODE_DATA(dev->id), NUMA_HIT), >> + node_page_state(NODE_DATA(dev->id), NUMA_MISS), >> + node_page_state(NODE_DATA(dev->id), NUMA_FOREIGN), >> + node_page_state(NODE_DATA(dev->id), NUMA_INTERLEAVE_HIT), >> + node_page_state(NODE_DATA(dev->id), NUMA_LOCAL), >> + node_page_state(NODE_DATA(dev->id), NUMA_OTHER)); >> } >> + >> static DEVICE_ATTR(numastat, S_IRUGO, node_read_numastat, NULL); >> >> static ssize_t node_read_vmstat(struct device *dev, >> @@ -190,17 +191,9 @@ static ssize_t node_read_vmstat(struct device *dev, >> n += sprintf(buf+n, "%s %lu\n", vmstat_text[i], >> sum_zone_node_page_state(nid, i)); >> >> -#ifdef CONFIG_NUMA >> - for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++) >> - n += sprintf(buf+n, "%s %lu\n", >> - vmstat_text[i + NR_VM_ZONE_STAT_ITEMS], >> - sum_zone_numa_state(nid, i)); >> -#endif >> - >> for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) >> n += sprintf(buf+n, "%s %lu\n", >> - vmstat_text[i + NR_VM_ZONE_STAT_ITEMS + >> - NR_VM_NUMA_STAT_ITEMS], >> + vmstat_text[i + NR_VM_ZONE_STAT_ITEMS], >> node_page_state(pgdat, i)); >> >> return n; >> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h >> index 67f2e3c..c06d880 100644 >> --- a/include/linux/mmzone.h >> +++ b/include/linux/mmzone.h >> @@ -115,20 +115,6 @@ struct zone_padding { >> #define ZONE_PADDING(name) >> #endif >> >> -#ifdef CONFIG_NUMA >> -enum numa_stat_item { >> - NUMA_HIT, /* allocated in intended node */ >> - NUMA_MISS, /* allocated in non intended node */ >> - NUMA_FOREIGN, /* was intended here, hit elsewhere */ >> - NUMA_INTERLEAVE_HIT, /* interleaver preferred this zone */ >> - NUMA_LOCAL, /* allocation from local node */ >> - NUMA_OTHER, /* allocation from other node */ >> - NR_VM_NUMA_STAT_ITEMS >> -}; >> -#else >> -#define NR_VM_NUMA_STAT_ITEMS 0 >> -#endif >> - >> enum zone_stat_item { >> /* First 128 byte cacheline (assuming 64 bit words) */ >> NR_FREE_PAGES, >> @@ -151,7 +137,18 @@ enum zone_stat_item { >> NR_VM_ZONE_STAT_ITEMS }; >> >> enum node_stat_item { >> - NR_LRU_BASE, >> +#ifdef CONFIG_NUMA >> + NUMA_HIT, /* allocated in intended node */ >> + NUMA_MISS, /* allocated in non intended node */ >> + NUMA_FOREIGN, /* was intended here, hit elsewhere */ >> + NUMA_INTERLEAVE_HIT, /* interleaver preferred this zone */ >> + NUMA_LOCAL, /* allocation from local node */ >> + NUMA_OTHER, /* allocation from other node */ >> + NR_VM_NUMA_STAT_ITEMS, >> +#else >> +#define NR_VM_NUMA_STAT_ITEMS 0 >> +#endif >> + NR_LRU_BASE = NR_VM_NUMA_STAT_ITEMS, >> NR_INACTIVE_ANON = NR_LRU_BASE, /* must match order of LRU_[IN]ACTIVE */ >> NR_ACTIVE_ANON, /* " " " " " */ >> NR_INACTIVE_FILE, /* " " " " " */ >> diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h >> index 1779c98..80bf290 100644 >> --- a/include/linux/vmstat.h >> +++ b/include/linux/vmstat.h >> @@ -118,37 +118,8 @@ static inline void vm_events_fold_cpu(int cpu) >> * Zone and node-based page accounting with per cpu differentials. >> */ >> extern atomic_long_t vm_zone_stat[NR_VM_ZONE_STAT_ITEMS]; >> -extern atomic_long_t vm_numa_stat[NR_VM_NUMA_STAT_ITEMS]; >> extern atomic_long_t vm_node_stat[NR_VM_NODE_STAT_ITEMS]; >> >> -#ifdef CONFIG_NUMA >> -static inline void zone_numa_state_add(long x, struct zone *zone, >> - enum numa_stat_item item) >> -{ >> - atomic_long_add(x, &zone->vm_numa_stat[item]); >> - atomic_long_add(x, &vm_numa_stat[item]); >> -} >> - >> -static inline unsigned long global_numa_state(enum numa_stat_item item) >> -{ >> - long x = atomic_long_read(&vm_numa_stat[item]); >> - >> - return x; >> -} >> - >> -static inline unsigned long zone_numa_state_snapshot(struct zone *zone, >> - enum numa_stat_item item) >> -{ >> - long x = atomic_long_read(&zone->vm_numa_stat[item]); >> - int cpu; >> - >> - for_each_online_cpu(cpu) >> - x += per_cpu_ptr(zone->pageset, cpu)->vm_numa_stat_diff[item]; >> - >> - return x; >> -} >> -#endif /* CONFIG_NUMA */ >> - >> static inline void zone_page_state_add(long x, struct zone *zone, >> enum zone_stat_item item) >> { >> @@ -234,10 +205,8 @@ static inline unsigned long node_page_state_snapshot(pg_data_t *pgdat, >> >> >> #ifdef CONFIG_NUMA >> -extern void __inc_numa_state(struct zone *zone, enum numa_stat_item item); >> extern unsigned long sum_zone_node_page_state(int node, >> enum zone_stat_item item); >> -extern unsigned long sum_zone_numa_state(int node, enum numa_stat_item item); >> extern unsigned long node_page_state(struct pglist_data *pgdat, >> enum node_stat_item item); >> #else >> diff --git a/mm/mempolicy.c b/mm/mempolicy.c >> index 4ce44d3..b2293e3 100644 >> --- a/mm/mempolicy.c >> +++ b/mm/mempolicy.c >> @@ -1920,7 +1920,7 @@ static struct page *alloc_page_interleave(gfp_t gfp, unsigned order, >> return page; >> if (page && page_to_nid(page) == nid) { >> preempt_disable(); >> - __inc_numa_state(page_zone(page), NUMA_INTERLEAVE_HIT); >> + inc_node_state(page_pgdat(page), NUMA_INTERLEAVE_HIT); >> preempt_enable(); >> } >> return page; >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >> index 7e5e775..81e8d8f 100644 >> --- a/mm/page_alloc.c >> +++ b/mm/page_alloc.c >> @@ -2793,22 +2793,24 @@ int __isolate_free_page(struct page *page, unsigned int order) >> static inline void zone_statistics(struct zone *preferred_zone, struct zone *z) >> { >> #ifdef CONFIG_NUMA >> - enum numa_stat_item local_stat = NUMA_LOCAL; >> + int preferred_nid = preferred_zone->node; >> + int nid = z->node; >> + enum node_stat_item local_stat = NUMA_LOCAL; >> >> /* skip numa counters update if numa stats is disabled */ >> if (!static_branch_likely(&vm_numa_stat_key)) >> return; >> >> - if (z->node != numa_node_id()) >> + if (nid != numa_node_id()) >> local_stat = NUMA_OTHER; >> >> - if (z->node == preferred_zone->node) >> - __inc_numa_state(z, NUMA_HIT); >> + if (nid == preferred_nid) >> + inc_node_state(NODE_DATA(nid), NUMA_HIT); >> else { >> - __inc_numa_state(z, NUMA_MISS); >> - __inc_numa_state(preferred_zone, NUMA_FOREIGN); >> + inc_node_state(NODE_DATA(nid), NUMA_MISS); >> + inc_node_state(NODE_DATA(preferred_nid), NUMA_FOREIGN); >> } >> - __inc_numa_state(z, local_stat); >> + inc_node_state(NODE_DATA(nid), local_stat); >> #endif >> } >> >> diff --git a/mm/vmstat.c b/mm/vmstat.c >> index 40b2db6..1dd12ae 100644 >> --- a/mm/vmstat.c >> +++ b/mm/vmstat.c >> @@ -30,46 +30,44 @@ >> >> #include "internal.h" >> >> -#define NUMA_STATS_THRESHOLD (U16_MAX - 2) >> - >> #ifdef CONFIG_NUMA >> int sysctl_vm_numa_stat = ENABLE_NUMA_STAT; >> >> -/* zero numa counters within a zone */ >> -static void zero_zone_numa_counters(struct zone *zone) >> +/* zero numa stats within a node */ >> +static void zero_node_numa_stats(int node) >> { >> int item, cpu; >> >> for (item = 0; item < NR_VM_NUMA_STAT_ITEMS; item++) { >> - atomic_long_set(&zone->vm_numa_stat[item], 0); >> + atomic_long_set(&(NODE_DATA(node)->vm_stat[item]), 0); >> for_each_online_cpu(cpu) >> - per_cpu_ptr(zone->pageset, cpu)->vm_numa_stat_diff[item] >> - = 0; >> + per_cpu_ptr(NODE_DATA(node)->per_cpu_nodestats, >> + cpu)->vm_node_stat_diff[item] = 0; >> } >> } >> >> -/* zero numa counters of all the populated zones */ >> -static void zero_zones_numa_counters(void) >> +/* zero numa stats of all the online nodes */ >> +static void zero_nodes_numa_stats(void) >> { >> - struct zone *zone; >> + int node; >> >> - for_each_populated_zone(zone) >> - zero_zone_numa_counters(zone); >> + for_each_online_node(node) >> + zero_node_numa_stats(node); >> } >> >> -/* zero global numa counters */ >> -static void zero_global_numa_counters(void) >> +/* zero global numa stats */ >> +static void zero_global_numa_stats(void) >> { >> int item; >> >> for (item = 0; item < NR_VM_NUMA_STAT_ITEMS; item++) >> - atomic_long_set(&vm_numa_stat[item], 0); >> + atomic_long_set(&vm_node_stat[item], 0); >> } >> >> static void invalid_numa_statistics(void) >> { >> - zero_zones_numa_counters(); >> - zero_global_numa_counters(); >> + zero_nodes_numa_stats(); >> + zero_global_numa_stats(); >> } >> >> static DEFINE_MUTEX(vm_numa_stat_lock); >> @@ -160,10 +158,8 @@ void vm_events_fold_cpu(int cpu) >> * vm_stat contains the global counters >> */ >> atomic_long_t vm_zone_stat[NR_VM_ZONE_STAT_ITEMS] __cacheline_aligned_in_smp; >> -atomic_long_t vm_numa_stat[NR_VM_NUMA_STAT_ITEMS] __cacheline_aligned_in_smp; >> atomic_long_t vm_node_stat[NR_VM_NODE_STAT_ITEMS] __cacheline_aligned_in_smp; >> EXPORT_SYMBOL(vm_zone_stat); >> -EXPORT_SYMBOL(vm_numa_stat); >> EXPORT_SYMBOL(vm_node_stat); >> >> #ifdef CONFIG_SMP >> @@ -679,32 +675,6 @@ EXPORT_SYMBOL(dec_node_page_state); >> * Fold a differential into the global counters. >> * Returns the number of counters updated. >> */ >> -#ifdef CONFIG_NUMA >> -static int fold_diff(int *zone_diff, int *numa_diff, int *node_diff) >> -{ >> - int i; >> - int changes = 0; >> - >> - for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) >> - if (zone_diff[i]) { >> - atomic_long_add(zone_diff[i], &vm_zone_stat[i]); >> - changes++; >> - } >> - >> - for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++) >> - if (numa_diff[i]) { >> - atomic_long_add(numa_diff[i], &vm_numa_stat[i]); >> - changes++; >> - } >> - >> - for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) >> - if (node_diff[i]) { >> - atomic_long_add(node_diff[i], &vm_node_stat[i]); >> - changes++; >> - } >> - return changes; >> -} >> -#else >> static int fold_diff(int *zone_diff, int *node_diff) >> { >> int i; >> @@ -723,7 +693,6 @@ static int fold_diff(int *zone_diff, int *node_diff) >> } >> return changes; >> } >> -#endif /* CONFIG_NUMA */ >> >> /* >> * Update the zone counters for the current cpu. >> @@ -747,9 +716,6 @@ static int refresh_cpu_vm_stats(bool do_pagesets) >> struct zone *zone; >> int i; >> int global_zone_diff[NR_VM_ZONE_STAT_ITEMS] = { 0, }; >> -#ifdef CONFIG_NUMA >> - int global_numa_diff[NR_VM_NUMA_STAT_ITEMS] = { 0, }; >> -#endif >> int global_node_diff[NR_VM_NODE_STAT_ITEMS] = { 0, }; >> int changes = 0; >> >> @@ -771,18 +737,6 @@ static int refresh_cpu_vm_stats(bool do_pagesets) >> } >> } >> #ifdef CONFIG_NUMA >> - for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++) { >> - int v; >> - >> - v = this_cpu_xchg(p->vm_numa_stat_diff[i], 0); >> - if (v) { >> - >> - atomic_long_add(v, &zone->vm_numa_stat[i]); >> - global_numa_diff[i] += v; >> - __this_cpu_write(p->expire, 3); >> - } >> - } >> - >> if (do_pagesets) { >> cond_resched(); >> /* >> @@ -829,12 +783,7 @@ static int refresh_cpu_vm_stats(bool do_pagesets) >> } >> } >> >> -#ifdef CONFIG_NUMA >> - changes += fold_diff(global_zone_diff, global_numa_diff, >> - global_node_diff); >> -#else >> changes += fold_diff(global_zone_diff, global_node_diff); >> -#endif >> return changes; >> } >> >> @@ -849,9 +798,6 @@ void cpu_vm_stats_fold(int cpu) >> struct zone *zone; >> int i; >> int global_zone_diff[NR_VM_ZONE_STAT_ITEMS] = { 0, }; >> -#ifdef CONFIG_NUMA >> - int global_numa_diff[NR_VM_NUMA_STAT_ITEMS] = { 0, }; >> -#endif >> int global_node_diff[NR_VM_NODE_STAT_ITEMS] = { 0, }; >> >> for_each_populated_zone(zone) { >> @@ -868,18 +814,6 @@ void cpu_vm_stats_fold(int cpu) >> atomic_long_add(v, &zone->vm_stat[i]); >> global_zone_diff[i] += v; >> } >> - >> -#ifdef CONFIG_NUMA >> - for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++) >> - if (p->vm_numa_stat_diff[i]) { >> - int v; >> - >> - v = p->vm_numa_stat_diff[i]; >> - p->vm_numa_stat_diff[i] = 0; >> - atomic_long_add(v, &zone->vm_numa_stat[i]); >> - global_numa_diff[i] += v; >> - } >> -#endif >> } >> >> for_each_online_pgdat(pgdat) { >> @@ -898,11 +832,7 @@ void cpu_vm_stats_fold(int cpu) >> } >> } >> >> -#ifdef CONFIG_NUMA >> - fold_diff(global_zone_diff, global_numa_diff, global_node_diff); >> -#else >> fold_diff(global_zone_diff, global_node_diff); >> -#endif >> } >> >> /* >> @@ -920,36 +850,10 @@ void drain_zonestat(struct zone *zone, struct per_cpu_pageset *pset) >> atomic_long_add(v, &zone->vm_stat[i]); >> atomic_long_add(v, &vm_zone_stat[i]); >> } >> - >> -#ifdef CONFIG_NUMA >> - for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++) >> - if (pset->vm_numa_stat_diff[i]) { >> - int v = pset->vm_numa_stat_diff[i]; >> - >> - pset->vm_numa_stat_diff[i] = 0; >> - atomic_long_add(v, &zone->vm_numa_stat[i]); >> - atomic_long_add(v, &vm_numa_stat[i]); >> - } >> -#endif >> } >> #endif >> >> #ifdef CONFIG_NUMA >> -void __inc_numa_state(struct zone *zone, >> - enum numa_stat_item item) >> -{ >> - struct per_cpu_pageset __percpu *pcp = zone->pageset; >> - u16 __percpu *p = pcp->vm_numa_stat_diff + item; >> - u16 v; >> - >> - v = __this_cpu_inc_return(*p); >> - >> - if (unlikely(v > NUMA_STATS_THRESHOLD)) { >> - zone_numa_state_add(v, zone, item); >> - __this_cpu_write(*p, 0); >> - } >> -} >> - >> /* >> * Determine the per node value of a stat item. This function >> * is called frequently in a NUMA machine, so try to be as >> @@ -969,23 +873,6 @@ unsigned long sum_zone_node_page_state(int node, >> } >> >> /* >> - * Determine the per node value of a numa stat item. To avoid deviation, >> - * the per cpu stat number in vm_numa_stat_diff[] is also included. >> - */ >> -unsigned long sum_zone_numa_state(int node, >> - enum numa_stat_item item) >> -{ >> - struct zone *zones = NODE_DATA(node)->node_zones; >> - int i; >> - unsigned long count = 0; >> - >> - for (i = 0; i < MAX_NR_ZONES; i++) >> - count += zone_numa_state_snapshot(zones + i, item); >> - >> - return count; >> -} >> - >> -/* >> * Determine the per node value of a stat item. >> */ >> unsigned long node_page_state(struct pglist_data *pgdat, >> @@ -1569,8 +1456,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat, >> seq_printf(m, "\n per-node stats"); >> for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) { >> seq_printf(m, "\n %-12s %lu", >> - vmstat_text[i + NR_VM_ZONE_STAT_ITEMS + >> - NR_VM_NUMA_STAT_ITEMS], >> + vmstat_text[i + NR_VM_ZONE_STAT_ITEMS], >> node_page_state(pgdat, i)); >> } >> } >> @@ -1607,13 +1493,6 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat, >> seq_printf(m, "\n %-12s %lu", vmstat_text[i], >> zone_page_state(zone, i)); >> >> -#ifdef CONFIG_NUMA >> - for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++) >> - seq_printf(m, "\n %-12s %lu", >> - vmstat_text[i + NR_VM_ZONE_STAT_ITEMS], >> - zone_numa_state_snapshot(zone, i)); >> -#endif >> - >> seq_printf(m, "\n pagesets"); >> for_each_online_cpu(i) { >> struct per_cpu_pageset *pageset; >> @@ -1688,7 +1567,6 @@ static void *vmstat_start(struct seq_file *m, loff_t *pos) >> if (*pos >= ARRAY_SIZE(vmstat_text)) >> return NULL; >> stat_items_size = NR_VM_ZONE_STAT_ITEMS * sizeof(unsigned long) + >> - NR_VM_NUMA_STAT_ITEMS * sizeof(unsigned long) + >> NR_VM_NODE_STAT_ITEMS * sizeof(unsigned long) + >> NR_VM_WRITEBACK_STAT_ITEMS * sizeof(unsigned long); >> >> @@ -1704,12 +1582,6 @@ static void *vmstat_start(struct seq_file *m, loff_t *pos) >> v[i] = global_zone_page_state(i); >> v += NR_VM_ZONE_STAT_ITEMS; >> >> -#ifdef CONFIG_NUMA >> - for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++) >> - v[i] = global_numa_state(i); >> - v += NR_VM_NUMA_STAT_ITEMS; >> -#endif >> - >> for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) >> v[i] = global_node_page_state(i); >> v += NR_VM_NODE_STAT_ITEMS; >> @@ -1811,16 +1683,6 @@ int vmstat_refresh(struct ctl_table *table, int write, >> err = -EINVAL; >> } >> } >> -#ifdef CONFIG_NUMA >> - for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++) { >> - val = atomic_long_read(&vm_numa_stat[i]); >> - if (val < 0) { >> - pr_warn("%s: %s %ld\n", >> - __func__, vmstat_text[i + NR_VM_ZONE_STAT_ITEMS], val); >> - err = -EINVAL; >> - } >> - } >> -#endif >> if (err) >> return err; >> if (write) >> @@ -1862,9 +1724,6 @@ static bool need_update(int cpu) >> struct per_cpu_pageset *p = per_cpu_ptr(zone->pageset, cpu); >> >> BUILD_BUG_ON(sizeof(p->vm_stat_diff[0]) != 1); >> -#ifdef CONFIG_NUMA >> - BUILD_BUG_ON(sizeof(p->vm_numa_stat_diff[0]) != 2); >> -#endif >> >> /* >> * The fast way of checking if there are any vmstat diffs. >> @@ -1872,10 +1731,6 @@ static bool need_update(int cpu) >> */ >> if (memchr_inv(p->vm_stat_diff, 0, NR_VM_ZONE_STAT_ITEMS)) >> return true; >> -#ifdef CONFIG_NUMA >> - if (memchr_inv(p->vm_numa_stat_diff, 0, NR_VM_NUMA_STAT_ITEMS)) >> - return true; >> -#endif >> } >> return false; >> } >> -- >> 2.7.4 >> > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 68+ messages in thread
* [PATCH v2 2/5] mm: Extends local cpu counter vm_diff_nodestat from s8 to s16 2017-12-19 6:39 ` Kemi Wang @ 2017-12-19 6:39 ` Kemi Wang -1 siblings, 0 replies; 68+ messages in thread From: Kemi Wang @ 2017-12-19 6:39 UTC (permalink / raw) To: Greg Kroah-Hartman, Andrew Morton, Michal Hocko, Vlastimil Babka, Mel Gorman, Johannes Weiner, Christopher Lameter, YASUAKI ISHIMATSU, Andrey Ryabinin, Nikolay Borisov, Pavel Tatashin, David Rientjes, Sebastian Andrzej Siewior Cc: Dave, Andi Kleen, Tim Chen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Aubrey Li, Kemi Wang, Linux MM, Linux Kernel The type s8 used for vm_diff_nodestat[] as local cpu counters has the limitation of global counters update frequency, especially for those monotone increasing type of counters like NUMA counters with more and more cpus/nodes. This patch extends the type of vm_diff_nodestat from s8 to s16 without any functionality change. before after sizeof(struct per_cpu_nodestat) 28 68 Signed-off-by: Kemi Wang <kemi.wang@intel.com> --- include/linux/mmzone.h | 4 ++-- mm/vmstat.c | 16 ++++++++-------- 2 files changed, 10 insertions(+), 10 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index c06d880..2da6b6f 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -289,8 +289,8 @@ struct per_cpu_pageset { }; struct per_cpu_nodestat { - s8 stat_threshold; - s8 vm_node_stat_diff[NR_VM_NODE_STAT_ITEMS]; + s16 stat_threshold; + s16 vm_node_stat_diff[NR_VM_NODE_STAT_ITEMS]; }; #endif /* !__GENERATING_BOUNDS.H */ diff --git a/mm/vmstat.c b/mm/vmstat.c index 1dd12ae..9c681cc 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -332,7 +332,7 @@ void __mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item, long delta) { struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats; - s8 __percpu *p = pcp->vm_node_stat_diff + item; + s16 __percpu *p = pcp->vm_node_stat_diff + item; long x; long t; @@ -390,13 +390,13 @@ void __inc_zone_state(struct zone *zone, enum zone_stat_item item) void __inc_node_state(struct pglist_data *pgdat, enum node_stat_item item) { struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats; - s8 __percpu *p = pcp->vm_node_stat_diff + item; - s8 v, t; + s16 __percpu *p = pcp->vm_node_stat_diff + item; + s16 v, t; v = __this_cpu_inc_return(*p); t = __this_cpu_read(pcp->stat_threshold); if (unlikely(v > t)) { - s8 overstep = t >> 1; + s16 overstep = t >> 1; node_page_state_add(v + overstep, pgdat, item); __this_cpu_write(*p, -overstep); @@ -434,13 +434,13 @@ void __dec_zone_state(struct zone *zone, enum zone_stat_item item) void __dec_node_state(struct pglist_data *pgdat, enum node_stat_item item) { struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats; - s8 __percpu *p = pcp->vm_node_stat_diff + item; - s8 v, t; + s16 __percpu *p = pcp->vm_node_stat_diff + item; + s16 v, t; v = __this_cpu_dec_return(*p); t = __this_cpu_read(pcp->stat_threshold); if (unlikely(v < - t)) { - s8 overstep = t >> 1; + s16 overstep = t >> 1; node_page_state_add(v - overstep, pgdat, item); __this_cpu_write(*p, overstep); @@ -533,7 +533,7 @@ static inline void mod_node_state(struct pglist_data *pgdat, enum node_stat_item item, int delta, int overstep_mode) { struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats; - s8 __percpu *p = pcp->vm_node_stat_diff + item; + s16 __percpu *p = pcp->vm_node_stat_diff + item; long o, n, t, z; do { -- 2.7.4 ^ permalink raw reply related [flat|nested] 68+ messages in thread
* [PATCH v2 2/5] mm: Extends local cpu counter vm_diff_nodestat from s8 to s16 @ 2017-12-19 6:39 ` Kemi Wang 0 siblings, 0 replies; 68+ messages in thread From: Kemi Wang @ 2017-12-19 6:39 UTC (permalink / raw) To: Greg Kroah-Hartman, Andrew Morton, Michal Hocko, Vlastimil Babka, Mel Gorman, Johannes Weiner, Christopher Lameter, YASUAKI ISHIMATSU, Andrey Ryabinin, Nikolay Borisov, Pavel Tatashin, David Rientjes, Sebastian Andrzej Siewior Cc: Dave, Andi Kleen, Tim Chen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Aubrey Li, Kemi Wang, Linux MM, Linux Kernel The type s8 used for vm_diff_nodestat[] as local cpu counters has the limitation of global counters update frequency, especially for those monotone increasing type of counters like NUMA counters with more and more cpus/nodes. This patch extends the type of vm_diff_nodestat from s8 to s16 without any functionality change. before after sizeof(struct per_cpu_nodestat) 28 68 Signed-off-by: Kemi Wang <kemi.wang@intel.com> --- include/linux/mmzone.h | 4 ++-- mm/vmstat.c | 16 ++++++++-------- 2 files changed, 10 insertions(+), 10 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index c06d880..2da6b6f 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -289,8 +289,8 @@ struct per_cpu_pageset { }; struct per_cpu_nodestat { - s8 stat_threshold; - s8 vm_node_stat_diff[NR_VM_NODE_STAT_ITEMS]; + s16 stat_threshold; + s16 vm_node_stat_diff[NR_VM_NODE_STAT_ITEMS]; }; #endif /* !__GENERATING_BOUNDS.H */ diff --git a/mm/vmstat.c b/mm/vmstat.c index 1dd12ae..9c681cc 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -332,7 +332,7 @@ void __mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item, long delta) { struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats; - s8 __percpu *p = pcp->vm_node_stat_diff + item; + s16 __percpu *p = pcp->vm_node_stat_diff + item; long x; long t; @@ -390,13 +390,13 @@ void __inc_zone_state(struct zone *zone, enum zone_stat_item item) void __inc_node_state(struct pglist_data *pgdat, enum node_stat_item item) { struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats; - s8 __percpu *p = pcp->vm_node_stat_diff + item; - s8 v, t; + s16 __percpu *p = pcp->vm_node_stat_diff + item; + s16 v, t; v = __this_cpu_inc_return(*p); t = __this_cpu_read(pcp->stat_threshold); if (unlikely(v > t)) { - s8 overstep = t >> 1; + s16 overstep = t >> 1; node_page_state_add(v + overstep, pgdat, item); __this_cpu_write(*p, -overstep); @@ -434,13 +434,13 @@ void __dec_zone_state(struct zone *zone, enum zone_stat_item item) void __dec_node_state(struct pglist_data *pgdat, enum node_stat_item item) { struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats; - s8 __percpu *p = pcp->vm_node_stat_diff + item; - s8 v, t; + s16 __percpu *p = pcp->vm_node_stat_diff + item; + s16 v, t; v = __this_cpu_dec_return(*p); t = __this_cpu_read(pcp->stat_threshold); if (unlikely(v < - t)) { - s8 overstep = t >> 1; + s16 overstep = t >> 1; node_page_state_add(v - overstep, pgdat, item); __this_cpu_write(*p, overstep); @@ -533,7 +533,7 @@ static inline void mod_node_state(struct pglist_data *pgdat, enum node_stat_item item, int delta, int overstep_mode) { struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats; - s8 __percpu *p = pcp->vm_node_stat_diff + item; + s16 __percpu *p = pcp->vm_node_stat_diff + item; long o, n, t, z; do { -- 2.7.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 68+ messages in thread
* Re: [PATCH v2 2/5] mm: Extends local cpu counter vm_diff_nodestat from s8 to s16 2017-12-19 6:39 ` Kemi Wang @ 2017-12-19 12:38 ` Michal Hocko -1 siblings, 0 replies; 68+ messages in thread From: Michal Hocko @ 2017-12-19 12:38 UTC (permalink / raw) To: Kemi Wang Cc: Greg Kroah-Hartman, Andrew Morton, Vlastimil Babka, Mel Gorman, Johannes Weiner, Christopher Lameter, YASUAKI ISHIMATSU, Andrey Ryabinin, Nikolay Borisov, Pavel Tatashin, David Rientjes, Sebastian Andrzej Siewior, Dave, Andi Kleen, Tim Chen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Aubrey Li, Linux MM, Linux Kernel On Tue 19-12-17 14:39:23, Kemi Wang wrote: > The type s8 used for vm_diff_nodestat[] as local cpu counters has the > limitation of global counters update frequency, especially for those > monotone increasing type of counters like NUMA counters with more and more > cpus/nodes. This patch extends the type of vm_diff_nodestat from s8 to s16 > without any functionality change. > > before after > sizeof(struct per_cpu_nodestat) 28 68 So it is 40B * num_cpus * num_nodes. Nothing really catastrophic IMHO but the changelog is a bit silent about any numbers. This is a performance optimization so it should better give us some. > Signed-off-by: Kemi Wang <kemi.wang@intel.com> > --- > include/linux/mmzone.h | 4 ++-- > mm/vmstat.c | 16 ++++++++-------- > 2 files changed, 10 insertions(+), 10 deletions(-) > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > index c06d880..2da6b6f 100644 > --- a/include/linux/mmzone.h > +++ b/include/linux/mmzone.h > @@ -289,8 +289,8 @@ struct per_cpu_pageset { > }; > > struct per_cpu_nodestat { > - s8 stat_threshold; > - s8 vm_node_stat_diff[NR_VM_NODE_STAT_ITEMS]; > + s16 stat_threshold; > + s16 vm_node_stat_diff[NR_VM_NODE_STAT_ITEMS]; > }; > > #endif /* !__GENERATING_BOUNDS.H */ > diff --git a/mm/vmstat.c b/mm/vmstat.c > index 1dd12ae..9c681cc 100644 > --- a/mm/vmstat.c > +++ b/mm/vmstat.c > @@ -332,7 +332,7 @@ void __mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item, > long delta) > { > struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats; > - s8 __percpu *p = pcp->vm_node_stat_diff + item; > + s16 __percpu *p = pcp->vm_node_stat_diff + item; > long x; > long t; > > @@ -390,13 +390,13 @@ void __inc_zone_state(struct zone *zone, enum zone_stat_item item) > void __inc_node_state(struct pglist_data *pgdat, enum node_stat_item item) > { > struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats; > - s8 __percpu *p = pcp->vm_node_stat_diff + item; > - s8 v, t; > + s16 __percpu *p = pcp->vm_node_stat_diff + item; > + s16 v, t; > > v = __this_cpu_inc_return(*p); > t = __this_cpu_read(pcp->stat_threshold); > if (unlikely(v > t)) { > - s8 overstep = t >> 1; > + s16 overstep = t >> 1; > > node_page_state_add(v + overstep, pgdat, item); > __this_cpu_write(*p, -overstep); > @@ -434,13 +434,13 @@ void __dec_zone_state(struct zone *zone, enum zone_stat_item item) > void __dec_node_state(struct pglist_data *pgdat, enum node_stat_item item) > { > struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats; > - s8 __percpu *p = pcp->vm_node_stat_diff + item; > - s8 v, t; > + s16 __percpu *p = pcp->vm_node_stat_diff + item; > + s16 v, t; > > v = __this_cpu_dec_return(*p); > t = __this_cpu_read(pcp->stat_threshold); > if (unlikely(v < - t)) { > - s8 overstep = t >> 1; > + s16 overstep = t >> 1; > > node_page_state_add(v - overstep, pgdat, item); > __this_cpu_write(*p, overstep); > @@ -533,7 +533,7 @@ static inline void mod_node_state(struct pglist_data *pgdat, > enum node_stat_item item, int delta, int overstep_mode) > { > struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats; > - s8 __percpu *p = pcp->vm_node_stat_diff + item; > + s16 __percpu *p = pcp->vm_node_stat_diff + item; > long o, n, t, z; > > do { > -- > 2.7.4 > -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH v2 2/5] mm: Extends local cpu counter vm_diff_nodestat from s8 to s16 @ 2017-12-19 12:38 ` Michal Hocko 0 siblings, 0 replies; 68+ messages in thread From: Michal Hocko @ 2017-12-19 12:38 UTC (permalink / raw) To: Kemi Wang Cc: Greg Kroah-Hartman, Andrew Morton, Vlastimil Babka, Mel Gorman, Johannes Weiner, Christopher Lameter, YASUAKI ISHIMATSU, Andrey Ryabinin, Nikolay Borisov, Pavel Tatashin, David Rientjes, Sebastian Andrzej Siewior, Dave, Andi Kleen, Tim Chen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Aubrey Li, Linux MM, Linux Kernel On Tue 19-12-17 14:39:23, Kemi Wang wrote: > The type s8 used for vm_diff_nodestat[] as local cpu counters has the > limitation of global counters update frequency, especially for those > monotone increasing type of counters like NUMA counters with more and more > cpus/nodes. This patch extends the type of vm_diff_nodestat from s8 to s16 > without any functionality change. > > before after > sizeof(struct per_cpu_nodestat) 28 68 So it is 40B * num_cpus * num_nodes. Nothing really catastrophic IMHO but the changelog is a bit silent about any numbers. This is a performance optimization so it should better give us some. > Signed-off-by: Kemi Wang <kemi.wang@intel.com> > --- > include/linux/mmzone.h | 4 ++-- > mm/vmstat.c | 16 ++++++++-------- > 2 files changed, 10 insertions(+), 10 deletions(-) > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > index c06d880..2da6b6f 100644 > --- a/include/linux/mmzone.h > +++ b/include/linux/mmzone.h > @@ -289,8 +289,8 @@ struct per_cpu_pageset { > }; > > struct per_cpu_nodestat { > - s8 stat_threshold; > - s8 vm_node_stat_diff[NR_VM_NODE_STAT_ITEMS]; > + s16 stat_threshold; > + s16 vm_node_stat_diff[NR_VM_NODE_STAT_ITEMS]; > }; > > #endif /* !__GENERATING_BOUNDS.H */ > diff --git a/mm/vmstat.c b/mm/vmstat.c > index 1dd12ae..9c681cc 100644 > --- a/mm/vmstat.c > +++ b/mm/vmstat.c > @@ -332,7 +332,7 @@ void __mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item, > long delta) > { > struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats; > - s8 __percpu *p = pcp->vm_node_stat_diff + item; > + s16 __percpu *p = pcp->vm_node_stat_diff + item; > long x; > long t; > > @@ -390,13 +390,13 @@ void __inc_zone_state(struct zone *zone, enum zone_stat_item item) > void __inc_node_state(struct pglist_data *pgdat, enum node_stat_item item) > { > struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats; > - s8 __percpu *p = pcp->vm_node_stat_diff + item; > - s8 v, t; > + s16 __percpu *p = pcp->vm_node_stat_diff + item; > + s16 v, t; > > v = __this_cpu_inc_return(*p); > t = __this_cpu_read(pcp->stat_threshold); > if (unlikely(v > t)) { > - s8 overstep = t >> 1; > + s16 overstep = t >> 1; > > node_page_state_add(v + overstep, pgdat, item); > __this_cpu_write(*p, -overstep); > @@ -434,13 +434,13 @@ void __dec_zone_state(struct zone *zone, enum zone_stat_item item) > void __dec_node_state(struct pglist_data *pgdat, enum node_stat_item item) > { > struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats; > - s8 __percpu *p = pcp->vm_node_stat_diff + item; > - s8 v, t; > + s16 __percpu *p = pcp->vm_node_stat_diff + item; > + s16 v, t; > > v = __this_cpu_dec_return(*p); > t = __this_cpu_read(pcp->stat_threshold); > if (unlikely(v < - t)) { > - s8 overstep = t >> 1; > + s16 overstep = t >> 1; > > node_page_state_add(v - overstep, pgdat, item); > __this_cpu_write(*p, overstep); > @@ -533,7 +533,7 @@ static inline void mod_node_state(struct pglist_data *pgdat, > enum node_stat_item item, int delta, int overstep_mode) > { > struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats; > - s8 __percpu *p = pcp->vm_node_stat_diff + item; > + s16 __percpu *p = pcp->vm_node_stat_diff + item; > long o, n, t, z; > > do { > -- > 2.7.4 > -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH v2 2/5] mm: Extends local cpu counter vm_diff_nodestat from s8 to s16 2017-12-19 12:38 ` Michal Hocko @ 2017-12-20 3:05 ` kemi -1 siblings, 0 replies; 68+ messages in thread From: kemi @ 2017-12-20 3:05 UTC (permalink / raw) To: Michal Hocko Cc: Greg Kroah-Hartman, Andrew Morton, Vlastimil Babka, Mel Gorman, Johannes Weiner, Christopher Lameter, YASUAKI ISHIMATSU, Andrey Ryabinin, Nikolay Borisov, Pavel Tatashin, David Rientjes, Sebastian Andrzej Siewior, Dave, Andi Kleen, Tim Chen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Aubrey Li, Linux MM, Linux Kernel On 2017年12月19日 20:38, Michal Hocko wrote: > On Tue 19-12-17 14:39:23, Kemi Wang wrote: >> The type s8 used for vm_diff_nodestat[] as local cpu counters has the >> limitation of global counters update frequency, especially for those >> monotone increasing type of counters like NUMA counters with more and more >> cpus/nodes. This patch extends the type of vm_diff_nodestat from s8 to s16 >> without any functionality change. >> >> before after >> sizeof(struct per_cpu_nodestat) 28 68 > > So it is 40B * num_cpus * num_nodes. Nothing really catastrophic IMHO > but the changelog is a bit silent about any numbers. This is a > performance optimization so it should better give us some. > This patch does not have any functionality change. So no performance gain I suppose. I guess you are talking about performance gain from the third patch which increases threshold size of NUMA counters. >> Signed-off-by: Kemi Wang <kemi.wang@intel.com> >> --- >> include/linux/mmzone.h | 4 ++-- >> mm/vmstat.c | 16 ++++++++-------- >> 2 files changed, 10 insertions(+), 10 deletions(-) >> >> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h >> index c06d880..2da6b6f 100644 >> --- a/include/linux/mmzone.h >> +++ b/include/linux/mmzone.h >> @@ -289,8 +289,8 @@ struct per_cpu_pageset { >> }; >> >> struct per_cpu_nodestat { >> - s8 stat_threshold; >> - s8 vm_node_stat_diff[NR_VM_NODE_STAT_ITEMS]; >> + s16 stat_threshold; >> + s16 vm_node_stat_diff[NR_VM_NODE_STAT_ITEMS]; >> }; >> >> #endif /* !__GENERATING_BOUNDS.H */ >> diff --git a/mm/vmstat.c b/mm/vmstat.c >> index 1dd12ae..9c681cc 100644 >> --- a/mm/vmstat.c >> +++ b/mm/vmstat.c >> @@ -332,7 +332,7 @@ void __mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item, >> long delta) >> { >> struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats; >> - s8 __percpu *p = pcp->vm_node_stat_diff + item; >> + s16 __percpu *p = pcp->vm_node_stat_diff + item; >> long x; >> long t; >> >> @@ -390,13 +390,13 @@ void __inc_zone_state(struct zone *zone, enum zone_stat_item item) >> void __inc_node_state(struct pglist_data *pgdat, enum node_stat_item item) >> { >> struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats; >> - s8 __percpu *p = pcp->vm_node_stat_diff + item; >> - s8 v, t; >> + s16 __percpu *p = pcp->vm_node_stat_diff + item; >> + s16 v, t; >> >> v = __this_cpu_inc_return(*p); >> t = __this_cpu_read(pcp->stat_threshold); >> if (unlikely(v > t)) { >> - s8 overstep = t >> 1; >> + s16 overstep = t >> 1; >> >> node_page_state_add(v + overstep, pgdat, item); >> __this_cpu_write(*p, -overstep); >> @@ -434,13 +434,13 @@ void __dec_zone_state(struct zone *zone, enum zone_stat_item item) >> void __dec_node_state(struct pglist_data *pgdat, enum node_stat_item item) >> { >> struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats; >> - s8 __percpu *p = pcp->vm_node_stat_diff + item; >> - s8 v, t; >> + s16 __percpu *p = pcp->vm_node_stat_diff + item; >> + s16 v, t; >> >> v = __this_cpu_dec_return(*p); >> t = __this_cpu_read(pcp->stat_threshold); >> if (unlikely(v < - t)) { >> - s8 overstep = t >> 1; >> + s16 overstep = t >> 1; >> >> node_page_state_add(v - overstep, pgdat, item); >> __this_cpu_write(*p, overstep); >> @@ -533,7 +533,7 @@ static inline void mod_node_state(struct pglist_data *pgdat, >> enum node_stat_item item, int delta, int overstep_mode) >> { >> struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats; >> - s8 __percpu *p = pcp->vm_node_stat_diff + item; >> + s16 __percpu *p = pcp->vm_node_stat_diff + item; >> long o, n, t, z; >> >> do { >> -- >> 2.7.4 >> > ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH v2 2/5] mm: Extends local cpu counter vm_diff_nodestat from s8 to s16 @ 2017-12-20 3:05 ` kemi 0 siblings, 0 replies; 68+ messages in thread From: kemi @ 2017-12-20 3:05 UTC (permalink / raw) To: Michal Hocko Cc: Greg Kroah-Hartman, Andrew Morton, Vlastimil Babka, Mel Gorman, Johannes Weiner, Christopher Lameter, YASUAKI ISHIMATSU, Andrey Ryabinin, Nikolay Borisov, Pavel Tatashin, David Rientjes, Sebastian Andrzej Siewior, Dave, Andi Kleen, Tim Chen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Aubrey Li, Linux MM, Linux Kernel On 2017a1'12ae??19ae?JPY 20:38, Michal Hocko wrote: > On Tue 19-12-17 14:39:23, Kemi Wang wrote: >> The type s8 used for vm_diff_nodestat[] as local cpu counters has the >> limitation of global counters update frequency, especially for those >> monotone increasing type of counters like NUMA counters with more and more >> cpus/nodes. This patch extends the type of vm_diff_nodestat from s8 to s16 >> without any functionality change. >> >> before after >> sizeof(struct per_cpu_nodestat) 28 68 > > So it is 40B * num_cpus * num_nodes. Nothing really catastrophic IMHO > but the changelog is a bit silent about any numbers. This is a > performance optimization so it should better give us some. > This patch does not have any functionality change. So no performance gain I suppose. I guess you are talking about performance gain from the third patch which increases threshold size of NUMA counters. >> Signed-off-by: Kemi Wang <kemi.wang@intel.com> >> --- >> include/linux/mmzone.h | 4 ++-- >> mm/vmstat.c | 16 ++++++++-------- >> 2 files changed, 10 insertions(+), 10 deletions(-) >> >> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h >> index c06d880..2da6b6f 100644 >> --- a/include/linux/mmzone.h >> +++ b/include/linux/mmzone.h >> @@ -289,8 +289,8 @@ struct per_cpu_pageset { >> }; >> >> struct per_cpu_nodestat { >> - s8 stat_threshold; >> - s8 vm_node_stat_diff[NR_VM_NODE_STAT_ITEMS]; >> + s16 stat_threshold; >> + s16 vm_node_stat_diff[NR_VM_NODE_STAT_ITEMS]; >> }; >> >> #endif /* !__GENERATING_BOUNDS.H */ >> diff --git a/mm/vmstat.c b/mm/vmstat.c >> index 1dd12ae..9c681cc 100644 >> --- a/mm/vmstat.c >> +++ b/mm/vmstat.c >> @@ -332,7 +332,7 @@ void __mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item, >> long delta) >> { >> struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats; >> - s8 __percpu *p = pcp->vm_node_stat_diff + item; >> + s16 __percpu *p = pcp->vm_node_stat_diff + item; >> long x; >> long t; >> >> @@ -390,13 +390,13 @@ void __inc_zone_state(struct zone *zone, enum zone_stat_item item) >> void __inc_node_state(struct pglist_data *pgdat, enum node_stat_item item) >> { >> struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats; >> - s8 __percpu *p = pcp->vm_node_stat_diff + item; >> - s8 v, t; >> + s16 __percpu *p = pcp->vm_node_stat_diff + item; >> + s16 v, t; >> >> v = __this_cpu_inc_return(*p); >> t = __this_cpu_read(pcp->stat_threshold); >> if (unlikely(v > t)) { >> - s8 overstep = t >> 1; >> + s16 overstep = t >> 1; >> >> node_page_state_add(v + overstep, pgdat, item); >> __this_cpu_write(*p, -overstep); >> @@ -434,13 +434,13 @@ void __dec_zone_state(struct zone *zone, enum zone_stat_item item) >> void __dec_node_state(struct pglist_data *pgdat, enum node_stat_item item) >> { >> struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats; >> - s8 __percpu *p = pcp->vm_node_stat_diff + item; >> - s8 v, t; >> + s16 __percpu *p = pcp->vm_node_stat_diff + item; >> + s16 v, t; >> >> v = __this_cpu_dec_return(*p); >> t = __this_cpu_read(pcp->stat_threshold); >> if (unlikely(v < - t)) { >> - s8 overstep = t >> 1; >> + s16 overstep = t >> 1; >> >> node_page_state_add(v - overstep, pgdat, item); >> __this_cpu_write(*p, overstep); >> @@ -533,7 +533,7 @@ static inline void mod_node_state(struct pglist_data *pgdat, >> enum node_stat_item item, int delta, int overstep_mode) >> { >> struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats; >> - s8 __percpu *p = pcp->vm_node_stat_diff + item; >> + s16 __percpu *p = pcp->vm_node_stat_diff + item; >> long o, n, t, z; >> >> do { >> -- >> 2.7.4 >> > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH v2 2/5] mm: Extends local cpu counter vm_diff_nodestat from s8 to s16 2017-12-19 6:39 ` Kemi Wang @ 2017-12-19 16:05 ` Christopher Lameter -1 siblings, 0 replies; 68+ messages in thread From: Christopher Lameter @ 2017-12-19 16:05 UTC (permalink / raw) To: Kemi Wang Cc: Greg Kroah-Hartman, Andrew Morton, Michal Hocko, Vlastimil Babka, Mel Gorman, Johannes Weiner, YASUAKI ISHIMATSU, Andrey Ryabinin, Nikolay Borisov, Pavel Tatashin, David Rientjes, Sebastian Andrzej Siewior, Dave, Andi Kleen, Tim Chen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Aubrey Li, Linux MM, Linux Kernel On Tue, 19 Dec 2017, Kemi Wang wrote: > The type s8 used for vm_diff_nodestat[] as local cpu counters has the > limitation of global counters update frequency, especially for those > monotone increasing type of counters like NUMA counters with more and more > cpus/nodes. This patch extends the type of vm_diff_nodestat from s8 to s16 > without any functionality change. Well the reason for s8 was to keep the data structures small so that they fit in the higher level cpu caches. The large these structures become the more cachelines are used by the counters and the larger the performance influence on the code that should not be impacted by the overhead. ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH v2 2/5] mm: Extends local cpu counter vm_diff_nodestat from s8 to s16 @ 2017-12-19 16:05 ` Christopher Lameter 0 siblings, 0 replies; 68+ messages in thread From: Christopher Lameter @ 2017-12-19 16:05 UTC (permalink / raw) To: Kemi Wang Cc: Greg Kroah-Hartman, Andrew Morton, Michal Hocko, Vlastimil Babka, Mel Gorman, Johannes Weiner, YASUAKI ISHIMATSU, Andrey Ryabinin, Nikolay Borisov, Pavel Tatashin, David Rientjes, Sebastian Andrzej Siewior, Dave, Andi Kleen, Tim Chen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Aubrey Li, Linux MM, Linux Kernel On Tue, 19 Dec 2017, Kemi Wang wrote: > The type s8 used for vm_diff_nodestat[] as local cpu counters has the > limitation of global counters update frequency, especially for those > monotone increasing type of counters like NUMA counters with more and more > cpus/nodes. This patch extends the type of vm_diff_nodestat from s8 to s16 > without any functionality change. Well the reason for s8 was to keep the data structures small so that they fit in the higher level cpu caches. The large these structures become the more cachelines are used by the counters and the larger the performance influence on the code that should not be impacted by the overhead. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH v2 2/5] mm: Extends local cpu counter vm_diff_nodestat from s8 to s16 2017-12-19 16:05 ` Christopher Lameter @ 2017-12-19 16:20 ` Michal Hocko -1 siblings, 0 replies; 68+ messages in thread From: Michal Hocko @ 2017-12-19 16:20 UTC (permalink / raw) To: Christopher Lameter Cc: Kemi Wang, Greg Kroah-Hartman, Andrew Morton, Vlastimil Babka, Mel Gorman, Johannes Weiner, YASUAKI ISHIMATSU, Andrey Ryabinin, Nikolay Borisov, Pavel Tatashin, David Rientjes, Sebastian Andrzej Siewior, Dave, Andi Kleen, Tim Chen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Aubrey Li, Linux MM, Linux Kernel On Tue 19-12-17 10:05:48, Cristopher Lameter wrote: > On Tue, 19 Dec 2017, Kemi Wang wrote: > > > The type s8 used for vm_diff_nodestat[] as local cpu counters has the > > limitation of global counters update frequency, especially for those > > monotone increasing type of counters like NUMA counters with more and more > > cpus/nodes. This patch extends the type of vm_diff_nodestat from s8 to s16 > > without any functionality change. > > Well the reason for s8 was to keep the data structures small so that they > fit in the higher level cpu caches. The large these structures become the > more cachelines are used by the counters and the larger the performance > influence on the code that should not be impacted by the overhead. I am not sure I understand. We usually do not access more counters in the single code path (well, PGALLOC and NUMA counteres is more of an exception). So it is rarely an advantage that the whole array is in the same cache line. Besides that this is allocated by the percpu allocator aligns to the type size rather than cache lines AFAICS. Maybe it used to be all different back then when the code has been added but arguing about cache lines seems to be a bit problematic here. Maybe you have some specific workloads which can prove me wrong? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH v2 2/5] mm: Extends local cpu counter vm_diff_nodestat from s8 to s16 @ 2017-12-19 16:20 ` Michal Hocko 0 siblings, 0 replies; 68+ messages in thread From: Michal Hocko @ 2017-12-19 16:20 UTC (permalink / raw) To: Christopher Lameter Cc: Kemi Wang, Greg Kroah-Hartman, Andrew Morton, Vlastimil Babka, Mel Gorman, Johannes Weiner, YASUAKI ISHIMATSU, Andrey Ryabinin, Nikolay Borisov, Pavel Tatashin, David Rientjes, Sebastian Andrzej Siewior, Dave, Andi Kleen, Tim Chen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Aubrey Li, Linux MM, Linux Kernel On Tue 19-12-17 10:05:48, Cristopher Lameter wrote: > On Tue, 19 Dec 2017, Kemi Wang wrote: > > > The type s8 used for vm_diff_nodestat[] as local cpu counters has the > > limitation of global counters update frequency, especially for those > > monotone increasing type of counters like NUMA counters with more and more > > cpus/nodes. This patch extends the type of vm_diff_nodestat from s8 to s16 > > without any functionality change. > > Well the reason for s8 was to keep the data structures small so that they > fit in the higher level cpu caches. The large these structures become the > more cachelines are used by the counters and the larger the performance > influence on the code that should not be impacted by the overhead. I am not sure I understand. We usually do not access more counters in the single code path (well, PGALLOC and NUMA counteres is more of an exception). So it is rarely an advantage that the whole array is in the same cache line. Besides that this is allocated by the percpu allocator aligns to the type size rather than cache lines AFAICS. Maybe it used to be all different back then when the code has been added but arguing about cache lines seems to be a bit problematic here. Maybe you have some specific workloads which can prove me wrong? -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH v2 2/5] mm: Extends local cpu counter vm_diff_nodestat from s8 to s16 2017-12-19 16:20 ` Michal Hocko @ 2017-12-19 17:21 ` Christopher Lameter -1 siblings, 0 replies; 68+ messages in thread From: Christopher Lameter @ 2017-12-19 17:21 UTC (permalink / raw) To: Michal Hocko Cc: Kemi Wang, Greg Kroah-Hartman, Andrew Morton, Vlastimil Babka, Mel Gorman, Johannes Weiner, YASUAKI ISHIMATSU, Andrey Ryabinin, Nikolay Borisov, Pavel Tatashin, David Rientjes, Sebastian Andrzej Siewior, Dave, Andi Kleen, Tim Chen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Aubrey Li, Linux MM, Linux Kernel On Tue, 19 Dec 2017, Michal Hocko wrote: > > Well the reason for s8 was to keep the data structures small so that they > > fit in the higher level cpu caches. The large these structures become the > > more cachelines are used by the counters and the larger the performance > > influence on the code that should not be impacted by the overhead. > > I am not sure I understand. We usually do not access more counters in > the single code path (well, PGALLOC and NUMA counteres is more of an > exception). So it is rarely an advantage that the whole array is in the > same cache line. Besides that this is allocated by the percpu allocator > aligns to the type size rather than cache lines AFAICS. I thought we are talking about NUMA counters here? Regardless: A typical fault, system call or OS action will access multiple zone and node counters when allocating or freeing memory. Enlarging the fields will increase the number of cachelines touched. > Maybe it used to be all different back then when the code has been added > but arguing about cache lines seems to be a bit problematic here. Maybe > you have some specific workloads which can prove me wrong? Run a workload that does some page faults? Heavy allocation and freeing of memory? Maybe that is no longer relevant since the number of the counters is large that the accesses are so sparse that each action pulls in a whole cacheline. That would be something we tried to avoid when implementing the differentials. ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH v2 2/5] mm: Extends local cpu counter vm_diff_nodestat from s8 to s16 @ 2017-12-19 17:21 ` Christopher Lameter 0 siblings, 0 replies; 68+ messages in thread From: Christopher Lameter @ 2017-12-19 17:21 UTC (permalink / raw) To: Michal Hocko Cc: Kemi Wang, Greg Kroah-Hartman, Andrew Morton, Vlastimil Babka, Mel Gorman, Johannes Weiner, YASUAKI ISHIMATSU, Andrey Ryabinin, Nikolay Borisov, Pavel Tatashin, David Rientjes, Sebastian Andrzej Siewior, Dave, Andi Kleen, Tim Chen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Aubrey Li, Linux MM, Linux Kernel On Tue, 19 Dec 2017, Michal Hocko wrote: > > Well the reason for s8 was to keep the data structures small so that they > > fit in the higher level cpu caches. The large these structures become the > > more cachelines are used by the counters and the larger the performance > > influence on the code that should not be impacted by the overhead. > > I am not sure I understand. We usually do not access more counters in > the single code path (well, PGALLOC and NUMA counteres is more of an > exception). So it is rarely an advantage that the whole array is in the > same cache line. Besides that this is allocated by the percpu allocator > aligns to the type size rather than cache lines AFAICS. I thought we are talking about NUMA counters here? Regardless: A typical fault, system call or OS action will access multiple zone and node counters when allocating or freeing memory. Enlarging the fields will increase the number of cachelines touched. > Maybe it used to be all different back then when the code has been added > but arguing about cache lines seems to be a bit problematic here. Maybe > you have some specific workloads which can prove me wrong? Run a workload that does some page faults? Heavy allocation and freeing of memory? Maybe that is no longer relevant since the number of the counters is large that the accesses are so sparse that each action pulls in a whole cacheline. That would be something we tried to avoid when implementing the differentials. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH v2 2/5] mm: Extends local cpu counter vm_diff_nodestat from s8 to s16 2017-12-19 17:21 ` Christopher Lameter @ 2017-12-20 6:45 ` kemi -1 siblings, 0 replies; 68+ messages in thread From: kemi @ 2017-12-20 6:45 UTC (permalink / raw) To: Christopher Lameter, Michal Hocko Cc: Greg Kroah-Hartman, Andrew Morton, Vlastimil Babka, Mel Gorman, Johannes Weiner, YASUAKI ISHIMATSU, Andrey Ryabinin, Nikolay Borisov, Pavel Tatashin, David Rientjes, Sebastian Andrzej Siewior, Dave, Andi Kleen, Tim Chen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Aubrey Li, Linux MM, Linux Kernel On 2017年12月20日 01:21, Christopher Lameter wrote: > On Tue, 19 Dec 2017, Michal Hocko wrote: > >>> Well the reason for s8 was to keep the data structures small so that they >>> fit in the higher level cpu caches. The large these structures become the >>> more cachelines are used by the counters and the larger the performance >>> influence on the code that should not be impacted by the overhead. >> >> I am not sure I understand. We usually do not access more counters in >> the single code path (well, PGALLOC and NUMA counteres is more of an >> exception). So it is rarely an advantage that the whole array is in the >> same cache line. Besides that this is allocated by the percpu allocator >> aligns to the type size rather than cache lines AFAICS. > > I thought we are talking about NUMA counters here? > > Regardless: A typical fault, system call or OS action will access multiple > zone and node counters when allocating or freeing memory. Enlarging the > fields will increase the number of cachelines touched. > Yes, we add one more cache line footprint access theoretically. But I don't think it would be a problem. 1) Not all the counters need to be accessed in fast path of page allocation, the counters covered in a single cache line usually is enough for that, we probably don't need to access one more cache line. I tend to agree Michal's argument. Besides, in some slow path in which code is protected by zone lock or lru lock, access one more cache line would be a big problem since many other cache lines are also be accessed. 2) Enlarging vm_node_stat_diff from s8 to s16 gives an opportunity to keep more number in local cpus that provides the possibility of reducing the global counter update frequency. Thus, we can gain the benefit by reducing expensive cache bouncing. Well, if you still have some concerns, I can post some data for will-it-scale.page_fault1. What the benchmark does is: it forks nr_cpu processes and then each process does the following: 1 mmap() 128M anonymous space; 2 writes to each page there to trigger actual page allocation; 3 munmap() it. in a loop. https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault1.c Or you can provide some other benchmarks on which you want to see performance impact. >> Maybe it used to be all different back then when the code has been added >> but arguing about cache lines seems to be a bit problematic here. Maybe >> you have some specific workloads which can prove me wrong? > > Run a workload that does some page faults? Heavy allocation and freeing of > memory? > > Maybe that is no longer relevant since the number of the counters is > large that the accesses are so sparse that each action pulls in a whole > cacheline. That would be something we tried to avoid when implementing > the differentials. > > ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH v2 2/5] mm: Extends local cpu counter vm_diff_nodestat from s8 to s16 @ 2017-12-20 6:45 ` kemi 0 siblings, 0 replies; 68+ messages in thread From: kemi @ 2017-12-20 6:45 UTC (permalink / raw) To: Christopher Lameter, Michal Hocko Cc: Greg Kroah-Hartman, Andrew Morton, Vlastimil Babka, Mel Gorman, Johannes Weiner, YASUAKI ISHIMATSU, Andrey Ryabinin, Nikolay Borisov, Pavel Tatashin, David Rientjes, Sebastian Andrzej Siewior, Dave, Andi Kleen, Tim Chen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Aubrey Li, Linux MM, Linux Kernel On 2017a1'12ae??20ae?JPY 01:21, Christopher Lameter wrote: > On Tue, 19 Dec 2017, Michal Hocko wrote: > >>> Well the reason for s8 was to keep the data structures small so that they >>> fit in the higher level cpu caches. The large these structures become the >>> more cachelines are used by the counters and the larger the performance >>> influence on the code that should not be impacted by the overhead. >> >> I am not sure I understand. We usually do not access more counters in >> the single code path (well, PGALLOC and NUMA counteres is more of an >> exception). So it is rarely an advantage that the whole array is in the >> same cache line. Besides that this is allocated by the percpu allocator >> aligns to the type size rather than cache lines AFAICS. > > I thought we are talking about NUMA counters here? > > Regardless: A typical fault, system call or OS action will access multiple > zone and node counters when allocating or freeing memory. Enlarging the > fields will increase the number of cachelines touched. > Yes, we add one more cache line footprint access theoretically. But I don't think it would be a problem. 1) Not all the counters need to be accessed in fast path of page allocation, the counters covered in a single cache line usually is enough for that, we probably don't need to access one more cache line. I tend to agree Michal's argument. Besides, in some slow path in which code is protected by zone lock or lru lock, access one more cache line would be a big problem since many other cache lines are also be accessed. 2) Enlarging vm_node_stat_diff from s8 to s16 gives an opportunity to keep more number in local cpus that provides the possibility of reducing the global counter update frequency. Thus, we can gain the benefit by reducing expensive cache bouncing. Well, if you still have some concerns, I can post some data for will-it-scale.page_fault1. What the benchmark does is: it forks nr_cpu processes and then each process does the following: 1 mmap() 128M anonymous space; 2 writes to each page there to trigger actual page allocation; 3 munmap() it. in a loop. https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault1.c Or you can provide some other benchmarks on which you want to see performance impact. >> Maybe it used to be all different back then when the code has been added >> but arguing about cache lines seems to be a bit problematic here. Maybe >> you have some specific workloads which can prove me wrong? > > Run a workload that does some page faults? Heavy allocation and freeing of > memory? > > Maybe that is no longer relevant since the number of the counters is > large that the accesses are so sparse that each action pulls in a whole > cacheline. That would be something we tried to avoid when implementing > the differentials. > > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 68+ messages in thread
* [PATCH v2 3/5] mm: enlarge NUMA counters threshold size 2017-12-19 6:39 ` Kemi Wang @ 2017-12-19 6:39 ` Kemi Wang -1 siblings, 0 replies; 68+ messages in thread From: Kemi Wang @ 2017-12-19 6:39 UTC (permalink / raw) To: Greg Kroah-Hartman, Andrew Morton, Michal Hocko, Vlastimil Babka, Mel Gorman, Johannes Weiner, Christopher Lameter, YASUAKI ISHIMATSU, Andrey Ryabinin, Nikolay Borisov, Pavel Tatashin, David Rientjes, Sebastian Andrzej Siewior Cc: Dave, Andi Kleen, Tim Chen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Aubrey Li, Kemi Wang, Linux MM, Linux Kernel We have seen significant overhead in cache bouncing caused by NUMA counters update in multi-threaded page allocation. See 'commit 1d90ca897cb0 ("mm: update NUMA counter threshold size")' for more details. This patch updates NUMA counters to a fixed size of (MAX_S16 - 2) and deals with global counter update using different threshold size for node page stats. Signed-off-by: Kemi Wang <kemi.wang@intel.com> --- mm/vmstat.c | 13 +++++++++++-- 1 file changed, 11 insertions(+), 2 deletions(-) diff --git a/mm/vmstat.c b/mm/vmstat.c index 9c681cc..64e08ae 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -30,6 +30,8 @@ #include "internal.h" +#define VM_NUMA_STAT_THRESHOLD (S16_MAX - 2) + #ifdef CONFIG_NUMA int sysctl_vm_numa_stat = ENABLE_NUMA_STAT; @@ -394,7 +396,11 @@ void __inc_node_state(struct pglist_data *pgdat, enum node_stat_item item) s16 v, t; v = __this_cpu_inc_return(*p); - t = __this_cpu_read(pcp->stat_threshold); + if (item >= NR_VM_NUMA_STAT_ITEMS) + t = __this_cpu_read(pcp->stat_threshold); + else + t = VM_NUMA_STAT_THRESHOLD; + if (unlikely(v > t)) { s16 overstep = t >> 1; @@ -549,7 +555,10 @@ static inline void mod_node_state(struct pglist_data *pgdat, * Most of the time the thresholds are the same anyways * for all cpus in a node. */ - t = this_cpu_read(pcp->stat_threshold); + if (item >= NR_VM_NUMA_STAT_ITEMS) + t = this_cpu_read(pcp->stat_threshold); + else + t = VM_NUMA_STAT_THRESHOLD; o = this_cpu_read(*p); n = delta + o; -- 2.7.4 ^ permalink raw reply related [flat|nested] 68+ messages in thread
* [PATCH v2 3/5] mm: enlarge NUMA counters threshold size @ 2017-12-19 6:39 ` Kemi Wang 0 siblings, 0 replies; 68+ messages in thread From: Kemi Wang @ 2017-12-19 6:39 UTC (permalink / raw) To: Greg Kroah-Hartman, Andrew Morton, Michal Hocko, Vlastimil Babka, Mel Gorman, Johannes Weiner, Christopher Lameter, YASUAKI ISHIMATSU, Andrey Ryabinin, Nikolay Borisov, Pavel Tatashin, David Rientjes, Sebastian Andrzej Siewior Cc: Dave, Andi Kleen, Tim Chen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Aubrey Li, Kemi Wang, Linux MM, Linux Kernel We have seen significant overhead in cache bouncing caused by NUMA counters update in multi-threaded page allocation. See 'commit 1d90ca897cb0 ("mm: update NUMA counter threshold size")' for more details. This patch updates NUMA counters to a fixed size of (MAX_S16 - 2) and deals with global counter update using different threshold size for node page stats. Signed-off-by: Kemi Wang <kemi.wang@intel.com> --- mm/vmstat.c | 13 +++++++++++-- 1 file changed, 11 insertions(+), 2 deletions(-) diff --git a/mm/vmstat.c b/mm/vmstat.c index 9c681cc..64e08ae 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -30,6 +30,8 @@ #include "internal.h" +#define VM_NUMA_STAT_THRESHOLD (S16_MAX - 2) + #ifdef CONFIG_NUMA int sysctl_vm_numa_stat = ENABLE_NUMA_STAT; @@ -394,7 +396,11 @@ void __inc_node_state(struct pglist_data *pgdat, enum node_stat_item item) s16 v, t; v = __this_cpu_inc_return(*p); - t = __this_cpu_read(pcp->stat_threshold); + if (item >= NR_VM_NUMA_STAT_ITEMS) + t = __this_cpu_read(pcp->stat_threshold); + else + t = VM_NUMA_STAT_THRESHOLD; + if (unlikely(v > t)) { s16 overstep = t >> 1; @@ -549,7 +555,10 @@ static inline void mod_node_state(struct pglist_data *pgdat, * Most of the time the thresholds are the same anyways * for all cpus in a node. */ - t = this_cpu_read(pcp->stat_threshold); + if (item >= NR_VM_NUMA_STAT_ITEMS) + t = this_cpu_read(pcp->stat_threshold); + else + t = VM_NUMA_STAT_THRESHOLD; o = this_cpu_read(*p); n = delta + o; -- 2.7.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 68+ messages in thread
* Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size 2017-12-19 6:39 ` Kemi Wang @ 2017-12-19 12:40 ` Michal Hocko -1 siblings, 0 replies; 68+ messages in thread From: Michal Hocko @ 2017-12-19 12:40 UTC (permalink / raw) To: Kemi Wang Cc: Greg Kroah-Hartman, Andrew Morton, Vlastimil Babka, Mel Gorman, Johannes Weiner, Christopher Lameter, YASUAKI ISHIMATSU, Andrey Ryabinin, Nikolay Borisov, Pavel Tatashin, David Rientjes, Sebastian Andrzej Siewior, Dave, Andi Kleen, Tim Chen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Aubrey Li, Linux MM, Linux Kernel On Tue 19-12-17 14:39:24, Kemi Wang wrote: > We have seen significant overhead in cache bouncing caused by NUMA counters > update in multi-threaded page allocation. See 'commit 1d90ca897cb0 ("mm: > update NUMA counter threshold size")' for more details. > > This patch updates NUMA counters to a fixed size of (MAX_S16 - 2) and deals > with global counter update using different threshold size for node page > stats. Again, no numbers. To be honest I do not really like the special casing here. Why are numa counters any different from PGALLOC which is incremented for _every_ single page allocation? > --- > mm/vmstat.c | 13 +++++++++++-- > 1 file changed, 11 insertions(+), 2 deletions(-) > > diff --git a/mm/vmstat.c b/mm/vmstat.c > index 9c681cc..64e08ae 100644 > --- a/mm/vmstat.c > +++ b/mm/vmstat.c > @@ -30,6 +30,8 @@ > > #include "internal.h" > > +#define VM_NUMA_STAT_THRESHOLD (S16_MAX - 2) > + > #ifdef CONFIG_NUMA > int sysctl_vm_numa_stat = ENABLE_NUMA_STAT; > > @@ -394,7 +396,11 @@ void __inc_node_state(struct pglist_data *pgdat, enum node_stat_item item) > s16 v, t; > > v = __this_cpu_inc_return(*p); > - t = __this_cpu_read(pcp->stat_threshold); > + if (item >= NR_VM_NUMA_STAT_ITEMS) > + t = __this_cpu_read(pcp->stat_threshold); > + else > + t = VM_NUMA_STAT_THRESHOLD; > + > if (unlikely(v > t)) { > s16 overstep = t >> 1; > > @@ -549,7 +555,10 @@ static inline void mod_node_state(struct pglist_data *pgdat, > * Most of the time the thresholds are the same anyways > * for all cpus in a node. > */ > - t = this_cpu_read(pcp->stat_threshold); > + if (item >= NR_VM_NUMA_STAT_ITEMS) > + t = this_cpu_read(pcp->stat_threshold); > + else > + t = VM_NUMA_STAT_THRESHOLD; > > o = this_cpu_read(*p); > n = delta + o; > -- > 2.7.4 > -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size @ 2017-12-19 12:40 ` Michal Hocko 0 siblings, 0 replies; 68+ messages in thread From: Michal Hocko @ 2017-12-19 12:40 UTC (permalink / raw) To: Kemi Wang Cc: Greg Kroah-Hartman, Andrew Morton, Vlastimil Babka, Mel Gorman, Johannes Weiner, Christopher Lameter, YASUAKI ISHIMATSU, Andrey Ryabinin, Nikolay Borisov, Pavel Tatashin, David Rientjes, Sebastian Andrzej Siewior, Dave, Andi Kleen, Tim Chen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Aubrey Li, Linux MM, Linux Kernel On Tue 19-12-17 14:39:24, Kemi Wang wrote: > We have seen significant overhead in cache bouncing caused by NUMA counters > update in multi-threaded page allocation. See 'commit 1d90ca897cb0 ("mm: > update NUMA counter threshold size")' for more details. > > This patch updates NUMA counters to a fixed size of (MAX_S16 - 2) and deals > with global counter update using different threshold size for node page > stats. Again, no numbers. To be honest I do not really like the special casing here. Why are numa counters any different from PGALLOC which is incremented for _every_ single page allocation? > --- > mm/vmstat.c | 13 +++++++++++-- > 1 file changed, 11 insertions(+), 2 deletions(-) > > diff --git a/mm/vmstat.c b/mm/vmstat.c > index 9c681cc..64e08ae 100644 > --- a/mm/vmstat.c > +++ b/mm/vmstat.c > @@ -30,6 +30,8 @@ > > #include "internal.h" > > +#define VM_NUMA_STAT_THRESHOLD (S16_MAX - 2) > + > #ifdef CONFIG_NUMA > int sysctl_vm_numa_stat = ENABLE_NUMA_STAT; > > @@ -394,7 +396,11 @@ void __inc_node_state(struct pglist_data *pgdat, enum node_stat_item item) > s16 v, t; > > v = __this_cpu_inc_return(*p); > - t = __this_cpu_read(pcp->stat_threshold); > + if (item >= NR_VM_NUMA_STAT_ITEMS) > + t = __this_cpu_read(pcp->stat_threshold); > + else > + t = VM_NUMA_STAT_THRESHOLD; > + > if (unlikely(v > t)) { > s16 overstep = t >> 1; > > @@ -549,7 +555,10 @@ static inline void mod_node_state(struct pglist_data *pgdat, > * Most of the time the thresholds are the same anyways > * for all cpus in a node. > */ > - t = this_cpu_read(pcp->stat_threshold); > + if (item >= NR_VM_NUMA_STAT_ITEMS) > + t = this_cpu_read(pcp->stat_threshold); > + else > + t = VM_NUMA_STAT_THRESHOLD; > > o = this_cpu_read(*p); > n = delta + o; > -- > 2.7.4 > -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size 2017-12-19 12:40 ` Michal Hocko @ 2017-12-20 5:52 ` kemi -1 siblings, 0 replies; 68+ messages in thread From: kemi @ 2017-12-20 5:52 UTC (permalink / raw) To: Michal Hocko Cc: Greg Kroah-Hartman, Andrew Morton, Vlastimil Babka, Mel Gorman, Johannes Weiner, Christopher Lameter, YASUAKI ISHIMATSU, Andrey Ryabinin, Nikolay Borisov, Pavel Tatashin, David Rientjes, Sebastian Andrzej Siewior, Dave, Andi Kleen, Tim Chen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Aubrey Li, Linux MM, Linux Kernel On 2017年12月19日 20:40, Michal Hocko wrote: > On Tue 19-12-17 14:39:24, Kemi Wang wrote: >> We have seen significant overhead in cache bouncing caused by NUMA counters >> update in multi-threaded page allocation. See 'commit 1d90ca897cb0 ("mm: >> update NUMA counter threshold size")' for more details. >> >> This patch updates NUMA counters to a fixed size of (MAX_S16 - 2) and deals >> with global counter update using different threshold size for node page >> stats. > > Again, no numbers. Compare to vanilla kernel, I don't think it has performance improvement, so I didn't post performance data here. But, if you would like to see performance gain from enlarging threshold size for NUMA stats (compare to the first patch), I will do that later. > To be honest I do not really like the special casing > here. Why are numa counters any different from PGALLOC which is > incremented for _every_ single page allocation? > I guess you meant to PGALLOC event. The number of this event is kept in local cpu and sum up (for_each_online_cpu) when need. It uses the similar way to what I used before for NUMA stats in V1 patch series. Good enough. >> --- >> mm/vmstat.c | 13 +++++++++++-- >> 1 file changed, 11 insertions(+), 2 deletions(-) >> >> diff --git a/mm/vmstat.c b/mm/vmstat.c >> index 9c681cc..64e08ae 100644 >> --- a/mm/vmstat.c >> +++ b/mm/vmstat.c >> @@ -30,6 +30,8 @@ >> >> #include "internal.h" >> >> +#define VM_NUMA_STAT_THRESHOLD (S16_MAX - 2) >> + >> #ifdef CONFIG_NUMA >> int sysctl_vm_numa_stat = ENABLE_NUMA_STAT; >> >> @@ -394,7 +396,11 @@ void __inc_node_state(struct pglist_data *pgdat, enum node_stat_item item) >> s16 v, t; >> >> v = __this_cpu_inc_return(*p); >> - t = __this_cpu_read(pcp->stat_threshold); >> + if (item >= NR_VM_NUMA_STAT_ITEMS) >> + t = __this_cpu_read(pcp->stat_threshold); >> + else >> + t = VM_NUMA_STAT_THRESHOLD; >> + >> if (unlikely(v > t)) { >> s16 overstep = t >> 1; >> >> @@ -549,7 +555,10 @@ static inline void mod_node_state(struct pglist_data *pgdat, >> * Most of the time the thresholds are the same anyways >> * for all cpus in a node. >> */ >> - t = this_cpu_read(pcp->stat_threshold); >> + if (item >= NR_VM_NUMA_STAT_ITEMS) >> + t = this_cpu_read(pcp->stat_threshold); >> + else >> + t = VM_NUMA_STAT_THRESHOLD; >> >> o = this_cpu_read(*p); >> n = delta + o; >> -- >> 2.7.4 >> > ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size @ 2017-12-20 5:52 ` kemi 0 siblings, 0 replies; 68+ messages in thread From: kemi @ 2017-12-20 5:52 UTC (permalink / raw) To: Michal Hocko Cc: Greg Kroah-Hartman, Andrew Morton, Vlastimil Babka, Mel Gorman, Johannes Weiner, Christopher Lameter, YASUAKI ISHIMATSU, Andrey Ryabinin, Nikolay Borisov, Pavel Tatashin, David Rientjes, Sebastian Andrzej Siewior, Dave, Andi Kleen, Tim Chen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Aubrey Li, Linux MM, Linux Kernel On 2017a1'12ae??19ae?JPY 20:40, Michal Hocko wrote: > On Tue 19-12-17 14:39:24, Kemi Wang wrote: >> We have seen significant overhead in cache bouncing caused by NUMA counters >> update in multi-threaded page allocation. See 'commit 1d90ca897cb0 ("mm: >> update NUMA counter threshold size")' for more details. >> >> This patch updates NUMA counters to a fixed size of (MAX_S16 - 2) and deals >> with global counter update using different threshold size for node page >> stats. > > Again, no numbers. Compare to vanilla kernel, I don't think it has performance improvement, so I didn't post performance data here. But, if you would like to see performance gain from enlarging threshold size for NUMA stats (compare to the first patch), I will do that later. > To be honest I do not really like the special casing > here. Why are numa counters any different from PGALLOC which is > incremented for _every_ single page allocation? > I guess you meant to PGALLOC event. The number of this event is kept in local cpu and sum up (for_each_online_cpu) when need. It uses the similar way to what I used before for NUMA stats in V1 patch series. Good enough. >> --- >> mm/vmstat.c | 13 +++++++++++-- >> 1 file changed, 11 insertions(+), 2 deletions(-) >> >> diff --git a/mm/vmstat.c b/mm/vmstat.c >> index 9c681cc..64e08ae 100644 >> --- a/mm/vmstat.c >> +++ b/mm/vmstat.c >> @@ -30,6 +30,8 @@ >> >> #include "internal.h" >> >> +#define VM_NUMA_STAT_THRESHOLD (S16_MAX - 2) >> + >> #ifdef CONFIG_NUMA >> int sysctl_vm_numa_stat = ENABLE_NUMA_STAT; >> >> @@ -394,7 +396,11 @@ void __inc_node_state(struct pglist_data *pgdat, enum node_stat_item item) >> s16 v, t; >> >> v = __this_cpu_inc_return(*p); >> - t = __this_cpu_read(pcp->stat_threshold); >> + if (item >= NR_VM_NUMA_STAT_ITEMS) >> + t = __this_cpu_read(pcp->stat_threshold); >> + else >> + t = VM_NUMA_STAT_THRESHOLD; >> + >> if (unlikely(v > t)) { >> s16 overstep = t >> 1; >> >> @@ -549,7 +555,10 @@ static inline void mod_node_state(struct pglist_data *pgdat, >> * Most of the time the thresholds are the same anyways >> * for all cpus in a node. >> */ >> - t = this_cpu_read(pcp->stat_threshold); >> + if (item >= NR_VM_NUMA_STAT_ITEMS) >> + t = this_cpu_read(pcp->stat_threshold); >> + else >> + t = VM_NUMA_STAT_THRESHOLD; >> >> o = this_cpu_read(*p); >> n = delta + o; >> -- >> 2.7.4 >> > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size 2017-12-20 5:52 ` kemi @ 2017-12-20 10:12 ` Michal Hocko -1 siblings, 0 replies; 68+ messages in thread From: Michal Hocko @ 2017-12-20 10:12 UTC (permalink / raw) To: kemi Cc: Greg Kroah-Hartman, Andrew Morton, Vlastimil Babka, Mel Gorman, Johannes Weiner, Christopher Lameter, YASUAKI ISHIMATSU, Andrey Ryabinin, Nikolay Borisov, Pavel Tatashin, David Rientjes, Sebastian Andrzej Siewior, Dave, Andi Kleen, Tim Chen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Aubrey Li, Linux MM, Linux Kernel On Wed 20-12-17 13:52:14, kemi wrote: > > > On 2017年12月19日 20:40, Michal Hocko wrote: > > On Tue 19-12-17 14:39:24, Kemi Wang wrote: > >> We have seen significant overhead in cache bouncing caused by NUMA counters > >> update in multi-threaded page allocation. See 'commit 1d90ca897cb0 ("mm: > >> update NUMA counter threshold size")' for more details. > >> > >> This patch updates NUMA counters to a fixed size of (MAX_S16 - 2) and deals > >> with global counter update using different threshold size for node page > >> stats. > > > > Again, no numbers. > > Compare to vanilla kernel, I don't think it has performance improvement, so > I didn't post performance data here. > But, if you would like to see performance gain from enlarging threshold size > for NUMA stats (compare to the first patch), I will do that later. Please do. I would also like to hear _why_ all counters cannot simply behave same. In other words why we cannot simply increase stat_threshold? Maybe calculate_normal_threshold needs a better scaling for larger machines. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size @ 2017-12-20 10:12 ` Michal Hocko 0 siblings, 0 replies; 68+ messages in thread From: Michal Hocko @ 2017-12-20 10:12 UTC (permalink / raw) To: kemi Cc: Greg Kroah-Hartman, Andrew Morton, Vlastimil Babka, Mel Gorman, Johannes Weiner, Christopher Lameter, YASUAKI ISHIMATSU, Andrey Ryabinin, Nikolay Borisov, Pavel Tatashin, David Rientjes, Sebastian Andrzej Siewior, Dave, Andi Kleen, Tim Chen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Aubrey Li, Linux MM, Linux Kernel On Wed 20-12-17 13:52:14, kemi wrote: > > > On 2017a1'12ae??19ae?JPY 20:40, Michal Hocko wrote: > > On Tue 19-12-17 14:39:24, Kemi Wang wrote: > >> We have seen significant overhead in cache bouncing caused by NUMA counters > >> update in multi-threaded page allocation. See 'commit 1d90ca897cb0 ("mm: > >> update NUMA counter threshold size")' for more details. > >> > >> This patch updates NUMA counters to a fixed size of (MAX_S16 - 2) and deals > >> with global counter update using different threshold size for node page > >> stats. > > > > Again, no numbers. > > Compare to vanilla kernel, I don't think it has performance improvement, so > I didn't post performance data here. > But, if you would like to see performance gain from enlarging threshold size > for NUMA stats (compare to the first patch), I will do that later. Please do. I would also like to hear _why_ all counters cannot simply behave same. In other words why we cannot simply increase stat_threshold? Maybe calculate_normal_threshold needs a better scaling for larger machines. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size 2017-12-20 10:12 ` Michal Hocko @ 2017-12-20 10:21 ` kemi -1 siblings, 0 replies; 68+ messages in thread From: kemi @ 2017-12-20 10:21 UTC (permalink / raw) To: Michal Hocko Cc: Greg Kroah-Hartman, Andrew Morton, Vlastimil Babka, Mel Gorman, Johannes Weiner, Christopher Lameter, YASUAKI ISHIMATSU, Andrey Ryabinin, Nikolay Borisov, Pavel Tatashin, David Rientjes, Sebastian Andrzej Siewior, Dave, Andi Kleen, Tim Chen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Aubrey Li, Linux MM, Linux Kernel On 2017年12月20日 18:12, Michal Hocko wrote: > On Wed 20-12-17 13:52:14, kemi wrote: >> >> >> On 2017年12月19日 20:40, Michal Hocko wrote: >>> On Tue 19-12-17 14:39:24, Kemi Wang wrote: >>>> We have seen significant overhead in cache bouncing caused by NUMA counters >>>> update in multi-threaded page allocation. See 'commit 1d90ca897cb0 ("mm: >>>> update NUMA counter threshold size")' for more details. >>>> >>>> This patch updates NUMA counters to a fixed size of (MAX_S16 - 2) and deals >>>> with global counter update using different threshold size for node page >>>> stats. >>> >>> Again, no numbers. >> >> Compare to vanilla kernel, I don't think it has performance improvement, so >> I didn't post performance data here. >> But, if you would like to see performance gain from enlarging threshold size >> for NUMA stats (compare to the first patch), I will do that later. > > Please do. I would also like to hear _why_ all counters cannot simply > behave same. In other words why we cannot simply increase > stat_threshold? Maybe calculate_normal_threshold needs a better scaling > for larger machines. > Agree. We may consider that. But, unlike NUMA counters which do not effect system decision. We need consider very carefully when increase stat_threshold for all the counters for larger machines. BTW, this is another topic that we may discuss it in different thread. ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size @ 2017-12-20 10:21 ` kemi 0 siblings, 0 replies; 68+ messages in thread From: kemi @ 2017-12-20 10:21 UTC (permalink / raw) To: Michal Hocko Cc: Greg Kroah-Hartman, Andrew Morton, Vlastimil Babka, Mel Gorman, Johannes Weiner, Christopher Lameter, YASUAKI ISHIMATSU, Andrey Ryabinin, Nikolay Borisov, Pavel Tatashin, David Rientjes, Sebastian Andrzej Siewior, Dave, Andi Kleen, Tim Chen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Aubrey Li, Linux MM, Linux Kernel On 2017a1'12ae??20ae?JPY 18:12, Michal Hocko wrote: > On Wed 20-12-17 13:52:14, kemi wrote: >> >> >> On 2017a1'12ae??19ae?JPY 20:40, Michal Hocko wrote: >>> On Tue 19-12-17 14:39:24, Kemi Wang wrote: >>>> We have seen significant overhead in cache bouncing caused by NUMA counters >>>> update in multi-threaded page allocation. See 'commit 1d90ca897cb0 ("mm: >>>> update NUMA counter threshold size")' for more details. >>>> >>>> This patch updates NUMA counters to a fixed size of (MAX_S16 - 2) and deals >>>> with global counter update using different threshold size for node page >>>> stats. >>> >>> Again, no numbers. >> >> Compare to vanilla kernel, I don't think it has performance improvement, so >> I didn't post performance data here. >> But, if you would like to see performance gain from enlarging threshold size >> for NUMA stats (compare to the first patch), I will do that later. > > Please do. I would also like to hear _why_ all counters cannot simply > behave same. In other words why we cannot simply increase > stat_threshold? Maybe calculate_normal_threshold needs a better scaling > for larger machines. > Agree. We may consider that. But, unlike NUMA counters which do not effect system decision. We need consider very carefully when increase stat_threshold for all the counters for larger machines. BTW, this is another topic that we may discuss it in different thread. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size 2017-12-20 10:12 ` Michal Hocko @ 2017-12-21 8:06 ` kemi -1 siblings, 0 replies; 68+ messages in thread From: kemi @ 2017-12-21 8:06 UTC (permalink / raw) To: Michal Hocko Cc: Greg Kroah-Hartman, Andrew Morton, Vlastimil Babka, Mel Gorman, Johannes Weiner, Christopher Lameter, YASUAKI ISHIMATSU, Andrey Ryabinin, Nikolay Borisov, Pavel Tatashin, David Rientjes, Sebastian Andrzej Siewior, Dave, Andi Kleen, Tim Chen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Aubrey Li, Linux MM, Linux Kernel On 2017年12月20日 18:12, Michal Hocko wrote: > On Wed 20-12-17 13:52:14, kemi wrote: >> >> >> On 2017年12月19日 20:40, Michal Hocko wrote: >>> On Tue 19-12-17 14:39:24, Kemi Wang wrote: >>>> We have seen significant overhead in cache bouncing caused by NUMA counters >>>> update in multi-threaded page allocation. See 'commit 1d90ca897cb0 ("mm: >>>> update NUMA counter threshold size")' for more details. >>>> >>>> This patch updates NUMA counters to a fixed size of (MAX_S16 - 2) and deals >>>> with global counter update using different threshold size for node page >>>> stats. >>> >>> Again, no numbers. >> >> Compare to vanilla kernel, I don't think it has performance improvement, so >> I didn't post performance data here. >> But, if you would like to see performance gain from enlarging threshold size >> for NUMA stats (compare to the first patch), I will do that later. > > Please do. I would also like to hear _why_ all counters cannot simply > behave same. In other words why we cannot simply increase > stat_threshold? Maybe calculate_normal_threshold needs a better scaling > for larger machines. > I will add this performance data to changelog in V3 patch series. Test machine: 2-sockets skylake platform (112 CPUs, 62G RAM) Benchmark: page_bench03 Description: 112 threads do single page allocation/deallocation in parallel. before after (enlarge threshold size) CPU cycles 722 379(-47.5%) Some thinking about that: a) the overhead due to cache bouncing caused by NUMA counter update in fast path severely increase with more and more CPUs cores b) AFAIK, the typical usage scenario (similar at least)for which this optimization can benefit is 10/40G NIC used in high-speed data center network of cloud service providers. ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size @ 2017-12-21 8:06 ` kemi 0 siblings, 0 replies; 68+ messages in thread From: kemi @ 2017-12-21 8:06 UTC (permalink / raw) To: Michal Hocko Cc: Greg Kroah-Hartman, Andrew Morton, Vlastimil Babka, Mel Gorman, Johannes Weiner, Christopher Lameter, YASUAKI ISHIMATSU, Andrey Ryabinin, Nikolay Borisov, Pavel Tatashin, David Rientjes, Sebastian Andrzej Siewior, Dave, Andi Kleen, Tim Chen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Aubrey Li, Linux MM, Linux Kernel On 2017a1'12ae??20ae?JPY 18:12, Michal Hocko wrote: > On Wed 20-12-17 13:52:14, kemi wrote: >> >> >> On 2017a1'12ae??19ae?JPY 20:40, Michal Hocko wrote: >>> On Tue 19-12-17 14:39:24, Kemi Wang wrote: >>>> We have seen significant overhead in cache bouncing caused by NUMA counters >>>> update in multi-threaded page allocation. See 'commit 1d90ca897cb0 ("mm: >>>> update NUMA counter threshold size")' for more details. >>>> >>>> This patch updates NUMA counters to a fixed size of (MAX_S16 - 2) and deals >>>> with global counter update using different threshold size for node page >>>> stats. >>> >>> Again, no numbers. >> >> Compare to vanilla kernel, I don't think it has performance improvement, so >> I didn't post performance data here. >> But, if you would like to see performance gain from enlarging threshold size >> for NUMA stats (compare to the first patch), I will do that later. > > Please do. I would also like to hear _why_ all counters cannot simply > behave same. In other words why we cannot simply increase > stat_threshold? Maybe calculate_normal_threshold needs a better scaling > for larger machines. > I will add this performance data to changelog in V3 patch series. Test machine: 2-sockets skylake platform (112 CPUs, 62G RAM) Benchmark: page_bench03 Description: 112 threads do single page allocation/deallocation in parallel. before after (enlarge threshold size) CPU cycles 722 379(-47.5%) Some thinking about that: a) the overhead due to cache bouncing caused by NUMA counter update in fast path severely increase with more and more CPUs cores b) AFAIK, the typical usage scenario (similar at least)for which this optimization can benefit is 10/40G NIC used in high-speed data center network of cloud service providers. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size 2017-12-21 8:06 ` kemi @ 2017-12-21 8:17 ` Michal Hocko -1 siblings, 0 replies; 68+ messages in thread From: Michal Hocko @ 2017-12-21 8:17 UTC (permalink / raw) To: kemi Cc: Greg Kroah-Hartman, Andrew Morton, Vlastimil Babka, Mel Gorman, Johannes Weiner, Christopher Lameter, YASUAKI ISHIMATSU, Andrey Ryabinin, Nikolay Borisov, Pavel Tatashin, David Rientjes, Sebastian Andrzej Siewior, Dave, Andi Kleen, Tim Chen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Aubrey Li, Linux MM, Linux Kernel On Thu 21-12-17 16:06:50, kemi wrote: > > > On 2017年12月20日 18:12, Michal Hocko wrote: > > On Wed 20-12-17 13:52:14, kemi wrote: > >> > >> > >> On 2017年12月19日 20:40, Michal Hocko wrote: > >>> On Tue 19-12-17 14:39:24, Kemi Wang wrote: > >>>> We have seen significant overhead in cache bouncing caused by NUMA counters > >>>> update in multi-threaded page allocation. See 'commit 1d90ca897cb0 ("mm: > >>>> update NUMA counter threshold size")' for more details. > >>>> > >>>> This patch updates NUMA counters to a fixed size of (MAX_S16 - 2) and deals > >>>> with global counter update using different threshold size for node page > >>>> stats. > >>> > >>> Again, no numbers. > >> > >> Compare to vanilla kernel, I don't think it has performance improvement, so > >> I didn't post performance data here. > >> But, if you would like to see performance gain from enlarging threshold size > >> for NUMA stats (compare to the first patch), I will do that later. > > > > Please do. I would also like to hear _why_ all counters cannot simply > > behave same. In other words why we cannot simply increase > > stat_threshold? Maybe calculate_normal_threshold needs a better scaling > > for larger machines. > > > > I will add this performance data to changelog in V3 patch series. > > Test machine: 2-sockets skylake platform (112 CPUs, 62G RAM) > Benchmark: page_bench03 > Description: 112 threads do single page allocation/deallocation in parallel. > before after > (enlarge threshold size) > CPU cycles 722 379(-47.5%) Please describe the numbers some more. Is this an average? What is the std? Can you see any difference with a more generic workload? > Some thinking about that: > a) the overhead due to cache bouncing caused by NUMA counter update in fast path > severely increase with more and more CPUs cores What is an effect on a smaller system with fewer CPUs? > b) AFAIK, the typical usage scenario (similar at least)for which this optimization can > benefit is 10/40G NIC used in high-speed data center network of cloud service providers. I would expect those would disable the numa accounting altogether. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size @ 2017-12-21 8:17 ` Michal Hocko 0 siblings, 0 replies; 68+ messages in thread From: Michal Hocko @ 2017-12-21 8:17 UTC (permalink / raw) To: kemi Cc: Greg Kroah-Hartman, Andrew Morton, Vlastimil Babka, Mel Gorman, Johannes Weiner, Christopher Lameter, YASUAKI ISHIMATSU, Andrey Ryabinin, Nikolay Borisov, Pavel Tatashin, David Rientjes, Sebastian Andrzej Siewior, Dave, Andi Kleen, Tim Chen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Aubrey Li, Linux MM, Linux Kernel On Thu 21-12-17 16:06:50, kemi wrote: > > > On 2017a1'12ae??20ae?JPY 18:12, Michal Hocko wrote: > > On Wed 20-12-17 13:52:14, kemi wrote: > >> > >> > >> On 2017a1'12ae??19ae?JPY 20:40, Michal Hocko wrote: > >>> On Tue 19-12-17 14:39:24, Kemi Wang wrote: > >>>> We have seen significant overhead in cache bouncing caused by NUMA counters > >>>> update in multi-threaded page allocation. See 'commit 1d90ca897cb0 ("mm: > >>>> update NUMA counter threshold size")' for more details. > >>>> > >>>> This patch updates NUMA counters to a fixed size of (MAX_S16 - 2) and deals > >>>> with global counter update using different threshold size for node page > >>>> stats. > >>> > >>> Again, no numbers. > >> > >> Compare to vanilla kernel, I don't think it has performance improvement, so > >> I didn't post performance data here. > >> But, if you would like to see performance gain from enlarging threshold size > >> for NUMA stats (compare to the first patch), I will do that later. > > > > Please do. I would also like to hear _why_ all counters cannot simply > > behave same. In other words why we cannot simply increase > > stat_threshold? Maybe calculate_normal_threshold needs a better scaling > > for larger machines. > > > > I will add this performance data to changelog in V3 patch series. > > Test machine: 2-sockets skylake platform (112 CPUs, 62G RAM) > Benchmark: page_bench03 > Description: 112 threads do single page allocation/deallocation in parallel. > before after > (enlarge threshold size) > CPU cycles 722 379(-47.5%) Please describe the numbers some more. Is this an average? What is the std? Can you see any difference with a more generic workload? > Some thinking about that: > a) the overhead due to cache bouncing caused by NUMA counter update in fast path > severely increase with more and more CPUs cores What is an effect on a smaller system with fewer CPUs? > b) AFAIK, the typical usage scenario (similar at least)for which this optimization can > benefit is 10/40G NIC used in high-speed data center network of cloud service providers. I would expect those would disable the numa accounting altogether. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size 2017-12-21 8:17 ` Michal Hocko @ 2017-12-21 8:23 ` kemi -1 siblings, 0 replies; 68+ messages in thread From: kemi @ 2017-12-21 8:23 UTC (permalink / raw) To: Michal Hocko Cc: Greg Kroah-Hartman, Andrew Morton, Vlastimil Babka, Mel Gorman, Johannes Weiner, Christopher Lameter, YASUAKI ISHIMATSU, Andrey Ryabinin, Nikolay Borisov, Pavel Tatashin, David Rientjes, Sebastian Andrzej Siewior, Dave, Andi Kleen, Tim Chen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Aubrey Li, Linux MM, Linux Kernel On 2017年12月21日 16:17, Michal Hocko wrote: > On Thu 21-12-17 16:06:50, kemi wrote: >> >> >> On 2017年12月20日 18:12, Michal Hocko wrote: >>> On Wed 20-12-17 13:52:14, kemi wrote: >>>> >>>> >>>> On 2017年12月19日 20:40, Michal Hocko wrote: >>>>> On Tue 19-12-17 14:39:24, Kemi Wang wrote: >>>>>> We have seen significant overhead in cache bouncing caused by NUMA counters >>>>>> update in multi-threaded page allocation. See 'commit 1d90ca897cb0 ("mm: >>>>>> update NUMA counter threshold size")' for more details. >>>>>> >>>>>> This patch updates NUMA counters to a fixed size of (MAX_S16 - 2) and deals >>>>>> with global counter update using different threshold size for node page >>>>>> stats. >>>>> >>>>> Again, no numbers. >>>> >>>> Compare to vanilla kernel, I don't think it has performance improvement, so >>>> I didn't post performance data here. >>>> But, if you would like to see performance gain from enlarging threshold size >>>> for NUMA stats (compare to the first patch), I will do that later. >>> >>> Please do. I would also like to hear _why_ all counters cannot simply >>> behave same. In other words why we cannot simply increase >>> stat_threshold? Maybe calculate_normal_threshold needs a better scaling >>> for larger machines. >>> >> >> I will add this performance data to changelog in V3 patch series. >> >> Test machine: 2-sockets skylake platform (112 CPUs, 62G RAM) >> Benchmark: page_bench03 >> Description: 112 threads do single page allocation/deallocation in parallel. >> before after >> (enlarge threshold size) >> CPU cycles 722 379(-47.5%) > > Please describe the numbers some more. Is this an average? Yes > What is the std? I increase the loop times to 10m, so the std is quite slow (repeat 3 times) > Can you see any difference with a more generic workload? > I didn't see obvious improvement for will-it-scale.page_fault1 Two reasons for that: 1) too long code path 2) server zone lock and lru lock contention (access to buddy system frequently) >> Some thinking about that: >> a) the overhead due to cache bouncing caused by NUMA counter update in fast path >> severely increase with more and more CPUs cores > > What is an effect on a smaller system with fewer CPUs? > Several CPU cycles can be saved using single thread for that. >> b) AFAIK, the typical usage scenario (similar at least)for which this optimization can >> benefit is 10/40G NIC used in high-speed data center network of cloud service providers. > > I would expect those would disable the numa accounting altogether. > Yes, but it is still worthy to do some optimization, isn't? ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size @ 2017-12-21 8:23 ` kemi 0 siblings, 0 replies; 68+ messages in thread From: kemi @ 2017-12-21 8:23 UTC (permalink / raw) To: Michal Hocko Cc: Greg Kroah-Hartman, Andrew Morton, Vlastimil Babka, Mel Gorman, Johannes Weiner, Christopher Lameter, YASUAKI ISHIMATSU, Andrey Ryabinin, Nikolay Borisov, Pavel Tatashin, David Rientjes, Sebastian Andrzej Siewior, Dave, Andi Kleen, Tim Chen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Aubrey Li, Linux MM, Linux Kernel On 2017a1'12ae??21ae?JPY 16:17, Michal Hocko wrote: > On Thu 21-12-17 16:06:50, kemi wrote: >> >> >> On 2017a1'12ae??20ae?JPY 18:12, Michal Hocko wrote: >>> On Wed 20-12-17 13:52:14, kemi wrote: >>>> >>>> >>>> On 2017a1'12ae??19ae?JPY 20:40, Michal Hocko wrote: >>>>> On Tue 19-12-17 14:39:24, Kemi Wang wrote: >>>>>> We have seen significant overhead in cache bouncing caused by NUMA counters >>>>>> update in multi-threaded page allocation. See 'commit 1d90ca897cb0 ("mm: >>>>>> update NUMA counter threshold size")' for more details. >>>>>> >>>>>> This patch updates NUMA counters to a fixed size of (MAX_S16 - 2) and deals >>>>>> with global counter update using different threshold size for node page >>>>>> stats. >>>>> >>>>> Again, no numbers. >>>> >>>> Compare to vanilla kernel, I don't think it has performance improvement, so >>>> I didn't post performance data here. >>>> But, if you would like to see performance gain from enlarging threshold size >>>> for NUMA stats (compare to the first patch), I will do that later. >>> >>> Please do. I would also like to hear _why_ all counters cannot simply >>> behave same. In other words why we cannot simply increase >>> stat_threshold? Maybe calculate_normal_threshold needs a better scaling >>> for larger machines. >>> >> >> I will add this performance data to changelog in V3 patch series. >> >> Test machine: 2-sockets skylake platform (112 CPUs, 62G RAM) >> Benchmark: page_bench03 >> Description: 112 threads do single page allocation/deallocation in parallel. >> before after >> (enlarge threshold size) >> CPU cycles 722 379(-47.5%) > > Please describe the numbers some more. Is this an average? Yes > What is the std? I increase the loop times to 10m, so the std is quite slow (repeat 3 times) > Can you see any difference with a more generic workload? > I didn't see obvious improvement for will-it-scale.page_fault1 Two reasons for that: 1) too long code path 2) server zone lock and lru lock contention (access to buddy system frequently) >> Some thinking about that: >> a) the overhead due to cache bouncing caused by NUMA counter update in fast path >> severely increase with more and more CPUs cores > > What is an effect on a smaller system with fewer CPUs? > Several CPU cycles can be saved using single thread for that. >> b) AFAIK, the typical usage scenario (similar at least)for which this optimization can >> benefit is 10/40G NIC used in high-speed data center network of cloud service providers. > > I would expect those would disable the numa accounting altogether. > Yes, but it is still worthy to do some optimization, isn't? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size 2017-12-21 8:23 ` kemi @ 2017-12-21 8:59 ` Michal Hocko -1 siblings, 0 replies; 68+ messages in thread From: Michal Hocko @ 2017-12-21 8:59 UTC (permalink / raw) To: kemi Cc: Greg Kroah-Hartman, Andrew Morton, Vlastimil Babka, Mel Gorman, Johannes Weiner, Christopher Lameter, YASUAKI ISHIMATSU, Andrey Ryabinin, Nikolay Borisov, Pavel Tatashin, David Rientjes, Sebastian Andrzej Siewior, Dave, Andi Kleen, Tim Chen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Aubrey Li, Linux MM, Linux Kernel On Thu 21-12-17 16:23:23, kemi wrote: > > > On 2017年12月21日 16:17, Michal Hocko wrote: [...] > > Can you see any difference with a more generic workload? > > > > I didn't see obvious improvement for will-it-scale.page_fault1 > Two reasons for that: > 1) too long code path > 2) server zone lock and lru lock contention (access to buddy system frequently) OK. So does the patch helps for anything other than a microbenchmark? > >> Some thinking about that: > >> a) the overhead due to cache bouncing caused by NUMA counter update in fast path > >> severely increase with more and more CPUs cores > > > > What is an effect on a smaller system with fewer CPUs? > > > > Several CPU cycles can be saved using single thread for that. > > >> b) AFAIK, the typical usage scenario (similar at least)for which this optimization can > >> benefit is 10/40G NIC used in high-speed data center network of cloud service providers. > > > > I would expect those would disable the numa accounting altogether. > > > > Yes, but it is still worthy to do some optimization, isn't? Ohh, I am not opposing optimizations but you should make sure that they are worth the additional code and special casing. As I've said I am not convinced special casing numa counters is good. You can play with the threshold scaling for larger CPU count but let's make sure that the benefit is really measurable for normal workloads. Special ones will disable the numa accounting anyway. Thanks! -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size @ 2017-12-21 8:59 ` Michal Hocko 0 siblings, 0 replies; 68+ messages in thread From: Michal Hocko @ 2017-12-21 8:59 UTC (permalink / raw) To: kemi Cc: Greg Kroah-Hartman, Andrew Morton, Vlastimil Babka, Mel Gorman, Johannes Weiner, Christopher Lameter, YASUAKI ISHIMATSU, Andrey Ryabinin, Nikolay Borisov, Pavel Tatashin, David Rientjes, Sebastian Andrzej Siewior, Dave, Andi Kleen, Tim Chen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Aubrey Li, Linux MM, Linux Kernel On Thu 21-12-17 16:23:23, kemi wrote: > > > On 2017a1'12ae??21ae?JPY 16:17, Michal Hocko wrote: [...] > > Can you see any difference with a more generic workload? > > > > I didn't see obvious improvement for will-it-scale.page_fault1 > Two reasons for that: > 1) too long code path > 2) server zone lock and lru lock contention (access to buddy system frequently) OK. So does the patch helps for anything other than a microbenchmark? > >> Some thinking about that: > >> a) the overhead due to cache bouncing caused by NUMA counter update in fast path > >> severely increase with more and more CPUs cores > > > > What is an effect on a smaller system with fewer CPUs? > > > > Several CPU cycles can be saved using single thread for that. > > >> b) AFAIK, the typical usage scenario (similar at least)for which this optimization can > >> benefit is 10/40G NIC used in high-speed data center network of cloud service providers. > > > > I would expect those would disable the numa accounting altogether. > > > > Yes, but it is still worthy to do some optimization, isn't? Ohh, I am not opposing optimizations but you should make sure that they are worth the additional code and special casing. As I've said I am not convinced special casing numa counters is good. You can play with the threshold scaling for larger CPU count but let's make sure that the benefit is really measurable for normal workloads. Special ones will disable the numa accounting anyway. Thanks! -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size 2017-12-21 8:59 ` Michal Hocko @ 2017-12-21 10:31 ` kemi -1 siblings, 0 replies; 68+ messages in thread From: kemi @ 2017-12-21 10:31 UTC (permalink / raw) To: Michal Hocko Cc: Greg Kroah-Hartman, Andrew Morton, Vlastimil Babka, Mel Gorman, Johannes Weiner, Christopher Lameter, YASUAKI ISHIMATSU, Andrey Ryabinin, Nikolay Borisov, Pavel Tatashin, David Rientjes, Sebastian Andrzej Siewior, Dave, Andi Kleen, Tim Chen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Aubrey Li, Linux MM, Linux Kernel On 2017年12月21日 16:59, Michal Hocko wrote: > On Thu 21-12-17 16:23:23, kemi wrote: >> >> >> On 2017年12月21日 16:17, Michal Hocko wrote: > [...] >>> Can you see any difference with a more generic workload? >>> >> >> I didn't see obvious improvement for will-it-scale.page_fault1 >> Two reasons for that: >> 1) too long code path >> 2) server zone lock and lru lock contention (access to buddy system frequently) > > OK. So does the patch helps for anything other than a microbenchmark? > >>>> Some thinking about that: >>>> a) the overhead due to cache bouncing caused by NUMA counter update in fast path >>>> severely increase with more and more CPUs cores >>> >>> What is an effect on a smaller system with fewer CPUs? >>> >> >> Several CPU cycles can be saved using single thread for that. >> >>>> b) AFAIK, the typical usage scenario (similar at least)for which this optimization can >>>> benefit is 10/40G NIC used in high-speed data center network of cloud service providers. >>> >>> I would expect those would disable the numa accounting altogether. >>> >> >> Yes, but it is still worthy to do some optimization, isn't? > > Ohh, I am not opposing optimizations but you should make sure that they > are worth the additional code and special casing. As I've said I am not > convinced special casing numa counters is good. You can play with the > threshold scaling for larger CPU count but let's make sure that the > benefit is really measurable for normal workloads. Special ones will > disable the numa accounting anyway. > I understood. Could you give me some suggestion for those normal workloads, Thanks. I will have a try and post the data ASAP. ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size @ 2017-12-21 10:31 ` kemi 0 siblings, 0 replies; 68+ messages in thread From: kemi @ 2017-12-21 10:31 UTC (permalink / raw) To: Michal Hocko Cc: Greg Kroah-Hartman, Andrew Morton, Vlastimil Babka, Mel Gorman, Johannes Weiner, Christopher Lameter, YASUAKI ISHIMATSU, Andrey Ryabinin, Nikolay Borisov, Pavel Tatashin, David Rientjes, Sebastian Andrzej Siewior, Dave, Andi Kleen, Tim Chen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Aubrey Li, Linux MM, Linux Kernel On 2017a1'12ae??21ae?JPY 16:59, Michal Hocko wrote: > On Thu 21-12-17 16:23:23, kemi wrote: >> >> >> On 2017a1'12ae??21ae?JPY 16:17, Michal Hocko wrote: > [...] >>> Can you see any difference with a more generic workload? >>> >> >> I didn't see obvious improvement for will-it-scale.page_fault1 >> Two reasons for that: >> 1) too long code path >> 2) server zone lock and lru lock contention (access to buddy system frequently) > > OK. So does the patch helps for anything other than a microbenchmark? > >>>> Some thinking about that: >>>> a) the overhead due to cache bouncing caused by NUMA counter update in fast path >>>> severely increase with more and more CPUs cores >>> >>> What is an effect on a smaller system with fewer CPUs? >>> >> >> Several CPU cycles can be saved using single thread for that. >> >>>> b) AFAIK, the typical usage scenario (similar at least)for which this optimization can >>>> benefit is 10/40G NIC used in high-speed data center network of cloud service providers. >>> >>> I would expect those would disable the numa accounting altogether. >>> >> >> Yes, but it is still worthy to do some optimization, isn't? > > Ohh, I am not opposing optimizations but you should make sure that they > are worth the additional code and special casing. As I've said I am not > convinced special casing numa counters is good. You can play with the > threshold scaling for larger CPU count but let's make sure that the > benefit is really measurable for normal workloads. Special ones will > disable the numa accounting anyway. > I understood. Could you give me some suggestion for those normal workloads, Thanks. I will have a try and post the data ASAP. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size 2017-12-21 10:31 ` kemi @ 2017-12-22 12:31 ` Michal Hocko -1 siblings, 0 replies; 68+ messages in thread From: Michal Hocko @ 2017-12-22 12:31 UTC (permalink / raw) To: kemi Cc: Greg Kroah-Hartman, Andrew Morton, Vlastimil Babka, Mel Gorman, Johannes Weiner, Christopher Lameter, YASUAKI ISHIMATSU, Andrey Ryabinin, Nikolay Borisov, Pavel Tatashin, David Rientjes, Sebastian Andrzej Siewior, Dave, Andi Kleen, Tim Chen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Aubrey Li, Linux MM, Linux Kernel On Thu 21-12-17 18:31:19, kemi wrote: > > > On 2017年12月21日 16:59, Michal Hocko wrote: > > On Thu 21-12-17 16:23:23, kemi wrote: > >> > >> > >> On 2017年12月21日 16:17, Michal Hocko wrote: > > [...] > >>> Can you see any difference with a more generic workload? > >>> > >> > >> I didn't see obvious improvement for will-it-scale.page_fault1 > >> Two reasons for that: > >> 1) too long code path > >> 2) server zone lock and lru lock contention (access to buddy system frequently) > > > > OK. So does the patch helps for anything other than a microbenchmark? > > > >>>> Some thinking about that: > >>>> a) the overhead due to cache bouncing caused by NUMA counter update in fast path > >>>> severely increase with more and more CPUs cores > >>> > >>> What is an effect on a smaller system with fewer CPUs? > >>> > >> > >> Several CPU cycles can be saved using single thread for that. > >> > >>>> b) AFAIK, the typical usage scenario (similar at least)for which this optimization can > >>>> benefit is 10/40G NIC used in high-speed data center network of cloud service providers. > >>> > >>> I would expect those would disable the numa accounting altogether. > >>> > >> > >> Yes, but it is still worthy to do some optimization, isn't? > > > > Ohh, I am not opposing optimizations but you should make sure that they > > are worth the additional code and special casing. As I've said I am not > > convinced special casing numa counters is good. You can play with the > > threshold scaling for larger CPU count but let's make sure that the > > benefit is really measurable for normal workloads. Special ones will > > disable the numa accounting anyway. > > > > I understood. Could you give me some suggestion for those normal workloads, Thanks. > I will have a try and post the data ASAP. Well, to be honest, I am really confused what is your objective for these optimizations then. I hope we have agreed that workloads which really need to squeeze every single CPU cycle in the allocation path will simply disable the whole numa stat thing. I haven't yet heard about any use case which would really required numa stats and suffer from the numa stats overhead. I can see some arguments for a better threshold scaling but that requires to check wider range of tests to show there are no unintended changes. I am not really confident you understand that when you are asking for "those normal workloads". So please, try to step back, rethink who you are optimizing for and act accordingly. If I were you I would repost the first patch which only integrates numa stats because that removes a lot of pointless code and that is a win of its own. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size @ 2017-12-22 12:31 ` Michal Hocko 0 siblings, 0 replies; 68+ messages in thread From: Michal Hocko @ 2017-12-22 12:31 UTC (permalink / raw) To: kemi Cc: Greg Kroah-Hartman, Andrew Morton, Vlastimil Babka, Mel Gorman, Johannes Weiner, Christopher Lameter, YASUAKI ISHIMATSU, Andrey Ryabinin, Nikolay Borisov, Pavel Tatashin, David Rientjes, Sebastian Andrzej Siewior, Dave, Andi Kleen, Tim Chen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Aubrey Li, Linux MM, Linux Kernel On Thu 21-12-17 18:31:19, kemi wrote: > > > On 2017a1'12ae??21ae?JPY 16:59, Michal Hocko wrote: > > On Thu 21-12-17 16:23:23, kemi wrote: > >> > >> > >> On 2017a1'12ae??21ae?JPY 16:17, Michal Hocko wrote: > > [...] > >>> Can you see any difference with a more generic workload? > >>> > >> > >> I didn't see obvious improvement for will-it-scale.page_fault1 > >> Two reasons for that: > >> 1) too long code path > >> 2) server zone lock and lru lock contention (access to buddy system frequently) > > > > OK. So does the patch helps for anything other than a microbenchmark? > > > >>>> Some thinking about that: > >>>> a) the overhead due to cache bouncing caused by NUMA counter update in fast path > >>>> severely increase with more and more CPUs cores > >>> > >>> What is an effect on a smaller system with fewer CPUs? > >>> > >> > >> Several CPU cycles can be saved using single thread for that. > >> > >>>> b) AFAIK, the typical usage scenario (similar at least)for which this optimization can > >>>> benefit is 10/40G NIC used in high-speed data center network of cloud service providers. > >>> > >>> I would expect those would disable the numa accounting altogether. > >>> > >> > >> Yes, but it is still worthy to do some optimization, isn't? > > > > Ohh, I am not opposing optimizations but you should make sure that they > > are worth the additional code and special casing. As I've said I am not > > convinced special casing numa counters is good. You can play with the > > threshold scaling for larger CPU count but let's make sure that the > > benefit is really measurable for normal workloads. Special ones will > > disable the numa accounting anyway. > > > > I understood. Could you give me some suggestion for those normal workloads, Thanks. > I will have a try and post the data ASAP. Well, to be honest, I am really confused what is your objective for these optimizations then. I hope we have agreed that workloads which really need to squeeze every single CPU cycle in the allocation path will simply disable the whole numa stat thing. I haven't yet heard about any use case which would really required numa stats and suffer from the numa stats overhead. I can see some arguments for a better threshold scaling but that requires to check wider range of tests to show there are no unintended changes. I am not really confident you understand that when you are asking for "those normal workloads". So please, try to step back, rethink who you are optimizing for and act accordingly. If I were you I would repost the first patch which only integrates numa stats because that removes a lot of pointless code and that is a win of its own. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size 2017-12-21 8:06 ` kemi @ 2017-12-21 17:10 ` Christopher Lameter -1 siblings, 0 replies; 68+ messages in thread From: Christopher Lameter @ 2017-12-21 17:10 UTC (permalink / raw) To: kemi Cc: Michal Hocko, Greg Kroah-Hartman, Andrew Morton, Vlastimil Babka, Mel Gorman, Johannes Weiner, YASUAKI ISHIMATSU, Andrey Ryabinin, Nikolay Borisov, Pavel Tatashin, David Rientjes, Sebastian Andrzej Siewior, Dave, Andi Kleen, Tim Chen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Aubrey Li, Linux MM, Linux Kernel On Thu, 21 Dec 2017, kemi wrote: > Some thinking about that: > a) the overhead due to cache bouncing caused by NUMA counter update in fast path > severely increase with more and more CPUs cores > b) AFAIK, the typical usage scenario (similar at least)for which this optimization can > benefit is 10/40G NIC used in high-speed data center network of cloud service providers. I think you are fighting a lost battle there. As evident from the timing constraints on packet processing in a 10/40G you will have a hard time to process data if the packets are of regular ethernet size. And we alrady have 100G NICs in operation here. We can try to get the performance as high as possible but full rate high speed networking invariable must use offload mechanisms and thus the statistics would only be available from the hardware devices that can do wire speed processing. ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size @ 2017-12-21 17:10 ` Christopher Lameter 0 siblings, 0 replies; 68+ messages in thread From: Christopher Lameter @ 2017-12-21 17:10 UTC (permalink / raw) To: kemi Cc: Michal Hocko, Greg Kroah-Hartman, Andrew Morton, Vlastimil Babka, Mel Gorman, Johannes Weiner, YASUAKI ISHIMATSU, Andrey Ryabinin, Nikolay Borisov, Pavel Tatashin, David Rientjes, Sebastian Andrzej Siewior, Dave, Andi Kleen, Tim Chen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Aubrey Li, Linux MM, Linux Kernel On Thu, 21 Dec 2017, kemi wrote: > Some thinking about that: > a) the overhead due to cache bouncing caused by NUMA counter update in fast path > severely increase with more and more CPUs cores > b) AFAIK, the typical usage scenario (similar at least)for which this optimization can > benefit is 10/40G NIC used in high-speed data center network of cloud service providers. I think you are fighting a lost battle there. As evident from the timing constraints on packet processing in a 10/40G you will have a hard time to process data if the packets are of regular ethernet size. And we alrady have 100G NICs in operation here. We can try to get the performance as high as possible but full rate high speed networking invariable must use offload mechanisms and thus the statistics would only be available from the hardware devices that can do wire speed processing. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size 2017-12-21 17:10 ` Christopher Lameter @ 2017-12-22 2:06 ` kemi -1 siblings, 0 replies; 68+ messages in thread From: kemi @ 2017-12-22 2:06 UTC (permalink / raw) To: Christopher Lameter Cc: Michal Hocko, Greg Kroah-Hartman, Andrew Morton, Vlastimil Babka, Mel Gorman, Johannes Weiner, YASUAKI ISHIMATSU, Andrey Ryabinin, Nikolay Borisov, Pavel Tatashin, David Rientjes, Sebastian Andrzej Siewior, Dave, Andi Kleen, Tim Chen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Aubrey Li, Linux MM, Linux Kernel On 2017年12月22日 01:10, Christopher Lameter wrote: > On Thu, 21 Dec 2017, kemi wrote: > >> Some thinking about that: >> a) the overhead due to cache bouncing caused by NUMA counter update in fast path >> severely increase with more and more CPUs cores >> b) AFAIK, the typical usage scenario (similar at least)for which this optimization can >> benefit is 10/40G NIC used in high-speed data center network of cloud service providers. > > I think you are fighting a lost battle there. As evident from the timing > constraints on packet processing in a 10/40G you will have a hard time to > process data if the packets are of regular ethernet size. And we alrady > have 100G NICs in operation here. > Not really. For 10/40G NIC or even 100G, I admit DPDK is widely used in data center network rather than kernel driver in production environment. That's due to the slow page allocator and long pipeline processing in network protocol stack. That's not easy to change this state in short time, but if we can do something here to change it a little, why not. > We can try to get the performance as high as possible but full rate high > speed networking invariable must use offload mechanisms and thus the > statistics would only be available from the hardware devices that can do > wire speed processing. > I think you may be talking something about SmartNIC (e.g. OpenVswitch offload + VF pass through). That's usually used in virtualization environment to eliminate the overhead from device emulation and packet processing in software virtual switch(OVS or linux bridge). What I have done in this patch series is to improve page allocator performance, that's also helpful in offload environment (guest kernel at least), IMHO. ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size @ 2017-12-22 2:06 ` kemi 0 siblings, 0 replies; 68+ messages in thread From: kemi @ 2017-12-22 2:06 UTC (permalink / raw) To: Christopher Lameter Cc: Michal Hocko, Greg Kroah-Hartman, Andrew Morton, Vlastimil Babka, Mel Gorman, Johannes Weiner, YASUAKI ISHIMATSU, Andrey Ryabinin, Nikolay Borisov, Pavel Tatashin, David Rientjes, Sebastian Andrzej Siewior, Dave, Andi Kleen, Tim Chen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Aubrey Li, Linux MM, Linux Kernel On 2017a1'12ae??22ae?JPY 01:10, Christopher Lameter wrote: > On Thu, 21 Dec 2017, kemi wrote: > >> Some thinking about that: >> a) the overhead due to cache bouncing caused by NUMA counter update in fast path >> severely increase with more and more CPUs cores >> b) AFAIK, the typical usage scenario (similar at least)for which this optimization can >> benefit is 10/40G NIC used in high-speed data center network of cloud service providers. > > I think you are fighting a lost battle there. As evident from the timing > constraints on packet processing in a 10/40G you will have a hard time to > process data if the packets are of regular ethernet size. And we alrady > have 100G NICs in operation here. > Not really. For 10/40G NIC or even 100G, I admit DPDK is widely used in data center network rather than kernel driver in production environment. That's due to the slow page allocator and long pipeline processing in network protocol stack. That's not easy to change this state in short time, but if we can do something here to change it a little, why not. > We can try to get the performance as high as possible but full rate high > speed networking invariable must use offload mechanisms and thus the > statistics would only be available from the hardware devices that can do > wire speed processing. > I think you may be talking something about SmartNIC (e.g. OpenVswitch offload + VF pass through). That's usually used in virtualization environment to eliminate the overhead from device emulation and packet processing in software virtual switch(OVS or linux bridge). What I have done in this patch series is to improve page allocator performance, that's also helpful in offload environment (guest kernel at least), IMHO. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size 2017-12-22 2:06 ` kemi @ 2017-12-26 19:05 ` Christopher Lameter -1 siblings, 0 replies; 68+ messages in thread From: Christopher Lameter @ 2017-12-26 19:05 UTC (permalink / raw) To: kemi Cc: Michal Hocko, Greg Kroah-Hartman, Andrew Morton, Vlastimil Babka, Mel Gorman, Johannes Weiner, YASUAKI ISHIMATSU, Andrey Ryabinin, Nikolay Borisov, Pavel Tatashin, David Rientjes, Sebastian Andrzej Siewior, Dave, Andi Kleen, Tim Chen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Aubrey Li, Linux MM, Linux Kernel On Fri, 22 Dec 2017, kemi wrote: > > I think you are fighting a lost battle there. As evident from the timing > > constraints on packet processing in a 10/40G you will have a hard time to > > process data if the packets are of regular ethernet size. And we alrady > > have 100G NICs in operation here. > > > > Not really. > For 10/40G NIC or even 100G, I admit DPDK is widely used in data center network > rather than kernel driver in production environment. Shudder. I would rather have an user space API that is vendor neutral and that allows the use of multiple NICs. The Linux kernel has an RDMA subsystem that does just that. But time budget is difficult to deal with even using RDMA or DPKG where we can avoid the OS overhead. > That's due to the slow page allocator and long pipeline processing in network > protocol stack. Right the timing budget there for processing a single packet gets below a microsecond at some point and there its going to be difficult to do much. Some aggregation / offloading is required and that increases as speeds become higher. > That's not easy to change this state in short time, but if we can do something > here to change it a little, why not. How much of an improvement is this going to be? If it is significant then by all means lets do it. > > We can try to get the performance as high as possible but full rate high > > speed networking invariable must use offload mechanisms and thus the > > statistics would only be available from the hardware devices that can do > > wire speed processing. > > > > I think you may be talking something about SmartNIC (e.g. OpenVswitch offload + > VF pass through). That's usually used in virtualization environment to eliminate > the overhead from device emulation and packet processing in software virtual > switch(OVS or linux bridge). The switch offloads Can also be used elsewhere. Also the RDMA subsystem has counters like that. ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size @ 2017-12-26 19:05 ` Christopher Lameter 0 siblings, 0 replies; 68+ messages in thread From: Christopher Lameter @ 2017-12-26 19:05 UTC (permalink / raw) To: kemi Cc: Michal Hocko, Greg Kroah-Hartman, Andrew Morton, Vlastimil Babka, Mel Gorman, Johannes Weiner, YASUAKI ISHIMATSU, Andrey Ryabinin, Nikolay Borisov, Pavel Tatashin, David Rientjes, Sebastian Andrzej Siewior, Dave, Andi Kleen, Tim Chen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Aubrey Li, Linux MM, Linux Kernel On Fri, 22 Dec 2017, kemi wrote: > > I think you are fighting a lost battle there. As evident from the timing > > constraints on packet processing in a 10/40G you will have a hard time to > > process data if the packets are of regular ethernet size. And we alrady > > have 100G NICs in operation here. > > > > Not really. > For 10/40G NIC or even 100G, I admit DPDK is widely used in data center network > rather than kernel driver in production environment. Shudder. I would rather have an user space API that is vendor neutral and that allows the use of multiple NICs. The Linux kernel has an RDMA subsystem that does just that. But time budget is difficult to deal with even using RDMA or DPKG where we can avoid the OS overhead. > That's due to the slow page allocator and long pipeline processing in network > protocol stack. Right the timing budget there for processing a single packet gets below a microsecond at some point and there its going to be difficult to do much. Some aggregation / offloading is required and that increases as speeds become higher. > That's not easy to change this state in short time, but if we can do something > here to change it a little, why not. How much of an improvement is this going to be? If it is significant then by all means lets do it. > > We can try to get the performance as high as possible but full rate high > > speed networking invariable must use offload mechanisms and thus the > > statistics would only be available from the hardware devices that can do > > wire speed processing. > > > > I think you may be talking something about SmartNIC (e.g. OpenVswitch offload + > VF pass through). That's usually used in virtualization environment to eliminate > the overhead from device emulation and packet processing in software virtual > switch(OVS or linux bridge). The switch offloads Can also be used elsewhere. Also the RDMA subsystem has counters like that. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 68+ messages in thread
* [PATCH v2 4/5] mm: use node_page_state_snapshot to avoid deviation 2017-12-19 6:39 ` Kemi Wang @ 2017-12-19 6:39 ` Kemi Wang -1 siblings, 0 replies; 68+ messages in thread From: Kemi Wang @ 2017-12-19 6:39 UTC (permalink / raw) To: Greg Kroah-Hartman, Andrew Morton, Michal Hocko, Vlastimil Babka, Mel Gorman, Johannes Weiner, Christopher Lameter, YASUAKI ISHIMATSU, Andrey Ryabinin, Nikolay Borisov, Pavel Tatashin, David Rientjes, Sebastian Andrzej Siewior Cc: Dave, Andi Kleen, Tim Chen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Aubrey Li, Kemi Wang, Linux MM, Linux Kernel To avoid deviation, this patch uses node_page_state_snapshot instead of node_page_state for node page stats query. e.g. cat /proc/zoneinfo cat /sys/devices/system/node/node*/vmstat cat /sys/devices/system/node/node*/numastat As it is a slow path and would not be read frequently, I would worry about it. Signed-off-by: Kemi Wang <kemi.wang@intel.com> --- drivers/base/node.c | 17 ++++++++++------- mm/vmstat.c | 2 +- 2 files changed, 11 insertions(+), 8 deletions(-) diff --git a/drivers/base/node.c b/drivers/base/node.c index a045ea1..cf303f8 100644 --- a/drivers/base/node.c +++ b/drivers/base/node.c @@ -169,12 +169,15 @@ static ssize_t node_read_numastat(struct device *dev, "interleave_hit %lu\n" "local_node %lu\n" "other_node %lu\n", - node_page_state(NODE_DATA(dev->id), NUMA_HIT), - node_page_state(NODE_DATA(dev->id), NUMA_MISS), - node_page_state(NODE_DATA(dev->id), NUMA_FOREIGN), - node_page_state(NODE_DATA(dev->id), NUMA_INTERLEAVE_HIT), - node_page_state(NODE_DATA(dev->id), NUMA_LOCAL), - node_page_state(NODE_DATA(dev->id), NUMA_OTHER)); + node_page_state_snapshot(NODE_DATA(dev->id), NUMA_HIT), + node_page_state_snapshot(NODE_DATA(dev->id), NUMA_MISS), + node_page_state_snapshot(NODE_DATA(dev->id), + NUMA_FOREIGN), + node_page_state_snapshot(NODE_DATA(dev->id), + NUMA_INTERLEAVE_HIT), + node_page_state_snapshot(NODE_DATA(dev->id), NUMA_LOCAL), + node_page_state_snapshot(NODE_DATA(dev->id), + NUMA_OTHER)); } static DEVICE_ATTR(numastat, S_IRUGO, node_read_numastat, NULL); @@ -194,7 +197,7 @@ static ssize_t node_read_vmstat(struct device *dev, for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) n += sprintf(buf+n, "%s %lu\n", vmstat_text[i + NR_VM_ZONE_STAT_ITEMS], - node_page_state(pgdat, i)); + node_page_state_snapshot(pgdat, i)); return n; } diff --git a/mm/vmstat.c b/mm/vmstat.c index 64e08ae..d65f28d 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1466,7 +1466,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat, for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) { seq_printf(m, "\n %-12s %lu", vmstat_text[i + NR_VM_ZONE_STAT_ITEMS], - node_page_state(pgdat, i)); + node_page_state_snapshot(pgdat, i)); } } seq_printf(m, -- 2.7.4 ^ permalink raw reply related [flat|nested] 68+ messages in thread
* [PATCH v2 4/5] mm: use node_page_state_snapshot to avoid deviation @ 2017-12-19 6:39 ` Kemi Wang 0 siblings, 0 replies; 68+ messages in thread From: Kemi Wang @ 2017-12-19 6:39 UTC (permalink / raw) To: Greg Kroah-Hartman, Andrew Morton, Michal Hocko, Vlastimil Babka, Mel Gorman, Johannes Weiner, Christopher Lameter, YASUAKI ISHIMATSU, Andrey Ryabinin, Nikolay Borisov, Pavel Tatashin, David Rientjes, Sebastian Andrzej Siewior Cc: Dave, Andi Kleen, Tim Chen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Aubrey Li, Kemi Wang, Linux MM, Linux Kernel To avoid deviation, this patch uses node_page_state_snapshot instead of node_page_state for node page stats query. e.g. cat /proc/zoneinfo cat /sys/devices/system/node/node*/vmstat cat /sys/devices/system/node/node*/numastat As it is a slow path and would not be read frequently, I would worry about it. Signed-off-by: Kemi Wang <kemi.wang@intel.com> --- drivers/base/node.c | 17 ++++++++++------- mm/vmstat.c | 2 +- 2 files changed, 11 insertions(+), 8 deletions(-) diff --git a/drivers/base/node.c b/drivers/base/node.c index a045ea1..cf303f8 100644 --- a/drivers/base/node.c +++ b/drivers/base/node.c @@ -169,12 +169,15 @@ static ssize_t node_read_numastat(struct device *dev, "interleave_hit %lu\n" "local_node %lu\n" "other_node %lu\n", - node_page_state(NODE_DATA(dev->id), NUMA_HIT), - node_page_state(NODE_DATA(dev->id), NUMA_MISS), - node_page_state(NODE_DATA(dev->id), NUMA_FOREIGN), - node_page_state(NODE_DATA(dev->id), NUMA_INTERLEAVE_HIT), - node_page_state(NODE_DATA(dev->id), NUMA_LOCAL), - node_page_state(NODE_DATA(dev->id), NUMA_OTHER)); + node_page_state_snapshot(NODE_DATA(dev->id), NUMA_HIT), + node_page_state_snapshot(NODE_DATA(dev->id), NUMA_MISS), + node_page_state_snapshot(NODE_DATA(dev->id), + NUMA_FOREIGN), + node_page_state_snapshot(NODE_DATA(dev->id), + NUMA_INTERLEAVE_HIT), + node_page_state_snapshot(NODE_DATA(dev->id), NUMA_LOCAL), + node_page_state_snapshot(NODE_DATA(dev->id), + NUMA_OTHER)); } static DEVICE_ATTR(numastat, S_IRUGO, node_read_numastat, NULL); @@ -194,7 +197,7 @@ static ssize_t node_read_vmstat(struct device *dev, for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) n += sprintf(buf+n, "%s %lu\n", vmstat_text[i + NR_VM_ZONE_STAT_ITEMS], - node_page_state(pgdat, i)); + node_page_state_snapshot(pgdat, i)); return n; } diff --git a/mm/vmstat.c b/mm/vmstat.c index 64e08ae..d65f28d 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1466,7 +1466,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat, for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) { seq_printf(m, "\n %-12s %lu", vmstat_text[i + NR_VM_ZONE_STAT_ITEMS], - node_page_state(pgdat, i)); + node_page_state_snapshot(pgdat, i)); } } seq_printf(m, -- 2.7.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 68+ messages in thread
* Re: [PATCH v2 4/5] mm: use node_page_state_snapshot to avoid deviation 2017-12-19 6:39 ` Kemi Wang @ 2017-12-19 12:43 ` Michal Hocko -1 siblings, 0 replies; 68+ messages in thread From: Michal Hocko @ 2017-12-19 12:43 UTC (permalink / raw) To: Kemi Wang Cc: Greg Kroah-Hartman, Andrew Morton, Vlastimil Babka, Mel Gorman, Johannes Weiner, Christopher Lameter, YASUAKI ISHIMATSU, Andrey Ryabinin, Nikolay Borisov, Pavel Tatashin, David Rientjes, Sebastian Andrzej Siewior, Dave, Andi Kleen, Tim Chen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Aubrey Li, Linux MM, Linux Kernel On Tue 19-12-17 14:39:25, Kemi Wang wrote: > To avoid deviation, this patch uses node_page_state_snapshot instead of > node_page_state for node page stats query. > e.g. cat /proc/zoneinfo > cat /sys/devices/system/node/node*/vmstat > cat /sys/devices/system/node/node*/numastat > > As it is a slow path and would not be read frequently, I would worry about > it. The changelog doesn't explain why these counters needs any special treatment. _snapshot variants where used only for internal handling where the precision really mattered. We do not have any in-tree user and Jack has removed this by http://lkml.kernel.org/r/20171122094416.26019-1-jack@suse.cz which is already sitting in the mmotm tree. We can re-add it but that would really require a _very good_ reason. > Signed-off-by: Kemi Wang <kemi.wang@intel.com> > --- > drivers/base/node.c | 17 ++++++++++------- > mm/vmstat.c | 2 +- > 2 files changed, 11 insertions(+), 8 deletions(-) > > diff --git a/drivers/base/node.c b/drivers/base/node.c > index a045ea1..cf303f8 100644 > --- a/drivers/base/node.c > +++ b/drivers/base/node.c > @@ -169,12 +169,15 @@ static ssize_t node_read_numastat(struct device *dev, > "interleave_hit %lu\n" > "local_node %lu\n" > "other_node %lu\n", > - node_page_state(NODE_DATA(dev->id), NUMA_HIT), > - node_page_state(NODE_DATA(dev->id), NUMA_MISS), > - node_page_state(NODE_DATA(dev->id), NUMA_FOREIGN), > - node_page_state(NODE_DATA(dev->id), NUMA_INTERLEAVE_HIT), > - node_page_state(NODE_DATA(dev->id), NUMA_LOCAL), > - node_page_state(NODE_DATA(dev->id), NUMA_OTHER)); > + node_page_state_snapshot(NODE_DATA(dev->id), NUMA_HIT), > + node_page_state_snapshot(NODE_DATA(dev->id), NUMA_MISS), > + node_page_state_snapshot(NODE_DATA(dev->id), > + NUMA_FOREIGN), > + node_page_state_snapshot(NODE_DATA(dev->id), > + NUMA_INTERLEAVE_HIT), > + node_page_state_snapshot(NODE_DATA(dev->id), NUMA_LOCAL), > + node_page_state_snapshot(NODE_DATA(dev->id), > + NUMA_OTHER)); > } > > static DEVICE_ATTR(numastat, S_IRUGO, node_read_numastat, NULL); > @@ -194,7 +197,7 @@ static ssize_t node_read_vmstat(struct device *dev, > for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) > n += sprintf(buf+n, "%s %lu\n", > vmstat_text[i + NR_VM_ZONE_STAT_ITEMS], > - node_page_state(pgdat, i)); > + node_page_state_snapshot(pgdat, i)); > > return n; > } > diff --git a/mm/vmstat.c b/mm/vmstat.c > index 64e08ae..d65f28d 100644 > --- a/mm/vmstat.c > +++ b/mm/vmstat.c > @@ -1466,7 +1466,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat, > for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) { > seq_printf(m, "\n %-12s %lu", > vmstat_text[i + NR_VM_ZONE_STAT_ITEMS], > - node_page_state(pgdat, i)); > + node_page_state_snapshot(pgdat, i)); > } > } > seq_printf(m, > -- > 2.7.4 > -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH v2 4/5] mm: use node_page_state_snapshot to avoid deviation @ 2017-12-19 12:43 ` Michal Hocko 0 siblings, 0 replies; 68+ messages in thread From: Michal Hocko @ 2017-12-19 12:43 UTC (permalink / raw) To: Kemi Wang Cc: Greg Kroah-Hartman, Andrew Morton, Vlastimil Babka, Mel Gorman, Johannes Weiner, Christopher Lameter, YASUAKI ISHIMATSU, Andrey Ryabinin, Nikolay Borisov, Pavel Tatashin, David Rientjes, Sebastian Andrzej Siewior, Dave, Andi Kleen, Tim Chen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Aubrey Li, Linux MM, Linux Kernel On Tue 19-12-17 14:39:25, Kemi Wang wrote: > To avoid deviation, this patch uses node_page_state_snapshot instead of > node_page_state for node page stats query. > e.g. cat /proc/zoneinfo > cat /sys/devices/system/node/node*/vmstat > cat /sys/devices/system/node/node*/numastat > > As it is a slow path and would not be read frequently, I would worry about > it. The changelog doesn't explain why these counters needs any special treatment. _snapshot variants where used only for internal handling where the precision really mattered. We do not have any in-tree user and Jack has removed this by http://lkml.kernel.org/r/20171122094416.26019-1-jack@suse.cz which is already sitting in the mmotm tree. We can re-add it but that would really require a _very good_ reason. > Signed-off-by: Kemi Wang <kemi.wang@intel.com> > --- > drivers/base/node.c | 17 ++++++++++------- > mm/vmstat.c | 2 +- > 2 files changed, 11 insertions(+), 8 deletions(-) > > diff --git a/drivers/base/node.c b/drivers/base/node.c > index a045ea1..cf303f8 100644 > --- a/drivers/base/node.c > +++ b/drivers/base/node.c > @@ -169,12 +169,15 @@ static ssize_t node_read_numastat(struct device *dev, > "interleave_hit %lu\n" > "local_node %lu\n" > "other_node %lu\n", > - node_page_state(NODE_DATA(dev->id), NUMA_HIT), > - node_page_state(NODE_DATA(dev->id), NUMA_MISS), > - node_page_state(NODE_DATA(dev->id), NUMA_FOREIGN), > - node_page_state(NODE_DATA(dev->id), NUMA_INTERLEAVE_HIT), > - node_page_state(NODE_DATA(dev->id), NUMA_LOCAL), > - node_page_state(NODE_DATA(dev->id), NUMA_OTHER)); > + node_page_state_snapshot(NODE_DATA(dev->id), NUMA_HIT), > + node_page_state_snapshot(NODE_DATA(dev->id), NUMA_MISS), > + node_page_state_snapshot(NODE_DATA(dev->id), > + NUMA_FOREIGN), > + node_page_state_snapshot(NODE_DATA(dev->id), > + NUMA_INTERLEAVE_HIT), > + node_page_state_snapshot(NODE_DATA(dev->id), NUMA_LOCAL), > + node_page_state_snapshot(NODE_DATA(dev->id), > + NUMA_OTHER)); > } > > static DEVICE_ATTR(numastat, S_IRUGO, node_read_numastat, NULL); > @@ -194,7 +197,7 @@ static ssize_t node_read_vmstat(struct device *dev, > for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) > n += sprintf(buf+n, "%s %lu\n", > vmstat_text[i + NR_VM_ZONE_STAT_ITEMS], > - node_page_state(pgdat, i)); > + node_page_state_snapshot(pgdat, i)); > > return n; > } > diff --git a/mm/vmstat.c b/mm/vmstat.c > index 64e08ae..d65f28d 100644 > --- a/mm/vmstat.c > +++ b/mm/vmstat.c > @@ -1466,7 +1466,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat, > for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) { > seq_printf(m, "\n %-12s %lu", > vmstat_text[i + NR_VM_ZONE_STAT_ITEMS], > - node_page_state(pgdat, i)); > + node_page_state_snapshot(pgdat, i)); > } > } > seq_printf(m, > -- > 2.7.4 > -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH v2 4/5] mm: use node_page_state_snapshot to avoid deviation 2017-12-19 12:43 ` Michal Hocko @ 2017-12-20 6:07 ` kemi -1 siblings, 0 replies; 68+ messages in thread From: kemi @ 2017-12-20 6:07 UTC (permalink / raw) To: Michal Hocko Cc: Greg Kroah-Hartman, Andrew Morton, Vlastimil Babka, Mel Gorman, Johannes Weiner, Christopher Lameter, YASUAKI ISHIMATSU, Andrey Ryabinin, Nikolay Borisov, Pavel Tatashin, David Rientjes, Sebastian Andrzej Siewior, Dave, Andi Kleen, Tim Chen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Aubrey Li, Linux MM, Linux Kernel On 2017年12月19日 20:43, Michal Hocko wrote: > On Tue 19-12-17 14:39:25, Kemi Wang wrote: >> To avoid deviation, this patch uses node_page_state_snapshot instead of >> node_page_state for node page stats query. >> e.g. cat /proc/zoneinfo >> cat /sys/devices/system/node/node*/vmstat >> cat /sys/devices/system/node/node*/numastat >> >> As it is a slow path and would not be read frequently, I would worry about >> it. > > The changelog doesn't explain why these counters needs any special > treatment. _snapshot variants where used only for internal handling > where the precision really mattered. We do not have any in-tree user and > Jack has removed this by http://lkml.kernel.org/r/20171122094416.26019-1-jack@suse.cz > which is already sitting in the mmotm tree. We can re-add it but that > would really require a _very good_ reason. > Assume we have *nr* cpus, and threshold size is *t*. Thus, the maximum deviation is nr*t. Currently, Skylake platform has hundreds of CPUs numbers and the number is still increasing. Also, even the threshold size is kept to 125 at maximum (32765 for NUMA counters now), the deviation is just a little too big as I have mentioned in the log. I tend to sum the number in local cpus up when query the global stats. Also, node_page_state_snapshot is only called in slow path and I don't think that would be a big problem. Anyway, it is a matter of taste. I just think it's better to have. >> Signed-off-by: Kemi Wang <kemi.wang@intel.com> >> --- >> drivers/base/node.c | 17 ++++++++++------- >> mm/vmstat.c | 2 +- >> 2 files changed, 11 insertions(+), 8 deletions(-) >> >> diff --git a/drivers/base/node.c b/drivers/base/node.c >> index a045ea1..cf303f8 100644 >> --- a/drivers/base/node.c >> +++ b/drivers/base/node.c >> @@ -169,12 +169,15 @@ static ssize_t node_read_numastat(struct device *dev, >> "interleave_hit %lu\n" >> "local_node %lu\n" >> "other_node %lu\n", >> - node_page_state(NODE_DATA(dev->id), NUMA_HIT), >> - node_page_state(NODE_DATA(dev->id), NUMA_MISS), >> - node_page_state(NODE_DATA(dev->id), NUMA_FOREIGN), >> - node_page_state(NODE_DATA(dev->id), NUMA_INTERLEAVE_HIT), >> - node_page_state(NODE_DATA(dev->id), NUMA_LOCAL), >> - node_page_state(NODE_DATA(dev->id), NUMA_OTHER)); >> + node_page_state_snapshot(NODE_DATA(dev->id), NUMA_HIT), >> + node_page_state_snapshot(NODE_DATA(dev->id), NUMA_MISS), >> + node_page_state_snapshot(NODE_DATA(dev->id), >> + NUMA_FOREIGN), >> + node_page_state_snapshot(NODE_DATA(dev->id), >> + NUMA_INTERLEAVE_HIT), >> + node_page_state_snapshot(NODE_DATA(dev->id), NUMA_LOCAL), >> + node_page_state_snapshot(NODE_DATA(dev->id), >> + NUMA_OTHER)); >> } >> >> static DEVICE_ATTR(numastat, S_IRUGO, node_read_numastat, NULL); >> @@ -194,7 +197,7 @@ static ssize_t node_read_vmstat(struct device *dev, >> for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) >> n += sprintf(buf+n, "%s %lu\n", >> vmstat_text[i + NR_VM_ZONE_STAT_ITEMS], >> - node_page_state(pgdat, i)); >> + node_page_state_snapshot(pgdat, i)); >> >> return n; >> } >> diff --git a/mm/vmstat.c b/mm/vmstat.c >> index 64e08ae..d65f28d 100644 >> --- a/mm/vmstat.c >> +++ b/mm/vmstat.c >> @@ -1466,7 +1466,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat, >> for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) { >> seq_printf(m, "\n %-12s %lu", >> vmstat_text[i + NR_VM_ZONE_STAT_ITEMS], >> - node_page_state(pgdat, i)); >> + node_page_state_snapshot(pgdat, i)); >> } >> } >> seq_printf(m, >> -- >> 2.7.4 >> > ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH v2 4/5] mm: use node_page_state_snapshot to avoid deviation @ 2017-12-20 6:07 ` kemi 0 siblings, 0 replies; 68+ messages in thread From: kemi @ 2017-12-20 6:07 UTC (permalink / raw) To: Michal Hocko Cc: Greg Kroah-Hartman, Andrew Morton, Vlastimil Babka, Mel Gorman, Johannes Weiner, Christopher Lameter, YASUAKI ISHIMATSU, Andrey Ryabinin, Nikolay Borisov, Pavel Tatashin, David Rientjes, Sebastian Andrzej Siewior, Dave, Andi Kleen, Tim Chen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Aubrey Li, Linux MM, Linux Kernel On 2017a1'12ae??19ae?JPY 20:43, Michal Hocko wrote: > On Tue 19-12-17 14:39:25, Kemi Wang wrote: >> To avoid deviation, this patch uses node_page_state_snapshot instead of >> node_page_state for node page stats query. >> e.g. cat /proc/zoneinfo >> cat /sys/devices/system/node/node*/vmstat >> cat /sys/devices/system/node/node*/numastat >> >> As it is a slow path and would not be read frequently, I would worry about >> it. > > The changelog doesn't explain why these counters needs any special > treatment. _snapshot variants where used only for internal handling > where the precision really mattered. We do not have any in-tree user and > Jack has removed this by http://lkml.kernel.org/r/20171122094416.26019-1-jack@suse.cz > which is already sitting in the mmotm tree. We can re-add it but that > would really require a _very good_ reason. > Assume we have *nr* cpus, and threshold size is *t*. Thus, the maximum deviation is nr*t. Currently, Skylake platform has hundreds of CPUs numbers and the number is still increasing. Also, even the threshold size is kept to 125 at maximum (32765 for NUMA counters now), the deviation is just a little too big as I have mentioned in the log. I tend to sum the number in local cpus up when query the global stats. Also, node_page_state_snapshot is only called in slow path and I don't think that would be a big problem. Anyway, it is a matter of taste. I just think it's better to have. >> Signed-off-by: Kemi Wang <kemi.wang@intel.com> >> --- >> drivers/base/node.c | 17 ++++++++++------- >> mm/vmstat.c | 2 +- >> 2 files changed, 11 insertions(+), 8 deletions(-) >> >> diff --git a/drivers/base/node.c b/drivers/base/node.c >> index a045ea1..cf303f8 100644 >> --- a/drivers/base/node.c >> +++ b/drivers/base/node.c >> @@ -169,12 +169,15 @@ static ssize_t node_read_numastat(struct device *dev, >> "interleave_hit %lu\n" >> "local_node %lu\n" >> "other_node %lu\n", >> - node_page_state(NODE_DATA(dev->id), NUMA_HIT), >> - node_page_state(NODE_DATA(dev->id), NUMA_MISS), >> - node_page_state(NODE_DATA(dev->id), NUMA_FOREIGN), >> - node_page_state(NODE_DATA(dev->id), NUMA_INTERLEAVE_HIT), >> - node_page_state(NODE_DATA(dev->id), NUMA_LOCAL), >> - node_page_state(NODE_DATA(dev->id), NUMA_OTHER)); >> + node_page_state_snapshot(NODE_DATA(dev->id), NUMA_HIT), >> + node_page_state_snapshot(NODE_DATA(dev->id), NUMA_MISS), >> + node_page_state_snapshot(NODE_DATA(dev->id), >> + NUMA_FOREIGN), >> + node_page_state_snapshot(NODE_DATA(dev->id), >> + NUMA_INTERLEAVE_HIT), >> + node_page_state_snapshot(NODE_DATA(dev->id), NUMA_LOCAL), >> + node_page_state_snapshot(NODE_DATA(dev->id), >> + NUMA_OTHER)); >> } >> >> static DEVICE_ATTR(numastat, S_IRUGO, node_read_numastat, NULL); >> @@ -194,7 +197,7 @@ static ssize_t node_read_vmstat(struct device *dev, >> for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) >> n += sprintf(buf+n, "%s %lu\n", >> vmstat_text[i + NR_VM_ZONE_STAT_ITEMS], >> - node_page_state(pgdat, i)); >> + node_page_state_snapshot(pgdat, i)); >> >> return n; >> } >> diff --git a/mm/vmstat.c b/mm/vmstat.c >> index 64e08ae..d65f28d 100644 >> --- a/mm/vmstat.c >> +++ b/mm/vmstat.c >> @@ -1466,7 +1466,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat, >> for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) { >> seq_printf(m, "\n %-12s %lu", >> vmstat_text[i + NR_VM_ZONE_STAT_ITEMS], >> - node_page_state(pgdat, i)); >> + node_page_state_snapshot(pgdat, i)); >> } >> } >> seq_printf(m, >> -- >> 2.7.4 >> > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH v2 4/5] mm: use node_page_state_snapshot to avoid deviation 2017-12-20 6:07 ` kemi @ 2017-12-20 10:06 ` Michal Hocko -1 siblings, 0 replies; 68+ messages in thread From: Michal Hocko @ 2017-12-20 10:06 UTC (permalink / raw) To: kemi Cc: Greg Kroah-Hartman, Andrew Morton, Vlastimil Babka, Mel Gorman, Johannes Weiner, Christopher Lameter, YASUAKI ISHIMATSU, Andrey Ryabinin, Nikolay Borisov, Pavel Tatashin, David Rientjes, Sebastian Andrzej Siewior, Dave, Andi Kleen, Tim Chen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Aubrey Li, Linux MM, Linux Kernel On Wed 20-12-17 14:07:35, kemi wrote: > > > On 2017年12月19日 20:43, Michal Hocko wrote: > > On Tue 19-12-17 14:39:25, Kemi Wang wrote: > >> To avoid deviation, this patch uses node_page_state_snapshot instead of > >> node_page_state for node page stats query. > >> e.g. cat /proc/zoneinfo > >> cat /sys/devices/system/node/node*/vmstat > >> cat /sys/devices/system/node/node*/numastat > >> > >> As it is a slow path and would not be read frequently, I would worry about > >> it. > > > > The changelog doesn't explain why these counters needs any special > > treatment. _snapshot variants where used only for internal handling > > where the precision really mattered. We do not have any in-tree user and > > Jack has removed this by http://lkml.kernel.org/r/20171122094416.26019-1-jack@suse.cz > > which is already sitting in the mmotm tree. We can re-add it but that > > would really require a _very good_ reason. > > > > Assume we have *nr* cpus, and threshold size is *t*. Thus, the maximum deviation is nr*t. > Currently, Skylake platform has hundreds of CPUs numbers and the number is still > increasing. Also, even the threshold size is kept to 125 at maximum (32765 > for NUMA counters now), the deviation is just a little too big as I have mentioned in > the log. I tend to sum the number in local cpus up when query the global stats. This is a general problem of pcp accounting. So if it needs to be addressed then do it the same way for all stats. > Also, node_page_state_snapshot is only called in slow path and I don't think that > would be a big problem. > > Anyway, it is a matter of taste. I just think it's better to have. You are making numastats special and I yet haven't heard any sounds arguments for that. But that should be discussed in the respective patch. > >> Signed-off-by: Kemi Wang <kemi.wang@intel.com> > >> --- > >> drivers/base/node.c | 17 ++++++++++------- > >> mm/vmstat.c | 2 +- > >> 2 files changed, 11 insertions(+), 8 deletions(-) > >> > >> diff --git a/drivers/base/node.c b/drivers/base/node.c > >> index a045ea1..cf303f8 100644 > >> --- a/drivers/base/node.c > >> +++ b/drivers/base/node.c > >> @@ -169,12 +169,15 @@ static ssize_t node_read_numastat(struct device *dev, > >> "interleave_hit %lu\n" > >> "local_node %lu\n" > >> "other_node %lu\n", > >> - node_page_state(NODE_DATA(dev->id), NUMA_HIT), > >> - node_page_state(NODE_DATA(dev->id), NUMA_MISS), > >> - node_page_state(NODE_DATA(dev->id), NUMA_FOREIGN), > >> - node_page_state(NODE_DATA(dev->id), NUMA_INTERLEAVE_HIT), > >> - node_page_state(NODE_DATA(dev->id), NUMA_LOCAL), > >> - node_page_state(NODE_DATA(dev->id), NUMA_OTHER)); > >> + node_page_state_snapshot(NODE_DATA(dev->id), NUMA_HIT), > >> + node_page_state_snapshot(NODE_DATA(dev->id), NUMA_MISS), > >> + node_page_state_snapshot(NODE_DATA(dev->id), > >> + NUMA_FOREIGN), > >> + node_page_state_snapshot(NODE_DATA(dev->id), > >> + NUMA_INTERLEAVE_HIT), > >> + node_page_state_snapshot(NODE_DATA(dev->id), NUMA_LOCAL), > >> + node_page_state_snapshot(NODE_DATA(dev->id), > >> + NUMA_OTHER)); > >> } > >> > >> static DEVICE_ATTR(numastat, S_IRUGO, node_read_numastat, NULL); > >> @@ -194,7 +197,7 @@ static ssize_t node_read_vmstat(struct device *dev, > >> for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) > >> n += sprintf(buf+n, "%s %lu\n", > >> vmstat_text[i + NR_VM_ZONE_STAT_ITEMS], > >> - node_page_state(pgdat, i)); > >> + node_page_state_snapshot(pgdat, i)); > >> > >> return n; > >> } > >> diff --git a/mm/vmstat.c b/mm/vmstat.c > >> index 64e08ae..d65f28d 100644 > >> --- a/mm/vmstat.c > >> +++ b/mm/vmstat.c > >> @@ -1466,7 +1466,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat, > >> for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) { > >> seq_printf(m, "\n %-12s %lu", > >> vmstat_text[i + NR_VM_ZONE_STAT_ITEMS], > >> - node_page_state(pgdat, i)); > >> + node_page_state_snapshot(pgdat, i)); > >> } > >> } > >> seq_printf(m, > >> -- > >> 2.7.4 > >> > > -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH v2 4/5] mm: use node_page_state_snapshot to avoid deviation @ 2017-12-20 10:06 ` Michal Hocko 0 siblings, 0 replies; 68+ messages in thread From: Michal Hocko @ 2017-12-20 10:06 UTC (permalink / raw) To: kemi Cc: Greg Kroah-Hartman, Andrew Morton, Vlastimil Babka, Mel Gorman, Johannes Weiner, Christopher Lameter, YASUAKI ISHIMATSU, Andrey Ryabinin, Nikolay Borisov, Pavel Tatashin, David Rientjes, Sebastian Andrzej Siewior, Dave, Andi Kleen, Tim Chen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Aubrey Li, Linux MM, Linux Kernel On Wed 20-12-17 14:07:35, kemi wrote: > > > On 2017a1'12ae??19ae?JPY 20:43, Michal Hocko wrote: > > On Tue 19-12-17 14:39:25, Kemi Wang wrote: > >> To avoid deviation, this patch uses node_page_state_snapshot instead of > >> node_page_state for node page stats query. > >> e.g. cat /proc/zoneinfo > >> cat /sys/devices/system/node/node*/vmstat > >> cat /sys/devices/system/node/node*/numastat > >> > >> As it is a slow path and would not be read frequently, I would worry about > >> it. > > > > The changelog doesn't explain why these counters needs any special > > treatment. _snapshot variants where used only for internal handling > > where the precision really mattered. We do not have any in-tree user and > > Jack has removed this by http://lkml.kernel.org/r/20171122094416.26019-1-jack@suse.cz > > which is already sitting in the mmotm tree. We can re-add it but that > > would really require a _very good_ reason. > > > > Assume we have *nr* cpus, and threshold size is *t*. Thus, the maximum deviation is nr*t. > Currently, Skylake platform has hundreds of CPUs numbers and the number is still > increasing. Also, even the threshold size is kept to 125 at maximum (32765 > for NUMA counters now), the deviation is just a little too big as I have mentioned in > the log. I tend to sum the number in local cpus up when query the global stats. This is a general problem of pcp accounting. So if it needs to be addressed then do it the same way for all stats. > Also, node_page_state_snapshot is only called in slow path and I don't think that > would be a big problem. > > Anyway, it is a matter of taste. I just think it's better to have. You are making numastats special and I yet haven't heard any sounds arguments for that. But that should be discussed in the respective patch. > >> Signed-off-by: Kemi Wang <kemi.wang@intel.com> > >> --- > >> drivers/base/node.c | 17 ++++++++++------- > >> mm/vmstat.c | 2 +- > >> 2 files changed, 11 insertions(+), 8 deletions(-) > >> > >> diff --git a/drivers/base/node.c b/drivers/base/node.c > >> index a045ea1..cf303f8 100644 > >> --- a/drivers/base/node.c > >> +++ b/drivers/base/node.c > >> @@ -169,12 +169,15 @@ static ssize_t node_read_numastat(struct device *dev, > >> "interleave_hit %lu\n" > >> "local_node %lu\n" > >> "other_node %lu\n", > >> - node_page_state(NODE_DATA(dev->id), NUMA_HIT), > >> - node_page_state(NODE_DATA(dev->id), NUMA_MISS), > >> - node_page_state(NODE_DATA(dev->id), NUMA_FOREIGN), > >> - node_page_state(NODE_DATA(dev->id), NUMA_INTERLEAVE_HIT), > >> - node_page_state(NODE_DATA(dev->id), NUMA_LOCAL), > >> - node_page_state(NODE_DATA(dev->id), NUMA_OTHER)); > >> + node_page_state_snapshot(NODE_DATA(dev->id), NUMA_HIT), > >> + node_page_state_snapshot(NODE_DATA(dev->id), NUMA_MISS), > >> + node_page_state_snapshot(NODE_DATA(dev->id), > >> + NUMA_FOREIGN), > >> + node_page_state_snapshot(NODE_DATA(dev->id), > >> + NUMA_INTERLEAVE_HIT), > >> + node_page_state_snapshot(NODE_DATA(dev->id), NUMA_LOCAL), > >> + node_page_state_snapshot(NODE_DATA(dev->id), > >> + NUMA_OTHER)); > >> } > >> > >> static DEVICE_ATTR(numastat, S_IRUGO, node_read_numastat, NULL); > >> @@ -194,7 +197,7 @@ static ssize_t node_read_vmstat(struct device *dev, > >> for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) > >> n += sprintf(buf+n, "%s %lu\n", > >> vmstat_text[i + NR_VM_ZONE_STAT_ITEMS], > >> - node_page_state(pgdat, i)); > >> + node_page_state_snapshot(pgdat, i)); > >> > >> return n; > >> } > >> diff --git a/mm/vmstat.c b/mm/vmstat.c > >> index 64e08ae..d65f28d 100644 > >> --- a/mm/vmstat.c > >> +++ b/mm/vmstat.c > >> @@ -1466,7 +1466,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat, > >> for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) { > >> seq_printf(m, "\n %-12s %lu", > >> vmstat_text[i + NR_VM_ZONE_STAT_ITEMS], > >> - node_page_state(pgdat, i)); > >> + node_page_state_snapshot(pgdat, i)); > >> } > >> } > >> seq_printf(m, > >> -- > >> 2.7.4 > >> > > -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH v2 4/5] mm: use node_page_state_snapshot to avoid deviation 2017-12-20 10:06 ` Michal Hocko @ 2017-12-20 10:24 ` kemi -1 siblings, 0 replies; 68+ messages in thread From: kemi @ 2017-12-20 10:24 UTC (permalink / raw) To: Michal Hocko Cc: Greg Kroah-Hartman, Andrew Morton, Vlastimil Babka, Mel Gorman, Johannes Weiner, Christopher Lameter, YASUAKI ISHIMATSU, Andrey Ryabinin, Nikolay Borisov, Pavel Tatashin, David Rientjes, Sebastian Andrzej Siewior, Dave, Andi Kleen, Tim Chen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Aubrey Li, Linux MM, Linux Kernel On 2017年12月20日 18:06, Michal Hocko wrote: > On Wed 20-12-17 14:07:35, kemi wrote: >> >> >> On 2017年12月19日 20:43, Michal Hocko wrote: >>> On Tue 19-12-17 14:39:25, Kemi Wang wrote: >>>> To avoid deviation, this patch uses node_page_state_snapshot instead of >>>> node_page_state for node page stats query. >>>> e.g. cat /proc/zoneinfo >>>> cat /sys/devices/system/node/node*/vmstat >>>> cat /sys/devices/system/node/node*/numastat >>>> >>>> As it is a slow path and would not be read frequently, I would worry about >>>> it. >>> >>> The changelog doesn't explain why these counters needs any special >>> treatment. _snapshot variants where used only for internal handling >>> where the precision really mattered. We do not have any in-tree user and >>> Jack has removed this by http://lkml.kernel.org/r/20171122094416.26019-1-jack@suse.cz >>> which is already sitting in the mmotm tree. We can re-add it but that >>> would really require a _very good_ reason. >>> >> >> Assume we have *nr* cpus, and threshold size is *t*. Thus, the maximum deviation is nr*t. >> Currently, Skylake platform has hundreds of CPUs numbers and the number is still >> increasing. Also, even the threshold size is kept to 125 at maximum (32765 >> for NUMA counters now), the deviation is just a little too big as I have mentioned in >> the log. I tend to sum the number in local cpus up when query the global stats. > > This is a general problem of pcp accounting. So if it needs to be > addressed then do it the same way for all stats. > >> Also, node_page_state_snapshot is only called in slow path and I don't think that >> would be a big problem. >> >> Anyway, it is a matter of taste. I just think it's better to have. > > You are making numastats special and I yet haven't heard any sounds > arguments for that. But that should be discussed in the respective > patch. > That is because we have much larger threshold size for NUMA counters, that means larger deviation. So, the number in local cpus may not be simply ignored. >>>> Signed-off-by: Kemi Wang <kemi.wang@intel.com> >>>> --- >>>> drivers/base/node.c | 17 ++++++++++------- >>>> mm/vmstat.c | 2 +- >>>> 2 files changed, 11 insertions(+), 8 deletions(-) >>>> >>>> diff --git a/drivers/base/node.c b/drivers/base/node.c >>>> index a045ea1..cf303f8 100644 >>>> --- a/drivers/base/node.c >>>> +++ b/drivers/base/node.c >>>> @@ -169,12 +169,15 @@ static ssize_t node_read_numastat(struct device *dev, >>>> "interleave_hit %lu\n" >>>> "local_node %lu\n" >>>> "other_node %lu\n", >>>> - node_page_state(NODE_DATA(dev->id), NUMA_HIT), >>>> - node_page_state(NODE_DATA(dev->id), NUMA_MISS), >>>> - node_page_state(NODE_DATA(dev->id), NUMA_FOREIGN), >>>> - node_page_state(NODE_DATA(dev->id), NUMA_INTERLEAVE_HIT), >>>> - node_page_state(NODE_DATA(dev->id), NUMA_LOCAL), >>>> - node_page_state(NODE_DATA(dev->id), NUMA_OTHER)); >>>> + node_page_state_snapshot(NODE_DATA(dev->id), NUMA_HIT), >>>> + node_page_state_snapshot(NODE_DATA(dev->id), NUMA_MISS), >>>> + node_page_state_snapshot(NODE_DATA(dev->id), >>>> + NUMA_FOREIGN), >>>> + node_page_state_snapshot(NODE_DATA(dev->id), >>>> + NUMA_INTERLEAVE_HIT), >>>> + node_page_state_snapshot(NODE_DATA(dev->id), NUMA_LOCAL), >>>> + node_page_state_snapshot(NODE_DATA(dev->id), >>>> + NUMA_OTHER)); >>>> } >>>> >>>> static DEVICE_ATTR(numastat, S_IRUGO, node_read_numastat, NULL); >>>> @@ -194,7 +197,7 @@ static ssize_t node_read_vmstat(struct device *dev, >>>> for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) >>>> n += sprintf(buf+n, "%s %lu\n", >>>> vmstat_text[i + NR_VM_ZONE_STAT_ITEMS], >>>> - node_page_state(pgdat, i)); >>>> + node_page_state_snapshot(pgdat, i)); >>>> >>>> return n; >>>> } >>>> diff --git a/mm/vmstat.c b/mm/vmstat.c >>>> index 64e08ae..d65f28d 100644 >>>> --- a/mm/vmstat.c >>>> +++ b/mm/vmstat.c >>>> @@ -1466,7 +1466,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat, >>>> for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) { >>>> seq_printf(m, "\n %-12s %lu", >>>> vmstat_text[i + NR_VM_ZONE_STAT_ITEMS], >>>> - node_page_state(pgdat, i)); >>>> + node_page_state_snapshot(pgdat, i)); >>>> } >>>> } >>>> seq_printf(m, >>>> -- >>>> 2.7.4 >>>> >>> > ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH v2 4/5] mm: use node_page_state_snapshot to avoid deviation @ 2017-12-20 10:24 ` kemi 0 siblings, 0 replies; 68+ messages in thread From: kemi @ 2017-12-20 10:24 UTC (permalink / raw) To: Michal Hocko Cc: Greg Kroah-Hartman, Andrew Morton, Vlastimil Babka, Mel Gorman, Johannes Weiner, Christopher Lameter, YASUAKI ISHIMATSU, Andrey Ryabinin, Nikolay Borisov, Pavel Tatashin, David Rientjes, Sebastian Andrzej Siewior, Dave, Andi Kleen, Tim Chen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Aubrey Li, Linux MM, Linux Kernel On 2017a1'12ae??20ae?JPY 18:06, Michal Hocko wrote: > On Wed 20-12-17 14:07:35, kemi wrote: >> >> >> On 2017a1'12ae??19ae?JPY 20:43, Michal Hocko wrote: >>> On Tue 19-12-17 14:39:25, Kemi Wang wrote: >>>> To avoid deviation, this patch uses node_page_state_snapshot instead of >>>> node_page_state for node page stats query. >>>> e.g. cat /proc/zoneinfo >>>> cat /sys/devices/system/node/node*/vmstat >>>> cat /sys/devices/system/node/node*/numastat >>>> >>>> As it is a slow path and would not be read frequently, I would worry about >>>> it. >>> >>> The changelog doesn't explain why these counters needs any special >>> treatment. _snapshot variants where used only for internal handling >>> where the precision really mattered. We do not have any in-tree user and >>> Jack has removed this by http://lkml.kernel.org/r/20171122094416.26019-1-jack@suse.cz >>> which is already sitting in the mmotm tree. We can re-add it but that >>> would really require a _very good_ reason. >>> >> >> Assume we have *nr* cpus, and threshold size is *t*. Thus, the maximum deviation is nr*t. >> Currently, Skylake platform has hundreds of CPUs numbers and the number is still >> increasing. Also, even the threshold size is kept to 125 at maximum (32765 >> for NUMA counters now), the deviation is just a little too big as I have mentioned in >> the log. I tend to sum the number in local cpus up when query the global stats. > > This is a general problem of pcp accounting. So if it needs to be > addressed then do it the same way for all stats. > >> Also, node_page_state_snapshot is only called in slow path and I don't think that >> would be a big problem. >> >> Anyway, it is a matter of taste. I just think it's better to have. > > You are making numastats special and I yet haven't heard any sounds > arguments for that. But that should be discussed in the respective > patch. > That is because we have much larger threshold size for NUMA counters, that means larger deviation. So, the number in local cpus may not be simply ignored. >>>> Signed-off-by: Kemi Wang <kemi.wang@intel.com> >>>> --- >>>> drivers/base/node.c | 17 ++++++++++------- >>>> mm/vmstat.c | 2 +- >>>> 2 files changed, 11 insertions(+), 8 deletions(-) >>>> >>>> diff --git a/drivers/base/node.c b/drivers/base/node.c >>>> index a045ea1..cf303f8 100644 >>>> --- a/drivers/base/node.c >>>> +++ b/drivers/base/node.c >>>> @@ -169,12 +169,15 @@ static ssize_t node_read_numastat(struct device *dev, >>>> "interleave_hit %lu\n" >>>> "local_node %lu\n" >>>> "other_node %lu\n", >>>> - node_page_state(NODE_DATA(dev->id), NUMA_HIT), >>>> - node_page_state(NODE_DATA(dev->id), NUMA_MISS), >>>> - node_page_state(NODE_DATA(dev->id), NUMA_FOREIGN), >>>> - node_page_state(NODE_DATA(dev->id), NUMA_INTERLEAVE_HIT), >>>> - node_page_state(NODE_DATA(dev->id), NUMA_LOCAL), >>>> - node_page_state(NODE_DATA(dev->id), NUMA_OTHER)); >>>> + node_page_state_snapshot(NODE_DATA(dev->id), NUMA_HIT), >>>> + node_page_state_snapshot(NODE_DATA(dev->id), NUMA_MISS), >>>> + node_page_state_snapshot(NODE_DATA(dev->id), >>>> + NUMA_FOREIGN), >>>> + node_page_state_snapshot(NODE_DATA(dev->id), >>>> + NUMA_INTERLEAVE_HIT), >>>> + node_page_state_snapshot(NODE_DATA(dev->id), NUMA_LOCAL), >>>> + node_page_state_snapshot(NODE_DATA(dev->id), >>>> + NUMA_OTHER)); >>>> } >>>> >>>> static DEVICE_ATTR(numastat, S_IRUGO, node_read_numastat, NULL); >>>> @@ -194,7 +197,7 @@ static ssize_t node_read_vmstat(struct device *dev, >>>> for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) >>>> n += sprintf(buf+n, "%s %lu\n", >>>> vmstat_text[i + NR_VM_ZONE_STAT_ITEMS], >>>> - node_page_state(pgdat, i)); >>>> + node_page_state_snapshot(pgdat, i)); >>>> >>>> return n; >>>> } >>>> diff --git a/mm/vmstat.c b/mm/vmstat.c >>>> index 64e08ae..d65f28d 100644 >>>> --- a/mm/vmstat.c >>>> +++ b/mm/vmstat.c >>>> @@ -1466,7 +1466,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat, >>>> for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) { >>>> seq_printf(m, "\n %-12s %lu", >>>> vmstat_text[i + NR_VM_ZONE_STAT_ITEMS], >>>> - node_page_state(pgdat, i)); >>>> + node_page_state_snapshot(pgdat, i)); >>>> } >>>> } >>>> seq_printf(m, >>>> -- >>>> 2.7.4 >>>> >>> > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH v2 4/5] mm: use node_page_state_snapshot to avoid deviation 2017-12-20 10:24 ` kemi @ 2017-12-20 15:58 ` Christopher Lameter -1 siblings, 0 replies; 68+ messages in thread From: Christopher Lameter @ 2017-12-20 15:58 UTC (permalink / raw) To: kemi Cc: Michal Hocko, Greg Kroah-Hartman, Andrew Morton, Vlastimil Babka, Mel Gorman, Johannes Weiner, YASUAKI ISHIMATSU, Andrey Ryabinin, Nikolay Borisov, Pavel Tatashin, David Rientjes, Sebastian Andrzej Siewior, Dave, Andi Kleen, Tim Chen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Aubrey Li, Linux MM, Linux Kernel On Wed, 20 Dec 2017, kemi wrote: > > You are making numastats special and I yet haven't heard any sounds > > arguments for that. But that should be discussed in the respective > > patch. > > > > That is because we have much larger threshold size for NUMA counters, that means larger > deviation. So, the number in local cpus may not be simply ignored. Some numbers showing the effect of these changes would be helpful. You can probably create some in kernel synthetic tests to start with which would allow you to see any significant effects of those changes. Then run the larger testsuites (f.e. those that Mel has published) and benchmarks to figure out how behavior of real apps *may* change? ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH v2 4/5] mm: use node_page_state_snapshot to avoid deviation @ 2017-12-20 15:58 ` Christopher Lameter 0 siblings, 0 replies; 68+ messages in thread From: Christopher Lameter @ 2017-12-20 15:58 UTC (permalink / raw) To: kemi Cc: Michal Hocko, Greg Kroah-Hartman, Andrew Morton, Vlastimil Babka, Mel Gorman, Johannes Weiner, YASUAKI ISHIMATSU, Andrey Ryabinin, Nikolay Borisov, Pavel Tatashin, David Rientjes, Sebastian Andrzej Siewior, Dave, Andi Kleen, Tim Chen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Aubrey Li, Linux MM, Linux Kernel On Wed, 20 Dec 2017, kemi wrote: > > You are making numastats special and I yet haven't heard any sounds > > arguments for that. But that should be discussed in the respective > > patch. > > > > That is because we have much larger threshold size for NUMA counters, that means larger > deviation. So, the number in local cpus may not be simply ignored. Some numbers showing the effect of these changes would be helpful. You can probably create some in kernel synthetic tests to start with which would allow you to see any significant effects of those changes. Then run the larger testsuites (f.e. those that Mel has published) and benchmarks to figure out how behavior of real apps *may* change? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH v2 4/5] mm: use node_page_state_snapshot to avoid deviation 2017-12-20 15:58 ` Christopher Lameter @ 2017-12-21 1:39 ` kemi -1 siblings, 0 replies; 68+ messages in thread From: kemi @ 2017-12-21 1:39 UTC (permalink / raw) To: Christopher Lameter Cc: Michal Hocko, Greg Kroah-Hartman, Andrew Morton, Vlastimil Babka, Mel Gorman, Johannes Weiner, YASUAKI ISHIMATSU, Andrey Ryabinin, Nikolay Borisov, Pavel Tatashin, David Rientjes, Sebastian Andrzej Siewior, Dave, Andi Kleen, Tim Chen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Aubrey Li, Linux MM, Linux Kernel On 2017年12月20日 23:58, Christopher Lameter wrote: > On Wed, 20 Dec 2017, kemi wrote: > >>> You are making numastats special and I yet haven't heard any sounds >>> arguments for that. But that should be discussed in the respective >>> patch. >>> >> >> That is because we have much larger threshold size for NUMA counters, that means larger >> deviation. So, the number in local cpus may not be simply ignored. > > Some numbers showing the effect of these changes would be helpful. You can > probably create some in kernel synthetic tests to start with which would > allow you to see any significant effects of those changes. > > Then run the larger testsuites (f.e. those that Mel has published) and > benchmarks to figure out how behavior of real apps *may* change? > OK. I will do that when available. Let's just drop this patch in this series and consider this issue in another patch. ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH v2 4/5] mm: use node_page_state_snapshot to avoid deviation @ 2017-12-21 1:39 ` kemi 0 siblings, 0 replies; 68+ messages in thread From: kemi @ 2017-12-21 1:39 UTC (permalink / raw) To: Christopher Lameter Cc: Michal Hocko, Greg Kroah-Hartman, Andrew Morton, Vlastimil Babka, Mel Gorman, Johannes Weiner, YASUAKI ISHIMATSU, Andrey Ryabinin, Nikolay Borisov, Pavel Tatashin, David Rientjes, Sebastian Andrzej Siewior, Dave, Andi Kleen, Tim Chen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Aubrey Li, Linux MM, Linux Kernel On 2017a1'12ae??20ae?JPY 23:58, Christopher Lameter wrote: > On Wed, 20 Dec 2017, kemi wrote: > >>> You are making numastats special and I yet haven't heard any sounds >>> arguments for that. But that should be discussed in the respective >>> patch. >>> >> >> That is because we have much larger threshold size for NUMA counters, that means larger >> deviation. So, the number in local cpus may not be simply ignored. > > Some numbers showing the effect of these changes would be helpful. You can > probably create some in kernel synthetic tests to start with which would > allow you to see any significant effects of those changes. > > Then run the larger testsuites (f.e. those that Mel has published) and > benchmarks to figure out how behavior of real apps *may* change? > OK. I will do that when available. Let's just drop this patch in this series and consider this issue in another patch. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 68+ messages in thread
* [PATCH v2 5/5] mm: Rename zone_statistics() to numa_statistics() 2017-12-19 6:39 ` Kemi Wang @ 2017-12-19 6:39 ` Kemi Wang -1 siblings, 0 replies; 68+ messages in thread From: Kemi Wang @ 2017-12-19 6:39 UTC (permalink / raw) To: Greg Kroah-Hartman, Andrew Morton, Michal Hocko, Vlastimil Babka, Mel Gorman, Johannes Weiner, Christopher Lameter, YASUAKI ISHIMATSU, Andrey Ryabinin, Nikolay Borisov, Pavel Tatashin, David Rientjes, Sebastian Andrzej Siewior Cc: Dave, Andi Kleen, Tim Chen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Aubrey Li, Kemi Wang, Linux MM, Linux Kernel Since the functionality of zone_statistics() updates numa counters, but numa statistics has been separated from zone statistics framework. Thus, the function name makes people confused. So, change the name to numa_statistics() as well as its call sites accordingly. Signed-off-by: Kemi Wang <kemi.wang@intel.com> --- mm/page_alloc.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 81e8d8f..f7583de 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2790,7 +2790,7 @@ int __isolate_free_page(struct page *page, unsigned int order) * * Must be called with interrupts disabled. */ -static inline void zone_statistics(struct zone *preferred_zone, struct zone *z) +static inline void numa_statistics(struct zone *preferred_zone, struct zone *z) { #ifdef CONFIG_NUMA int preferred_nid = preferred_zone->node; @@ -2854,7 +2854,7 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone, page = __rmqueue_pcplist(zone, migratetype, pcp, list); if (page) { __count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order); - zone_statistics(preferred_zone, zone); + numa_statistics(preferred_zone, zone); } local_irq_restore(flags); return page; @@ -2902,7 +2902,7 @@ struct page *rmqueue(struct zone *preferred_zone, get_pcppage_migratetype(page)); __count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order); - zone_statistics(preferred_zone, zone); + numa_statistics(preferred_zone, zone); local_irq_restore(flags); out: -- 2.7.4 ^ permalink raw reply related [flat|nested] 68+ messages in thread
* [PATCH v2 5/5] mm: Rename zone_statistics() to numa_statistics() @ 2017-12-19 6:39 ` Kemi Wang 0 siblings, 0 replies; 68+ messages in thread From: Kemi Wang @ 2017-12-19 6:39 UTC (permalink / raw) To: Greg Kroah-Hartman, Andrew Morton, Michal Hocko, Vlastimil Babka, Mel Gorman, Johannes Weiner, Christopher Lameter, YASUAKI ISHIMATSU, Andrey Ryabinin, Nikolay Borisov, Pavel Tatashin, David Rientjes, Sebastian Andrzej Siewior Cc: Dave, Andi Kleen, Tim Chen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Aubrey Li, Kemi Wang, Linux MM, Linux Kernel Since the functionality of zone_statistics() updates numa counters, but numa statistics has been separated from zone statistics framework. Thus, the function name makes people confused. So, change the name to numa_statistics() as well as its call sites accordingly. Signed-off-by: Kemi Wang <kemi.wang@intel.com> --- mm/page_alloc.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 81e8d8f..f7583de 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2790,7 +2790,7 @@ int __isolate_free_page(struct page *page, unsigned int order) * * Must be called with interrupts disabled. */ -static inline void zone_statistics(struct zone *preferred_zone, struct zone *z) +static inline void numa_statistics(struct zone *preferred_zone, struct zone *z) { #ifdef CONFIG_NUMA int preferred_nid = preferred_zone->node; @@ -2854,7 +2854,7 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone, page = __rmqueue_pcplist(zone, migratetype, pcp, list); if (page) { __count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order); - zone_statistics(preferred_zone, zone); + numa_statistics(preferred_zone, zone); } local_irq_restore(flags); return page; @@ -2902,7 +2902,7 @@ struct page *rmqueue(struct zone *preferred_zone, get_pcppage_migratetype(page)); __count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order); - zone_statistics(preferred_zone, zone); + numa_statistics(preferred_zone, zone); local_irq_restore(flags); out: -- 2.7.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 68+ messages in thread
* Re: [PATCH v2 5/5] mm: Rename zone_statistics() to numa_statistics() 2017-12-19 6:39 ` Kemi Wang @ 2017-12-19 12:44 ` Michal Hocko -1 siblings, 0 replies; 68+ messages in thread From: Michal Hocko @ 2017-12-19 12:44 UTC (permalink / raw) To: Kemi Wang Cc: Greg Kroah-Hartman, Andrew Morton, Vlastimil Babka, Mel Gorman, Johannes Weiner, Christopher Lameter, YASUAKI ISHIMATSU, Andrey Ryabinin, Nikolay Borisov, Pavel Tatashin, David Rientjes, Sebastian Andrzej Siewior, Dave, Andi Kleen, Tim Chen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Aubrey Li, Linux MM, Linux Kernel On Tue 19-12-17 14:39:26, Kemi Wang wrote: > Since the functionality of zone_statistics() updates numa counters, but > numa statistics has been separated from zone statistics framework. Thus, > the function name makes people confused. So, change the name to > numa_statistics() as well as its call sites accordingly. > > Signed-off-by: Kemi Wang <kemi.wang@intel.com> Acked-by: Michal Hocko <mhocko@suse.com> > --- > mm/page_alloc.c | 6 +++--- > 1 file changed, 3 insertions(+), 3 deletions(-) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 81e8d8f..f7583de 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -2790,7 +2790,7 @@ int __isolate_free_page(struct page *page, unsigned int order) > * > * Must be called with interrupts disabled. > */ > -static inline void zone_statistics(struct zone *preferred_zone, struct zone *z) > +static inline void numa_statistics(struct zone *preferred_zone, struct zone *z) > { > #ifdef CONFIG_NUMA > int preferred_nid = preferred_zone->node; > @@ -2854,7 +2854,7 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone, > page = __rmqueue_pcplist(zone, migratetype, pcp, list); > if (page) { > __count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order); > - zone_statistics(preferred_zone, zone); > + numa_statistics(preferred_zone, zone); > } > local_irq_restore(flags); > return page; > @@ -2902,7 +2902,7 @@ struct page *rmqueue(struct zone *preferred_zone, > get_pcppage_migratetype(page)); > > __count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order); > - zone_statistics(preferred_zone, zone); > + numa_statistics(preferred_zone, zone); > local_irq_restore(flags); > > out: > -- > 2.7.4 > -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [PATCH v2 5/5] mm: Rename zone_statistics() to numa_statistics() @ 2017-12-19 12:44 ` Michal Hocko 0 siblings, 0 replies; 68+ messages in thread From: Michal Hocko @ 2017-12-19 12:44 UTC (permalink / raw) To: Kemi Wang Cc: Greg Kroah-Hartman, Andrew Morton, Vlastimil Babka, Mel Gorman, Johannes Weiner, Christopher Lameter, YASUAKI ISHIMATSU, Andrey Ryabinin, Nikolay Borisov, Pavel Tatashin, David Rientjes, Sebastian Andrzej Siewior, Dave, Andi Kleen, Tim Chen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Aubrey Li, Linux MM, Linux Kernel On Tue 19-12-17 14:39:26, Kemi Wang wrote: > Since the functionality of zone_statistics() updates numa counters, but > numa statistics has been separated from zone statistics framework. Thus, > the function name makes people confused. So, change the name to > numa_statistics() as well as its call sites accordingly. > > Signed-off-by: Kemi Wang <kemi.wang@intel.com> Acked-by: Michal Hocko <mhocko@suse.com> > --- > mm/page_alloc.c | 6 +++--- > 1 file changed, 3 insertions(+), 3 deletions(-) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 81e8d8f..f7583de 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -2790,7 +2790,7 @@ int __isolate_free_page(struct page *page, unsigned int order) > * > * Must be called with interrupts disabled. > */ > -static inline void zone_statistics(struct zone *preferred_zone, struct zone *z) > +static inline void numa_statistics(struct zone *preferred_zone, struct zone *z) > { > #ifdef CONFIG_NUMA > int preferred_nid = preferred_zone->node; > @@ -2854,7 +2854,7 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone, > page = __rmqueue_pcplist(zone, migratetype, pcp, list); > if (page) { > __count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order); > - zone_statistics(preferred_zone, zone); > + numa_statistics(preferred_zone, zone); > } > local_irq_restore(flags); > return page; > @@ -2902,7 +2902,7 @@ struct page *rmqueue(struct zone *preferred_zone, > get_pcppage_migratetype(page)); > > __count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order); > - zone_statistics(preferred_zone, zone); > + numa_statistics(preferred_zone, zone); > local_irq_restore(flags); > > out: > -- > 2.7.4 > -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 68+ messages in thread
end of thread, other threads:[~2017-12-26 19:06 UTC | newest] Thread overview: 68+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-12-19 6:39 [PATCH v2 0/5] mm: NUMA stats code cleanup and enhancement Kemi Wang 2017-12-19 6:39 ` Kemi Wang 2017-12-19 6:39 ` [PATCH v2 1/5] mm: migrate NUMA stats from per-zone to per-node Kemi Wang 2017-12-19 6:39 ` Kemi Wang 2017-12-19 12:28 ` Michal Hocko 2017-12-19 12:28 ` Michal Hocko 2017-12-20 5:32 ` kemi 2017-12-20 5:32 ` kemi 2017-12-19 6:39 ` [PATCH v2 2/5] mm: Extends local cpu counter vm_diff_nodestat from s8 to s16 Kemi Wang 2017-12-19 6:39 ` Kemi Wang 2017-12-19 12:38 ` Michal Hocko 2017-12-19 12:38 ` Michal Hocko 2017-12-20 3:05 ` kemi 2017-12-20 3:05 ` kemi 2017-12-19 16:05 ` Christopher Lameter 2017-12-19 16:05 ` Christopher Lameter 2017-12-19 16:20 ` Michal Hocko 2017-12-19 16:20 ` Michal Hocko 2017-12-19 17:21 ` Christopher Lameter 2017-12-19 17:21 ` Christopher Lameter 2017-12-20 6:45 ` kemi 2017-12-20 6:45 ` kemi 2017-12-19 6:39 ` [PATCH v2 3/5] mm: enlarge NUMA counters threshold size Kemi Wang 2017-12-19 6:39 ` Kemi Wang 2017-12-19 12:40 ` Michal Hocko 2017-12-19 12:40 ` Michal Hocko 2017-12-20 5:52 ` kemi 2017-12-20 5:52 ` kemi 2017-12-20 10:12 ` Michal Hocko 2017-12-20 10:12 ` Michal Hocko 2017-12-20 10:21 ` kemi 2017-12-20 10:21 ` kemi 2017-12-21 8:06 ` kemi 2017-12-21 8:06 ` kemi 2017-12-21 8:17 ` Michal Hocko 2017-12-21 8:17 ` Michal Hocko 2017-12-21 8:23 ` kemi 2017-12-21 8:23 ` kemi 2017-12-21 8:59 ` Michal Hocko 2017-12-21 8:59 ` Michal Hocko 2017-12-21 10:31 ` kemi 2017-12-21 10:31 ` kemi 2017-12-22 12:31 ` Michal Hocko 2017-12-22 12:31 ` Michal Hocko 2017-12-21 17:10 ` Christopher Lameter 2017-12-21 17:10 ` Christopher Lameter 2017-12-22 2:06 ` kemi 2017-12-22 2:06 ` kemi 2017-12-26 19:05 ` Christopher Lameter 2017-12-26 19:05 ` Christopher Lameter 2017-12-19 6:39 ` [PATCH v2 4/5] mm: use node_page_state_snapshot to avoid deviation Kemi Wang 2017-12-19 6:39 ` Kemi Wang 2017-12-19 12:43 ` Michal Hocko 2017-12-19 12:43 ` Michal Hocko 2017-12-20 6:07 ` kemi 2017-12-20 6:07 ` kemi 2017-12-20 10:06 ` Michal Hocko 2017-12-20 10:06 ` Michal Hocko 2017-12-20 10:24 ` kemi 2017-12-20 10:24 ` kemi 2017-12-20 15:58 ` Christopher Lameter 2017-12-20 15:58 ` Christopher Lameter 2017-12-21 1:39 ` kemi 2017-12-21 1:39 ` kemi 2017-12-19 6:39 ` [PATCH v2 5/5] mm: Rename zone_statistics() to numa_statistics() Kemi Wang 2017-12-19 6:39 ` Kemi Wang 2017-12-19 12:44 ` Michal Hocko 2017-12-19 12:44 ` Michal Hocko
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.