Linux-rt-users Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH 0/11 v2] Use local_lock for pcp protection and reduce stat overhead
@ 2021-04-07 20:24 Mel Gorman
  2021-04-07 20:24 ` [PATCH 01/11] mm/page_alloc: Split per cpu page lists and zone stats Mel Gorman
                   ` (11 more replies)
  0 siblings, 12 replies; 31+ messages in thread
From: Mel Gorman @ 2021-04-07 20:24 UTC (permalink / raw)
  To: Linux-MM, Linux-RT-Users
  Cc: LKML, Chuck Lever, Jesper Dangaard Brouer, Matthew Wilcox,
	Thomas Gleixner, Peter Zijlstra, Ingo Molnar, Michal Hocko,
	Oscar Salvador, Mel Gorman

For MM people, the whole series is relevant but patch 3 needs particular
attention for memory hotremove as I had problems testing it because full
zone removal always failed for me. For RT people, the most interesting
patches are 2, 9 and 10 with 2 being the most important.

This series requires patches in Andrew's tree so for convenience, it's also available at

git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-percpu-local_lock-v2r10

The PCP (per-cpu page allocator in page_alloc.c) shares locking
requirements with vmstat and the zone lock which is inconvenient and
causes some issues. For example, the PCP list and vmstat share the same
per-cpu space meaning that it's possible that vmstat updates dirty cache
lines holding per-cpu lists across CPUs unless padding is used.  Second,
PREEMPT_RT does not want IRQs disabled in the page allocator because it's
too long for IRQs to be disabled unnecesarily.

This series splits the locking requirements and uses locks types more
suitable for PREEMPT_RT, reduces the time when special locking is required
for stats and reduces the time when IRQs need to be disabled on !PREEMPT_RT
kernels.

Why local_lock? PREEMPT_RT considers the following sequence to be unsafe
as documented in Documentation/locking/locktypes.rst

   local_irq_disable();
   raw_spin_lock(&lock);

The page allocator does not use raw_spin_lock but using local_irq_safe
is undesirable on PREEMPT_RT as it leaves IRQs disabled for an excessive
length of time. By converting to local_lock which disables migration on
PREEMPT_RT, the locking requirements can be separated and start moving
the protections for PCP, stats and the zone lock to PREEMPT_RT-safe
equivalent locking. As a bonus, local_lock also means that PROVE_LOCKING
does something useful.

After that, it was very obvious that zone_statistics in particular has
way too much overhead and leaves IRQs disabled for longer than necessary
on !PREEMPT_RT kernels. zone_statistics uses perfectly accurate counters
requiring IRQs be disabled for parallel RMW sequences when inaccurate ones
like vm_events would do. The series makes the NUMA statistics (NUMA_HIT
and friends) inaccurate counters that then require no special protection
on !PREEMPT_RT.

The bulk page allocator can then do stat updates in bulk with IRQs enabled
which should improve the efficiency.  Technically, this could have been
done without the local_lock and vmstat conversion work and the order
simply reflects the timing of when different series were implemented.

Finally, there are places where we conflate IRQs being disabled for the
PCP with the IRQ-safe zone spinlock. The remainder of the series reduces
the scope of what is protected by disabled IRQs on !PREEMPT_RT kernels.
By the end of the series, page_alloc.c does not call local_irq_save so
the locking scope is a bit clearer. The one exception is that modifying
NR_FREE_PAGES still happens in places where it's known the IRQs are
disabled as it's harmless for PREEMPT_RT and would be expensive to split
the locking there.

No performance data is included because despite the overhead of the stats,
it's within the noise for most workloads on !PREEMPT_RT. However, Jesper
Dangaard Brouer ran a page allocation microbenchmark on a E5-1650 v4 @
3.60GHz CPU on the first version of this series. Focusing on the array
variant of the bulk page allocator reveals the following.

(CPU: Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz)
ARRAY variant: time_bulk_page_alloc_free_array: step=bulk size

         Baseline        Patched
 1       56.383          54.225 (+3.83%)
 2       40.047          35.492 (+11.38%)
 3       37.339          32.643 (+12.58%)
 4       35.578          30.992 (+12.89%)
 8       33.592          29.606 (+11.87%)
 16      32.362          28.532 (+11.85%)
 32      31.476          27.728 (+11.91%)
 64      30.633          27.252 (+11.04%)
 128     30.596          27.090 (+11.46%)

While this is a positive outcome, the series is more likely to be
interesting to the RT people in terms of getting parts of the PREEMPT_RT
tree into mainline.

 drivers/base/node.c    |  18 +--
 include/linux/mmzone.h |  29 ++--
 include/linux/vmstat.h |  65 +++++----
 mm/internal.h          |   2 +-
 mm/memory_hotplug.c    |  10 +-
 mm/mempolicy.c         |   2 +-
 mm/page_alloc.c        | 297 ++++++++++++++++++++++++-----------------
 mm/vmstat.c            | 250 ++++++++++++----------------------
 8 files changed, 339 insertions(+), 334 deletions(-)

-- 
2.26.2


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 01/11] mm/page_alloc: Split per cpu page lists and zone stats
  2021-04-07 20:24 [PATCH 0/11 v2] Use local_lock for pcp protection and reduce stat overhead Mel Gorman
@ 2021-04-07 20:24 ` Mel Gorman
  2021-04-12 17:43   ` Vlastimil Babka
  2021-04-07 20:24 ` [PATCH 02/11] mm/page_alloc: Convert per-cpu list protection to local_lock Mel Gorman
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 31+ messages in thread
From: Mel Gorman @ 2021-04-07 20:24 UTC (permalink / raw)
  To: Linux-MM, Linux-RT-Users
  Cc: LKML, Chuck Lever, Jesper Dangaard Brouer, Matthew Wilcox,
	Thomas Gleixner, Peter Zijlstra, Ingo Molnar, Michal Hocko,
	Oscar Salvador, Mel Gorman

The per-cpu page allocator lists and the per-cpu vmstat deltas are stored
in the same struct per_cpu_pages even though vmstats have no direct impact
on the per-cpu page lists. This is inconsistent because the vmstats for a
node are stored on a dedicated structure. The bigger issue is that the
per_cpu_pages structure is not cache-aligned and stat updates either
cache conflict with adjacent per-cpu lists incurring a runtime cost or
padding is required incurring a memory cost.

This patch splits the per-cpu pagelists and the vmstat deltas into separate
structures. It's mostly a mechanical conversion but some variable renaming
is done to clearly distinguish the per-cpu pages structure (pcp) from
the vmstats (pzstats).

Superficially, this appears to increase the size of the per_cpu_pages
structure but the movement of expire fills a structure hole so there is
no impact overall.

[lkp@intel.com: Check struct per_cpu_zonestat has a non-zero size]
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 include/linux/mmzone.h | 18 ++++----
 include/linux/vmstat.h |  8 ++--
 mm/page_alloc.c        | 84 +++++++++++++++++++-----------------
 mm/vmstat.c            | 96 ++++++++++++++++++++++--------------------
 4 files changed, 110 insertions(+), 96 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 47946cec7584..a4393ac27336 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -341,20 +341,21 @@ struct per_cpu_pages {
 	int count;		/* number of pages in the list */
 	int high;		/* high watermark, emptying needed */
 	int batch;		/* chunk size for buddy add/remove */
+#ifdef CONFIG_NUMA
+	int expire;		/* When 0, remote pagesets are drained */
+#endif
 
 	/* Lists of pages, one per migrate type stored on the pcp-lists */
 	struct list_head lists[MIGRATE_PCPTYPES];
 };
 
-struct per_cpu_pageset {
-	struct per_cpu_pages pcp;
-#ifdef CONFIG_NUMA
-	s8 expire;
-	u16 vm_numa_stat_diff[NR_VM_NUMA_STAT_ITEMS];
-#endif
+struct per_cpu_zonestat {
 #ifdef CONFIG_SMP
-	s8 stat_threshold;
 	s8 vm_stat_diff[NR_VM_ZONE_STAT_ITEMS];
+	s8 stat_threshold;
+#endif
+#ifdef CONFIG_NUMA
+	u16 vm_numa_stat_diff[NR_VM_NUMA_STAT_ITEMS];
 #endif
 };
 
@@ -470,7 +471,8 @@ struct zone {
 	int node;
 #endif
 	struct pglist_data	*zone_pgdat;
-	struct per_cpu_pageset __percpu *pageset;
+	struct per_cpu_pages	__percpu *per_cpu_pageset;
+	struct per_cpu_zonestat	__percpu *per_cpu_zonestats;
 	/*
 	 * the high and batch values are copied to individual pagesets for
 	 * faster access
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 506d625163a1..1736ea9d24a7 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -163,7 +163,7 @@ static inline unsigned long zone_numa_state_snapshot(struct zone *zone,
 	int cpu;
 
 	for_each_online_cpu(cpu)
-		x += per_cpu_ptr(zone->pageset, cpu)->vm_numa_stat_diff[item];
+		x += per_cpu_ptr(zone->per_cpu_zonestats, cpu)->vm_numa_stat_diff[item];
 
 	return x;
 }
@@ -236,7 +236,7 @@ static inline unsigned long zone_page_state_snapshot(struct zone *zone,
 #ifdef CONFIG_SMP
 	int cpu;
 	for_each_online_cpu(cpu)
-		x += per_cpu_ptr(zone->pageset, cpu)->vm_stat_diff[item];
+		x += per_cpu_ptr(zone->per_cpu_zonestats, cpu)->vm_stat_diff[item];
 
 	if (x < 0)
 		x = 0;
@@ -291,7 +291,7 @@ struct ctl_table;
 int vmstat_refresh(struct ctl_table *, int write, void *buffer, size_t *lenp,
 		loff_t *ppos);
 
-void drain_zonestat(struct zone *zone, struct per_cpu_pageset *);
+void drain_zonestat(struct zone *zone, struct per_cpu_zonestat *);
 
 int calculate_pressure_threshold(struct zone *zone);
 int calculate_normal_threshold(struct zone *zone);
@@ -399,7 +399,7 @@ static inline void cpu_vm_stats_fold(int cpu) { }
 static inline void quiet_vmstat(void) { }
 
 static inline void drain_zonestat(struct zone *zone,
-			struct per_cpu_pageset *pset) { }
+			struct per_cpu_zonestat *pzstats) { }
 #endif		/* CONFIG_SMP */
 
 static inline void __mod_zone_freepage_state(struct zone *zone, int nr_pages,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5e8aedb64b57..a68bacddcae0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2981,15 +2981,14 @@ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
 static void drain_pages_zone(unsigned int cpu, struct zone *zone)
 {
 	unsigned long flags;
-	struct per_cpu_pageset *pset;
 	struct per_cpu_pages *pcp;
 
 	local_irq_save(flags);
-	pset = per_cpu_ptr(zone->pageset, cpu);
 
-	pcp = &pset->pcp;
+	pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);
 	if (pcp->count)
 		free_pcppages_bulk(zone, pcp->count, pcp);
+
 	local_irq_restore(flags);
 }
 
@@ -3088,7 +3087,7 @@ static void __drain_all_pages(struct zone *zone, bool force_all_cpus)
 	 * disables preemption as part of its processing
 	 */
 	for_each_online_cpu(cpu) {
-		struct per_cpu_pageset *pcp;
+		struct per_cpu_pages *pcp;
 		struct zone *z;
 		bool has_pcps = false;
 
@@ -3099,13 +3098,13 @@ static void __drain_all_pages(struct zone *zone, bool force_all_cpus)
 			 */
 			has_pcps = true;
 		} else if (zone) {
-			pcp = per_cpu_ptr(zone->pageset, cpu);
-			if (pcp->pcp.count)
+			pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);
+			if (pcp->count)
 				has_pcps = true;
 		} else {
 			for_each_populated_zone(z) {
-				pcp = per_cpu_ptr(z->pageset, cpu);
-				if (pcp->pcp.count) {
+				pcp = per_cpu_ptr(z->per_cpu_pageset, cpu);
+				if (pcp->count) {
 					has_pcps = true;
 					break;
 				}
@@ -3235,7 +3234,7 @@ static void free_unref_page_commit(struct page *page, unsigned long pfn)
 		migratetype = MIGRATE_MOVABLE;
 	}
 
-	pcp = &this_cpu_ptr(zone->pageset)->pcp;
+	pcp = this_cpu_ptr(zone->per_cpu_pageset);
 	list_add(&page->lru, &pcp->lists[migratetype]);
 	pcp->count++;
 	if (pcp->count >= READ_ONCE(pcp->high))
@@ -3451,7 +3450,7 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
 	unsigned long flags;
 
 	local_irq_save(flags);
-	pcp = &this_cpu_ptr(zone->pageset)->pcp;
+	pcp = this_cpu_ptr(zone->per_cpu_pageset);
 	list = &pcp->lists[migratetype];
 	page = __rmqueue_pcplist(zone,  migratetype, alloc_flags, pcp, list);
 	if (page) {
@@ -5054,7 +5053,7 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
 
 	/* Attempt the batch allocation */
 	local_irq_save(flags);
-	pcp = &this_cpu_ptr(zone->pageset)->pcp;
+	pcp = this_cpu_ptr(zone->per_cpu_pageset);
 	pcp_list = &pcp->lists[ac.migratetype];
 
 	while (nr_populated < nr_pages) {
@@ -5667,7 +5666,7 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
 			continue;
 
 		for_each_online_cpu(cpu)
-			free_pcp += per_cpu_ptr(zone->pageset, cpu)->pcp.count;
+			free_pcp += per_cpu_ptr(zone->per_cpu_pageset, cpu)->count;
 	}
 
 	printk("active_anon:%lu inactive_anon:%lu isolated_anon:%lu\n"
@@ -5759,7 +5758,7 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
 
 		free_pcp = 0;
 		for_each_online_cpu(cpu)
-			free_pcp += per_cpu_ptr(zone->pageset, cpu)->pcp.count;
+			free_pcp += per_cpu_ptr(zone->per_cpu_pageset, cpu)->count;
 
 		show_node(zone);
 		printk(KERN_CONT
@@ -5800,7 +5799,7 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
 			K(zone_page_state(zone, NR_MLOCK)),
 			K(zone_page_state(zone, NR_BOUNCE)),
 			K(free_pcp),
-			K(this_cpu_read(zone->pageset->pcp.count)),
+			K(this_cpu_read(zone->per_cpu_pageset->count)),
 			K(zone_page_state(zone, NR_FREE_CMA_PAGES)));
 		printk("lowmem_reserve[]:");
 		for (i = 0; i < MAX_NR_ZONES; i++)
@@ -6127,11 +6126,12 @@ static void build_zonelists(pg_data_t *pgdat)
  * not check if the processor is online before following the pageset pointer.
  * Other parts of the kernel may not check if the zone is available.
  */
-static void pageset_init(struct per_cpu_pageset *p);
+static void per_cpu_pages_init(struct per_cpu_pages *pcp, struct per_cpu_zonestat *pzstats);
 /* These effectively disable the pcplists in the boot pageset completely */
 #define BOOT_PAGESET_HIGH	0
 #define BOOT_PAGESET_BATCH	1
-static DEFINE_PER_CPU(struct per_cpu_pageset, boot_pageset);
+static DEFINE_PER_CPU(struct per_cpu_pages, boot_pageset);
+static DEFINE_PER_CPU(struct per_cpu_zonestat, boot_zonestats);
 static DEFINE_PER_CPU(struct per_cpu_nodestat, boot_nodestats);
 
 static void __build_all_zonelists(void *data)
@@ -6198,7 +6198,7 @@ build_all_zonelists_init(void)
 	 * (a chicken-egg dilemma).
 	 */
 	for_each_possible_cpu(cpu)
-		pageset_init(&per_cpu(boot_pageset, cpu));
+		per_cpu_pages_init(&per_cpu(boot_pageset, cpu), &per_cpu(boot_zonestats, cpu));
 
 	mminit_verify_zonelist();
 	cpuset_init_current_mems_allowed();
@@ -6576,14 +6576,13 @@ static void pageset_update(struct per_cpu_pages *pcp, unsigned long high,
 	WRITE_ONCE(pcp->high, high);
 }
 
-static void pageset_init(struct per_cpu_pageset *p)
+static void per_cpu_pages_init(struct per_cpu_pages *pcp, struct per_cpu_zonestat *pzstats)
 {
-	struct per_cpu_pages *pcp;
 	int migratetype;
 
-	memset(p, 0, sizeof(*p));
+	memset(pcp, 0, sizeof(*pcp));
+	memset(pzstats, 0, sizeof(*pzstats));
 
-	pcp = &p->pcp;
 	for (migratetype = 0; migratetype < MIGRATE_PCPTYPES; migratetype++)
 		INIT_LIST_HEAD(&pcp->lists[migratetype]);
 
@@ -6600,12 +6599,12 @@ static void pageset_init(struct per_cpu_pageset *p)
 static void __zone_set_pageset_high_and_batch(struct zone *zone, unsigned long high,
 		unsigned long batch)
 {
-	struct per_cpu_pageset *p;
+	struct per_cpu_pages *pcp;
 	int cpu;
 
 	for_each_possible_cpu(cpu) {
-		p = per_cpu_ptr(zone->pageset, cpu);
-		pageset_update(&p->pcp, high, batch);
+		pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);
+		pageset_update(pcp, high, batch);
 	}
 }
 
@@ -6640,13 +6639,20 @@ static void zone_set_pageset_high_and_batch(struct zone *zone)
 
 void __meminit setup_zone_pageset(struct zone *zone)
 {
-	struct per_cpu_pageset *p;
 	int cpu;
 
-	zone->pageset = alloc_percpu(struct per_cpu_pageset);
+	/* Size may be 0 on !SMP && !NUMA */
+	if (sizeof(struct per_cpu_zonestat) > 0)
+		zone->per_cpu_zonestats = alloc_percpu(struct per_cpu_zonestat);
+
+	zone->per_cpu_pageset = alloc_percpu(struct per_cpu_pages);
 	for_each_possible_cpu(cpu) {
-		p = per_cpu_ptr(zone->pageset, cpu);
-		pageset_init(p);
+		struct per_cpu_pages *pcp;
+		struct per_cpu_zonestat *pzstats;
+
+		pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);
+		pzstats = per_cpu_ptr(zone->per_cpu_zonestats, cpu);
+		per_cpu_pages_init(pcp, pzstats);
 	}
 
 	zone_set_pageset_high_and_batch(zone);
@@ -6673,9 +6679,9 @@ void __init setup_per_cpu_pageset(void)
 	 * the nodes these zones are associated with.
 	 */
 	for_each_possible_cpu(cpu) {
-		struct per_cpu_pageset *pcp = &per_cpu(boot_pageset, cpu);
-		memset(pcp->vm_numa_stat_diff, 0,
-		       sizeof(pcp->vm_numa_stat_diff));
+		struct per_cpu_zonestat *pzstats = &per_cpu(boot_zonestats, cpu);
+		memset(pzstats->vm_numa_stat_diff, 0,
+		       sizeof(pzstats->vm_numa_stat_diff));
 	}
 #endif
 
@@ -6691,7 +6697,7 @@ static __meminit void zone_pcp_init(struct zone *zone)
 	 * relies on the ability of the linker to provide the
 	 * offset of a (static) per cpu variable into the per cpu area.
 	 */
-	zone->pageset = &boot_pageset;
+	zone->per_cpu_pageset = &boot_pageset;
 	zone->pageset_high = BOOT_PAGESET_HIGH;
 	zone->pageset_batch = BOOT_PAGESET_BATCH;
 
@@ -8954,17 +8960,19 @@ void zone_pcp_reset(struct zone *zone)
 {
 	unsigned long flags;
 	int cpu;
-	struct per_cpu_pageset *pset;
+	struct per_cpu_zonestat *pzstats;
 
 	/* avoid races with drain_pages()  */
 	local_irq_save(flags);
-	if (zone->pageset != &boot_pageset) {
+	if (zone->per_cpu_pageset != &boot_pageset) {
 		for_each_online_cpu(cpu) {
-			pset = per_cpu_ptr(zone->pageset, cpu);
-			drain_zonestat(zone, pset);
+			pzstats = per_cpu_ptr(zone->per_cpu_zonestats, cpu);
+			drain_zonestat(zone, pzstats);
 		}
-		free_percpu(zone->pageset);
-		zone->pageset = &boot_pageset;
+		free_percpu(zone->per_cpu_pageset);
+		free_percpu(zone->per_cpu_zonestats);
+		zone->per_cpu_pageset = &boot_pageset;
+		zone->per_cpu_zonestats = &boot_zonestats;
 	}
 	local_irq_restore(flags);
 }
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 74b2c374b86c..8a8f1a26b231 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -44,7 +44,7 @@ static void zero_zone_numa_counters(struct zone *zone)
 	for (item = 0; item < NR_VM_NUMA_STAT_ITEMS; item++) {
 		atomic_long_set(&zone->vm_numa_stat[item], 0);
 		for_each_online_cpu(cpu)
-			per_cpu_ptr(zone->pageset, cpu)->vm_numa_stat_diff[item]
+			per_cpu_ptr(zone->per_cpu_zonestats, cpu)->vm_numa_stat_diff[item]
 						= 0;
 	}
 }
@@ -266,7 +266,7 @@ void refresh_zone_stat_thresholds(void)
 		for_each_online_cpu(cpu) {
 			int pgdat_threshold;
 
-			per_cpu_ptr(zone->pageset, cpu)->stat_threshold
+			per_cpu_ptr(zone->per_cpu_zonestats, cpu)->stat_threshold
 							= threshold;
 
 			/* Base nodestat threshold on the largest populated zone. */
@@ -303,7 +303,7 @@ void set_pgdat_percpu_threshold(pg_data_t *pgdat,
 
 		threshold = (*calculate_pressure)(zone);
 		for_each_online_cpu(cpu)
-			per_cpu_ptr(zone->pageset, cpu)->stat_threshold
+			per_cpu_ptr(zone->per_cpu_zonestats, cpu)->stat_threshold
 							= threshold;
 	}
 }
@@ -316,7 +316,7 @@ void set_pgdat_percpu_threshold(pg_data_t *pgdat,
 void __mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
 			   long delta)
 {
-	struct per_cpu_pageset __percpu *pcp = zone->pageset;
+	struct per_cpu_zonestat __percpu *pcp = zone->per_cpu_zonestats;
 	s8 __percpu *p = pcp->vm_stat_diff + item;
 	long x;
 	long t;
@@ -389,7 +389,7 @@ EXPORT_SYMBOL(__mod_node_page_state);
  */
 void __inc_zone_state(struct zone *zone, enum zone_stat_item item)
 {
-	struct per_cpu_pageset __percpu *pcp = zone->pageset;
+	struct per_cpu_zonestat __percpu *pcp = zone->per_cpu_zonestats;
 	s8 __percpu *p = pcp->vm_stat_diff + item;
 	s8 v, t;
 
@@ -435,7 +435,7 @@ EXPORT_SYMBOL(__inc_node_page_state);
 
 void __dec_zone_state(struct zone *zone, enum zone_stat_item item)
 {
-	struct per_cpu_pageset __percpu *pcp = zone->pageset;
+	struct per_cpu_zonestat __percpu *pcp = zone->per_cpu_zonestats;
 	s8 __percpu *p = pcp->vm_stat_diff + item;
 	s8 v, t;
 
@@ -495,7 +495,7 @@ EXPORT_SYMBOL(__dec_node_page_state);
 static inline void mod_zone_state(struct zone *zone,
        enum zone_stat_item item, long delta, int overstep_mode)
 {
-	struct per_cpu_pageset __percpu *pcp = zone->pageset;
+	struct per_cpu_zonestat __percpu *pcp = zone->per_cpu_zonestats;
 	s8 __percpu *p = pcp->vm_stat_diff + item;
 	long o, n, t, z;
 
@@ -781,19 +781,20 @@ static int refresh_cpu_vm_stats(bool do_pagesets)
 	int changes = 0;
 
 	for_each_populated_zone(zone) {
-		struct per_cpu_pageset __percpu *p = zone->pageset;
+		struct per_cpu_zonestat __percpu *pzstats = zone->per_cpu_zonestats;
+		struct per_cpu_pages __percpu *pcp = zone->per_cpu_pageset;
 
 		for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) {
 			int v;
 
-			v = this_cpu_xchg(p->vm_stat_diff[i], 0);
+			v = this_cpu_xchg(pzstats->vm_stat_diff[i], 0);
 			if (v) {
 
 				atomic_long_add(v, &zone->vm_stat[i]);
 				global_zone_diff[i] += v;
 #ifdef CONFIG_NUMA
 				/* 3 seconds idle till flush */
-				__this_cpu_write(p->expire, 3);
+				__this_cpu_write(pcp->expire, 3);
 #endif
 			}
 		}
@@ -801,12 +802,12 @@ static int refresh_cpu_vm_stats(bool do_pagesets)
 		for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++) {
 			int v;
 
-			v = this_cpu_xchg(p->vm_numa_stat_diff[i], 0);
+			v = this_cpu_xchg(pzstats->vm_numa_stat_diff[i], 0);
 			if (v) {
 
 				atomic_long_add(v, &zone->vm_numa_stat[i]);
 				global_numa_diff[i] += v;
-				__this_cpu_write(p->expire, 3);
+				__this_cpu_write(pcp->expire, 3);
 			}
 		}
 
@@ -819,23 +820,23 @@ static int refresh_cpu_vm_stats(bool do_pagesets)
 			 * Check if there are pages remaining in this pageset
 			 * if not then there is nothing to expire.
 			 */
-			if (!__this_cpu_read(p->expire) ||
-			       !__this_cpu_read(p->pcp.count))
+			if (!__this_cpu_read(pcp->expire) ||
+			       !__this_cpu_read(pcp->count))
 				continue;
 
 			/*
 			 * We never drain zones local to this processor.
 			 */
 			if (zone_to_nid(zone) == numa_node_id()) {
-				__this_cpu_write(p->expire, 0);
+				__this_cpu_write(pcp->expire, 0);
 				continue;
 			}
 
-			if (__this_cpu_dec_return(p->expire))
+			if (__this_cpu_dec_return(pcp->expire))
 				continue;
 
-			if (__this_cpu_read(p->pcp.count)) {
-				drain_zone_pages(zone, this_cpu_ptr(&p->pcp));
+			if (__this_cpu_read(pcp->count)) {
+				drain_zone_pages(zone, this_cpu_ptr(pcp));
 				changes++;
 			}
 		}
@@ -882,27 +883,27 @@ void cpu_vm_stats_fold(int cpu)
 	int global_node_diff[NR_VM_NODE_STAT_ITEMS] = { 0, };
 
 	for_each_populated_zone(zone) {
-		struct per_cpu_pageset *p;
+		struct per_cpu_zonestat *pzstats;
 
-		p = per_cpu_ptr(zone->pageset, cpu);
+		pzstats = per_cpu_ptr(zone->per_cpu_zonestats, cpu);
 
 		for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
-			if (p->vm_stat_diff[i]) {
+			if (pzstats->vm_stat_diff[i]) {
 				int v;
 
-				v = p->vm_stat_diff[i];
-				p->vm_stat_diff[i] = 0;
+				v = pzstats->vm_stat_diff[i];
+				pzstats->vm_stat_diff[i] = 0;
 				atomic_long_add(v, &zone->vm_stat[i]);
 				global_zone_diff[i] += v;
 			}
 
 #ifdef CONFIG_NUMA
 		for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++)
-			if (p->vm_numa_stat_diff[i]) {
+			if (pzstats->vm_numa_stat_diff[i]) {
 				int v;
 
-				v = p->vm_numa_stat_diff[i];
-				p->vm_numa_stat_diff[i] = 0;
+				v = pzstats->vm_numa_stat_diff[i];
+				pzstats->vm_numa_stat_diff[i] = 0;
 				atomic_long_add(v, &zone->vm_numa_stat[i]);
 				global_numa_diff[i] += v;
 			}
@@ -936,24 +937,24 @@ void cpu_vm_stats_fold(int cpu)
  * this is only called if !populated_zone(zone), which implies no other users of
  * pset->vm_stat_diff[] exsist.
  */
-void drain_zonestat(struct zone *zone, struct per_cpu_pageset *pset)
+void drain_zonestat(struct zone *zone, struct per_cpu_zonestat *pzstats)
 {
 	int i;
 
 	for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
-		if (pset->vm_stat_diff[i]) {
-			int v = pset->vm_stat_diff[i];
-			pset->vm_stat_diff[i] = 0;
+		if (pzstats->vm_stat_diff[i]) {
+			int v = pzstats->vm_stat_diff[i];
+			pzstats->vm_stat_diff[i] = 0;
 			atomic_long_add(v, &zone->vm_stat[i]);
 			atomic_long_add(v, &vm_zone_stat[i]);
 		}
 
 #ifdef CONFIG_NUMA
 	for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++)
-		if (pset->vm_numa_stat_diff[i]) {
-			int v = pset->vm_numa_stat_diff[i];
+		if (pzstats->vm_numa_stat_diff[i]) {
+			int v = pzstats->vm_numa_stat_diff[i];
 
-			pset->vm_numa_stat_diff[i] = 0;
+			pzstats->vm_numa_stat_diff[i] = 0;
 			atomic_long_add(v, &zone->vm_numa_stat[i]);
 			atomic_long_add(v, &vm_numa_stat[i]);
 		}
@@ -965,8 +966,8 @@ void drain_zonestat(struct zone *zone, struct per_cpu_pageset *pset)
 void __inc_numa_state(struct zone *zone,
 				 enum numa_stat_item item)
 {
-	struct per_cpu_pageset __percpu *pcp = zone->pageset;
-	u16 __percpu *p = pcp->vm_numa_stat_diff + item;
+	struct per_cpu_zonestat __percpu *pzstats = zone->per_cpu_zonestats;
+	u16 __percpu *p = pzstats->vm_numa_stat_diff + item;
 	u16 v;
 
 	v = __this_cpu_inc_return(*p);
@@ -1685,21 +1686,23 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 
 	seq_printf(m, "\n  pagesets");
 	for_each_online_cpu(i) {
-		struct per_cpu_pageset *pageset;
+		struct per_cpu_pages *pcp;
+		struct per_cpu_zonestat *pzstats;
 
-		pageset = per_cpu_ptr(zone->pageset, i);
+		pcp = per_cpu_ptr(zone->per_cpu_pageset, i);
+		pzstats = per_cpu_ptr(zone->per_cpu_zonestats, i);
 		seq_printf(m,
 			   "\n    cpu: %i"
 			   "\n              count: %i"
 			   "\n              high:  %i"
 			   "\n              batch: %i",
 			   i,
-			   pageset->pcp.count,
-			   pageset->pcp.high,
-			   pageset->pcp.batch);
+			   pcp->count,
+			   pcp->high,
+			   pcp->batch);
 #ifdef CONFIG_SMP
 		seq_printf(m, "\n  vm stats threshold: %d",
-				pageset->stat_threshold);
+				pzstats->stat_threshold);
 #endif
 	}
 	seq_printf(m,
@@ -1910,17 +1913,18 @@ static bool need_update(int cpu)
 	struct zone *zone;
 
 	for_each_populated_zone(zone) {
-		struct per_cpu_pageset *p = per_cpu_ptr(zone->pageset, cpu);
+		struct per_cpu_zonestat *pzstats = per_cpu_ptr(zone->per_cpu_zonestats, cpu);
 		struct per_cpu_nodestat *n;
+
 		/*
 		 * The fast way of checking if there are any vmstat diffs.
 		 */
-		if (memchr_inv(p->vm_stat_diff, 0, NR_VM_ZONE_STAT_ITEMS *
-			       sizeof(p->vm_stat_diff[0])))
+		if (memchr_inv(pzstats->vm_stat_diff, 0, NR_VM_ZONE_STAT_ITEMS *
+			       sizeof(pzstats->vm_stat_diff[0])))
 			return true;
 #ifdef CONFIG_NUMA
-		if (memchr_inv(p->vm_numa_stat_diff, 0, NR_VM_NUMA_STAT_ITEMS *
-			       sizeof(p->vm_numa_stat_diff[0])))
+		if (memchr_inv(pzstats->vm_numa_stat_diff, 0, NR_VM_NUMA_STAT_ITEMS *
+			       sizeof(pzstats->vm_numa_stat_diff[0])))
 			return true;
 #endif
 		if (last_pgdat == zone->zone_pgdat)
-- 
2.26.2


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 02/11] mm/page_alloc: Convert per-cpu list protection to local_lock
  2021-04-07 20:24 [PATCH 0/11 v2] Use local_lock for pcp protection and reduce stat overhead Mel Gorman
  2021-04-07 20:24 ` [PATCH 01/11] mm/page_alloc: Split per cpu page lists and zone stats Mel Gorman
@ 2021-04-07 20:24 ` Mel Gorman
  2021-04-08 10:52   ` Peter Zijlstra
  2021-04-07 20:24 ` [PATCH 03/11] mm/memory_hotplug: Make unpopulated zones PCP structures unreachable during hot remove Mel Gorman
                   ` (9 subsequent siblings)
  11 siblings, 1 reply; 31+ messages in thread
From: Mel Gorman @ 2021-04-07 20:24 UTC (permalink / raw)
  To: Linux-MM, Linux-RT-Users
  Cc: LKML, Chuck Lever, Jesper Dangaard Brouer, Matthew Wilcox,
	Thomas Gleixner, Peter Zijlstra, Ingo Molnar, Michal Hocko,
	Oscar Salvador, Mel Gorman

There is a lack of clarity of what exactly local_irq_save/local_irq_restore
protects in page_alloc.c . It conflates the protection of per-cpu page
allocation structures with per-cpu vmstat deltas.

This patch protects the PCP structure using local_lock which for most
configurations is identical to IRQ enabling/disabling.  The scope of the
lock is still wider than it should be but this is decreased laster.

[lkp@intel.com: Make pagesets static]
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 include/linux/mmzone.h |  2 ++
 mm/page_alloc.c        | 50 +++++++++++++++++++++++++++++-------------
 2 files changed, 37 insertions(+), 15 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index a4393ac27336..106da8fbc72a 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -20,6 +20,7 @@
 #include <linux/atomic.h>
 #include <linux/mm_types.h>
 #include <linux/page-flags.h>
+#include <linux/local_lock.h>
 #include <asm/page.h>
 
 /* Free memory management - zoned buddy allocator.  */
@@ -337,6 +338,7 @@ enum zone_watermarks {
 #define high_wmark_pages(z) (z->_watermark[WMARK_HIGH] + z->watermark_boost)
 #define wmark_pages(z, i) (z->_watermark[i] + z->watermark_boost)
 
+/* Fields and list protected by pagesets local_lock in page_alloc.c */
 struct per_cpu_pages {
 	int count;		/* number of pages in the list */
 	int high;		/* high watermark, emptying needed */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a68bacddcae0..e9e60d1a85d4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -112,6 +112,13 @@ typedef int __bitwise fpi_t;
 static DEFINE_MUTEX(pcp_batch_high_lock);
 #define MIN_PERCPU_PAGELIST_FRACTION	(8)
 
+struct pagesets {
+	local_lock_t lock;
+};
+static DEFINE_PER_CPU(struct pagesets, pagesets) = {
+	.lock = INIT_LOCAL_LOCK(lock),
+};
+
 #ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID
 DEFINE_PER_CPU(int, numa_node);
 EXPORT_PER_CPU_SYMBOL(numa_node);
@@ -1421,6 +1428,10 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 		} while (--count && --batch_free && !list_empty(list));
 	}
 
+	/*
+	 * local_lock_irq held so equivalent to spin_lock_irqsave for
+	 * both PREEMPT_RT and non-PREEMPT_RT configurations.
+	 */
 	spin_lock(&zone->lock);
 	isolated_pageblocks = has_isolate_pageblock(zone);
 
@@ -1541,6 +1552,11 @@ static void __free_pages_ok(struct page *page, unsigned int order,
 		return;
 
 	migratetype = get_pfnblock_migratetype(page, pfn);
+
+	/*
+	 * TODO FIX: Disable IRQs before acquiring IRQ-safe zone->lock
+	 * and protect vmstat updates.
+	 */
 	local_irq_save(flags);
 	__count_vm_events(PGFREE, 1 << order);
 	free_one_page(page_zone(page), page, pfn, order, migratetype,
@@ -2910,6 +2926,10 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 {
 	int i, allocated = 0;
 
+	/*
+	 * local_lock_irq held so equivalent to spin_lock_irqsave for
+	 * both PREEMPT_RT and non-PREEMPT_RT configurations.
+	 */
 	spin_lock(&zone->lock);
 	for (i = 0; i < count; ++i) {
 		struct page *page = __rmqueue(zone, order, migratetype,
@@ -2962,12 +2982,12 @@ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
 	unsigned long flags;
 	int to_drain, batch;
 
-	local_irq_save(flags);
+	local_lock_irqsave(&pagesets.lock, flags);
 	batch = READ_ONCE(pcp->batch);
 	to_drain = min(pcp->count, batch);
 	if (to_drain > 0)
 		free_pcppages_bulk(zone, to_drain, pcp);
-	local_irq_restore(flags);
+	local_unlock_irqrestore(&pagesets.lock, flags);
 }
 #endif
 
@@ -2983,13 +3003,13 @@ static void drain_pages_zone(unsigned int cpu, struct zone *zone)
 	unsigned long flags;
 	struct per_cpu_pages *pcp;
 
-	local_irq_save(flags);
+	local_lock_irqsave(&pagesets.lock, flags);
 
 	pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);
 	if (pcp->count)
 		free_pcppages_bulk(zone, pcp->count, pcp);
 
-	local_irq_restore(flags);
+	local_unlock_irqrestore(&pagesets.lock, flags);
 }
 
 /*
@@ -3252,9 +3272,9 @@ void free_unref_page(struct page *page)
 	if (!free_unref_page_prepare(page, pfn))
 		return;
 
-	local_irq_save(flags);
+	local_lock_irqsave(&pagesets.lock, flags);
 	free_unref_page_commit(page, pfn);
-	local_irq_restore(flags);
+	local_unlock_irqrestore(&pagesets.lock, flags);
 }
 
 /*
@@ -3274,7 +3294,7 @@ void free_unref_page_list(struct list_head *list)
 		set_page_private(page, pfn);
 	}
 
-	local_irq_save(flags);
+	local_lock_irqsave(&pagesets.lock, flags);
 	list_for_each_entry_safe(page, next, list, lru) {
 		unsigned long pfn = page_private(page);
 
@@ -3287,12 +3307,12 @@ void free_unref_page_list(struct list_head *list)
 		 * a large list of pages to free.
 		 */
 		if (++batch_count == SWAP_CLUSTER_MAX) {
-			local_irq_restore(flags);
+			local_unlock_irqrestore(&pagesets.lock, flags);
 			batch_count = 0;
-			local_irq_save(flags);
+			local_lock_irqsave(&pagesets.lock, flags);
 		}
 	}
-	local_irq_restore(flags);
+	local_unlock_irqrestore(&pagesets.lock, flags);
 }
 
 /*
@@ -3449,7 +3469,7 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
 	struct page *page;
 	unsigned long flags;
 
-	local_irq_save(flags);
+	local_lock_irqsave(&pagesets.lock, flags);
 	pcp = this_cpu_ptr(zone->per_cpu_pageset);
 	list = &pcp->lists[migratetype];
 	page = __rmqueue_pcplist(zone,  migratetype, alloc_flags, pcp, list);
@@ -3457,7 +3477,7 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
 		__count_zid_vm_events(PGALLOC, page_zonenum(page), 1);
 		zone_statistics(preferred_zone, zone);
 	}
-	local_irq_restore(flags);
+	local_unlock_irqrestore(&pagesets.lock, flags);
 	return page;
 }
 
@@ -5052,7 +5072,7 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
 		goto failed;
 
 	/* Attempt the batch allocation */
-	local_irq_save(flags);
+	local_lock_irqsave(&pagesets.lock, flags);
 	pcp = this_cpu_ptr(zone->per_cpu_pageset);
 	pcp_list = &pcp->lists[ac.migratetype];
 
@@ -5090,12 +5110,12 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
 		nr_populated++;
 	}
 
-	local_irq_restore(flags);
+	local_unlock_irqrestore(&pagesets.lock, flags);
 
 	return nr_populated;
 
 failed_irq:
-	local_irq_restore(flags);
+	local_unlock_irqrestore(&pagesets.lock, flags);
 
 failed:
 	page = __alloc_pages(gfp, 0, preferred_nid, nodemask);
-- 
2.26.2


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 03/11] mm/memory_hotplug: Make unpopulated zones PCP structures unreachable during hot remove
  2021-04-07 20:24 [PATCH 0/11 v2] Use local_lock for pcp protection and reduce stat overhead Mel Gorman
  2021-04-07 20:24 ` [PATCH 01/11] mm/page_alloc: Split per cpu page lists and zone stats Mel Gorman
  2021-04-07 20:24 ` [PATCH 02/11] mm/page_alloc: Convert per-cpu list protection to local_lock Mel Gorman
@ 2021-04-07 20:24 ` Mel Gorman
  2021-04-07 20:24 ` [PATCH 04/11] mm/vmstat: Convert NUMA statistics to basic NUMA counters Mel Gorman
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 31+ messages in thread
From: Mel Gorman @ 2021-04-07 20:24 UTC (permalink / raw)
  To: Linux-MM, Linux-RT-Users
  Cc: LKML, Chuck Lever, Jesper Dangaard Brouer, Matthew Wilcox,
	Thomas Gleixner, Peter Zijlstra, Ingo Molnar, Michal Hocko,
	Oscar Salvador, Mel Gorman

zone_pcp_reset allegedly protects against a race with drain_pages
using local_irq_save but this is bogus. local_irq_save only operates
on the local CPU. If memory hotplug is running on CPU A and drain_pages
is running on CPU B, disabling IRQs on CPU A does not affect CPU B and
offers no protection.

This patch reorders memory hotremove such that the PCP structures
relevant to the zone are no longer reachable by the time the structures
are freed.  With this reordering, no protection is required to prevent
a use-after-free and the IRQs can be left enabled. zone_pcp_reset is
renamed to zone_pcp_destroy to make it clear that the per-cpu structures
are deleted when the function returns.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/internal.h       |  2 +-
 mm/memory_hotplug.c | 10 +++++++---
 mm/page_alloc.c     | 22 ++++++++++++++++------
 3 files changed, 24 insertions(+), 10 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 09adf152a10b..cc34ce4461b7 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -203,7 +203,7 @@ extern void free_unref_page(struct page *page);
 extern void free_unref_page_list(struct list_head *list);
 
 extern void zone_pcp_update(struct zone *zone);
-extern void zone_pcp_reset(struct zone *zone);
+extern void zone_pcp_destroy(struct zone *zone);
 extern void zone_pcp_disable(struct zone *zone);
 extern void zone_pcp_enable(struct zone *zone);
 
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 0cdbbfbc5757..3d059c9f9c2d 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1687,12 +1687,16 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
 	zone->nr_isolate_pageblock -= nr_pages / pageblock_nr_pages;
 	spin_unlock_irqrestore(&zone->lock, flags);
 
-	zone_pcp_enable(zone);
-
 	/* removal success */
 	adjust_managed_page_count(pfn_to_page(start_pfn), -nr_pages);
 	zone->present_pages -= nr_pages;
 
+	/*
+	 * Restore PCP after managed pages has been updated. Unpopulated
+	 * zones PCP structures will remain unusable.
+	 */
+	zone_pcp_enable(zone);
+
 	pgdat_resize_lock(zone->zone_pgdat, &flags);
 	zone->zone_pgdat->node_present_pages -= nr_pages;
 	pgdat_resize_unlock(zone->zone_pgdat, &flags);
@@ -1700,8 +1704,8 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
 	init_per_zone_wmark_min();
 
 	if (!populated_zone(zone)) {
-		zone_pcp_reset(zone);
 		build_all_zonelists(NULL);
+		zone_pcp_destroy(zone);
 	} else
 		zone_pcp_update(zone);
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e9e60d1a85d4..a8630003612b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -8972,18 +8972,29 @@ void zone_pcp_disable(struct zone *zone)
 
 void zone_pcp_enable(struct zone *zone)
 {
-	__zone_set_pageset_high_and_batch(zone, zone->pageset_high, zone->pageset_batch);
+	/*
+	 * If the zone is populated, restore the high and batch counts.
+	 * If unpopulated, leave the high and batch count as 0 and 1
+	 * respectively as done by zone_pcp_disable. The per-cpu
+	 * structures will later be freed by zone_pcp_destroy.
+	 */
+	if (populated_zone(zone))
+		__zone_set_pageset_high_and_batch(zone, zone->pageset_high, zone->pageset_batch);
+
 	mutex_unlock(&pcp_batch_high_lock);
 }
 
-void zone_pcp_reset(struct zone *zone)
+/*
+ * Called when a zone has been hot-removed. At this point, the PCP has been
+ * drained, disabled and the zone is removed from the zonelists so the
+ * structures are no longer in use. PCP was disabled/drained by
+ * zone_pcp_disable. This function will drain any remaining vmstat deltas.
+ */
+void zone_pcp_destroy(struct zone *zone)
 {
-	unsigned long flags;
 	int cpu;
 	struct per_cpu_zonestat *pzstats;
 
-	/* avoid races with drain_pages()  */
-	local_irq_save(flags);
 	if (zone->per_cpu_pageset != &boot_pageset) {
 		for_each_online_cpu(cpu) {
 			pzstats = per_cpu_ptr(zone->per_cpu_zonestats, cpu);
@@ -8994,7 +9005,6 @@ void zone_pcp_reset(struct zone *zone)
 		zone->per_cpu_pageset = &boot_pageset;
 		zone->per_cpu_zonestats = &boot_zonestats;
 	}
-	local_irq_restore(flags);
 }
 
 #ifdef CONFIG_MEMORY_HOTREMOVE
-- 
2.26.2


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 04/11] mm/vmstat: Convert NUMA statistics to basic NUMA counters
  2021-04-07 20:24 [PATCH 0/11 v2] Use local_lock for pcp protection and reduce stat overhead Mel Gorman
                   ` (2 preceding siblings ...)
  2021-04-07 20:24 ` [PATCH 03/11] mm/memory_hotplug: Make unpopulated zones PCP structures unreachable during hot remove Mel Gorman
@ 2021-04-07 20:24 ` Mel Gorman
  2021-04-14 12:56   ` Vlastimil Babka
  2021-04-07 20:24 ` [PATCH 05/11] mm/vmstat: Inline NUMA event counter updates Mel Gorman
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 31+ messages in thread
From: Mel Gorman @ 2021-04-07 20:24 UTC (permalink / raw)
  To: Linux-MM, Linux-RT-Users
  Cc: LKML, Chuck Lever, Jesper Dangaard Brouer, Matthew Wilcox,
	Thomas Gleixner, Peter Zijlstra, Ingo Molnar, Michal Hocko,
	Oscar Salvador, Mel Gorman

NUMA statistics are maintained on the zone level for hits, misses, foreign
etc but nothing relies on them being perfectly accurate for functional
correctness. The counters are used by userspace to get a general overview
of a workloads NUMA behaviour but the page allocator incurs a high cost to
maintain perfect accuracy similar to what is required for a vmstat like
NR_FREE_PAGES. There even is a sysctl vm.numa_stat to allow userspace to
turn off the collection of NUMA statistics like NUMA_HIT.

This patch converts NUMA_HIT and friends to be NUMA events with similar
accuracy to VM events. There is a possibility that slight errors will be
introduced but the overall trend as seen by userspace will be similar.
Note that while these counters could be maintained at the node level that
it would have a user-visible impact.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 drivers/base/node.c    |  18 +++--
 include/linux/mmzone.h |  11 ++-
 include/linux/vmstat.h |  42 +++++-----
 mm/mempolicy.c         |   2 +-
 mm/page_alloc.c        |  12 +--
 mm/vmstat.c            | 175 ++++++++++++-----------------------------
 6 files changed, 93 insertions(+), 167 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index f449dbb2c746..443a609db428 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -484,6 +484,7 @@ static DEVICE_ATTR(meminfo, 0444, node_read_meminfo, NULL);
 static ssize_t node_read_numastat(struct device *dev,
 				  struct device_attribute *attr, char *buf)
 {
+	fold_vm_numa_events();
 	return sysfs_emit(buf,
 			  "numa_hit %lu\n"
 			  "numa_miss %lu\n"
@@ -491,12 +492,12 @@ static ssize_t node_read_numastat(struct device *dev,
 			  "interleave_hit %lu\n"
 			  "local_node %lu\n"
 			  "other_node %lu\n",
-			  sum_zone_numa_state(dev->id, NUMA_HIT),
-			  sum_zone_numa_state(dev->id, NUMA_MISS),
-			  sum_zone_numa_state(dev->id, NUMA_FOREIGN),
-			  sum_zone_numa_state(dev->id, NUMA_INTERLEAVE_HIT),
-			  sum_zone_numa_state(dev->id, NUMA_LOCAL),
-			  sum_zone_numa_state(dev->id, NUMA_OTHER));
+			  sum_zone_numa_event_state(dev->id, NUMA_HIT),
+			  sum_zone_numa_event_state(dev->id, NUMA_MISS),
+			  sum_zone_numa_event_state(dev->id, NUMA_FOREIGN),
+			  sum_zone_numa_event_state(dev->id, NUMA_INTERLEAVE_HIT),
+			  sum_zone_numa_event_state(dev->id, NUMA_LOCAL),
+			  sum_zone_numa_event_state(dev->id, NUMA_OTHER));
 }
 static DEVICE_ATTR(numastat, 0444, node_read_numastat, NULL);
 
@@ -514,10 +515,11 @@ static ssize_t node_read_vmstat(struct device *dev,
 				     sum_zone_node_page_state(nid, i));
 
 #ifdef CONFIG_NUMA
-	for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++)
+	fold_vm_numa_events();
+	for (i = 0; i < NR_VM_NUMA_EVENT_ITEMS; i++)
 		len += sysfs_emit_at(buf, len, "%s %lu\n",
 				     numa_stat_name(i),
-				     sum_zone_numa_state(nid, i));
+				     sum_zone_numa_event_state(nid, i));
 
 #endif
 	for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) {
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 106da8fbc72a..693cd5f24f7d 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -135,10 +135,10 @@ enum numa_stat_item {
 	NUMA_INTERLEAVE_HIT,	/* interleaver preferred this zone */
 	NUMA_LOCAL,		/* allocation from local node */
 	NUMA_OTHER,		/* allocation from other node */
-	NR_VM_NUMA_STAT_ITEMS
+	NR_VM_NUMA_EVENT_ITEMS
 };
 #else
-#define NR_VM_NUMA_STAT_ITEMS 0
+#define NR_VM_NUMA_EVENT_ITEMS 0
 #endif
 
 enum zone_stat_item {
@@ -357,7 +357,10 @@ struct per_cpu_zonestat {
 	s8 stat_threshold;
 #endif
 #ifdef CONFIG_NUMA
-	u16 vm_numa_stat_diff[NR_VM_NUMA_STAT_ITEMS];
+	u16 vm_numa_stat_diff[NR_VM_NUMA_EVENT_ITEMS];
+#endif
+#ifdef CONFIG_NUMA
+	unsigned long vm_numa_event[NR_VM_NUMA_EVENT_ITEMS];
 #endif
 };
 
@@ -609,7 +612,7 @@ struct zone {
 	ZONE_PADDING(_pad3_)
 	/* Zone statistics */
 	atomic_long_t		vm_stat[NR_VM_ZONE_STAT_ITEMS];
-	atomic_long_t		vm_numa_stat[NR_VM_NUMA_STAT_ITEMS];
+	atomic_long_t		vm_numa_events[NR_VM_NUMA_EVENT_ITEMS];
 } ____cacheline_internodealigned_in_smp;
 
 enum pgdat_flags {
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 1736ea9d24a7..fc14415223c5 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -138,35 +138,27 @@ static inline void vm_events_fold_cpu(int cpu)
  * Zone and node-based page accounting with per cpu differentials.
  */
 extern atomic_long_t vm_zone_stat[NR_VM_ZONE_STAT_ITEMS];
-extern atomic_long_t vm_numa_stat[NR_VM_NUMA_STAT_ITEMS];
 extern atomic_long_t vm_node_stat[NR_VM_NODE_STAT_ITEMS];
 
 #ifdef CONFIG_NUMA
-static inline void zone_numa_state_add(long x, struct zone *zone,
-				 enum numa_stat_item item)
-{
-	atomic_long_add(x, &zone->vm_numa_stat[item]);
-	atomic_long_add(x, &vm_numa_stat[item]);
-}
-
-static inline unsigned long global_numa_state(enum numa_stat_item item)
+static inline unsigned long zone_numa_event_state(struct zone *zone,
+					enum numa_stat_item item)
 {
-	long x = atomic_long_read(&vm_numa_stat[item]);
-
-	return x;
+	return atomic_long_read(&zone->vm_numa_events[item]);
 }
 
-static inline unsigned long zone_numa_state_snapshot(struct zone *zone,
-					enum numa_stat_item item)
+static inline unsigned long
+global_numa_event_state(enum numa_stat_item item)
 {
-	long x = atomic_long_read(&zone->vm_numa_stat[item]);
-	int cpu;
+	struct zone *zone;
+	unsigned long x = 0;
 
-	for_each_online_cpu(cpu)
-		x += per_cpu_ptr(zone->per_cpu_zonestats, cpu)->vm_numa_stat_diff[item];
+	for_each_populated_zone(zone)
+		x += zone_numa_event_state(zone, item);
 
 	return x;
 }
+
 #endif /* CONFIG_NUMA */
 
 static inline void zone_page_state_add(long x, struct zone *zone,
@@ -245,18 +237,22 @@ static inline unsigned long zone_page_state_snapshot(struct zone *zone,
 }
 
 #ifdef CONFIG_NUMA
-extern void __inc_numa_state(struct zone *zone, enum numa_stat_item item);
+extern void __count_numa_event(struct zone *zone, enum numa_stat_item item);
 extern unsigned long sum_zone_node_page_state(int node,
 					      enum zone_stat_item item);
-extern unsigned long sum_zone_numa_state(int node, enum numa_stat_item item);
+extern unsigned long sum_zone_numa_event_state(int node, enum numa_stat_item item);
 extern unsigned long node_page_state(struct pglist_data *pgdat,
 						enum node_stat_item item);
 extern unsigned long node_page_state_pages(struct pglist_data *pgdat,
 					   enum node_stat_item item);
+extern void fold_vm_numa_events(void);
 #else
 #define sum_zone_node_page_state(node, item) global_zone_page_state(item)
 #define node_page_state(node, item) global_node_page_state(item)
 #define node_page_state_pages(node, item) global_node_page_state_pages(item)
+static inline void fold_vm_numa_events(void)
+{
+}
 #endif /* CONFIG_NUMA */
 
 #ifdef CONFIG_SMP
@@ -428,7 +424,7 @@ static inline const char *numa_stat_name(enum numa_stat_item item)
 static inline const char *node_stat_name(enum node_stat_item item)
 {
 	return vmstat_text[NR_VM_ZONE_STAT_ITEMS +
-			   NR_VM_NUMA_STAT_ITEMS +
+			   NR_VM_NUMA_EVENT_ITEMS +
 			   item];
 }
 
@@ -440,7 +436,7 @@ static inline const char *lru_list_name(enum lru_list lru)
 static inline const char *writeback_stat_name(enum writeback_stat_item item)
 {
 	return vmstat_text[NR_VM_ZONE_STAT_ITEMS +
-			   NR_VM_NUMA_STAT_ITEMS +
+			   NR_VM_NUMA_EVENT_ITEMS +
 			   NR_VM_NODE_STAT_ITEMS +
 			   item];
 }
@@ -449,7 +445,7 @@ static inline const char *writeback_stat_name(enum writeback_stat_item item)
 static inline const char *vm_event_name(enum vm_event_item item)
 {
 	return vmstat_text[NR_VM_ZONE_STAT_ITEMS +
-			   NR_VM_NUMA_STAT_ITEMS +
+			   NR_VM_NUMA_EVENT_ITEMS +
 			   NR_VM_NODE_STAT_ITEMS +
 			   NR_VM_WRITEBACK_STAT_ITEMS +
 			   item];
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index cd0295567a04..99c06a9ae7ee 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2146,7 +2146,7 @@ static struct page *alloc_page_interleave(gfp_t gfp, unsigned order,
 		return page;
 	if (page && page_to_nid(page) == nid) {
 		preempt_disable();
-		__inc_numa_state(page_zone(page), NUMA_INTERLEAVE_HIT);
+		__count_numa_event(page_zone(page), NUMA_INTERLEAVE_HIT);
 		preempt_enable();
 	}
 	return page;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a8630003612b..73e618d06315 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3424,12 +3424,12 @@ static inline void zone_statistics(struct zone *preferred_zone, struct zone *z)
 		local_stat = NUMA_OTHER;
 
 	if (zone_to_nid(z) == zone_to_nid(preferred_zone))
-		__inc_numa_state(z, NUMA_HIT);
+		__count_numa_event(z, NUMA_HIT);
 	else {
-		__inc_numa_state(z, NUMA_MISS);
-		__inc_numa_state(preferred_zone, NUMA_FOREIGN);
+		__count_numa_event(z, NUMA_MISS);
+		__count_numa_event(preferred_zone, NUMA_FOREIGN);
 	}
-	__inc_numa_state(z, local_stat);
+	__count_numa_event(z, local_stat);
 #endif
 }
 
@@ -6700,8 +6700,8 @@ void __init setup_per_cpu_pageset(void)
 	 */
 	for_each_possible_cpu(cpu) {
 		struct per_cpu_zonestat *pzstats = &per_cpu(boot_zonestats, cpu);
-		memset(pzstats->vm_numa_stat_diff, 0,
-		       sizeof(pzstats->vm_numa_stat_diff));
+		memset(pzstats->vm_numa_event, 0,
+		       sizeof(pzstats->vm_numa_event));
 	}
 #endif
 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 8a8f1a26b231..63bd84d122c0 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -41,38 +41,24 @@ static void zero_zone_numa_counters(struct zone *zone)
 {
 	int item, cpu;
 
-	for (item = 0; item < NR_VM_NUMA_STAT_ITEMS; item++) {
-		atomic_long_set(&zone->vm_numa_stat[item], 0);
-		for_each_online_cpu(cpu)
-			per_cpu_ptr(zone->per_cpu_zonestats, cpu)->vm_numa_stat_diff[item]
+	for (item = 0; item < NR_VM_NUMA_EVENT_ITEMS; item++) {
+		atomic_long_set(&zone->vm_numa_events[item], 0);
+		for_each_online_cpu(cpu) {
+			per_cpu_ptr(zone->per_cpu_zonestats, cpu)->vm_numa_event[item]
 						= 0;
+		}
 	}
 }
 
-/* zero numa counters of all the populated zones */
-static void zero_zones_numa_counters(void)
+static void invalidate_numa_statistics(void)
 {
 	struct zone *zone;
 
+	/* zero numa counters of all the populated zones */
 	for_each_populated_zone(zone)
 		zero_zone_numa_counters(zone);
 }
 
-/* zero global numa counters */
-static void zero_global_numa_counters(void)
-{
-	int item;
-
-	for (item = 0; item < NR_VM_NUMA_STAT_ITEMS; item++)
-		atomic_long_set(&vm_numa_stat[item], 0);
-}
-
-static void invalid_numa_statistics(void)
-{
-	zero_zones_numa_counters();
-	zero_global_numa_counters();
-}
-
 static DEFINE_MUTEX(vm_numa_stat_lock);
 
 int sysctl_vm_numa_stat_handler(struct ctl_table *table, int write,
@@ -94,7 +80,7 @@ int sysctl_vm_numa_stat_handler(struct ctl_table *table, int write,
 		pr_info("enable numa statistics\n");
 	} else {
 		static_branch_disable(&vm_numa_stat_key);
-		invalid_numa_statistics();
+		invalidate_numa_statistics();
 		pr_info("disable numa statistics, and clear numa counters\n");
 	}
 
@@ -161,10 +147,8 @@ void vm_events_fold_cpu(int cpu)
  * vm_stat contains the global counters
  */
 atomic_long_t vm_zone_stat[NR_VM_ZONE_STAT_ITEMS] __cacheline_aligned_in_smp;
-atomic_long_t vm_numa_stat[NR_VM_NUMA_STAT_ITEMS] __cacheline_aligned_in_smp;
 atomic_long_t vm_node_stat[NR_VM_NODE_STAT_ITEMS] __cacheline_aligned_in_smp;
 EXPORT_SYMBOL(vm_zone_stat);
-EXPORT_SYMBOL(vm_numa_stat);
 EXPORT_SYMBOL(vm_node_stat);
 
 #ifdef CONFIG_SMP
@@ -706,8 +690,7 @@ EXPORT_SYMBOL(dec_node_page_state);
  * Fold a differential into the global counters.
  * Returns the number of counters updated.
  */
-#ifdef CONFIG_NUMA
-static int fold_diff(int *zone_diff, int *numa_diff, int *node_diff)
+static int fold_diff(int *zone_diff, int *node_diff)
 {
 	int i;
 	int changes = 0;
@@ -718,12 +701,6 @@ static int fold_diff(int *zone_diff, int *numa_diff, int *node_diff)
 			changes++;
 	}
 
-	for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++)
-		if (numa_diff[i]) {
-			atomic_long_add(numa_diff[i], &vm_numa_stat[i]);
-			changes++;
-	}
-
 	for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
 		if (node_diff[i]) {
 			atomic_long_add(node_diff[i], &vm_node_stat[i]);
@@ -731,26 +708,36 @@ static int fold_diff(int *zone_diff, int *numa_diff, int *node_diff)
 	}
 	return changes;
 }
-#else
-static int fold_diff(int *zone_diff, int *node_diff)
+
+#ifdef CONFIG_NUMA
+static void fold_vm_zone_numa_events(struct zone *zone)
 {
-	int i;
-	int changes = 0;
+	int zone_numa_events[NR_VM_NUMA_EVENT_ITEMS] = { 0, };
+	int cpu;
+	enum numa_stat_item item;
 
-	for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
-		if (zone_diff[i]) {
-			atomic_long_add(zone_diff[i], &vm_zone_stat[i]);
-			changes++;
+	for_each_online_cpu(cpu) {
+		struct per_cpu_zonestat *pzstats;
+
+		pzstats = per_cpu_ptr(zone->per_cpu_zonestats, cpu);
+		for (item = 0; item < NR_VM_NUMA_EVENT_ITEMS; item++) {
+			zone_numa_events[item] += pzstats->vm_numa_event[item];
+		}
 	}
 
-	for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
-		if (node_diff[i]) {
-			atomic_long_add(node_diff[i], &vm_node_stat[i]);
-			changes++;
+	for (item = 0; item < NR_VM_NUMA_EVENT_ITEMS; item++) {
+		atomic_long_set(&zone->vm_numa_events[item], zone_numa_events[item]);
 	}
-	return changes;
 }
-#endif /* CONFIG_NUMA */
+
+void fold_vm_numa_events(void)
+{
+	struct zone *zone;
+
+	for_each_populated_zone(zone)
+		fold_vm_zone_numa_events(zone);
+}
+#endif
 
 /*
  * Update the zone counters for the current cpu.
@@ -774,9 +761,6 @@ static int refresh_cpu_vm_stats(bool do_pagesets)
 	struct zone *zone;
 	int i;
 	int global_zone_diff[NR_VM_ZONE_STAT_ITEMS] = { 0, };
-#ifdef CONFIG_NUMA
-	int global_numa_diff[NR_VM_NUMA_STAT_ITEMS] = { 0, };
-#endif
 	int global_node_diff[NR_VM_NODE_STAT_ITEMS] = { 0, };
 	int changes = 0;
 
@@ -799,17 +783,6 @@ static int refresh_cpu_vm_stats(bool do_pagesets)
 			}
 		}
 #ifdef CONFIG_NUMA
-		for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++) {
-			int v;
-
-			v = this_cpu_xchg(pzstats->vm_numa_stat_diff[i], 0);
-			if (v) {
-
-				atomic_long_add(v, &zone->vm_numa_stat[i]);
-				global_numa_diff[i] += v;
-				__this_cpu_write(pcp->expire, 3);
-			}
-		}
 
 		if (do_pagesets) {
 			cond_resched();
@@ -857,12 +830,7 @@ static int refresh_cpu_vm_stats(bool do_pagesets)
 		}
 	}
 
-#ifdef CONFIG_NUMA
-	changes += fold_diff(global_zone_diff, global_numa_diff,
-			     global_node_diff);
-#else
 	changes += fold_diff(global_zone_diff, global_node_diff);
-#endif
 	return changes;
 }
 
@@ -877,9 +845,6 @@ void cpu_vm_stats_fold(int cpu)
 	struct zone *zone;
 	int i;
 	int global_zone_diff[NR_VM_ZONE_STAT_ITEMS] = { 0, };
-#ifdef CONFIG_NUMA
-	int global_numa_diff[NR_VM_NUMA_STAT_ITEMS] = { 0, };
-#endif
 	int global_node_diff[NR_VM_NODE_STAT_ITEMS] = { 0, };
 
 	for_each_populated_zone(zone) {
@@ -887,7 +852,7 @@ void cpu_vm_stats_fold(int cpu)
 
 		pzstats = per_cpu_ptr(zone->per_cpu_zonestats, cpu);
 
-		for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
+		for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) {
 			if (pzstats->vm_stat_diff[i]) {
 				int v;
 
@@ -896,18 +861,7 @@ void cpu_vm_stats_fold(int cpu)
 				atomic_long_add(v, &zone->vm_stat[i]);
 				global_zone_diff[i] += v;
 			}
-
-#ifdef CONFIG_NUMA
-		for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++)
-			if (pzstats->vm_numa_stat_diff[i]) {
-				int v;
-
-				v = pzstats->vm_numa_stat_diff[i];
-				pzstats->vm_numa_stat_diff[i] = 0;
-				atomic_long_add(v, &zone->vm_numa_stat[i]);
-				global_numa_diff[i] += v;
-			}
-#endif
+		}
 	}
 
 	for_each_online_pgdat(pgdat) {
@@ -926,11 +880,7 @@ void cpu_vm_stats_fold(int cpu)
 			}
 	}
 
-#ifdef CONFIG_NUMA
-	fold_diff(global_zone_diff, global_numa_diff, global_node_diff);
-#else
 	fold_diff(global_zone_diff, global_node_diff);
-#endif
 }
 
 /*
@@ -948,34 +898,17 @@ void drain_zonestat(struct zone *zone, struct per_cpu_zonestat *pzstats)
 			atomic_long_add(v, &zone->vm_stat[i]);
 			atomic_long_add(v, &vm_zone_stat[i]);
 		}
-
-#ifdef CONFIG_NUMA
-	for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++)
-		if (pzstats->vm_numa_stat_diff[i]) {
-			int v = pzstats->vm_numa_stat_diff[i];
-
-			pzstats->vm_numa_stat_diff[i] = 0;
-			atomic_long_add(v, &zone->vm_numa_stat[i]);
-			atomic_long_add(v, &vm_numa_stat[i]);
-		}
-#endif
 }
 #endif
 
 #ifdef CONFIG_NUMA
-void __inc_numa_state(struct zone *zone,
+/* See __count_vm_event comment on why raw_cpu_inc is used. */
+void __count_numa_event(struct zone *zone,
 				 enum numa_stat_item item)
 {
 	struct per_cpu_zonestat __percpu *pzstats = zone->per_cpu_zonestats;
-	u16 __percpu *p = pzstats->vm_numa_stat_diff + item;
-	u16 v;
 
-	v = __this_cpu_inc_return(*p);
-
-	if (unlikely(v > NUMA_STATS_THRESHOLD)) {
-		zone_numa_state_add(v, zone, item);
-		__this_cpu_write(*p, 0);
-	}
+	raw_cpu_inc(pzstats->vm_numa_event[item]);
 }
 
 /*
@@ -1000,15 +933,15 @@ unsigned long sum_zone_node_page_state(int node,
  * Determine the per node value of a numa stat item. To avoid deviation,
  * the per cpu stat number in vm_numa_stat_diff[] is also included.
  */
-unsigned long sum_zone_numa_state(int node,
+unsigned long sum_zone_numa_event_state(int node,
 				 enum numa_stat_item item)
 {
 	struct zone *zones = NODE_DATA(node)->node_zones;
-	int i;
 	unsigned long count = 0;
+	int i;
 
 	for (i = 0; i < MAX_NR_ZONES; i++)
-		count += zone_numa_state_snapshot(zones + i, item);
+		count += zone_numa_event_state(zones + i, item);
 
 	return count;
 }
@@ -1679,9 +1612,9 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 			   zone_page_state(zone, i));
 
 #ifdef CONFIG_NUMA
-	for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++)
+	for (i = 0; i < NR_VM_NUMA_EVENT_ITEMS; i++)
 		seq_printf(m, "\n      %-12s %lu", numa_stat_name(i),
-			   zone_numa_state_snapshot(zone, i));
+			   zone_numa_event_state(zone, i));
 #endif
 
 	seq_printf(m, "\n  pagesets");
@@ -1735,7 +1668,7 @@ static const struct seq_operations zoneinfo_op = {
 };
 
 #define NR_VMSTAT_ITEMS (NR_VM_ZONE_STAT_ITEMS + \
-			 NR_VM_NUMA_STAT_ITEMS + \
+			 NR_VM_NUMA_EVENT_ITEMS + \
 			 NR_VM_NODE_STAT_ITEMS + \
 			 NR_VM_WRITEBACK_STAT_ITEMS + \
 			 (IS_ENABLED(CONFIG_VM_EVENT_COUNTERS) ? \
@@ -1750,6 +1683,7 @@ static void *vmstat_start(struct seq_file *m, loff_t *pos)
 		return NULL;
 
 	BUILD_BUG_ON(ARRAY_SIZE(vmstat_text) < NR_VMSTAT_ITEMS);
+	fold_vm_numa_events();
 	v = kmalloc_array(NR_VMSTAT_ITEMS, sizeof(unsigned long), GFP_KERNEL);
 	m->private = v;
 	if (!v)
@@ -1759,9 +1693,9 @@ static void *vmstat_start(struct seq_file *m, loff_t *pos)
 	v += NR_VM_ZONE_STAT_ITEMS;
 
 #ifdef CONFIG_NUMA
-	for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++)
-		v[i] = global_numa_state(i);
-	v += NR_VM_NUMA_STAT_ITEMS;
+	for (i = 0; i < NR_VM_NUMA_EVENT_ITEMS; i++)
+		v[i] = global_numa_event_state(i);
+	v += NR_VM_NUMA_EVENT_ITEMS;
 #endif
 
 	for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) {
@@ -1864,16 +1798,6 @@ int vmstat_refresh(struct ctl_table *table, int write,
 			err = -EINVAL;
 		}
 	}
-#ifdef CONFIG_NUMA
-	for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++) {
-		val = atomic_long_read(&vm_numa_stat[i]);
-		if (val < 0) {
-			pr_warn("%s: %s %ld\n",
-				__func__, numa_stat_name(i), val);
-			err = -EINVAL;
-		}
-	}
-#endif
 	if (err)
 		return err;
 	if (write)
@@ -1922,8 +1846,9 @@ static bool need_update(int cpu)
 		if (memchr_inv(pzstats->vm_stat_diff, 0, NR_VM_ZONE_STAT_ITEMS *
 			       sizeof(pzstats->vm_stat_diff[0])))
 			return true;
+
 #ifdef CONFIG_NUMA
-		if (memchr_inv(pzstats->vm_numa_stat_diff, 0, NR_VM_NUMA_STAT_ITEMS *
+		if (memchr_inv(pzstats->vm_numa_stat_diff, 0, NR_VM_NUMA_EVENT_ITEMS *
 			       sizeof(pzstats->vm_numa_stat_diff[0])))
 			return true;
 #endif
-- 
2.26.2


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 05/11] mm/vmstat: Inline NUMA event counter updates
  2021-04-07 20:24 [PATCH 0/11 v2] Use local_lock for pcp protection and reduce stat overhead Mel Gorman
                   ` (3 preceding siblings ...)
  2021-04-07 20:24 ` [PATCH 04/11] mm/vmstat: Convert NUMA statistics to basic NUMA counters Mel Gorman
@ 2021-04-07 20:24 ` Mel Gorman
  2021-04-07 20:24 ` [PATCH 06/11] mm/page_alloc: Batch the accounting updates in the bulk allocator Mel Gorman
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 31+ messages in thread
From: Mel Gorman @ 2021-04-07 20:24 UTC (permalink / raw)
  To: Linux-MM, Linux-RT-Users
  Cc: LKML, Chuck Lever, Jesper Dangaard Brouer, Matthew Wilcox,
	Thomas Gleixner, Peter Zijlstra, Ingo Molnar, Michal Hocko,
	Oscar Salvador, Mel Gorman

__count_numa_event is small enough to be treated similarly to
__count_vm_event so inline it.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 include/linux/vmstat.h | 9 +++++++++
 mm/vmstat.c            | 9 ---------
 2 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index fc14415223c5..dde4dec4e7dd 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -237,6 +237,15 @@ static inline unsigned long zone_page_state_snapshot(struct zone *zone,
 }
 
 #ifdef CONFIG_NUMA
+/* See __count_vm_event comment on why raw_cpu_inc is used. */
+static inline void
+__count_numa_event(struct zone *zone, enum numa_stat_item item)
+{
+	struct per_cpu_zonestat __percpu *pzstats = zone->per_cpu_zonestats;
+
+	raw_cpu_inc(pzstats->vm_numa_event[item]);
+}
+
 extern void __count_numa_event(struct zone *zone, enum numa_stat_item item);
 extern unsigned long sum_zone_node_page_state(int node,
 					      enum zone_stat_item item);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 63bd84d122c0..b853df95ed0c 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -902,15 +902,6 @@ void drain_zonestat(struct zone *zone, struct per_cpu_zonestat *pzstats)
 #endif
 
 #ifdef CONFIG_NUMA
-/* See __count_vm_event comment on why raw_cpu_inc is used. */
-void __count_numa_event(struct zone *zone,
-				 enum numa_stat_item item)
-{
-	struct per_cpu_zonestat __percpu *pzstats = zone->per_cpu_zonestats;
-
-	raw_cpu_inc(pzstats->vm_numa_event[item]);
-}
-
 /*
  * Determine the per node value of a stat item. This function
  * is called frequently in a NUMA machine, so try to be as
-- 
2.26.2


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 06/11] mm/page_alloc: Batch the accounting updates in the bulk allocator
  2021-04-07 20:24 [PATCH 0/11 v2] Use local_lock for pcp protection and reduce stat overhead Mel Gorman
                   ` (4 preceding siblings ...)
  2021-04-07 20:24 ` [PATCH 05/11] mm/vmstat: Inline NUMA event counter updates Mel Gorman
@ 2021-04-07 20:24 ` Mel Gorman
  2021-04-07 20:24 ` [PATCH 07/11] mm/page_alloc: Reduce duration that IRQs are disabled for VM counters Mel Gorman
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 31+ messages in thread
From: Mel Gorman @ 2021-04-07 20:24 UTC (permalink / raw)
  To: Linux-MM, Linux-RT-Users
  Cc: LKML, Chuck Lever, Jesper Dangaard Brouer, Matthew Wilcox,
	Thomas Gleixner, Peter Zijlstra, Ingo Molnar, Michal Hocko,
	Oscar Salvador, Mel Gorman

Now that the zone_statistics are simple counters that do not require
special protection, the bulk allocator accounting updates can be batch
updated without adding too much complexity with protected RMW updates or
using xchg.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 include/linux/vmstat.h |  8 ++++++++
 mm/page_alloc.c        | 30 +++++++++++++-----------------
 2 files changed, 21 insertions(+), 17 deletions(-)

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index dde4dec4e7dd..8473b8fa9756 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -246,6 +246,14 @@ __count_numa_event(struct zone *zone, enum numa_stat_item item)
 	raw_cpu_inc(pzstats->vm_numa_event[item]);
 }
 
+static inline void
+__count_numa_events(struct zone *zone, enum numa_stat_item item, long delta)
+{
+	struct per_cpu_zonestat __percpu *pzstats = zone->per_cpu_zonestats;
+
+	raw_cpu_add(pzstats->vm_numa_event[item], delta);
+}
+
 extern void __count_numa_event(struct zone *zone, enum numa_stat_item item);
 extern unsigned long sum_zone_node_page_state(int node,
 					      enum zone_stat_item item);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 73e618d06315..defb0e436fac 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3411,7 +3411,8 @@ void __putback_isolated_page(struct page *page, unsigned int order, int mt)
  *
  * Must be called with interrupts disabled.
  */
-static inline void zone_statistics(struct zone *preferred_zone, struct zone *z)
+static inline void zone_statistics(struct zone *preferred_zone, struct zone *z,
+				   long nr_account)
 {
 #ifdef CONFIG_NUMA
 	enum numa_stat_item local_stat = NUMA_LOCAL;
@@ -3424,12 +3425,12 @@ static inline void zone_statistics(struct zone *preferred_zone, struct zone *z)
 		local_stat = NUMA_OTHER;
 
 	if (zone_to_nid(z) == zone_to_nid(preferred_zone))
-		__count_numa_event(z, NUMA_HIT);
+		__count_numa_events(z, NUMA_HIT, nr_account);
 	else {
-		__count_numa_event(z, NUMA_MISS);
-		__count_numa_event(preferred_zone, NUMA_FOREIGN);
+		__count_numa_events(z, NUMA_MISS, nr_account);
+		__count_numa_events(preferred_zone, NUMA_FOREIGN, nr_account);
 	}
-	__count_numa_event(z, local_stat);
+	__count_numa_events(z, local_stat, nr_account);
 #endif
 }
 
@@ -3475,7 +3476,7 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
 	page = __rmqueue_pcplist(zone,  migratetype, alloc_flags, pcp, list);
 	if (page) {
 		__count_zid_vm_events(PGALLOC, page_zonenum(page), 1);
-		zone_statistics(preferred_zone, zone);
+		zone_statistics(preferred_zone, zone, 1);
 	}
 	local_unlock_irqrestore(&pagesets.lock, flags);
 	return page;
@@ -3536,7 +3537,7 @@ struct page *rmqueue(struct zone *preferred_zone,
 				  get_pcppage_migratetype(page));
 
 	__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
-	zone_statistics(preferred_zone, zone);
+	zone_statistics(preferred_zone, zone, 1);
 	local_irq_restore(flags);
 
 out:
@@ -5019,7 +5020,7 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
 	struct alloc_context ac;
 	gfp_t alloc_gfp;
 	unsigned int alloc_flags = ALLOC_WMARK_LOW;
-	int nr_populated = 0;
+	int nr_populated = 0, nr_account = 0;
 
 	if (unlikely(nr_pages <= 0))
 		return 0;
@@ -5092,15 +5093,7 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
 				goto failed_irq;
 			break;
 		}
-
-		/*
-		 * Ideally this would be batched but the best way to do
-		 * that cheaply is to first convert zone_statistics to
-		 * be inaccurate per-cpu counter like vm_events to avoid
-		 * a RMW cycle then do the accounting with IRQs enabled.
-		 */
-		__count_zid_vm_events(PGALLOC, zone_idx(zone), 1);
-		zone_statistics(ac.preferred_zoneref->zone, zone);
+		nr_account++;
 
 		prep_new_page(page, 0, gfp, 0);
 		if (page_list)
@@ -5110,6 +5103,9 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
 		nr_populated++;
 	}
 
+	__count_zid_vm_events(PGALLOC, zone_idx(zone), nr_account);
+	zone_statistics(ac.preferred_zoneref->zone, zone, nr_account);
+
 	local_unlock_irqrestore(&pagesets.lock, flags);
 
 	return nr_populated;
-- 
2.26.2


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 07/11] mm/page_alloc: Reduce duration that IRQs are disabled for VM counters
  2021-04-07 20:24 [PATCH 0/11 v2] Use local_lock for pcp protection and reduce stat overhead Mel Gorman
                   ` (5 preceding siblings ...)
  2021-04-07 20:24 ` [PATCH 06/11] mm/page_alloc: Batch the accounting updates in the bulk allocator Mel Gorman
@ 2021-04-07 20:24 ` Mel Gorman
  2021-04-07 20:24 ` [PATCH 08/11] mm/page_alloc: Remove duplicate checks if migratetype should be isolated Mel Gorman
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 31+ messages in thread
From: Mel Gorman @ 2021-04-07 20:24 UTC (permalink / raw)
  To: Linux-MM, Linux-RT-Users
  Cc: LKML, Chuck Lever, Jesper Dangaard Brouer, Matthew Wilcox,
	Thomas Gleixner, Peter Zijlstra, Ingo Molnar, Michal Hocko,
	Oscar Salvador, Mel Gorman

IRQs are left disabled for the zone and node VM event counters. This is
unnecessary as the affected counters are allowed to race for preemmption
and IRQs.

This patch reduces the scope of IRQs being disabled
via local_[lock|unlock]_irq on !PREEMPT_RT kernels. One
__mod_zone_freepage_state is still called with IRQs disabled. While this
could be moved out, it's not free on all architectures as some require
IRQs to be disabled for mod_zone_page_state on !PREEMPT_RTkernels.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/page_alloc.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index defb0e436fac..bd75102ef1e1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3474,11 +3474,11 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
 	pcp = this_cpu_ptr(zone->per_cpu_pageset);
 	list = &pcp->lists[migratetype];
 	page = __rmqueue_pcplist(zone,  migratetype, alloc_flags, pcp, list);
+	local_unlock_irqrestore(&pagesets.lock, flags);
 	if (page) {
 		__count_zid_vm_events(PGALLOC, page_zonenum(page), 1);
 		zone_statistics(preferred_zone, zone, 1);
 	}
-	local_unlock_irqrestore(&pagesets.lock, flags);
 	return page;
 }
 
@@ -3530,15 +3530,15 @@ struct page *rmqueue(struct zone *preferred_zone,
 		if (!page)
 			page = __rmqueue(zone, order, migratetype, alloc_flags);
 	} while (page && check_new_pages(page, order));
-	spin_unlock(&zone->lock);
 	if (!page)
 		goto failed;
+
 	__mod_zone_freepage_state(zone, -(1 << order),
 				  get_pcppage_migratetype(page));
+	spin_unlock_irqrestore(&zone->lock, flags);
 
 	__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
 	zone_statistics(preferred_zone, zone, 1);
-	local_irq_restore(flags);
 
 out:
 	/* Separate test+clear to avoid unnecessary atomics */
@@ -3551,7 +3551,7 @@ struct page *rmqueue(struct zone *preferred_zone,
 	return page;
 
 failed:
-	local_irq_restore(flags);
+	spin_unlock_irqrestore(&zone->lock, flags);
 	return NULL;
 }
 
@@ -5103,11 +5103,11 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
 		nr_populated++;
 	}
 
+	local_unlock_irqrestore(&pagesets.lock, flags);
+
 	__count_zid_vm_events(PGALLOC, zone_idx(zone), nr_account);
 	zone_statistics(ac.preferred_zoneref->zone, zone, nr_account);
 
-	local_unlock_irqrestore(&pagesets.lock, flags);
-
 	return nr_populated;
 
 failed_irq:
-- 
2.26.2


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 08/11] mm/page_alloc: Remove duplicate checks if migratetype should be isolated
  2021-04-07 20:24 [PATCH 0/11 v2] Use local_lock for pcp protection and reduce stat overhead Mel Gorman
                   ` (6 preceding siblings ...)
  2021-04-07 20:24 ` [PATCH 07/11] mm/page_alloc: Reduce duration that IRQs are disabled for VM counters Mel Gorman
@ 2021-04-07 20:24 ` Mel Gorman
  2021-04-07 20:24 ` [PATCH 09/11] mm/page_alloc: Explicitly acquire the zone lock in __free_pages_ok Mel Gorman
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 31+ messages in thread
From: Mel Gorman @ 2021-04-07 20:24 UTC (permalink / raw)
  To: Linux-MM, Linux-RT-Users
  Cc: LKML, Chuck Lever, Jesper Dangaard Brouer, Matthew Wilcox,
	Thomas Gleixner, Peter Zijlstra, Ingo Molnar, Michal Hocko,
	Oscar Salvador, Mel Gorman

Both free_pcppages_bulk() and free_one_page() have very similar
checks about whether a pages migratetype has changed under the
zone lock. Use a common helper.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/page_alloc.c | 32 ++++++++++++++++++++++----------
 1 file changed, 22 insertions(+), 10 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index bd75102ef1e1..1bb5b522a0f9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1354,6 +1354,23 @@ static inline void prefetch_buddy(struct page *page)
 	prefetch(buddy);
 }
 
+/*
+ * The migratetype of a page may have changed due to isolation so check.
+ * Assumes the caller holds the zone->lock to serialise against page
+ * isolation.
+ */
+static inline int
+check_migratetype_isolated(struct zone *zone, struct page *page, unsigned long pfn, int migratetype)
+{
+	/* If isolating, check if the migratetype has changed */
+	if (unlikely(has_isolate_pageblock(zone) ||
+		is_migrate_isolate(migratetype))) {
+		migratetype = get_pfnblock_migratetype(page, pfn);
+	}
+
+	return migratetype;
+}
+
 /*
  * Frees a number of pages from the PCP lists
  * Assumes all pages on list are in same zone, and of same order.
@@ -1371,7 +1388,6 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 	int migratetype = 0;
 	int batch_free = 0;
 	int prefetch_nr = READ_ONCE(pcp->batch);
-	bool isolated_pageblocks;
 	struct page *page, *tmp;
 	LIST_HEAD(head);
 
@@ -1433,21 +1449,20 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 	 * both PREEMPT_RT and non-PREEMPT_RT configurations.
 	 */
 	spin_lock(&zone->lock);
-	isolated_pageblocks = has_isolate_pageblock(zone);
 
 	/*
 	 * Use safe version since after __free_one_page(),
 	 * page->lru.next will not point to original list.
 	 */
 	list_for_each_entry_safe(page, tmp, &head, lru) {
+		unsigned long pfn = page_to_pfn(page);
 		int mt = get_pcppage_migratetype(page);
+
 		/* MIGRATE_ISOLATE page should not go to pcplists */
 		VM_BUG_ON_PAGE(is_migrate_isolate(mt), page);
-		/* Pageblock could have been isolated meanwhile */
-		if (unlikely(isolated_pageblocks))
-			mt = get_pageblock_migratetype(page);
 
-		__free_one_page(page, page_to_pfn(page), zone, 0, mt, FPI_NONE);
+		mt = check_migratetype_isolated(zone, page, pfn, mt);
+		__free_one_page(page, pfn, zone, 0, mt, FPI_NONE);
 		trace_mm_page_pcpu_drain(page, 0, mt);
 	}
 	spin_unlock(&zone->lock);
@@ -1459,10 +1474,7 @@ static void free_one_page(struct zone *zone,
 				int migratetype, fpi_t fpi_flags)
 {
 	spin_lock(&zone->lock);
-	if (unlikely(has_isolate_pageblock(zone) ||
-		is_migrate_isolate(migratetype))) {
-		migratetype = get_pfnblock_migratetype(page, pfn);
-	}
+	migratetype = check_migratetype_isolated(zone, page, pfn, migratetype);
 	__free_one_page(page, pfn, zone, order, migratetype, fpi_flags);
 	spin_unlock(&zone->lock);
 }
-- 
2.26.2


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 09/11] mm/page_alloc: Explicitly acquire the zone lock in __free_pages_ok
  2021-04-07 20:24 [PATCH 0/11 v2] Use local_lock for pcp protection and reduce stat overhead Mel Gorman
                   ` (7 preceding siblings ...)
  2021-04-07 20:24 ` [PATCH 08/11] mm/page_alloc: Remove duplicate checks if migratetype should be isolated Mel Gorman
@ 2021-04-07 20:24 ` Mel Gorman
  2021-04-07 20:24 ` [PATCH 10/11] mm/page_alloc: Avoid conflating IRQs disabled with zone->lock Mel Gorman
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 31+ messages in thread
From: Mel Gorman @ 2021-04-07 20:24 UTC (permalink / raw)
  To: Linux-MM, Linux-RT-Users
  Cc: LKML, Chuck Lever, Jesper Dangaard Brouer, Matthew Wilcox,
	Thomas Gleixner, Peter Zijlstra, Ingo Molnar, Michal Hocko,
	Oscar Salvador, Mel Gorman

__free_pages_ok() disables IRQs before calling a common helper
free_one_page() that acquires the zone lock. While this is safe, it
unnecessarily disables IRQs on PREEMPT_RT kernels.

This patch explicitly acquires the lock with spin_lock_irqsave instead of
relying on a helper. This removes the last instance of local_irq_save()
in page_alloc.c.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/page_alloc.c | 13 +++++--------
 1 file changed, 5 insertions(+), 8 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1bb5b522a0f9..d94ec53367bd 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1559,21 +1559,18 @@ static void __free_pages_ok(struct page *page, unsigned int order,
 	unsigned long flags;
 	int migratetype;
 	unsigned long pfn = page_to_pfn(page);
+	struct zone *zone = page_zone(page);
 
 	if (!free_pages_prepare(page, order, true))
 		return;
 
 	migratetype = get_pfnblock_migratetype(page, pfn);
 
-	/*
-	 * TODO FIX: Disable IRQs before acquiring IRQ-safe zone->lock
-	 * and protect vmstat updates.
-	 */
-	local_irq_save(flags);
+	spin_lock_irqsave(&zone->lock, flags);
 	__count_vm_events(PGFREE, 1 << order);
-	free_one_page(page_zone(page), page, pfn, order, migratetype,
-		      fpi_flags);
-	local_irq_restore(flags);
+	migratetype = check_migratetype_isolated(zone, page, pfn, migratetype);
+	__free_one_page(page, pfn, zone, order, migratetype, fpi_flags);
+	spin_unlock_irqrestore(&zone->lock, flags);
 }
 
 void __free_pages_core(struct page *page, unsigned int order)
-- 
2.26.2


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 10/11] mm/page_alloc: Avoid conflating IRQs disabled with zone->lock
  2021-04-07 20:24 [PATCH 0/11 v2] Use local_lock for pcp protection and reduce stat overhead Mel Gorman
                   ` (8 preceding siblings ...)
  2021-04-07 20:24 ` [PATCH 09/11] mm/page_alloc: Explicitly acquire the zone lock in __free_pages_ok Mel Gorman
@ 2021-04-07 20:24 ` Mel Gorman
  2021-04-07 20:24 ` [PATCH 11/11] mm/page_alloc: Update PGFREE outside the zone lock in __free_pages_ok Mel Gorman
  2021-04-08 10:56 ` [PATCH 0/11 v2] Use local_lock for pcp protection and reduce stat overhead Peter Zijlstra
  11 siblings, 0 replies; 31+ messages in thread
From: Mel Gorman @ 2021-04-07 20:24 UTC (permalink / raw)
  To: Linux-MM, Linux-RT-Users
  Cc: LKML, Chuck Lever, Jesper Dangaard Brouer, Matthew Wilcox,
	Thomas Gleixner, Peter Zijlstra, Ingo Molnar, Michal Hocko,
	Oscar Salvador, Mel Gorman

Historically when freeing pages, free_one_page() assumed that callers
had IRQs disabled and the zone->lock could be acquired with spin_lock().
This confuses the scope of what local_lock_irq is protecting and what
zone->lock is protecting in free_unref_page_list in particular.

This patch uses spin_lock_irqsave() for the zone->lock in
free_one_page() instead of relying on callers to have disabled
IRQs. free_unref_page_commit() is changed to only deal with PCP pages
protected by the local lock. free_unref_page_list() then first frees
isolated pages to the buddy lists with free_one_page() and frees the rest
of the pages to the PCP via free_unref_page_commit(). The end result
is that free_one_page() is no longer depending on side-effects of
local_lock to be correct.

Note that this may incur a performance penalty while memory hot-remove
is running but that is not a common operation.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/page_alloc.c | 67 ++++++++++++++++++++++++++++++-------------------
 1 file changed, 41 insertions(+), 26 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d94ec53367bd..6d98d97b6cf5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1473,10 +1473,12 @@ static void free_one_page(struct zone *zone,
 				unsigned int order,
 				int migratetype, fpi_t fpi_flags)
 {
-	spin_lock(&zone->lock);
+	unsigned long flags;
+
+	spin_lock_irqsave(&zone->lock, flags);
 	migratetype = check_migratetype_isolated(zone, page, pfn, migratetype);
 	__free_one_page(page, pfn, zone, order, migratetype, fpi_flags);
-	spin_unlock(&zone->lock);
+	spin_unlock_irqrestore(&zone->lock, flags);
 }
 
 static void __meminit __init_single_page(struct page *page, unsigned long pfn,
@@ -3238,31 +3240,13 @@ static bool free_unref_page_prepare(struct page *page, unsigned long pfn)
 	return true;
 }
 
-static void free_unref_page_commit(struct page *page, unsigned long pfn)
+static void free_unref_page_commit(struct page *page, unsigned long pfn,
+				   int migratetype)
 {
 	struct zone *zone = page_zone(page);
 	struct per_cpu_pages *pcp;
-	int migratetype;
 
-	migratetype = get_pcppage_migratetype(page);
 	__count_vm_event(PGFREE);
-
-	/*
-	 * We only track unmovable, reclaimable and movable on pcp lists.
-	 * Free ISOLATE pages back to the allocator because they are being
-	 * offlined but treat HIGHATOMIC as movable pages so we can get those
-	 * areas back if necessary. Otherwise, we may have to free
-	 * excessively into the page allocator
-	 */
-	if (migratetype >= MIGRATE_PCPTYPES) {
-		if (unlikely(is_migrate_isolate(migratetype))) {
-			free_one_page(zone, page, pfn, 0, migratetype,
-				      FPI_NONE);
-			return;
-		}
-		migratetype = MIGRATE_MOVABLE;
-	}
-
 	pcp = this_cpu_ptr(zone->per_cpu_pageset);
 	list_add(&page->lru, &pcp->lists[migratetype]);
 	pcp->count++;
@@ -3277,12 +3261,29 @@ void free_unref_page(struct page *page)
 {
 	unsigned long flags;
 	unsigned long pfn = page_to_pfn(page);
+	int migratetype;
 
 	if (!free_unref_page_prepare(page, pfn))
 		return;
 
+	/*
+	 * We only track unmovable, reclaimable and movable on pcp lists.
+	 * Place ISOLATE pages on the isolated list because they are being
+	 * offlined but treat HIGHATOMIC as movable pages so we can get those
+	 * areas back if necessary. Otherwise, we may have to free
+	 * excessively into the page allocator
+	 */
+	migratetype = get_pcppage_migratetype(page);
+	if (unlikely(migratetype >= MIGRATE_PCPTYPES)) {
+		if (unlikely(is_migrate_isolate(migratetype))) {
+			free_one_page(page_zone(page), page, pfn, 0, migratetype, FPI_NONE);
+			return;
+		}
+		migratetype = MIGRATE_MOVABLE;
+	}
+
 	local_lock_irqsave(&pagesets.lock, flags);
-	free_unref_page_commit(page, pfn);
+	free_unref_page_commit(page, pfn, migratetype);
 	local_unlock_irqrestore(&pagesets.lock, flags);
 }
 
@@ -3294,6 +3295,7 @@ void free_unref_page_list(struct list_head *list)
 	struct page *page, *next;
 	unsigned long flags, pfn;
 	int batch_count = 0;
+	int migratetype;
 
 	/* Prepare pages for freeing */
 	list_for_each_entry_safe(page, next, list, lru) {
@@ -3301,15 +3303,28 @@ void free_unref_page_list(struct list_head *list)
 		if (!free_unref_page_prepare(page, pfn))
 			list_del(&page->lru);
 		set_page_private(page, pfn);
+
+		/*
+		 * Free isolated pages directly to the allocator, see
+		 * comment in free_unref_page.
+		 */
+		migratetype = get_pcppage_migratetype(page);
+		if (unlikely(migratetype >= MIGRATE_PCPTYPES)) {
+			if (unlikely(is_migrate_isolate(migratetype))) {
+				free_one_page(page_zone(page), page, pfn, 0,
+							migratetype, FPI_NONE);
+				list_del(&page->lru);
+			}
+		}
 	}
 
 	local_lock_irqsave(&pagesets.lock, flags);
 	list_for_each_entry_safe(page, next, list, lru) {
-		unsigned long pfn = page_private(page);
-
+		pfn = page_private(page);
 		set_page_private(page, 0);
+		migratetype = get_pcppage_migratetype(page);
 		trace_mm_page_free_batched(page);
-		free_unref_page_commit(page, pfn);
+		free_unref_page_commit(page, pfn, migratetype);
 
 		/*
 		 * Guard against excessive IRQ disabled times when we get
-- 
2.26.2


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 11/11] mm/page_alloc: Update PGFREE outside the zone lock in __free_pages_ok
  2021-04-07 20:24 [PATCH 0/11 v2] Use local_lock for pcp protection and reduce stat overhead Mel Gorman
                   ` (9 preceding siblings ...)
  2021-04-07 20:24 ` [PATCH 10/11] mm/page_alloc: Avoid conflating IRQs disabled with zone->lock Mel Gorman
@ 2021-04-07 20:24 ` Mel Gorman
  2021-04-08 10:56 ` [PATCH 0/11 v2] Use local_lock for pcp protection and reduce stat overhead Peter Zijlstra
  11 siblings, 0 replies; 31+ messages in thread
From: Mel Gorman @ 2021-04-07 20:24 UTC (permalink / raw)
  To: Linux-MM, Linux-RT-Users
  Cc: LKML, Chuck Lever, Jesper Dangaard Brouer, Matthew Wilcox,
	Thomas Gleixner, Peter Zijlstra, Ingo Molnar, Michal Hocko,
	Oscar Salvador, Mel Gorman

VM events do not need explicit protection by disabling IRQs so
update the counter with IRQs enabled in __free_pages_ok.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/page_alloc.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6d98d97b6cf5..49951dd841fa 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1569,10 +1569,11 @@ static void __free_pages_ok(struct page *page, unsigned int order,
 	migratetype = get_pfnblock_migratetype(page, pfn);
 
 	spin_lock_irqsave(&zone->lock, flags);
-	__count_vm_events(PGFREE, 1 << order);
 	migratetype = check_migratetype_isolated(zone, page, pfn, migratetype);
 	__free_one_page(page, pfn, zone, order, migratetype, fpi_flags);
 	spin_unlock_irqrestore(&zone->lock, flags);
+
+	__count_vm_events(PGFREE, 1 << order);
 }
 
 void __free_pages_core(struct page *page, unsigned int order)
-- 
2.26.2


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 02/11] mm/page_alloc: Convert per-cpu list protection to local_lock
  2021-04-07 20:24 ` [PATCH 02/11] mm/page_alloc: Convert per-cpu list protection to local_lock Mel Gorman
@ 2021-04-08 10:52   ` Peter Zijlstra
  2021-04-08 17:42     ` Mel Gorman
  0 siblings, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2021-04-08 10:52 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, Linux-RT-Users, LKML, Chuck Lever,
	Jesper Dangaard Brouer, Matthew Wilcox, Thomas Gleixner,
	Ingo Molnar, Michal Hocko, Oscar Salvador

On Wed, Apr 07, 2021 at 09:24:14PM +0100, Mel Gorman wrote:
> There is a lack of clarity of what exactly local_irq_save/local_irq_restore
> protects in page_alloc.c . It conflates the protection of per-cpu page
> allocation structures with per-cpu vmstat deltas.
> 
> This patch protects the PCP structure using local_lock which for most
> configurations is identical to IRQ enabling/disabling.  The scope of the
> lock is still wider than it should be but this is decreased laster.

> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index a4393ac27336..106da8fbc72a 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h

> @@ -337,6 +338,7 @@ enum zone_watermarks {
>  #define high_wmark_pages(z) (z->_watermark[WMARK_HIGH] + z->watermark_boost)
>  #define wmark_pages(z, i) (z->_watermark[i] + z->watermark_boost)
>  
> +/* Fields and list protected by pagesets local_lock in page_alloc.c */
>  struct per_cpu_pages {
>  	int count;		/* number of pages in the list */
>  	int high;		/* high watermark, emptying needed */

> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index a68bacddcae0..e9e60d1a85d4 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -112,6 +112,13 @@ typedef int __bitwise fpi_t;
>  static DEFINE_MUTEX(pcp_batch_high_lock);
>  #define MIN_PERCPU_PAGELIST_FRACTION	(8)
>  
> +struct pagesets {
> +	local_lock_t lock;
> +};
> +static DEFINE_PER_CPU(struct pagesets, pagesets) = {
> +	.lock = INIT_LOCAL_LOCK(lock),
> +};

So why isn't the local_lock_t in struct per_cpu_pages ? That seems to be
the actual object that is protected by it and is already per-cpu.

Is that because you want to avoid the duplication across zones? Is that
worth the effort?

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/11 v2] Use local_lock for pcp protection and reduce stat overhead
  2021-04-07 20:24 [PATCH 0/11 v2] Use local_lock for pcp protection and reduce stat overhead Mel Gorman
                   ` (10 preceding siblings ...)
  2021-04-07 20:24 ` [PATCH 11/11] mm/page_alloc: Update PGFREE outside the zone lock in __free_pages_ok Mel Gorman
@ 2021-04-08 10:56 ` Peter Zijlstra
  2021-04-08 17:48   ` Mel Gorman
  11 siblings, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2021-04-08 10:56 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, Linux-RT-Users, LKML, Chuck Lever,
	Jesper Dangaard Brouer, Matthew Wilcox, Thomas Gleixner,
	Ingo Molnar, Michal Hocko, Oscar Salvador

On Wed, Apr 07, 2021 at 09:24:12PM +0100, Mel Gorman wrote:
> Why local_lock? PREEMPT_RT considers the following sequence to be unsafe
> as documented in Documentation/locking/locktypes.rst
> 
>    local_irq_disable();
>    raw_spin_lock(&lock);

Almost, the above is actually OK on RT. The problematic one is:

	local_irq_disable();
	spin_lock(&lock);

That doesn't work on RT since spin_lock() turns into a PI-mutex which
then obviously explodes if it tries to block with IRQs disabled.

And it so happens, that's exactly the one at hand.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 02/11] mm/page_alloc: Convert per-cpu list protection to local_lock
  2021-04-08 10:52   ` Peter Zijlstra
@ 2021-04-08 17:42     ` Mel Gorman
  2021-04-09  6:39       ` Peter Zijlstra
  0 siblings, 1 reply; 31+ messages in thread
From: Mel Gorman @ 2021-04-08 17:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linux-MM, Linux-RT-Users, LKML, Chuck Lever,
	Jesper Dangaard Brouer, Matthew Wilcox, Thomas Gleixner,
	Ingo Molnar, Michal Hocko, Oscar Salvador

On Thu, Apr 08, 2021 at 12:52:07PM +0200, Peter Zijlstra wrote:
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index a68bacddcae0..e9e60d1a85d4 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -112,6 +112,13 @@ typedef int __bitwise fpi_t;
> >  static DEFINE_MUTEX(pcp_batch_high_lock);
> >  #define MIN_PERCPU_PAGELIST_FRACTION	(8)
> >  
> > +struct pagesets {
> > +	local_lock_t lock;
> > +};
> > +static DEFINE_PER_CPU(struct pagesets, pagesets) = {
> > +	.lock = INIT_LOCAL_LOCK(lock),
> > +};
> 
> So why isn't the local_lock_t in struct per_cpu_pages ? That seems to be
> the actual object that is protected by it and is already per-cpu.
> 
> Is that because you want to avoid the duplication across zones? Is that
> worth the effort?

When I wrote the patch, the problem was that zone_pcp_reset freed the
per_cpu_pages structure and it was "protected" by local_irq_save(). If
that was converted to local_lock_irq then the structure containing the
lock is freed before it is released which is obviously bad.

Much later when trying to make the allocator RT-safe in general, I realised
that locking was broken and fixed it in patch 3 of this series. With that,
the local_lock could potentially be embedded within per_cpu_pages safely
at the end of this series.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/11 v2] Use local_lock for pcp protection and reduce stat overhead
  2021-04-08 10:56 ` [PATCH 0/11 v2] Use local_lock for pcp protection and reduce stat overhead Peter Zijlstra
@ 2021-04-08 17:48   ` Mel Gorman
  0 siblings, 0 replies; 31+ messages in thread
From: Mel Gorman @ 2021-04-08 17:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linux-MM, Linux-RT-Users, LKML, Chuck Lever,
	Jesper Dangaard Brouer, Matthew Wilcox, Thomas Gleixner,
	Ingo Molnar, Michal Hocko, Oscar Salvador

On Thu, Apr 08, 2021 at 12:56:01PM +0200, Peter Zijlstra wrote:
> On Wed, Apr 07, 2021 at 09:24:12PM +0100, Mel Gorman wrote:
> > Why local_lock? PREEMPT_RT considers the following sequence to be unsafe
> > as documented in Documentation/locking/locktypes.rst
> > 
> >    local_irq_disable();
> >    raw_spin_lock(&lock);
> 
> Almost, the above is actually OK on RT. The problematic one is:
> 
> 	local_irq_disable();
> 	spin_lock(&lock);
> 
> That doesn't work on RT since spin_lock() turns into a PI-mutex which
> then obviously explodes if it tries to block with IRQs disabled.
> 
> And it so happens, that's exactly the one at hand.

Ok, I completely messed up the leader because it was local_irq_disable()
+ spin_lock() that I was worried about. Once the series is complete,
it is replated with

  local_lock_irq(&lock_lock)
  spin_lock(&lock);

According to Documentation/locking/locktypes.rst, that should be safe.
I'll rephrase the justification.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 02/11] mm/page_alloc: Convert per-cpu list protection to local_lock
  2021-04-08 17:42     ` Mel Gorman
@ 2021-04-09  6:39       ` Peter Zijlstra
  2021-04-09  7:59         ` Mel Gorman
  0 siblings, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2021-04-09  6:39 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, Linux-RT-Users, LKML, Chuck Lever,
	Jesper Dangaard Brouer, Matthew Wilcox, Thomas Gleixner,
	Ingo Molnar, Michal Hocko, Oscar Salvador

On Thu, Apr 08, 2021 at 06:42:44PM +0100, Mel Gorman wrote:
> On Thu, Apr 08, 2021 at 12:52:07PM +0200, Peter Zijlstra wrote:
> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > index a68bacddcae0..e9e60d1a85d4 100644
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -112,6 +112,13 @@ typedef int __bitwise fpi_t;
> > >  static DEFINE_MUTEX(pcp_batch_high_lock);
> > >  #define MIN_PERCPU_PAGELIST_FRACTION	(8)
> > >  
> > > +struct pagesets {
> > > +	local_lock_t lock;
> > > +};
> > > +static DEFINE_PER_CPU(struct pagesets, pagesets) = {
> > > +	.lock = INIT_LOCAL_LOCK(lock),
> > > +};
> > 
> > So why isn't the local_lock_t in struct per_cpu_pages ? That seems to be
> > the actual object that is protected by it and is already per-cpu.
> > 
> > Is that because you want to avoid the duplication across zones? Is that
> > worth the effort?
> 
> When I wrote the patch, the problem was that zone_pcp_reset freed the
> per_cpu_pages structure and it was "protected" by local_irq_save(). If
> that was converted to local_lock_irq then the structure containing the
> lock is freed before it is released which is obviously bad.
> 
> Much later when trying to make the allocator RT-safe in general, I realised
> that locking was broken and fixed it in patch 3 of this series. With that,
> the local_lock could potentially be embedded within per_cpu_pages safely
> at the end of this series.

Fair enough; I was just wondering why the obvious solution wasn't chosen
and neither changelog nor comment explain, so I had to ask :-)

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 02/11] mm/page_alloc: Convert per-cpu list protection to local_lock
  2021-04-09  6:39       ` Peter Zijlstra
@ 2021-04-09  7:59         ` Mel Gorman
  2021-04-09  8:24           ` Peter Zijlstra
  0 siblings, 1 reply; 31+ messages in thread
From: Mel Gorman @ 2021-04-09  7:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linux-MM, Linux-RT-Users, LKML, Chuck Lever,
	Jesper Dangaard Brouer, Matthew Wilcox, Thomas Gleixner,
	Ingo Molnar, Michal Hocko, Oscar Salvador

On Fri, Apr 09, 2021 at 08:39:45AM +0200, Peter Zijlstra wrote:
> On Thu, Apr 08, 2021 at 06:42:44PM +0100, Mel Gorman wrote:
> > On Thu, Apr 08, 2021 at 12:52:07PM +0200, Peter Zijlstra wrote:
> > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > > index a68bacddcae0..e9e60d1a85d4 100644
> > > > --- a/mm/page_alloc.c
> > > > +++ b/mm/page_alloc.c
> > > > @@ -112,6 +112,13 @@ typedef int __bitwise fpi_t;
> > > >  static DEFINE_MUTEX(pcp_batch_high_lock);
> > > >  #define MIN_PERCPU_PAGELIST_FRACTION	(8)
> > > >  
> > > > +struct pagesets {
> > > > +	local_lock_t lock;
> > > > +};
> > > > +static DEFINE_PER_CPU(struct pagesets, pagesets) = {
> > > > +	.lock = INIT_LOCAL_LOCK(lock),
> > > > +};
> > > 
> > > So why isn't the local_lock_t in struct per_cpu_pages ? That seems to be
> > > the actual object that is protected by it and is already per-cpu.
> > > 
> > > Is that because you want to avoid the duplication across zones? Is that
> > > worth the effort?
> > 
> > When I wrote the patch, the problem was that zone_pcp_reset freed the
> > per_cpu_pages structure and it was "protected" by local_irq_save(). If
> > that was converted to local_lock_irq then the structure containing the
> > lock is freed before it is released which is obviously bad.
> > 
> > Much later when trying to make the allocator RT-safe in general, I realised
> > that locking was broken and fixed it in patch 3 of this series. With that,
> > the local_lock could potentially be embedded within per_cpu_pages safely
> > at the end of this series.
> 
> Fair enough; I was just wondering why the obvious solution wasn't chosen
> and neither changelog nor comment explain, so I had to ask :-)

It's a fair question and it was my first approach before I hit problems.
Thinking again this morning, I remembered that another problem I hit was
patterns like this

        local_lock_irqsave(&pagesets.lock, flags);
        pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);

turning into

	cpu = get_cpu();
        pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);
        local_lock_irqsave(&pcp->lock, flags);

That has its own problems if zone->lock was acquired within the
local_lock_irqsave section (Section "spinlock_t and rwlock_t" in
Documentation/locking/locktypes.rst) so it has to turn into

	migrate_disable();
	pcp = this_cpu_ptr(zone->per_cpu_pageset);
        local_lock_irqsave(&pcp->lock, flags);

I did not want to start adding migrate_disable() in multiple places like
this because I'm guessing that new users of migrate_disable() need strong
justification and adding such code in page_alloc.c might cause cargo-cult
copy&paste in the future. Maybe it could be addressed with a helper like
this_cpu_local_lock or this_cpu_local_lock_irq but that means in some
cases that the PCP structure is looked up twice with patterns like this one

        local_lock_irqsave(&pagesets.lock, flags);
        free_unref_page_commit(page, pfn, migratetype);
        local_unlock_irqrestore(&pagesets.lock, flags);

To get around multiple lookups the helper becomes something that disables
migration, looks up the PCP structure, locks it and returns it with
pcp then passed around as appropriate. Not sure what I would call that
helper :P

In the end I just gave up and kept it simple as there is no benefit to
!PREEMPT_RT which just disables IRQs. Maybe it'll be worth considering when
PREEMPT_RT is upstream and can be enabled. The series was functionally
tested on the PREEMPT_RT tree by reverting the page_alloc.c patch and
applies this series and all of its prerequisites on top.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 02/11] mm/page_alloc: Convert per-cpu list protection to local_lock
  2021-04-09  7:59         ` Mel Gorman
@ 2021-04-09  8:24           ` Peter Zijlstra
  2021-04-09 13:32             ` Mel Gorman
  0 siblings, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2021-04-09  8:24 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, Linux-RT-Users, LKML, Chuck Lever,
	Jesper Dangaard Brouer, Matthew Wilcox, Thomas Gleixner,
	Ingo Molnar, Michal Hocko, Oscar Salvador

On Fri, Apr 09, 2021 at 08:59:39AM +0100, Mel Gorman wrote:
> In the end I just gave up and kept it simple as there is no benefit to
> !PREEMPT_RT which just disables IRQs. Maybe it'll be worth considering when
> PREEMPT_RT is upstream and can be enabled. The series was functionally
> tested on the PREEMPT_RT tree by reverting the page_alloc.c patch and
> applies this series and all of its prerequisites on top.

Right, I see the problem. Fair enough; perhaps ammend the changelog to
include some of that so that we can 'remember' in a few months why the
code is 'funneh'.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 02/11] mm/page_alloc: Convert per-cpu list protection to local_lock
  2021-04-09  8:24           ` Peter Zijlstra
@ 2021-04-09 13:32             ` Mel Gorman
  2021-04-09 18:55               ` Peter Zijlstra
  0 siblings, 1 reply; 31+ messages in thread
From: Mel Gorman @ 2021-04-09 13:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linux-MM, Linux-RT-Users, LKML, Chuck Lever,
	Jesper Dangaard Brouer, Matthew Wilcox, Thomas Gleixner,
	Ingo Molnar, Michal Hocko, Oscar Salvador

On Fri, Apr 09, 2021 at 10:24:24AM +0200, Peter Zijlstra wrote:
> On Fri, Apr 09, 2021 at 08:59:39AM +0100, Mel Gorman wrote:
> > In the end I just gave up and kept it simple as there is no benefit to
> > !PREEMPT_RT which just disables IRQs. Maybe it'll be worth considering when
> > PREEMPT_RT is upstream and can be enabled. The series was functionally
> > tested on the PREEMPT_RT tree by reverting the page_alloc.c patch and
> > applies this series and all of its prerequisites on top.
> 
> Right, I see the problem. Fair enough; perhaps ammend the changelog to
> include some of that so that we can 'remember' in a few months why the
> code is 'funneh'.
> 

I updated the changelog and also added a comment above the
declaration. That said, there are some curious users already.
fs/squashfs/decompressor_multi_percpu.c looks like it always uses the
local_lock in CPU 0's per-cpu structure instead of stabilising a per-cpu
pointer. drivers/block/zram/zcomp.c appears to do the same although for
at least one of the zcomp_stream_get() callers, the CPU is pinned for
other reasons (bit spin lock held). I think it happens to work anyway
but it's weird and I'm not a fan.

Anyway, new version looks like is below.

-- 
[PATCH] mm/page_alloc: Convert per-cpu list protection to local_lock

There is a lack of clarity of what exactly local_irq_save/local_irq_restore
protects in page_alloc.c . It conflates the protection of per-cpu page
allocation structures with per-cpu vmstat deltas.

This patch protects the PCP structure using local_lock which for most
configurations is identical to IRQ enabling/disabling.  The scope of the
lock is still wider than it should be but this is decreased later.

local_lock is declared statically instead of placing it within a structure
and this is deliberate. Placing it in the zone offers limited benefit and
confuses what the lock is protecting -- struct per_cpu_pages. However,
putting it in per_cpu_pages is problematic because the task is not guaranteed
to be pinned to the CPU yet so looking up a per-cpu structure is unsafe.

[lkp@intel.com: Make pagesets static]
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 include/linux/mmzone.h |  2 ++
 mm/page_alloc.c        | 67 +++++++++++++++++++++++++++++++++++++++-----------
 2 files changed, 54 insertions(+), 15 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index a4393ac27336..106da8fbc72a 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -20,6 +20,7 @@
 #include <linux/atomic.h>
 #include <linux/mm_types.h>
 #include <linux/page-flags.h>
+#include <linux/local_lock.h>
 #include <asm/page.h>
 
 /* Free memory management - zoned buddy allocator.  */
@@ -337,6 +338,7 @@ enum zone_watermarks {
 #define high_wmark_pages(z) (z->_watermark[WMARK_HIGH] + z->watermark_boost)
 #define wmark_pages(z, i) (z->_watermark[i] + z->watermark_boost)
 
+/* Fields and list protected by pagesets local_lock in page_alloc.c */
 struct per_cpu_pages {
 	int count;		/* number of pages in the list */
 	int high;		/* high watermark, emptying needed */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3bc4da4cbf9c..04644c3dd187 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -112,6 +112,30 @@ typedef int __bitwise fpi_t;
 static DEFINE_MUTEX(pcp_batch_high_lock);
 #define MIN_PERCPU_PAGELIST_FRACTION	(8)
 
+/*
+ * Protects the per_cpu_pages structures.
+ *
+ * This lock is not placed in struct per_cpu_pages because the task acquiring
+ * the lock is not guaranteed to be pinned to the CPU yet due to
+ * preempt/migrate/IRQs disabled or holding a spinlock. The pattern to acquire
+ * the lock would become
+ *
+ *   migrate_disable();
+ *   pcp = this_cpu_ptr(zone->per_cpu_pageset);
+ *   local_lock_irqsave(&pcp->lock, flags);
+ *
+ * While a helper would avoid code duplication, there is no inherent advantage
+ * and migrate_disable itself is undesirable (see include/linux/preempt.h).
+ * Similarly, putting the lock in the zone offers no particular benefit but
+ * confuses what the lock is protecting.
+ */
+struct pagesets {
+	local_lock_t lock;
+};
+static DEFINE_PER_CPU(struct pagesets, pagesets) = {
+	.lock = INIT_LOCAL_LOCK(lock),
+};
+
 #ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID
 DEFINE_PER_CPU(int, numa_node);
 EXPORT_PER_CPU_SYMBOL(numa_node);
@@ -1421,6 +1445,10 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 		} while (--count && --batch_free && !list_empty(list));
 	}
 
+	/*
+	 * local_lock_irq held so equivalent to spin_lock_irqsave for
+	 * both PREEMPT_RT and non-PREEMPT_RT configurations.
+	 */
 	spin_lock(&zone->lock);
 	isolated_pageblocks = has_isolate_pageblock(zone);
 
@@ -1541,6 +1569,11 @@ static void __free_pages_ok(struct page *page, unsigned int order,
 		return;
 
 	migratetype = get_pfnblock_migratetype(page, pfn);
+
+	/*
+	 * TODO FIX: Disable IRQs before acquiring IRQ-safe zone->lock
+	 * and protect vmstat updates.
+	 */
 	local_irq_save(flags);
 	__count_vm_events(PGFREE, 1 << order);
 	free_one_page(page_zone(page), page, pfn, order, migratetype,
@@ -2910,6 +2943,10 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 {
 	int i, allocated = 0;
 
+	/*
+	 * local_lock_irq held so equivalent to spin_lock_irqsave for
+	 * both PREEMPT_RT and non-PREEMPT_RT configurations.
+	 */
 	spin_lock(&zone->lock);
 	for (i = 0; i < count; ++i) {
 		struct page *page = __rmqueue(zone, order, migratetype,
@@ -2962,12 +2999,12 @@ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
 	unsigned long flags;
 	int to_drain, batch;
 
-	local_irq_save(flags);
+	local_lock_irqsave(&pagesets.lock, flags);
 	batch = READ_ONCE(pcp->batch);
 	to_drain = min(pcp->count, batch);
 	if (to_drain > 0)
 		free_pcppages_bulk(zone, to_drain, pcp);
-	local_irq_restore(flags);
+	local_unlock_irqrestore(&pagesets.lock, flags);
 }
 #endif
 
@@ -2983,13 +3020,13 @@ static void drain_pages_zone(unsigned int cpu, struct zone *zone)
 	unsigned long flags;
 	struct per_cpu_pages *pcp;
 
-	local_irq_save(flags);
+	local_lock_irqsave(&pagesets.lock, flags);
 
 	pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);
 	if (pcp->count)
 		free_pcppages_bulk(zone, pcp->count, pcp);
 
-	local_irq_restore(flags);
+	local_unlock_irqrestore(&pagesets.lock, flags);
 }
 
 /*
@@ -3252,9 +3289,9 @@ void free_unref_page(struct page *page)
 	if (!free_unref_page_prepare(page, pfn))
 		return;
 
-	local_irq_save(flags);
+	local_lock_irqsave(&pagesets.lock, flags);
 	free_unref_page_commit(page, pfn);
-	local_irq_restore(flags);
+	local_unlock_irqrestore(&pagesets.lock, flags);
 }
 
 /*
@@ -3274,7 +3311,7 @@ void free_unref_page_list(struct list_head *list)
 		set_page_private(page, pfn);
 	}
 
-	local_irq_save(flags);
+	local_lock_irqsave(&pagesets.lock, flags);
 	list_for_each_entry_safe(page, next, list, lru) {
 		unsigned long pfn = page_private(page);
 
@@ -3287,12 +3324,12 @@ void free_unref_page_list(struct list_head *list)
 		 * a large list of pages to free.
 		 */
 		if (++batch_count == SWAP_CLUSTER_MAX) {
-			local_irq_restore(flags);
+			local_unlock_irqrestore(&pagesets.lock, flags);
 			batch_count = 0;
-			local_irq_save(flags);
+			local_lock_irqsave(&pagesets.lock, flags);
 		}
 	}
-	local_irq_restore(flags);
+	local_unlock_irqrestore(&pagesets.lock, flags);
 }
 
 /*
@@ -3449,7 +3486,7 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
 	struct page *page;
 	unsigned long flags;
 
-	local_irq_save(flags);
+	local_lock_irqsave(&pagesets.lock, flags);
 	pcp = this_cpu_ptr(zone->per_cpu_pageset);
 	list = &pcp->lists[migratetype];
 	page = __rmqueue_pcplist(zone,  migratetype, alloc_flags, pcp, list);
@@ -3457,7 +3494,7 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
 		__count_zid_vm_events(PGALLOC, page_zonenum(page), 1);
 		zone_statistics(preferred_zone, zone);
 	}
-	local_irq_restore(flags);
+	local_unlock_irqrestore(&pagesets.lock, flags);
 	return page;
 }
 
@@ -5052,7 +5089,7 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
 		goto failed;
 
 	/* Attempt the batch allocation */
-	local_irq_save(flags);
+	local_lock_irqsave(&pagesets.lock, flags);
 	pcp = this_cpu_ptr(zone->per_cpu_pageset);
 	pcp_list = &pcp->lists[ac.migratetype];
 
@@ -5090,12 +5127,12 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
 		nr_populated++;
 	}
 
-	local_irq_restore(flags);
+	local_unlock_irqrestore(&pagesets.lock, flags);
 
 	return nr_populated;
 
 failed_irq:
-	local_irq_restore(flags);
+	local_unlock_irqrestore(&pagesets.lock, flags);
 
 failed:
 	page = __alloc_pages(gfp, 0, preferred_nid, nodemask);

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 02/11] mm/page_alloc: Convert per-cpu list protection to local_lock
  2021-04-09 13:32             ` Mel Gorman
@ 2021-04-09 18:55               ` Peter Zijlstra
  2021-04-12 11:56                 ` Mel Gorman
  0 siblings, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2021-04-09 18:55 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, Linux-RT-Users, LKML, Chuck Lever,
	Jesper Dangaard Brouer, Matthew Wilcox, Thomas Gleixner,
	Ingo Molnar, Michal Hocko, Oscar Salvador

On Fri, Apr 09, 2021 at 02:32:56PM +0100, Mel Gorman wrote:
> That said, there are some curious users already.
> fs/squashfs/decompressor_multi_percpu.c looks like it always uses the
> local_lock in CPU 0's per-cpu structure instead of stabilising a per-cpu
> pointer. 

I'm not sure how you read that.

You're talking about this:

  local_lock(&msblk->stream->lock);

right? Note that msblk->stream is a per-cpu pointer, so
&msblk->stream->lock is that same per-cpu pointer with an offset on.

The whole think relies on:

	&per_cpu_ptr(msblk->stream, cpu)->lock == per_cpu_ptr(&msblk->stream->lock, cpu)

Which is true because the lhs:

	(local_lock_t *)((msblk->stream + per_cpu_offset(cpu)) + offsetof(struct squashfs_stream, lock))

and the rhs:

	(local_lock_t *)((msblk->stream + offsetof(struct squashfs_stream, lock)) + per_cpu_offset(cpu))

are identical, because addition is associative.

> drivers/block/zram/zcomp.c appears to do the same although for
> at least one of the zcomp_stream_get() callers, the CPU is pinned for
> other reasons (bit spin lock held). I think it happens to work anyway
> but it's weird and I'm not a fan.

Same thing.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 02/11] mm/page_alloc: Convert per-cpu list protection to local_lock
  2021-04-09 18:55               ` Peter Zijlstra
@ 2021-04-12 11:56                 ` Mel Gorman
  2021-04-12 21:47                   ` Thomas Gleixner
  0 siblings, 1 reply; 31+ messages in thread
From: Mel Gorman @ 2021-04-12 11:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linux-MM, Linux-RT-Users, LKML, Chuck Lever,
	Jesper Dangaard Brouer, Matthew Wilcox, Thomas Gleixner,
	Ingo Molnar, Michal Hocko, Oscar Salvador

On Fri, Apr 09, 2021 at 08:55:39PM +0200, Peter Zijlstra wrote:
> On Fri, Apr 09, 2021 at 02:32:56PM +0100, Mel Gorman wrote:
> > That said, there are some curious users already.
> > fs/squashfs/decompressor_multi_percpu.c looks like it always uses the
> > local_lock in CPU 0's per-cpu structure instead of stabilising a per-cpu
> > pointer. 
> 
> I'm not sure how you read that.
> 
> You're talking about this:
> 
>   local_lock(&msblk->stream->lock);
> 
> right? Note that msblk->stream is a per-cpu pointer, so
> &msblk->stream->lock is that same per-cpu pointer with an offset on.
> 
> The whole think relies on:
> 
> 	&per_cpu_ptr(msblk->stream, cpu)->lock == per_cpu_ptr(&msblk->stream->lock, cpu)
> 
> Which is true because the lhs:
> 
> 	(local_lock_t *)((msblk->stream + per_cpu_offset(cpu)) + offsetof(struct squashfs_stream, lock))
> 
> and the rhs:
> 
> 	(local_lock_t *)((msblk->stream + offsetof(struct squashfs_stream, lock)) + per_cpu_offset(cpu))
> 
> are identical, because addition is associative.
> 

Ok, I think I see and understand now, I didn't follow far enough down
into the macro magic and missed this observation so thanks for your
patience. The page allocator still incurs a double lookup of the per
cpu offsets but it should work for both the current local_lock_irq
implementation and the one in preempt-rt because the task will be pinned
to the CPU by either preempt_disable, migrate_disable or IRQ disable
depending on the local_lock implementation and kernel configuration.

I'll update the changelog and comment accordingly. I'll decide later
whether to leave it or move the location of the lock at the end of the
series. If the patch is added, it'll either incur the double lookup (not
that expensive, might be optimised by the compiler) or come up with a
helper that takes the lock and returns the per-cpu structure. The double
lookup probably makes more sense initially because there are multiple
potential users of a helper that says "pin to CPU, lookup, lock and return
a per-cpu structure" for both IRQ-safe and IRQ-unsafe variants with the
associated expansion of the local_lock API. It might be better to introduce
such a helper with multiple users converted at the same time and there are
other local_lock users in preempt-rt that could do with upstreaming first.

> > drivers/block/zram/zcomp.c appears to do the same although for
> > at least one of the zcomp_stream_get() callers, the CPU is pinned for
> > other reasons (bit spin lock held). I think it happens to work anyway
> > but it's weird and I'm not a fan.
> 
> Same thing.

Yep.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 01/11] mm/page_alloc: Split per cpu page lists and zone stats
  2021-04-07 20:24 ` [PATCH 01/11] mm/page_alloc: Split per cpu page lists and zone stats Mel Gorman
@ 2021-04-12 17:43   ` Vlastimil Babka
  2021-04-13 13:27     ` Mel Gorman
  0 siblings, 1 reply; 31+ messages in thread
From: Vlastimil Babka @ 2021-04-12 17:43 UTC (permalink / raw)
  To: Mel Gorman, Linux-MM, Linux-RT-Users
  Cc: LKML, Chuck Lever, Jesper Dangaard Brouer, Matthew Wilcox,
	Thomas Gleixner, Peter Zijlstra, Ingo Molnar, Michal Hocko,
	Oscar Salvador

On 4/7/21 10:24 PM, Mel Gorman wrote:
> @@ -6691,7 +6697,7 @@ static __meminit void zone_pcp_init(struct zone *zone)
>  	 * relies on the ability of the linker to provide the
>  	 * offset of a (static) per cpu variable into the per cpu area.
>  	 */
> -	zone->pageset = &boot_pageset;
> +	zone->per_cpu_pageset = &boot_pageset;

I don't see any &boot_zonestats assignment here in zone_pcp_init() or its
caller(s), which seems strange, as zone_pcp_reset() does it.

>  	zone->pageset_high = BOOT_PAGESET_HIGH;
>  	zone->pageset_batch = BOOT_PAGESET_BATCH;
>  
> @@ -8954,17 +8960,19 @@ void zone_pcp_reset(struct zone *zone)
>  {
>  	unsigned long flags;
>  	int cpu;
> -	struct per_cpu_pageset *pset;
> +	struct per_cpu_zonestat *pzstats;
>  
>  	/* avoid races with drain_pages()  */
>  	local_irq_save(flags);
> -	if (zone->pageset != &boot_pageset) {
> +	if (zone->per_cpu_pageset != &boot_pageset) {
>  		for_each_online_cpu(cpu) {
> -			pset = per_cpu_ptr(zone->pageset, cpu);
> -			drain_zonestat(zone, pset);
> +			pzstats = per_cpu_ptr(zone->per_cpu_zonestats, cpu);
> +			drain_zonestat(zone, pzstats);
>  		}
> -		free_percpu(zone->pageset);
> -		zone->pageset = &boot_pageset;
> +		free_percpu(zone->per_cpu_pageset);
> +		free_percpu(zone->per_cpu_zonestats);
> +		zone->per_cpu_pageset = &boot_pageset;
> +		zone->per_cpu_zonestats = &boot_zonestats;

^ here

>  	}
>  	local_irq_restore(flags);
>  }

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 02/11] mm/page_alloc: Convert per-cpu list protection to local_lock
  2021-04-12 11:56                 ` Mel Gorman
@ 2021-04-12 21:47                   ` Thomas Gleixner
  2021-04-13 16:52                     ` Mel Gorman
  0 siblings, 1 reply; 31+ messages in thread
From: Thomas Gleixner @ 2021-04-12 21:47 UTC (permalink / raw)
  To: Mel Gorman, Peter Zijlstra
  Cc: Linux-MM, Linux-RT-Users, LKML, Chuck Lever,
	Jesper Dangaard Brouer, Matthew Wilcox, Ingo Molnar,
	Michal Hocko, Oscar Salvador

On Mon, Apr 12 2021 at 12:56, Mel Gorman wrote:
> On Fri, Apr 09, 2021 at 08:55:39PM +0200, Peter Zijlstra wrote:
> I'll update the changelog and comment accordingly. I'll decide later
> whether to leave it or move the location of the lock at the end of the
> series. If the patch is added, it'll either incur the double lookup (not
> that expensive, might be optimised by the compiler) or come up with a
> helper that takes the lock and returns the per-cpu structure. The double
> lookup probably makes more sense initially because there are multiple
> potential users of a helper that says "pin to CPU, lookup, lock and return
> a per-cpu structure" for both IRQ-safe and IRQ-unsafe variants with the
> associated expansion of the local_lock API. It might be better to introduce
> such a helper with multiple users converted at the same time and there are
> other local_lock users in preempt-rt that could do with upstreaming first.

We had such helpers in RT a while ago but it turned into an helper
explosion pretty fast. But that was one of the early versions of local
locks which could not be embedded into a per CPU data structure due to
raisins (my stupidity).

But with the more thought out approach of today we can have (+/- the
obligatory naming bikeshedding):

--- a/include/linux/local_lock.h
+++ b/include/linux/local_lock.h
@@ -51,4 +51,35 @@
 #define local_unlock_irqrestore(lock, flags)			\
 	__local_unlock_irqrestore(lock, flags)
 
+/**
+ * local_lock_get_cpu_ptr - Acquire a per CPU local lock and return
+ *			    a pointer to the per CPU data which
+ *			    contains the local lock.
+ * @pcp:	Per CPU data structure
+ * @lock:	The local lock member of @pcp
+ */
+#define local_lock_get_cpu_ptr(pcp, lock)			\
+	__local_lock_get_cpu_ptr(pcp, typeof(*(pcp)), lock)
+
+/**
+ * local_lock_irq_get_cpu_ptr - Acquire a per CPU local lock, disable
+ *				interrupts and return a pointer to the
+ *				per CPU data which contains the local lock.
+ * @pcp:	Per CPU data structure
+ * @lock:	The local lock member of @pcp
+ */
+#define local_lock_irq_get_cpu_ptr(pcp, lock)			\
+	__local_lock_irq_get_cpu_ptr(pcp, typeof(*(pcp)), lock)
+
+/**
+ * local_lock_irqsave_get_cpu_ptr - Acquire a per CPU local lock, save and
+ *				    disable interrupts and return a pointer to
+ *				    the CPU data which contains the local lock.
+ * @pcp:	Per CPU data structure
+ * @lock:	The local lock member of @pcp
+ * @flags:	Storage for interrupt flags
+ */
+#define local_lock_irqsave_get_cpu_ptr(pcp, lock, flags)	\
+	__local_lock_irqsave_get_cpu_ptr(pcp, typeof(*(pcp)), lock, flags)
+
 #endif
--- a/include/linux/local_lock_internal.h
+++ b/include/linux/local_lock_internal.h
@@ -91,3 +91,33 @@ static inline void local_lock_release(lo
 		local_lock_release(this_cpu_ptr(lock));		\
 		local_irq_restore(flags);			\
 	} while (0)
+
+#define __local_lock_get_cpu_ptr(pcp, type, lock)		\
+	({							\
+		type *__pcp;					\
+								\
+		preempt_disable();				\
+		__pcp = this_cpu_ptr(pcp);			\
+		local_lock_acquire(&__pcp->lock);		\
+		__pcp;						\
+	})
+
+#define __local_lock_irq_get_cpu_ptr(pcp, type, lock)		\
+	({							\
+		type *__pcp;					\
+								\
+		local_irq_disable();				\
+		__pcp = this_cpu_ptr(pcp);			\
+		local_lock_acquire(&__pcp->lock);		\
+		__pcp;						\
+	})
+
+#define __local_lock_irqsave_get_cpu_ptr(pcp, type, lock, flags)\
+	({							\
+		type *__pcp;					\
+								\
+		local_irq_save(flags);				\
+		__pcp = this_cpu_ptr(pcp);			\
+		local_lock_acquire(&__pcp->lock);		\
+		__pcp;						\
+	})


and RT will then change that to:

--- a/include/linux/local_lock_internal.h
+++ b/include/linux/local_lock_internal.h
@@ -96,7 +96,7 @@ static inline void local_lock_release(lo
 	({							\
 		type *__pcp;					\
 								\
-		preempt_disable();				\
+		ll_preempt_disable();				\
 		__pcp = this_cpu_ptr(pcp);			\
 		local_lock_acquire(&__pcp->lock);		\
 		__pcp;						\
@@ -106,7 +106,7 @@ static inline void local_lock_release(lo
 	({							\
 		type *__pcp;					\
 								\
-		local_irq_disable();				\
+		ll_local_irq_disable();				\
 		__pcp = this_cpu_ptr(pcp);			\
 		local_lock_acquire(&__pcp->lock);		\
 		__pcp;						\
@@ -116,7 +116,7 @@ static inline void local_lock_release(lo
 	({							\
 		type *__pcp;					\
 								\
-		local_irq_save(flags);				\
+		ll_local_irq_save(flags);			\
 		__pcp = this_cpu_ptr(pcp);			\
 		local_lock_acquire(&__pcp->lock);		\
 		__pcp;						\


where ll_xxx is defined as xxx for non-RT and on RT all of them
get mapped to migrate_disable().

Thoughts?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 01/11] mm/page_alloc: Split per cpu page lists and zone stats
  2021-04-12 17:43   ` Vlastimil Babka
@ 2021-04-13 13:27     ` Mel Gorman
  0 siblings, 0 replies; 31+ messages in thread
From: Mel Gorman @ 2021-04-13 13:27 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Linux-MM, Linux-RT-Users, LKML, Chuck Lever,
	Jesper Dangaard Brouer, Matthew Wilcox, Thomas Gleixner,
	Peter Zijlstra, Ingo Molnar, Michal Hocko, Oscar Salvador

On Mon, Apr 12, 2021 at 07:43:18PM +0200, Vlastimil Babka wrote:
> On 4/7/21 10:24 PM, Mel Gorman wrote:
> > @@ -6691,7 +6697,7 @@ static __meminit void zone_pcp_init(struct zone *zone)
> >  	 * relies on the ability of the linker to provide the
> >  	 * offset of a (static) per cpu variable into the per cpu area.
> >  	 */
> > -	zone->pageset = &boot_pageset;
> > +	zone->per_cpu_pageset = &boot_pageset;
> 
> I don't see any &boot_zonestats assignment here in zone_pcp_init() or its
> caller(s), which seems strange, as zone_pcp_reset() does it.
> 

Yes, it's required, well spotted!

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 02/11] mm/page_alloc: Convert per-cpu list protection to local_lock
  2021-04-12 21:47                   ` Thomas Gleixner
@ 2021-04-13 16:52                     ` Mel Gorman
  0 siblings, 0 replies; 31+ messages in thread
From: Mel Gorman @ 2021-04-13 16:52 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, Linux-MM, Linux-RT-Users, LKML, Chuck Lever,
	Jesper Dangaard Brouer, Matthew Wilcox, Ingo Molnar,
	Michal Hocko, Oscar Salvador

On Mon, Apr 12, 2021 at 11:47:00PM +0200, Thomas Gleixner wrote:
> On Mon, Apr 12 2021 at 12:56, Mel Gorman wrote:
> > On Fri, Apr 09, 2021 at 08:55:39PM +0200, Peter Zijlstra wrote:
> > I'll update the changelog and comment accordingly. I'll decide later
> > whether to leave it or move the location of the lock at the end of the
> > series. If the patch is added, it'll either incur the double lookup (not
> > that expensive, might be optimised by the compiler) or come up with a
> > helper that takes the lock and returns the per-cpu structure. The double
> > lookup probably makes more sense initially because there are multiple
> > potential users of a helper that says "pin to CPU, lookup, lock and return
> > a per-cpu structure" for both IRQ-safe and IRQ-unsafe variants with the
> > associated expansion of the local_lock API. It might be better to introduce
> > such a helper with multiple users converted at the same time and there are
> > other local_lock users in preempt-rt that could do with upstreaming first.
> 
> We had such helpers in RT a while ago but it turned into an helper
> explosion pretty fast. But that was one of the early versions of local
> locks which could not be embedded into a per CPU data structure due to
> raisins (my stupidity).
> 
> But with the more thought out approach of today we can have (+/- the
> obligatory naming bikeshedding):
> 
> <SNIP>

I don't have strong opinions on the name -- it's long but it's clear.
The overhead of local_lock_get_cpu_ptr has similar weight to get_cpu_ptr
in terms of the cost of preempt_disable. The helper also means that new
users of a local_lock embedded within a per-cpu structure do not have to
figure out if it's safe from scratch.

If the page allocator embeds local_lock within struct per_cpu_pages then
the conversion to the helper is at the end of the mail. The messiest part
is free_unref_page_commit and that is a mess because free_unref_page_list
has to check if a new lock is required in case a list of pages is from
different zones.

> <SNIP>
>
> and RT will then change that to:
> 
> --- a/include/linux/local_lock_internal.h
> +++ b/include/linux/local_lock_internal.h
> @@ -96,7 +96,7 @@ static inline void local_lock_release(lo
>  	({							\
>  		type *__pcp;					\
>  								\
> -		preempt_disable();				\
> +		ll_preempt_disable();				\
>  		__pcp = this_cpu_ptr(pcp);			\
>  		local_lock_acquire(&__pcp->lock);		\
>  		__pcp;						\
> @@ -106,7 +106,7 @@ static inline void local_lock_release(lo
>  	({							\
>  		type *__pcp;					\
>  								\
> -		local_irq_disable();				\
> +		ll_local_irq_disable();				\
>  		__pcp = this_cpu_ptr(pcp);			\
>  		local_lock_acquire(&__pcp->lock);		\
>  		__pcp;						\
> @@ -116,7 +116,7 @@ static inline void local_lock_release(lo
>  	({							\
>  		type *__pcp;					\
>  								\
> -		local_irq_save(flags);				\
> +		ll_local_irq_save(flags);			\
>  		__pcp = this_cpu_ptr(pcp);			\
>  		local_lock_acquire(&__pcp->lock);		\
>  		__pcp;						\
> 
> 
> where ll_xxx is defined as xxx for non-RT and on RT all of them
> get mapped to migrate_disable().
> 
> Thoughts?
> 

I think that works. I created the obvious definitions of ll_* and rebased
on top of preempt-rt to see. I'll see if it boots :P

Page allocator conversion to helper looks like

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d9d7f6d68243..2948a5502589 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3008,9 +3008,7 @@ static void drain_pages_zone(unsigned int cpu, struct zone *zone)
 	unsigned long flags;
 	struct per_cpu_pages *pcp;
 
-	local_lock_irqsave(&zone->per_cpu_pageset->lock, flags);
-
-	pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);
+	pcp = local_lock_irqsave_get_cpu_ptr(zone->per_cpu_pageset, lock, flags);
 	if (pcp->count)
 		free_pcppages_bulk(zone, pcp->count, pcp);
 
@@ -3235,12 +3233,10 @@ static bool free_unref_page_prepare(struct page *page, unsigned long pfn)
 }
 
 static void free_unref_page_commit(struct page *page, struct zone *zone,
-				   unsigned long pfn, int migratetype)
+				   struct per_cpu_pages *pcp, unsigned long pfn,
+				   int migratetype)
 {
-	struct per_cpu_pages *pcp;
-
 	__count_vm_event(PGFREE);
-	pcp = this_cpu_ptr(zone->per_cpu_pageset);
 	list_add(&page->lru, &pcp->lists[migratetype]);
 	pcp->count++;
 	if (pcp->count >= READ_ONCE(pcp->high))
@@ -3252,6 +3248,7 @@ static void free_unref_page_commit(struct page *page, struct zone *zone,
  */
 void free_unref_page(struct page *page)
 {
+	struct per_cpu_pages *pcp;
 	struct zone *zone;
 	unsigned long flags;
 	unsigned long pfn = page_to_pfn(page);
@@ -3277,8 +3274,8 @@ void free_unref_page(struct page *page)
 	}
 
 	zone = page_zone(page);
-	local_lock_irqsave(&zone->per_cpu_pageset->lock, flags);
-	free_unref_page_commit(page, zone, pfn, migratetype);
+	pcp = local_lock_irqsave_get_cpu_ptr(zone->per_cpu_pageset, lock, flags);
+	free_unref_page_commit(page, zone, pcp, pfn, migratetype);
 	local_unlock_irqrestore(&zone->per_cpu_pageset->lock, flags);
 }
 
@@ -3287,6 +3284,7 @@ void free_unref_page(struct page *page)
  */
 void free_unref_page_list(struct list_head *list)
 {
+	struct per_cpu_pages *pcp;
 	struct zone *locked_zone;
 	struct page *page, *next;
 	unsigned long flags, pfn;
@@ -3320,7 +3318,7 @@ void free_unref_page_list(struct list_head *list)
 	/* Acquire the lock required for the first page. */
 	page = list_first_entry(list, struct page, lru);
 	locked_zone = page_zone(page);
-	local_lock_irqsave(&locked_zone->per_cpu_pageset->lock, flags);
+	pcp = local_lock_irqsave_get_cpu_ptr(locked_zone->per_cpu_pageset, lock, flags);
 
 	list_for_each_entry_safe(page, next, list, lru) {
 		struct zone *zone = page_zone(page);
@@ -3342,12 +3340,12 @@ void free_unref_page_list(struct list_head *list)
 #if defined(CONFIG_PREEMPT_RT) || defined(CONFIG_DEBUG_LOCK_ALLOC)
 		if (locked_zone != zone) {
 			local_unlock_irqrestore(&locked_zone->per_cpu_pageset->lock, flags);
-			local_lock_irqsave(&zone->per_cpu_pageset->lock, flags);
+			pcp = local_lock_irqsave_get_cpu_ptr(zone->per_cpu_pageset, lock, flags);
 			locked_zone = zone;
 		}
 #endif
 
-		free_unref_page_commit(page, zone, pfn, migratetype);
+		free_unref_page_commit(page, zone, pcp, pfn, migratetype);
 
 		/*
 		 * Guard against excessive IRQ disabled times when we get
@@ -3517,8 +3515,7 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
 	struct page *page;
 	unsigned long flags;
 
-	local_lock_irqsave(&zone->per_cpu_pageset->lock, flags);
-	pcp = this_cpu_ptr(zone->per_cpu_pageset);
+	pcp = local_lock_irqsave_get_cpu_ptr(zone->per_cpu_pageset, lock, flags);
 	list = &pcp->lists[migratetype];
 	page = __rmqueue_pcplist(zone,  migratetype, alloc_flags, pcp, list);
 	local_unlock_irqrestore(&zone->per_cpu_pageset->lock, flags);



-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 04/11] mm/vmstat: Convert NUMA statistics to basic NUMA counters
  2021-04-07 20:24 ` [PATCH 04/11] mm/vmstat: Convert NUMA statistics to basic NUMA counters Mel Gorman
@ 2021-04-14 12:56   ` Vlastimil Babka
  2021-04-14 15:18     ` Mel Gorman
  0 siblings, 1 reply; 31+ messages in thread
From: Vlastimil Babka @ 2021-04-14 12:56 UTC (permalink / raw)
  To: Mel Gorman, Linux-MM, Linux-RT-Users
  Cc: LKML, Chuck Lever, Jesper Dangaard Brouer, Matthew Wilcox,
	Thomas Gleixner, Peter Zijlstra, Ingo Molnar, Michal Hocko,
	Oscar Salvador

On 4/7/21 10:24 PM, Mel Gorman wrote:
> NUMA statistics are maintained on the zone level for hits, misses, foreign
> etc but nothing relies on them being perfectly accurate for functional
> correctness. The counters are used by userspace to get a general overview
> of a workloads NUMA behaviour but the page allocator incurs a high cost to
> maintain perfect accuracy similar to what is required for a vmstat like
> NR_FREE_PAGES. There even is a sysctl vm.numa_stat to allow userspace to
> turn off the collection of NUMA statistics like NUMA_HIT.
> 
> This patch converts NUMA_HIT and friends to be NUMA events with similar
> accuracy to VM events. There is a possibility that slight errors will be
> introduced but the overall trend as seen by userspace will be similar.
> Note that while these counters could be maintained at the node level that
> it would have a user-visible impact.

I guess this kind of inaccuracy is fine. I just don't like much
fold_vm_zone_numa_events() which seems to calculate sums of percpu counters and
then assign the result to zone counters for immediate consumption, which differs
from other kinds of folds in vmstat that reset the percpu counters to 0 as they
are treated as diffs to the global counters.

So it seems that this intermediate assignment to zone counters (using
atomic_long_set() even) is unnecessary and this could mimic sum_vm_events() that
just does the summation on a local array?

And probably a bit more serious is that vm_events have vm_events_fold_cpu() to
deal with a cpu going away, but after your patch the stats counted on a cpu just
disapepar from the sums as it goes offline as there's no such thing for the numa
counters.

Thanks,
Vlastimil

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 04/11] mm/vmstat: Convert NUMA statistics to basic NUMA counters
  2021-04-14 12:56   ` Vlastimil Babka
@ 2021-04-14 15:18     ` Mel Gorman
  2021-04-14 15:56       ` Vlastimil Babka
  0 siblings, 1 reply; 31+ messages in thread
From: Mel Gorman @ 2021-04-14 15:18 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Linux-MM, Linux-RT-Users, LKML, Chuck Lever,
	Jesper Dangaard Brouer, Matthew Wilcox, Thomas Gleixner,
	Peter Zijlstra, Ingo Molnar, Michal Hocko, Oscar Salvador

On Wed, Apr 14, 2021 at 02:56:45PM +0200, Vlastimil Babka wrote:
> On 4/7/21 10:24 PM, Mel Gorman wrote:
> > NUMA statistics are maintained on the zone level for hits, misses, foreign
> > etc but nothing relies on them being perfectly accurate for functional
> > correctness. The counters are used by userspace to get a general overview
> > of a workloads NUMA behaviour but the page allocator incurs a high cost to
> > maintain perfect accuracy similar to what is required for a vmstat like
> > NR_FREE_PAGES. There even is a sysctl vm.numa_stat to allow userspace to
> > turn off the collection of NUMA statistics like NUMA_HIT.
> > 
> > This patch converts NUMA_HIT and friends to be NUMA events with similar
> > accuracy to VM events. There is a possibility that slight errors will be
> > introduced but the overall trend as seen by userspace will be similar.
> > Note that while these counters could be maintained at the node level that
> > it would have a user-visible impact.
> 
> I guess this kind of inaccuracy is fine. I just don't like much
> fold_vm_zone_numa_events() which seems to calculate sums of percpu counters and
> then assign the result to zone counters for immediate consumption, which differs
> from other kinds of folds in vmstat that reset the percpu counters to 0 as they
> are treated as diffs to the global counters.
> 

The counters that are diffs fit inside an s8 and they are kept limited
because their "true" value is sometimes critical -- e.g. NR_FREE_PAGES
for watermark checking. So the level of drift has to be controlled and
the drift should not exist potentially forever so it gets updated
periodically.

The inaccurate counters are only exported to userspace. There is no need
to update them every few seconds so fold_vm_zone_numa_events() is only
called when a user cares but you raise a raise a valid below.

> So it seems that this intermediate assignment to zone counters (using
> atomic_long_set() even) is unnecessary and this could mimic sum_vm_events() that
> just does the summation on a local array?
> 

The atomic is unnecessary for sure but using a local array is
problematic because of your next point.

> And probably a bit more serious is that vm_events have vm_events_fold_cpu() to
> deal with a cpu going away, but after your patch the stats counted on a cpu just
> disapepar from the sums as it goes offline as there's no such thing for the numa
> counters.
> 

That is a problem I missed. Even if zonestats was preserved on
hot-remove, fold_vm_zone_numa_events would not be reading the CPU so
hotplug events jump all over the place.

So some periodic folding is necessary. I would still prefer not to do it
by time but it could be done only on overflow or when a file like
/proc/vmstat is read. I'll think about it a bit more and see what I come
up with.

Thanks!

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 04/11] mm/vmstat: Convert NUMA statistics to basic NUMA counters
  2021-04-14 15:18     ` Mel Gorman
@ 2021-04-14 15:56       ` Vlastimil Babka
  2021-04-15 10:06         ` Mel Gorman
  0 siblings, 1 reply; 31+ messages in thread
From: Vlastimil Babka @ 2021-04-14 15:56 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, Linux-RT-Users, LKML, Chuck Lever,
	Jesper Dangaard Brouer, Matthew Wilcox, Thomas Gleixner,
	Peter Zijlstra, Ingo Molnar, Michal Hocko, Oscar Salvador

On 4/14/21 5:18 PM, Mel Gorman wrote:
> On Wed, Apr 14, 2021 at 02:56:45PM +0200, Vlastimil Babka wrote:
>> So it seems that this intermediate assignment to zone counters (using
>> atomic_long_set() even) is unnecessary and this could mimic sum_vm_events() that
>> just does the summation on a local array?
>> 
> 
> The atomic is unnecessary for sure but using a local array is
> problematic because of your next point.

IIUC vm_events seems to do fine without a centralized array and handling CPU hot
remove at the sime time ...

>> And probably a bit more serious is that vm_events have vm_events_fold_cpu() to
>> deal with a cpu going away, but after your patch the stats counted on a cpu just
>> disapepar from the sums as it goes offline as there's no such thing for the numa
>> counters.
>> 
> 
> That is a problem I missed. Even if zonestats was preserved on
> hot-remove, fold_vm_zone_numa_events would not be reading the CPU so
> hotplug events jump all over the place.
> 
> So some periodic folding is necessary. I would still prefer not to do it
> by time but it could be done only on overflow or when a file like
> /proc/vmstat is read. I'll think about it a bit more and see what I come
> up with.

... because vm_events_fold_cpu() seems to simply move the stats from the CPU
being offlined to the current one. So the same approach should be enough for
NUMA stats?

> Thanks!
> 


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 04/11] mm/vmstat: Convert NUMA statistics to basic NUMA counters
  2021-04-14 15:56       ` Vlastimil Babka
@ 2021-04-15 10:06         ` Mel Gorman
  0 siblings, 0 replies; 31+ messages in thread
From: Mel Gorman @ 2021-04-15 10:06 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Linux-MM, Linux-RT-Users, LKML, Chuck Lever,
	Jesper Dangaard Brouer, Matthew Wilcox, Thomas Gleixner,
	Peter Zijlstra, Ingo Molnar, Michal Hocko, Oscar Salvador

On Wed, Apr 14, 2021 at 05:56:53PM +0200, Vlastimil Babka wrote:
> On 4/14/21 5:18 PM, Mel Gorman wrote:
> > On Wed, Apr 14, 2021 at 02:56:45PM +0200, Vlastimil Babka wrote:
> >> So it seems that this intermediate assignment to zone counters (using
> >> atomic_long_set() even) is unnecessary and this could mimic sum_vm_events() that
> >> just does the summation on a local array?
> >> 
> > 
> > The atomic is unnecessary for sure but using a local array is
> > problematic because of your next point.
> 
> IIUC vm_events seems to do fine without a centralized array and handling CPU hot
> remove at the sime time ...
> 

The vm_events are more global in nature. They are not reported
to userspace on a per-zone (/proc/zoneinfo) basis or per-node
(/sys/devices/system/node/node*/numastat) basis so they are not equivalent.

> >> And probably a bit more serious is that vm_events have vm_events_fold_cpu() to
> >> deal with a cpu going away, but after your patch the stats counted on a cpu just
> >> disapepar from the sums as it goes offline as there's no such thing for the numa
> >> counters.
> >> 
> > 
> > That is a problem I missed. Even if zonestats was preserved on
> > hot-remove, fold_vm_zone_numa_events would not be reading the CPU so
> > hotplug events jump all over the place.
> > 
> > So some periodic folding is necessary. I would still prefer not to do it
> > by time but it could be done only on overflow or when a file like
> > /proc/vmstat is read. I'll think about it a bit more and see what I come
> > up with.
> 
> ... because vm_events_fold_cpu() seems to simply move the stats from the CPU
> being offlined to the current one. So the same approach should be enough for
> NUMA stats?
> 

Yes, or at least very similar.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 02/11] mm/page_alloc: Convert per-cpu list protection to local_lock
  2021-04-14 13:39 [PATCH 0/11 v3] " Mel Gorman
@ 2021-04-14 13:39 ` Mel Gorman
  0 siblings, 0 replies; 31+ messages in thread
From: Mel Gorman @ 2021-04-14 13:39 UTC (permalink / raw)
  To: Linux-MM, Linux-RT-Users
  Cc: LKML, Chuck Lever, Jesper Dangaard Brouer, Thomas Gleixner,
	Peter Zijlstra, Ingo Molnar, Michal Hocko, Vlastimil Babka,
	Mel Gorman

There is a lack of clarity of what exactly local_irq_save/local_irq_restore
protects in page_alloc.c . It conflates the protection of per-cpu page
allocation structures with per-cpu vmstat deltas.

This patch protects the PCP structure using local_lock which for most
configurations is identical to IRQ enabling/disabling. The scope of the
lock is still wider than it should be but this is decreased later.

It is possible for the local_lock to be embedded safely within struct
per_cpu_pages but it adds complexity to free_unref_page_list so it is
implemented as a separate patch later in the series.

[lkp@intel.com: Make pagesets static]
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 include/linux/mmzone.h |  2 ++
 mm/page_alloc.c        | 50 +++++++++++++++++++++++++++++-------------
 2 files changed, 37 insertions(+), 15 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index a4393ac27336..106da8fbc72a 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -20,6 +20,7 @@
 #include <linux/atomic.h>
 #include <linux/mm_types.h>
 #include <linux/page-flags.h>
+#include <linux/local_lock.h>
 #include <asm/page.h>
 
 /* Free memory management - zoned buddy allocator.  */
@@ -337,6 +338,7 @@ enum zone_watermarks {
 #define high_wmark_pages(z) (z->_watermark[WMARK_HIGH] + z->watermark_boost)
 #define wmark_pages(z, i) (z->_watermark[i] + z->watermark_boost)
 
+/* Fields and list protected by pagesets local_lock in page_alloc.c */
 struct per_cpu_pages {
 	int count;		/* number of pages in the list */
 	int high;		/* high watermark, emptying needed */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2d6283cab22d..4e92d43c25f6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -112,6 +112,13 @@ typedef int __bitwise fpi_t;
 static DEFINE_MUTEX(pcp_batch_high_lock);
 #define MIN_PERCPU_PAGELIST_FRACTION	(8)
 
+struct pagesets {
+	local_lock_t lock;
+};
+static DEFINE_PER_CPU(struct pagesets, pagesets) = {
+	.lock = INIT_LOCAL_LOCK(lock),
+};
+
 #ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID
 DEFINE_PER_CPU(int, numa_node);
 EXPORT_PER_CPU_SYMBOL(numa_node);
@@ -1421,6 +1428,10 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 		} while (--count && --batch_free && !list_empty(list));
 	}
 
+	/*
+	 * local_lock_irq held so equivalent to spin_lock_irqsave for
+	 * both PREEMPT_RT and non-PREEMPT_RT configurations.
+	 */
 	spin_lock(&zone->lock);
 	isolated_pageblocks = has_isolate_pageblock(zone);
 
@@ -1541,6 +1552,11 @@ static void __free_pages_ok(struct page *page, unsigned int order,
 		return;
 
 	migratetype = get_pfnblock_migratetype(page, pfn);
+
+	/*
+	 * TODO FIX: Disable IRQs before acquiring IRQ-safe zone->lock
+	 * and protect vmstat updates.
+	 */
 	local_irq_save(flags);
 	__count_vm_events(PGFREE, 1 << order);
 	free_one_page(page_zone(page), page, pfn, order, migratetype,
@@ -2910,6 +2926,10 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 {
 	int i, allocated = 0;
 
+	/*
+	 * local_lock_irq held so equivalent to spin_lock_irqsave for
+	 * both PREEMPT_RT and non-PREEMPT_RT configurations.
+	 */
 	spin_lock(&zone->lock);
 	for (i = 0; i < count; ++i) {
 		struct page *page = __rmqueue(zone, order, migratetype,
@@ -2962,12 +2982,12 @@ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
 	unsigned long flags;
 	int to_drain, batch;
 
-	local_irq_save(flags);
+	local_lock_irqsave(&pagesets.lock, flags);
 	batch = READ_ONCE(pcp->batch);
 	to_drain = min(pcp->count, batch);
 	if (to_drain > 0)
 		free_pcppages_bulk(zone, to_drain, pcp);
-	local_irq_restore(flags);
+	local_unlock_irqrestore(&pagesets.lock, flags);
 }
 #endif
 
@@ -2983,13 +3003,13 @@ static void drain_pages_zone(unsigned int cpu, struct zone *zone)
 	unsigned long flags;
 	struct per_cpu_pages *pcp;
 
-	local_irq_save(flags);
+	local_lock_irqsave(&pagesets.lock, flags);
 
 	pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);
 	if (pcp->count)
 		free_pcppages_bulk(zone, pcp->count, pcp);
 
-	local_irq_restore(flags);
+	local_unlock_irqrestore(&pagesets.lock, flags);
 }
 
 /*
@@ -3252,9 +3272,9 @@ void free_unref_page(struct page *page)
 	if (!free_unref_page_prepare(page, pfn))
 		return;
 
-	local_irq_save(flags);
+	local_lock_irqsave(&pagesets.lock, flags);
 	free_unref_page_commit(page, pfn);
-	local_irq_restore(flags);
+	local_unlock_irqrestore(&pagesets.lock, flags);
 }
 
 /*
@@ -3274,7 +3294,7 @@ void free_unref_page_list(struct list_head *list)
 		set_page_private(page, pfn);
 	}
 
-	local_irq_save(flags);
+	local_lock_irqsave(&pagesets.lock, flags);
 	list_for_each_entry_safe(page, next, list, lru) {
 		unsigned long pfn = page_private(page);
 
@@ -3287,12 +3307,12 @@ void free_unref_page_list(struct list_head *list)
 		 * a large list of pages to free.
 		 */
 		if (++batch_count == SWAP_CLUSTER_MAX) {
-			local_irq_restore(flags);
+			local_unlock_irqrestore(&pagesets.lock, flags);
 			batch_count = 0;
-			local_irq_save(flags);
+			local_lock_irqsave(&pagesets.lock, flags);
 		}
 	}
-	local_irq_restore(flags);
+	local_unlock_irqrestore(&pagesets.lock, flags);
 }
 
 /*
@@ -3449,7 +3469,7 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
 	struct page *page;
 	unsigned long flags;
 
-	local_irq_save(flags);
+	local_lock_irqsave(&pagesets.lock, flags);
 	pcp = this_cpu_ptr(zone->per_cpu_pageset);
 	list = &pcp->lists[migratetype];
 	page = __rmqueue_pcplist(zone,  migratetype, alloc_flags, pcp, list);
@@ -3457,7 +3477,7 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
 		__count_zid_vm_events(PGALLOC, page_zonenum(page), 1);
 		zone_statistics(preferred_zone, zone);
 	}
-	local_irq_restore(flags);
+	local_unlock_irqrestore(&pagesets.lock, flags);
 	return page;
 }
 
@@ -5052,7 +5072,7 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
 		goto failed;
 
 	/* Attempt the batch allocation */
-	local_irq_save(flags);
+	local_lock_irqsave(&pagesets.lock, flags);
 	pcp = this_cpu_ptr(zone->per_cpu_pageset);
 	pcp_list = &pcp->lists[ac.migratetype];
 
@@ -5090,12 +5110,12 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
 		nr_populated++;
 	}
 
-	local_irq_restore(flags);
+	local_unlock_irqrestore(&pagesets.lock, flags);
 
 	return nr_populated;
 
 failed_irq:
-	local_irq_restore(flags);
+	local_unlock_irqrestore(&pagesets.lock, flags);
 
 failed:
 	page = __alloc_pages(gfp, 0, preferred_nid, nodemask);
-- 
2.26.2


^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, back to index

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-07 20:24 [PATCH 0/11 v2] Use local_lock for pcp protection and reduce stat overhead Mel Gorman
2021-04-07 20:24 ` [PATCH 01/11] mm/page_alloc: Split per cpu page lists and zone stats Mel Gorman
2021-04-12 17:43   ` Vlastimil Babka
2021-04-13 13:27     ` Mel Gorman
2021-04-07 20:24 ` [PATCH 02/11] mm/page_alloc: Convert per-cpu list protection to local_lock Mel Gorman
2021-04-08 10:52   ` Peter Zijlstra
2021-04-08 17:42     ` Mel Gorman
2021-04-09  6:39       ` Peter Zijlstra
2021-04-09  7:59         ` Mel Gorman
2021-04-09  8:24           ` Peter Zijlstra
2021-04-09 13:32             ` Mel Gorman
2021-04-09 18:55               ` Peter Zijlstra
2021-04-12 11:56                 ` Mel Gorman
2021-04-12 21:47                   ` Thomas Gleixner
2021-04-13 16:52                     ` Mel Gorman
2021-04-07 20:24 ` [PATCH 03/11] mm/memory_hotplug: Make unpopulated zones PCP structures unreachable during hot remove Mel Gorman
2021-04-07 20:24 ` [PATCH 04/11] mm/vmstat: Convert NUMA statistics to basic NUMA counters Mel Gorman
2021-04-14 12:56   ` Vlastimil Babka
2021-04-14 15:18     ` Mel Gorman
2021-04-14 15:56       ` Vlastimil Babka
2021-04-15 10:06         ` Mel Gorman
2021-04-07 20:24 ` [PATCH 05/11] mm/vmstat: Inline NUMA event counter updates Mel Gorman
2021-04-07 20:24 ` [PATCH 06/11] mm/page_alloc: Batch the accounting updates in the bulk allocator Mel Gorman
2021-04-07 20:24 ` [PATCH 07/11] mm/page_alloc: Reduce duration that IRQs are disabled for VM counters Mel Gorman
2021-04-07 20:24 ` [PATCH 08/11] mm/page_alloc: Remove duplicate checks if migratetype should be isolated Mel Gorman
2021-04-07 20:24 ` [PATCH 09/11] mm/page_alloc: Explicitly acquire the zone lock in __free_pages_ok Mel Gorman
2021-04-07 20:24 ` [PATCH 10/11] mm/page_alloc: Avoid conflating IRQs disabled with zone->lock Mel Gorman
2021-04-07 20:24 ` [PATCH 11/11] mm/page_alloc: Update PGFREE outside the zone lock in __free_pages_ok Mel Gorman
2021-04-08 10:56 ` [PATCH 0/11 v2] Use local_lock for pcp protection and reduce stat overhead Peter Zijlstra
2021-04-08 17:48   ` Mel Gorman
2021-04-14 13:39 [PATCH 0/11 v3] " Mel Gorman
2021-04-14 13:39 ` [PATCH 02/11] mm/page_alloc: Convert per-cpu list protection to local_lock Mel Gorman

Linux-rt-users Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-rt-users/0 linux-rt-users/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-rt-users linux-rt-users/ https://lore.kernel.org/linux-rt-users \
		linux-rt-users@vger.kernel.org
	public-inbox-index linux-rt-users

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-rt-users


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git