linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/9] disable pcplists during memory offline
@ 2020-09-22 14:37 Vlastimil Babka
  2020-09-22 14:37 ` [PATCH 1/9] mm, page_alloc: clean up pageset high and batch update Vlastimil Babka
                   ` (9 more replies)
  0 siblings, 10 replies; 41+ messages in thread
From: Vlastimil Babka @ 2020-09-22 14:37 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Michal Hocko, Pavel Tatashin, David Hildenbrand,
	Oscar Salvador, Joonsoo Kim, Vlastimil Babka

As per the discussions [1] [2] this is an attempt to implement David's
suggestion that page isolation should disable pcplists to avoid races with page
freeing in progress. This is done without extra checks in fast paths, as
explained in Patch 9. The repeated draining done by [2] is then no longer
needed. Previous version (RFC) is at [3].

The RFC tried to hide pcplists disabling/enabling into page isolation, but it
wasn't completely possible, as memory offline does not unisolation. Michal
suggested an explicit API in [4] so that's the current implementation and it
seems indeed nicer.

Once we accept that page isolation users need to do explicit actions around it
depending on the needed guarantees, we can also IMHO accept that the current
pcplist draining can be also done by the callers, which is more effective.
After all, there are only two users of page isolation. So patch 7 does
effectively the same thing as Pavel proposed in [5], and patches 8-9 implement
stronger guarantees only for memory offline. If CMA decides to opt-in to the
stronger guarantee, it's easy to do so.

Patches 1-6 are preparatory cleanups for pcplist disabling.

Patchset was briefly tested in QEMU so that memory online/offline works, but
I haven't done a stress test that would prove the race fixed by [2] is
eliminated.

Note that patch 9 could be avoided if we instead adjusted page freeing in shown
in [6], but I believe the current implementation of disabling pcplists is not
too much complex, so I would prefer this instead of adding new checks and
longer irq-disabled section into page freeing hotpaths.

[1] https://lore.kernel.org/linux-mm/20200901124615.137200-1-pasha.tatashin@soleen.com/
[2] https://lore.kernel.org/linux-mm/20200903140032.380431-1-pasha.tatashin@soleen.com/
[3] https://lore.kernel.org/linux-mm/20200907163628.26495-1-vbabka@suse.cz/
[4] https://lore.kernel.org/linux-mm/20200909113647.GG7348@dhcp22.suse.cz/
[5] https://lore.kernel.org/linux-mm/20200904151448.100489-3-pasha.tatashin@soleen.com/
[6] https://lore.kernel.org/linux-mm/3d3b53db-aeaa-ff24-260b-36427fac9b1c@suse.cz/

Vlastimil Babka (9):
  mm, page_alloc: clean up pageset high and batch update
  mm, page_alloc: calculate pageset high and batch once per zone
  mm, page_alloc: remove setup_pageset()
  mm, page_alloc: simplify pageset_update()
  mm, page_alloc: make per_cpu_pageset accessible only after init
  mm, page_alloc: cache pageset high and batch in struct zone
  mm, page_alloc: move draining pcplists to page isolation users
  mm, page_alloc: drain all pcplists during memory offline
  mm, page_alloc: optionally disable pcplists during page isolation

 include/linux/gfp.h            |   1 +
 include/linux/mmzone.h         |   8 ++
 include/linux/page-isolation.h |   2 +
 mm/internal.h                  |   4 +
 mm/memory_hotplug.c            |  27 +++--
 mm/page_alloc.c                | 190 ++++++++++++++++++---------------
 mm/page_isolation.c            |  26 ++++-
 7 files changed, 152 insertions(+), 106 deletions(-)

-- 
2.28.0



^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 1/9] mm, page_alloc: clean up pageset high and batch update
  2020-09-22 14:37 [PATCH 0/9] disable pcplists during memory offline Vlastimil Babka
@ 2020-09-22 14:37 ` Vlastimil Babka
  2020-09-25 10:18   ` David Hildenbrand
  2020-10-05 12:03   ` Michal Hocko
  2020-09-22 14:37 ` [PATCH 2/9] mm, page_alloc: calculate pageset high and batch once per zone Vlastimil Babka
                   ` (8 subsequent siblings)
  9 siblings, 2 replies; 41+ messages in thread
From: Vlastimil Babka @ 2020-09-22 14:37 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Michal Hocko, Pavel Tatashin, David Hildenbrand,
	Oscar Salvador, Joonsoo Kim, Vlastimil Babka

The updates to pcplists' high and batch valued are handled by multiple
functions that make the calculations hard to follow. Consolidate everything
to pageset_set_high_and_batch() and remove pageset_set_batch() and
pageset_set_high() wrappers.

The only special case using one of the removed wrappers was:
build_all_zonelists_init()
  setup_pageset()
    pageset_set_batch()
which was hardcoding batch as 0, so we can just open-code a call to
pageset_update() with constant parameters instead.

No functional change.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
---
 mm/page_alloc.c | 49 ++++++++++++++++++++-----------------------------
 1 file changed, 20 insertions(+), 29 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 60a0e94645a6..a163c5e561f2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5823,7 +5823,7 @@ static void build_zonelists(pg_data_t *pgdat)
  * not check if the processor is online before following the pageset pointer.
  * Other parts of the kernel may not check if the zone is available.
  */
-static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch);
+static void setup_pageset(struct per_cpu_pageset *p);
 static DEFINE_PER_CPU(struct per_cpu_pageset, boot_pageset);
 static DEFINE_PER_CPU(struct per_cpu_nodestat, boot_nodestats);
 
@@ -5891,7 +5891,7 @@ build_all_zonelists_init(void)
 	 * (a chicken-egg dilemma).
 	 */
 	for_each_possible_cpu(cpu)
-		setup_pageset(&per_cpu(boot_pageset, cpu), 0);
+		setup_pageset(&per_cpu(boot_pageset, cpu));
 
 	mminit_verify_zonelist();
 	cpuset_init_current_mems_allowed();
@@ -6200,12 +6200,6 @@ static void pageset_update(struct per_cpu_pages *pcp, unsigned long high,
 	pcp->batch = batch;
 }
 
-/* a companion to pageset_set_high() */
-static void pageset_set_batch(struct per_cpu_pageset *p, unsigned long batch)
-{
-	pageset_update(&p->pcp, 6 * batch, max(1UL, 1 * batch));
-}
-
 static void pageset_init(struct per_cpu_pageset *p)
 {
 	struct per_cpu_pages *pcp;
@@ -6218,35 +6212,32 @@ static void pageset_init(struct per_cpu_pageset *p)
 		INIT_LIST_HEAD(&pcp->lists[migratetype]);
 }
 
-static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch)
+static void setup_pageset(struct per_cpu_pageset *p)
 {
 	pageset_init(p);
-	pageset_set_batch(p, batch);
+	pageset_update(&p->pcp, 0, 1);
 }
 
 /*
- * pageset_set_high() sets the high water mark for hot per_cpu_pagelist
- * to the value high for the pageset p.
+ * Calculate and set new high and batch values for given per-cpu pageset of a
+ * zone, based on the zone's size and the percpu_pagelist_fraction sysctl.
  */
-static void pageset_set_high(struct per_cpu_pageset *p,
-				unsigned long high)
-{
-	unsigned long batch = max(1UL, high / 4);
-	if ((high / 4) > (PAGE_SHIFT * 8))
-		batch = PAGE_SHIFT * 8;
-
-	pageset_update(&p->pcp, high, batch);
-}
-
 static void pageset_set_high_and_batch(struct zone *zone,
-				       struct per_cpu_pageset *pcp)
+				       struct per_cpu_pageset *p)
 {
-	if (percpu_pagelist_fraction)
-		pageset_set_high(pcp,
-			(zone_managed_pages(zone) /
-				percpu_pagelist_fraction));
-	else
-		pageset_set_batch(pcp, zone_batchsize(zone));
+	unsigned long new_high, new_batch;
+
+	if (percpu_pagelist_fraction) {
+		new_high = zone_managed_pages(zone) / percpu_pagelist_fraction;
+		new_batch = max(1UL, new_high / 4);
+		if ((new_high / 4) > (PAGE_SHIFT * 8))
+			new_batch = PAGE_SHIFT * 8;
+	} else {
+		new_batch = zone_batchsize(zone);
+		new_high = 6 * new_batch;
+		new_batch = max(1UL, 1 * new_batch);
+	}
+	pageset_update(&p->pcp, new_high, new_batch);
 }
 
 static void __meminit zone_pageset_init(struct zone *zone, int cpu)
-- 
2.28.0



^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 2/9] mm, page_alloc: calculate pageset high and batch once per zone
  2020-09-22 14:37 [PATCH 0/9] disable pcplists during memory offline Vlastimil Babka
  2020-09-22 14:37 ` [PATCH 1/9] mm, page_alloc: clean up pageset high and batch update Vlastimil Babka
@ 2020-09-22 14:37 ` Vlastimil Babka
  2020-10-05 12:52   ` Michal Hocko
  2020-09-22 14:37 ` [PATCH 3/9] mm, page_alloc: remove setup_pageset() Vlastimil Babka
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 41+ messages in thread
From: Vlastimil Babka @ 2020-09-22 14:37 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Michal Hocko, Pavel Tatashin, David Hildenbrand,
	Oscar Salvador, Joonsoo Kim, Vlastimil Babka

We currently call pageset_set_high_and_batch() for each possible cpu, which
repeats the same calculations of high and batch values.

Instead call the function just once per zone, and make it apply the calculated
values to all per-cpu pagesets of the zone.

This also allows removing the zone_pageset_init() and __zone_pcp_update()
wrappers.

No functional change.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: David Hildenbrand <david@redhat.com>
---
 mm/page_alloc.c | 42 ++++++++++++++++++------------------------
 1 file changed, 18 insertions(+), 24 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a163c5e561f2..26069c8d1b19 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6219,13 +6219,14 @@ static void setup_pageset(struct per_cpu_pageset *p)
 }
 
 /*
- * Calculate and set new high and batch values for given per-cpu pageset of a
+ * Calculate and set new high and batch values for all per-cpu pagesets of a
  * zone, based on the zone's size and the percpu_pagelist_fraction sysctl.
  */
-static void pageset_set_high_and_batch(struct zone *zone,
-				       struct per_cpu_pageset *p)
+static void zone_set_pageset_high_and_batch(struct zone *zone)
 {
 	unsigned long new_high, new_batch;
+	struct per_cpu_pageset *p;
+	int cpu;
 
 	if (percpu_pagelist_fraction) {
 		new_high = zone_managed_pages(zone) / percpu_pagelist_fraction;
@@ -6237,23 +6238,25 @@ static void pageset_set_high_and_batch(struct zone *zone,
 		new_high = 6 * new_batch;
 		new_batch = max(1UL, 1 * new_batch);
 	}
-	pageset_update(&p->pcp, new_high, new_batch);
-}
-
-static void __meminit zone_pageset_init(struct zone *zone, int cpu)
-{
-	struct per_cpu_pageset *pcp = per_cpu_ptr(zone->pageset, cpu);
 
-	pageset_init(pcp);
-	pageset_set_high_and_batch(zone, pcp);
+	for_each_possible_cpu(cpu) {
+		p = per_cpu_ptr(zone->pageset, cpu);
+		pageset_update(&p->pcp, new_high, new_batch);
+	}
 }
 
 void __meminit setup_zone_pageset(struct zone *zone)
 {
+	struct per_cpu_pageset *p;
 	int cpu;
+
 	zone->pageset = alloc_percpu(struct per_cpu_pageset);
-	for_each_possible_cpu(cpu)
-		zone_pageset_init(zone, cpu);
+	for_each_possible_cpu(cpu) {
+		p = per_cpu_ptr(zone->pageset, cpu);
+		pageset_init(p);
+	}
+
+	zone_set_pageset_high_and_batch(zone);
 }
 
 /*
@@ -7985,15 +7988,6 @@ int lowmem_reserve_ratio_sysctl_handler(struct ctl_table *table, int write,
 	return 0;
 }
 
-static void __zone_pcp_update(struct zone *zone)
-{
-	unsigned int cpu;
-
-	for_each_possible_cpu(cpu)
-		pageset_set_high_and_batch(zone,
-				per_cpu_ptr(zone->pageset, cpu));
-}
-
 /*
  * percpu_pagelist_fraction - changes the pcp->high for each zone on each
  * cpu.  It is the fraction of total pages in each zone that a hot per cpu
@@ -8026,7 +8020,7 @@ int percpu_pagelist_fraction_sysctl_handler(struct ctl_table *table, int write,
 		goto out;
 
 	for_each_populated_zone(zone)
-		__zone_pcp_update(zone);
+		zone_set_pageset_high_and_batch(zone);
 out:
 	mutex_unlock(&pcp_batch_high_lock);
 	return ret;
@@ -8633,7 +8627,7 @@ EXPORT_SYMBOL(free_contig_range);
 void __meminit zone_pcp_update(struct zone *zone)
 {
 	mutex_lock(&pcp_batch_high_lock);
-	__zone_pcp_update(zone);
+	zone_set_pageset_high_and_batch(zone);
 	mutex_unlock(&pcp_batch_high_lock);
 }
 
-- 
2.28.0



^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 3/9] mm, page_alloc: remove setup_pageset()
  2020-09-22 14:37 [PATCH 0/9] disable pcplists during memory offline Vlastimil Babka
  2020-09-22 14:37 ` [PATCH 1/9] mm, page_alloc: clean up pageset high and batch update Vlastimil Babka
  2020-09-22 14:37 ` [PATCH 2/9] mm, page_alloc: calculate pageset high and batch once per zone Vlastimil Babka
@ 2020-09-22 14:37 ` Vlastimil Babka
  2020-09-25 10:19   ` David Hildenbrand
  2020-10-05 12:59   ` Michal Hocko
  2020-09-22 14:37 ` [PATCH 4/9] mm, page_alloc: simplify pageset_update() Vlastimil Babka
                   ` (6 subsequent siblings)
  9 siblings, 2 replies; 41+ messages in thread
From: Vlastimil Babka @ 2020-09-22 14:37 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Michal Hocko, Pavel Tatashin, David Hildenbrand,
	Oscar Salvador, Joonsoo Kim, Vlastimil Babka

We initialize boot-time pagesets with setup_pageset(), which sets high and
batch values that effectively disable pcplists.

We can remove this wrapper if we just set these values for all pagesets in
pageset_init(). Non-boot pagesets then subsequently update them to the proper
values.

No functional change.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/page_alloc.c | 17 ++++++++++-------
 1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 26069c8d1b19..76c2b4578723 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5823,7 +5823,7 @@ static void build_zonelists(pg_data_t *pgdat)
  * not check if the processor is online before following the pageset pointer.
  * Other parts of the kernel may not check if the zone is available.
  */
-static void setup_pageset(struct per_cpu_pageset *p);
+static void pageset_init(struct per_cpu_pageset *p);
 static DEFINE_PER_CPU(struct per_cpu_pageset, boot_pageset);
 static DEFINE_PER_CPU(struct per_cpu_nodestat, boot_nodestats);
 
@@ -5891,7 +5891,7 @@ build_all_zonelists_init(void)
 	 * (a chicken-egg dilemma).
 	 */
 	for_each_possible_cpu(cpu)
-		setup_pageset(&per_cpu(boot_pageset, cpu));
+		pageset_init(&per_cpu(boot_pageset, cpu));
 
 	mminit_verify_zonelist();
 	cpuset_init_current_mems_allowed();
@@ -6210,12 +6210,15 @@ static void pageset_init(struct per_cpu_pageset *p)
 	pcp = &p->pcp;
 	for (migratetype = 0; migratetype < MIGRATE_PCPTYPES; migratetype++)
 		INIT_LIST_HEAD(&pcp->lists[migratetype]);
-}
 
-static void setup_pageset(struct per_cpu_pageset *p)
-{
-	pageset_init(p);
-	pageset_update(&p->pcp, 0, 1);
+	/*
+	 * Set batch and high values safe for a boot pageset. A true percpu
+	 * pageset's initialization will update them subsequently. Here we don't
+	 * need to be as careful as pageset_update() as nobody can access the
+	 * pageset yet.
+	 */
+	pcp->high = 0;
+	pcp->batch = 1;
 }
 
 /*
-- 
2.28.0



^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 4/9] mm, page_alloc: simplify pageset_update()
  2020-09-22 14:37 [PATCH 0/9] disable pcplists during memory offline Vlastimil Babka
                   ` (2 preceding siblings ...)
  2020-09-22 14:37 ` [PATCH 3/9] mm, page_alloc: remove setup_pageset() Vlastimil Babka
@ 2020-09-22 14:37 ` Vlastimil Babka
  2020-09-25 10:23   ` David Hildenbrand
  2020-10-05 13:20   ` Michal Hocko
  2020-09-22 14:37 ` [PATCH 5/9] mm, page_alloc: make per_cpu_pageset accessible only after init Vlastimil Babka
                   ` (5 subsequent siblings)
  9 siblings, 2 replies; 41+ messages in thread
From: Vlastimil Babka @ 2020-09-22 14:37 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Michal Hocko, Pavel Tatashin, David Hildenbrand,
	Oscar Salvador, Joonsoo Kim, Vlastimil Babka

pageset_update() attempts to update pcplist's high and batch values in a way
that readers don't observe batch > high. It uses smp_wmb() to order the updates
in a way to achieve this. However, without proper pairing read barriers in
readers this guarantee doesn't hold, and there are no such barriers in
e.g. free_unref_page_commit().

Commit 88e8ac11d2ea ("mm, page_alloc: fix core hung in free_pcppages_bulk()")
already showed this is problematic, and solved this by ultimately only trusing
pcp->count of the current cpu with interrupts disabled.

The update dance with unpaired write barriers thus makes no sense. Replace
them with plain WRITE_ONCE to prevent store tearing, and document that the
values can change asynchronously and should not be trusted for correctness.

All current readers appear to be OK after 88e8ac11d2ea. Convert them to
READ_ONCE to prevent unnecessary read tearing, but mainly to alert anybody
making future changes to the code that special care is needed.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/page_alloc.c | 40 ++++++++++++++++++----------------------
 1 file changed, 18 insertions(+), 22 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 76c2b4578723..99b74c1c2b0a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1297,7 +1297,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 {
 	int migratetype = 0;
 	int batch_free = 0;
-	int prefetch_nr = 0;
+	int prefetch_nr = READ_ONCE(pcp->batch);
 	bool isolated_pageblocks;
 	struct page *page, *tmp;
 	LIST_HEAD(head);
@@ -1348,8 +1348,10 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 			 * avoid excessive prefetching due to large count, only
 			 * prefetch buddy for the first pcp->batch nr of pages.
 			 */
-			if (prefetch_nr++ < pcp->batch)
+			if (prefetch_nr) {
 				prefetch_buddy(page);
+				prefetch_nr--;
+			}
 		} while (--count && --batch_free && !list_empty(list));
 	}
 
@@ -3131,10 +3133,8 @@ static void free_unref_page_commit(struct page *page, unsigned long pfn)
 	pcp = &this_cpu_ptr(zone->pageset)->pcp;
 	list_add(&page->lru, &pcp->lists[migratetype]);
 	pcp->count++;
-	if (pcp->count >= pcp->high) {
-		unsigned long batch = READ_ONCE(pcp->batch);
-		free_pcppages_bulk(zone, batch, pcp);
-	}
+	if (pcp->count >= READ_ONCE(pcp->high))
+		free_pcppages_bulk(zone, READ_ONCE(pcp->batch), pcp);
 }
 
 /*
@@ -3318,7 +3318,7 @@ static struct page *__rmqueue_pcplist(struct zone *zone, int migratetype,
 	do {
 		if (list_empty(list)) {
 			pcp->count += rmqueue_bulk(zone, 0,
-					pcp->batch, list,
+					READ_ONCE(pcp->batch), list,
 					migratetype, alloc_flags);
 			if (unlikely(list_empty(list)))
 				return NULL;
@@ -6174,13 +6174,16 @@ static int zone_batchsize(struct zone *zone)
 }
 
 /*
- * pcp->high and pcp->batch values are related and dependent on one another:
- * ->batch must never be higher then ->high.
- * The following function updates them in a safe manner without read side
- * locking.
+ * pcp->high and pcp->batch values are related and generally batch is lower
+ * than high. They are also related to pcp->count such that count is lower
+ * than high, and as soon as it reaches high, the pcplist is flushed.
  *
- * Any new users of pcp->batch and pcp->high should ensure they can cope with
- * those fields changing asynchronously (acording to the above rule).
+ * However, guaranteeing these relations at all times would require e.g. write
+ * barriers here but also careful usage of read barriers at the read side, and
+ * thus be prone to error and bad for performance. Thus the update only prevents
+ * store tearing. Any new users of pcp->batch and pcp->high should ensure they
+ * can cope with those fields changing asynchronously, and fully trust only the
+ * pcp->count field on the local CPU with interrupts disabled.
  *
  * mutex_is_locked(&pcp_batch_high_lock) required when calling this function
  * outside of boot time (or some other assurance that no concurrent updaters
@@ -6189,15 +6192,8 @@ static int zone_batchsize(struct zone *zone)
 static void pageset_update(struct per_cpu_pages *pcp, unsigned long high,
 		unsigned long batch)
 {
-       /* start with a fail safe value for batch */
-	pcp->batch = 1;
-	smp_wmb();
-
-       /* Update high, then batch, in order */
-	pcp->high = high;
-	smp_wmb();
-
-	pcp->batch = batch;
+	WRITE_ONCE(pcp->batch, batch);
+	WRITE_ONCE(pcp->high, high);
 }
 
 static void pageset_init(struct per_cpu_pageset *p)
-- 
2.28.0



^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 5/9] mm, page_alloc: make per_cpu_pageset accessible only after init
  2020-09-22 14:37 [PATCH 0/9] disable pcplists during memory offline Vlastimil Babka
                   ` (3 preceding siblings ...)
  2020-09-22 14:37 ` [PATCH 4/9] mm, page_alloc: simplify pageset_update() Vlastimil Babka
@ 2020-09-22 14:37 ` Vlastimil Babka
  2020-09-25 10:25   ` David Hildenbrand
  2020-10-05 13:24   ` Michal Hocko
  2020-09-22 14:37 ` [PATCH 6/9] mm, page_alloc: cache pageset high and batch in struct zone Vlastimil Babka
                   ` (4 subsequent siblings)
  9 siblings, 2 replies; 41+ messages in thread
From: Vlastimil Babka @ 2020-09-22 14:37 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Michal Hocko, Pavel Tatashin, David Hildenbrand,
	Oscar Salvador, Joonsoo Kim, Vlastimil Babka

setup_zone_pageset() replaces the boot_pageset by allocating and initializing a
proper percpu one. Currently it assigns zone->pageset with the newly allocated
one before initializing it. That's currently not an issue, because the zone
should not be in any zonelist, thus not visible to allocators at this point.

Memory ordering between the pcplist contents and its visibility is also not
guaranteed here, but that also shouldn't be an issue because online_pages()
does a spin_unlock(pgdat->node_size_lock) before building the zonelists.

However it's best that we don't silently rely on operations that can be changed
in the future. Make sure only properly initialized pcplists are visible, using
smp_store_release(). The read side has a data dependency via the zone->pageset
pointer instead of an explicit read barrier.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/page_alloc.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 99b74c1c2b0a..de3b48bda45c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6246,15 +6246,17 @@ static void zone_set_pageset_high_and_batch(struct zone *zone)
 
 void __meminit setup_zone_pageset(struct zone *zone)
 {
+	struct per_cpu_pageset __percpu * new_pageset;
 	struct per_cpu_pageset *p;
 	int cpu;
 
-	zone->pageset = alloc_percpu(struct per_cpu_pageset);
+	new_pageset = alloc_percpu(struct per_cpu_pageset);
 	for_each_possible_cpu(cpu) {
-		p = per_cpu_ptr(zone->pageset, cpu);
+		p = per_cpu_ptr(new_pageset, cpu);
 		pageset_init(p);
 	}
 
+	smp_store_release(&zone->pageset, new_pageset);
 	zone_set_pageset_high_and_batch(zone);
 }
 
-- 
2.28.0



^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 6/9] mm, page_alloc: cache pageset high and batch in struct zone
  2020-09-22 14:37 [PATCH 0/9] disable pcplists during memory offline Vlastimil Babka
                   ` (4 preceding siblings ...)
  2020-09-22 14:37 ` [PATCH 5/9] mm, page_alloc: make per_cpu_pageset accessible only after init Vlastimil Babka
@ 2020-09-22 14:37 ` Vlastimil Babka
  2020-09-25 10:34   ` David Hildenbrand
  2020-10-05 13:28   ` Michal Hocko
  2020-09-22 14:37 ` [PATCH 7/9] mm, page_alloc: move draining pcplists to page isolation users Vlastimil Babka
                   ` (3 subsequent siblings)
  9 siblings, 2 replies; 41+ messages in thread
From: Vlastimil Babka @ 2020-09-22 14:37 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Michal Hocko, Pavel Tatashin, David Hildenbrand,
	Oscar Salvador, Joonsoo Kim, Vlastimil Babka

All per-cpu pagesets for a zone use the same high and batch values, that are
duplicated there just for performance (locality) reasons. This patch adds the
same variables also to struct zone as a shared copy.

This will be useful later for making possible to disable pcplists temporarily
by setting high value to 0, while remembering the values for restoring them
later. But we can also immediately benefit from not updating pagesets of all
possible cpus in case the newly recalculated values (after sysctl change or
memory online/offline) are actually unchanged from the previous ones.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/mmzone.h |  6 ++++++
 mm/page_alloc.c        | 16 ++++++++++++++--
 2 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 90721f3156bc..7ad3f14dbe88 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -470,6 +470,12 @@ struct zone {
 #endif
 	struct pglist_data	*zone_pgdat;
 	struct per_cpu_pageset __percpu *pageset;
+	/*
+	 * the high and batch values are copied to individual pagesets for
+	 * faster access
+	 */
+	int pageset_high;
+	int pageset_batch;
 
 #ifndef CONFIG_SPARSEMEM
 	/*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index de3b48bda45c..901907799bdc 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5824,6 +5824,8 @@ static void build_zonelists(pg_data_t *pgdat)
  * Other parts of the kernel may not check if the zone is available.
  */
 static void pageset_init(struct per_cpu_pageset *p);
+#define BOOT_PAGESET_HIGH	0
+#define BOOT_PAGESET_BATCH	1
 static DEFINE_PER_CPU(struct per_cpu_pageset, boot_pageset);
 static DEFINE_PER_CPU(struct per_cpu_nodestat, boot_nodestats);
 
@@ -6213,8 +6215,8 @@ static void pageset_init(struct per_cpu_pageset *p)
 	 * need to be as careful as pageset_update() as nobody can access the
 	 * pageset yet.
 	 */
-	pcp->high = 0;
-	pcp->batch = 1;
+	pcp->high = BOOT_PAGESET_HIGH;
+	pcp->batch = BOOT_PAGESET_BATCH;
 }
 
 /*
@@ -6238,6 +6240,14 @@ static void zone_set_pageset_high_and_batch(struct zone *zone)
 		new_batch = max(1UL, 1 * new_batch);
 	}
 
+	if (zone->pageset_high != new_high ||
+	    zone->pageset_batch != new_batch) {
+		zone->pageset_high = new_high;
+		zone->pageset_batch = new_batch;
+	} else {
+		return;
+	}
+
 	for_each_possible_cpu(cpu) {
 		p = per_cpu_ptr(zone->pageset, cpu);
 		pageset_update(&p->pcp, new_high, new_batch);
@@ -6300,6 +6310,8 @@ static __meminit void zone_pcp_init(struct zone *zone)
 	 * offset of a (static) per cpu variable into the per cpu area.
 	 */
 	zone->pageset = &boot_pageset;
+	zone->pageset_high = BOOT_PAGESET_HIGH;
+	zone->pageset_batch = BOOT_PAGESET_BATCH;
 
 	if (populated_zone(zone))
 		printk(KERN_DEBUG "  %s zone: %lu pages, LIFO batch:%u\n",
-- 
2.28.0



^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 7/9] mm, page_alloc: move draining pcplists to page isolation users
  2020-09-22 14:37 [PATCH 0/9] disable pcplists during memory offline Vlastimil Babka
                   ` (5 preceding siblings ...)
  2020-09-22 14:37 ` [PATCH 6/9] mm, page_alloc: cache pageset high and batch in struct zone Vlastimil Babka
@ 2020-09-22 14:37 ` Vlastimil Babka
  2020-09-25 10:39   ` David Hildenbrand
  2020-10-05 13:57   ` Michal Hocko
  2020-09-22 14:37 ` [PATCH 8/9] mm, page_alloc: drain all pcplists during memory offline Vlastimil Babka
                   ` (2 subsequent siblings)
  9 siblings, 2 replies; 41+ messages in thread
From: Vlastimil Babka @ 2020-09-22 14:37 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Michal Hocko, Pavel Tatashin, David Hildenbrand,
	Oscar Salvador, Joonsoo Kim, Vlastimil Babka

Currently, pcplists are drained during set_migratetype_isolate() which means
once per pageblock processed start_isolate_page_range(). This is somewhat
wasteful. Moreover, the callers might need different guarantees, and the
draining is currently prone to races and does not guarantee that no page
from isolated pageblock will end up on the pcplist after the drain.

Better guarantees are added by later patches and require explicit actions
by page isolation users that need them. Thus it makes sense to move the
current imperfect draining to the callers also as a preparation step.

Suggested-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/memory_hotplug.c | 11 ++++++-----
 mm/page_alloc.c     |  2 ++
 mm/page_isolation.c | 10 +++++-----
 3 files changed, 13 insertions(+), 10 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 9db80ee29caa..08f729922e18 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1524,6 +1524,8 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
 		goto failed_removal;
 	}
 
+	drain_all_pages(zone);
+
 	arg.start_pfn = start_pfn;
 	arg.nr_pages = nr_pages;
 	node_states_check_changes_offline(nr_pages, zone, &arg);
@@ -1574,11 +1576,10 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
 		}
 
 		/*
-		 * per-cpu pages are drained in start_isolate_page_range, but if
-		 * there are still pages that are not free, make sure that we
-		 * drain again, because when we isolated range we might
-		 * have raced with another thread that was adding pages to pcp
-		 * list.
+		 * per-cpu pages are drained after start_isolate_page_range, but
+		 * if there are still pages that are not free, make sure that we
+		 * drain again, because when we isolated range we might have
+		 * raced with another thread that was adding pages to pcp list.
 		 *
 		 * Forward progress should be still guaranteed because
 		 * pages on the pcp list can only belong to MOVABLE_ZONE
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 901907799bdc..4e37bc3f6077 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -8432,6 +8432,8 @@ int alloc_contig_range(unsigned long start, unsigned long end,
 	if (ret)
 		return ret;
 
+	drain_all_pages(cc.zone);
+
 	/*
 	 * In case of -EBUSY, we'd like to know which page causes problem.
 	 * So, just fall through. test_pages_isolated() has a tracepoint
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index abfe26ad59fd..57d8bc1e7760 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -49,7 +49,6 @@ static int set_migratetype_isolate(struct page *page, int migratetype, int isol_
 
 		__mod_zone_freepage_state(zone, -nr_pages, mt);
 		spin_unlock_irqrestore(&zone->lock, flags);
-		drain_all_pages(zone);
 		return 0;
 	}
 
@@ -167,11 +166,12 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
  *
  * Please note that there is no strong synchronization with the page allocator
  * either. Pages might be freed while their page blocks are marked ISOLATED.
- * In some cases pages might still end up on pcp lists and that would allow
+ * A call to drain_all_pages() after isolation can flush most of them. However
+ * in some cases pages might still end up on pcp lists and that would allow
  * for their allocation even when they are in fact isolated already. Depending
- * on how strong of a guarantee the caller needs drain_all_pages might be needed
- * (e.g. __offline_pages will need to call it after check for isolated range for
- * a next retry).
+ * on how strong of a guarantee the caller needs, further drain_all_pages()
+ * might be needed (e.g. __offline_pages will need to call it after check for
+ * isolated range for a next retry).
  *
  * Return: 0 on success and -EBUSY if any part of range cannot be isolated.
  */
-- 
2.28.0



^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 8/9] mm, page_alloc: drain all pcplists during memory offline
  2020-09-22 14:37 [PATCH 0/9] disable pcplists during memory offline Vlastimil Babka
                   ` (6 preceding siblings ...)
  2020-09-22 14:37 ` [PATCH 7/9] mm, page_alloc: move draining pcplists to page isolation users Vlastimil Babka
@ 2020-09-22 14:37 ` Vlastimil Babka
  2020-09-25 10:46   ` David Hildenbrand
  2020-09-22 14:37 ` [PATCH 9/9] mm, page_alloc: optionally disable pcplists during page isolation Vlastimil Babka
  2020-09-22 17:15 ` [PATCH 0/9] disable pcplists during memory offline David Hildenbrand
  9 siblings, 1 reply; 41+ messages in thread
From: Vlastimil Babka @ 2020-09-22 14:37 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Michal Hocko, Pavel Tatashin, David Hildenbrand,
	Oscar Salvador, Joonsoo Kim, Vlastimil Babka

drain_all_pages() is optimized to only execute on cpus where pcplists are not
empty. The check can however race with a free to pcplist that has not yet
increased the pcp->count from 0 to 1. Make the drain optionally skip the racy
check and drain on all cpus, and use it in memory offline context, where we
want to make sure no isolated pages are left behind on pcplists.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/gfp.h |  1 +
 mm/memory_hotplug.c |  4 ++--
 mm/page_alloc.c     | 29 ++++++++++++++++++++---------
 3 files changed, 23 insertions(+), 11 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 67a0774e080b..cc52c5cc9fab 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -592,6 +592,7 @@ extern void page_frag_free(void *addr);
 
 void page_alloc_init(void);
 void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp);
+void __drain_all_pages(struct zone *zone, bool page_isolation);
 void drain_all_pages(struct zone *zone);
 void drain_local_pages(struct zone *zone);
 
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 08f729922e18..bbde415b558b 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1524,7 +1524,7 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
 		goto failed_removal;
 	}
 
-	drain_all_pages(zone);
+	__drain_all_pages(zone, true);
 
 	arg.start_pfn = start_pfn;
 	arg.nr_pages = nr_pages;
@@ -1588,7 +1588,7 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
 		 */
 		ret = test_pages_isolated(start_pfn, end_pfn, MEMORY_OFFLINE);
 		if (ret)
-			drain_all_pages(zone);
+			__drain_all_pages(zone, true);
 	} while (ret);
 
 	/* Mark all sections offline and remove free pages from the buddy. */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4e37bc3f6077..33cc35d152b1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2960,14 +2960,7 @@ static void drain_local_pages_wq(struct work_struct *work)
 	preempt_enable();
 }
 
-/*
- * Spill all the per-cpu pages from all CPUs back into the buddy allocator.
- *
- * When zone parameter is non-NULL, spill just the single zone's pages.
- *
- * Note that this can be extremely slow as the draining happens in a workqueue.
- */
-void drain_all_pages(struct zone *zone)
+void __drain_all_pages(struct zone *zone, bool force_all_cpus)
 {
 	int cpu;
 
@@ -3006,7 +2999,13 @@ void drain_all_pages(struct zone *zone)
 		struct zone *z;
 		bool has_pcps = false;
 
-		if (zone) {
+		if (force_all_cpus) {
+			/*
+			 * The pcp.count check is racy, some callers need a
+			 * guarantee that no cpu is missed.
+			 */
+			has_pcps = true;
+		} else if (zone) {
 			pcp = per_cpu_ptr(zone->pageset, cpu);
 			if (pcp->pcp.count)
 				has_pcps = true;
@@ -3039,6 +3038,18 @@ void drain_all_pages(struct zone *zone)
 	mutex_unlock(&pcpu_drain_mutex);
 }
 
+/*
+ * Spill all the per-cpu pages from all CPUs back into the buddy allocator.
+ *
+ * When zone parameter is non-NULL, spill just the single zone's pages.
+ *
+ * Note that this can be extremely slow as the draining happens in a workqueue.
+ */
+void drain_all_pages(struct zone *zone)
+{
+	__drain_all_pages(zone, false);
+}
+
 #ifdef CONFIG_HIBERNATION
 
 /*
-- 
2.28.0



^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 9/9] mm, page_alloc: optionally disable pcplists during page isolation
  2020-09-22 14:37 [PATCH 0/9] disable pcplists during memory offline Vlastimil Babka
                   ` (7 preceding siblings ...)
  2020-09-22 14:37 ` [PATCH 8/9] mm, page_alloc: drain all pcplists during memory offline Vlastimil Babka
@ 2020-09-22 14:37 ` Vlastimil Babka
  2020-09-25 10:53   ` David Hildenbrand
  2020-10-06  8:34   ` Michal Hocko
  2020-09-22 17:15 ` [PATCH 0/9] disable pcplists during memory offline David Hildenbrand
  9 siblings, 2 replies; 41+ messages in thread
From: Vlastimil Babka @ 2020-09-22 14:37 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Michal Hocko, Pavel Tatashin, David Hildenbrand,
	Oscar Salvador, Joonsoo Kim, Vlastimil Babka, Michal Hocko

Page isolation can race with process freeing pages to pcplists in a way that
a page from isolated pageblock can end up on pcplist. This can be fixed by
repeated draining of pcplists, as done by patch "mm/memory_hotplug: drain
per-cpu pages again during memory offline" in [1].

David and Michal would prefer that this race was closed in a way that callers
of page isolation who need stronger guarantees don't need to repeatedly drain.
David suggested disabling pcplists usage completely during page isolation,
instead of repeatedly draining them.

To achieve this without adding special cases in alloc/free fastpath, we can use
the same approach as boot pagesets - when pcp->high is 0, any pcplist addition
will be immediately flushed.

The race can thus be closed by setting pcp->high to 0 and draining pcplists
once, before calling start_isolate_page_range(). The draining will serialize
after processes that already disabled interrupts and read the old value of
pcp->high in free_unref_page_commit(), and processes that have not yet disabled
interrupts, will observe pcp->high == 0 when they are rescheduled, and skip
pcplists. This guarantees no stray pages on pcplists in zones where isolation
happens.

This patch thus adds zone_pcplist_disable() and zone_pcplist_enable() functions
that page isolation users can call before start_isolate_page_range() and after
unisolating (or offlining) the isolated pages. A new zone->pcplist_disabled
atomic variable makes sure we disable only pcplists once and don't enable
them prematurely in case there are multiple users in parallel.

We however have to avoid external updates to high and batch by taking
pcp_batch_high_lock. To allow multiple isolations in parallel, change this lock
from mutex to rwsem.

Currently the only user of this functionality is offline_pages().

[1] https://lore.kernel.org/linux-mm/20200903140032.380431-1-pasha.tatashin@soleen.com/

Suggested-by: David Hildenbrand <david@redhat.com>
Suggested-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/mmzone.h         |  2 ++
 include/linux/page-isolation.h |  2 ++
 mm/internal.h                  |  4 ++++
 mm/memory_hotplug.c            | 28 ++++++++++++----------------
 mm/page_alloc.c                | 29 ++++++++++++++++++-----------
 mm/page_isolation.c            | 22 +++++++++++++++++++---
 6 files changed, 57 insertions(+), 30 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 7ad3f14dbe88..3c653d6348b1 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -536,6 +536,8 @@ struct zone {
 	 * of pageblock. Protected by zone->lock.
 	 */
 	unsigned long		nr_isolate_pageblock;
+	/* Track requests for disabling pcplists */
+	atomic_t		pcplist_disabled;
 #endif
 
 #ifdef CONFIG_MEMORY_HOTPLUG
diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h
index 572458016331..1dec5d0f62a6 100644
--- a/include/linux/page-isolation.h
+++ b/include/linux/page-isolation.h
@@ -38,6 +38,8 @@ struct page *has_unmovable_pages(struct zone *zone, struct page *page,
 void set_pageblock_migratetype(struct page *page, int migratetype);
 int move_freepages_block(struct zone *zone, struct page *page,
 				int migratetype, int *num_movable);
+void zone_pcplist_disable(struct zone *zone);
+void zone_pcplist_enable(struct zone *zone);
 
 /*
  * Changes migrate type in [start_pfn, end_pfn) to be MIGRATE_ISOLATE.
diff --git a/mm/internal.h b/mm/internal.h
index ea66ef45da6c..154e38e702dd 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -7,6 +7,7 @@
 #ifndef __MM_INTERNAL_H
 #define __MM_INTERNAL_H
 
+#include <linux/rwsem.h>
 #include <linux/fs.h>
 #include <linux/mm.h>
 #include <linux/pagemap.h>
@@ -199,8 +200,11 @@ extern void post_alloc_hook(struct page *page, unsigned int order,
 					gfp_t gfp_flags);
 extern int user_min_free_kbytes;
 
+extern struct rw_semaphore pcp_batch_high_lock;
 extern void zone_pcp_update(struct zone *zone);
 extern void zone_pcp_reset(struct zone *zone);
+extern void zone_update_pageset_high_and_batch(struct zone *zone,
+		unsigned long high, unsigned long batch);
 
 #if defined CONFIG_COMPACTION || defined CONFIG_CMA
 
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index bbde415b558b..6fe9be550160 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1515,17 +1515,21 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
 	}
 	node = zone_to_nid(zone);
 
+	/*
+	 * Disable pcplists so that page isolation cannot race with freeing
+	 * in a way that pages from isolated pageblock are left on pcplists.
+	 */
+	zone_pcplist_disable(zone);
+
 	/* set above range as isolated */
 	ret = start_isolate_page_range(start_pfn, end_pfn,
 				       MIGRATE_MOVABLE,
 				       MEMORY_OFFLINE | REPORT_FAILURE);
 	if (ret) {
 		reason = "failure to isolate range";
-		goto failed_removal;
+		goto failed_removal_pcplists_disabled;
 	}
 
-	__drain_all_pages(zone, true);
-
 	arg.start_pfn = start_pfn;
 	arg.nr_pages = nr_pages;
 	node_states_check_changes_offline(nr_pages, zone, &arg);
@@ -1575,20 +1579,8 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
 			goto failed_removal_isolated;
 		}
 
-		/*
-		 * per-cpu pages are drained after start_isolate_page_range, but
-		 * if there are still pages that are not free, make sure that we
-		 * drain again, because when we isolated range we might have
-		 * raced with another thread that was adding pages to pcp list.
-		 *
-		 * Forward progress should be still guaranteed because
-		 * pages on the pcp list can only belong to MOVABLE_ZONE
-		 * because has_unmovable_pages explicitly checks for
-		 * PageBuddy on freed pages on other zones.
-		 */
 		ret = test_pages_isolated(start_pfn, end_pfn, MEMORY_OFFLINE);
-		if (ret)
-			__drain_all_pages(zone, true);
+
 	} while (ret);
 
 	/* Mark all sections offline and remove free pages from the buddy. */
@@ -1604,6 +1596,8 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
 	zone->nr_isolate_pageblock -= nr_pages / pageblock_nr_pages;
 	spin_unlock_irqrestore(&zone->lock, flags);
 
+	zone_pcplist_enable(zone);
+
 	/* removal success */
 	adjust_managed_page_count(pfn_to_page(start_pfn), -nr_pages);
 	zone->present_pages -= nr_pages;
@@ -1637,6 +1631,8 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
 failed_removal_isolated:
 	undo_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE);
 	memory_notify(MEM_CANCEL_OFFLINE, &arg);
+failed_removal_pcplists_disabled:
+	zone_pcplist_enable(zone);
 failed_removal:
 	pr_debug("memory offlining [mem %#010llx-%#010llx] failed due to %s\n",
 		 (unsigned long long) start_pfn << PAGE_SHIFT,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 33cc35d152b1..673d353f9311 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -78,7 +78,7 @@
 #include "page_reporting.h"
 
 /* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */
-static DEFINE_MUTEX(pcp_batch_high_lock);
+DECLARE_RWSEM(pcp_batch_high_lock);
 #define MIN_PERCPU_PAGELIST_FRACTION	(8)
 
 #ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID
@@ -6230,6 +6230,18 @@ static void pageset_init(struct per_cpu_pageset *p)
 	pcp->batch = BOOT_PAGESET_BATCH;
 }
 
+void zone_update_pageset_high_and_batch(struct zone *zone, unsigned long high,
+		unsigned long batch)
+{
+	struct per_cpu_pageset *p;
+	int cpu;
+
+	for_each_possible_cpu(cpu) {
+		p = per_cpu_ptr(zone->pageset, cpu);
+		pageset_update(&p->pcp, high, batch);
+	}
+}
+
 /*
  * Calculate and set new high and batch values for all per-cpu pagesets of a
  * zone, based on the zone's size and the percpu_pagelist_fraction sysctl.
@@ -6237,8 +6249,6 @@ static void pageset_init(struct per_cpu_pageset *p)
 static void zone_set_pageset_high_and_batch(struct zone *zone)
 {
 	unsigned long new_high, new_batch;
-	struct per_cpu_pageset *p;
-	int cpu;
 
 	if (percpu_pagelist_fraction) {
 		new_high = zone_managed_pages(zone) / percpu_pagelist_fraction;
@@ -6259,10 +6269,7 @@ static void zone_set_pageset_high_and_batch(struct zone *zone)
 		return;
 	}
 
-	for_each_possible_cpu(cpu) {
-		p = per_cpu_ptr(zone->pageset, cpu);
-		pageset_update(&p->pcp, new_high, new_batch);
-	}
+	zone_update_pageset_high_and_batch(zone, new_high, new_batch);
 }
 
 void __meminit setup_zone_pageset(struct zone *zone)
@@ -8024,7 +8031,7 @@ int percpu_pagelist_fraction_sysctl_handler(struct ctl_table *table, int write,
 	int old_percpu_pagelist_fraction;
 	int ret;
 
-	mutex_lock(&pcp_batch_high_lock);
+	down_write(&pcp_batch_high_lock);
 	old_percpu_pagelist_fraction = percpu_pagelist_fraction;
 
 	ret = proc_dointvec_minmax(table, write, buffer, length, ppos);
@@ -8046,7 +8053,7 @@ int percpu_pagelist_fraction_sysctl_handler(struct ctl_table *table, int write,
 	for_each_populated_zone(zone)
 		zone_set_pageset_high_and_batch(zone);
 out:
-	mutex_unlock(&pcp_batch_high_lock);
+	up_write(&pcp_batch_high_lock);
 	return ret;
 }
 
@@ -8652,9 +8659,9 @@ EXPORT_SYMBOL(free_contig_range);
  */
 void __meminit zone_pcp_update(struct zone *zone)
 {
-	mutex_lock(&pcp_batch_high_lock);
+	down_write(&pcp_batch_high_lock);
 	zone_set_pageset_high_and_batch(zone);
-	mutex_unlock(&pcp_batch_high_lock);
+	up_write(&pcp_batch_high_lock);
 }
 
 void zone_pcp_reset(struct zone *zone)
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index 57d8bc1e7760..c0e895e8f322 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -15,6 +15,22 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/page_isolation.h>
 
+void zone_pcplist_disable(struct zone *zone)
+{
+	down_read(&pcp_batch_high_lock);
+	if (atomic_inc_return(&zone->pcplist_disabled) == 1) {
+		zone_update_pageset_high_and_batch(zone, 0, 1);
+		__drain_all_pages(zone, true);
+	}
+}
+
+void zone_pcplist_enable(struct zone *zone)
+{
+	if (atomic_dec_return(&zone->pcplist_disabled) == 0)
+		zone_update_pageset_high_and_batch(zone, zone->pageset_high, zone->pageset_batch);
+	up_read(&pcp_batch_high_lock);
+}
+
 static int set_migratetype_isolate(struct page *page, int migratetype, int isol_flags)
 {
 	struct zone *zone = page_zone(page);
@@ -169,9 +185,9 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
  * A call to drain_all_pages() after isolation can flush most of them. However
  * in some cases pages might still end up on pcp lists and that would allow
  * for their allocation even when they are in fact isolated already. Depending
- * on how strong of a guarantee the caller needs, further drain_all_pages()
- * might be needed (e.g. __offline_pages will need to call it after check for
- * isolated range for a next retry).
+ * on how strong of a guarantee the caller needs, zone_pcplist_disable/enable()
+ * might be used to flush and disable pcplist before isolation and after
+ * unisolation.
  *
  * Return: 0 on success and -EBUSY if any part of range cannot be isolated.
  */
-- 
2.28.0



^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/9] disable pcplists during memory offline
  2020-09-22 14:37 [PATCH 0/9] disable pcplists during memory offline Vlastimil Babka
                   ` (8 preceding siblings ...)
  2020-09-22 14:37 ` [PATCH 9/9] mm, page_alloc: optionally disable pcplists during page isolation Vlastimil Babka
@ 2020-09-22 17:15 ` David Hildenbrand
  9 siblings, 0 replies; 41+ messages in thread
From: David Hildenbrand @ 2020-09-22 17:15 UTC (permalink / raw)
  To: Vlastimil Babka, linux-mm
  Cc: linux-kernel, Michal Hocko, Pavel Tatashin, Oscar Salvador, Joonsoo Kim

On 22.09.20 16:37, Vlastimil Babka wrote:
> As per the discussions [1] [2] this is an attempt to implement David's
> suggestion that page isolation should disable pcplists to avoid races with page
> freeing in progress. This is done without extra checks in fast paths, as
> explained in Patch 9. The repeated draining done by [2] is then no longer
> needed. Previous version (RFC) is at [3].
> 
> The RFC tried to hide pcplists disabling/enabling into page isolation, but it
> wasn't completely possible, as memory offline does not unisolation. Michal
> suggested an explicit API in [4] so that's the current implementation and it
> seems indeed nicer.
> 
> Once we accept that page isolation users need to do explicit actions around it
> depending on the needed guarantees, we can also IMHO accept that the current
> pcplist draining can be also done by the callers, which is more effective.
> After all, there are only two users of page isolation. So patch 7 does
> effectively the same thing as Pavel proposed in [5], and patches 8-9 implement
> stronger guarantees only for memory offline. If CMA decides to opt-in to the
> stronger guarantee, it's easy to do so.
> 
> Patches 1-6 are preparatory cleanups for pcplist disabling.
> 
> Patchset was briefly tested in QEMU so that memory online/offline works, but
> I haven't done a stress test that would prove the race fixed by [2] is
> eliminated.
> 
> Note that patch 9 could be avoided if we instead adjusted page freeing in shown
> in [6], but I believe the current implementation of disabling pcplists is not
> too much complex, so I would prefer this instead of adding new checks and
> longer irq-disabled section into page freeing hotpaths.

Haven't looked into the details (yet), but I assume we can add some flag
to alloc_contig_range(), to also disable+flush+enable. (or let the
caller do it, for example on a bunch of bulk allocations - TBD).

Result of patch #9 looks quite clean.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 1/9] mm, page_alloc: clean up pageset high and batch update
  2020-09-22 14:37 ` [PATCH 1/9] mm, page_alloc: clean up pageset high and batch update Vlastimil Babka
@ 2020-09-25 10:18   ` David Hildenbrand
  2020-10-05 12:03   ` Michal Hocko
  1 sibling, 0 replies; 41+ messages in thread
From: David Hildenbrand @ 2020-09-25 10:18 UTC (permalink / raw)
  To: Vlastimil Babka, linux-mm
  Cc: linux-kernel, Michal Hocko, Pavel Tatashin, Oscar Salvador, Joonsoo Kim

On 22.09.20 16:37, Vlastimil Babka wrote:
> The updates to pcplists' high and batch valued are handled by multiple
> functions that make the calculations hard to follow. Consolidate everything
> to pageset_set_high_and_batch() and remove pageset_set_batch() and
> pageset_set_high() wrappers.
> 
> The only special case using one of the removed wrappers was:
> build_all_zonelists_init()
>   setup_pageset()
>     pageset_set_batch()
> which was hardcoding batch as 0, so we can just open-code a call to
> pageset_update() with constant parameters instead.
> 
> No functional change.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> Reviewed-by: Oscar Salvador <osalvador@suse.de>
> ---
>  mm/page_alloc.c | 49 ++++++++++++++++++++-----------------------------
>  1 file changed, 20 insertions(+), 29 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 60a0e94645a6..a163c5e561f2 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5823,7 +5823,7 @@ static void build_zonelists(pg_data_t *pgdat)
>   * not check if the processor is online before following the pageset pointer.
>   * Other parts of the kernel may not check if the zone is available.
>   */
> -static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch);
> +static void setup_pageset(struct per_cpu_pageset *p);
>  static DEFINE_PER_CPU(struct per_cpu_pageset, boot_pageset);
>  static DEFINE_PER_CPU(struct per_cpu_nodestat, boot_nodestats);
>  
> @@ -5891,7 +5891,7 @@ build_all_zonelists_init(void)
>  	 * (a chicken-egg dilemma).
>  	 */
>  	for_each_possible_cpu(cpu)
> -		setup_pageset(&per_cpu(boot_pageset, cpu), 0);
> +		setup_pageset(&per_cpu(boot_pageset, cpu));
>  
>  	mminit_verify_zonelist();
>  	cpuset_init_current_mems_allowed();
> @@ -6200,12 +6200,6 @@ static void pageset_update(struct per_cpu_pages *pcp, unsigned long high,
>  	pcp->batch = batch;
>  }
>  
> -/* a companion to pageset_set_high() */
> -static void pageset_set_batch(struct per_cpu_pageset *p, unsigned long batch)
> -{
> -	pageset_update(&p->pcp, 6 * batch, max(1UL, 1 * batch));
> -}
> -
>  static void pageset_init(struct per_cpu_pageset *p)
>  {
>  	struct per_cpu_pages *pcp;
> @@ -6218,35 +6212,32 @@ static void pageset_init(struct per_cpu_pageset *p)
>  		INIT_LIST_HEAD(&pcp->lists[migratetype]);
>  }
>  
> -static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch)
> +static void setup_pageset(struct per_cpu_pageset *p)
>  {
>  	pageset_init(p);
> -	pageset_set_batch(p, batch);
> +	pageset_update(&p->pcp, 0, 1);
>  }
>  
>  /*
> - * pageset_set_high() sets the high water mark for hot per_cpu_pagelist
> - * to the value high for the pageset p.
> + * Calculate and set new high and batch values for given per-cpu pageset of a
> + * zone, based on the zone's size and the percpu_pagelist_fraction sysctl.
>   */
> -static void pageset_set_high(struct per_cpu_pageset *p,
> -				unsigned long high)
> -{
> -	unsigned long batch = max(1UL, high / 4);
> -	if ((high / 4) > (PAGE_SHIFT * 8))
> -		batch = PAGE_SHIFT * 8;
> -
> -	pageset_update(&p->pcp, high, batch);
> -}
> -
>  static void pageset_set_high_and_batch(struct zone *zone,
> -				       struct per_cpu_pageset *pcp)
> +				       struct per_cpu_pageset *p)
>  {
> -	if (percpu_pagelist_fraction)
> -		pageset_set_high(pcp,
> -			(zone_managed_pages(zone) /
> -				percpu_pagelist_fraction));
> -	else
> -		pageset_set_batch(pcp, zone_batchsize(zone));
> +	unsigned long new_high, new_batch;
> +
> +	if (percpu_pagelist_fraction) {
> +		new_high = zone_managed_pages(zone) / percpu_pagelist_fraction;
> +		new_batch = max(1UL, new_high / 4);
> +		if ((new_high / 4) > (PAGE_SHIFT * 8))
> +			new_batch = PAGE_SHIFT * 8;
> +	} else {
> +		new_batch = zone_batchsize(zone);
> +		new_high = 6 * new_batch;
> +		new_batch = max(1UL, 1 * new_batch);
> +	}
> +	pageset_update(&p->pcp, new_high, new_batch);
>  }
>  
>  static void __meminit zone_pageset_init(struct zone *zone, int cpu)
> 

Reviewed-by: David Hildenbrand <david@redhat.com>

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 3/9] mm, page_alloc: remove setup_pageset()
  2020-09-22 14:37 ` [PATCH 3/9] mm, page_alloc: remove setup_pageset() Vlastimil Babka
@ 2020-09-25 10:19   ` David Hildenbrand
  2020-10-05 12:59   ` Michal Hocko
  1 sibling, 0 replies; 41+ messages in thread
From: David Hildenbrand @ 2020-09-25 10:19 UTC (permalink / raw)
  To: Vlastimil Babka, linux-mm
  Cc: linux-kernel, Michal Hocko, Pavel Tatashin, Oscar Salvador, Joonsoo Kim

On 22.09.20 16:37, Vlastimil Babka wrote:
> We initialize boot-time pagesets with setup_pageset(), which sets high and
> batch values that effectively disable pcplists.
> 
> We can remove this wrapper if we just set these values for all pagesets in
> pageset_init(). Non-boot pagesets then subsequently update them to the proper
> values.
> 
> No functional change.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>  mm/page_alloc.c | 17 ++++++++++-------
>  1 file changed, 10 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 26069c8d1b19..76c2b4578723 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5823,7 +5823,7 @@ static void build_zonelists(pg_data_t *pgdat)
>   * not check if the processor is online before following the pageset pointer.
>   * Other parts of the kernel may not check if the zone is available.
>   */
> -static void setup_pageset(struct per_cpu_pageset *p);
> +static void pageset_init(struct per_cpu_pageset *p);
>  static DEFINE_PER_CPU(struct per_cpu_pageset, boot_pageset);
>  static DEFINE_PER_CPU(struct per_cpu_nodestat, boot_nodestats);
>  
> @@ -5891,7 +5891,7 @@ build_all_zonelists_init(void)
>  	 * (a chicken-egg dilemma).
>  	 */
>  	for_each_possible_cpu(cpu)
> -		setup_pageset(&per_cpu(boot_pageset, cpu));
> +		pageset_init(&per_cpu(boot_pageset, cpu));
>  
>  	mminit_verify_zonelist();
>  	cpuset_init_current_mems_allowed();
> @@ -6210,12 +6210,15 @@ static void pageset_init(struct per_cpu_pageset *p)
>  	pcp = &p->pcp;
>  	for (migratetype = 0; migratetype < MIGRATE_PCPTYPES; migratetype++)
>  		INIT_LIST_HEAD(&pcp->lists[migratetype]);
> -}
>  
> -static void setup_pageset(struct per_cpu_pageset *p)
> -{
> -	pageset_init(p);
> -	pageset_update(&p->pcp, 0, 1);
> +	/*
> +	 * Set batch and high values safe for a boot pageset. A true percpu
> +	 * pageset's initialization will update them subsequently. Here we don't
> +	 * need to be as careful as pageset_update() as nobody can access the
> +	 * pageset yet.
> +	 */
> +	pcp->high = 0;
> +	pcp->batch = 1;
>  }
>  
>  /*
> 

Reviewed-by: David Hildenbrand <david@redhat.com>

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 4/9] mm, page_alloc: simplify pageset_update()
  2020-09-22 14:37 ` [PATCH 4/9] mm, page_alloc: simplify pageset_update() Vlastimil Babka
@ 2020-09-25 10:23   ` David Hildenbrand
  2020-10-05 13:20   ` Michal Hocko
  1 sibling, 0 replies; 41+ messages in thread
From: David Hildenbrand @ 2020-09-25 10:23 UTC (permalink / raw)
  To: Vlastimil Babka, linux-mm
  Cc: linux-kernel, Michal Hocko, Pavel Tatashin, Oscar Salvador, Joonsoo Kim

On 22.09.20 16:37, Vlastimil Babka wrote:
> pageset_update() attempts to update pcplist's high and batch values in a way
> that readers don't observe batch > high. It uses smp_wmb() to order the updates
> in a way to achieve this. However, without proper pairing read barriers in
> readers this guarantee doesn't hold, and there are no such barriers in
> e.g. free_unref_page_commit().
> 
> Commit 88e8ac11d2ea ("mm, page_alloc: fix core hung in free_pcppages_bulk()")
> already showed this is problematic, and solved this by ultimately only trusing
> pcp->count of the current cpu with interrupts disabled.
> 
> The update dance with unpaired write barriers thus makes no sense. Replace
> them with plain WRITE_ONCE to prevent store tearing, and document that the
> values can change asynchronously and should not be trusted for correctness.
> 
> All current readers appear to be OK after 88e8ac11d2ea. Convert them to
> READ_ONCE to prevent unnecessary read tearing, but mainly to alert anybody
> making future changes to the code that special care is needed.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>  mm/page_alloc.c | 40 ++++++++++++++++++----------------------
>  1 file changed, 18 insertions(+), 22 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 76c2b4578723..99b74c1c2b0a 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1297,7 +1297,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>  {
>  	int migratetype = 0;
>  	int batch_free = 0;
> -	int prefetch_nr = 0;
> +	int prefetch_nr = READ_ONCE(pcp->batch);
>  	bool isolated_pageblocks;
>  	struct page *page, *tmp;
>  	LIST_HEAD(head);
> @@ -1348,8 +1348,10 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>  			 * avoid excessive prefetching due to large count, only
>  			 * prefetch buddy for the first pcp->batch nr of pages.
>  			 */
> -			if (prefetch_nr++ < pcp->batch)
> +			if (prefetch_nr) {
>  				prefetch_buddy(page);
> +				prefetch_nr--;
> +			}
>  		} while (--count && --batch_free && !list_empty(list));
>  	}
>  
> @@ -3131,10 +3133,8 @@ static void free_unref_page_commit(struct page *page, unsigned long pfn)
>  	pcp = &this_cpu_ptr(zone->pageset)->pcp;
>  	list_add(&page->lru, &pcp->lists[migratetype]);
>  	pcp->count++;
> -	if (pcp->count >= pcp->high) {
> -		unsigned long batch = READ_ONCE(pcp->batch);
> -		free_pcppages_bulk(zone, batch, pcp);
> -	}
> +	if (pcp->count >= READ_ONCE(pcp->high))
> +		free_pcppages_bulk(zone, READ_ONCE(pcp->batch), pcp);
>  }
>  
>  /*
> @@ -3318,7 +3318,7 @@ static struct page *__rmqueue_pcplist(struct zone *zone, int migratetype,
>  	do {
>  		if (list_empty(list)) {
>  			pcp->count += rmqueue_bulk(zone, 0,
> -					pcp->batch, list,
> +					READ_ONCE(pcp->batch), list,
>  					migratetype, alloc_flags);
>  			if (unlikely(list_empty(list)))
>  				return NULL;
> @@ -6174,13 +6174,16 @@ static int zone_batchsize(struct zone *zone)
>  }
>  
>  /*
> - * pcp->high and pcp->batch values are related and dependent on one another:
> - * ->batch must never be higher then ->high.
> - * The following function updates them in a safe manner without read side
> - * locking.
> + * pcp->high and pcp->batch values are related and generally batch is lower
> + * than high. They are also related to pcp->count such that count is lower
> + * than high, and as soon as it reaches high, the pcplist is flushed.
>   *
> - * Any new users of pcp->batch and pcp->high should ensure they can cope with
> - * those fields changing asynchronously (acording to the above rule).
> + * However, guaranteeing these relations at all times would require e.g. write
> + * barriers here but also careful usage of read barriers at the read side, and
> + * thus be prone to error and bad for performance. Thus the update only prevents
> + * store tearing. Any new users of pcp->batch and pcp->high should ensure they
> + * can cope with those fields changing asynchronously, and fully trust only the
> + * pcp->count field on the local CPU with interrupts disabled.
>   *
>   * mutex_is_locked(&pcp_batch_high_lock) required when calling this function
>   * outside of boot time (or some other assurance that no concurrent updaters
> @@ -6189,15 +6192,8 @@ static int zone_batchsize(struct zone *zone)
>  static void pageset_update(struct per_cpu_pages *pcp, unsigned long high,
>  		unsigned long batch)
>  {
> -       /* start with a fail safe value for batch */
> -	pcp->batch = 1;
> -	smp_wmb();
> -
> -       /* Update high, then batch, in order */
> -	pcp->high = high;
> -	smp_wmb();
> -
> -	pcp->batch = batch;
> +	WRITE_ONCE(pcp->batch, batch);
> +	WRITE_ONCE(pcp->high, high);
>  }
>  
>  static void pageset_init(struct per_cpu_pageset *p)
> 

I *think* this is okay and obviously simplifies things. But to be 100%
sure, I'll have to rely on your judgment :)

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 5/9] mm, page_alloc: make per_cpu_pageset accessible only after init
  2020-09-22 14:37 ` [PATCH 5/9] mm, page_alloc: make per_cpu_pageset accessible only after init Vlastimil Babka
@ 2020-09-25 10:25   ` David Hildenbrand
  2020-10-05 13:24   ` Michal Hocko
  1 sibling, 0 replies; 41+ messages in thread
From: David Hildenbrand @ 2020-09-25 10:25 UTC (permalink / raw)
  To: Vlastimil Babka, linux-mm
  Cc: linux-kernel, Michal Hocko, Pavel Tatashin, Oscar Salvador, Joonsoo Kim

On 22.09.20 16:37, Vlastimil Babka wrote:
> setup_zone_pageset() replaces the boot_pageset by allocating and initializing a
> proper percpu one. Currently it assigns zone->pageset with the newly allocated
> one before initializing it. That's currently not an issue, because the zone
> should not be in any zonelist, thus not visible to allocators at this point.
> 
> Memory ordering between the pcplist contents and its visibility is also not
> guaranteed here, but that also shouldn't be an issue because online_pages()
> does a spin_unlock(pgdat->node_size_lock) before building the zonelists.
> 
> However it's best that we don't silently rely on operations that can be changed
> in the future. Make sure only properly initialized pcplists are visible, using
> smp_store_release(). The read side has a data dependency via the zone->pageset
> pointer instead of an explicit read barrier.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>  mm/page_alloc.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 99b74c1c2b0a..de3b48bda45c 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6246,15 +6246,17 @@ static void zone_set_pageset_high_and_batch(struct zone *zone)
>  
>  void __meminit setup_zone_pageset(struct zone *zone)
>  {
> +	struct per_cpu_pageset __percpu * new_pageset;
>  	struct per_cpu_pageset *p;
>  	int cpu;
>  
> -	zone->pageset = alloc_percpu(struct per_cpu_pageset);
> +	new_pageset = alloc_percpu(struct per_cpu_pageset);
>  	for_each_possible_cpu(cpu) {
> -		p = per_cpu_ptr(zone->pageset, cpu);
> +		p = per_cpu_ptr(new_pageset, cpu);
>  		pageset_init(p);
>  	}
>  
> +	smp_store_release(&zone->pageset, new_pageset);
>  	zone_set_pageset_high_and_batch(zone);
>  }
>  
> 

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 6/9] mm, page_alloc: cache pageset high and batch in struct zone
  2020-09-22 14:37 ` [PATCH 6/9] mm, page_alloc: cache pageset high and batch in struct zone Vlastimil Babka
@ 2020-09-25 10:34   ` David Hildenbrand
  2020-10-06 22:31     ` Vlastimil Babka
  2020-10-05 13:28   ` Michal Hocko
  1 sibling, 1 reply; 41+ messages in thread
From: David Hildenbrand @ 2020-09-25 10:34 UTC (permalink / raw)
  To: Vlastimil Babka, linux-mm
  Cc: linux-kernel, Michal Hocko, Pavel Tatashin, Oscar Salvador, Joonsoo Kim

On 22.09.20 16:37, Vlastimil Babka wrote:
> All per-cpu pagesets for a zone use the same high and batch values, that are
> duplicated there just for performance (locality) reasons. This patch adds the
> same variables also to struct zone as a shared copy.
> 
> This will be useful later for making possible to disable pcplists temporarily
> by setting high value to 0, while remembering the values for restoring them
> later. But we can also immediately benefit from not updating pagesets of all
> possible cpus in case the newly recalculated values (after sysctl change or
> memory online/offline) are actually unchanged from the previous ones.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>  include/linux/mmzone.h |  6 ++++++
>  mm/page_alloc.c        | 16 ++++++++++++++--
>  2 files changed, 20 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 90721f3156bc..7ad3f14dbe88 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -470,6 +470,12 @@ struct zone {
>  #endif
>  	struct pglist_data	*zone_pgdat;
>  	struct per_cpu_pageset __percpu *pageset;
> +	/*
> +	 * the high and batch values are copied to individual pagesets for
> +	 * faster access
> +	 */
> +	int pageset_high;
> +	int pageset_batch;
>  
>  #ifndef CONFIG_SPARSEMEM
>  	/*
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index de3b48bda45c..901907799bdc 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5824,6 +5824,8 @@ static void build_zonelists(pg_data_t *pgdat)
>   * Other parts of the kernel may not check if the zone is available.
>   */
>  static void pageset_init(struct per_cpu_pageset *p);
> +#define BOOT_PAGESET_HIGH	0
> +#define BOOT_PAGESET_BATCH	1

Much better. A comment would have been nice ("this disables the pcp via
the boot pageset completely.") :) (I'm pretty sure I'd forget at one
point what these values mean)

>  static DEFINE_PER_CPU(struct per_cpu_pageset, boot_pageset);
>  static DEFINE_PER_CPU(struct per_cpu_nodestat, boot_nodestats);
>  
> @@ -6213,8 +6215,8 @@ static void pageset_init(struct per_cpu_pageset *p)
>  	 * need to be as careful as pageset_update() as nobody can access the
>  	 * pageset yet.
>  	 */
> -	pcp->high = 0;
> -	pcp->batch = 1;
> +	pcp->high = BOOT_PAGESET_HIGH;
> +	pcp->batch = BOOT_PAGESET_BATCH;
>  }
>  
>  /*
> @@ -6238,6 +6240,14 @@ static void zone_set_pageset_high_and_batch(struct zone *zone)
>  		new_batch = max(1UL, 1 * new_batch);
>  	}
>  
> +	if (zone->pageset_high != new_high ||
> +	    zone->pageset_batch != new_batch) {
> +		zone->pageset_high = new_high;
> +		zone->pageset_batch = new_batch;
> +	} else {
> +		return;
> +	}
> +
>  	for_each_possible_cpu(cpu) {
>  		p = per_cpu_ptr(zone->pageset, cpu);
>  		pageset_update(&p->pcp, new_high, new_batch);
> @@ -6300,6 +6310,8 @@ static __meminit void zone_pcp_init(struct zone *zone)
>  	 * offset of a (static) per cpu variable into the per cpu area.
>  	 */
>  	zone->pageset = &boot_pageset;
> +	zone->pageset_high = BOOT_PAGESET_HIGH;
> +	zone->pageset_batch = BOOT_PAGESET_BATCH;

I do wonder if copying from any cpuvar inside boot_pageset is cleaner.

zone->pageset_high = &this_cpu_ptr(zone->pageset)->pcp.high;
...


-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 7/9] mm, page_alloc: move draining pcplists to page isolation users
  2020-09-22 14:37 ` [PATCH 7/9] mm, page_alloc: move draining pcplists to page isolation users Vlastimil Babka
@ 2020-09-25 10:39   ` David Hildenbrand
  2020-10-05 13:57   ` Michal Hocko
  1 sibling, 0 replies; 41+ messages in thread
From: David Hildenbrand @ 2020-09-25 10:39 UTC (permalink / raw)
  To: Vlastimil Babka, linux-mm
  Cc: linux-kernel, Michal Hocko, Pavel Tatashin, Oscar Salvador, Joonsoo Kim

On 22.09.20 16:37, Vlastimil Babka wrote:
> Currently, pcplists are drained during set_migratetype_isolate() which means
> once per pageblock processed start_isolate_page_range(). This is somewhat
> wasteful. Moreover, the callers might need different guarantees, and the
> draining is currently prone to races and does not guarantee that no page
> from isolated pageblock will end up on the pcplist after the drain.
> 
> Better guarantees are added by later patches and require explicit actions
> by page isolation users that need them. Thus it makes sense to move the
> current imperfect draining to the callers also as a preparation step.
> 
> Suggested-by: Pavel Tatashin <pasha.tatashin@soleen.com>

I would have guessed I suggested that one first as well :)

https://lkml.kernel.org/r/6ec66eb9-eeba-5076-af97-cef59ed5cbaa@redhat.com

(But honestly, I don't care :), just stumbled over it)

> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>  mm/memory_hotplug.c | 11 ++++++-----
>  mm/page_alloc.c     |  2 ++
>  mm/page_isolation.c | 10 +++++-----
>  3 files changed, 13 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 9db80ee29caa..08f729922e18 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1524,6 +1524,8 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
>  		goto failed_removal;
>  	}
>  
> +	drain_all_pages(zone);
> +
>  	arg.start_pfn = start_pfn;
>  	arg.nr_pages = nr_pages;
>  	node_states_check_changes_offline(nr_pages, zone, &arg);
> @@ -1574,11 +1576,10 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
>  		}
>  
>  		/*
> -		 * per-cpu pages are drained in start_isolate_page_range, but if
> -		 * there are still pages that are not free, make sure that we
> -		 * drain again, because when we isolated range we might
> -		 * have raced with another thread that was adding pages to pcp
> -		 * list.
> +		 * per-cpu pages are drained after start_isolate_page_range, but
> +		 * if there are still pages that are not free, make sure that we
> +		 * drain again, because when we isolated range we might have
> +		 * raced with another thread that was adding pages to pcp list.
>  		 *
>  		 * Forward progress should be still guaranteed because
>  		 * pages on the pcp list can only belong to MOVABLE_ZONE
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 901907799bdc..4e37bc3f6077 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -8432,6 +8432,8 @@ int alloc_contig_range(unsigned long start, unsigned long end,
>  	if (ret)
>  		return ret;
>  
> +	drain_all_pages(cc.zone);
> +
>  	/*
>  	 * In case of -EBUSY, we'd like to know which page causes problem.
>  	 * So, just fall through. test_pages_isolated() has a tracepoint
> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> index abfe26ad59fd..57d8bc1e7760 100644
> --- a/mm/page_isolation.c
> +++ b/mm/page_isolation.c
> @@ -49,7 +49,6 @@ static int set_migratetype_isolate(struct page *page, int migratetype, int isol_
>  
>  		__mod_zone_freepage_state(zone, -nr_pages, mt);
>  		spin_unlock_irqrestore(&zone->lock, flags);
> -		drain_all_pages(zone);
>  		return 0;
>  	}
>  
> @@ -167,11 +166,12 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
>   *
>   * Please note that there is no strong synchronization with the page allocator
>   * either. Pages might be freed while their page blocks are marked ISOLATED.
> - * In some cases pages might still end up on pcp lists and that would allow
> + * A call to drain_all_pages() after isolation can flush most of them. However
> + * in some cases pages might still end up on pcp lists and that would allow
>   * for their allocation even when they are in fact isolated already. Depending
> - * on how strong of a guarantee the caller needs drain_all_pages might be needed
> - * (e.g. __offline_pages will need to call it after check for isolated range for
> - * a next retry).
> + * on how strong of a guarantee the caller needs, further drain_all_pages()
> + * might be needed (e.g. __offline_pages will need to call it after check for
> + * isolated range for a next retry).
>   *
>   * Return: 0 on success and -EBUSY if any part of range cannot be isolated.
>   */
> 

Reviewed-by: David Hildenbrand <david@redhat.com>

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 8/9] mm, page_alloc: drain all pcplists during memory offline
  2020-09-22 14:37 ` [PATCH 8/9] mm, page_alloc: drain all pcplists during memory offline Vlastimil Babka
@ 2020-09-25 10:46   ` David Hildenbrand
  2020-10-05 14:03     ` Michal Hocko
  0 siblings, 1 reply; 41+ messages in thread
From: David Hildenbrand @ 2020-09-25 10:46 UTC (permalink / raw)
  To: Vlastimil Babka, linux-mm
  Cc: linux-kernel, Michal Hocko, Pavel Tatashin, Oscar Salvador, Joonsoo Kim

On 22.09.20 16:37, Vlastimil Babka wrote:
> drain_all_pages() is optimized to only execute on cpus where pcplists are not
> empty. The check can however race with a free to pcplist that has not yet
> increased the pcp->count from 0 to 1. Make the drain optionally skip the racy
> check and drain on all cpus, and use it in memory offline context, where we
> want to make sure no isolated pages are left behind on pcplists.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>  include/linux/gfp.h |  1 +
>  mm/memory_hotplug.c |  4 ++--
>  mm/page_alloc.c     | 29 ++++++++++++++++++++---------
>  3 files changed, 23 insertions(+), 11 deletions(-)
> 
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 67a0774e080b..cc52c5cc9fab 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -592,6 +592,7 @@ extern void page_frag_free(void *addr);
>  
>  void page_alloc_init(void);
>  void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp);
> +void __drain_all_pages(struct zone *zone, bool page_isolation);
>  void drain_all_pages(struct zone *zone);
>  void drain_local_pages(struct zone *zone);
>  
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 08f729922e18..bbde415b558b 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1524,7 +1524,7 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
>  		goto failed_removal;
>  	}
>  
> -	drain_all_pages(zone);
> +	__drain_all_pages(zone, true);
>  
>  	arg.start_pfn = start_pfn;
>  	arg.nr_pages = nr_pages;
> @@ -1588,7 +1588,7 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
>  		 */
>  		ret = test_pages_isolated(start_pfn, end_pfn, MEMORY_OFFLINE);
>  		if (ret)
> -			drain_all_pages(zone);
> +			__drain_all_pages(zone, true);
>  	} while (ret);
>  
>  	/* Mark all sections offline and remove free pages from the buddy. */
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 4e37bc3f6077..33cc35d152b1 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2960,14 +2960,7 @@ static void drain_local_pages_wq(struct work_struct *work)
>  	preempt_enable();
>  }
>  
> -/*
> - * Spill all the per-cpu pages from all CPUs back into the buddy allocator.
> - *
> - * When zone parameter is non-NULL, spill just the single zone's pages.
> - *
> - * Note that this can be extremely slow as the draining happens in a workqueue.
> - */
> -void drain_all_pages(struct zone *zone)
> +void __drain_all_pages(struct zone *zone, bool force_all_cpus)
>  {
>  	int cpu;
>  
> @@ -3006,7 +2999,13 @@ void drain_all_pages(struct zone *zone)
>  		struct zone *z;
>  		bool has_pcps = false;
>  
> -		if (zone) {
> +		if (force_all_cpus) {
> +			/*
> +			 * The pcp.count check is racy, some callers need a
> +			 * guarantee that no cpu is missed.
> +			 */
> +			has_pcps = true;
> +		} else if (zone) {
>  			pcp = per_cpu_ptr(zone->pageset, cpu);
>  			if (pcp->pcp.count)
>  				has_pcps = true;
> @@ -3039,6 +3038,18 @@ void drain_all_pages(struct zone *zone)
>  	mutex_unlock(&pcpu_drain_mutex);
>  }
>  
> +/*
> + * Spill all the per-cpu pages from all CPUs back into the buddy allocator.
> + *
> + * When zone parameter is non-NULL, spill just the single zone's pages.
> + *
> + * Note that this can be extremely slow as the draining happens in a workqueue.
> + */
> +void drain_all_pages(struct zone *zone)
> +{
> +	__drain_all_pages(zone, false);
> +}
> +
>  #ifdef CONFIG_HIBERNATION
>  
>  /*
> 

Interesting race. Instead of this ugly __drain_all_pages() with a
boolean parameter, can we have two properly named functions to be used
in !page_alloc.c code without scratching your head what the difference is?

(yeah, coming up with a proper name is difficult. the one gives more
guarantees than the other, that cannot really be deducted from
"force_all_cpus" - maybe we can encode the actual semantics in the name)


-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 9/9] mm, page_alloc: optionally disable pcplists during page isolation
  2020-09-22 14:37 ` [PATCH 9/9] mm, page_alloc: optionally disable pcplists during page isolation Vlastimil Babka
@ 2020-09-25 10:53   ` David Hildenbrand
  2020-09-25 10:54     ` David Hildenbrand
  2020-10-06  8:34   ` Michal Hocko
  1 sibling, 1 reply; 41+ messages in thread
From: David Hildenbrand @ 2020-09-25 10:53 UTC (permalink / raw)
  To: Vlastimil Babka, linux-mm
  Cc: linux-kernel, Michal Hocko, Pavel Tatashin, Oscar Salvador,
	Joonsoo Kim, Michal Hocko

On 22.09.20 16:37, Vlastimil Babka wrote:
> Page isolation can race with process freeing pages to pcplists in a way that
> a page from isolated pageblock can end up on pcplist. This can be fixed by
> repeated draining of pcplists, as done by patch "mm/memory_hotplug: drain
> per-cpu pages again during memory offline" in [1].
> 
> David and Michal would prefer that this race was closed in a way that callers
> of page isolation who need stronger guarantees don't need to repeatedly drain.
> David suggested disabling pcplists usage completely during page isolation,
> instead of repeatedly draining them.
> 
> To achieve this without adding special cases in alloc/free fastpath, we can use
> the same approach as boot pagesets - when pcp->high is 0, any pcplist addition
> will be immediately flushed.
> 
> The race can thus be closed by setting pcp->high to 0 and draining pcplists
> once, before calling start_isolate_page_range(). The draining will serialize
> after processes that already disabled interrupts and read the old value of
> pcp->high in free_unref_page_commit(), and processes that have not yet disabled
> interrupts, will observe pcp->high == 0 when they are rescheduled, and skip
> pcplists. This guarantees no stray pages on pcplists in zones where isolation
> happens.
> 
> This patch thus adds zone_pcplist_disable() and zone_pcplist_enable() functions
> that page isolation users can call before start_isolate_page_range() and after
> unisolating (or offlining) the isolated pages. A new zone->pcplist_disabled
> atomic variable makes sure we disable only pcplists once and don't enable
> them prematurely in case there are multiple users in parallel.
> 
> We however have to avoid external updates to high and batch by taking
> pcp_batch_high_lock. To allow multiple isolations in parallel, change this lock
> from mutex to rwsem.
> 
> Currently the only user of this functionality is offline_pages().
> 
> [1] https://lore.kernel.org/linux-mm/20200903140032.380431-1-pasha.tatashin@soleen.com/
> 
> Suggested-by: David Hildenbrand <david@redhat.com>
> Suggested-by: Michal Hocko <mhocko@suse.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>  include/linux/mmzone.h         |  2 ++
>  include/linux/page-isolation.h |  2 ++
>  mm/internal.h                  |  4 ++++
>  mm/memory_hotplug.c            | 28 ++++++++++++----------------
>  mm/page_alloc.c                | 29 ++++++++++++++++++-----------
>  mm/page_isolation.c            | 22 +++++++++++++++++++---
>  6 files changed, 57 insertions(+), 30 deletions(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 7ad3f14dbe88..3c653d6348b1 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -536,6 +536,8 @@ struct zone {
>  	 * of pageblock. Protected by zone->lock.
>  	 */
>  	unsigned long		nr_isolate_pageblock;
> +	/* Track requests for disabling pcplists */
> +	atomic_t		pcplist_disabled;
>  #endif
>  
>  #ifdef CONFIG_MEMORY_HOTPLUG
> diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h
> index 572458016331..1dec5d0f62a6 100644
> --- a/include/linux/page-isolation.h
> +++ b/include/linux/page-isolation.h
> @@ -38,6 +38,8 @@ struct page *has_unmovable_pages(struct zone *zone, struct page *page,
>  void set_pageblock_migratetype(struct page *page, int migratetype);
>  int move_freepages_block(struct zone *zone, struct page *page,
>  				int migratetype, int *num_movable);
> +void zone_pcplist_disable(struct zone *zone);
> +void zone_pcplist_enable(struct zone *zone);
>  
>  /*
>   * Changes migrate type in [start_pfn, end_pfn) to be MIGRATE_ISOLATE.
> diff --git a/mm/internal.h b/mm/internal.h
> index ea66ef45da6c..154e38e702dd 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -7,6 +7,7 @@
>  #ifndef __MM_INTERNAL_H
>  #define __MM_INTERNAL_H
>  
> +#include <linux/rwsem.h>
>  #include <linux/fs.h>
>  #include <linux/mm.h>
>  #include <linux/pagemap.h>
> @@ -199,8 +200,11 @@ extern void post_alloc_hook(struct page *page, unsigned int order,
>  					gfp_t gfp_flags);
>  extern int user_min_free_kbytes;
>  
> +extern struct rw_semaphore pcp_batch_high_lock;
>  extern void zone_pcp_update(struct zone *zone);
>  extern void zone_pcp_reset(struct zone *zone);
> +extern void zone_update_pageset_high_and_batch(struct zone *zone,
> +		unsigned long high, unsigned long batch);
>  
>  #if defined CONFIG_COMPACTION || defined CONFIG_CMA
>  
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index bbde415b558b..6fe9be550160 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1515,17 +1515,21 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
>  	}
>  	node = zone_to_nid(zone);
>  
> +	/*
> +	 * Disable pcplists so that page isolation cannot race with freeing
> +	 * in a way that pages from isolated pageblock are left on pcplists.
> +	 */
> +	zone_pcplist_disable(zone);
> +
>  	/* set above range as isolated */
>  	ret = start_isolate_page_range(start_pfn, end_pfn,
>  				       MIGRATE_MOVABLE,
>  				       MEMORY_OFFLINE | REPORT_FAILURE);
>  	if (ret) {
>  		reason = "failure to isolate range";
> -		goto failed_removal;
> +		goto failed_removal_pcplists_disabled;
>  	}
>  
> -	__drain_all_pages(zone, true);
> -
>  	arg.start_pfn = start_pfn;
>  	arg.nr_pages = nr_pages;
>  	node_states_check_changes_offline(nr_pages, zone, &arg);
> @@ -1575,20 +1579,8 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
>  			goto failed_removal_isolated;
>  		}
>  
> -		/*
> -		 * per-cpu pages are drained after start_isolate_page_range, but
> -		 * if there are still pages that are not free, make sure that we
> -		 * drain again, because when we isolated range we might have
> -		 * raced with another thread that was adding pages to pcp list.
> -		 *
> -		 * Forward progress should be still guaranteed because
> -		 * pages on the pcp list can only belong to MOVABLE_ZONE
> -		 * because has_unmovable_pages explicitly checks for
> -		 * PageBuddy on freed pages on other zones.
> -		 */
>  		ret = test_pages_isolated(start_pfn, end_pfn, MEMORY_OFFLINE);
> -		if (ret)
> -			__drain_all_pages(zone, true);
> +
>  	} while (ret);
>  
>  	/* Mark all sections offline and remove free pages from the buddy. */
> @@ -1604,6 +1596,8 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
>  	zone->nr_isolate_pageblock -= nr_pages / pageblock_nr_pages;
>  	spin_unlock_irqrestore(&zone->lock, flags);
>  
> +	zone_pcplist_enable(zone);
> +
>  	/* removal success */
>  	adjust_managed_page_count(pfn_to_page(start_pfn), -nr_pages);
>  	zone->present_pages -= nr_pages;
> @@ -1637,6 +1631,8 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
>  failed_removal_isolated:
>  	undo_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE);
>  	memory_notify(MEM_CANCEL_OFFLINE, &arg);
> +failed_removal_pcplists_disabled:
> +	zone_pcplist_enable(zone);
>  failed_removal:
>  	pr_debug("memory offlining [mem %#010llx-%#010llx] failed due to %s\n",
>  		 (unsigned long long) start_pfn << PAGE_SHIFT,
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 33cc35d152b1..673d353f9311 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -78,7 +78,7 @@
>  #include "page_reporting.h"
>  
>  /* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */
> -static DEFINE_MUTEX(pcp_batch_high_lock);
> +DECLARE_RWSEM(pcp_batch_high_lock);
>  #define MIN_PERCPU_PAGELIST_FRACTION	(8)
>  
>  #ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID
> @@ -6230,6 +6230,18 @@ static void pageset_init(struct per_cpu_pageset *p)
>  	pcp->batch = BOOT_PAGESET_BATCH;
>  }
>  
> +void zone_update_pageset_high_and_batch(struct zone *zone, unsigned long high,
> +		unsigned long batch)
> +{
> +	struct per_cpu_pageset *p;
> +	int cpu;
> +
> +	for_each_possible_cpu(cpu) {
> +		p = per_cpu_ptr(zone->pageset, cpu);
> +		pageset_update(&p->pcp, high, batch);
> +	}
> +}
> +
>  /*
>   * Calculate and set new high and batch values for all per-cpu pagesets of a
>   * zone, based on the zone's size and the percpu_pagelist_fraction sysctl.
> @@ -6237,8 +6249,6 @@ static void pageset_init(struct per_cpu_pageset *p)
>  static void zone_set_pageset_high_and_batch(struct zone *zone)
>  {
>  	unsigned long new_high, new_batch;
> -	struct per_cpu_pageset *p;
> -	int cpu;
>  
>  	if (percpu_pagelist_fraction) {
>  		new_high = zone_managed_pages(zone) / percpu_pagelist_fraction;
> @@ -6259,10 +6269,7 @@ static void zone_set_pageset_high_and_batch(struct zone *zone)
>  		return;
>  	}
>  
> -	for_each_possible_cpu(cpu) {
> -		p = per_cpu_ptr(zone->pageset, cpu);
> -		pageset_update(&p->pcp, new_high, new_batch);
> -	}
> +	zone_update_pageset_high_and_batch(zone, new_high, new_batch);
>  }
>  
>  void __meminit setup_zone_pageset(struct zone *zone)
> @@ -8024,7 +8031,7 @@ int percpu_pagelist_fraction_sysctl_handler(struct ctl_table *table, int write,
>  	int old_percpu_pagelist_fraction;
>  	int ret;
>  
> -	mutex_lock(&pcp_batch_high_lock);
> +	down_write(&pcp_batch_high_lock);
>  	old_percpu_pagelist_fraction = percpu_pagelist_fraction;
>  
>  	ret = proc_dointvec_minmax(table, write, buffer, length, ppos);
> @@ -8046,7 +8053,7 @@ int percpu_pagelist_fraction_sysctl_handler(struct ctl_table *table, int write,
>  	for_each_populated_zone(zone)
>  		zone_set_pageset_high_and_batch(zone);
>  out:
> -	mutex_unlock(&pcp_batch_high_lock);
> +	up_write(&pcp_batch_high_lock);
>  	return ret;
>  }
>  
> @@ -8652,9 +8659,9 @@ EXPORT_SYMBOL(free_contig_range);
>   */
>  void __meminit zone_pcp_update(struct zone *zone)
>  {
> -	mutex_lock(&pcp_batch_high_lock);
> +	down_write(&pcp_batch_high_lock);
>  	zone_set_pageset_high_and_batch(zone);
> -	mutex_unlock(&pcp_batch_high_lock);
> +	up_write(&pcp_batch_high_lock);
>  }
>  
>  void zone_pcp_reset(struct zone *zone)
> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> index 57d8bc1e7760..c0e895e8f322 100644
> --- a/mm/page_isolation.c
> +++ b/mm/page_isolation.c
> @@ -15,6 +15,22 @@
>  #define CREATE_TRACE_POINTS
>  #include <trace/events/page_isolation.h>
>  
> +void zone_pcplist_disable(struct zone *zone)
> +{
> +	down_read(&pcp_batch_high_lock);
> +	if (atomic_inc_return(&zone->pcplist_disabled) == 1) {
> +		zone_update_pageset_high_and_batch(zone, 0, 1);
> +		__drain_all_pages(zone, true);
> +	}

Hm, if one CPU is still inside the if-clause, the other one would
continue, however pcp wpould not be disabled and zones not drained when
returning.

(while we only allow a single Offline_pages() call, it will be different
when we use the function in other context - especially,
alloc_contig_range() for some users)

Can't we use down_write() here? So it's serialized and everybody has to
properly wait. (and we would not have to rely on an atomic_t)

> +}
> +
> +void zone_pcplist_enable(struct zone *zone)
> +{
> +	if (atomic_dec_return(&zone->pcplist_disabled) == 0)
> +		zone_update_pageset_high_and_batch(zone, zone->pageset_high, zone->pageset_batch);
> +	up_read(&pcp_batch_high_lock);
> +}
> +
>  static int set_migratetype_isolate(struct page *page, int migratetype, int isol_flags)
>  {
>  	struct zone *zone = page_zone(page);

In general, I like that approach. But I do wonder if we can tweak the
rwsem + atomic_t handling.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 9/9] mm, page_alloc: optionally disable pcplists during page isolation
  2020-09-25 10:53   ` David Hildenbrand
@ 2020-09-25 10:54     ` David Hildenbrand
  2020-09-25 11:10       ` Vlastimil Babka
  0 siblings, 1 reply; 41+ messages in thread
From: David Hildenbrand @ 2020-09-25 10:54 UTC (permalink / raw)
  To: Vlastimil Babka, linux-mm
  Cc: linux-kernel, Michal Hocko, Pavel Tatashin, Oscar Salvador,
	Joonsoo Kim, Michal Hocko

On 25.09.20 12:53, David Hildenbrand wrote:
> On 22.09.20 16:37, Vlastimil Babka wrote:
>> Page isolation can race with process freeing pages to pcplists in a way that
>> a page from isolated pageblock can end up on pcplist. This can be fixed by
>> repeated draining of pcplists, as done by patch "mm/memory_hotplug: drain
>> per-cpu pages again during memory offline" in [1].
>>
>> David and Michal would prefer that this race was closed in a way that callers
>> of page isolation who need stronger guarantees don't need to repeatedly drain.
>> David suggested disabling pcplists usage completely during page isolation,
>> instead of repeatedly draining them.
>>
>> To achieve this without adding special cases in alloc/free fastpath, we can use
>> the same approach as boot pagesets - when pcp->high is 0, any pcplist addition
>> will be immediately flushed.
>>
>> The race can thus be closed by setting pcp->high to 0 and draining pcplists
>> once, before calling start_isolate_page_range(). The draining will serialize
>> after processes that already disabled interrupts and read the old value of
>> pcp->high in free_unref_page_commit(), and processes that have not yet disabled
>> interrupts, will observe pcp->high == 0 when they are rescheduled, and skip
>> pcplists. This guarantees no stray pages on pcplists in zones where isolation
>> happens.
>>
>> This patch thus adds zone_pcplist_disable() and zone_pcplist_enable() functions
>> that page isolation users can call before start_isolate_page_range() and after
>> unisolating (or offlining) the isolated pages. A new zone->pcplist_disabled
>> atomic variable makes sure we disable only pcplists once and don't enable
>> them prematurely in case there are multiple users in parallel.
>>
>> We however have to avoid external updates to high and batch by taking
>> pcp_batch_high_lock. To allow multiple isolations in parallel, change this lock
>> from mutex to rwsem.
>>
>> Currently the only user of this functionality is offline_pages().
>>
>> [1] https://lore.kernel.org/linux-mm/20200903140032.380431-1-pasha.tatashin@soleen.com/
>>
>> Suggested-by: David Hildenbrand <david@redhat.com>
>> Suggested-by: Michal Hocko <mhocko@suse.com>
>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
>> ---
>>  include/linux/mmzone.h         |  2 ++
>>  include/linux/page-isolation.h |  2 ++
>>  mm/internal.h                  |  4 ++++
>>  mm/memory_hotplug.c            | 28 ++++++++++++----------------
>>  mm/page_alloc.c                | 29 ++++++++++++++++++-----------
>>  mm/page_isolation.c            | 22 +++++++++++++++++++---
>>  6 files changed, 57 insertions(+), 30 deletions(-)
>>
>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>> index 7ad3f14dbe88..3c653d6348b1 100644
>> --- a/include/linux/mmzone.h
>> +++ b/include/linux/mmzone.h
>> @@ -536,6 +536,8 @@ struct zone {
>>  	 * of pageblock. Protected by zone->lock.
>>  	 */
>>  	unsigned long		nr_isolate_pageblock;
>> +	/* Track requests for disabling pcplists */
>> +	atomic_t		pcplist_disabled;
>>  #endif
>>  
>>  #ifdef CONFIG_MEMORY_HOTPLUG
>> diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h
>> index 572458016331..1dec5d0f62a6 100644
>> --- a/include/linux/page-isolation.h
>> +++ b/include/linux/page-isolation.h
>> @@ -38,6 +38,8 @@ struct page *has_unmovable_pages(struct zone *zone, struct page *page,
>>  void set_pageblock_migratetype(struct page *page, int migratetype);
>>  int move_freepages_block(struct zone *zone, struct page *page,
>>  				int migratetype, int *num_movable);
>> +void zone_pcplist_disable(struct zone *zone);
>> +void zone_pcplist_enable(struct zone *zone);
>>  
>>  /*
>>   * Changes migrate type in [start_pfn, end_pfn) to be MIGRATE_ISOLATE.
>> diff --git a/mm/internal.h b/mm/internal.h
>> index ea66ef45da6c..154e38e702dd 100644
>> --- a/mm/internal.h
>> +++ b/mm/internal.h
>> @@ -7,6 +7,7 @@
>>  #ifndef __MM_INTERNAL_H
>>  #define __MM_INTERNAL_H
>>  
>> +#include <linux/rwsem.h>
>>  #include <linux/fs.h>
>>  #include <linux/mm.h>
>>  #include <linux/pagemap.h>
>> @@ -199,8 +200,11 @@ extern void post_alloc_hook(struct page *page, unsigned int order,
>>  					gfp_t gfp_flags);
>>  extern int user_min_free_kbytes;
>>  
>> +extern struct rw_semaphore pcp_batch_high_lock;
>>  extern void zone_pcp_update(struct zone *zone);
>>  extern void zone_pcp_reset(struct zone *zone);
>> +extern void zone_update_pageset_high_and_batch(struct zone *zone,
>> +		unsigned long high, unsigned long batch);
>>  
>>  #if defined CONFIG_COMPACTION || defined CONFIG_CMA
>>  
>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>> index bbde415b558b..6fe9be550160 100644
>> --- a/mm/memory_hotplug.c
>> +++ b/mm/memory_hotplug.c
>> @@ -1515,17 +1515,21 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
>>  	}
>>  	node = zone_to_nid(zone);
>>  
>> +	/*
>> +	 * Disable pcplists so that page isolation cannot race with freeing
>> +	 * in a way that pages from isolated pageblock are left on pcplists.
>> +	 */
>> +	zone_pcplist_disable(zone);
>> +
>>  	/* set above range as isolated */
>>  	ret = start_isolate_page_range(start_pfn, end_pfn,
>>  				       MIGRATE_MOVABLE,
>>  				       MEMORY_OFFLINE | REPORT_FAILURE);
>>  	if (ret) {
>>  		reason = "failure to isolate range";
>> -		goto failed_removal;
>> +		goto failed_removal_pcplists_disabled;
>>  	}
>>  
>> -	__drain_all_pages(zone, true);
>> -
>>  	arg.start_pfn = start_pfn;
>>  	arg.nr_pages = nr_pages;
>>  	node_states_check_changes_offline(nr_pages, zone, &arg);
>> @@ -1575,20 +1579,8 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
>>  			goto failed_removal_isolated;
>>  		}
>>  
>> -		/*
>> -		 * per-cpu pages are drained after start_isolate_page_range, but
>> -		 * if there are still pages that are not free, make sure that we
>> -		 * drain again, because when we isolated range we might have
>> -		 * raced with another thread that was adding pages to pcp list.
>> -		 *
>> -		 * Forward progress should be still guaranteed because
>> -		 * pages on the pcp list can only belong to MOVABLE_ZONE
>> -		 * because has_unmovable_pages explicitly checks for
>> -		 * PageBuddy on freed pages on other zones.
>> -		 */
>>  		ret = test_pages_isolated(start_pfn, end_pfn, MEMORY_OFFLINE);
>> -		if (ret)
>> -			__drain_all_pages(zone, true);
>> +
>>  	} while (ret);
>>  
>>  	/* Mark all sections offline and remove free pages from the buddy. */
>> @@ -1604,6 +1596,8 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
>>  	zone->nr_isolate_pageblock -= nr_pages / pageblock_nr_pages;
>>  	spin_unlock_irqrestore(&zone->lock, flags);
>>  
>> +	zone_pcplist_enable(zone);
>> +
>>  	/* removal success */
>>  	adjust_managed_page_count(pfn_to_page(start_pfn), -nr_pages);
>>  	zone->present_pages -= nr_pages;
>> @@ -1637,6 +1631,8 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
>>  failed_removal_isolated:
>>  	undo_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE);
>>  	memory_notify(MEM_CANCEL_OFFLINE, &arg);
>> +failed_removal_pcplists_disabled:
>> +	zone_pcplist_enable(zone);
>>  failed_removal:
>>  	pr_debug("memory offlining [mem %#010llx-%#010llx] failed due to %s\n",
>>  		 (unsigned long long) start_pfn << PAGE_SHIFT,
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 33cc35d152b1..673d353f9311 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -78,7 +78,7 @@
>>  #include "page_reporting.h"
>>  
>>  /* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */
>> -static DEFINE_MUTEX(pcp_batch_high_lock);
>> +DECLARE_RWSEM(pcp_batch_high_lock);
>>  #define MIN_PERCPU_PAGELIST_FRACTION	(8)
>>  
>>  #ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID
>> @@ -6230,6 +6230,18 @@ static void pageset_init(struct per_cpu_pageset *p)
>>  	pcp->batch = BOOT_PAGESET_BATCH;
>>  }
>>  
>> +void zone_update_pageset_high_and_batch(struct zone *zone, unsigned long high,
>> +		unsigned long batch)
>> +{
>> +	struct per_cpu_pageset *p;
>> +	int cpu;
>> +
>> +	for_each_possible_cpu(cpu) {
>> +		p = per_cpu_ptr(zone->pageset, cpu);
>> +		pageset_update(&p->pcp, high, batch);
>> +	}
>> +}
>> +
>>  /*
>>   * Calculate and set new high and batch values for all per-cpu pagesets of a
>>   * zone, based on the zone's size and the percpu_pagelist_fraction sysctl.
>> @@ -6237,8 +6249,6 @@ static void pageset_init(struct per_cpu_pageset *p)
>>  static void zone_set_pageset_high_and_batch(struct zone *zone)
>>  {
>>  	unsigned long new_high, new_batch;
>> -	struct per_cpu_pageset *p;
>> -	int cpu;
>>  
>>  	if (percpu_pagelist_fraction) {
>>  		new_high = zone_managed_pages(zone) / percpu_pagelist_fraction;
>> @@ -6259,10 +6269,7 @@ static void zone_set_pageset_high_and_batch(struct zone *zone)
>>  		return;
>>  	}
>>  
>> -	for_each_possible_cpu(cpu) {
>> -		p = per_cpu_ptr(zone->pageset, cpu);
>> -		pageset_update(&p->pcp, new_high, new_batch);
>> -	}
>> +	zone_update_pageset_high_and_batch(zone, new_high, new_batch);
>>  }
>>  
>>  void __meminit setup_zone_pageset(struct zone *zone)
>> @@ -8024,7 +8031,7 @@ int percpu_pagelist_fraction_sysctl_handler(struct ctl_table *table, int write,
>>  	int old_percpu_pagelist_fraction;
>>  	int ret;
>>  
>> -	mutex_lock(&pcp_batch_high_lock);
>> +	down_write(&pcp_batch_high_lock);
>>  	old_percpu_pagelist_fraction = percpu_pagelist_fraction;
>>  
>>  	ret = proc_dointvec_minmax(table, write, buffer, length, ppos);
>> @@ -8046,7 +8053,7 @@ int percpu_pagelist_fraction_sysctl_handler(struct ctl_table *table, int write,
>>  	for_each_populated_zone(zone)
>>  		zone_set_pageset_high_and_batch(zone);
>>  out:
>> -	mutex_unlock(&pcp_batch_high_lock);
>> +	up_write(&pcp_batch_high_lock);
>>  	return ret;
>>  }
>>  
>> @@ -8652,9 +8659,9 @@ EXPORT_SYMBOL(free_contig_range);
>>   */
>>  void __meminit zone_pcp_update(struct zone *zone)
>>  {
>> -	mutex_lock(&pcp_batch_high_lock);
>> +	down_write(&pcp_batch_high_lock);
>>  	zone_set_pageset_high_and_batch(zone);
>> -	mutex_unlock(&pcp_batch_high_lock);
>> +	up_write(&pcp_batch_high_lock);
>>  }
>>  
>>  void zone_pcp_reset(struct zone *zone)
>> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
>> index 57d8bc1e7760..c0e895e8f322 100644
>> --- a/mm/page_isolation.c
>> +++ b/mm/page_isolation.c
>> @@ -15,6 +15,22 @@
>>  #define CREATE_TRACE_POINTS
>>  #include <trace/events/page_isolation.h>
>>  
>> +void zone_pcplist_disable(struct zone *zone)
>> +{
>> +	down_read(&pcp_batch_high_lock);
>> +	if (atomic_inc_return(&zone->pcplist_disabled) == 1) {
>> +		zone_update_pageset_high_and_batch(zone, 0, 1);
>> +		__drain_all_pages(zone, true);
>> +	}
> 
> Hm, if one CPU is still inside the if-clause, the other one would
> continue, however pcp wpould not be disabled and zones not drained when
> returning.
> 
> (while we only allow a single Offline_pages() call, it will be different
> when we use the function in other context - especially,
> alloc_contig_range() for some users)
> 
> Can't we use down_write() here? So it's serialized and everybody has to
> properly wait. (and we would not have to rely on an atomic_t)

Sorry, I meant down_write only temporarily in this code path. Not
keeping it locked in write when returning (I remember there is a way to
downgrade).


-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 9/9] mm, page_alloc: optionally disable pcplists during page isolation
  2020-09-25 10:54     ` David Hildenbrand
@ 2020-09-25 11:10       ` Vlastimil Babka
  2020-10-01  8:47         ` David Hildenbrand
  2020-10-05 14:05         ` Michal Hocko
  0 siblings, 2 replies; 41+ messages in thread
From: Vlastimil Babka @ 2020-09-25 11:10 UTC (permalink / raw)
  To: David Hildenbrand, linux-mm
  Cc: linux-kernel, Michal Hocko, Pavel Tatashin, Oscar Salvador,
	Joonsoo Kim, Michal Hocko

On 9/25/20 12:54 PM, David Hildenbrand wrote:
>>> --- a/mm/page_isolation.c
>>> +++ b/mm/page_isolation.c
>>> @@ -15,6 +15,22 @@
>>>  #define CREATE_TRACE_POINTS
>>>  #include <trace/events/page_isolation.h>
>>>  
>>> +void zone_pcplist_disable(struct zone *zone)
>>> +{
>>> +	down_read(&pcp_batch_high_lock);
>>> +	if (atomic_inc_return(&zone->pcplist_disabled) == 1) {
>>> +		zone_update_pageset_high_and_batch(zone, 0, 1);
>>> +		__drain_all_pages(zone, true);
>>> +	}
>> Hm, if one CPU is still inside the if-clause, the other one would
>> continue, however pcp wpould not be disabled and zones not drained when
>> returning.

Ah, well spotted, thanks!

>> (while we only allow a single Offline_pages() call, it will be different
>> when we use the function in other context - especially,
>> alloc_contig_range() for some users)
>>
>> Can't we use down_write() here? So it's serialized and everybody has to
>> properly wait. (and we would not have to rely on an atomic_t)
> Sorry, I meant down_write only temporarily in this code path. Not
> keeping it locked in write when returning (I remember there is a way to
> downgrade).

Hmm that temporary write lock would still block new callers until previous
finish with the downgraded-to-read lock.

But I guess something like this would work:

retry:
  if (atomic_read(...) == 0) {
    // zone_update... + drain
    atomic_inc(...);
  else if (atomic_inc_return == 1)
    // atomic_cmpxchg from 0 to 1; if that fails, goto retry

Tricky, but races could only read to unnecessary duplicated updates + flushing
but nothing worse?

Or add another spinlock to cover this part instead of the temp write lock...


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 9/9] mm, page_alloc: optionally disable pcplists during page isolation
  2020-09-25 11:10       ` Vlastimil Babka
@ 2020-10-01  8:47         ` David Hildenbrand
  2020-10-05 14:05         ` Michal Hocko
  1 sibling, 0 replies; 41+ messages in thread
From: David Hildenbrand @ 2020-10-01  8:47 UTC (permalink / raw)
  To: Vlastimil Babka, linux-mm
  Cc: linux-kernel, Michal Hocko, Pavel Tatashin, Oscar Salvador,
	Joonsoo Kim, Michal Hocko

On 25.09.20 13:10, Vlastimil Babka wrote:
> On 9/25/20 12:54 PM, David Hildenbrand wrote:
>>>> --- a/mm/page_isolation.c
>>>> +++ b/mm/page_isolation.c
>>>> @@ -15,6 +15,22 @@
>>>>  #define CREATE_TRACE_POINTS
>>>>  #include <trace/events/page_isolation.h>
>>>>  
>>>> +void zone_pcplist_disable(struct zone *zone)
>>>> +{
>>>> +	down_read(&pcp_batch_high_lock);
>>>> +	if (atomic_inc_return(&zone->pcplist_disabled) == 1) {
>>>> +		zone_update_pageset_high_and_batch(zone, 0, 1);
>>>> +		__drain_all_pages(zone, true);
>>>> +	}
>>> Hm, if one CPU is still inside the if-clause, the other one would
>>> continue, however pcp wpould not be disabled and zones not drained when
>>> returning.
> 
> Ah, well spotted, thanks!
> 
>>> (while we only allow a single Offline_pages() call, it will be different
>>> when we use the function in other context - especially,
>>> alloc_contig_range() for some users)
>>>
>>> Can't we use down_write() here? So it's serialized and everybody has to
>>> properly wait. (and we would not have to rely on an atomic_t)
>> Sorry, I meant down_write only temporarily in this code path. Not
>> keeping it locked in write when returning (I remember there is a way to
>> downgrade).
> 
> Hmm that temporary write lock would still block new callers until previous
> finish with the downgraded-to-read lock.
> 
> But I guess something like this would work:
> 
> retry:
>   if (atomic_read(...) == 0) {
>     // zone_update... + drain
>     atomic_inc(...);
>   else if (atomic_inc_return == 1)
>     // atomic_cmpxchg from 0 to 1; if that fails, goto retry
> 
> Tricky, but races could only read to unnecessary duplicated updates + flushing
> but nothing worse?
> 
> Or add another spinlock to cover this part instead of the temp write lock...

My gut feeling is, that that would be the cleanest approach.


-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 1/9] mm, page_alloc: clean up pageset high and batch update
  2020-09-22 14:37 ` [PATCH 1/9] mm, page_alloc: clean up pageset high and batch update Vlastimil Babka
  2020-09-25 10:18   ` David Hildenbrand
@ 2020-10-05 12:03   ` Michal Hocko
  1 sibling, 0 replies; 41+ messages in thread
From: Michal Hocko @ 2020-10-05 12:03 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, linux-kernel, Pavel Tatashin, David Hildenbrand,
	Oscar Salvador, Joonsoo Kim

On Tue 22-09-20 16:37:04, Vlastimil Babka wrote:
> The updates to pcplists' high and batch valued are handled by multiple
> functions that make the calculations hard to follow. Consolidate everything
> to pageset_set_high_and_batch() and remove pageset_set_batch() and
> pageset_set_high() wrappers.
> 
> The only special case using one of the removed wrappers was:
> build_all_zonelists_init()
>   setup_pageset()
>     pageset_set_batch()
> which was hardcoding batch as 0, so we can just open-code a call to
> pageset_update() with constant parameters instead.
> 
> No functional change.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> Reviewed-by: Oscar Salvador <osalvador@suse.de>

yes this looks better, the original code was really hard to follow.

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/page_alloc.c | 49 ++++++++++++++++++++-----------------------------
>  1 file changed, 20 insertions(+), 29 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 60a0e94645a6..a163c5e561f2 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5823,7 +5823,7 @@ static void build_zonelists(pg_data_t *pgdat)
>   * not check if the processor is online before following the pageset pointer.
>   * Other parts of the kernel may not check if the zone is available.
>   */
> -static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch);
> +static void setup_pageset(struct per_cpu_pageset *p);
>  static DEFINE_PER_CPU(struct per_cpu_pageset, boot_pageset);
>  static DEFINE_PER_CPU(struct per_cpu_nodestat, boot_nodestats);
>  
> @@ -5891,7 +5891,7 @@ build_all_zonelists_init(void)
>  	 * (a chicken-egg dilemma).
>  	 */
>  	for_each_possible_cpu(cpu)
> -		setup_pageset(&per_cpu(boot_pageset, cpu), 0);
> +		setup_pageset(&per_cpu(boot_pageset, cpu));
>  
>  	mminit_verify_zonelist();
>  	cpuset_init_current_mems_allowed();
> @@ -6200,12 +6200,6 @@ static void pageset_update(struct per_cpu_pages *pcp, unsigned long high,
>  	pcp->batch = batch;
>  }
>  
> -/* a companion to pageset_set_high() */
> -static void pageset_set_batch(struct per_cpu_pageset *p, unsigned long batch)
> -{
> -	pageset_update(&p->pcp, 6 * batch, max(1UL, 1 * batch));
> -}
> -
>  static void pageset_init(struct per_cpu_pageset *p)
>  {
>  	struct per_cpu_pages *pcp;
> @@ -6218,35 +6212,32 @@ static void pageset_init(struct per_cpu_pageset *p)
>  		INIT_LIST_HEAD(&pcp->lists[migratetype]);
>  }
>  
> -static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch)
> +static void setup_pageset(struct per_cpu_pageset *p)
>  {
>  	pageset_init(p);
> -	pageset_set_batch(p, batch);
> +	pageset_update(&p->pcp, 0, 1);
>  }
>  
>  /*
> - * pageset_set_high() sets the high water mark for hot per_cpu_pagelist
> - * to the value high for the pageset p.
> + * Calculate and set new high and batch values for given per-cpu pageset of a
> + * zone, based on the zone's size and the percpu_pagelist_fraction sysctl.
>   */
> -static void pageset_set_high(struct per_cpu_pageset *p,
> -				unsigned long high)
> -{
> -	unsigned long batch = max(1UL, high / 4);
> -	if ((high / 4) > (PAGE_SHIFT * 8))
> -		batch = PAGE_SHIFT * 8;
> -
> -	pageset_update(&p->pcp, high, batch);
> -}
> -
>  static void pageset_set_high_and_batch(struct zone *zone,
> -				       struct per_cpu_pageset *pcp)
> +				       struct per_cpu_pageset *p)
>  {
> -	if (percpu_pagelist_fraction)
> -		pageset_set_high(pcp,
> -			(zone_managed_pages(zone) /
> -				percpu_pagelist_fraction));
> -	else
> -		pageset_set_batch(pcp, zone_batchsize(zone));
> +	unsigned long new_high, new_batch;
> +
> +	if (percpu_pagelist_fraction) {
> +		new_high = zone_managed_pages(zone) / percpu_pagelist_fraction;
> +		new_batch = max(1UL, new_high / 4);
> +		if ((new_high / 4) > (PAGE_SHIFT * 8))
> +			new_batch = PAGE_SHIFT * 8;
> +	} else {
> +		new_batch = zone_batchsize(zone);
> +		new_high = 6 * new_batch;
> +		new_batch = max(1UL, 1 * new_batch);
> +	}
> +	pageset_update(&p->pcp, new_high, new_batch);
>  }
>  
>  static void __meminit zone_pageset_init(struct zone *zone, int cpu)
> -- 
> 2.28.0

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/9] mm, page_alloc: calculate pageset high and batch once per zone
  2020-09-22 14:37 ` [PATCH 2/9] mm, page_alloc: calculate pageset high and batch once per zone Vlastimil Babka
@ 2020-10-05 12:52   ` Michal Hocko
  2020-10-06 22:04     ` Vlastimil Babka
  0 siblings, 1 reply; 41+ messages in thread
From: Michal Hocko @ 2020-10-05 12:52 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, linux-kernel, Pavel Tatashin, David Hildenbrand,
	Oscar Salvador, Joonsoo Kim

On Tue 22-09-20 16:37:05, Vlastimil Babka wrote:
> We currently call pageset_set_high_and_batch() for each possible cpu, which
> repeats the same calculations of high and batch values.
> 
> Instead call the function just once per zone, and make it apply the calculated
> values to all per-cpu pagesets of the zone.
> 
> This also allows removing the zone_pageset_init() and __zone_pcp_update()
> wrappers.
> 
> No functional change.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> Reviewed-by: Oscar Salvador <osalvador@suse.de>
> Reviewed-by: David Hildenbrand <david@redhat.com>

I like this. One question below
Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/page_alloc.c | 42 ++++++++++++++++++------------------------
>  1 file changed, 18 insertions(+), 24 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index a163c5e561f2..26069c8d1b19 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6219,13 +6219,14 @@ static void setup_pageset(struct per_cpu_pageset *p)
>  }
>  
>  /*
> - * Calculate and set new high and batch values for given per-cpu pageset of a
> + * Calculate and set new high and batch values for all per-cpu pagesets of a
>   * zone, based on the zone's size and the percpu_pagelist_fraction sysctl.
>   */
> -static void pageset_set_high_and_batch(struct zone *zone,
> -				       struct per_cpu_pageset *p)
> +static void zone_set_pageset_high_and_batch(struct zone *zone)
>  {
>  	unsigned long new_high, new_batch;
> +	struct per_cpu_pageset *p;
> +	int cpu;
>  
>  	if (percpu_pagelist_fraction) {
>  		new_high = zone_managed_pages(zone) / percpu_pagelist_fraction;
> @@ -6237,23 +6238,25 @@ static void pageset_set_high_and_batch(struct zone *zone,
>  		new_high = 6 * new_batch;
>  		new_batch = max(1UL, 1 * new_batch);
>  	}
> -	pageset_update(&p->pcp, new_high, new_batch);
> -}
> -
> -static void __meminit zone_pageset_init(struct zone *zone, int cpu)
> -{
> -	struct per_cpu_pageset *pcp = per_cpu_ptr(zone->pageset, cpu);
>  
> -	pageset_init(pcp);
> -	pageset_set_high_and_batch(zone, pcp);
> +	for_each_possible_cpu(cpu) {
> +		p = per_cpu_ptr(zone->pageset, cpu);
> +		pageset_update(&p->pcp, new_high, new_batch);
> +	}
>  }
>  
>  void __meminit setup_zone_pageset(struct zone *zone)
>  {
> +	struct per_cpu_pageset *p;
>  	int cpu;
> +
>  	zone->pageset = alloc_percpu(struct per_cpu_pageset);
> -	for_each_possible_cpu(cpu)
> -		zone_pageset_init(zone, cpu);
> +	for_each_possible_cpu(cpu) {
> +		p = per_cpu_ptr(zone->pageset, cpu);
> +		pageset_init(p);
> +	}
> +
> +	zone_set_pageset_high_and_batch(zone);

I hope I am not misreading the diff but it seems that setup_zone_pageset
is calling pageset_init which is then done again by
zone_set_pageset_high_and_batch as a part of pageset_update

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 3/9] mm, page_alloc: remove setup_pageset()
  2020-09-22 14:37 ` [PATCH 3/9] mm, page_alloc: remove setup_pageset() Vlastimil Babka
  2020-09-25 10:19   ` David Hildenbrand
@ 2020-10-05 12:59   ` Michal Hocko
  2020-10-06 22:11     ` Vlastimil Babka
  1 sibling, 1 reply; 41+ messages in thread
From: Michal Hocko @ 2020-10-05 12:59 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, linux-kernel, Pavel Tatashin, David Hildenbrand,
	Oscar Salvador, Joonsoo Kim

On Tue 22-09-20 16:37:06, Vlastimil Babka wrote:
> We initialize boot-time pagesets with setup_pageset(), which sets high and
> batch values that effectively disable pcplists.
> 
> We can remove this wrapper if we just set these values for all pagesets in
> pageset_init(). Non-boot pagesets then subsequently update them to the proper
> values.
> 
> No functional change.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>  mm/page_alloc.c | 17 ++++++++++-------
>  1 file changed, 10 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 26069c8d1b19..76c2b4578723 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5823,7 +5823,7 @@ static void build_zonelists(pg_data_t *pgdat)
>   * not check if the processor is online before following the pageset pointer.
>   * Other parts of the kernel may not check if the zone is available.
>   */
> -static void setup_pageset(struct per_cpu_pageset *p);
> +static void pageset_init(struct per_cpu_pageset *p);
>  static DEFINE_PER_CPU(struct per_cpu_pageset, boot_pageset);
>  static DEFINE_PER_CPU(struct per_cpu_nodestat, boot_nodestats);
>  
> @@ -5891,7 +5891,7 @@ build_all_zonelists_init(void)
>  	 * (a chicken-egg dilemma).
>  	 */
>  	for_each_possible_cpu(cpu)
> -		setup_pageset(&per_cpu(boot_pageset, cpu));
> +		pageset_init(&per_cpu(boot_pageset, cpu));
>  
>  	mminit_verify_zonelist();
>  	cpuset_init_current_mems_allowed();
> @@ -6210,12 +6210,15 @@ static void pageset_init(struct per_cpu_pageset *p)
>  	pcp = &p->pcp;
>  	for (migratetype = 0; migratetype < MIGRATE_PCPTYPES; migratetype++)
>  		INIT_LIST_HEAD(&pcp->lists[migratetype]);
> -}
>  
> -static void setup_pageset(struct per_cpu_pageset *p)
> -{
> -	pageset_init(p);
> -	pageset_update(&p->pcp, 0, 1);
> +	/*
> +	 * Set batch and high values safe for a boot pageset. A true percpu
> +	 * pageset's initialization will update them subsequently. Here we don't
> +	 * need to be as careful as pageset_update() as nobody can access the
> +	 * pageset yet.

Isn't this slightly misleading? pageset_init is called from setup_zone_pageset
which is called from the memory hotplug as well. Isn't this more about
early zone initialization rather than boot pagesets? Or am I misreading
the patch?

> +	 */
> +	pcp->high = 0;
> +	pcp->batch = 1;
>  }
>  
>  /*
> -- 
> 2.28.0

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 4/9] mm, page_alloc: simplify pageset_update()
  2020-09-22 14:37 ` [PATCH 4/9] mm, page_alloc: simplify pageset_update() Vlastimil Babka
  2020-09-25 10:23   ` David Hildenbrand
@ 2020-10-05 13:20   ` Michal Hocko
  1 sibling, 0 replies; 41+ messages in thread
From: Michal Hocko @ 2020-10-05 13:20 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, linux-kernel, Pavel Tatashin, David Hildenbrand,
	Oscar Salvador, Joonsoo Kim

On Tue 22-09-20 16:37:07, Vlastimil Babka wrote:
> pageset_update() attempts to update pcplist's high and batch values in a way
> that readers don't observe batch > high. It uses smp_wmb() to order the updates
> in a way to achieve this. However, without proper pairing read barriers in
> readers this guarantee doesn't hold, and there are no such barriers in
> e.g. free_unref_page_commit().
> 
> Commit 88e8ac11d2ea ("mm, page_alloc: fix core hung in free_pcppages_bulk()")
> already showed this is problematic, and solved this by ultimately only trusing
> pcp->count of the current cpu with interrupts disabled.
> 
> The update dance with unpaired write barriers thus makes no sense. Replace
> them with plain WRITE_ONCE to prevent store tearing, and document that the
> values can change asynchronously and should not be trusted for correctness.
> 
> All current readers appear to be OK after 88e8ac11d2ea. Convert them to
> READ_ONCE to prevent unnecessary read tearing, but mainly to alert anybody
> making future changes to the code that special care is needed.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Yes, this should be safe AFAICS. I believe the original intention was
well minded but didn't go all the way to do the thing properly.

I have to admit I have stumbled over this weirdness few times and never
found enough motivation to think that through.

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/page_alloc.c | 40 ++++++++++++++++++----------------------
>  1 file changed, 18 insertions(+), 22 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 76c2b4578723..99b74c1c2b0a 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1297,7 +1297,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>  {
>  	int migratetype = 0;
>  	int batch_free = 0;
> -	int prefetch_nr = 0;
> +	int prefetch_nr = READ_ONCE(pcp->batch);
>  	bool isolated_pageblocks;
>  	struct page *page, *tmp;
>  	LIST_HEAD(head);
> @@ -1348,8 +1348,10 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>  			 * avoid excessive prefetching due to large count, only
>  			 * prefetch buddy for the first pcp->batch nr of pages.
>  			 */
> -			if (prefetch_nr++ < pcp->batch)
> +			if (prefetch_nr) {
>  				prefetch_buddy(page);
> +				prefetch_nr--;
> +			}
>  		} while (--count && --batch_free && !list_empty(list));
>  	}
>  
> @@ -3131,10 +3133,8 @@ static void free_unref_page_commit(struct page *page, unsigned long pfn)
>  	pcp = &this_cpu_ptr(zone->pageset)->pcp;
>  	list_add(&page->lru, &pcp->lists[migratetype]);
>  	pcp->count++;
> -	if (pcp->count >= pcp->high) {
> -		unsigned long batch = READ_ONCE(pcp->batch);
> -		free_pcppages_bulk(zone, batch, pcp);
> -	}
> +	if (pcp->count >= READ_ONCE(pcp->high))
> +		free_pcppages_bulk(zone, READ_ONCE(pcp->batch), pcp);
>  }
>  
>  /*
> @@ -3318,7 +3318,7 @@ static struct page *__rmqueue_pcplist(struct zone *zone, int migratetype,
>  	do {
>  		if (list_empty(list)) {
>  			pcp->count += rmqueue_bulk(zone, 0,
> -					pcp->batch, list,
> +					READ_ONCE(pcp->batch), list,
>  					migratetype, alloc_flags);
>  			if (unlikely(list_empty(list)))
>  				return NULL;
> @@ -6174,13 +6174,16 @@ static int zone_batchsize(struct zone *zone)
>  }
>  
>  /*
> - * pcp->high and pcp->batch values are related and dependent on one another:
> - * ->batch must never be higher then ->high.
> - * The following function updates them in a safe manner without read side
> - * locking.
> + * pcp->high and pcp->batch values are related and generally batch is lower
> + * than high. They are also related to pcp->count such that count is lower
> + * than high, and as soon as it reaches high, the pcplist is flushed.
>   *
> - * Any new users of pcp->batch and pcp->high should ensure they can cope with
> - * those fields changing asynchronously (acording to the above rule).
> + * However, guaranteeing these relations at all times would require e.g. write
> + * barriers here but also careful usage of read barriers at the read side, and
> + * thus be prone to error and bad for performance. Thus the update only prevents
> + * store tearing. Any new users of pcp->batch and pcp->high should ensure they
> + * can cope with those fields changing asynchronously, and fully trust only the
> + * pcp->count field on the local CPU with interrupts disabled.
>   *
>   * mutex_is_locked(&pcp_batch_high_lock) required when calling this function
>   * outside of boot time (or some other assurance that no concurrent updaters
> @@ -6189,15 +6192,8 @@ static int zone_batchsize(struct zone *zone)
>  static void pageset_update(struct per_cpu_pages *pcp, unsigned long high,
>  		unsigned long batch)
>  {
> -       /* start with a fail safe value for batch */
> -	pcp->batch = 1;
> -	smp_wmb();
> -
> -       /* Update high, then batch, in order */
> -	pcp->high = high;
> -	smp_wmb();
> -
> -	pcp->batch = batch;
> +	WRITE_ONCE(pcp->batch, batch);
> +	WRITE_ONCE(pcp->high, high);
>  }
>  
>  static void pageset_init(struct per_cpu_pageset *p)
> -- 
> 2.28.0

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 5/9] mm, page_alloc: make per_cpu_pageset accessible only after init
  2020-09-22 14:37 ` [PATCH 5/9] mm, page_alloc: make per_cpu_pageset accessible only after init Vlastimil Babka
  2020-09-25 10:25   ` David Hildenbrand
@ 2020-10-05 13:24   ` Michal Hocko
  2020-10-06 22:28     ` Vlastimil Babka
  1 sibling, 1 reply; 41+ messages in thread
From: Michal Hocko @ 2020-10-05 13:24 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, linux-kernel, Pavel Tatashin, David Hildenbrand,
	Oscar Salvador, Joonsoo Kim

On Tue 22-09-20 16:37:08, Vlastimil Babka wrote:
> setup_zone_pageset() replaces the boot_pageset by allocating and initializing a
> proper percpu one. Currently it assigns zone->pageset with the newly allocated
> one before initializing it. That's currently not an issue, because the zone
> should not be in any zonelist, thus not visible to allocators at this point.
> 
> Memory ordering between the pcplist contents and its visibility is also not
> guaranteed here, but that also shouldn't be an issue because online_pages()
> does a spin_unlock(pgdat->node_size_lock) before building the zonelists.
> 
> However it's best that we don't silently rely on operations that can be changed
> in the future. Make sure only properly initialized pcplists are visible, using
> smp_store_release(). The read side has a data dependency via the zone->pageset
> pointer instead of an explicit read barrier.

Heh, this looks like inveting a similar trap the previous patch was
removing. But more seriously considering that we need a locking for the
whole setup, wouldn't it be better to simply document the locking
requirements rather than adding scary looking barriers future ourselves
or somebody else will need to scratch heads about. I am pretty sure we
don't do anything like that when initializating numa node or other data
structures that might be allocated during the memory hotadd.

> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>  mm/page_alloc.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 99b74c1c2b0a..de3b48bda45c 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6246,15 +6246,17 @@ static void zone_set_pageset_high_and_batch(struct zone *zone)
>  
>  void __meminit setup_zone_pageset(struct zone *zone)
>  {
> +	struct per_cpu_pageset __percpu * new_pageset;
>  	struct per_cpu_pageset *p;
>  	int cpu;
>  
> -	zone->pageset = alloc_percpu(struct per_cpu_pageset);
> +	new_pageset = alloc_percpu(struct per_cpu_pageset);
>  	for_each_possible_cpu(cpu) {
> -		p = per_cpu_ptr(zone->pageset, cpu);
> +		p = per_cpu_ptr(new_pageset, cpu);
>  		pageset_init(p);
>  	}
>  
> +	smp_store_release(&zone->pageset, new_pageset);
>  	zone_set_pageset_high_and_batch(zone);
>  }
>  
> -- 
> 2.28.0

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 6/9] mm, page_alloc: cache pageset high and batch in struct zone
  2020-09-22 14:37 ` [PATCH 6/9] mm, page_alloc: cache pageset high and batch in struct zone Vlastimil Babka
  2020-09-25 10:34   ` David Hildenbrand
@ 2020-10-05 13:28   ` Michal Hocko
  2020-10-06 22:34     ` Vlastimil Babka
  1 sibling, 1 reply; 41+ messages in thread
From: Michal Hocko @ 2020-10-05 13:28 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, linux-kernel, Pavel Tatashin, David Hildenbrand,
	Oscar Salvador, Joonsoo Kim

On Tue 22-09-20 16:37:09, Vlastimil Babka wrote:
> All per-cpu pagesets for a zone use the same high and batch values, that are
> duplicated there just for performance (locality) reasons. This patch adds the
> same variables also to struct zone as a shared copy.
> 
> This will be useful later for making possible to disable pcplists temporarily
> by setting high value to 0, while remembering the values for restoring them
> later. But we can also immediately benefit from not updating pagesets of all
> possible cpus in case the newly recalculated values (after sysctl change or
> memory online/offline) are actually unchanged from the previous ones.

Advantage of this patch is not really clear from it in isolation. Maybe
merge it with the patch which uses the duplicated state.

> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>  include/linux/mmzone.h |  6 ++++++
>  mm/page_alloc.c        | 16 ++++++++++++++--
>  2 files changed, 20 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 90721f3156bc..7ad3f14dbe88 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -470,6 +470,12 @@ struct zone {
>  #endif
>  	struct pglist_data	*zone_pgdat;
>  	struct per_cpu_pageset __percpu *pageset;
> +	/*
> +	 * the high and batch values are copied to individual pagesets for
> +	 * faster access
> +	 */
> +	int pageset_high;
> +	int pageset_batch;
>  
>  #ifndef CONFIG_SPARSEMEM
>  	/*
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index de3b48bda45c..901907799bdc 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5824,6 +5824,8 @@ static void build_zonelists(pg_data_t *pgdat)
>   * Other parts of the kernel may not check if the zone is available.
>   */
>  static void pageset_init(struct per_cpu_pageset *p);
> +#define BOOT_PAGESET_HIGH	0
> +#define BOOT_PAGESET_BATCH	1
>  static DEFINE_PER_CPU(struct per_cpu_pageset, boot_pageset);
>  static DEFINE_PER_CPU(struct per_cpu_nodestat, boot_nodestats);
>  
> @@ -6213,8 +6215,8 @@ static void pageset_init(struct per_cpu_pageset *p)
>  	 * need to be as careful as pageset_update() as nobody can access the
>  	 * pageset yet.
>  	 */
> -	pcp->high = 0;
> -	pcp->batch = 1;
> +	pcp->high = BOOT_PAGESET_HIGH;
> +	pcp->batch = BOOT_PAGESET_BATCH;
>  }
>  
>  /*
> @@ -6238,6 +6240,14 @@ static void zone_set_pageset_high_and_batch(struct zone *zone)
>  		new_batch = max(1UL, 1 * new_batch);
>  	}
>  
> +	if (zone->pageset_high != new_high ||
> +	    zone->pageset_batch != new_batch) {
> +		zone->pageset_high = new_high;
> +		zone->pageset_batch = new_batch;
> +	} else {
> +		return;
> +	}
> +
>  	for_each_possible_cpu(cpu) {
>  		p = per_cpu_ptr(zone->pageset, cpu);
>  		pageset_update(&p->pcp, new_high, new_batch);
> @@ -6300,6 +6310,8 @@ static __meminit void zone_pcp_init(struct zone *zone)
>  	 * offset of a (static) per cpu variable into the per cpu area.
>  	 */
>  	zone->pageset = &boot_pageset;
> +	zone->pageset_high = BOOT_PAGESET_HIGH;
> +	zone->pageset_batch = BOOT_PAGESET_BATCH;
>  
>  	if (populated_zone(zone))
>  		printk(KERN_DEBUG "  %s zone: %lu pages, LIFO batch:%u\n",
> -- 
> 2.28.0

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 7/9] mm, page_alloc: move draining pcplists to page isolation users
  2020-09-22 14:37 ` [PATCH 7/9] mm, page_alloc: move draining pcplists to page isolation users Vlastimil Babka
  2020-09-25 10:39   ` David Hildenbrand
@ 2020-10-05 13:57   ` Michal Hocko
  1 sibling, 0 replies; 41+ messages in thread
From: Michal Hocko @ 2020-10-05 13:57 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, linux-kernel, Pavel Tatashin, David Hildenbrand,
	Oscar Salvador, Joonsoo Kim

On Tue 22-09-20 16:37:10, Vlastimil Babka wrote:
> Currently, pcplists are drained during set_migratetype_isolate() which means
> once per pageblock processed start_isolate_page_range(). This is somewhat
> wasteful. Moreover, the callers might need different guarantees, and the
> draining is currently prone to races and does not guarantee that no page
> from isolated pageblock will end up on the pcplist after the drain.
> 
> Better guarantees are added by later patches and require explicit actions
> by page isolation users that need them. Thus it makes sense to move the
> current imperfect draining to the callers also as a preparation step.
> 
> Suggested-by: Pavel Tatashin <pasha.tatashin@soleen.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/memory_hotplug.c | 11 ++++++-----
>  mm/page_alloc.c     |  2 ++
>  mm/page_isolation.c | 10 +++++-----
>  3 files changed, 13 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 9db80ee29caa..08f729922e18 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1524,6 +1524,8 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
>  		goto failed_removal;
>  	}
>  
> +	drain_all_pages(zone);
> +
>  	arg.start_pfn = start_pfn;
>  	arg.nr_pages = nr_pages;
>  	node_states_check_changes_offline(nr_pages, zone, &arg);
> @@ -1574,11 +1576,10 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
>  		}
>  
>  		/*
> -		 * per-cpu pages are drained in start_isolate_page_range, but if
> -		 * there are still pages that are not free, make sure that we
> -		 * drain again, because when we isolated range we might
> -		 * have raced with another thread that was adding pages to pcp
> -		 * list.
> +		 * per-cpu pages are drained after start_isolate_page_range, but
> +		 * if there are still pages that are not free, make sure that we
> +		 * drain again, because when we isolated range we might have
> +		 * raced with another thread that was adding pages to pcp list.
>  		 *
>  		 * Forward progress should be still guaranteed because
>  		 * pages on the pcp list can only belong to MOVABLE_ZONE
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 901907799bdc..4e37bc3f6077 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -8432,6 +8432,8 @@ int alloc_contig_range(unsigned long start, unsigned long end,
>  	if (ret)
>  		return ret;
>  
> +	drain_all_pages(cc.zone);
> +
>  	/*
>  	 * In case of -EBUSY, we'd like to know which page causes problem.
>  	 * So, just fall through. test_pages_isolated() has a tracepoint
> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> index abfe26ad59fd..57d8bc1e7760 100644
> --- a/mm/page_isolation.c
> +++ b/mm/page_isolation.c
> @@ -49,7 +49,6 @@ static int set_migratetype_isolate(struct page *page, int migratetype, int isol_
>  
>  		__mod_zone_freepage_state(zone, -nr_pages, mt);
>  		spin_unlock_irqrestore(&zone->lock, flags);
> -		drain_all_pages(zone);
>  		return 0;
>  	}
>  
> @@ -167,11 +166,12 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
>   *
>   * Please note that there is no strong synchronization with the page allocator
>   * either. Pages might be freed while their page blocks are marked ISOLATED.
> - * In some cases pages might still end up on pcp lists and that would allow
> + * A call to drain_all_pages() after isolation can flush most of them. However
> + * in some cases pages might still end up on pcp lists and that would allow
>   * for their allocation even when they are in fact isolated already. Depending
> - * on how strong of a guarantee the caller needs drain_all_pages might be needed
> - * (e.g. __offline_pages will need to call it after check for isolated range for
> - * a next retry).
> + * on how strong of a guarantee the caller needs, further drain_all_pages()
> + * might be needed (e.g. __offline_pages will need to call it after check for
> + * isolated range for a next retry).
>   *
>   * Return: 0 on success and -EBUSY if any part of range cannot be isolated.
>   */
> -- 
> 2.28.0

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 8/9] mm, page_alloc: drain all pcplists during memory offline
  2020-09-25 10:46   ` David Hildenbrand
@ 2020-10-05 14:03     ` Michal Hocko
  0 siblings, 0 replies; 41+ messages in thread
From: Michal Hocko @ 2020-10-05 14:03 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Vlastimil Babka, linux-mm, linux-kernel, Pavel Tatashin,
	Oscar Salvador, Joonsoo Kim

On Fri 25-09-20 12:46:27, David Hildenbrand wrote:
> On 22.09.20 16:37, Vlastimil Babka wrote:
[...]
> > +/*
> > + * Spill all the per-cpu pages from all CPUs back into the buddy allocator.
> > + *
> > + * When zone parameter is non-NULL, spill just the single zone's pages.
> > + *
> > + * Note that this can be extremely slow as the draining happens in a workqueue.
> > + */
> > +void drain_all_pages(struct zone *zone)
> > +{
> > +	__drain_all_pages(zone, false);
> > +}
> > +
> >  #ifdef CONFIG_HIBERNATION
> >  
> >  /*
> > 
> 
> Interesting race. Instead of this ugly __drain_all_pages() with a
> boolean parameter, can we have two properly named functions to be used
> in !page_alloc.c code without scratching your head what the difference is?

I tend to agree here. I would even fold this into the next patch because
disable/enable interface is much more manageable.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 9/9] mm, page_alloc: optionally disable pcplists during page isolation
  2020-09-25 11:10       ` Vlastimil Babka
  2020-10-01  8:47         ` David Hildenbrand
@ 2020-10-05 14:05         ` Michal Hocko
  2020-10-05 14:22           ` Vlastimil Babka
  1 sibling, 1 reply; 41+ messages in thread
From: Michal Hocko @ 2020-10-05 14:05 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: David Hildenbrand, linux-mm, linux-kernel, Pavel Tatashin,
	Oscar Salvador, Joonsoo Kim

On Fri 25-09-20 13:10:05, Vlastimil Babka wrote:
> On 9/25/20 12:54 PM, David Hildenbrand wrote:
> >>> --- a/mm/page_isolation.c
> >>> +++ b/mm/page_isolation.c
> >>> @@ -15,6 +15,22 @@
> >>>  #define CREATE_TRACE_POINTS
> >>>  #include <trace/events/page_isolation.h>
> >>>  
> >>> +void zone_pcplist_disable(struct zone *zone)
> >>> +{
> >>> +	down_read(&pcp_batch_high_lock);
> >>> +	if (atomic_inc_return(&zone->pcplist_disabled) == 1) {
> >>> +		zone_update_pageset_high_and_batch(zone, 0, 1);
> >>> +		__drain_all_pages(zone, true);
> >>> +	}
> >> Hm, if one CPU is still inside the if-clause, the other one would
> >> continue, however pcp wpould not be disabled and zones not drained when
> >> returning.
> 
> Ah, well spotted, thanks!
> 
> >> (while we only allow a single Offline_pages() call, it will be different
> >> when we use the function in other context - especially,
> >> alloc_contig_range() for some users)
> >>
> >> Can't we use down_write() here? So it's serialized and everybody has to
> >> properly wait. (and we would not have to rely on an atomic_t)
> > Sorry, I meant down_write only temporarily in this code path. Not
> > keeping it locked in write when returning (I remember there is a way to
> > downgrade).
> 
> Hmm that temporary write lock would still block new callers until previous
> finish with the downgraded-to-read lock.
> 
> But I guess something like this would work:
> 
> retry:
>   if (atomic_read(...) == 0) {
>     // zone_update... + drain
>     atomic_inc(...);
>   else if (atomic_inc_return == 1)
>     // atomic_cmpxchg from 0 to 1; if that fails, goto retry
> 
> Tricky, but races could only read to unnecessary duplicated updates + flushing
> but nothing worse?
> 
> Or add another spinlock to cover this part instead of the temp write lock...

Do you plan to post a new version or should I review this one?

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 9/9] mm, page_alloc: optionally disable pcplists during page isolation
  2020-10-05 14:05         ` Michal Hocko
@ 2020-10-05 14:22           ` Vlastimil Babka
  2020-10-05 16:56             ` Michal Hocko
  0 siblings, 1 reply; 41+ messages in thread
From: Vlastimil Babka @ 2020-10-05 14:22 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Hildenbrand, linux-mm, linux-kernel, Pavel Tatashin,
	Oscar Salvador, Joonsoo Kim

On 10/5/20 4:05 PM, Michal Hocko wrote:
> On Fri 25-09-20 13:10:05, Vlastimil Babka wrote:
>> On 9/25/20 12:54 PM, David Hildenbrand wrote:
>>
>> Hmm that temporary write lock would still block new callers until previous
>> finish with the downgraded-to-read lock.
>>
>> But I guess something like this would work:
>>
>> retry:
>>   if (atomic_read(...) == 0) {
>>     // zone_update... + drain
>>     atomic_inc(...);
>>   else if (atomic_inc_return == 1)
>>     // atomic_cmpxchg from 0 to 1; if that fails, goto retry
>>
>> Tricky, but races could only read to unnecessary duplicated updates + flushing
>> but nothing worse?
>>
>> Or add another spinlock to cover this part instead of the temp write lock...
> 
> Do you plan to post a new version or should I review this one?

I will, hopefully this week, but you could comment on other details and
overall approach meanwhile? I don't think it will change significantly.



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 9/9] mm, page_alloc: optionally disable pcplists during page isolation
  2020-10-05 14:22           ` Vlastimil Babka
@ 2020-10-05 16:56             ` Michal Hocko
  0 siblings, 0 replies; 41+ messages in thread
From: Michal Hocko @ 2020-10-05 16:56 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: David Hildenbrand, linux-mm, linux-kernel, Pavel Tatashin,
	Oscar Salvador, Joonsoo Kim

On Mon 05-10-20 16:22:46, Vlastimil Babka wrote:
> On 10/5/20 4:05 PM, Michal Hocko wrote:
> > On Fri 25-09-20 13:10:05, Vlastimil Babka wrote:
> >> On 9/25/20 12:54 PM, David Hildenbrand wrote:
> >>
> >> Hmm that temporary write lock would still block new callers until previous
> >> finish with the downgraded-to-read lock.
> >>
> >> But I guess something like this would work:
> >>
> >> retry:
> >>   if (atomic_read(...) == 0) {
> >>     // zone_update... + drain
> >>     atomic_inc(...);
> >>   else if (atomic_inc_return == 1)
> >>     // atomic_cmpxchg from 0 to 1; if that fails, goto retry
> >>
> >> Tricky, but races could only read to unnecessary duplicated updates + flushing
> >> but nothing worse?
> >>
> >> Or add another spinlock to cover this part instead of the temp write lock...
> > 
> > Do you plan to post a new version or should I review this one?
> 
> I will, hopefully this week, but you could comment on other details and
> overall approach meanwhile? I don't think it will change significantly.

OK. I will have a look tomorrow.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 9/9] mm, page_alloc: optionally disable pcplists during page isolation
  2020-09-22 14:37 ` [PATCH 9/9] mm, page_alloc: optionally disable pcplists during page isolation Vlastimil Babka
  2020-09-25 10:53   ` David Hildenbrand
@ 2020-10-06  8:34   ` Michal Hocko
  2020-10-06  8:40     ` David Hildenbrand
  1 sibling, 1 reply; 41+ messages in thread
From: Michal Hocko @ 2020-10-06  8:34 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, linux-kernel, Pavel Tatashin, David Hildenbrand,
	Oscar Salvador, Joonsoo Kim

On Tue 22-09-20 16:37:12, Vlastimil Babka wrote:
> Page isolation can race with process freeing pages to pcplists in a way that
> a page from isolated pageblock can end up on pcplist. This can be fixed by
> repeated draining of pcplists, as done by patch "mm/memory_hotplug: drain
> per-cpu pages again during memory offline" in [1].
> 
> David and Michal would prefer that this race was closed in a way that callers
> of page isolation who need stronger guarantees don't need to repeatedly drain.
> David suggested disabling pcplists usage completely during page isolation,
> instead of repeatedly draining them.
> 
> To achieve this without adding special cases in alloc/free fastpath, we can use
> the same approach as boot pagesets - when pcp->high is 0, any pcplist addition
> will be immediately flushed.
> 
> The race can thus be closed by setting pcp->high to 0 and draining pcplists
> once, before calling start_isolate_page_range(). The draining will serialize
> after processes that already disabled interrupts and read the old value of
> pcp->high in free_unref_page_commit(), and processes that have not yet disabled
> interrupts, will observe pcp->high == 0 when they are rescheduled, and skip
> pcplists. This guarantees no stray pages on pcplists in zones where isolation
> happens.
> 
> This patch thus adds zone_pcplist_disable() and zone_pcplist_enable() functions
> that page isolation users can call before start_isolate_page_range() and after
> unisolating (or offlining) the isolated pages. A new zone->pcplist_disabled
> atomic variable makes sure we disable only pcplists once and don't enable
> them prematurely in case there are multiple users in parallel.
> 
> We however have to avoid external updates to high and batch by taking
> pcp_batch_high_lock. To allow multiple isolations in parallel, change this lock
> from mutex to rwsem.

The overall idea makes sense. I just suspect you are over overcomplicating 
the implementation a bit. Is there any reason that we cannot start with
a really dumb implementation first. The only user of this functionality
is the memory offlining and that is already strongly synchronized
(mem_hotplug_begin) so a lot of trickery can be dropped here. Should we
find a new user later on we can make the implementation finer grained
but now it will not serve any purpose. So can we simply update pcp->high
and drain all pcp in the given zone and wait for all remote pcp draining
in zone_pcplist_enable and updte revert all that in zone_pcplist_enable.
We can stick to the existing pcp_batch_high_lock.

What do you think?

Btw. zone_pcplist_{enable,disable} would benefit from a documentation
explaining the purpose and guarantees. And side effect on the
performance. I would even stick this interface to internal.h

> Currently the only user of this functionality is offline_pages().
> 
> [1] https://lore.kernel.org/linux-mm/20200903140032.380431-1-pasha.tatashin@soleen.com/
> 
> Suggested-by: David Hildenbrand <david@redhat.com>
> Suggested-by: Michal Hocko <mhocko@suse.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>  include/linux/mmzone.h         |  2 ++
>  include/linux/page-isolation.h |  2 ++
>  mm/internal.h                  |  4 ++++
>  mm/memory_hotplug.c            | 28 ++++++++++++----------------
>  mm/page_alloc.c                | 29 ++++++++++++++++++-----------
>  mm/page_isolation.c            | 22 +++++++++++++++++++---
>  6 files changed, 57 insertions(+), 30 deletions(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 7ad3f14dbe88..3c653d6348b1 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -536,6 +536,8 @@ struct zone {
>  	 * of pageblock. Protected by zone->lock.
>  	 */
>  	unsigned long		nr_isolate_pageblock;
> +	/* Track requests for disabling pcplists */
> +	atomic_t		pcplist_disabled;
>  #endif
>  
>  #ifdef CONFIG_MEMORY_HOTPLUG
> diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h
> index 572458016331..1dec5d0f62a6 100644
> --- a/include/linux/page-isolation.h
> +++ b/include/linux/page-isolation.h
> @@ -38,6 +38,8 @@ struct page *has_unmovable_pages(struct zone *zone, struct page *page,
>  void set_pageblock_migratetype(struct page *page, int migratetype);
>  int move_freepages_block(struct zone *zone, struct page *page,
>  				int migratetype, int *num_movable);
> +void zone_pcplist_disable(struct zone *zone);
> +void zone_pcplist_enable(struct zone *zone);
>  
>  /*
>   * Changes migrate type in [start_pfn, end_pfn) to be MIGRATE_ISOLATE.
> diff --git a/mm/internal.h b/mm/internal.h
> index ea66ef45da6c..154e38e702dd 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -7,6 +7,7 @@
>  #ifndef __MM_INTERNAL_H
>  #define __MM_INTERNAL_H
>  
> +#include <linux/rwsem.h>
>  #include <linux/fs.h>
>  #include <linux/mm.h>
>  #include <linux/pagemap.h>
> @@ -199,8 +200,11 @@ extern void post_alloc_hook(struct page *page, unsigned int order,
>  					gfp_t gfp_flags);
>  extern int user_min_free_kbytes;
>  
> +extern struct rw_semaphore pcp_batch_high_lock;
>  extern void zone_pcp_update(struct zone *zone);
>  extern void zone_pcp_reset(struct zone *zone);
> +extern void zone_update_pageset_high_and_batch(struct zone *zone,
> +		unsigned long high, unsigned long batch);
>  
>  #if defined CONFIG_COMPACTION || defined CONFIG_CMA
>  
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index bbde415b558b..6fe9be550160 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1515,17 +1515,21 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
>  	}
>  	node = zone_to_nid(zone);
>  
> +	/*
> +	 * Disable pcplists so that page isolation cannot race with freeing
> +	 * in a way that pages from isolated pageblock are left on pcplists.
> +	 */
> +	zone_pcplist_disable(zone);
> +
>  	/* set above range as isolated */
>  	ret = start_isolate_page_range(start_pfn, end_pfn,
>  				       MIGRATE_MOVABLE,
>  				       MEMORY_OFFLINE | REPORT_FAILURE);
>  	if (ret) {
>  		reason = "failure to isolate range";
> -		goto failed_removal;
> +		goto failed_removal_pcplists_disabled;
>  	}
>  
> -	__drain_all_pages(zone, true);
> -
>  	arg.start_pfn = start_pfn;
>  	arg.nr_pages = nr_pages;
>  	node_states_check_changes_offline(nr_pages, zone, &arg);
> @@ -1575,20 +1579,8 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
>  			goto failed_removal_isolated;
>  		}
>  
> -		/*
> -		 * per-cpu pages are drained after start_isolate_page_range, but
> -		 * if there are still pages that are not free, make sure that we
> -		 * drain again, because when we isolated range we might have
> -		 * raced with another thread that was adding pages to pcp list.
> -		 *
> -		 * Forward progress should be still guaranteed because
> -		 * pages on the pcp list can only belong to MOVABLE_ZONE
> -		 * because has_unmovable_pages explicitly checks for
> -		 * PageBuddy on freed pages on other zones.
> -		 */
>  		ret = test_pages_isolated(start_pfn, end_pfn, MEMORY_OFFLINE);
> -		if (ret)
> -			__drain_all_pages(zone, true);
> +
>  	} while (ret);
>  
>  	/* Mark all sections offline and remove free pages from the buddy. */
> @@ -1604,6 +1596,8 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
>  	zone->nr_isolate_pageblock -= nr_pages / pageblock_nr_pages;
>  	spin_unlock_irqrestore(&zone->lock, flags);
>  
> +	zone_pcplist_enable(zone);
> +
>  	/* removal success */
>  	adjust_managed_page_count(pfn_to_page(start_pfn), -nr_pages);
>  	zone->present_pages -= nr_pages;
> @@ -1637,6 +1631,8 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
>  failed_removal_isolated:
>  	undo_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE);
>  	memory_notify(MEM_CANCEL_OFFLINE, &arg);
> +failed_removal_pcplists_disabled:
> +	zone_pcplist_enable(zone);
>  failed_removal:
>  	pr_debug("memory offlining [mem %#010llx-%#010llx] failed due to %s\n",
>  		 (unsigned long long) start_pfn << PAGE_SHIFT,
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 33cc35d152b1..673d353f9311 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -78,7 +78,7 @@
>  #include "page_reporting.h"
>  
>  /* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */
> -static DEFINE_MUTEX(pcp_batch_high_lock);
> +DECLARE_RWSEM(pcp_batch_high_lock);
>  #define MIN_PERCPU_PAGELIST_FRACTION	(8)
>  
>  #ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID
> @@ -6230,6 +6230,18 @@ static void pageset_init(struct per_cpu_pageset *p)
>  	pcp->batch = BOOT_PAGESET_BATCH;
>  }
>  
> +void zone_update_pageset_high_and_batch(struct zone *zone, unsigned long high,
> +		unsigned long batch)
> +{
> +	struct per_cpu_pageset *p;
> +	int cpu;
> +
> +	for_each_possible_cpu(cpu) {
> +		p = per_cpu_ptr(zone->pageset, cpu);
> +		pageset_update(&p->pcp, high, batch);
> +	}
> +}
> +
>  /*
>   * Calculate and set new high and batch values for all per-cpu pagesets of a
>   * zone, based on the zone's size and the percpu_pagelist_fraction sysctl.
> @@ -6237,8 +6249,6 @@ static void pageset_init(struct per_cpu_pageset *p)
>  static void zone_set_pageset_high_and_batch(struct zone *zone)
>  {
>  	unsigned long new_high, new_batch;
> -	struct per_cpu_pageset *p;
> -	int cpu;
>  
>  	if (percpu_pagelist_fraction) {
>  		new_high = zone_managed_pages(zone) / percpu_pagelist_fraction;
> @@ -6259,10 +6269,7 @@ static void zone_set_pageset_high_and_batch(struct zone *zone)
>  		return;
>  	}
>  
> -	for_each_possible_cpu(cpu) {
> -		p = per_cpu_ptr(zone->pageset, cpu);
> -		pageset_update(&p->pcp, new_high, new_batch);
> -	}
> +	zone_update_pageset_high_and_batch(zone, new_high, new_batch);
>  }
>  
>  void __meminit setup_zone_pageset(struct zone *zone)
> @@ -8024,7 +8031,7 @@ int percpu_pagelist_fraction_sysctl_handler(struct ctl_table *table, int write,
>  	int old_percpu_pagelist_fraction;
>  	int ret;
>  
> -	mutex_lock(&pcp_batch_high_lock);
> +	down_write(&pcp_batch_high_lock);
>  	old_percpu_pagelist_fraction = percpu_pagelist_fraction;
>  
>  	ret = proc_dointvec_minmax(table, write, buffer, length, ppos);
> @@ -8046,7 +8053,7 @@ int percpu_pagelist_fraction_sysctl_handler(struct ctl_table *table, int write,
>  	for_each_populated_zone(zone)
>  		zone_set_pageset_high_and_batch(zone);
>  out:
> -	mutex_unlock(&pcp_batch_high_lock);
> +	up_write(&pcp_batch_high_lock);
>  	return ret;
>  }
>  
> @@ -8652,9 +8659,9 @@ EXPORT_SYMBOL(free_contig_range);
>   */
>  void __meminit zone_pcp_update(struct zone *zone)
>  {
> -	mutex_lock(&pcp_batch_high_lock);
> +	down_write(&pcp_batch_high_lock);
>  	zone_set_pageset_high_and_batch(zone);
> -	mutex_unlock(&pcp_batch_high_lock);
> +	up_write(&pcp_batch_high_lock);
>  }
>  
>  void zone_pcp_reset(struct zone *zone)
> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> index 57d8bc1e7760..c0e895e8f322 100644
> --- a/mm/page_isolation.c
> +++ b/mm/page_isolation.c
> @@ -15,6 +15,22 @@
>  #define CREATE_TRACE_POINTS
>  #include <trace/events/page_isolation.h>
>  
> +void zone_pcplist_disable(struct zone *zone)
> +{
> +	down_read(&pcp_batch_high_lock);
> +	if (atomic_inc_return(&zone->pcplist_disabled) == 1) {
> +		zone_update_pageset_high_and_batch(zone, 0, 1);
> +		__drain_all_pages(zone, true);
> +	}
> +}
> +
> +void zone_pcplist_enable(struct zone *zone)
> +{
> +	if (atomic_dec_return(&zone->pcplist_disabled) == 0)
> +		zone_update_pageset_high_and_batch(zone, zone->pageset_high, zone->pageset_batch);
> +	up_read(&pcp_batch_high_lock);
> +}
> +
>  static int set_migratetype_isolate(struct page *page, int migratetype, int isol_flags)
>  {
>  	struct zone *zone = page_zone(page);
> @@ -169,9 +185,9 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
>   * A call to drain_all_pages() after isolation can flush most of them. However
>   * in some cases pages might still end up on pcp lists and that would allow
>   * for their allocation even when they are in fact isolated already. Depending
> - * on how strong of a guarantee the caller needs, further drain_all_pages()
> - * might be needed (e.g. __offline_pages will need to call it after check for
> - * isolated range for a next retry).
> + * on how strong of a guarantee the caller needs, zone_pcplist_disable/enable()
> + * might be used to flush and disable pcplist before isolation and after
> + * unisolation.
>   *
>   * Return: 0 on success and -EBUSY if any part of range cannot be isolated.
>   */
> -- 
> 2.28.0

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 9/9] mm, page_alloc: optionally disable pcplists during page isolation
  2020-10-06  8:34   ` Michal Hocko
@ 2020-10-06  8:40     ` David Hildenbrand
  2020-10-06 10:05       ` Michal Hocko
  0 siblings, 1 reply; 41+ messages in thread
From: David Hildenbrand @ 2020-10-06  8:40 UTC (permalink / raw)
  To: Michal Hocko, Vlastimil Babka
  Cc: linux-mm, linux-kernel, Pavel Tatashin, Oscar Salvador, Joonsoo Kim

On 06.10.20 10:34, Michal Hocko wrote:
> On Tue 22-09-20 16:37:12, Vlastimil Babka wrote:
>> Page isolation can race with process freeing pages to pcplists in a way that
>> a page from isolated pageblock can end up on pcplist. This can be fixed by
>> repeated draining of pcplists, as done by patch "mm/memory_hotplug: drain
>> per-cpu pages again during memory offline" in [1].
>>
>> David and Michal would prefer that this race was closed in a way that callers
>> of page isolation who need stronger guarantees don't need to repeatedly drain.
>> David suggested disabling pcplists usage completely during page isolation,
>> instead of repeatedly draining them.
>>
>> To achieve this without adding special cases in alloc/free fastpath, we can use
>> the same approach as boot pagesets - when pcp->high is 0, any pcplist addition
>> will be immediately flushed.
>>
>> The race can thus be closed by setting pcp->high to 0 and draining pcplists
>> once, before calling start_isolate_page_range(). The draining will serialize
>> after processes that already disabled interrupts and read the old value of
>> pcp->high in free_unref_page_commit(), and processes that have not yet disabled
>> interrupts, will observe pcp->high == 0 when they are rescheduled, and skip
>> pcplists. This guarantees no stray pages on pcplists in zones where isolation
>> happens.
>>
>> This patch thus adds zone_pcplist_disable() and zone_pcplist_enable() functions
>> that page isolation users can call before start_isolate_page_range() and after
>> unisolating (or offlining) the isolated pages. A new zone->pcplist_disabled
>> atomic variable makes sure we disable only pcplists once and don't enable
>> them prematurely in case there are multiple users in parallel.
>>
>> We however have to avoid external updates to high and batch by taking
>> pcp_batch_high_lock. To allow multiple isolations in parallel, change this lock
>> from mutex to rwsem.
> 
> The overall idea makes sense. I just suspect you are over overcomplicating 
> the implementation a bit. Is there any reason that we cannot start with
> a really dumb implementation first. The only user of this functionality
> is the memory offlining and that is already strongly synchronized
> (mem_hotplug_begin) so a lot of trickery can be dropped here. Should we
> find a new user later on we can make the implementation finer grained
> but now it will not serve any purpose. So can we simply update pcp->high
> and drain all pcp in the given zone and wait for all remote pcp draining
> in zone_pcplist_enable and updte revert all that in zone_pcplist_enable.
> We can stick to the existing pcp_batch_high_lock.
> 
> What do you think?
> 

My two cents, we might want to make use of this in some cases of
alloc_contig_range() soon ("try hard mode"). So I'd love to see a
synchronized mechanism. However, that can be factored out into a
separate patch, so this patch gets significantly simpler.


-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 9/9] mm, page_alloc: optionally disable pcplists during page isolation
  2020-10-06  8:40     ` David Hildenbrand
@ 2020-10-06 10:05       ` Michal Hocko
  0 siblings, 0 replies; 41+ messages in thread
From: Michal Hocko @ 2020-10-06 10:05 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Vlastimil Babka, linux-mm, linux-kernel, Pavel Tatashin,
	Oscar Salvador, Joonsoo Kim

On Tue 06-10-20 10:40:23, David Hildenbrand wrote:
> On 06.10.20 10:34, Michal Hocko wrote:
> > On Tue 22-09-20 16:37:12, Vlastimil Babka wrote:
> >> Page isolation can race with process freeing pages to pcplists in a way that
> >> a page from isolated pageblock can end up on pcplist. This can be fixed by
> >> repeated draining of pcplists, as done by patch "mm/memory_hotplug: drain
> >> per-cpu pages again during memory offline" in [1].
> >>
> >> David and Michal would prefer that this race was closed in a way that callers
> >> of page isolation who need stronger guarantees don't need to repeatedly drain.
> >> David suggested disabling pcplists usage completely during page isolation,
> >> instead of repeatedly draining them.
> >>
> >> To achieve this without adding special cases in alloc/free fastpath, we can use
> >> the same approach as boot pagesets - when pcp->high is 0, any pcplist addition
> >> will be immediately flushed.
> >>
> >> The race can thus be closed by setting pcp->high to 0 and draining pcplists
> >> once, before calling start_isolate_page_range(). The draining will serialize
> >> after processes that already disabled interrupts and read the old value of
> >> pcp->high in free_unref_page_commit(), and processes that have not yet disabled
> >> interrupts, will observe pcp->high == 0 when they are rescheduled, and skip
> >> pcplists. This guarantees no stray pages on pcplists in zones where isolation
> >> happens.
> >>
> >> This patch thus adds zone_pcplist_disable() and zone_pcplist_enable() functions
> >> that page isolation users can call before start_isolate_page_range() and after
> >> unisolating (or offlining) the isolated pages. A new zone->pcplist_disabled
> >> atomic variable makes sure we disable only pcplists once and don't enable
> >> them prematurely in case there are multiple users in parallel.
> >>
> >> We however have to avoid external updates to high and batch by taking
> >> pcp_batch_high_lock. To allow multiple isolations in parallel, change this lock
> >> from mutex to rwsem.
> > 
> > The overall idea makes sense. I just suspect you are over overcomplicating 
> > the implementation a bit. Is there any reason that we cannot start with
> > a really dumb implementation first. The only user of this functionality
> > is the memory offlining and that is already strongly synchronized
> > (mem_hotplug_begin) so a lot of trickery can be dropped here. Should we
> > find a new user later on we can make the implementation finer grained
> > but now it will not serve any purpose. So can we simply update pcp->high
> > and drain all pcp in the given zone and wait for all remote pcp draining
> > in zone_pcplist_enable and updte revert all that in zone_pcplist_enable.
> > We can stick to the existing pcp_batch_high_lock.
> > 
> > What do you think?
> > 
> 
> My two cents, we might want to make use of this in some cases of
> alloc_contig_range() soon ("try hard mode"). So I'd love to see a
> synchronized mechanism. However, that can be factored out into a
> separate patch, so this patch gets significantly simpler.

Exactly. And the incremental patch can be added along with the a-c-r try
harder mode.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/9] mm, page_alloc: calculate pageset high and batch once per zone
  2020-10-05 12:52   ` Michal Hocko
@ 2020-10-06 22:04     ` Vlastimil Babka
  0 siblings, 0 replies; 41+ messages in thread
From: Vlastimil Babka @ 2020-10-06 22:04 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-kernel, Pavel Tatashin, David Hildenbrand,
	Oscar Salvador, Joonsoo Kim

On 10/5/20 2:52 PM, Michal Hocko wrote:
> On Tue 22-09-20 16:37:05, Vlastimil Babka wrote:
>> We currently call pageset_set_high_and_batch() for each possible cpu, which
>> repeats the same calculations of high and batch values.
>> 
>> Instead call the function just once per zone, and make it apply the calculated
>> values to all per-cpu pagesets of the zone.
>> 
>> This also allows removing the zone_pageset_init() and __zone_pcp_update()
>> wrappers.
>> 
>> No functional change.
>> 
>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
>> Reviewed-by: Oscar Salvador <osalvador@suse.de>
>> Reviewed-by: David Hildenbrand <david@redhat.com>
> 
> I like this. One question below
> Acked-by: Michal Hocko <mhocko@suse.com>

Thanks.

> I hope I am not misreading the diff but it seems that setup_zone_pageset
> is calling pageset_init which is then done again by
> zone_set_pageset_high_and_batch as a part of pageset_update

No, pageset_init() is not called again from there, so must be insufficient diff 
context giving that impression.



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 3/9] mm, page_alloc: remove setup_pageset()
  2020-10-05 12:59   ` Michal Hocko
@ 2020-10-06 22:11     ` Vlastimil Babka
  0 siblings, 0 replies; 41+ messages in thread
From: Vlastimil Babka @ 2020-10-06 22:11 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-kernel, Pavel Tatashin, David Hildenbrand,
	Oscar Salvador, Joonsoo Kim

On 10/5/20 2:59 PM, Michal Hocko wrote:
> On Tue 22-09-20 16:37:06, Vlastimil Babka wrote:
>> We initialize boot-time pagesets with setup_pageset(), which sets high and
>> batch values that effectively disable pcplists.
>> 
>> We can remove this wrapper if we just set these values for all pagesets in
>> pageset_init(). Non-boot pagesets then subsequently update them to the proper
>> values.
>> 
>> No functional change.
>> 
>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
>> ---
>>  mm/page_alloc.c | 17 ++++++++++-------
>>  1 file changed, 10 insertions(+), 7 deletions(-)
>> 
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 26069c8d1b19..76c2b4578723 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -5823,7 +5823,7 @@ static void build_zonelists(pg_data_t *pgdat)
>>   * not check if the processor is online before following the pageset pointer.
>>   * Other parts of the kernel may not check if the zone is available.
>>   */
>> -static void setup_pageset(struct per_cpu_pageset *p);
>> +static void pageset_init(struct per_cpu_pageset *p);
>>  static DEFINE_PER_CPU(struct per_cpu_pageset, boot_pageset);
>>  static DEFINE_PER_CPU(struct per_cpu_nodestat, boot_nodestats);
>>  
>> @@ -5891,7 +5891,7 @@ build_all_zonelists_init(void)
>>  	 * (a chicken-egg dilemma).
>>  	 */
>>  	for_each_possible_cpu(cpu)
>> -		setup_pageset(&per_cpu(boot_pageset, cpu));
>> +		pageset_init(&per_cpu(boot_pageset, cpu));
>>  
>>  	mminit_verify_zonelist();
>>  	cpuset_init_current_mems_allowed();
>> @@ -6210,12 +6210,15 @@ static void pageset_init(struct per_cpu_pageset *p)
>>  	pcp = &p->pcp;
>>  	for (migratetype = 0; migratetype < MIGRATE_PCPTYPES; migratetype++)
>>  		INIT_LIST_HEAD(&pcp->lists[migratetype]);
>> -}
>>  
>> -static void setup_pageset(struct per_cpu_pageset *p)
>> -{
>> -	pageset_init(p);
>> -	pageset_update(&p->pcp, 0, 1);
>> +	/*
>> +	 * Set batch and high values safe for a boot pageset. A true percpu
>> +	 * pageset's initialization will update them subsequently. Here we don't
>> +	 * need to be as careful as pageset_update() as nobody can access the
>> +	 * pageset yet.
> 
> Isn't this slightly misleading? pageset_init is called from setup_zone_pageset
> which is called from the memory hotplug as well. Isn't this more about

It is.

> early zone initialization rather than boot pagesets? Or am I misreading
> the patch?

It's for boot pagesets (call from build_all_zonelists_init()) that we need these 
particular values (well, just high = 0). For early zone initialization during 
hotplug the values don't matter as the caller is setup_zone_pageset() who 
immediately replaces the values by zone_set_pageset_high_and_batch(). Which is 
what the comment tries to say.

>> +	 */
>> +	pcp->high = 0;
>> +	pcp->batch = 1;
>>  }
>>  
>>  /*
>> -- 
>> 2.28.0
> 



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 5/9] mm, page_alloc: make per_cpu_pageset accessible only after init
  2020-10-05 13:24   ` Michal Hocko
@ 2020-10-06 22:28     ` Vlastimil Babka
  0 siblings, 0 replies; 41+ messages in thread
From: Vlastimil Babka @ 2020-10-06 22:28 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-kernel, Pavel Tatashin, David Hildenbrand,
	Oscar Salvador, Joonsoo Kim

On 10/5/20 3:24 PM, Michal Hocko wrote:
> On Tue 22-09-20 16:37:08, Vlastimil Babka wrote:
>> setup_zone_pageset() replaces the boot_pageset by allocating and initializing a
>> proper percpu one. Currently it assigns zone->pageset with the newly allocated
>> one before initializing it. That's currently not an issue, because the zone
>> should not be in any zonelist, thus not visible to allocators at this point.
>> 
>> Memory ordering between the pcplist contents and its visibility is also not
>> guaranteed here, but that also shouldn't be an issue because online_pages()
>> does a spin_unlock(pgdat->node_size_lock) before building the zonelists.
>> 
>> However it's best that we don't silently rely on operations that can be changed
>> in the future. Make sure only properly initialized pcplists are visible, using
>> smp_store_release(). The read side has a data dependency via the zone->pageset
>> pointer instead of an explicit read barrier.
> 
> Heh, this looks like inveting a similar trap the previous patch was
> removing. But more seriously considering that we need a locking for the
> whole setup, wouldn't it be better to simply document the locking
> requirements rather than adding scary looking barriers future ourselves
> or somebody else will need to scratch heads about. I am pretty sure we
> don't do anything like that when initializating numa node or other data
> structures that might be allocated during the memory hotadd.

Yeah it will be best to drop this for now. I just looked closed at 
build_zonelist() implementation and got scared how it just modifies the list 
that others might be reading at the same time. zoneref_set_zone() sets 
zoneref->zone and zoneref->zone_idx without any sync, what if somebody, e.g. 
for_next_zone_zonelist_nodemask() makes decision based on zone_idx and then 
picks wrong zone pointer? etc.

>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
>> ---
>>  mm/page_alloc.c | 6 ++++--
>>  1 file changed, 4 insertions(+), 2 deletions(-)
>> 
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 99b74c1c2b0a..de3b48bda45c 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -6246,15 +6246,17 @@ static void zone_set_pageset_high_and_batch(struct zone *zone)
>>  
>>  void __meminit setup_zone_pageset(struct zone *zone)
>>  {
>> +	struct per_cpu_pageset __percpu * new_pageset;
>>  	struct per_cpu_pageset *p;
>>  	int cpu;
>>  
>> -	zone->pageset = alloc_percpu(struct per_cpu_pageset);
>> +	new_pageset = alloc_percpu(struct per_cpu_pageset);
>>  	for_each_possible_cpu(cpu) {
>> -		p = per_cpu_ptr(zone->pageset, cpu);
>> +		p = per_cpu_ptr(new_pageset, cpu);
>>  		pageset_init(p);
>>  	}
>>  
>> +	smp_store_release(&zone->pageset, new_pageset);
>>  	zone_set_pageset_high_and_batch(zone);
>>  }
>>  
>> -- 
>> 2.28.0
> 



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 6/9] mm, page_alloc: cache pageset high and batch in struct zone
  2020-09-25 10:34   ` David Hildenbrand
@ 2020-10-06 22:31     ` Vlastimil Babka
  0 siblings, 0 replies; 41+ messages in thread
From: Vlastimil Babka @ 2020-10-06 22:31 UTC (permalink / raw)
  To: David Hildenbrand, linux-mm
  Cc: linux-kernel, Michal Hocko, Pavel Tatashin, Oscar Salvador, Joonsoo Kim

On 9/25/20 12:34 PM, David Hildenbrand wrote:
> On 22.09.20 16:37, Vlastimil Babka wrote:
>> @@ -6300,6 +6310,8 @@ static __meminit void zone_pcp_init(struct zone *zone)
>>  	 * offset of a (static) per cpu variable into the per cpu area.
>>  	 */
>>  	zone->pageset = &boot_pageset;
>> +	zone->pageset_high = BOOT_PAGESET_HIGH;
>> +	zone->pageset_batch = BOOT_PAGESET_BATCH;
> 
> I do wonder if copying from any cpuvar inside boot_pageset is cleaner.
> 
> zone->pageset_high = &this_cpu_ptr(zone->pageset)->pcp.high;

Uh I don't know. That would be like admitting they can be different than what 
was initialized. But then they could be also different depending on what cpu we 
happen to run it on. It's why I added the #define BOOT_PAGESET_* in the first 
place - to ensure same value used in two places. Makes sense?



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 6/9] mm, page_alloc: cache pageset high and batch in struct zone
  2020-10-05 13:28   ` Michal Hocko
@ 2020-10-06 22:34     ` Vlastimil Babka
  0 siblings, 0 replies; 41+ messages in thread
From: Vlastimil Babka @ 2020-10-06 22:34 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-kernel, Pavel Tatashin, David Hildenbrand,
	Oscar Salvador, Joonsoo Kim

On 10/5/20 3:28 PM, Michal Hocko wrote:
> On Tue 22-09-20 16:37:09, Vlastimil Babka wrote:
>> All per-cpu pagesets for a zone use the same high and batch values, that are
>> duplicated there just for performance (locality) reasons. This patch adds the
>> same variables also to struct zone as a shared copy.
>> 
>> This will be useful later for making possible to disable pcplists temporarily
>> by setting high value to 0, while remembering the values for restoring them
>> later. But we can also immediately benefit from not updating pagesets of all
>> possible cpus in case the newly recalculated values (after sysctl change or
>> memory online/offline) are actually unchanged from the previous ones.
> 
> Advantage of this patch is not really clear from it in isolation. Maybe
> merge it with the patch which uses the duplicated state.

I'm not sure that would help its reviewability? As the patch that uses it is the 
last, largest one. And there is already a small advantage right away as 
changelog explains.


^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2020-10-06 22:34 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-22 14:37 [PATCH 0/9] disable pcplists during memory offline Vlastimil Babka
2020-09-22 14:37 ` [PATCH 1/9] mm, page_alloc: clean up pageset high and batch update Vlastimil Babka
2020-09-25 10:18   ` David Hildenbrand
2020-10-05 12:03   ` Michal Hocko
2020-09-22 14:37 ` [PATCH 2/9] mm, page_alloc: calculate pageset high and batch once per zone Vlastimil Babka
2020-10-05 12:52   ` Michal Hocko
2020-10-06 22:04     ` Vlastimil Babka
2020-09-22 14:37 ` [PATCH 3/9] mm, page_alloc: remove setup_pageset() Vlastimil Babka
2020-09-25 10:19   ` David Hildenbrand
2020-10-05 12:59   ` Michal Hocko
2020-10-06 22:11     ` Vlastimil Babka
2020-09-22 14:37 ` [PATCH 4/9] mm, page_alloc: simplify pageset_update() Vlastimil Babka
2020-09-25 10:23   ` David Hildenbrand
2020-10-05 13:20   ` Michal Hocko
2020-09-22 14:37 ` [PATCH 5/9] mm, page_alloc: make per_cpu_pageset accessible only after init Vlastimil Babka
2020-09-25 10:25   ` David Hildenbrand
2020-10-05 13:24   ` Michal Hocko
2020-10-06 22:28     ` Vlastimil Babka
2020-09-22 14:37 ` [PATCH 6/9] mm, page_alloc: cache pageset high and batch in struct zone Vlastimil Babka
2020-09-25 10:34   ` David Hildenbrand
2020-10-06 22:31     ` Vlastimil Babka
2020-10-05 13:28   ` Michal Hocko
2020-10-06 22:34     ` Vlastimil Babka
2020-09-22 14:37 ` [PATCH 7/9] mm, page_alloc: move draining pcplists to page isolation users Vlastimil Babka
2020-09-25 10:39   ` David Hildenbrand
2020-10-05 13:57   ` Michal Hocko
2020-09-22 14:37 ` [PATCH 8/9] mm, page_alloc: drain all pcplists during memory offline Vlastimil Babka
2020-09-25 10:46   ` David Hildenbrand
2020-10-05 14:03     ` Michal Hocko
2020-09-22 14:37 ` [PATCH 9/9] mm, page_alloc: optionally disable pcplists during page isolation Vlastimil Babka
2020-09-25 10:53   ` David Hildenbrand
2020-09-25 10:54     ` David Hildenbrand
2020-09-25 11:10       ` Vlastimil Babka
2020-10-01  8:47         ` David Hildenbrand
2020-10-05 14:05         ` Michal Hocko
2020-10-05 14:22           ` Vlastimil Babka
2020-10-05 16:56             ` Michal Hocko
2020-10-06  8:34   ` Michal Hocko
2020-10-06  8:40     ` David Hildenbrand
2020-10-06 10:05       ` Michal Hocko
2020-09-22 17:15 ` [PATCH 0/9] disable pcplists during memory offline David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).