linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/7] disable pcplists during memory offline
@ 2020-10-08 11:41 Vlastimil Babka
  2020-10-08 11:41 ` [PATCH v2 1/7] mm, page_alloc: clean up pageset high and batch update Vlastimil Babka
                   ` (6 more replies)
  0 siblings, 7 replies; 21+ messages in thread
From: Vlastimil Babka @ 2020-10-08 11:41 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Michal Hocko, Pavel Tatashin, David Hildenbrand,
	Oscar Salvador, Joonsoo Kim, Vlastimil Babka

Changes since v1 [7]:

- add acks/reviews (thanks David and Michal)
- drop "mm, page_alloc: make per_cpu_pageset accessible only after init" as
  that's orthogonal and needs more consideration
- squash "mm, page_alloc: drain all pcplists during memory offline" into the
  last patch, and move new zone_pcp_* functions into mm/page_alloc. As such,
  the new 'force all cpus' param of __drain_all_pages() is never exported
  outside page_alloc.c so I didn't add a new wrapper function to hide the bool
- keep pcp_batch_high_lock a mutex as offline_pages is synchronized anyway,
  as suggested by Michal. Thus we don't need atomic variable and sync around
  it, and patch is much smaller. If alloc_contic_range() wants to use the new
  functionality and keep parallelism, we can add that on top.

As per the discussions [1] [2] this is an attempt to implement David's
suggestion that page isolation should disable pcplists to avoid races with page
freeing in progress. This is done without extra checks in fast paths, as
explained in Patch 9. The repeated draining done by [2] is then no longer
needed. Previous version (RFC) is at [3].

The RFC tried to hide pcplists disabling/enabling into page isolation, but it
wasn't completely possible, as memory offline does not unisolation. Michal
suggested an explicit API in [4] so that's the current implementation and it
seems indeed nicer.

Once we accept that page isolation users need to do explicit actions around it
depending on the needed guarantees, we can also IMHO accept that the current
pcplist draining can be also done by the callers, which is more effective.
After all, there are only two users of page isolation. So patch 6 does
effectively the same thing as Pavel proposed in [5], and patch 7 implement
stronger guarantees only for memory offline. If CMA decides to opt-in to the
stronger guarantee, it can be added later.

Patches 1-5 are preparatory cleanups for pcplist disabling.

Patchset was briefly tested in QEMU so that memory online/offline works, but
I haven't done a stress test that would prove the race fixed by [2] is
eliminated.

Note that patch 7 could be avoided if we instead adjusted page freeing in shown
in [6], but I believe the current implementation of disabling pcplists is not
too much complex, so I would prefer this instead of adding new checks and
longer irq-disabled section into page freeing hotpaths.

[1] https://lore.kernel.org/linux-mm/20200901124615.137200-1-pasha.tatashin@soleen.com/
[2] https://lore.kernel.org/linux-mm/20200903140032.380431-1-pasha.tatashin@soleen.com/
[3] https://lore.kernel.org/linux-mm/20200907163628.26495-1-vbabka@suse.cz/
[4] https://lore.kernel.org/linux-mm/20200909113647.GG7348@dhcp22.suse.cz/
[5] https://lore.kernel.org/linux-mm/20200904151448.100489-3-pasha.tatashin@soleen.com/
[6] https://lore.kernel.org/linux-mm/3d3b53db-aeaa-ff24-260b-36427fac9b1c@suse.cz/
[7] https://lore.kernel.org/linux-mm/20200922143712.12048-1-vbabka@suse.cz/

Vlastimil Babka (7):
  mm, page_alloc: clean up pageset high and batch update
  mm, page_alloc: calculate pageset high and batch once per zone
  mm, page_alloc: remove setup_pageset()
  mm, page_alloc: simplify pageset_update()
  mm, page_alloc: cache pageset high and batch in struct zone
  mm, page_alloc: move draining pcplists to page isolation users
  mm, page_alloc: disable pcplists during memory offline

 include/linux/mmzone.h |   6 ++
 mm/internal.h          |   2 +
 mm/memory_hotplug.c    |  27 +++---
 mm/page_alloc.c        | 198 ++++++++++++++++++++++++-----------------
 mm/page_isolation.c    |  10 +--
 5 files changed, 143 insertions(+), 100 deletions(-)

-- 
2.28.0



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v2 1/7] mm, page_alloc: clean up pageset high and batch update
  2020-10-08 11:41 [PATCH v2 0/7] disable pcplists during memory offline Vlastimil Babka
@ 2020-10-08 11:41 ` Vlastimil Babka
  2020-10-25 14:17   ` Mike Rapoport
  2020-10-08 11:41 ` [PATCH v2 2/7] mm, page_alloc: calculate pageset high and batch once per zone Vlastimil Babka
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 21+ messages in thread
From: Vlastimil Babka @ 2020-10-08 11:41 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Michal Hocko, Pavel Tatashin, David Hildenbrand,
	Oscar Salvador, Joonsoo Kim, Vlastimil Babka, Michal Hocko

The updates to pcplists' high and batch valued are handled by multiple
functions that make the calculations hard to follow. Consolidate everything
to pageset_set_high_and_batch() and remove pageset_set_batch() and
pageset_set_high() wrappers.

The only special case using one of the removed wrappers was:
build_all_zonelists_init()
  setup_pageset()
    pageset_set_batch()
which was hardcoding batch as 0, so we can just open-code a call to
pageset_update() with constant parameters instead.

No functional change.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 mm/page_alloc.c | 49 ++++++++++++++++++++-----------------------------
 1 file changed, 20 insertions(+), 29 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e74ca22baaa1..81f360c0d9fc 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5899,7 +5899,7 @@ static void build_zonelists(pg_data_t *pgdat)
  * not check if the processor is online before following the pageset pointer.
  * Other parts of the kernel may not check if the zone is available.
  */
-static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch);
+static void setup_pageset(struct per_cpu_pageset *p);
 static DEFINE_PER_CPU(struct per_cpu_pageset, boot_pageset);
 static DEFINE_PER_CPU(struct per_cpu_nodestat, boot_nodestats);
 
@@ -5967,7 +5967,7 @@ build_all_zonelists_init(void)
 	 * (a chicken-egg dilemma).
 	 */
 	for_each_possible_cpu(cpu)
-		setup_pageset(&per_cpu(boot_pageset, cpu), 0);
+		setup_pageset(&per_cpu(boot_pageset, cpu));
 
 	mminit_verify_zonelist();
 	cpuset_init_current_mems_allowed();
@@ -6276,12 +6276,6 @@ static void pageset_update(struct per_cpu_pages *pcp, unsigned long high,
 	pcp->batch = batch;
 }
 
-/* a companion to pageset_set_high() */
-static void pageset_set_batch(struct per_cpu_pageset *p, unsigned long batch)
-{
-	pageset_update(&p->pcp, 6 * batch, max(1UL, 1 * batch));
-}
-
 static void pageset_init(struct per_cpu_pageset *p)
 {
 	struct per_cpu_pages *pcp;
@@ -6294,35 +6288,32 @@ static void pageset_init(struct per_cpu_pageset *p)
 		INIT_LIST_HEAD(&pcp->lists[migratetype]);
 }
 
-static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch)
+static void setup_pageset(struct per_cpu_pageset *p)
 {
 	pageset_init(p);
-	pageset_set_batch(p, batch);
+	pageset_update(&p->pcp, 0, 1);
 }
 
 /*
- * pageset_set_high() sets the high water mark for hot per_cpu_pagelist
- * to the value high for the pageset p.
+ * Calculate and set new high and batch values for given per-cpu pageset of a
+ * zone, based on the zone's size and the percpu_pagelist_fraction sysctl.
  */
-static void pageset_set_high(struct per_cpu_pageset *p,
-				unsigned long high)
-{
-	unsigned long batch = max(1UL, high / 4);
-	if ((high / 4) > (PAGE_SHIFT * 8))
-		batch = PAGE_SHIFT * 8;
-
-	pageset_update(&p->pcp, high, batch);
-}
-
 static void pageset_set_high_and_batch(struct zone *zone,
-				       struct per_cpu_pageset *pcp)
+				       struct per_cpu_pageset *p)
 {
-	if (percpu_pagelist_fraction)
-		pageset_set_high(pcp,
-			(zone_managed_pages(zone) /
-				percpu_pagelist_fraction));
-	else
-		pageset_set_batch(pcp, zone_batchsize(zone));
+	unsigned long new_high, new_batch;
+
+	if (percpu_pagelist_fraction) {
+		new_high = zone_managed_pages(zone) / percpu_pagelist_fraction;
+		new_batch = max(1UL, new_high / 4);
+		if ((new_high / 4) > (PAGE_SHIFT * 8))
+			new_batch = PAGE_SHIFT * 8;
+	} else {
+		new_batch = zone_batchsize(zone);
+		new_high = 6 * new_batch;
+		new_batch = max(1UL, 1 * new_batch);
+	}
+	pageset_update(&p->pcp, new_high, new_batch);
 }
 
 static void __meminit zone_pageset_init(struct zone *zone, int cpu)
-- 
2.28.0



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v2 2/7] mm, page_alloc: calculate pageset high and batch once per zone
  2020-10-08 11:41 [PATCH v2 0/7] disable pcplists during memory offline Vlastimil Babka
  2020-10-08 11:41 ` [PATCH v2 1/7] mm, page_alloc: clean up pageset high and batch update Vlastimil Babka
@ 2020-10-08 11:41 ` Vlastimil Babka
  2020-10-08 11:41 ` [PATCH v2 3/7] mm, page_alloc: remove setup_pageset() Vlastimil Babka
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 21+ messages in thread
From: Vlastimil Babka @ 2020-10-08 11:41 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Michal Hocko, Pavel Tatashin, David Hildenbrand,
	Oscar Salvador, Joonsoo Kim, Vlastimil Babka, Michal Hocko

We currently call pageset_set_high_and_batch() for each possible cpu, which
repeats the same calculations of high and batch values.

Instead call the function just once per zone, and make it apply the calculated
values to all per-cpu pagesets of the zone.

This also allows removing the zone_pageset_init() and __zone_pcp_update()
wrappers.

No functional change.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 mm/page_alloc.c | 42 ++++++++++++++++++------------------------
 1 file changed, 18 insertions(+), 24 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 81f360c0d9fc..463f40b12aca 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6295,13 +6295,14 @@ static void setup_pageset(struct per_cpu_pageset *p)
 }
 
 /*
- * Calculate and set new high and batch values for given per-cpu pageset of a
+ * Calculate and set new high and batch values for all per-cpu pagesets of a
  * zone, based on the zone's size and the percpu_pagelist_fraction sysctl.
  */
-static void pageset_set_high_and_batch(struct zone *zone,
-				       struct per_cpu_pageset *p)
+static void zone_set_pageset_high_and_batch(struct zone *zone)
 {
 	unsigned long new_high, new_batch;
+	struct per_cpu_pageset *p;
+	int cpu;
 
 	if (percpu_pagelist_fraction) {
 		new_high = zone_managed_pages(zone) / percpu_pagelist_fraction;
@@ -6313,23 +6314,25 @@ static void pageset_set_high_and_batch(struct zone *zone,
 		new_high = 6 * new_batch;
 		new_batch = max(1UL, 1 * new_batch);
 	}
-	pageset_update(&p->pcp, new_high, new_batch);
-}
-
-static void __meminit zone_pageset_init(struct zone *zone, int cpu)
-{
-	struct per_cpu_pageset *pcp = per_cpu_ptr(zone->pageset, cpu);
 
-	pageset_init(pcp);
-	pageset_set_high_and_batch(zone, pcp);
+	for_each_possible_cpu(cpu) {
+		p = per_cpu_ptr(zone->pageset, cpu);
+		pageset_update(&p->pcp, new_high, new_batch);
+	}
 }
 
 void __meminit setup_zone_pageset(struct zone *zone)
 {
+	struct per_cpu_pageset *p;
 	int cpu;
+
 	zone->pageset = alloc_percpu(struct per_cpu_pageset);
-	for_each_possible_cpu(cpu)
-		zone_pageset_init(zone, cpu);
+	for_each_possible_cpu(cpu) {
+		p = per_cpu_ptr(zone->pageset, cpu);
+		pageset_init(p);
+	}
+
+	zone_set_pageset_high_and_batch(zone);
 }
 
 /*
@@ -8063,15 +8066,6 @@ int lowmem_reserve_ratio_sysctl_handler(struct ctl_table *table, int write,
 	return 0;
 }
 
-static void __zone_pcp_update(struct zone *zone)
-{
-	unsigned int cpu;
-
-	for_each_possible_cpu(cpu)
-		pageset_set_high_and_batch(zone,
-				per_cpu_ptr(zone->pageset, cpu));
-}
-
 /*
  * percpu_pagelist_fraction - changes the pcp->high for each zone on each
  * cpu.  It is the fraction of total pages in each zone that a hot per cpu
@@ -8104,7 +8098,7 @@ int percpu_pagelist_fraction_sysctl_handler(struct ctl_table *table, int write,
 		goto out;
 
 	for_each_populated_zone(zone)
-		__zone_pcp_update(zone);
+		zone_set_pageset_high_and_batch(zone);
 out:
 	mutex_unlock(&pcp_batch_high_lock);
 	return ret;
@@ -8711,7 +8705,7 @@ EXPORT_SYMBOL(free_contig_range);
 void __meminit zone_pcp_update(struct zone *zone)
 {
 	mutex_lock(&pcp_batch_high_lock);
-	__zone_pcp_update(zone);
+	zone_set_pageset_high_and_batch(zone);
 	mutex_unlock(&pcp_batch_high_lock);
 }
 
-- 
2.28.0



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v2 3/7] mm, page_alloc: remove setup_pageset()
  2020-10-08 11:41 [PATCH v2 0/7] disable pcplists during memory offline Vlastimil Babka
  2020-10-08 11:41 ` [PATCH v2 1/7] mm, page_alloc: clean up pageset high and batch update Vlastimil Babka
  2020-10-08 11:41 ` [PATCH v2 2/7] mm, page_alloc: calculate pageset high and batch once per zone Vlastimil Babka
@ 2020-10-08 11:41 ` Vlastimil Babka
  2020-10-08 12:23   ` Michal Hocko
  2020-10-22 12:34   ` Oscar Salvador
  2020-10-08 11:41 ` [PATCH v2 4/7] mm, page_alloc: simplify pageset_update() Vlastimil Babka
                   ` (3 subsequent siblings)
  6 siblings, 2 replies; 21+ messages in thread
From: Vlastimil Babka @ 2020-10-08 11:41 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Michal Hocko, Pavel Tatashin, David Hildenbrand,
	Oscar Salvador, Joonsoo Kim, Vlastimil Babka

We initialize boot-time pagesets with setup_pageset(), which sets high and
batch values that effectively disable pcplists.

We can remove this wrapper if we just set these values for all pagesets in
pageset_init(). Non-boot pagesets then subsequently update them to the proper
values.

No functional change.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: David Hildenbrand <david@redhat.com>
---
 mm/page_alloc.c | 17 ++++++++++-------
 1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 463f40b12aca..f827b42a2475 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5899,7 +5899,7 @@ static void build_zonelists(pg_data_t *pgdat)
  * not check if the processor is online before following the pageset pointer.
  * Other parts of the kernel may not check if the zone is available.
  */
-static void setup_pageset(struct per_cpu_pageset *p);
+static void pageset_init(struct per_cpu_pageset *p);
 static DEFINE_PER_CPU(struct per_cpu_pageset, boot_pageset);
 static DEFINE_PER_CPU(struct per_cpu_nodestat, boot_nodestats);
 
@@ -5967,7 +5967,7 @@ build_all_zonelists_init(void)
 	 * (a chicken-egg dilemma).
 	 */
 	for_each_possible_cpu(cpu)
-		setup_pageset(&per_cpu(boot_pageset, cpu));
+		pageset_init(&per_cpu(boot_pageset, cpu));
 
 	mminit_verify_zonelist();
 	cpuset_init_current_mems_allowed();
@@ -6286,12 +6286,15 @@ static void pageset_init(struct per_cpu_pageset *p)
 	pcp = &p->pcp;
 	for (migratetype = 0; migratetype < MIGRATE_PCPTYPES; migratetype++)
 		INIT_LIST_HEAD(&pcp->lists[migratetype]);
-}
 
-static void setup_pageset(struct per_cpu_pageset *p)
-{
-	pageset_init(p);
-	pageset_update(&p->pcp, 0, 1);
+	/*
+	 * Set batch and high values safe for a boot pageset. A true percpu
+	 * pageset's initialization will update them subsequently. Here we don't
+	 * need to be as careful as pageset_update() as nobody can access the
+	 * pageset yet.
+	 */
+	pcp->high = 0;
+	pcp->batch = 1;
 }
 
 /*
-- 
2.28.0



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v2 4/7] mm, page_alloc: simplify pageset_update()
  2020-10-08 11:41 [PATCH v2 0/7] disable pcplists during memory offline Vlastimil Babka
                   ` (2 preceding siblings ...)
  2020-10-08 11:41 ` [PATCH v2 3/7] mm, page_alloc: remove setup_pageset() Vlastimil Babka
@ 2020-10-08 11:41 ` Vlastimil Babka
  2020-10-22 12:39   ` Oscar Salvador
  2020-10-08 11:41 ` [PATCH v2 5/7] mm, page_alloc: cache pageset high and batch in struct zone Vlastimil Babka
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 21+ messages in thread
From: Vlastimil Babka @ 2020-10-08 11:41 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Michal Hocko, Pavel Tatashin, David Hildenbrand,
	Oscar Salvador, Joonsoo Kim, Vlastimil Babka, Michal Hocko

pageset_update() attempts to update pcplist's high and batch values in a way
that readers don't observe batch > high. It uses smp_wmb() to order the updates
in a way to achieve this. However, without proper pairing read barriers in
readers this guarantee doesn't hold, and there are no such barriers in
e.g. free_unref_page_commit().

Commit 88e8ac11d2ea ("mm, page_alloc: fix core hung in free_pcppages_bulk()")
already showed this is problematic, and solved this by ultimately only trusing
pcp->count of the current cpu with interrupts disabled.

The update dance with unpaired write barriers thus makes no sense. Replace
them with plain WRITE_ONCE to prevent store tearing, and document that the
values can change asynchronously and should not be trusted for correctness.

All current readers appear to be OK after 88e8ac11d2ea. Convert them to
READ_ONCE to prevent unnecessary read tearing, but mainly to alert anybody
making future changes to the code that special care is needed.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 mm/page_alloc.c | 40 ++++++++++++++++++----------------------
 1 file changed, 18 insertions(+), 22 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f827b42a2475..f33c36312eb5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1344,7 +1344,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 {
 	int migratetype = 0;
 	int batch_free = 0;
-	int prefetch_nr = 0;
+	int prefetch_nr = READ_ONCE(pcp->batch);
 	bool isolated_pageblocks;
 	struct page *page, *tmp;
 	LIST_HEAD(head);
@@ -1395,8 +1395,10 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 			 * avoid excessive prefetching due to large count, only
 			 * prefetch buddy for the first pcp->batch nr of pages.
 			 */
-			if (prefetch_nr++ < pcp->batch)
+			if (prefetch_nr) {
 				prefetch_buddy(page);
+				prefetch_nr--;
+			}
 		} while (--count && --batch_free && !list_empty(list));
 	}
 
@@ -3190,10 +3192,8 @@ static void free_unref_page_commit(struct page *page, unsigned long pfn)
 	pcp = &this_cpu_ptr(zone->pageset)->pcp;
 	list_add(&page->lru, &pcp->lists[migratetype]);
 	pcp->count++;
-	if (pcp->count >= pcp->high) {
-		unsigned long batch = READ_ONCE(pcp->batch);
-		free_pcppages_bulk(zone, batch, pcp);
-	}
+	if (pcp->count >= READ_ONCE(pcp->high))
+		free_pcppages_bulk(zone, READ_ONCE(pcp->batch), pcp);
 }
 
 /*
@@ -3378,7 +3378,7 @@ static struct page *__rmqueue_pcplist(struct zone *zone, int migratetype,
 	do {
 		if (list_empty(list)) {
 			pcp->count += rmqueue_bulk(zone, 0,
-					pcp->batch, list,
+					READ_ONCE(pcp->batch), list,
 					migratetype, alloc_flags);
 			if (unlikely(list_empty(list)))
 				return NULL;
@@ -6250,13 +6250,16 @@ static int zone_batchsize(struct zone *zone)
 }
 
 /*
- * pcp->high and pcp->batch values are related and dependent on one another:
- * ->batch must never be higher then ->high.
- * The following function updates them in a safe manner without read side
- * locking.
+ * pcp->high and pcp->batch values are related and generally batch is lower
+ * than high. They are also related to pcp->count such that count is lower
+ * than high, and as soon as it reaches high, the pcplist is flushed.
  *
- * Any new users of pcp->batch and pcp->high should ensure they can cope with
- * those fields changing asynchronously (acording to the above rule).
+ * However, guaranteeing these relations at all times would require e.g. write
+ * barriers here but also careful usage of read barriers at the read side, and
+ * thus be prone to error and bad for performance. Thus the update only prevents
+ * store tearing. Any new users of pcp->batch and pcp->high should ensure they
+ * can cope with those fields changing asynchronously, and fully trust only the
+ * pcp->count field on the local CPU with interrupts disabled.
  *
  * mutex_is_locked(&pcp_batch_high_lock) required when calling this function
  * outside of boot time (or some other assurance that no concurrent updaters
@@ -6265,15 +6268,8 @@ static int zone_batchsize(struct zone *zone)
 static void pageset_update(struct per_cpu_pages *pcp, unsigned long high,
 		unsigned long batch)
 {
-       /* start with a fail safe value for batch */
-	pcp->batch = 1;
-	smp_wmb();
-
-       /* Update high, then batch, in order */
-	pcp->high = high;
-	smp_wmb();
-
-	pcp->batch = batch;
+	WRITE_ONCE(pcp->batch, batch);
+	WRITE_ONCE(pcp->high, high);
 }
 
 static void pageset_init(struct per_cpu_pageset *p)
-- 
2.28.0



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v2 5/7] mm, page_alloc: cache pageset high and batch in struct zone
  2020-10-08 11:41 [PATCH v2 0/7] disable pcplists during memory offline Vlastimil Babka
                   ` (3 preceding siblings ...)
  2020-10-08 11:41 ` [PATCH v2 4/7] mm, page_alloc: simplify pageset_update() Vlastimil Babka
@ 2020-10-08 11:41 ` Vlastimil Babka
  2020-10-08 12:31   ` Michal Hocko
  2020-10-08 11:42 ` [PATCH v2 6/7] mm, page_alloc: move draining pcplists to page isolation users Vlastimil Babka
  2020-10-08 11:42 ` [PATCH v2 7/7] mm, page_alloc: disable pcplists during memory offline Vlastimil Babka
  6 siblings, 1 reply; 21+ messages in thread
From: Vlastimil Babka @ 2020-10-08 11:41 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Michal Hocko, Pavel Tatashin, David Hildenbrand,
	Oscar Salvador, Joonsoo Kim, Vlastimil Babka

All per-cpu pagesets for a zone use the same high and batch values, that are
duplicated there just for performance (locality) reasons. This patch adds the
same variables also to struct zone as a shared copy.

This will be useful later for making possible to disable pcplists temporarily
by setting high value to 0, while remembering the values for restoring them
later. But we can also immediately benefit from not updating pagesets of all
possible cpus in case the newly recalculated values (after sysctl change or
memory online/offline) are actually unchanged from the previous ones.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/mmzone.h |  6 ++++++
 mm/page_alloc.c        | 17 +++++++++++++++--
 2 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index fb3bf696c05e..c63863794afc 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -470,6 +470,12 @@ struct zone {
 #endif
 	struct pglist_data	*zone_pgdat;
 	struct per_cpu_pageset __percpu *pageset;
+	/*
+	 * the high and batch values are copied to individual pagesets for
+	 * faster access
+	 */
+	int pageset_high;
+	int pageset_batch;
 
 #ifndef CONFIG_SPARSEMEM
 	/*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f33c36312eb5..5b98dd5ab006 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5900,6 +5900,9 @@ static void build_zonelists(pg_data_t *pgdat)
  * Other parts of the kernel may not check if the zone is available.
  */
 static void pageset_init(struct per_cpu_pageset *p);
+/* These effectively disable the pcplists in the boot pageset completely */
+#define BOOT_PAGESET_HIGH	0
+#define BOOT_PAGESET_BATCH	1
 static DEFINE_PER_CPU(struct per_cpu_pageset, boot_pageset);
 static DEFINE_PER_CPU(struct per_cpu_nodestat, boot_nodestats);
 
@@ -6289,8 +6292,8 @@ static void pageset_init(struct per_cpu_pageset *p)
 	 * need to be as careful as pageset_update() as nobody can access the
 	 * pageset yet.
 	 */
-	pcp->high = 0;
-	pcp->batch = 1;
+	pcp->high = BOOT_PAGESET_HIGH;
+	pcp->batch = BOOT_PAGESET_BATCH;
 }
 
 /*
@@ -6314,6 +6317,14 @@ static void zone_set_pageset_high_and_batch(struct zone *zone)
 		new_batch = max(1UL, 1 * new_batch);
 	}
 
+	if (zone->pageset_high != new_high ||
+	    zone->pageset_batch != new_batch) {
+		zone->pageset_high = new_high;
+		zone->pageset_batch = new_batch;
+	} else {
+		return;
+	}
+
 	for_each_possible_cpu(cpu) {
 		p = per_cpu_ptr(zone->pageset, cpu);
 		pageset_update(&p->pcp, new_high, new_batch);
@@ -6374,6 +6385,8 @@ static __meminit void zone_pcp_init(struct zone *zone)
 	 * offset of a (static) per cpu variable into the per cpu area.
 	 */
 	zone->pageset = &boot_pageset;
+	zone->pageset_high = BOOT_PAGESET_HIGH;
+	zone->pageset_batch = BOOT_PAGESET_BATCH;
 
 	if (populated_zone(zone))
 		printk(KERN_DEBUG "  %s zone: %lu pages, LIFO batch:%u\n",
-- 
2.28.0



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v2 6/7] mm, page_alloc: move draining pcplists to page isolation users
  2020-10-08 11:41 [PATCH v2 0/7] disable pcplists during memory offline Vlastimil Babka
                   ` (4 preceding siblings ...)
  2020-10-08 11:41 ` [PATCH v2 5/7] mm, page_alloc: cache pageset high and batch in struct zone Vlastimil Babka
@ 2020-10-08 11:42 ` Vlastimil Babka
  2020-10-22 12:44   ` Oscar Salvador
  2020-10-08 11:42 ` [PATCH v2 7/7] mm, page_alloc: disable pcplists during memory offline Vlastimil Babka
  6 siblings, 1 reply; 21+ messages in thread
From: Vlastimil Babka @ 2020-10-08 11:42 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Michal Hocko, Pavel Tatashin, David Hildenbrand,
	Oscar Salvador, Joonsoo Kim, Vlastimil Babka, Michal Hocko

Currently, pcplists are drained during set_migratetype_isolate() which means
once per pageblock processed start_isolate_page_range(). This is somewhat
wasteful. Moreover, the callers might need different guarantees, and the
draining is currently prone to races and does not guarantee that no page
from isolated pageblock will end up on the pcplist after the drain.

Better guarantees are added by later patches and require explicit actions
by page isolation users that need them. Thus it makes sense to move the
current imperfect draining to the callers also as a preparation step.

Suggested-by: David Hildenbrand <david@redhat.com>
Suggested-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 mm/memory_hotplug.c | 11 ++++++-----
 mm/page_alloc.c     |  2 ++
 mm/page_isolation.c | 10 +++++-----
 3 files changed, 13 insertions(+), 10 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index b44d4c7ba73b..2e6ad899c55e 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1519,6 +1519,8 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
 		goto failed_removal;
 	}
 
+	drain_all_pages(zone);
+
 	arg.start_pfn = start_pfn;
 	arg.nr_pages = nr_pages;
 	node_states_check_changes_offline(nr_pages, zone, &arg);
@@ -1569,11 +1571,10 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
 		}
 
 		/*
-		 * per-cpu pages are drained in start_isolate_page_range, but if
-		 * there are still pages that are not free, make sure that we
-		 * drain again, because when we isolated range we might
-		 * have raced with another thread that was adding pages to pcp
-		 * list.
+		 * per-cpu pages are drained after start_isolate_page_range, but
+		 * if there are still pages that are not free, make sure that we
+		 * drain again, because when we isolated range we might have
+		 * raced with another thread that was adding pages to pcp list.
 		 *
 		 * Forward progress should be still guaranteed because
 		 * pages on the pcp list can only belong to MOVABLE_ZONE
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5b98dd5ab006..1f7108fe9a0b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -8509,6 +8509,8 @@ int alloc_contig_range(unsigned long start, unsigned long end,
 	if (ret)
 		return ret;
 
+	drain_all_pages(cc.zone);
+
 	/*
 	 * In case of -EBUSY, we'd like to know which page causes problem.
 	 * So, just fall through. test_pages_isolated() has a tracepoint
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index abbf42214485..feab446d1982 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -49,7 +49,6 @@ static int set_migratetype_isolate(struct page *page, int migratetype, int isol_
 
 		__mod_zone_freepage_state(zone, -nr_pages, mt);
 		spin_unlock_irqrestore(&zone->lock, flags);
-		drain_all_pages(zone);
 		return 0;
 	}
 
@@ -172,11 +171,12 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
  *
  * Please note that there is no strong synchronization with the page allocator
  * either. Pages might be freed while their page blocks are marked ISOLATED.
- * In some cases pages might still end up on pcp lists and that would allow
+ * A call to drain_all_pages() after isolation can flush most of them. However
+ * in some cases pages might still end up on pcp lists and that would allow
  * for their allocation even when they are in fact isolated already. Depending
- * on how strong of a guarantee the caller needs drain_all_pages might be needed
- * (e.g. __offline_pages will need to call it after check for isolated range for
- * a next retry).
+ * on how strong of a guarantee the caller needs, further drain_all_pages()
+ * might be needed (e.g. __offline_pages will need to call it after check for
+ * isolated range for a next retry).
  *
  * Return: 0 on success and -EBUSY if any part of range cannot be isolated.
  */
-- 
2.28.0



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v2 7/7] mm, page_alloc: disable pcplists during memory offline
  2020-10-08 11:41 [PATCH v2 0/7] disable pcplists during memory offline Vlastimil Babka
                   ` (5 preceding siblings ...)
  2020-10-08 11:42 ` [PATCH v2 6/7] mm, page_alloc: move draining pcplists to page isolation users Vlastimil Babka
@ 2020-10-08 11:42 ` Vlastimil Babka
  2020-10-08 12:45   ` Michal Hocko
  2020-10-22 12:52   ` Oscar Salvador
  6 siblings, 2 replies; 21+ messages in thread
From: Vlastimil Babka @ 2020-10-08 11:42 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Michal Hocko, Pavel Tatashin, David Hildenbrand,
	Oscar Salvador, Joonsoo Kim, Vlastimil Babka, Michal Hocko

Memory offline relies on page isolation can race with process freeing pages to
pcplists in a way that a page from isolated pageblock can end up on pcplist.
This can be worked around by repeated draining of pcplists, as done by commit
968318261221 ("mm/memory_hotplug: drain per-cpu pages again during memory
offline").

David and Michal would prefer that this race was closed in a way that callers
of page isolation who need stronger guarantees don't need to repeatedly drain.
David suggested disabling pcplists usage completely during page isolation,
instead of repeatedly draining them.

To achieve this without adding special cases in alloc/free fastpath, we can use
the same approach as boot pagesets - when pcp->high is 0, any pcplist addition
will be immediately flushed.

The race can thus be closed by setting pcp->high to 0 and draining pcplists
once, before calling start_isolate_page_range(). The draining will serialize
after processes that already disabled interrupts and read the old value of
pcp->high in free_unref_page_commit(), and processes that have not yet disabled
interrupts, will observe pcp->high == 0 when they are rescheduled, and skip
pcplists. This guarantees no stray pages on pcplists in zones where isolation
happens.

This patch thus adds zone_pcp_disable() and zone_pcp_enable() functions that
page isolation users can call before start_isolate_page_range() and after
unisolating (or offlining) the isolated pages.

Also, drain_all_pages() is optimized to only execute on cpus where pcplists are
not empty. The check can however race with a free to pcplist that has not yet
increased the pcp->count from 0 to 1. Thus make the drain optionally skip the
racy check and drain on all cpus, and use this option in zone_pcp_disable().

As we have to avoid external updates to high and batch while pcplists are
disabled, we take pcp_batch_high_lock in zone_pcp_disable() and release it in
zone_pcp_enable(). This also synchronizes multiple users of
zone_pcp_disable()/enable().

Currently the only user of this functionality is offline_pages().

Suggested-by: David Hildenbrand <david@redhat.com>
Suggested-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/internal.h       |  2 ++
 mm/memory_hotplug.c | 28 ++++++++----------
 mm/page_alloc.c     | 69 +++++++++++++++++++++++++++++++++++----------
 mm/page_isolation.c |  6 ++--
 4 files changed, 71 insertions(+), 34 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index c43ccdddb0f6..2966496680bc 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -201,6 +201,8 @@ extern int user_min_free_kbytes;
 
 extern void zone_pcp_update(struct zone *zone);
 extern void zone_pcp_reset(struct zone *zone);
+extern void zone_pcp_disable(struct zone *zone);
+extern void zone_pcp_enable(struct zone *zone);
 
 #if defined CONFIG_COMPACTION || defined CONFIG_CMA
 
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 2e6ad899c55e..4382b585c76c 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1510,17 +1510,21 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
 	}
 	node = zone_to_nid(zone);
 
+	/*
+	 * Disable pcplists so that page isolation cannot race with freeing
+	 * in a way that pages from isolated pageblock are left on pcplists.
+	 */
+	zone_pcp_disable(zone);
+
 	/* set above range as isolated */
 	ret = start_isolate_page_range(start_pfn, end_pfn,
 				       MIGRATE_MOVABLE,
 				       MEMORY_OFFLINE | REPORT_FAILURE);
 	if (ret) {
 		reason = "failure to isolate range";
-		goto failed_removal;
+		goto failed_removal_pcplists_disabled;
 	}
 
-	drain_all_pages(zone);
-
 	arg.start_pfn = start_pfn;
 	arg.nr_pages = nr_pages;
 	node_states_check_changes_offline(nr_pages, zone, &arg);
@@ -1570,20 +1574,8 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
 			goto failed_removal_isolated;
 		}
 
-		/*
-		 * per-cpu pages are drained after start_isolate_page_range, but
-		 * if there are still pages that are not free, make sure that we
-		 * drain again, because when we isolated range we might have
-		 * raced with another thread that was adding pages to pcp list.
-		 *
-		 * Forward progress should be still guaranteed because
-		 * pages on the pcp list can only belong to MOVABLE_ZONE
-		 * because has_unmovable_pages explicitly checks for
-		 * PageBuddy on freed pages on other zones.
-		 */
 		ret = test_pages_isolated(start_pfn, end_pfn, MEMORY_OFFLINE);
-		if (ret)
-			drain_all_pages(zone);
+
 	} while (ret);
 
 	/* Mark all sections offline and remove free pages from the buddy. */
@@ -1599,6 +1591,8 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
 	zone->nr_isolate_pageblock -= nr_pages / pageblock_nr_pages;
 	spin_unlock_irqrestore(&zone->lock, flags);
 
+	zone_pcp_enable(zone);
+
 	/* removal success */
 	adjust_managed_page_count(pfn_to_page(start_pfn), -nr_pages);
 	zone->present_pages -= nr_pages;
@@ -1631,6 +1625,8 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
 failed_removal_isolated:
 	undo_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE);
 	memory_notify(MEM_CANCEL_OFFLINE, &arg);
+failed_removal_pcplists_disabled:
+	zone_pcp_enable(zone);
 failed_removal:
 	pr_debug("memory offlining [mem %#010llx-%#010llx] failed due to %s\n",
 		 (unsigned long long) start_pfn << PAGE_SHIFT,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1f7108fe9a0b..366c516c9062 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3018,14 +3018,7 @@ static void drain_local_pages_wq(struct work_struct *work)
 	preempt_enable();
 }
 
-/*
- * Spill all the per-cpu pages from all CPUs back into the buddy allocator.
- *
- * When zone parameter is non-NULL, spill just the single zone's pages.
- *
- * Note that this can be extremely slow as the draining happens in a workqueue.
- */
-void drain_all_pages(struct zone *zone)
+void __drain_all_pages(struct zone *zone, bool force_all_cpus)
 {
 	int cpu;
 
@@ -3064,7 +3057,13 @@ void drain_all_pages(struct zone *zone)
 		struct zone *z;
 		bool has_pcps = false;
 
-		if (zone) {
+		if (force_all_cpus) {
+			/*
+			 * The pcp.count check is racy, some callers need a
+			 * guarantee that no cpu is missed.
+			 */
+			has_pcps = true;
+		} else if (zone) {
 			pcp = per_cpu_ptr(zone->pageset, cpu);
 			if (pcp->pcp.count)
 				has_pcps = true;
@@ -3097,6 +3096,18 @@ void drain_all_pages(struct zone *zone)
 	mutex_unlock(&pcpu_drain_mutex);
 }
 
+/*
+ * Spill all the per-cpu pages from all CPUs back into the buddy allocator.
+ *
+ * When zone parameter is non-NULL, spill just the single zone's pages.
+ *
+ * Note that this can be extremely slow as the draining happens in a workqueue.
+ */
+void drain_all_pages(struct zone *zone)
+{
+	__drain_all_pages(zone, false);
+}
+
 #ifdef CONFIG_HIBERNATION
 
 /*
@@ -6296,6 +6307,18 @@ static void pageset_init(struct per_cpu_pageset *p)
 	pcp->batch = BOOT_PAGESET_BATCH;
 }
 
+void __zone_set_pageset_high_and_batch(struct zone *zone, unsigned long high,
+		unsigned long batch)
+{
+	struct per_cpu_pageset *p;
+	int cpu;
+
+	for_each_possible_cpu(cpu) {
+		p = per_cpu_ptr(zone->pageset, cpu);
+		pageset_update(&p->pcp, high, batch);
+	}
+}
+
 /*
  * Calculate and set new high and batch values for all per-cpu pagesets of a
  * zone, based on the zone's size and the percpu_pagelist_fraction sysctl.
@@ -6303,8 +6326,6 @@ static void pageset_init(struct per_cpu_pageset *p)
 static void zone_set_pageset_high_and_batch(struct zone *zone)
 {
 	unsigned long new_high, new_batch;
-	struct per_cpu_pageset *p;
-	int cpu;
 
 	if (percpu_pagelist_fraction) {
 		new_high = zone_managed_pages(zone) / percpu_pagelist_fraction;
@@ -6325,10 +6346,7 @@ static void zone_set_pageset_high_and_batch(struct zone *zone)
 		return;
 	}
 
-	for_each_possible_cpu(cpu) {
-		p = per_cpu_ptr(zone->pageset, cpu);
-		pageset_update(&p->pcp, new_high, new_batch);
-	}
+	__zone_set_pageset_high_and_batch(zone, new_high, new_batch);
 }
 
 void __meminit setup_zone_pageset(struct zone *zone)
@@ -8723,6 +8741,27 @@ void __meminit zone_pcp_update(struct zone *zone)
 	mutex_unlock(&pcp_batch_high_lock);
 }
 
+/*
+ * Effectively disable pcplists for the zone by setting the high limit to 0
+ * and draining all cpus. A concurrent page freeing on another CPU that's about
+ * to put the page on pcplist will either finish before the drain and the page
+ * will be drained, or observe the new high limit and skip the pcplist.
+ *
+ * Must be paired with a call to zone_pcp_enable().
+ */
+void zone_pcp_disable(struct zone *zone)
+{
+	mutex_lock(&pcp_batch_high_lock);
+	__zone_set_pageset_high_and_batch(zone, 0, 1);
+	__drain_all_pages(zone, true);
+}
+
+void zone_pcp_enable(struct zone *zone)
+{
+	__zone_set_pageset_high_and_batch(zone, zone->pageset_high, zone->pageset_batch);
+	mutex_unlock(&pcp_batch_high_lock);
+}
+
 void zone_pcp_reset(struct zone *zone)
 {
 	unsigned long flags;
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index feab446d1982..a254e1f370a3 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -174,9 +174,9 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
  * A call to drain_all_pages() after isolation can flush most of them. However
  * in some cases pages might still end up on pcp lists and that would allow
  * for their allocation even when they are in fact isolated already. Depending
- * on how strong of a guarantee the caller needs, further drain_all_pages()
- * might be needed (e.g. __offline_pages will need to call it after check for
- * isolated range for a next retry).
+ * on how strong of a guarantee the caller needs, zone_pcp_disable/enable()
+ * might be used to flush and disable pcplist before isolation and enable after
+ * unisolation.
  *
  * Return: 0 on success and -EBUSY if any part of range cannot be isolated.
  */
-- 
2.28.0



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 3/7] mm, page_alloc: remove setup_pageset()
  2020-10-08 11:41 ` [PATCH v2 3/7] mm, page_alloc: remove setup_pageset() Vlastimil Babka
@ 2020-10-08 12:23   ` Michal Hocko
  2020-10-08 12:56     ` Vlastimil Babka
  2020-10-22 12:34   ` Oscar Salvador
  1 sibling, 1 reply; 21+ messages in thread
From: Michal Hocko @ 2020-10-08 12:23 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, linux-kernel, Pavel Tatashin, David Hildenbrand,
	Oscar Salvador, Joonsoo Kim

On Thu 08-10-20 13:41:57, Vlastimil Babka wrote:
> We initialize boot-time pagesets with setup_pageset(), which sets high and
> batch values that effectively disable pcplists.
> 
> We can remove this wrapper if we just set these values for all pagesets in
> pageset_init(). Non-boot pagesets then subsequently update them to the proper
> values.
> 
> No functional change.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> Reviewed-by: David Hildenbrand <david@redhat.com>

Acked-by: Michal Hocko <mhocko@suse.com>

Btw. where do we initialize pcp->count? I thought that pcp allocator
zeroes out the allocated memory but alloc_percpu is GFP_KERNEL like.

> ---
>  mm/page_alloc.c | 17 ++++++++++-------
>  1 file changed, 10 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 463f40b12aca..f827b42a2475 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5899,7 +5899,7 @@ static void build_zonelists(pg_data_t *pgdat)
>   * not check if the processor is online before following the pageset pointer.
>   * Other parts of the kernel may not check if the zone is available.
>   */
> -static void setup_pageset(struct per_cpu_pageset *p);
> +static void pageset_init(struct per_cpu_pageset *p);
>  static DEFINE_PER_CPU(struct per_cpu_pageset, boot_pageset);
>  static DEFINE_PER_CPU(struct per_cpu_nodestat, boot_nodestats);
>  
> @@ -5967,7 +5967,7 @@ build_all_zonelists_init(void)
>  	 * (a chicken-egg dilemma).
>  	 */
>  	for_each_possible_cpu(cpu)
> -		setup_pageset(&per_cpu(boot_pageset, cpu));
> +		pageset_init(&per_cpu(boot_pageset, cpu));
>  
>  	mminit_verify_zonelist();
>  	cpuset_init_current_mems_allowed();
> @@ -6286,12 +6286,15 @@ static void pageset_init(struct per_cpu_pageset *p)
>  	pcp = &p->pcp;
>  	for (migratetype = 0; migratetype < MIGRATE_PCPTYPES; migratetype++)
>  		INIT_LIST_HEAD(&pcp->lists[migratetype]);
> -}
>  
> -static void setup_pageset(struct per_cpu_pageset *p)
> -{
> -	pageset_init(p);
> -	pageset_update(&p->pcp, 0, 1);
> +	/*
> +	 * Set batch and high values safe for a boot pageset. A true percpu
> +	 * pageset's initialization will update them subsequently. Here we don't
> +	 * need to be as careful as pageset_update() as nobody can access the
> +	 * pageset yet.
> +	 */
> +	pcp->high = 0;
> +	pcp->batch = 1;
>  }
>  
>  /*
> -- 
> 2.28.0

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 5/7] mm, page_alloc: cache pageset high and batch in struct zone
  2020-10-08 11:41 ` [PATCH v2 5/7] mm, page_alloc: cache pageset high and batch in struct zone Vlastimil Babka
@ 2020-10-08 12:31   ` Michal Hocko
  2020-10-08 17:55     ` Vlastimil Babka
  0 siblings, 1 reply; 21+ messages in thread
From: Michal Hocko @ 2020-10-08 12:31 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, linux-kernel, Pavel Tatashin, David Hildenbrand,
	Oscar Salvador, Joonsoo Kim

On Thu 08-10-20 13:41:59, Vlastimil Babka wrote:
> All per-cpu pagesets for a zone use the same high and batch values, that are
> duplicated there just for performance (locality) reasons. This patch adds the
> same variables also to struct zone as a shared copy.
> 
> This will be useful later for making possible to disable pcplists temporarily
> by setting high value to 0, while remembering the values for restoring them
> later. But we can also immediately benefit from not updating pagesets of all
> possible cpus in case the newly recalculated values (after sysctl change or
> memory online/offline) are actually unchanged from the previous ones.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Acked-by: Michal Hocko <mhocko@suse.com>

I would consider the check flipped with early return more pleasing to my
eyes but nothing to lose sleep over.

> ---
>  include/linux/mmzone.h |  6 ++++++
>  mm/page_alloc.c        | 17 +++++++++++++++--
>  2 files changed, 21 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index fb3bf696c05e..c63863794afc 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -470,6 +470,12 @@ struct zone {
>  #endif
>  	struct pglist_data	*zone_pgdat;
>  	struct per_cpu_pageset __percpu *pageset;
> +	/*
> +	 * the high and batch values are copied to individual pagesets for
> +	 * faster access
> +	 */
> +	int pageset_high;
> +	int pageset_batch;
>  
>  #ifndef CONFIG_SPARSEMEM
>  	/*
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index f33c36312eb5..5b98dd5ab006 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5900,6 +5900,9 @@ static void build_zonelists(pg_data_t *pgdat)
>   * Other parts of the kernel may not check if the zone is available.
>   */
>  static void pageset_init(struct per_cpu_pageset *p);
> +/* These effectively disable the pcplists in the boot pageset completely */
> +#define BOOT_PAGESET_HIGH	0
> +#define BOOT_PAGESET_BATCH	1
>  static DEFINE_PER_CPU(struct per_cpu_pageset, boot_pageset);
>  static DEFINE_PER_CPU(struct per_cpu_nodestat, boot_nodestats);
>  
> @@ -6289,8 +6292,8 @@ static void pageset_init(struct per_cpu_pageset *p)
>  	 * need to be as careful as pageset_update() as nobody can access the
>  	 * pageset yet.
>  	 */
> -	pcp->high = 0;
> -	pcp->batch = 1;
> +	pcp->high = BOOT_PAGESET_HIGH;
> +	pcp->batch = BOOT_PAGESET_BATCH;
>  }
>  
>  /*
> @@ -6314,6 +6317,14 @@ static void zone_set_pageset_high_and_batch(struct zone *zone)
>  		new_batch = max(1UL, 1 * new_batch);
>  	}
>  
> +	if (zone->pageset_high != new_high ||
> +	    zone->pageset_batch != new_batch) {
> +		zone->pageset_high = new_high;
> +		zone->pageset_batch = new_batch;
> +	} else {
> +		return;
> +	}
> +
>  	for_each_possible_cpu(cpu) {
>  		p = per_cpu_ptr(zone->pageset, cpu);
>  		pageset_update(&p->pcp, new_high, new_batch);
> @@ -6374,6 +6385,8 @@ static __meminit void zone_pcp_init(struct zone *zone)
>  	 * offset of a (static) per cpu variable into the per cpu area.
>  	 */
>  	zone->pageset = &boot_pageset;
> +	zone->pageset_high = BOOT_PAGESET_HIGH;
> +	zone->pageset_batch = BOOT_PAGESET_BATCH;
>  
>  	if (populated_zone(zone))
>  		printk(KERN_DEBUG "  %s zone: %lu pages, LIFO batch:%u\n",
> -- 
> 2.28.0

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 7/7] mm, page_alloc: disable pcplists during memory offline
  2020-10-08 11:42 ` [PATCH v2 7/7] mm, page_alloc: disable pcplists during memory offline Vlastimil Babka
@ 2020-10-08 12:45   ` Michal Hocko
  2020-10-08 17:46     ` Vlastimil Babka
  2020-10-22 12:52   ` Oscar Salvador
  1 sibling, 1 reply; 21+ messages in thread
From: Michal Hocko @ 2020-10-08 12:45 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, linux-kernel, Pavel Tatashin, David Hildenbrand,
	Oscar Salvador, Joonsoo Kim

On Thu 08-10-20 13:42:01, Vlastimil Babka wrote:
> Memory offline relies on page isolation can race with process freeing pages to
> pcplists in a way that a page from isolated pageblock can end up on pcplist.

"Memory offlining relies on page isolation to guarantee a forward
progress because pages cannot be reused while they are isolated. But the
page isolation itself doesn't prevent from races while freed pages are
stored on pcp lists and thus can be reused.
"

> This can be worked around by repeated draining of pcplists, as done by commit
> 968318261221 ("mm/memory_hotplug: drain per-cpu pages again during memory
> offline").
> 
> David and Michal would prefer that this race was closed in a way that callers
> of page isolation who need stronger guarantees don't need to repeatedly drain.
> David suggested disabling pcplists usage completely during page isolation,
> instead of repeatedly draining them.
> 
> To achieve this without adding special cases in alloc/free fastpath, we can use
> the same approach as boot pagesets - when pcp->high is 0, any pcplist addition
> will be immediately flushed.
> 
> The race can thus be closed by setting pcp->high to 0 and draining pcplists
> once, before calling start_isolate_page_range(). The draining will serialize
> after processes that already disabled interrupts and read the old value of
> pcp->high in free_unref_page_commit(), and processes that have not yet disabled
> interrupts, will observe pcp->high == 0 when they are rescheduled, and skip
> pcplists. This guarantees no stray pages on pcplists in zones where isolation
> happens.
> 
> This patch thus adds zone_pcp_disable() and zone_pcp_enable() functions that
> page isolation users can call before start_isolate_page_range() and after
> unisolating (or offlining) the isolated pages.
> 
> Also, drain_all_pages() is optimized to only execute on cpus where pcplists are
> not empty. The check can however race with a free to pcplist that has not yet
> increased the pcp->count from 0 to 1. Thus make the drain optionally skip the
> racy check and drain on all cpus, and use this option in zone_pcp_disable().
> 
> As we have to avoid external updates to high and batch while pcplists are
> disabled, we take pcp_batch_high_lock in zone_pcp_disable() and release it in
> zone_pcp_enable(). This also synchronizes multiple users of
> zone_pcp_disable()/enable().
> 
> Currently the only user of this functionality is offline_pages().

Thanks for simplifying the implementation!

> Suggested-by: David Hildenbrand <david@redhat.com>
> Suggested-by: Michal Hocko <mhocko@suse.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Acked-by: Michal Hocko <mhocko@suse.com>

Btw. I suspect this functionality might become handy for hwpoisoning as
well. I didn't get around to check the current state of the
implementation but I believe they would appreciate a guanratee to not
free into pcp lists as well. Oscar will surely know better though.

Thanks!
> ---
>  mm/internal.h       |  2 ++
>  mm/memory_hotplug.c | 28 ++++++++----------
>  mm/page_alloc.c     | 69 +++++++++++++++++++++++++++++++++++----------
>  mm/page_isolation.c |  6 ++--
>  4 files changed, 71 insertions(+), 34 deletions(-)
> 
> diff --git a/mm/internal.h b/mm/internal.h
> index c43ccdddb0f6..2966496680bc 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -201,6 +201,8 @@ extern int user_min_free_kbytes;
>  
>  extern void zone_pcp_update(struct zone *zone);
>  extern void zone_pcp_reset(struct zone *zone);
> +extern void zone_pcp_disable(struct zone *zone);
> +extern void zone_pcp_enable(struct zone *zone);
>  
>  #if defined CONFIG_COMPACTION || defined CONFIG_CMA
>  
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 2e6ad899c55e..4382b585c76c 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1510,17 +1510,21 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
>  	}
>  	node = zone_to_nid(zone);
>  
> +	/*
> +	 * Disable pcplists so that page isolation cannot race with freeing
> +	 * in a way that pages from isolated pageblock are left on pcplists.
> +	 */
> +	zone_pcp_disable(zone);
> +
>  	/* set above range as isolated */
>  	ret = start_isolate_page_range(start_pfn, end_pfn,
>  				       MIGRATE_MOVABLE,
>  				       MEMORY_OFFLINE | REPORT_FAILURE);
>  	if (ret) {
>  		reason = "failure to isolate range";
> -		goto failed_removal;
> +		goto failed_removal_pcplists_disabled;
>  	}
>  
> -	drain_all_pages(zone);
> -
>  	arg.start_pfn = start_pfn;
>  	arg.nr_pages = nr_pages;
>  	node_states_check_changes_offline(nr_pages, zone, &arg);
> @@ -1570,20 +1574,8 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
>  			goto failed_removal_isolated;
>  		}
>  
> -		/*
> -		 * per-cpu pages are drained after start_isolate_page_range, but
> -		 * if there are still pages that are not free, make sure that we
> -		 * drain again, because when we isolated range we might have
> -		 * raced with another thread that was adding pages to pcp list.
> -		 *
> -		 * Forward progress should be still guaranteed because
> -		 * pages on the pcp list can only belong to MOVABLE_ZONE
> -		 * because has_unmovable_pages explicitly checks for
> -		 * PageBuddy on freed pages on other zones.
> -		 */
>  		ret = test_pages_isolated(start_pfn, end_pfn, MEMORY_OFFLINE);
> -		if (ret)
> -			drain_all_pages(zone);
> +
>  	} while (ret);
>  
>  	/* Mark all sections offline and remove free pages from the buddy. */
> @@ -1599,6 +1591,8 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
>  	zone->nr_isolate_pageblock -= nr_pages / pageblock_nr_pages;
>  	spin_unlock_irqrestore(&zone->lock, flags);
>  
> +	zone_pcp_enable(zone);
> +
>  	/* removal success */
>  	adjust_managed_page_count(pfn_to_page(start_pfn), -nr_pages);
>  	zone->present_pages -= nr_pages;
> @@ -1631,6 +1625,8 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
>  failed_removal_isolated:
>  	undo_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE);
>  	memory_notify(MEM_CANCEL_OFFLINE, &arg);
> +failed_removal_pcplists_disabled:
> +	zone_pcp_enable(zone);
>  failed_removal:
>  	pr_debug("memory offlining [mem %#010llx-%#010llx] failed due to %s\n",
>  		 (unsigned long long) start_pfn << PAGE_SHIFT,
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 1f7108fe9a0b..366c516c9062 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3018,14 +3018,7 @@ static void drain_local_pages_wq(struct work_struct *work)
>  	preempt_enable();
>  }
>  
> -/*
> - * Spill all the per-cpu pages from all CPUs back into the buddy allocator.
> - *
> - * When zone parameter is non-NULL, spill just the single zone's pages.
> - *
> - * Note that this can be extremely slow as the draining happens in a workqueue.
> - */
> -void drain_all_pages(struct zone *zone)
> +void __drain_all_pages(struct zone *zone, bool force_all_cpus)
>  {
>  	int cpu;
>  
> @@ -3064,7 +3057,13 @@ void drain_all_pages(struct zone *zone)
>  		struct zone *z;
>  		bool has_pcps = false;
>  
> -		if (zone) {
> +		if (force_all_cpus) {
> +			/*
> +			 * The pcp.count check is racy, some callers need a
> +			 * guarantee that no cpu is missed.
> +			 */
> +			has_pcps = true;
> +		} else if (zone) {
>  			pcp = per_cpu_ptr(zone->pageset, cpu);
>  			if (pcp->pcp.count)
>  				has_pcps = true;
> @@ -3097,6 +3096,18 @@ void drain_all_pages(struct zone *zone)
>  	mutex_unlock(&pcpu_drain_mutex);
>  }
>  
> +/*
> + * Spill all the per-cpu pages from all CPUs back into the buddy allocator.
> + *
> + * When zone parameter is non-NULL, spill just the single zone's pages.
> + *
> + * Note that this can be extremely slow as the draining happens in a workqueue.
> + */
> +void drain_all_pages(struct zone *zone)
> +{
> +	__drain_all_pages(zone, false);
> +}
> +
>  #ifdef CONFIG_HIBERNATION
>  
>  /*
> @@ -6296,6 +6307,18 @@ static void pageset_init(struct per_cpu_pageset *p)
>  	pcp->batch = BOOT_PAGESET_BATCH;
>  }
>  
> +void __zone_set_pageset_high_and_batch(struct zone *zone, unsigned long high,
> +		unsigned long batch)
> +{
> +	struct per_cpu_pageset *p;
> +	int cpu;
> +
> +	for_each_possible_cpu(cpu) {
> +		p = per_cpu_ptr(zone->pageset, cpu);
> +		pageset_update(&p->pcp, high, batch);
> +	}
> +}
> +
>  /*
>   * Calculate and set new high and batch values for all per-cpu pagesets of a
>   * zone, based on the zone's size and the percpu_pagelist_fraction sysctl.
> @@ -6303,8 +6326,6 @@ static void pageset_init(struct per_cpu_pageset *p)
>  static void zone_set_pageset_high_and_batch(struct zone *zone)
>  {
>  	unsigned long new_high, new_batch;
> -	struct per_cpu_pageset *p;
> -	int cpu;
>  
>  	if (percpu_pagelist_fraction) {
>  		new_high = zone_managed_pages(zone) / percpu_pagelist_fraction;
> @@ -6325,10 +6346,7 @@ static void zone_set_pageset_high_and_batch(struct zone *zone)
>  		return;
>  	}
>  
> -	for_each_possible_cpu(cpu) {
> -		p = per_cpu_ptr(zone->pageset, cpu);
> -		pageset_update(&p->pcp, new_high, new_batch);
> -	}
> +	__zone_set_pageset_high_and_batch(zone, new_high, new_batch);
>  }
>  
>  void __meminit setup_zone_pageset(struct zone *zone)
> @@ -8723,6 +8741,27 @@ void __meminit zone_pcp_update(struct zone *zone)
>  	mutex_unlock(&pcp_batch_high_lock);
>  }
>  
> +/*
> + * Effectively disable pcplists for the zone by setting the high limit to 0
> + * and draining all cpus. A concurrent page freeing on another CPU that's about
> + * to put the page on pcplist will either finish before the drain and the page
> + * will be drained, or observe the new high limit and skip the pcplist.
> + *
> + * Must be paired with a call to zone_pcp_enable().
> + */
> +void zone_pcp_disable(struct zone *zone)
> +{
> +	mutex_lock(&pcp_batch_high_lock);
> +	__zone_set_pageset_high_and_batch(zone, 0, 1);
> +	__drain_all_pages(zone, true);
> +}
> +
> +void zone_pcp_enable(struct zone *zone)
> +{
> +	__zone_set_pageset_high_and_batch(zone, zone->pageset_high, zone->pageset_batch);
> +	mutex_unlock(&pcp_batch_high_lock);
> +}
> +
>  void zone_pcp_reset(struct zone *zone)
>  {
>  	unsigned long flags;
> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> index feab446d1982..a254e1f370a3 100644
> --- a/mm/page_isolation.c
> +++ b/mm/page_isolation.c
> @@ -174,9 +174,9 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
>   * A call to drain_all_pages() after isolation can flush most of them. However
>   * in some cases pages might still end up on pcp lists and that would allow
>   * for their allocation even when they are in fact isolated already. Depending
> - * on how strong of a guarantee the caller needs, further drain_all_pages()
> - * might be needed (e.g. __offline_pages will need to call it after check for
> - * isolated range for a next retry).
> + * on how strong of a guarantee the caller needs, zone_pcp_disable/enable()
> + * might be used to flush and disable pcplist before isolation and enable after
> + * unisolation.
>   *
>   * Return: 0 on success and -EBUSY if any part of range cannot be isolated.
>   */
> -- 
> 2.28.0

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 3/7] mm, page_alloc: remove setup_pageset()
  2020-10-08 12:23   ` Michal Hocko
@ 2020-10-08 12:56     ` Vlastimil Babka
  2020-10-08 13:03       ` Michal Hocko
  0 siblings, 1 reply; 21+ messages in thread
From: Vlastimil Babka @ 2020-10-08 12:56 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-kernel, Pavel Tatashin, David Hildenbrand,
	Oscar Salvador, Joonsoo Kim

On 10/8/20 2:23 PM, Michal Hocko wrote:
> On Thu 08-10-20 13:41:57, Vlastimil Babka wrote:
>> We initialize boot-time pagesets with setup_pageset(), which sets high and
>> batch values that effectively disable pcplists.
>> 
>> We can remove this wrapper if we just set these values for all pagesets in
>> pageset_init(). Non-boot pagesets then subsequently update them to the proper
>> values.
>> 
>> No functional change.
>> 
>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
>> Reviewed-by: David Hildenbrand <david@redhat.com>
> 
> Acked-by: Michal Hocko <mhocko@suse.com>

Thanks!

> Btw. where do we initialize pcp->count? I thought that pcp allocator
> zeroes out the allocated memory but alloc_percpu is GFP_KERNEL like.

pageset_init() does:
memset(p, 0, sizeof(*p))



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 3/7] mm, page_alloc: remove setup_pageset()
  2020-10-08 12:56     ` Vlastimil Babka
@ 2020-10-08 13:03       ` Michal Hocko
  0 siblings, 0 replies; 21+ messages in thread
From: Michal Hocko @ 2020-10-08 13:03 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, linux-kernel, Pavel Tatashin, David Hildenbrand,
	Oscar Salvador, Joonsoo Kim

On Thu 08-10-20 14:56:13, Vlastimil Babka wrote:
> On 10/8/20 2:23 PM, Michal Hocko wrote:
> > On Thu 08-10-20 13:41:57, Vlastimil Babka wrote:
> > > We initialize boot-time pagesets with setup_pageset(), which sets high and
> > > batch values that effectively disable pcplists.
> > > 
> > > We can remove this wrapper if we just set these values for all pagesets in
> > > pageset_init(). Non-boot pagesets then subsequently update them to the proper
> > > values.
> > > 
> > > No functional change.
> > > 
> > > Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> > > Reviewed-by: David Hildenbrand <david@redhat.com>
> > 
> > Acked-by: Michal Hocko <mhocko@suse.com>
> 
> Thanks!
> 
> > Btw. where do we initialize pcp->count? I thought that pcp allocator
> > zeroes out the allocated memory but alloc_percpu is GFP_KERNEL like.
> 
> pageset_init() does:
> memset(p, 0, sizeof(*p))

Ohh, I have missed pcp is embeded into per_cpu_pageset. Using it as s
pointer here confused me. Sorry about the noise.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 7/7] mm, page_alloc: disable pcplists during memory offline
  2020-10-08 12:45   ` Michal Hocko
@ 2020-10-08 17:46     ` Vlastimil Babka
  0 siblings, 0 replies; 21+ messages in thread
From: Vlastimil Babka @ 2020-10-08 17:46 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-kernel, Pavel Tatashin, David Hildenbrand,
	Oscar Salvador, Joonsoo Kim

On 10/8/20 2:45 PM, Michal Hocko wrote:
> On Thu 08-10-20 13:42:01, Vlastimil Babka wrote:
>> Memory offline relies on page isolation can race with process freeing pages to
>> pcplists in a way that a page from isolated pageblock can end up on pcplist.
> 
> "Memory offlining relies on page isolation to guarantee a forward
> progress because pages cannot be reused while they are isolated. But the
> page isolation itself doesn't prevent from races while freed pages are
> stored on pcp lists and thus can be reused.
> "

Much better, thanks.



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 5/7] mm, page_alloc: cache pageset high and batch in struct zone
  2020-10-08 12:31   ` Michal Hocko
@ 2020-10-08 17:55     ` Vlastimil Babka
  2020-10-22 12:42       ` Oscar Salvador
  0 siblings, 1 reply; 21+ messages in thread
From: Vlastimil Babka @ 2020-10-08 17:55 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-kernel, Pavel Tatashin, David Hildenbrand,
	Oscar Salvador, Joonsoo Kim

On 10/8/20 2:31 PM, Michal Hocko wrote:
> On Thu 08-10-20 13:41:59, Vlastimil Babka wrote:
>> All per-cpu pagesets for a zone use the same high and batch values, that are
>> duplicated there just for performance (locality) reasons. This patch adds the
>> same variables also to struct zone as a shared copy.
>> 
>> This will be useful later for making possible to disable pcplists temporarily
>> by setting high value to 0, while remembering the values for restoring them
>> later. But we can also immediately benefit from not updating pagesets of all
>> possible cpus in case the newly recalculated values (after sysctl change or
>> memory online/offline) are actually unchanged from the previous ones.
>> 
>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> 
> Acked-by: Michal Hocko <mhocko@suse.com>

Thanks!

> I would consider the check flipped with early return more pleasing to my
> eyes but nothing to lose sleep over.

Right, here's updated patch:

----8<----
 From 6ab0f03762d122a896349d5e568f75c20875eb42 Mon Sep 17 00:00:00 2001
From: Vlastimil Babka <vbabka@suse.cz>
Date: Mon, 7 Sep 2020 14:20:08 +0200
Subject: [PATCH v2 5/7] mm, page_alloc: cache pageset high and batch in struct
  zone

All per-cpu pagesets for a zone use the same high and batch values, that are
duplicated there just for performance (locality) reasons. This patch adds the
same variables also to struct zone as a shared copy.

This will be useful later for making possible to disable pcplists temporarily
by setting high value to 0, while remembering the values for restoring them
later. But we can also immediately benefit from not updating pagesets of all
possible cpus in case the newly recalculated values (after sysctl change or
memory online/offline) are actually unchanged from the previous ones.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
---
  include/linux/mmzone.h |  6 ++++++
  mm/page_alloc.c        | 16 ++++++++++++++--
  2 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index fb3bf696c05e..c63863794afc 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -470,6 +470,12 @@ struct zone {
  #endif
  	struct pglist_data	*zone_pgdat;
  	struct per_cpu_pageset __percpu *pageset;
+	/*
+	 * the high and batch values are copied to individual pagesets for
+	 * faster access
+	 */
+	int pageset_high;
+	int pageset_batch;

  #ifndef CONFIG_SPARSEMEM
  	/*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f33c36312eb5..057baefba8f3 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5900,6 +5900,9 @@ static void build_zonelists(pg_data_t *pgdat)
   * Other parts of the kernel may not check if the zone is available.
   */
  static void pageset_init(struct per_cpu_pageset *p);
+/* These effectively disable the pcplists in the boot pageset completely */
+#define BOOT_PAGESET_HIGH	0
+#define BOOT_PAGESET_BATCH	1
  static DEFINE_PER_CPU(struct per_cpu_pageset, boot_pageset);
  static DEFINE_PER_CPU(struct per_cpu_nodestat, boot_nodestats);

@@ -6289,8 +6292,8 @@ static void pageset_init(struct per_cpu_pageset *p)
  	 * need to be as careful as pageset_update() as nobody can access the
  	 * pageset yet.
  	 */
-	pcp->high = 0;
-	pcp->batch = 1;
+	pcp->high = BOOT_PAGESET_HIGH;
+	pcp->batch = BOOT_PAGESET_BATCH;
  }

  /*
@@ -6314,6 +6317,13 @@ static void zone_set_pageset_high_and_batch(struct zone 
*zone)
  		new_batch = max(1UL, 1 * new_batch);
  	}

+	if (zone->pageset_high == new_high &&
+	    zone->pageset_batch == new_batch)
+		return;
+
+	zone->pageset_high = new_high;
+	zone->pageset_batch = new_batch;
+
  	for_each_possible_cpu(cpu) {
  		p = per_cpu_ptr(zone->pageset, cpu);
  		pageset_update(&p->pcp, new_high, new_batch);
@@ -6374,6 +6384,8 @@ static __meminit void zone_pcp_init(struct zone *zone)
  	 * offset of a (static) per cpu variable into the per cpu area.
  	 */
  	zone->pageset = &boot_pageset;
+	zone->pageset_high = BOOT_PAGESET_HIGH;
+	zone->pageset_batch = BOOT_PAGESET_BATCH;

  	if (populated_zone(zone))
  		printk(KERN_DEBUG "  %s zone: %lu pages, LIFO batch:%u\n",
-- 
2.28.0





^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 3/7] mm, page_alloc: remove setup_pageset()
  2020-10-08 11:41 ` [PATCH v2 3/7] mm, page_alloc: remove setup_pageset() Vlastimil Babka
  2020-10-08 12:23   ` Michal Hocko
@ 2020-10-22 12:34   ` Oscar Salvador
  1 sibling, 0 replies; 21+ messages in thread
From: Oscar Salvador @ 2020-10-22 12:34 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, linux-kernel, Michal Hocko, Pavel Tatashin,
	David Hildenbrand, Joonsoo Kim

On Thu, Oct 08, 2020 at 01:41:57PM +0200, Vlastimil Babka wrote:
> We initialize boot-time pagesets with setup_pageset(), which sets high and
> batch values that effectively disable pcplists.
> 
> We can remove this wrapper if we just set these values for all pagesets in
> pageset_init(). Non-boot pagesets then subsequently update them to the proper
> values.
> 
> No functional change.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> Reviewed-by: David Hildenbrand <david@redhat.com>

I think I already reviewed this one, but:

Reviewed-by: Oscar Salvador <osalvador@suse.de>

-- 
Oscar Salvador
SUSE L3


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 4/7] mm, page_alloc: simplify pageset_update()
  2020-10-08 11:41 ` [PATCH v2 4/7] mm, page_alloc: simplify pageset_update() Vlastimil Babka
@ 2020-10-22 12:39   ` Oscar Salvador
  0 siblings, 0 replies; 21+ messages in thread
From: Oscar Salvador @ 2020-10-22 12:39 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, linux-kernel, Michal Hocko, Pavel Tatashin,
	David Hildenbrand, Joonsoo Kim, Michal Hocko

On Thu, Oct 08, 2020 at 01:41:58PM +0200, Vlastimil Babka wrote:
> pageset_update() attempts to update pcplist's high and batch values in a way
> that readers don't observe batch > high. It uses smp_wmb() to order the updates
> in a way to achieve this. However, without proper pairing read barriers in
> readers this guarantee doesn't hold, and there are no such barriers in
> e.g. free_unref_page_commit().
> 
> Commit 88e8ac11d2ea ("mm, page_alloc: fix core hung in free_pcppages_bulk()")
> already showed this is problematic, and solved this by ultimately only trusing
> pcp->count of the current cpu with interrupts disabled.
> 
> The update dance with unpaired write barriers thus makes no sense. Replace
> them with plain WRITE_ONCE to prevent store tearing, and document that the
> values can change asynchronously and should not be trusted for correctness.
> 
> All current readers appear to be OK after 88e8ac11d2ea. Convert them to
> READ_ONCE to prevent unnecessary read tearing, but mainly to alert anybody
> making future changes to the code that special care is needed.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> Acked-by: David Hildenbrand <david@redhat.com>
> Acked-by: Michal Hocko <mhocko@suse.com>

Yeah, I never got my head around those smp_wmb()

Reviewed-by: Oscar Salvador <osalvador@suse.de>

-- 
Oscar Salvador
SUSE L3


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 5/7] mm, page_alloc: cache pageset high and batch in struct zone
  2020-10-08 17:55     ` Vlastimil Babka
@ 2020-10-22 12:42       ` Oscar Salvador
  0 siblings, 0 replies; 21+ messages in thread
From: Oscar Salvador @ 2020-10-22 12:42 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Michal Hocko, linux-mm, linux-kernel, Pavel Tatashin,
	David Hildenbrand, Joonsoo Kim

On Thu, Oct 08, 2020 at 07:55:02PM +0200, Vlastimil Babka wrote:
> Right, here's updated patch:
> 
> ----8<----
> From 6ab0f03762d122a896349d5e568f75c20875eb42 Mon Sep 17 00:00:00 2001
> From: Vlastimil Babka <vbabka@suse.cz>
> Date: Mon, 7 Sep 2020 14:20:08 +0200
> Subject: [PATCH v2 5/7] mm, page_alloc: cache pageset high and batch in struct
>  zone
> 
> All per-cpu pagesets for a zone use the same high and batch values, that are
> duplicated there just for performance (locality) reasons. This patch adds the
> same variables also to struct zone as a shared copy.
> 
> This will be useful later for making possible to disable pcplists temporarily
> by setting high value to 0, while remembering the values for restoring them
> later. But we can also immediately benefit from not updating pagesets of all
> possible cpus in case the newly recalculated values (after sysctl change or
> memory online/offline) are actually unchanged from the previous ones.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> Acked-by: Michal Hocko <mhocko@suse.com>

LGTM,

Reviewed-by: Oscar Salvador <osalvador@suse.de>

-- 
Oscar Salvador
SUSE L3


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 6/7] mm, page_alloc: move draining pcplists to page isolation users
  2020-10-08 11:42 ` [PATCH v2 6/7] mm, page_alloc: move draining pcplists to page isolation users Vlastimil Babka
@ 2020-10-22 12:44   ` Oscar Salvador
  0 siblings, 0 replies; 21+ messages in thread
From: Oscar Salvador @ 2020-10-22 12:44 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, linux-kernel, Michal Hocko, Pavel Tatashin,
	David Hildenbrand, Joonsoo Kim, Michal Hocko

On Thu, Oct 08, 2020 at 01:42:00PM +0200, Vlastimil Babka wrote:
> Currently, pcplists are drained during set_migratetype_isolate() which means
> once per pageblock processed start_isolate_page_range(). This is somewhat
> wasteful. Moreover, the callers might need different guarantees, and the
> draining is currently prone to races and does not guarantee that no page
> from isolated pageblock will end up on the pcplist after the drain.
> 
> Better guarantees are added by later patches and require explicit actions
> by page isolation users that need them. Thus it makes sense to move the
> current imperfect draining to the callers also as a preparation step.
> 
> Suggested-by: David Hildenbrand <david@redhat.com>
> Suggested-by: Pavel Tatashin <pasha.tatashin@soleen.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> Reviewed-by: David Hildenbrand <david@redhat.com>
> Acked-by: Michal Hocko <mhocko@suse.com>

Reviewed-by: Oscar Salvador <osalvador@suse.de>

-- 
Oscar Salvador
SUSE L3


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 7/7] mm, page_alloc: disable pcplists during memory offline
  2020-10-08 11:42 ` [PATCH v2 7/7] mm, page_alloc: disable pcplists during memory offline Vlastimil Babka
  2020-10-08 12:45   ` Michal Hocko
@ 2020-10-22 12:52   ` Oscar Salvador
  1 sibling, 0 replies; 21+ messages in thread
From: Oscar Salvador @ 2020-10-22 12:52 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, linux-kernel, Michal Hocko, Pavel Tatashin,
	David Hildenbrand, Joonsoo Kim, Michal Hocko

On Thu, Oct 08, 2020 at 01:42:01PM +0200, Vlastimil Babka wrote:
> Memory offline relies on page isolation can race with process freeing pages to
> pcplists in a way that a page from isolated pageblock can end up on pcplist.
> This can be worked around by repeated draining of pcplists, as done by commit
> 968318261221 ("mm/memory_hotplug: drain per-cpu pages again during memory
> offline").
> 
> David and Michal would prefer that this race was closed in a way that callers
> of page isolation who need stronger guarantees don't need to repeatedly drain.
> David suggested disabling pcplists usage completely during page isolation,
> instead of repeatedly draining them.
> 
> To achieve this without adding special cases in alloc/free fastpath, we can use
> the same approach as boot pagesets - when pcp->high is 0, any pcplist addition
> will be immediately flushed.
> 
> The race can thus be closed by setting pcp->high to 0 and draining pcplists
> once, before calling start_isolate_page_range(). The draining will serialize
> after processes that already disabled interrupts and read the old value of
> pcp->high in free_unref_page_commit(), and processes that have not yet disabled
> interrupts, will observe pcp->high == 0 when they are rescheduled, and skip
> pcplists. This guarantees no stray pages on pcplists in zones where isolation
> happens.
> 
> This patch thus adds zone_pcp_disable() and zone_pcp_enable() functions that
> page isolation users can call before start_isolate_page_range() and after
> unisolating (or offlining) the isolated pages.
> 
> Also, drain_all_pages() is optimized to only execute on cpus where pcplists are
> not empty. The check can however race with a free to pcplist that has not yet
> increased the pcp->count from 0 to 1. Thus make the drain optionally skip the
> racy check and drain on all cpus, and use this option in zone_pcp_disable().
> 
> As we have to avoid external updates to high and batch while pcplists are
> disabled, we take pcp_batch_high_lock in zone_pcp_disable() and release it in
> zone_pcp_enable(). This also synchronizes multiple users of
> zone_pcp_disable()/enable().
> 
> Currently the only user of this functionality is offline_pages().
> 
> Suggested-by: David Hildenbrand <david@redhat.com>
> Suggested-by: Michal Hocko <mhocko@suse.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

I definitely like this one, and the implemantion looks smoth.
As Michal said in another thread, Hwposion code will also benefit from this,
since now we have a drain_all_pages dance that might be suboptimal and not
accurate.
I will get back to that once this gets merged.

Reviewed-by: Oscar Salvador <osalvador@suse.de>

Thanks!

-- 
Oscar Salvador
SUSE L3


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 1/7] mm, page_alloc: clean up pageset high and batch update
  2020-10-08 11:41 ` [PATCH v2 1/7] mm, page_alloc: clean up pageset high and batch update Vlastimil Babka
@ 2020-10-25 14:17   ` Mike Rapoport
  0 siblings, 0 replies; 21+ messages in thread
From: Mike Rapoport @ 2020-10-25 14:17 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, linux-kernel, Michal Hocko, Pavel Tatashin,
	David Hildenbrand, Oscar Salvador, Joonsoo Kim, Michal Hocko

On Thu, Oct 08, 2020 at 01:41:55PM +0200, Vlastimil Babka wrote:
> The updates to pcplists' high and batch valued are handled by multiple

Nit:                                     ^ values

> functions that make the calculations hard to follow. Consolidate everything
> to pageset_set_high_and_batch() and remove pageset_set_batch() and
> pageset_set_high() wrappers.
> 
> The only special case using one of the removed wrappers was:
> build_all_zonelists_init()
>   setup_pageset()
>     pageset_set_batch()
> which was hardcoding batch as 0, so we can just open-code a call to
> pageset_update() with constant parameters instead.
> 
> No functional change.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> Reviewed-by: Oscar Salvador <osalvador@suse.de>
> Reviewed-by: David Hildenbrand <david@redhat.com>
> Acked-by: Michal Hocko <mhocko@suse.com>
> ---
>  mm/page_alloc.c | 49 ++++++++++++++++++++-----------------------------
>  1 file changed, 20 insertions(+), 29 deletions(-)
 
--
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2020-10-25 14:17 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-08 11:41 [PATCH v2 0/7] disable pcplists during memory offline Vlastimil Babka
2020-10-08 11:41 ` [PATCH v2 1/7] mm, page_alloc: clean up pageset high and batch update Vlastimil Babka
2020-10-25 14:17   ` Mike Rapoport
2020-10-08 11:41 ` [PATCH v2 2/7] mm, page_alloc: calculate pageset high and batch once per zone Vlastimil Babka
2020-10-08 11:41 ` [PATCH v2 3/7] mm, page_alloc: remove setup_pageset() Vlastimil Babka
2020-10-08 12:23   ` Michal Hocko
2020-10-08 12:56     ` Vlastimil Babka
2020-10-08 13:03       ` Michal Hocko
2020-10-22 12:34   ` Oscar Salvador
2020-10-08 11:41 ` [PATCH v2 4/7] mm, page_alloc: simplify pageset_update() Vlastimil Babka
2020-10-22 12:39   ` Oscar Salvador
2020-10-08 11:41 ` [PATCH v2 5/7] mm, page_alloc: cache pageset high and batch in struct zone Vlastimil Babka
2020-10-08 12:31   ` Michal Hocko
2020-10-08 17:55     ` Vlastimil Babka
2020-10-22 12:42       ` Oscar Salvador
2020-10-08 11:42 ` [PATCH v2 6/7] mm, page_alloc: move draining pcplists to page isolation users Vlastimil Babka
2020-10-22 12:44   ` Oscar Salvador
2020-10-08 11:42 ` [PATCH v2 7/7] mm, page_alloc: disable pcplists during memory offline Vlastimil Babka
2020-10-08 12:45   ` Michal Hocko
2020-10-08 17:46     ` Vlastimil Babka
2020-10-22 12:52   ` Oscar Salvador

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).