linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/6 v2] Calculate pcp->high based on zone sizes and active CPUs
@ 2021-05-25  8:01 Mel Gorman
  2021-05-25  8:01 ` [PATCH 1/6] mm/page_alloc: Delete vm.percpu_pagelist_fraction Mel Gorman
                   ` (6 more replies)
  0 siblings, 7 replies; 34+ messages in thread
From: Mel Gorman @ 2021-05-25  8:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hillf Danton, Dave Hansen, Vlastimil Babka, Michal Hocko, LKML,
	Linux-MM, Mel Gorman

Changelog since v1
o Clarification comments
o Sanity check pcp->high during reclaim				(dhansen)
o Handle vm.percpu_pagelist_high_fraction in zone_highsize	(hdanton)
o Sanity check pcp->batch versus pcp->high

This series has pre-requisites in mmotm so for convenience it is also
available at

https://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-pcpburst-v2r3

The per-cpu page allocator (PCP) is meant to reduce contention on the zone
lock but the sizing of batch and high is archaic and neither takes the zone
size into account or the number of CPUs local to a zone. With larger zones
and more CPUs per node, the contention is getting worse. Furthermore,
the fact that vm.percpu_pagelist_fraction adjusts both batch and high
values means that the sysctl can reduce zone lock contention but also
increase allocation latencies.

This series disassociates pcp->high from pcp->batch and then scales
pcp->high based on the size of the local zone with limited impact to
reclaim and accounting for active CPUs but leaves pcp->batch static.
It also adapts the number of pages that can be on the pcp list based on
recent freeing patterns.

The motivation is partially to adjust to larger memory sizes but
is also driven by the fact that large batches of page freeing via
release_pages() often shows zone contention as a major part of the
problem. Another is a bug report based on an older kernel where a
multi-terabyte process can takes several minutes to exit. A workaround
was to use vm.percpu_pagelist_fraction to increase the pcp->high value
but testing indicated that a production workload could not use the same
values because of an increase in allocation latencies. Unfortunately,
I cannot reproduce this test case myself as the multi-terabyte machines
are in active use but it should alleviate the problem.

The series aims to address both and partially acts as a pre-requisite. pcp
only works with order-0 which is useless for SLUB (when using high orders)
and THP (unconditionally). To store high-order pages on PCP, the pcp->high
values need to be increased first.

 Documentation/admin-guide/sysctl/vm.rst |  29 ++--
 include/linux/cpuhotplug.h              |   2 +-
 include/linux/mmzone.h                  |   8 +-
 kernel/sysctl.c                         |   8 +-
 mm/internal.h                           |   2 +-
 mm/memory_hotplug.c                     |   4 +-
 mm/page_alloc.c                         | 196 ++++++++++++++++++------
 mm/vmscan.c                             |  35 +++++
 8 files changed, 212 insertions(+), 72 deletions(-)

-- 
2.26.2


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 1/6] mm/page_alloc: Delete vm.percpu_pagelist_fraction
  2021-05-25  8:01 [PATCH 0/6 v2] Calculate pcp->high based on zone sizes and active CPUs Mel Gorman
@ 2021-05-25  8:01 ` Mel Gorman
  2021-05-26 17:41   ` Vlastimil Babka
  2021-05-25  8:01 ` [PATCH 2/6] mm/page_alloc: Disassociate the pcp->high from pcp->batch Mel Gorman
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 34+ messages in thread
From: Mel Gorman @ 2021-05-25  8:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hillf Danton, Dave Hansen, Vlastimil Babka, Michal Hocko, LKML,
	Linux-MM, Mel Gorman

The vm.percpu_pagelist_fraction is used to increase the batch and high
limits for the per-cpu page allocator (PCP). The intent behind the sysctl
is to reduce zone lock acquisition when allocating/freeing pages but it has
a problem. While it can decrease contention, it can also increase latency
on the allocation side due to unreasonably large batch sizes. This leads
to games where an administrator adjusts percpu_pagelist_fraction on the
fly to work around contention and allocation latency problems.

This series aims to alleviate the problems with zone lock contention while
avoiding the allocation-side latency problems. For the purposes of review,
it's easier to remove this sysctl now and reintroduce a similar sysctl
later in the series that deals only with pcp->high.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
---
 Documentation/admin-guide/sysctl/vm.rst | 19 ---------
 include/linux/mmzone.h                  |  3 --
 kernel/sysctl.c                         |  8 ----
 mm/page_alloc.c                         | 55 ++-----------------------
 4 files changed, 4 insertions(+), 81 deletions(-)

diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
index 586cd4b86428..2fcafccb53a8 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -64,7 +64,6 @@ files can be found in mm/swap.c.
 - overcommit_ratio
 - page-cluster
 - panic_on_oom
-- percpu_pagelist_fraction
 - stat_interval
 - stat_refresh
 - numa_stat
@@ -790,24 +789,6 @@ panic_on_oom=2+kdump gives you very strong tool to investigate
 why oom happens. You can get snapshot.
 
 
-percpu_pagelist_fraction
-========================
-
-This is the fraction of pages at most (high mark pcp->high) in each zone that
-are allocated for each per cpu page list.  The min value for this is 8.  It
-means that we don't allow more than 1/8th of pages in each zone to be
-allocated in any single per_cpu_pagelist.  This entry only changes the value
-of hot per cpu pagelists.  User can specify a number like 100 to allocate
-1/100th of each zone to each per cpu page list.
-
-The batch value of each per cpu pagelist is also updated as a result.  It is
-set to pcp->high/4.  The upper limit of batch is (PAGE_SHIFT * 8)
-
-The initial value is zero.  Kernel does not use this value at boot time to set
-the high water marks for each per cpu page list.  If the user writes '0' to this
-sysctl, it will revert to this default behavior.
-
-
 stat_interval
 =============
 
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index d7740c97b87e..b449151745d7 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1038,15 +1038,12 @@ int watermark_scale_factor_sysctl_handler(struct ctl_table *, int, void *,
 extern int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES];
 int lowmem_reserve_ratio_sysctl_handler(struct ctl_table *, int, void *,
 		size_t *, loff_t *);
-int percpu_pagelist_fraction_sysctl_handler(struct ctl_table *, int,
-		void *, size_t *, loff_t *);
 int sysctl_min_unmapped_ratio_sysctl_handler(struct ctl_table *, int,
 		void *, size_t *, loff_t *);
 int sysctl_min_slab_ratio_sysctl_handler(struct ctl_table *, int,
 		void *, size_t *, loff_t *);
 int numa_zonelist_order_handler(struct ctl_table *, int,
 		void *, size_t *, loff_t *);
-extern int percpu_pagelist_fraction;
 extern char numa_zonelist_order[];
 #define NUMA_ZONELIST_ORDER_LEN	16
 
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 14edf84cc571..4e5ac50a1af0 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -2889,14 +2889,6 @@ static struct ctl_table vm_table[] = {
 		.extra1		= SYSCTL_ONE,
 		.extra2		= &one_thousand,
 	},
-	{
-		.procname	= "percpu_pagelist_fraction",
-		.data		= &percpu_pagelist_fraction,
-		.maxlen		= sizeof(percpu_pagelist_fraction),
-		.mode		= 0644,
-		.proc_handler	= percpu_pagelist_fraction_sysctl_handler,
-		.extra1		= SYSCTL_ZERO,
-	},
 	{
 		.procname	= "page_lock_unfairness",
 		.data		= &sysctl_page_lock_unfairness,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ff8f706839ea..a48f305f0381 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -120,7 +120,6 @@ typedef int __bitwise fpi_t;
 
 /* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */
 static DEFINE_MUTEX(pcp_batch_high_lock);
-#define MIN_PERCPU_PAGELIST_FRACTION	(8)
 
 struct pagesets {
 	local_lock_t lock;
@@ -182,7 +181,6 @@ EXPORT_SYMBOL(_totalram_pages);
 unsigned long totalreserve_pages __read_mostly;
 unsigned long totalcma_pages __read_mostly;
 
-int percpu_pagelist_fraction;
 gfp_t gfp_allowed_mask __read_mostly = GFP_BOOT_MASK;
 DEFINE_STATIC_KEY_MAYBE(CONFIG_INIT_ON_ALLOC_DEFAULT_ON, init_on_alloc);
 EXPORT_SYMBOL(init_on_alloc);
@@ -6696,22 +6694,15 @@ static void __zone_set_pageset_high_and_batch(struct zone *zone, unsigned long h
 
 /*
  * Calculate and set new high and batch values for all per-cpu pagesets of a
- * zone, based on the zone's size and the percpu_pagelist_fraction sysctl.
+ * zone based on the zone's size.
  */
 static void zone_set_pageset_high_and_batch(struct zone *zone)
 {
 	unsigned long new_high, new_batch;
 
-	if (percpu_pagelist_fraction) {
-		new_high = zone_managed_pages(zone) / percpu_pagelist_fraction;
-		new_batch = max(1UL, new_high / 4);
-		if ((new_high / 4) > (PAGE_SHIFT * 8))
-			new_batch = PAGE_SHIFT * 8;
-	} else {
-		new_batch = zone_batchsize(zone);
-		new_high = 6 * new_batch;
-		new_batch = max(1UL, 1 * new_batch);
-	}
+	new_batch = zone_batchsize(zone);
+	new_high = 6 * new_batch;
+	new_batch = max(1UL, 1 * new_batch);
 
 	if (zone->pageset_high == new_high &&
 	    zone->pageset_batch == new_batch)
@@ -8377,44 +8368,6 @@ int lowmem_reserve_ratio_sysctl_handler(struct ctl_table *table, int write,
 	return 0;
 }
 
-/*
- * percpu_pagelist_fraction - changes the pcp->high for each zone on each
- * cpu.  It is the fraction of total pages in each zone that a hot per cpu
- * pagelist can have before it gets flushed back to buddy allocator.
- */
-int percpu_pagelist_fraction_sysctl_handler(struct ctl_table *table, int write,
-		void *buffer, size_t *length, loff_t *ppos)
-{
-	struct zone *zone;
-	int old_percpu_pagelist_fraction;
-	int ret;
-
-	mutex_lock(&pcp_batch_high_lock);
-	old_percpu_pagelist_fraction = percpu_pagelist_fraction;
-
-	ret = proc_dointvec_minmax(table, write, buffer, length, ppos);
-	if (!write || ret < 0)
-		goto out;
-
-	/* Sanity checking to avoid pcp imbalance */
-	if (percpu_pagelist_fraction &&
-	    percpu_pagelist_fraction < MIN_PERCPU_PAGELIST_FRACTION) {
-		percpu_pagelist_fraction = old_percpu_pagelist_fraction;
-		ret = -EINVAL;
-		goto out;
-	}
-
-	/* No change? */
-	if (percpu_pagelist_fraction == old_percpu_pagelist_fraction)
-		goto out;
-
-	for_each_populated_zone(zone)
-		zone_set_pageset_high_and_batch(zone);
-out:
-	mutex_unlock(&pcp_batch_high_lock);
-	return ret;
-}
-
 #ifndef __HAVE_ARCH_RESERVED_KERNEL_PAGES
 /*
  * Returns the number of pages that arch has reserved but
-- 
2.26.2


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 2/6] mm/page_alloc: Disassociate the pcp->high from pcp->batch
  2021-05-25  8:01 [PATCH 0/6 v2] Calculate pcp->high based on zone sizes and active CPUs Mel Gorman
  2021-05-25  8:01 ` [PATCH 1/6] mm/page_alloc: Delete vm.percpu_pagelist_fraction Mel Gorman
@ 2021-05-25  8:01 ` Mel Gorman
  2021-05-26 18:14   ` Vlastimil Babka
  2021-05-25  8:01 ` [PATCH 3/6] mm/page_alloc: Adjust pcp->high after CPU hotplug events Mel Gorman
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 34+ messages in thread
From: Mel Gorman @ 2021-05-25  8:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hillf Danton, Dave Hansen, Vlastimil Babka, Michal Hocko, LKML,
	Linux-MM, Mel Gorman

The pcp high watermark is based on the batch size but there is no
relationship between them other than it is convenient to use early in
boot.

This patch takes the first step and bases pcp->high on the zone low
watermark split across the number of CPUs local to a zone while the batch
size remains the same to avoid increasing allocation latencies. The intent
behind the default pcp->high is "set the number of PCP pages such that
if they are all full that background reclaim is not started prematurely".

Note that in this patch the pcp->high values are adjusted after memory
hotplug events, min_free_kbytes adjustments and watermark scale factor
adjustments but not CPU hotplug events which is handled later in the
series.

On a test KVM instance;

Before grep -E "high:|batch" /proc/zoneinfo | tail -2
              high:  378
              batch: 63

After grep -E "high:|batch" /proc/zoneinfo | tail -2
              high:  649
              batch: 63

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/page_alloc.c | 60 ++++++++++++++++++++++++++++++++++---------------
 1 file changed, 42 insertions(+), 18 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a48f305f0381..c0536e5d088a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2163,14 +2163,6 @@ void __init page_alloc_init_late(void)
 	/* Block until all are initialised */
 	wait_for_completion(&pgdat_init_all_done_comp);
 
-	/*
-	 * The number of managed pages has changed due to the initialisation
-	 * so the pcpu batch and high limits needs to be updated or the limits
-	 * will be artificially small.
-	 */
-	for_each_populated_zone(zone)
-		zone_pcp_update(zone);
-
 	/*
 	 * We initialized the rest of the deferred pages.  Permanently disable
 	 * on-demand struct page initialization.
@@ -6594,13 +6586,12 @@ static int zone_batchsize(struct zone *zone)
 	int batch;
 
 	/*
-	 * The per-cpu-pages pools are set to around 1000th of the
-	 * size of the zone.
+	 * The number of pages to batch allocate is either ~0.1%
+	 * of the zone or 1MB, whichever is smaller. The batch
+	 * size is striking a balance between allocation latency
+	 * and zone lock contention.
 	 */
-	batch = zone_managed_pages(zone) / 1024;
-	/* But no more than a meg. */
-	if (batch * PAGE_SIZE > 1024 * 1024)
-		batch = (1024 * 1024) / PAGE_SIZE;
+	batch = min(zone_managed_pages(zone) >> 10, (1024 * 1024) / PAGE_SIZE);
 	batch /= 4;		/* We effectively *= 4 below */
 	if (batch < 1)
 		batch = 1;
@@ -6637,6 +6628,34 @@ static int zone_batchsize(struct zone *zone)
 #endif
 }
 
+static int zone_highsize(struct zone *zone, int batch)
+{
+#ifdef CONFIG_MMU
+	int high;
+	int nr_local_cpus;
+
+	/*
+	 * The high value of the pcp is based on the zone low watermark
+	 * so that if they are full then background reclaim will not be
+	 * started prematurely. The value is split across all online CPUs
+	 * local to the zone. Note that early in boot that CPUs may not be
+	 * online yet.
+	 */
+	nr_local_cpus = max(1U, cpumask_weight(cpumask_of_node(zone_to_nid(zone))));
+	high = low_wmark_pages(zone) / nr_local_cpus;
+
+	/*
+	 * Ensure high is at least batch*4. The multiple is based on the
+	 * historical relationship between high and batch.
+	 */
+	high = max(high, batch << 2);
+
+	return high;
+#else
+	return 0;
+#endif
+}
+
 /*
  * pcp->high and pcp->batch values are related and generally batch is lower
  * than high. They are also related to pcp->count such that count is lower
@@ -6698,11 +6717,10 @@ static void __zone_set_pageset_high_and_batch(struct zone *zone, unsigned long h
  */
 static void zone_set_pageset_high_and_batch(struct zone *zone)
 {
-	unsigned long new_high, new_batch;
+	int new_high, new_batch;
 
-	new_batch = zone_batchsize(zone);
-	new_high = 6 * new_batch;
-	new_batch = max(1UL, 1 * new_batch);
+	new_batch = max(1, zone_batchsize(zone));
+	new_high = zone_highsize(zone, new_batch);
 
 	if (zone->pageset_high == new_high &&
 	    zone->pageset_batch == new_batch)
@@ -8170,6 +8188,12 @@ static void __setup_per_zone_wmarks(void)
 		zone->_watermark[WMARK_LOW]  = min_wmark_pages(zone) + tmp;
 		zone->_watermark[WMARK_HIGH] = min_wmark_pages(zone) + tmp * 2;
 
+		/*
+		 * The watermark size have changed so update the pcpu batch
+		 * and high limits or the limits may be inappropriate.
+		 */
+		zone_set_pageset_high_and_batch(zone);
+
 		spin_unlock_irqrestore(&zone->lock, flags);
 	}
 
-- 
2.26.2


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 3/6] mm/page_alloc: Adjust pcp->high after CPU hotplug events
  2021-05-25  8:01 [PATCH 0/6 v2] Calculate pcp->high based on zone sizes and active CPUs Mel Gorman
  2021-05-25  8:01 ` [PATCH 1/6] mm/page_alloc: Delete vm.percpu_pagelist_fraction Mel Gorman
  2021-05-25  8:01 ` [PATCH 2/6] mm/page_alloc: Disassociate the pcp->high from pcp->batch Mel Gorman
@ 2021-05-25  8:01 ` Mel Gorman
  2021-05-28 11:08   ` Vlastimil Babka
  2021-05-25  8:01 ` [PATCH 4/6] mm/page_alloc: Scale the number of pages that are batch freed Mel Gorman
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 34+ messages in thread
From: Mel Gorman @ 2021-05-25  8:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hillf Danton, Dave Hansen, Vlastimil Babka, Michal Hocko, LKML,
	Linux-MM, Mel Gorman

The PCP high watermark is based on the number of online CPUs so the
watermarks must be adjusted during CPU hotplug. At the time of
hot-remove, the number of online CPUs is already adjusted but during
hot-add, a delta needs to be applied to update PCP to the correct
value. After this patch is applied, the high watermarks are adjusted
correctly.

  # grep high: /proc/zoneinfo  | tail -1
              high:  649
  # echo 0 > /sys/devices/system/cpu/cpu4/online
  # grep high: /proc/zoneinfo  | tail -1
              high:  664
  # echo 1 > /sys/devices/system/cpu/cpu4/online
  # grep high: /proc/zoneinfo  | tail -1
              high:  649

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 include/linux/cpuhotplug.h |  2 +-
 mm/internal.h              |  2 +-
 mm/memory_hotplug.c        |  4 ++--
 mm/page_alloc.c            | 38 +++++++++++++++++++++++++++-----------
 4 files changed, 31 insertions(+), 15 deletions(-)

diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
index 4a62b3980642..47e13582d9fc 100644
--- a/include/linux/cpuhotplug.h
+++ b/include/linux/cpuhotplug.h
@@ -54,7 +54,7 @@ enum cpuhp_state {
 	CPUHP_MM_MEMCQ_DEAD,
 	CPUHP_PERCPU_CNT_DEAD,
 	CPUHP_RADIX_DEAD,
-	CPUHP_PAGE_ALLOC_DEAD,
+	CPUHP_PAGE_ALLOC,
 	CPUHP_NET_DEV_DEAD,
 	CPUHP_PCI_XGENE_DEAD,
 	CPUHP_IOMMU_IOVA_DEAD,
diff --git a/mm/internal.h b/mm/internal.h
index 54bd0dc2c23c..651250e59ef5 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -221,7 +221,7 @@ extern int user_min_free_kbytes;
 extern void free_unref_page(struct page *page);
 extern void free_unref_page_list(struct list_head *list);
 
-extern void zone_pcp_update(struct zone *zone);
+extern void zone_pcp_update(struct zone *zone, int cpu_online);
 extern void zone_pcp_reset(struct zone *zone);
 extern void zone_pcp_disable(struct zone *zone);
 extern void zone_pcp_enable(struct zone *zone);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 70620d0dd923..bebb3cead810 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -961,7 +961,7 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, struct zone *z
 	node_states_set_node(nid, &arg);
 	if (need_zonelists_rebuild)
 		build_all_zonelists(NULL);
-	zone_pcp_update(zone);
+	zone_pcp_update(zone, 0);
 
 	/* Basic onlining is complete, allow allocation of onlined pages. */
 	undo_isolate_page_range(pfn, pfn + nr_pages, MIGRATE_MOVABLE);
@@ -1835,7 +1835,7 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
 		zone_pcp_reset(zone);
 		build_all_zonelists(NULL);
 	} else
-		zone_pcp_update(zone);
+		zone_pcp_update(zone, 0);
 
 	node_states_clear_node(node, &arg);
 	if (arg.status_change_nid >= 0) {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c0536e5d088a..dc4ac309bc21 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6628,7 +6628,7 @@ static int zone_batchsize(struct zone *zone)
 #endif
 }
 
-static int zone_highsize(struct zone *zone, int batch)
+static int zone_highsize(struct zone *zone, int batch, int cpu_online)
 {
 #ifdef CONFIG_MMU
 	int high;
@@ -6639,9 +6639,10 @@ static int zone_highsize(struct zone *zone, int batch)
 	 * so that if they are full then background reclaim will not be
 	 * started prematurely. The value is split across all online CPUs
 	 * local to the zone. Note that early in boot that CPUs may not be
-	 * online yet.
+	 * online yet and that during CPU hotplug that the cpumask is not
+	 * yet updated when a CPU is being onlined.
 	 */
-	nr_local_cpus = max(1U, cpumask_weight(cpumask_of_node(zone_to_nid(zone))));
+	nr_local_cpus = max(1U, cpumask_weight(cpumask_of_node(zone_to_nid(zone)))) + cpu_online;
 	high = low_wmark_pages(zone) / nr_local_cpus;
 
 	/*
@@ -6715,12 +6716,12 @@ static void __zone_set_pageset_high_and_batch(struct zone *zone, unsigned long h
  * Calculate and set new high and batch values for all per-cpu pagesets of a
  * zone based on the zone's size.
  */
-static void zone_set_pageset_high_and_batch(struct zone *zone)
+static void zone_set_pageset_high_and_batch(struct zone *zone, int cpu_online)
 {
 	int new_high, new_batch;
 
 	new_batch = max(1, zone_batchsize(zone));
-	new_high = zone_highsize(zone, new_batch);
+	new_high = zone_highsize(zone, new_batch, cpu_online);
 
 	if (zone->pageset_high == new_high &&
 	    zone->pageset_batch == new_batch)
@@ -6750,7 +6751,7 @@ void __meminit setup_zone_pageset(struct zone *zone)
 		per_cpu_pages_init(pcp, pzstats);
 	}
 
-	zone_set_pageset_high_and_batch(zone);
+	zone_set_pageset_high_and_batch(zone, 0);
 }
 
 /*
@@ -8008,6 +8009,7 @@ void __init set_dma_reserve(unsigned long new_dma_reserve)
 
 static int page_alloc_cpu_dead(unsigned int cpu)
 {
+	struct zone *zone;
 
 	lru_add_drain_cpu(cpu);
 	drain_pages(cpu);
@@ -8028,6 +8030,19 @@ static int page_alloc_cpu_dead(unsigned int cpu)
 	 * race with what we are doing.
 	 */
 	cpu_vm_stats_fold(cpu);
+
+	for_each_populated_zone(zone)
+		zone_pcp_update(zone, 0);
+
+	return 0;
+}
+
+static int page_alloc_cpu_online(unsigned int cpu)
+{
+	struct zone *zone;
+
+	for_each_populated_zone(zone)
+		zone_pcp_update(zone, 1);
 	return 0;
 }
 
@@ -8053,8 +8068,9 @@ void __init page_alloc_init(void)
 		hashdist = 0;
 #endif
 
-	ret = cpuhp_setup_state_nocalls(CPUHP_PAGE_ALLOC_DEAD,
-					"mm/page_alloc:dead", NULL,
+	ret = cpuhp_setup_state_nocalls(CPUHP_PAGE_ALLOC,
+					"mm/page_alloc:pcp",
+					page_alloc_cpu_online,
 					page_alloc_cpu_dead);
 	WARN_ON(ret < 0);
 }
@@ -8192,7 +8208,7 @@ static void __setup_per_zone_wmarks(void)
 		 * The watermark size have changed so update the pcpu batch
 		 * and high limits or the limits may be inappropriate.
 		 */
-		zone_set_pageset_high_and_batch(zone);
+		zone_set_pageset_high_and_batch(zone, 0);
 
 		spin_unlock_irqrestore(&zone->lock, flags);
 	}
@@ -9014,10 +9030,10 @@ EXPORT_SYMBOL(free_contig_range);
  * The zone indicated has a new number of managed_pages; batch sizes and percpu
  * page high values need to be recalculated.
  */
-void __meminit zone_pcp_update(struct zone *zone)
+void zone_pcp_update(struct zone *zone, int cpu_online)
 {
 	mutex_lock(&pcp_batch_high_lock);
-	zone_set_pageset_high_and_batch(zone);
+	zone_set_pageset_high_and_batch(zone, cpu_online);
 	mutex_unlock(&pcp_batch_high_lock);
 }
 
-- 
2.26.2


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 4/6] mm/page_alloc: Scale the number of pages that are batch freed
  2021-05-25  8:01 [PATCH 0/6 v2] Calculate pcp->high based on zone sizes and active CPUs Mel Gorman
                   ` (2 preceding siblings ...)
  2021-05-25  8:01 ` [PATCH 3/6] mm/page_alloc: Adjust pcp->high after CPU hotplug events Mel Gorman
@ 2021-05-25  8:01 ` Mel Gorman
  2021-05-28 11:19   ` Vlastimil Babka
  2021-05-25  8:01 ` [PATCH 5/6] mm/page_alloc: Limit the number of pages on PCP lists when reclaim is active Mel Gorman
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 34+ messages in thread
From: Mel Gorman @ 2021-05-25  8:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hillf Danton, Dave Hansen, Vlastimil Babka, Michal Hocko, LKML,
	Linux-MM, Mel Gorman

When a task is freeing a large number of order-0 pages, it may acquire
the zone->lock multiple times freeing pages in batches. This may
unnecessarily contend on the zone lock when freeing very large number
of pages. This patch adapts the size of the batch based on the recent
pattern to scale the batch size for subsequent frees.

As the machines I used were not large enough to test this are not large
enough to illustrate a problem, a debugging patch shows patterns like
the following (slightly editted for clarity)

Baseline vanilla kernel
  time-unmap-14426   [...] free_pcppages_bulk: free   63 count  378 high  378
  time-unmap-14426   [...] free_pcppages_bulk: free   63 count  378 high  378
  time-unmap-14426   [...] free_pcppages_bulk: free   63 count  378 high  378
  time-unmap-14426   [...] free_pcppages_bulk: free   63 count  378 high  378
  time-unmap-14426   [...] free_pcppages_bulk: free   63 count  378 high  378

With patches
  time-unmap-7724    [...] free_pcppages_bulk: free  126 count  814 high  814
  time-unmap-7724    [...] free_pcppages_bulk: free  252 count  814 high  814
  time-unmap-7724    [...] free_pcppages_bulk: free  504 count  814 high  814
  time-unmap-7724    [...] free_pcppages_bulk: free  751 count  814 high  814
  time-unmap-7724    [...] free_pcppages_bulk: free  751 count  814 high  814

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
---
 include/linux/mmzone.h |  3 ++-
 mm/page_alloc.c        | 41 +++++++++++++++++++++++++++++++++++++++--
 2 files changed, 41 insertions(+), 3 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index b449151745d7..92182e0299b2 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -343,8 +343,9 @@ struct per_cpu_pages {
 	int count;		/* number of pages in the list */
 	int high;		/* high watermark, emptying needed */
 	int batch;		/* chunk size for buddy add/remove */
+	short free_factor;	/* batch scaling factor during free */
 #ifdef CONFIG_NUMA
-	int expire;		/* When 0, remote pagesets are drained */
+	short expire;		/* When 0, remote pagesets are drained */
 #endif
 
 	/* Lists of pages, one per migrate type stored on the pcp-lists */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index dc4ac309bc21..89e60005dd27 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3267,18 +3267,47 @@ static bool free_unref_page_prepare(struct page *page, unsigned long pfn)
 	return true;
 }
 
+static int nr_pcp_free(struct per_cpu_pages *pcp, int high, int batch)
+{
+	int min_nr_free, max_nr_free;
+
+	/* Check for PCP disabled or boot pageset */
+	if (unlikely(high < batch))
+		return 1;
+
+	/* Leave at least pcp->batch pages on the list */
+	min_nr_free = batch;
+	max_nr_free = high - batch;
+
+	/*
+	 * Double the number of pages freed each time there is subsequent
+	 * freeing of pages without any allocation.
+	 */
+	batch <<= pcp->free_factor;
+	if (batch < max_nr_free)
+		pcp->free_factor++;
+	batch = clamp(batch, min_nr_free, max_nr_free);
+
+	return batch;
+}
+
 static void free_unref_page_commit(struct page *page, unsigned long pfn,
 				   int migratetype)
 {
 	struct zone *zone = page_zone(page);
 	struct per_cpu_pages *pcp;
+	int high;
 
 	__count_vm_event(PGFREE);
 	pcp = this_cpu_ptr(zone->per_cpu_pageset);
 	list_add(&page->lru, &pcp->lists[migratetype]);
 	pcp->count++;
-	if (pcp->count >= READ_ONCE(pcp->high))
-		free_pcppages_bulk(zone, READ_ONCE(pcp->batch), pcp);
+	high = READ_ONCE(pcp->high);
+	if (pcp->count >= high) {
+		int batch = READ_ONCE(pcp->batch);
+
+		free_pcppages_bulk(zone, nr_pcp_free(pcp, high, batch), pcp);
+	}
 }
 
 /*
@@ -3530,7 +3559,14 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
 	unsigned long flags;
 
 	local_lock_irqsave(&pagesets.lock, flags);
+
+	/*
+	 * On allocation, reduce the number of pages that are batch freed.
+	 * See nr_pcp_free() where free_factor is increased for subsequent
+	 * frees.
+	 */
 	pcp = this_cpu_ptr(zone->per_cpu_pageset);
+	pcp->free_factor >>= 1;
 	list = &pcp->lists[migratetype];
 	page = __rmqueue_pcplist(zone,  migratetype, alloc_flags, pcp, list);
 	local_unlock_irqrestore(&pagesets.lock, flags);
@@ -6698,6 +6734,7 @@ static void per_cpu_pages_init(struct per_cpu_pages *pcp, struct per_cpu_zonesta
 	 */
 	pcp->high = BOOT_PAGESET_HIGH;
 	pcp->batch = BOOT_PAGESET_BATCH;
+	pcp->free_factor = 0;
 }
 
 static void __zone_set_pageset_high_and_batch(struct zone *zone, unsigned long high,
-- 
2.26.2


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 5/6] mm/page_alloc: Limit the number of pages on PCP lists when reclaim is active
  2021-05-25  8:01 [PATCH 0/6 v2] Calculate pcp->high based on zone sizes and active CPUs Mel Gorman
                   ` (3 preceding siblings ...)
  2021-05-25  8:01 ` [PATCH 4/6] mm/page_alloc: Scale the number of pages that are batch freed Mel Gorman
@ 2021-05-25  8:01 ` Mel Gorman
  2021-05-28 11:43   ` Vlastimil Babka
  2021-05-25  8:01 ` [PATCH 6/6] mm/page_alloc: Introduce vm.percpu_pagelist_high_fraction Mel Gorman
  2021-05-27 19:36 ` [PATCH 0/6 v2] Calculate pcp->high based on zone sizes and active CPUs Dave Hansen
  6 siblings, 1 reply; 34+ messages in thread
From: Mel Gorman @ 2021-05-25  8:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hillf Danton, Dave Hansen, Vlastimil Babka, Michal Hocko, LKML,
	Linux-MM, Mel Gorman

When kswapd is active then direct reclaim is potentially active. In
either case, it is possible that a zone would be balanced if pages were
not trapped on PCP lists. Instead of draining remote pages, simply limit
the size of the PCP lists while kswapd is active.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 include/linux/mmzone.h |  1 +
 mm/page_alloc.c        | 19 ++++++++++++++++++-
 mm/vmscan.c            | 35 +++++++++++++++++++++++++++++++++++
 3 files changed, 54 insertions(+), 1 deletion(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 92182e0299b2..a0606239a167 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -647,6 +647,7 @@ enum zone_flags {
 	ZONE_BOOSTED_WATERMARK,		/* zone recently boosted watermarks.
 					 * Cleared when kswapd is woken.
 					 */
+	ZONE_RECLAIM_ACTIVE,		/* kswapd may be scanning the zone. */
 };
 
 static inline unsigned long zone_managed_pages(struct zone *zone)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 89e60005dd27..9144b0c4b6c9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3291,6 +3291,23 @@ static int nr_pcp_free(struct per_cpu_pages *pcp, int high, int batch)
 	return batch;
 }
 
+static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone)
+{
+	int high = READ_ONCE(pcp->high);
+
+	if (unlikely(!high))
+		return 0;
+
+	if (!test_bit(ZONE_RECLAIM_ACTIVE, &zone->flags))
+		return high;
+
+	/*
+	 * If reclaim is active, limit the number of pages that can be
+	 * stored on pcp lists
+	 */
+	return min(READ_ONCE(pcp->batch) << 2, high);
+}
+
 static void free_unref_page_commit(struct page *page, unsigned long pfn,
 				   int migratetype)
 {
@@ -3302,7 +3319,7 @@ static void free_unref_page_commit(struct page *page, unsigned long pfn,
 	pcp = this_cpu_ptr(zone->per_cpu_pageset);
 	list_add(&page->lru, &pcp->lists[migratetype]);
 	pcp->count++;
-	high = READ_ONCE(pcp->high);
+	high = nr_pcp_high(pcp, zone);
 	if (pcp->count >= high) {
 		int batch = READ_ONCE(pcp->batch);
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5199b9696bab..c3c2100a80b8 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3722,6 +3722,38 @@ static bool kswapd_shrink_node(pg_data_t *pgdat,
 	return sc->nr_scanned >= sc->nr_to_reclaim;
 }
 
+/* Page allocator PCP high watermark is lowered if reclaim is active. */
+static inline void
+update_reclaim_active(pg_data_t *pgdat, int highest_zoneidx, bool active)
+{
+	int i;
+	struct zone *zone;
+
+	for (i = 0; i <= highest_zoneidx; i++) {
+		zone = pgdat->node_zones + i;
+
+		if (!managed_zone(zone))
+			continue;
+
+		if (active)
+			set_bit(ZONE_RECLAIM_ACTIVE, &zone->flags);
+		else
+			clear_bit(ZONE_RECLAIM_ACTIVE, &zone->flags);
+	}
+}
+
+static inline void
+set_reclaim_active(pg_data_t *pgdat, int highest_zoneidx)
+{
+	update_reclaim_active(pgdat, highest_zoneidx, true);
+}
+
+static inline void
+clear_reclaim_active(pg_data_t *pgdat, int highest_zoneidx)
+{
+	update_reclaim_active(pgdat, highest_zoneidx, false);
+}
+
 /*
  * For kswapd, balance_pgdat() will reclaim pages across a node from zones
  * that are eligible for use by the caller until at least one zone is
@@ -3774,6 +3806,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx)
 	boosted = nr_boost_reclaim;
 
 restart:
+	set_reclaim_active(pgdat, highest_zoneidx);
 	sc.priority = DEF_PRIORITY;
 	do {
 		unsigned long nr_reclaimed = sc.nr_reclaimed;
@@ -3907,6 +3940,8 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx)
 		pgdat->kswapd_failures++;
 
 out:
+	clear_reclaim_active(pgdat, highest_zoneidx);
+
 	/* If reclaim was boosted, account for the reclaim done in this pass */
 	if (boosted) {
 		unsigned long flags;
-- 
2.26.2


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 6/6] mm/page_alloc: Introduce vm.percpu_pagelist_high_fraction
  2021-05-25  8:01 [PATCH 0/6 v2] Calculate pcp->high based on zone sizes and active CPUs Mel Gorman
                   ` (4 preceding siblings ...)
  2021-05-25  8:01 ` [PATCH 5/6] mm/page_alloc: Limit the number of pages on PCP lists when reclaim is active Mel Gorman
@ 2021-05-25  8:01 ` Mel Gorman
  2021-05-28 11:59   ` Vlastimil Babka
  2021-05-27 19:36 ` [PATCH 0/6 v2] Calculate pcp->high based on zone sizes and active CPUs Dave Hansen
  6 siblings, 1 reply; 34+ messages in thread
From: Mel Gorman @ 2021-05-25  8:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hillf Danton, Dave Hansen, Vlastimil Babka, Michal Hocko, LKML,
	Linux-MM, Mel Gorman

This introduces a new sysctl vm.percpu_pagelist_high_fraction. It is
similar to the old vm.percpu_pagelist_fraction. The old sysctl increased
both pcp->batch and pcp->high with the higher pcp->high potentially
reducing zone->lock contention. However, the higher pcp->batch value also
potentially increased allocation latency while the PCP was refilled.
This sysctl only adjusts pcp->high so that zone->lock contention is
potentially reduced but allocation latency during a PCP refill remains
the same.

  # grep -E "high:|batch" /proc/zoneinfo | tail -2
              high:  649
              batch: 63

  # sysctl vm.percpu_pagelist_high_fraction=8
  # grep -E "high:|batch" /proc/zoneinfo | tail -2
              high:  35071
              batch: 63

  # sysctl vm.percpu_pagelist_high_fraction=64
              high:  4383
              batch: 63

  # sysctl vm.percpu_pagelist_high_fraction=0
              high:  649
              batch: 63

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
---
 Documentation/admin-guide/sysctl/vm.rst | 20 +++++++
 include/linux/mmzone.h                  |  3 ++
 kernel/sysctl.c                         |  8 +++
 mm/page_alloc.c                         | 69 ++++++++++++++++++++++---
 4 files changed, 93 insertions(+), 7 deletions(-)

diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
index 2fcafccb53a8..e85c2f21d209 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -64,6 +64,7 @@ files can be found in mm/swap.c.
 - overcommit_ratio
 - page-cluster
 - panic_on_oom
+- percpu_pagelist_high_fraction
 - stat_interval
 - stat_refresh
 - numa_stat
@@ -789,6 +790,25 @@ panic_on_oom=2+kdump gives you very strong tool to investigate
 why oom happens. You can get snapshot.
 
 
+percpu_pagelist_high_fraction
+=============================
+
+This is the fraction of pages in each zone that are allocated for each
+per cpu page list.  The min value for this is 8.  It means that we do
+not allow more than 1/8th of pages in each zone to be allocated in any
+single per_cpu_pagelist.  This entry only changes the value of hot per
+cpu pagelists. User can specify a number like 100 to allocate 1/100th
+of each zone to each per cpu page list.
+
+The batch value of each per cpu pagelist remains the same regardless of the
+value of the high fraction so allocation latencies are unaffected.
+
+The initial value is zero. Kernel uses this value to set the high pcp->high
+mark based on the low watermark for the zone and the number of local
+online CPUs.  If the user writes '0' to this sysctl, it will revert to
+this default behavior.
+
+
 stat_interval
 =============
 
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index a0606239a167..e20d98c62beb 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1040,12 +1040,15 @@ int watermark_scale_factor_sysctl_handler(struct ctl_table *, int, void *,
 extern int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES];
 int lowmem_reserve_ratio_sysctl_handler(struct ctl_table *, int, void *,
 		size_t *, loff_t *);
+int percpu_pagelist_high_fraction_sysctl_handler(struct ctl_table *, int,
+		void *, size_t *, loff_t *);
 int sysctl_min_unmapped_ratio_sysctl_handler(struct ctl_table *, int,
 		void *, size_t *, loff_t *);
 int sysctl_min_slab_ratio_sysctl_handler(struct ctl_table *, int,
 		void *, size_t *, loff_t *);
 int numa_zonelist_order_handler(struct ctl_table *, int,
 		void *, size_t *, loff_t *);
+extern int percpu_pagelist_high_fraction;
 extern char numa_zonelist_order[];
 #define NUMA_ZONELIST_ORDER_LEN	16
 
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 4e5ac50a1af0..9eb9d1f987d9 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -2889,6 +2889,14 @@ static struct ctl_table vm_table[] = {
 		.extra1		= SYSCTL_ONE,
 		.extra2		= &one_thousand,
 	},
+	{
+		.procname	= "percpu_pagelist_high_fraction",
+		.data		= &percpu_pagelist_high_fraction,
+		.maxlen		= sizeof(percpu_pagelist_high_fraction),
+		.mode		= 0644,
+		.proc_handler	= percpu_pagelist_high_fraction_sysctl_handler,
+		.extra1		= SYSCTL_ZERO,
+	},
 	{
 		.procname	= "page_lock_unfairness",
 		.data		= &sysctl_page_lock_unfairness,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9144b0c4b6c9..07e09b3c2bcf 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -120,6 +120,7 @@ typedef int __bitwise fpi_t;
 
 /* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */
 static DEFINE_MUTEX(pcp_batch_high_lock);
+#define MIN_PERCPU_PAGELIST_HIGH_FRACTION (8)
 
 struct pagesets {
 	local_lock_t lock;
@@ -181,6 +182,7 @@ EXPORT_SYMBOL(_totalram_pages);
 unsigned long totalreserve_pages __read_mostly;
 unsigned long totalcma_pages __read_mostly;
 
+int percpu_pagelist_high_fraction;
 gfp_t gfp_allowed_mask __read_mostly = GFP_BOOT_MASK;
 DEFINE_STATIC_KEY_MAYBE(CONFIG_INIT_ON_ALLOC_DEFAULT_ON, init_on_alloc);
 EXPORT_SYMBOL(init_on_alloc);
@@ -6686,17 +6688,32 @@ static int zone_highsize(struct zone *zone, int batch, int cpu_online)
 #ifdef CONFIG_MMU
 	int high;
 	int nr_local_cpus;
+	unsigned long total_pages;
+
+	if (!percpu_pagelist_high_fraction) {
+		/*
+		 * By default, the high value of the pcp is based on the zone
+		 * low watermark so that if they are full then background
+		 * reclaim will not be started prematurely.
+		 */
+		total_pages = low_wmark_pages(zone);
+	} else {
+		/*
+		 * If percpu_pagelist_high_fraction is configured, the high
+		 * value is based on a fraction of the managed pages in the
+		 * zone.
+		 */
+		total_pages = zone_managed_pages(zone) / percpu_pagelist_high_fraction;
+	}
 
 	/*
-	 * The high value of the pcp is based on the zone low watermark
-	 * so that if they are full then background reclaim will not be
-	 * started prematurely. The value is split across all online CPUs
-	 * local to the zone. Note that early in boot that CPUs may not be
-	 * online yet and that during CPU hotplug that the cpumask is not
-	 * yet updated when a CPU is being onlined.
+	 * Split the high value across all online CPUs local to the zone. Note
+	 * that early in boot that CPUs may not be online yet and that during
+	 * CPU hotplug that the cpumask is not yet updated when a CPU is being
+	 * onlined.
 	 */
 	nr_local_cpus = max(1U, cpumask_weight(cpumask_of_node(zone_to_nid(zone)))) + cpu_online;
-	high = low_wmark_pages(zone) / nr_local_cpus;
+	high = total_pages / nr_local_cpus;
 
 	/*
 	 * Ensure high is at least batch*4. The multiple is based on the
@@ -8462,6 +8479,44 @@ int lowmem_reserve_ratio_sysctl_handler(struct ctl_table *table, int write,
 	return 0;
 }
 
+/*
+ * percpu_pagelist_high_fraction - changes the pcp->high for each zone on each
+ * cpu. It is the fraction of total pages in each zone that a hot per cpu
+ * pagelist can have before it gets flushed back to buddy allocator.
+ */
+int percpu_pagelist_high_fraction_sysctl_handler(struct ctl_table *table,
+		int write, void *buffer, size_t *length, loff_t *ppos)
+{
+	struct zone *zone;
+	int old_percpu_pagelist_high_fraction;
+	int ret;
+
+	mutex_lock(&pcp_batch_high_lock);
+	old_percpu_pagelist_high_fraction = percpu_pagelist_high_fraction;
+
+	ret = proc_dointvec_minmax(table, write, buffer, length, ppos);
+	if (!write || ret < 0)
+		goto out;
+
+	/* Sanity checking to avoid pcp imbalance */
+	if (percpu_pagelist_high_fraction &&
+	    percpu_pagelist_high_fraction < MIN_PERCPU_PAGELIST_HIGH_FRACTION) {
+		percpu_pagelist_high_fraction = old_percpu_pagelist_high_fraction;
+		ret = -EINVAL;
+		goto out;
+	}
+
+	/* No change? */
+	if (percpu_pagelist_high_fraction == old_percpu_pagelist_high_fraction)
+		goto out;
+
+	for_each_populated_zone(zone)
+		zone_set_pageset_high_and_batch(zone, 0);
+out:
+	mutex_unlock(&pcp_batch_high_lock);
+	return ret;
+}
+
 #ifndef __HAVE_ARCH_RESERVED_KERNEL_PAGES
 /*
  * Returns the number of pages that arch has reserved but
-- 
2.26.2


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/6] mm/page_alloc: Delete vm.percpu_pagelist_fraction
  2021-05-25  8:01 ` [PATCH 1/6] mm/page_alloc: Delete vm.percpu_pagelist_fraction Mel Gorman
@ 2021-05-26 17:41   ` Vlastimil Babka
  0 siblings, 0 replies; 34+ messages in thread
From: Vlastimil Babka @ 2021-05-26 17:41 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton
  Cc: Hillf Danton, Dave Hansen, Michal Hocko, LKML, Linux-MM

On 5/25/21 10:01 AM, Mel Gorman wrote:
> The vm.percpu_pagelist_fraction is used to increase the batch and high
> limits for the per-cpu page allocator (PCP). The intent behind the sysctl
> is to reduce zone lock acquisition when allocating/freeing pages but it has
> a problem. While it can decrease contention, it can also increase latency
> on the allocation side due to unreasonably large batch sizes. This leads
> to games where an administrator adjusts percpu_pagelist_fraction on the
> fly to work around contention and allocation latency problems.
> 
> This series aims to alleviate the problems with zone lock contention while
> avoiding the allocation-side latency problems. For the purposes of review,
> it's easier to remove this sysctl now and reintroduce a similar sysctl
> later in the series that deals only with pcp->high.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: Dave Hansen <dave.hansen@linux.intel.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/6] mm/page_alloc: Disassociate the pcp->high from pcp->batch
  2021-05-25  8:01 ` [PATCH 2/6] mm/page_alloc: Disassociate the pcp->high from pcp->batch Mel Gorman
@ 2021-05-26 18:14   ` Vlastimil Babka
  2021-05-27 10:52     ` Mel Gorman
  0 siblings, 1 reply; 34+ messages in thread
From: Vlastimil Babka @ 2021-05-26 18:14 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton
  Cc: Hillf Danton, Dave Hansen, Michal Hocko, LKML, Linux-MM

On 5/25/21 10:01 AM, Mel Gorman wrote:
> The pcp high watermark is based on the batch size but there is no
> relationship between them other than it is convenient to use early in
> boot.
> 
> This patch takes the first step and bases pcp->high on the zone low
> watermark split across the number of CPUs local to a zone while the batch
> size remains the same to avoid increasing allocation latencies. The intent
> behind the default pcp->high is "set the number of PCP pages such that
> if they are all full that background reclaim is not started prematurely".
> 
> Note that in this patch the pcp->high values are adjusted after memory
> hotplug events, min_free_kbytes adjustments and watermark scale factor
> adjustments but not CPU hotplug events which is handled later in the
> series.
> 
> On a test KVM instance;
> 
> Before grep -E "high:|batch" /proc/zoneinfo | tail -2
>               high:  378
>               batch: 63
> 
> After grep -E "high:|batch" /proc/zoneinfo | tail -2
>               high:  649
>               batch: 63
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

...

> @@ -6637,6 +6628,34 @@ static int zone_batchsize(struct zone *zone)
>  #endif
>  }
>  
> +static int zone_highsize(struct zone *zone, int batch)
> +{
> +#ifdef CONFIG_MMU
> +	int high;
> +	int nr_local_cpus;
> +
> +	/*
> +	 * The high value of the pcp is based on the zone low watermark
> +	 * so that if they are full then background reclaim will not be
> +	 * started prematurely. The value is split across all online CPUs
> +	 * local to the zone. Note that early in boot that CPUs may not be
> +	 * online yet.
> +	 */
> +	nr_local_cpus = max(1U, cpumask_weight(cpumask_of_node(zone_to_nid(zone))));
> +	high = low_wmark_pages(zone) / nr_local_cpus;
> +
> +	/*
> +	 * Ensure high is at least batch*4. The multiple is based on the
> +	 * historical relationship between high and batch.
> +	 */
> +	high = max(high, batch << 2);
> +
> +	return high;
> +#else
> +	return 0;
> +#endif
> +}
> +
>  /*
>   * pcp->high and pcp->batch values are related and generally batch is lower
>   * than high. They are also related to pcp->count such that count is lower
> @@ -6698,11 +6717,10 @@ static void __zone_set_pageset_high_and_batch(struct zone *zone, unsigned long h
>   */
>  static void zone_set_pageset_high_and_batch(struct zone *zone)
>  {
> -	unsigned long new_high, new_batch;
> +	int new_high, new_batch;
>  
> -	new_batch = zone_batchsize(zone);
> -	new_high = 6 * new_batch;
> -	new_batch = max(1UL, 1 * new_batch);
> +	new_batch = max(1, zone_batchsize(zone));
> +	new_high = zone_highsize(zone, new_batch);
>  
>  	if (zone->pageset_high == new_high &&
>  	    zone->pageset_batch == new_batch)
> @@ -8170,6 +8188,12 @@ static void __setup_per_zone_wmarks(void)
>  		zone->_watermark[WMARK_LOW]  = min_wmark_pages(zone) + tmp;
>  		zone->_watermark[WMARK_HIGH] = min_wmark_pages(zone) + tmp * 2;
>  
> +		/*
> +		 * The watermark size have changed so update the pcpu batch
> +		 * and high limits or the limits may be inappropriate.
> +		 */
> +		zone_set_pageset_high_and_batch(zone);

Hm so this puts the call in the path of various watermark related sysctl
handlers, but it's not protected by pcp_batch_high_lock. The zone lock won't
help against zone_pcp_update() from a hotplug handler. On the other hand, since
hotplug handlers also call __setup_per_zone_wmarks(), the zone_pcp_update()
calls there are now redundant and could be removed, no?
But later there will be a new sysctl in patch 6/6 using pcp_batch_high_lock,
thus that one will not be protected against the watermark related sysctl
handlers that reach here.

To solve all this, seems like the static lock in setup_per_zone_wmarks() could
become a top-level visible lock and pcp high/batch updates could switch to that
one instead of own pcp_batch_high_lock. And zone_pcp_update() calls from hotplug
handlers could be removed.

> +
>  		spin_unlock_irqrestore(&zone->lock, flags);
>  	}
>  
> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/6] mm/page_alloc: Disassociate the pcp->high from pcp->batch
  2021-05-26 18:14   ` Vlastimil Babka
@ 2021-05-27 10:52     ` Mel Gorman
  2021-05-28 10:27       ` Vlastimil Babka
  0 siblings, 1 reply; 34+ messages in thread
From: Mel Gorman @ 2021-05-27 10:52 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Hillf Danton, Dave Hansen, Michal Hocko, LKML, Linux-MM

On Wed, May 26, 2021 at 08:14:13PM +0200, Vlastimil Babka wrote:
> > @@ -6698,11 +6717,10 @@ static void __zone_set_pageset_high_and_batch(struct zone *zone, unsigned long h
> >   */
> >  static void zone_set_pageset_high_and_batch(struct zone *zone)
> >  {
> > -	unsigned long new_high, new_batch;
> > +	int new_high, new_batch;
> >  
> > -	new_batch = zone_batchsize(zone);
> > -	new_high = 6 * new_batch;
> > -	new_batch = max(1UL, 1 * new_batch);
> > +	new_batch = max(1, zone_batchsize(zone));
> > +	new_high = zone_highsize(zone, new_batch);
> >  
> >  	if (zone->pageset_high == new_high &&
> >  	    zone->pageset_batch == new_batch)
> > @@ -8170,6 +8188,12 @@ static void __setup_per_zone_wmarks(void)
> >  		zone->_watermark[WMARK_LOW]  = min_wmark_pages(zone) + tmp;
> >  		zone->_watermark[WMARK_HIGH] = min_wmark_pages(zone) + tmp * 2;
> >  
> > +		/*
> > +		 * The watermark size have changed so update the pcpu batch
> > +		 * and high limits or the limits may be inappropriate.
> > +		 */
> > +		zone_set_pageset_high_and_batch(zone);
> 
> Hm so this puts the call in the path of various watermark related sysctl
> handlers, but it's not protected by pcp_batch_high_lock. The zone lock won't
> help against zone_pcp_update() from a hotplug handler. On the other hand, since
> hotplug handlers also call __setup_per_zone_wmarks(), the zone_pcp_update()
> calls there are now redundant and could be removed, no?
> But later there will be a new sysctl in patch 6/6 using pcp_batch_high_lock,
> thus that one will not be protected against the watermark related sysctl
> handlers that reach here.
> 
> To solve all this, seems like the static lock in setup_per_zone_wmarks() could
> become a top-level visible lock and pcp high/batch updates could switch to that
> one instead of own pcp_batch_high_lock. And zone_pcp_update() calls from hotplug
> handlers could be removed.
> 

Hmm, the locking has very different hold times. The static lock in
setup_per_zone_wmarks is a spinlock that protects against parallel updates
of watermarks and is held for a short duration. The pcp_batch_high_lock
is a mutex that is held for a relatively long time while memory is being
offlined and can sleep. Memory hotplug updates the watermarks without
pcp_batch_high_lock held so overall, unifying the locking there should
be a separate series.

How about this as a fix for this patch?

---8<---
mm/page_alloc: Disassociate the pcp->high from pcp->batch -fix

Vlastimil Babka noted that __setup_per_zone_wmarks updating pcp->high
did not protect watermark-related sysctl handlers from a parallel
memory hotplug operations. This patch moves the PCP update to
setup_per_zone_wmarks and updates the PCP high value while protected
by the pcp_batch_high_lock mutex.

This is a fix to the mmotm patch mm-page_alloc-disassociate-the-pcp-high-from-pcp-batch.patch.
It'll cause a conflict with mm-page_alloc-adjust-pcp-high-after-cpu-hotplug-events.patch
but the resolution is simply to change the caller in setup_per_zone_wmarks
to zone_pcp_update(zone, 0)

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/page_alloc.c | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 329b71e41db4..b1b3c66e9d88 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -8199,12 +8199,6 @@ static void __setup_per_zone_wmarks(void)
 		zone->_watermark[WMARK_LOW]  = min_wmark_pages(zone) + tmp;
 		zone->_watermark[WMARK_HIGH] = min_wmark_pages(zone) + tmp * 2;
 
-		/*
-		 * The watermark size have changed so update the pcpu batch
-		 * and high limits or the limits may be inappropriate.
-		 */
-		zone_set_pageset_high_and_batch(zone);
-
 		spin_unlock_irqrestore(&zone->lock, flags);
 	}
 
@@ -8221,11 +8215,19 @@ static void __setup_per_zone_wmarks(void)
  */
 void setup_per_zone_wmarks(void)
 {
+	struct zone *zone;
 	static DEFINE_SPINLOCK(lock);
 
 	spin_lock(&lock);
 	__setup_per_zone_wmarks();
 	spin_unlock(&lock);
+
+	/*
+	 * The watermark size have changed so update the pcpu batch
+	 * and high limits or the limits may be inappropriate.
+	 */
+	for_each_zone(zone)
+		zone_pcp_update(zone);
 }
 
 /*

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/6 v2] Calculate pcp->high based on zone sizes and active CPUs
  2021-05-25  8:01 [PATCH 0/6 v2] Calculate pcp->high based on zone sizes and active CPUs Mel Gorman
                   ` (5 preceding siblings ...)
  2021-05-25  8:01 ` [PATCH 6/6] mm/page_alloc: Introduce vm.percpu_pagelist_high_fraction Mel Gorman
@ 2021-05-27 19:36 ` Dave Hansen
  2021-05-28  8:55   ` Mel Gorman
  6 siblings, 1 reply; 34+ messages in thread
From: Dave Hansen @ 2021-05-27 19:36 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton
  Cc: Hillf Danton, Dave Hansen, Vlastimil Babka, Michal Hocko, LKML,
	Linux-MM, Tang, Feng

Hi Mel,

Feng Tang tossed these on a "Cascade Lake" system with 96 threads and
~512G of persistent memory and 128G of DRAM.  The PMEM is in "volatile
use" mode and being managed via the buddy just like the normal RAM.

The PMEM zones are big ones:

        present  65011712 = 248 G
        high       134595 = 525 M

The PMEM nodes, of course, don't have any CPUs in them.

With your series, the pcp->high value per-cpu is 69584 pages or about
270MB per CPU.  Scaled up by the 96 CPU threads, that's ~26GB of
worst-case memory in the pcps per zone, or roughly 10% of the size of
the zone.

I did see quite a few pcp->counts above 60,000, so it's definitely
possible in practice to see the pcps filled up.  This was not observed
to cause any actual problems in practice.  But, it's still a bit worrisome.

Maybe instead of pretending that cpu-less nodes have one cpu:

	nr_local_cpus = max(1U,
cpumask_weight(cpumask_of_node(zone_to_nid(zone))));

we should just consider them to have *all* the CPUs in the system.  Perhaps:

	nr_local_cpus = cpumask_weight(cpumask_of_node(zone_to_nid(zone)));
	if (!nr_local_cpus)
		nr_local_cpus = num_online_cpus();

Even if a system has a silly number of CPUs, the 'high' sizes will still
hit a floor at 4*batch:

	high = max(high, batch << 2);

Which doesn't seem too bad, especially considering that CPU-less nodes
naturally have less contention.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/6 v2] Calculate pcp->high based on zone sizes and active CPUs
  2021-05-27 19:36 ` [PATCH 0/6 v2] Calculate pcp->high based on zone sizes and active CPUs Dave Hansen
@ 2021-05-28  8:55   ` Mel Gorman
  2021-05-28  9:03     ` David Hildenbrand
                       ` (2 more replies)
  0 siblings, 3 replies; 34+ messages in thread
From: Mel Gorman @ 2021-05-28  8:55 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andrew Morton, Hillf Danton, Dave Hansen, Vlastimil Babka,
	Michal Hocko, LKML, Linux-MM, Tang, Feng

On Thu, May 27, 2021 at 12:36:21PM -0700, Dave Hansen wrote:
> Hi Mel,
> 
> Feng Tang tossed these on a "Cascade Lake" system with 96 threads and
> ~512G of persistent memory and 128G of DRAM.  The PMEM is in "volatile
> use" mode and being managed via the buddy just like the normal RAM.
> 
> The PMEM zones are big ones:
> 
>         present  65011712 = 248 G
>         high       134595 = 525 M
> 
> The PMEM nodes, of course, don't have any CPUs in them.
> 
> With your series, the pcp->high value per-cpu is 69584 pages or about
> 270MB per CPU.  Scaled up by the 96 CPU threads, that's ~26GB of
> worst-case memory in the pcps per zone, or roughly 10% of the size of
> the zone.
> 
> I did see quite a few pcp->counts above 60,000, so it's definitely
> possible in practice to see the pcps filled up.  This was not observed
> to cause any actual problems in practice.  But, it's still a bit worrisome.
> 

Ok, it does have the potential to trigger early reclaim as pages are
stored on remote PCP lists. The problem would be transient because
vmstat would drain those pages over time but still, how about this patch
on top of the series?

--8<--
mm/page_alloc: Split pcp->high across all online CPUs for cpuless nodes

Dave Hansen reported the following about Feng Tang's tests on a machine
with persistent memory onlined as a DRAM-like device.

  Feng Tang tossed these on a "Cascade Lake" system with 96 threads and
  ~512G of persistent memory and 128G of DRAM.  The PMEM is in "volatile
  use" mode and being managed via the buddy just like the normal RAM.

  The PMEM zones are big ones:

        present  65011712 = 248 G
        high       134595 = 525 M

  The PMEM nodes, of course, don't have any CPUs in them.

  With your series, the pcp->high value per-cpu is 69584 pages or about
  270MB per CPU.  Scaled up by the 96 CPU threads, that's ~26GB of
  worst-case memory in the pcps per zone, or roughly 10% of the size of
  the zone.

This should not cause a problem as such although it could trigger reclaim
due to pages being stored on per-cpu lists for CPUs remote to a node. It
is not possible to treat cpuless nodes exactly the same as normal nodes
but the worst-case scenario can be mitigated by splitting pcp->high across
all online CPUs for cpuless memory nodes.

Suggested-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/page_alloc.c | 14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d708aa14f4ef..af566e97a0f8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6687,7 +6687,7 @@ static int zone_highsize(struct zone *zone, int batch, int cpu_online)
 {
 #ifdef CONFIG_MMU
 	int high;
-	int nr_local_cpus;
+	int nr_split_cpus;
 	unsigned long total_pages;
 
 	if (!percpu_pagelist_high_fraction) {
@@ -6710,10 +6710,14 @@ static int zone_highsize(struct zone *zone, int batch, int cpu_online)
 	 * Split the high value across all online CPUs local to the zone. Note
 	 * that early in boot that CPUs may not be online yet and that during
 	 * CPU hotplug that the cpumask is not yet updated when a CPU is being
-	 * onlined.
-	 */
-	nr_local_cpus = max(1U, cpumask_weight(cpumask_of_node(zone_to_nid(zone)))) + cpu_online;
-	high = total_pages / nr_local_cpus;
+	 * onlined. For memory nodes that have no CPUs, split pcp->high across
+	 * all online CPUs to mitigate the risk that reclaim is triggered
+	 * prematurely due to pages stored on pcp lists.
+	 */
+	nr_split_cpus = cpumask_weight(cpumask_of_node(zone_to_nid(zone))) + cpu_online;
+	if (!nr_split_cpus)
+		nr_split_cpus = num_online_cpus();
+	high = total_pages / nr_split_cpus;
 
 	/*
 	 * Ensure high is at least batch*4. The multiple is based on the

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/6 v2] Calculate pcp->high based on zone sizes and active CPUs
  2021-05-28  8:55   ` Mel Gorman
@ 2021-05-28  9:03     ` David Hildenbrand
  2021-05-28  9:08       ` David Hildenbrand
  2021-05-28 12:12     ` Vlastimil Babka
  2021-05-28 14:39     ` Dave Hansen
  2 siblings, 1 reply; 34+ messages in thread
From: David Hildenbrand @ 2021-05-28  9:03 UTC (permalink / raw)
  To: Mel Gorman, Dave Hansen
  Cc: Andrew Morton, Hillf Danton, Dave Hansen, Vlastimil Babka,
	Michal Hocko, LKML, Linux-MM, Tang, Feng

On 28.05.21 10:55, Mel Gorman wrote:
> On Thu, May 27, 2021 at 12:36:21PM -0700, Dave Hansen wrote:
>> Hi Mel,
>>
>> Feng Tang tossed these on a "Cascade Lake" system with 96 threads and
>> ~512G of persistent memory and 128G of DRAM.  The PMEM is in "volatile
>> use" mode and being managed via the buddy just like the normal RAM.
>>
>> The PMEM zones are big ones:
>>
>>          present  65011712 = 248 G
>>          high       134595 = 525 M
>>
>> The PMEM nodes, of course, don't have any CPUs in them.
>>
>> With your series, the pcp->high value per-cpu is 69584 pages or about
>> 270MB per CPU.  Scaled up by the 96 CPU threads, that's ~26GB of
>> worst-case memory in the pcps per zone, or roughly 10% of the size of
>> the zone.

When I read about having such big amounts of free memory theoretically 
stuck in PCP lists, I guess we really want to start draining the PCP in 
alloc_contig_range(), just as we do with memory hotunplug when offlining.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/6 v2] Calculate pcp->high based on zone sizes and active CPUs
  2021-05-28  9:03     ` David Hildenbrand
@ 2021-05-28  9:08       ` David Hildenbrand
  2021-05-28  9:49         ` Mel Gorman
  0 siblings, 1 reply; 34+ messages in thread
From: David Hildenbrand @ 2021-05-28  9:08 UTC (permalink / raw)
  To: Mel Gorman, Dave Hansen
  Cc: Andrew Morton, Hillf Danton, Dave Hansen, Vlastimil Babka,
	Michal Hocko, LKML, Linux-MM, Tang, Feng

On 28.05.21 11:03, David Hildenbrand wrote:
> On 28.05.21 10:55, Mel Gorman wrote:
>> On Thu, May 27, 2021 at 12:36:21PM -0700, Dave Hansen wrote:
>>> Hi Mel,
>>>
>>> Feng Tang tossed these on a "Cascade Lake" system with 96 threads and
>>> ~512G of persistent memory and 128G of DRAM.  The PMEM is in "volatile
>>> use" mode and being managed via the buddy just like the normal RAM.
>>>
>>> The PMEM zones are big ones:
>>>
>>>           present  65011712 = 248 G
>>>           high       134595 = 525 M
>>>
>>> The PMEM nodes, of course, don't have any CPUs in them.
>>>
>>> With your series, the pcp->high value per-cpu is 69584 pages or about
>>> 270MB per CPU.  Scaled up by the 96 CPU threads, that's ~26GB of
>>> worst-case memory in the pcps per zone, or roughly 10% of the size of
>>> the zone.
> 
> When I read about having such big amounts of free memory theoretically
> stuck in PCP lists, I guess we really want to start draining the PCP in
> alloc_contig_range(), just as we do with memory hotunplug when offlining.
> 

Correction: we already drain the pcp, we just don't temporarily disable 
it, so a race as described in offline_pages() could apply:

"Disable pcplists so that page isolation cannot race with freeing
  in a way that pages from isolated pageblock are left on pcplists."

Guess we'd then want to move the draining before 
start_isolate_page_range() in alloc_contig_range().

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/6 v2] Calculate pcp->high based on zone sizes and active CPUs
  2021-05-28  9:08       ` David Hildenbrand
@ 2021-05-28  9:49         ` Mel Gorman
  2021-05-28  9:52           ` David Hildenbrand
  0 siblings, 1 reply; 34+ messages in thread
From: Mel Gorman @ 2021-05-28  9:49 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Dave Hansen, Andrew Morton, Hillf Danton, Dave Hansen,
	Vlastimil Babka, Michal Hocko, LKML, Linux-MM, Tang, Feng

On Fri, May 28, 2021 at 11:08:01AM +0200, David Hildenbrand wrote:
> On 28.05.21 11:03, David Hildenbrand wrote:
> > On 28.05.21 10:55, Mel Gorman wrote:
> > > On Thu, May 27, 2021 at 12:36:21PM -0700, Dave Hansen wrote:
> > > > Hi Mel,
> > > > 
> > > > Feng Tang tossed these on a "Cascade Lake" system with 96 threads and
> > > > ~512G of persistent memory and 128G of DRAM.  The PMEM is in "volatile
> > > > use" mode and being managed via the buddy just like the normal RAM.
> > > > 
> > > > The PMEM zones are big ones:
> > > > 
> > > >           present  65011712 = 248 G
> > > >           high       134595 = 525 M
> > > > 
> > > > The PMEM nodes, of course, don't have any CPUs in them.
> > > > 
> > > > With your series, the pcp->high value per-cpu is 69584 pages or about
> > > > 270MB per CPU.  Scaled up by the 96 CPU threads, that's ~26GB of
> > > > worst-case memory in the pcps per zone, or roughly 10% of the size of
> > > > the zone.
> > 
> > When I read about having such big amounts of free memory theoretically
> > stuck in PCP lists, I guess we really want to start draining the PCP in
> > alloc_contig_range(), just as we do with memory hotunplug when offlining.
> > 
> 
> Correction: we already drain the pcp, we just don't temporarily disable it,
> so a race as described in offline_pages() could apply:
> 
> "Disable pcplists so that page isolation cannot race with freeing
>  in a way that pages from isolated pageblock are left on pcplists."
> 
> Guess we'd then want to move the draining before start_isolate_page_range()
> in alloc_contig_range().
> 

Or instead of draining, validate the PFN range in alloc_contig_range
is within the same zone and if so, call zone_pcp_disable() before
start_isolate_page_range and enable after __alloc_contig_migrate_range.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/6 v2] Calculate pcp->high based on zone sizes and active CPUs
  2021-05-28  9:49         ` Mel Gorman
@ 2021-05-28  9:52           ` David Hildenbrand
  2021-05-28 10:09             ` Mel Gorman
  0 siblings, 1 reply; 34+ messages in thread
From: David Hildenbrand @ 2021-05-28  9:52 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Dave Hansen, Andrew Morton, Hillf Danton, Dave Hansen,
	Vlastimil Babka, Michal Hocko, LKML, Linux-MM, Tang, Feng

On 28.05.21 11:49, Mel Gorman wrote:
> On Fri, May 28, 2021 at 11:08:01AM +0200, David Hildenbrand wrote:
>> On 28.05.21 11:03, David Hildenbrand wrote:
>>> On 28.05.21 10:55, Mel Gorman wrote:
>>>> On Thu, May 27, 2021 at 12:36:21PM -0700, Dave Hansen wrote:
>>>>> Hi Mel,
>>>>>
>>>>> Feng Tang tossed these on a "Cascade Lake" system with 96 threads and
>>>>> ~512G of persistent memory and 128G of DRAM.  The PMEM is in "volatile
>>>>> use" mode and being managed via the buddy just like the normal RAM.
>>>>>
>>>>> The PMEM zones are big ones:
>>>>>
>>>>>            present  65011712 = 248 G
>>>>>            high       134595 = 525 M
>>>>>
>>>>> The PMEM nodes, of course, don't have any CPUs in them.
>>>>>
>>>>> With your series, the pcp->high value per-cpu is 69584 pages or about
>>>>> 270MB per CPU.  Scaled up by the 96 CPU threads, that's ~26GB of
>>>>> worst-case memory in the pcps per zone, or roughly 10% of the size of
>>>>> the zone.
>>>
>>> When I read about having such big amounts of free memory theoretically
>>> stuck in PCP lists, I guess we really want to start draining the PCP in
>>> alloc_contig_range(), just as we do with memory hotunplug when offlining.
>>>
>>
>> Correction: we already drain the pcp, we just don't temporarily disable it,
>> so a race as described in offline_pages() could apply:
>>
>> "Disable pcplists so that page isolation cannot race with freeing
>>   in a way that pages from isolated pageblock are left on pcplists."
>>
>> Guess we'd then want to move the draining before start_isolate_page_range()
>> in alloc_contig_range().
>>
> 
> Or instead of draining, validate the PFN range in alloc_contig_range
> is within the same zone and if so, call zone_pcp_disable() before
> start_isolate_page_range and enable after __alloc_contig_migrate_range.
> 

We require the caller to only pass a range within a single zone, so that 
should be fine.

The only ugly thing about zone_pcp_disable() is 
mutex_lock(&pcp_batch_high_lock) which would serialize all 
alloc_contig_range() and even with offline_pages().

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/6 v2] Calculate pcp->high based on zone sizes and active CPUs
  2021-05-28  9:52           ` David Hildenbrand
@ 2021-05-28 10:09             ` Mel Gorman
  2021-05-28 10:21               ` David Hildenbrand
  0 siblings, 1 reply; 34+ messages in thread
From: Mel Gorman @ 2021-05-28 10:09 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Dave Hansen, Andrew Morton, Hillf Danton, Dave Hansen,
	Vlastimil Babka, Michal Hocko, LKML, Linux-MM, Tang, Feng

On Fri, May 28, 2021 at 11:52:53AM +0200, David Hildenbrand wrote:
> > > "Disable pcplists so that page isolation cannot race with freeing
> > >   in a way that pages from isolated pageblock are left on pcplists."
> > > 
> > > Guess we'd then want to move the draining before start_isolate_page_range()
> > > in alloc_contig_range().
> > > 
> > 
> > Or instead of draining, validate the PFN range in alloc_contig_range
> > is within the same zone and if so, call zone_pcp_disable() before
> > start_isolate_page_range and enable after __alloc_contig_migrate_range.
> > 
> 
> We require the caller to only pass a range within a single zone, so that
> should be fine.
> 
> The only ugly thing about zone_pcp_disable() is
> mutex_lock(&pcp_batch_high_lock) which would serialize all
> alloc_contig_range() and even with offline_pages().
> 

True so it would have to be accessed if that is bad or not. If racing
against offline_pages, memory is potentially being offlined in the
target zone which may cause allocation failure. If racing with other
alloc_contig_range calls, the two callers are potentially racing to
isolate and allocate the same range. The arguement could be made that
alloc_contig_ranges should be serialised within one zone to improve the
allocation success rate at the potential cost of allocation latency.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/6 v2] Calculate pcp->high based on zone sizes and active CPUs
  2021-05-28 10:09             ` Mel Gorman
@ 2021-05-28 10:21               ` David Hildenbrand
  0 siblings, 0 replies; 34+ messages in thread
From: David Hildenbrand @ 2021-05-28 10:21 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Dave Hansen, Andrew Morton, Hillf Danton, Dave Hansen,
	Vlastimil Babka, Michal Hocko, LKML, Linux-MM, Tang, Feng

On 28.05.21 12:09, Mel Gorman wrote:
> On Fri, May 28, 2021 at 11:52:53AM +0200, David Hildenbrand wrote:
>>>> "Disable pcplists so that page isolation cannot race with freeing
>>>>    in a way that pages from isolated pageblock are left on pcplists."
>>>>
>>>> Guess we'd then want to move the draining before start_isolate_page_range()
>>>> in alloc_contig_range().
>>>>
>>>
>>> Or instead of draining, validate the PFN range in alloc_contig_range
>>> is within the same zone and if so, call zone_pcp_disable() before
>>> start_isolate_page_range and enable after __alloc_contig_migrate_range.
>>>
>>
>> We require the caller to only pass a range within a single zone, so that
>> should be fine.
>>
>> The only ugly thing about zone_pcp_disable() is
>> mutex_lock(&pcp_batch_high_lock) which would serialize all
>> alloc_contig_range() and even with offline_pages().
>>
> 
> True so it would have to be accessed if that is bad or not. If racing
> against offline_pages, memory is potentially being offlined in the
> target zone which may cause allocation failure. If racing with other
> alloc_contig_range calls, the two callers are potentially racing to
> isolate and allocate the same range. The arguement could be made that
> alloc_contig_ranges should be serialised within one zone to improve the
> allocation success rate at the potential cost of allocation latency.

We have 3 main users of alloc_contig_range():

1. CMA

CMA synchronizes allocation within a CMA area via the allocation bitmap. 
So parallel CMA is perfectly possible and avoids races by design.

2. alloc_contig_pages() / Gigantic pages

Gigantic page allocation could race with virtio-mem. CMA does not apply. 
Possible but unlikely to matter in practice. virtio-mem will retry later 
again.

3. virito-mem

A virtio-mem device only operates on its assigned memory region, so we 
cannot have alloc_contig_range() from different devices racing, even 
when within a single zone. It could only race with gigantic pages as CMA 
does not apply.


So serializing would mostly harm parallel CMA (possible and likely) and 
parallel virtio-mem operation (e.g., unplug memory of two virtio-mem 
devices -- unlikely but possible).


Memory offlining racing with CMA is not an issue (impossible). 
virtio-mem synchronizes with memory offlining via memory notifiers, 
there is only a tiny little race window that usually doesn't matter as 
virtio-mem is expected to usually triggers offlining itself, and not 
user space rancomly. Memory offlining can race with dynamic gigantic 
page allocation, wich is highly unreliable already.

I wonder if we could optimize locking in zone_pcp_disable() instead.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/6] mm/page_alloc: Disassociate the pcp->high from pcp->batch
  2021-05-27 10:52     ` Mel Gorman
@ 2021-05-28 10:27       ` Vlastimil Babka
  0 siblings, 0 replies; 34+ messages in thread
From: Vlastimil Babka @ 2021-05-28 10:27 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Hillf Danton, Dave Hansen, Michal Hocko, LKML, Linux-MM

On 5/27/21 12:52 PM, Mel Gorman wrote:
> On Wed, May 26, 2021 at 08:14:13PM +0200, Vlastimil Babka wrote:
>> > @@ -6698,11 +6717,10 @@ static void __zone_set_pageset_high_and_batch(struct zone *zone, unsigned long h
>> >   */
>> >  static void zone_set_pageset_high_and_batch(struct zone *zone)
>> >  {
>> > -	unsigned long new_high, new_batch;
>> > +	int new_high, new_batch;
>> >  
>> > -	new_batch = zone_batchsize(zone);
>> > -	new_high = 6 * new_batch;
>> > -	new_batch = max(1UL, 1 * new_batch);
>> > +	new_batch = max(1, zone_batchsize(zone));
>> > +	new_high = zone_highsize(zone, new_batch);
>> >  
>> >  	if (zone->pageset_high == new_high &&
>> >  	    zone->pageset_batch == new_batch)
>> > @@ -8170,6 +8188,12 @@ static void __setup_per_zone_wmarks(void)
>> >  		zone->_watermark[WMARK_LOW]  = min_wmark_pages(zone) + tmp;
>> >  		zone->_watermark[WMARK_HIGH] = min_wmark_pages(zone) + tmp * 2;
>> >  
>> > +		/*
>> > +		 * The watermark size have changed so update the pcpu batch
>> > +		 * and high limits or the limits may be inappropriate.
>> > +		 */
>> > +		zone_set_pageset_high_and_batch(zone);
>> 
>> Hm so this puts the call in the path of various watermark related sysctl
>> handlers, but it's not protected by pcp_batch_high_lock. The zone lock won't
>> help against zone_pcp_update() from a hotplug handler. On the other hand, since
>> hotplug handlers also call __setup_per_zone_wmarks(), the zone_pcp_update()
>> calls there are now redundant and could be removed, no?
>> But later there will be a new sysctl in patch 6/6 using pcp_batch_high_lock,
>> thus that one will not be protected against the watermark related sysctl
>> handlers that reach here.
>> 
>> To solve all this, seems like the static lock in setup_per_zone_wmarks() could
>> become a top-level visible lock and pcp high/batch updates could switch to that
>> one instead of own pcp_batch_high_lock. And zone_pcp_update() calls from hotplug
>> handlers could be removed.
>> 
> 
> Hmm, the locking has very different hold times. The static lock in
> setup_per_zone_wmarks is a spinlock that protects against parallel updates
> of watermarks and is held for a short duration. The pcp_batch_high_lock
> is a mutex that is held for a relatively long time while memory is being
> offlined and can sleep. Memory hotplug updates the watermarks without
> pcp_batch_high_lock held so overall, unifying the locking there should
> be a separate series.
> 
> How about this as a fix for this patch?
> 
> ---8<---
> mm/page_alloc: Disassociate the pcp->high from pcp->batch -fix
> 
> Vlastimil Babka noted that __setup_per_zone_wmarks updating pcp->high
> did not protect watermark-related sysctl handlers from a parallel
> memory hotplug operations. This patch moves the PCP update to
> setup_per_zone_wmarks and updates the PCP high value while protected
> by the pcp_batch_high_lock mutex.
> 
> This is a fix to the mmotm patch mm-page_alloc-disassociate-the-pcp-high-from-pcp-batch.patch.
> It'll cause a conflict with mm-page_alloc-adjust-pcp-high-after-cpu-hotplug-events.patch
> but the resolution is simply to change the caller in setup_per_zone_wmarks
> to zone_pcp_update(zone, 0)
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Looks fine. But I would also remove the redudancy introduced by this patch+fix,
as part of the patch:

online_pages()
  zone_pcp_update(zone); <- this predates the patch
  init_per_zone_wmark_min()
    setup_per_zone_wmarks()
      for_each_zone(zone)
         zone_pcp_update(zone); <- new in this patch

offline_pages() similarly

In any case, for the fixed version:
Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  mm/page_alloc.c | 14 ++++++++------
>  1 file changed, 8 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 329b71e41db4..b1b3c66e9d88 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -8199,12 +8199,6 @@ static void __setup_per_zone_wmarks(void)
>  		zone->_watermark[WMARK_LOW]  = min_wmark_pages(zone) + tmp;
>  		zone->_watermark[WMARK_HIGH] = min_wmark_pages(zone) + tmp * 2;
>  
> -		/*
> -		 * The watermark size have changed so update the pcpu batch
> -		 * and high limits or the limits may be inappropriate.
> -		 */
> -		zone_set_pageset_high_and_batch(zone);
> -
>  		spin_unlock_irqrestore(&zone->lock, flags);
>  	}
>  
> @@ -8221,11 +8215,19 @@ static void __setup_per_zone_wmarks(void)
>   */
>  void setup_per_zone_wmarks(void)
>  {
> +	struct zone *zone;
>  	static DEFINE_SPINLOCK(lock);
>  
>  	spin_lock(&lock);
>  	__setup_per_zone_wmarks();
>  	spin_unlock(&lock);
> +
> +	/*
> +	 * The watermark size have changed so update the pcpu batch
> +	 * and high limits or the limits may be inappropriate.
> +	 */
> +	for_each_zone(zone)
> +		zone_pcp_update(zone);
>  }
>  
>  /*
> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/6] mm/page_alloc: Adjust pcp->high after CPU hotplug events
  2021-05-25  8:01 ` [PATCH 3/6] mm/page_alloc: Adjust pcp->high after CPU hotplug events Mel Gorman
@ 2021-05-28 11:08   ` Vlastimil Babka
  0 siblings, 0 replies; 34+ messages in thread
From: Vlastimil Babka @ 2021-05-28 11:08 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton
  Cc: Hillf Danton, Dave Hansen, Michal Hocko, LKML, Linux-MM

On 5/25/21 10:01 AM, Mel Gorman wrote:
> The PCP high watermark is based on the number of online CPUs so the
> watermarks must be adjusted during CPU hotplug. At the time of
> hot-remove, the number of online CPUs is already adjusted but during
> hot-add, a delta needs to be applied to update PCP to the correct
> value. After this patch is applied, the high watermarks are adjusted
> correctly.
> 
>   # grep high: /proc/zoneinfo  | tail -1
>               high:  649
>   # echo 0 > /sys/devices/system/cpu/cpu4/online
>   # grep high: /proc/zoneinfo  | tail -1
>               high:  664
>   # echo 1 > /sys/devices/system/cpu/cpu4/online
>   # grep high: /proc/zoneinfo  | tail -1
>               high:  649
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 4/6] mm/page_alloc: Scale the number of pages that are batch freed
  2021-05-25  8:01 ` [PATCH 4/6] mm/page_alloc: Scale the number of pages that are batch freed Mel Gorman
@ 2021-05-28 11:19   ` Vlastimil Babka
  0 siblings, 0 replies; 34+ messages in thread
From: Vlastimil Babka @ 2021-05-28 11:19 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton
  Cc: Hillf Danton, Dave Hansen, Michal Hocko, LKML, Linux-MM

On 5/25/21 10:01 AM, Mel Gorman wrote:
> When a task is freeing a large number of order-0 pages, it may acquire
> the zone->lock multiple times freeing pages in batches. This may
> unnecessarily contend on the zone lock when freeing very large number
> of pages. This patch adapts the size of the batch based on the recent
> pattern to scale the batch size for subsequent frees.
> 
> As the machines I used were not large enough to test this are not large
> enough to illustrate a problem, a debugging patch shows patterns like
> the following (slightly editted for clarity)
> 
> Baseline vanilla kernel
>   time-unmap-14426   [...] free_pcppages_bulk: free   63 count  378 high  378
>   time-unmap-14426   [...] free_pcppages_bulk: free   63 count  378 high  378
>   time-unmap-14426   [...] free_pcppages_bulk: free   63 count  378 high  378
>   time-unmap-14426   [...] free_pcppages_bulk: free   63 count  378 high  378
>   time-unmap-14426   [...] free_pcppages_bulk: free   63 count  378 high  378
> 
> With patches
>   time-unmap-7724    [...] free_pcppages_bulk: free  126 count  814 high  814
>   time-unmap-7724    [...] free_pcppages_bulk: free  252 count  814 high  814
>   time-unmap-7724    [...] free_pcppages_bulk: free  504 count  814 high  814
>   time-unmap-7724    [...] free_pcppages_bulk: free  751 count  814 high  814
>   time-unmap-7724    [...] free_pcppages_bulk: free  751 count  814 high  814
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: Dave Hansen <dave.hansen@linux.intel.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 5/6] mm/page_alloc: Limit the number of pages on PCP lists when reclaim is active
  2021-05-25  8:01 ` [PATCH 5/6] mm/page_alloc: Limit the number of pages on PCP lists when reclaim is active Mel Gorman
@ 2021-05-28 11:43   ` Vlastimil Babka
  0 siblings, 0 replies; 34+ messages in thread
From: Vlastimil Babka @ 2021-05-28 11:43 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton
  Cc: Hillf Danton, Dave Hansen, Michal Hocko, LKML, Linux-MM

On 5/25/21 10:01 AM, Mel Gorman wrote:
> When kswapd is active then direct reclaim is potentially active. In
> either case, it is possible that a zone would be balanced if pages were
> not trapped on PCP lists. Instead of draining remote pages, simply limit
> the size of the PCP lists while kswapd is active.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 6/6] mm/page_alloc: Introduce vm.percpu_pagelist_high_fraction
  2021-05-25  8:01 ` [PATCH 6/6] mm/page_alloc: Introduce vm.percpu_pagelist_high_fraction Mel Gorman
@ 2021-05-28 11:59   ` Vlastimil Babka
  2021-05-28 12:53     ` Mel Gorman
  0 siblings, 1 reply; 34+ messages in thread
From: Vlastimil Babka @ 2021-05-28 11:59 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton
  Cc: Hillf Danton, Dave Hansen, Michal Hocko, LKML, Linux-MM

On 5/25/21 10:01 AM, Mel Gorman wrote:
> This introduces a new sysctl vm.percpu_pagelist_high_fraction. It is
> similar to the old vm.percpu_pagelist_fraction. The old sysctl increased
> both pcp->batch and pcp->high with the higher pcp->high potentially
> reducing zone->lock contention. However, the higher pcp->batch value also
> potentially increased allocation latency while the PCP was refilled.
> This sysctl only adjusts pcp->high so that zone->lock contention is
> potentially reduced but allocation latency during a PCP refill remains
> the same.
> 
>   # grep -E "high:|batch" /proc/zoneinfo | tail -2
>               high:  649
>               batch: 63
> 
>   # sysctl vm.percpu_pagelist_high_fraction=8
>   # grep -E "high:|batch" /proc/zoneinfo | tail -2
>               high:  35071
>               batch: 63
> 
>   # sysctl vm.percpu_pagelist_high_fraction=64
>               high:  4383
>               batch: 63
> 
>   # sysctl vm.percpu_pagelist_high_fraction=0
>               high:  649
>               batch: 63
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: Dave Hansen <dave.hansen@linux.intel.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

Documentation nit below:

> @@ -789,6 +790,25 @@ panic_on_oom=2+kdump gives you very strong tool to investigate
>  why oom happens. You can get snapshot.
>  
>  
> +percpu_pagelist_high_fraction
> +=============================
> +
> +This is the fraction of pages in each zone that are allocated for each
> +per cpu page list.  The min value for this is 8.  It means that we do
> +not allow more than 1/8th of pages in each zone to be allocated in any
> +single per_cpu_pagelist.

This, while technically correct (as an upper limit) is somewhat misleading as
the limit for a single per_cpu_pagelist also considers the number of local cpus.

>  This entry only changes the value of hot per
> +cpu pagelists. User can specify a number like 100 to allocate 1/100th
> +of each zone to each per cpu page list.

This is worse. Anyone trying to reproduce this example on a system with multiple
cpus per node and checking the result will be puzzled.
So I think the part about number of local cpus should be mentioned to avoid
confusion.

> +The batch value of each per cpu pagelist remains the same regardless of the
> +value of the high fraction so allocation latencies are unaffected.
> +
> +The initial value is zero. Kernel uses this value to set the high pcp->high
> +mark based on the low watermark for the zone and the number of local
> +online CPUs.  If the user writes '0' to this sysctl, it will revert to
> +this default behavior.
> +
> +

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/6 v2] Calculate pcp->high based on zone sizes and active CPUs
  2021-05-28  8:55   ` Mel Gorman
  2021-05-28  9:03     ` David Hildenbrand
@ 2021-05-28 12:12     ` Vlastimil Babka
  2021-05-28 12:37       ` Mel Gorman
  2021-05-28 14:39     ` Dave Hansen
  2 siblings, 1 reply; 34+ messages in thread
From: Vlastimil Babka @ 2021-05-28 12:12 UTC (permalink / raw)
  To: Mel Gorman, Dave Hansen
  Cc: Andrew Morton, Hillf Danton, Dave Hansen, Michal Hocko, LKML,
	Linux-MM, Tang, Feng

On 5/28/21 10:55 AM, Mel Gorman wrote:
> On Thu, May 27, 2021 at 12:36:21PM -0700, Dave Hansen wrote:
>> Hi Mel,
>> 
>> Feng Tang tossed these on a "Cascade Lake" system with 96 threads and
>> ~512G of persistent memory and 128G of DRAM.  The PMEM is in "volatile
>> use" mode and being managed via the buddy just like the normal RAM.
>> 
>> The PMEM zones are big ones:
>> 
>>         present  65011712 = 248 G
>>         high       134595 = 525 M
>> 
>> The PMEM nodes, of course, don't have any CPUs in them.
>> 
>> With your series, the pcp->high value per-cpu is 69584 pages or about
>> 270MB per CPU.  Scaled up by the 96 CPU threads, that's ~26GB of
>> worst-case memory in the pcps per zone, or roughly 10% of the size of
>> the zone.
>> 
>> I did see quite a few pcp->counts above 60,000, so it's definitely
>> possible in practice to see the pcps filled up.  This was not observed
>> to cause any actual problems in practice.  But, it's still a bit worrisome.
>> 
> 
> Ok, it does have the potential to trigger early reclaim as pages are
> stored on remote PCP lists. The problem would be transient because
> vmstat would drain those pages over time but still, how about this patch
> on top of the series?
> 
> --8<--
> mm/page_alloc: Split pcp->high across all online CPUs for cpuless nodes
> 
> Dave Hansen reported the following about Feng Tang's tests on a machine
> with persistent memory onlined as a DRAM-like device.
> 
>   Feng Tang tossed these on a "Cascade Lake" system with 96 threads and
>   ~512G of persistent memory and 128G of DRAM.  The PMEM is in "volatile
>   use" mode and being managed via the buddy just like the normal RAM.
> 
>   The PMEM zones are big ones:
> 
>         present  65011712 = 248 G
>         high       134595 = 525 M
> 
>   The PMEM nodes, of course, don't have any CPUs in them.
> 
>   With your series, the pcp->high value per-cpu is 69584 pages or about
>   270MB per CPU.  Scaled up by the 96 CPU threads, that's ~26GB of
>   worst-case memory in the pcps per zone, or roughly 10% of the size of
>   the zone.
> 
> This should not cause a problem as such although it could trigger reclaim
> due to pages being stored on per-cpu lists for CPUs remote to a node. It
> is not possible to treat cpuless nodes exactly the same as normal nodes
> but the worst-case scenario can be mitigated by splitting pcp->high across
> all online CPUs for cpuless memory nodes.
> 
> Suggested-by: Dave Hansen <dave.hansen@intel.com>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

Maybe we should even consider distinguishing high limits for local-to-cpu zones
vs remote, for example for the local-to-cpu zones we would divide by the number
of local cpus, for remote-to-cpu zones we would divide by all cpus.

Because we can expect cpus to allocate mostly from local zones, so leaving more
pages on percpu for those zones can be beneficial.

But as the motivation here was to reduce lock contention on freeing, that's less
clear. We probably can't expect the cpu to be freeing mostly local pages (in
case of e.g. a large process exiting), because no mechanism works towards that,
or does it? In case of cpu freeing to remote zone, the lower high limit could hurt.

So that would have to be evaluated if that works in practice. Out of scope here,
just an idea to discuss.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/6 v2] Calculate pcp->high based on zone sizes and active CPUs
  2021-05-28 12:12     ` Vlastimil Babka
@ 2021-05-28 12:37       ` Mel Gorman
  0 siblings, 0 replies; 34+ messages in thread
From: Mel Gorman @ 2021-05-28 12:37 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Dave Hansen, Andrew Morton, Hillf Danton, Dave Hansen,
	Michal Hocko, LKML, Linux-MM, Tang, Feng

On Fri, May 28, 2021 at 02:12:09PM +0200, Vlastimil Babka wrote:
> > mm/page_alloc: Split pcp->high across all online CPUs for cpuless nodes
> > 
> > Dave Hansen reported the following about Feng Tang's tests on a machine
> > with persistent memory onlined as a DRAM-like device.
> > 
> >   Feng Tang tossed these on a "Cascade Lake" system with 96 threads and
> >   ~512G of persistent memory and 128G of DRAM.  The PMEM is in "volatile
> >   use" mode and being managed via the buddy just like the normal RAM.
> > 
> >   The PMEM zones are big ones:
> > 
> >         present  65011712 = 248 G
> >         high       134595 = 525 M
> > 
> >   The PMEM nodes, of course, don't have any CPUs in them.
> > 
> >   With your series, the pcp->high value per-cpu is 69584 pages or about
> >   270MB per CPU.  Scaled up by the 96 CPU threads, that's ~26GB of
> >   worst-case memory in the pcps per zone, or roughly 10% of the size of
> >   the zone.
> > 
> > This should not cause a problem as such although it could trigger reclaim
> > due to pages being stored on per-cpu lists for CPUs remote to a node. It
> > is not possible to treat cpuless nodes exactly the same as normal nodes
> > but the worst-case scenario can be mitigated by splitting pcp->high across
> > all online CPUs for cpuless memory nodes.
> > 
> > Suggested-by: Dave Hansen <dave.hansen@intel.com>
> > Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> 
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> 

Thanks.

> Maybe we should even consider distinguishing high limits for local-to-cpu zones
> vs remote, for example for the local-to-cpu zones we would divide by the number
> of local cpus, for remote-to-cpu zones we would divide by all cpus.
> 
> Because we can expect cpus to allocate mostly from local zones, so leaving more
> pages on percpu for those zones can be beneficial.
> 

I did think about whether the ratios should be different but failed to
conclude that it was necessary or useful so I kept it simple.

> But as the motivation here was to reduce lock contention on freeing, that's less
> clear. We probably can't expect the cpu to be freeing mostly local pages (in
> case of e.g. a large process exiting), because no mechanism works towards that,
> or does it? In case of cpu freeing to remote zone, the lower high limit could hurt.
> 

This is the major issue. Even if an application was NUMA aware and heavily
threaded, the process exiting is potentially freeing remote memory and
there is nothing wrong about that. The remote memory will be partially
drained by pcp->high being reached and the remaining memory will be
cleaned up by vmstat. It's a similar problem if a process is truncating
a large file with page cache allocated on a remote node.

Hence I decided to do nothing fancy with the ratios until a practical
problem was identified that could be alleviated by adjusting pcp->high
based on whether the CPU is remote or local to memory.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 6/6] mm/page_alloc: Introduce vm.percpu_pagelist_high_fraction
  2021-05-28 11:59   ` Vlastimil Babka
@ 2021-05-28 12:53     ` Mel Gorman
  2021-05-28 14:38       ` Vlastimil Babka
  0 siblings, 1 reply; 34+ messages in thread
From: Mel Gorman @ 2021-05-28 12:53 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Hillf Danton, Dave Hansen, Michal Hocko, LKML, Linux-MM

On Fri, May 28, 2021 at 01:59:37PM +0200, Vlastimil Babka wrote:
> On 5/25/21 10:01 AM, Mel Gorman wrote:
> > This introduces a new sysctl vm.percpu_pagelist_high_fraction. It is
> > similar to the old vm.percpu_pagelist_fraction. The old sysctl increased
> > both pcp->batch and pcp->high with the higher pcp->high potentially
> > reducing zone->lock contention. However, the higher pcp->batch value also
> > potentially increased allocation latency while the PCP was refilled.
> > This sysctl only adjusts pcp->high so that zone->lock contention is
> > potentially reduced but allocation latency during a PCP refill remains
> > the same.
> > 
> >   # grep -E "high:|batch" /proc/zoneinfo | tail -2
> >               high:  649
> >               batch: 63
> > 
> >   # sysctl vm.percpu_pagelist_high_fraction=8
> >   # grep -E "high:|batch" /proc/zoneinfo | tail -2
> >               high:  35071
> >               batch: 63
> > 
> >   # sysctl vm.percpu_pagelist_high_fraction=64
> >               high:  4383
> >               batch: 63
> > 
> >   # sysctl vm.percpu_pagelist_high_fraction=0
> >               high:  649
> >               batch: 63
> > 
> > Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> > Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
> 
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> 

Thanks.

> Documentation nit below:
> 
> > @@ -789,6 +790,25 @@ panic_on_oom=2+kdump gives you very strong tool to investigate
> >  why oom happens. You can get snapshot.
> >  
> >  
> > +percpu_pagelist_high_fraction
> > +=============================
> > +
> > +This is the fraction of pages in each zone that are allocated for each
> > +per cpu page list.  The min value for this is 8.  It means that we do
> > +not allow more than 1/8th of pages in each zone to be allocated in any
> > +single per_cpu_pagelist.
> 
> This, while technically correct (as an upper limit) is somewhat misleading as
> the limit for a single per_cpu_pagelist also considers the number of local cpus.
> 
> >  This entry only changes the value of hot per
> > +cpu pagelists. User can specify a number like 100 to allocate 1/100th
> > +of each zone to each per cpu page list.
> 
> This is worse. Anyone trying to reproduce this example on a system with multiple
> cpus per node and checking the result will be puzzled.
> So I think the part about number of local cpus should be mentioned to avoid
> confusion.
> 

Is this any better?

diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
index e85c2f21d209..2da25735a629 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -793,15 +793,16 @@ why oom happens. You can get snapshot.
 percpu_pagelist_high_fraction
 =============================
 
-This is the fraction of pages in each zone that are allocated for each
-per cpu page list.  The min value for this is 8.  It means that we do
-not allow more than 1/8th of pages in each zone to be allocated in any
-single per_cpu_pagelist.  This entry only changes the value of hot per
-cpu pagelists. User can specify a number like 100 to allocate 1/100th
-of each zone to each per cpu page list.
-
-The batch value of each per cpu pagelist remains the same regardless of the
-value of the high fraction so allocation latencies are unaffected.
+This is the fraction of pages in each zone that are can be stored to
+per-cpu page lists. It is an upper boundary that is divided depending
+on the number of online CPUs. The min value for this is 8 which means
+that we do not allow more than 1/8th of pages in each zone to be stored
+on per-cpu page lists. This entry only changes the value of hot per-cpu
+page lists. A user can specify a number like 100 to allocate 1/100th of
+each zone between per-cpu lists.
+
+The batch value of each per-cpu page list remains the same regardless of
+the value of the high fraction so allocation latencies are unaffected.
 
 The initial value is zero. Kernel uses this value to set the high pcp->high
 mark based on the low watermark for the zone and the number of local
-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 6/6] mm/page_alloc: Introduce vm.percpu_pagelist_high_fraction
  2021-05-28 12:53     ` Mel Gorman
@ 2021-05-28 14:38       ` Vlastimil Babka
  0 siblings, 0 replies; 34+ messages in thread
From: Vlastimil Babka @ 2021-05-28 14:38 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Hillf Danton, Dave Hansen, Michal Hocko, LKML, Linux-MM

On 5/28/21 2:53 PM, Mel Gorman wrote:
> On Fri, May 28, 2021 at 01:59:37PM +0200, Vlastimil Babka wrote:
>> On 5/25/21 10:01 AM, Mel Gorman wrote:
>> > This introduces a new sysctl vm.percpu_pagelist_high_fraction. It is
>> > similar to the old vm.percpu_pagelist_fraction. The old sysctl increased
>> > both pcp->batch and pcp->high with the higher pcp->high potentially
>> > reducing zone->lock contention. However, the higher pcp->batch value also
>> > potentially increased allocation latency while the PCP was refilled.
>> > This sysctl only adjusts pcp->high so that zone->lock contention is
>> > potentially reduced but allocation latency during a PCP refill remains
>> > the same.
>> > 
>> >   # grep -E "high:|batch" /proc/zoneinfo | tail -2
>> >               high:  649
>> >               batch: 63
>> > 
>> >   # sysctl vm.percpu_pagelist_high_fraction=8
>> >   # grep -E "high:|batch" /proc/zoneinfo | tail -2
>> >               high:  35071
>> >               batch: 63
>> > 
>> >   # sysctl vm.percpu_pagelist_high_fraction=64
>> >               high:  4383
>> >               batch: 63
>> > 
>> >   # sysctl vm.percpu_pagelist_high_fraction=0
>> >               high:  649
>> >               batch: 63
>> > 
>> > Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
>> > Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
>> 
>> Acked-by: Vlastimil Babka <vbabka@suse.cz>
>> 
> 
> Thanks.
> 
>> Documentation nit below:
>> 
>> > @@ -789,6 +790,25 @@ panic_on_oom=2+kdump gives you very strong tool to investigate
>> >  why oom happens. You can get snapshot.
>> >  
>> >  
>> > +percpu_pagelist_high_fraction
>> > +=============================
>> > +
>> > +This is the fraction of pages in each zone that are allocated for each
>> > +per cpu page list.  The min value for this is 8.  It means that we do
>> > +not allow more than 1/8th of pages in each zone to be allocated in any
>> > +single per_cpu_pagelist.
>> 
>> This, while technically correct (as an upper limit) is somewhat misleading as
>> the limit for a single per_cpu_pagelist also considers the number of local cpus.
>> 
>> >  This entry only changes the value of hot per
>> > +cpu pagelists. User can specify a number like 100 to allocate 1/100th
>> > +of each zone to each per cpu page list.
>> 
>> This is worse. Anyone trying to reproduce this example on a system with multiple
>> cpus per node and checking the result will be puzzled.
>> So I think the part about number of local cpus should be mentioned to avoid
>> confusion.
>> 
> 
> Is this any better?

Ack, thanks

> diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
> index e85c2f21d209..2da25735a629 100644
> --- a/Documentation/admin-guide/sysctl/vm.rst
> +++ b/Documentation/admin-guide/sysctl/vm.rst
> @@ -793,15 +793,16 @@ why oom happens. You can get snapshot.
>  percpu_pagelist_high_fraction
>  =============================
>  
> -This is the fraction of pages in each zone that are allocated for each
> -per cpu page list.  The min value for this is 8.  It means that we do
> -not allow more than 1/8th of pages in each zone to be allocated in any
> -single per_cpu_pagelist.  This entry only changes the value of hot per
> -cpu pagelists. User can specify a number like 100 to allocate 1/100th
> -of each zone to each per cpu page list.
> -
> -The batch value of each per cpu pagelist remains the same regardless of the
> -value of the high fraction so allocation latencies are unaffected.
> +This is the fraction of pages in each zone that are can be stored to
> +per-cpu page lists. It is an upper boundary that is divided depending
> +on the number of online CPUs. The min value for this is 8 which means
> +that we do not allow more than 1/8th of pages in each zone to be stored
> +on per-cpu page lists. This entry only changes the value of hot per-cpu
> +page lists. A user can specify a number like 100 to allocate 1/100th of
> +each zone between per-cpu lists.
> +
> +The batch value of each per-cpu page list remains the same regardless of
> +the value of the high fraction so allocation latencies are unaffected.
>  
>  The initial value is zero. Kernel uses this value to set the high pcp->high
>  mark based on the low watermark for the zone and the number of local
> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/6 v2] Calculate pcp->high based on zone sizes and active CPUs
  2021-05-28  8:55   ` Mel Gorman
  2021-05-28  9:03     ` David Hildenbrand
  2021-05-28 12:12     ` Vlastimil Babka
@ 2021-05-28 14:39     ` Dave Hansen
  2021-05-28 15:18       ` Mel Gorman
  2 siblings, 1 reply; 34+ messages in thread
From: Dave Hansen @ 2021-05-28 14:39 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Hillf Danton, Dave Hansen, Vlastimil Babka,
	Michal Hocko, LKML, Linux-MM, Tang, Feng

On 5/28/21 1:55 AM, Mel Gorman wrote:
> -	 * onlined.
> -	 */
> -	nr_local_cpus = max(1U, cpumask_weight(cpumask_of_node(zone_to_nid(zone)))) + cpu_online;
> -	high = total_pages / nr_local_cpus;
> +	 * onlined. For memory nodes that have no CPUs, split pcp->high across
> +	 * all online CPUs to mitigate the risk that reclaim is triggered
> +	 * prematurely due to pages stored on pcp lists.
> +	 */
> +	nr_split_cpus = cpumask_weight(cpumask_of_node(zone_to_nid(zone))) + cpu_online;
> +	if (!nr_split_cpus)
> +		nr_split_cpus = num_online_cpus();
> +	high = total_pages / nr_split_cpus;

Updated version looks fine to me, thanks!

BTW, to do some of this testing, Feng was doing a plain old kernel
build.  On the one system where this got run, he noted a ~2% regression
in build times.  Nothing major, but you might want to be on the lookout
in case 0day or the other test harnesses find something similar once
this series gets to them.

Acked-by: Dave Hansen <dave.hansen@intel.com>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/6 v2] Calculate pcp->high based on zone sizes and active CPUs
  2021-05-28 14:39     ` Dave Hansen
@ 2021-05-28 15:18       ` Mel Gorman
  2021-05-28 16:17         ` Dave Hansen
  0 siblings, 1 reply; 34+ messages in thread
From: Mel Gorman @ 2021-05-28 15:18 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andrew Morton, Hillf Danton, Dave Hansen, Vlastimil Babka,
	Michal Hocko, LKML, Linux-MM, Tang, Feng

On Fri, May 28, 2021 at 07:39:29AM -0700, Dave Hansen wrote:
> On 5/28/21 1:55 AM, Mel Gorman wrote:
> > -	 * onlined.
> > -	 */
> > -	nr_local_cpus = max(1U, cpumask_weight(cpumask_of_node(zone_to_nid(zone)))) + cpu_online;
> > -	high = total_pages / nr_local_cpus;
> > +	 * onlined. For memory nodes that have no CPUs, split pcp->high across
> > +	 * all online CPUs to mitigate the risk that reclaim is triggered
> > +	 * prematurely due to pages stored on pcp lists.
> > +	 */
> > +	nr_split_cpus = cpumask_weight(cpumask_of_node(zone_to_nid(zone))) + cpu_online;
> > +	if (!nr_split_cpus)
> > +		nr_split_cpus = num_online_cpus();
> > +	high = total_pages / nr_split_cpus;
> 
> Updated version looks fine to me, thanks!
> 
> BTW, to do some of this testing, Feng was doing a plain old kernel
> build.  On the one system where this got run, he noted a ~2% regression
> in build times.  Nothing major, but you might want to be on the lookout
> in case 0day or the other test harnesses find something similar once
> this series gets to them.
> 

What type of system was it?

I noticed minor differences for some thread counts on kernel compilations
but for CascadeLake at least, it was mostly neutral. Below is an old test
result based on a previous revision.

kernbench
                               5.13.0-rc2             5.13.0-rc2
                                  vanilla       mm-pcpburst-v2r3
Amean     elsp-2        469.22 (   0.00%)      470.03 *  -0.17%*
Amean     elsp-4        251.03 (   0.00%)      250.83 (   0.08%)
Amean     elsp-8        131.39 (   0.00%)      130.89 (   0.38%)
Amean     elsp-16        74.37 (   0.00%)       75.11 (  -0.99%)
Amean     elsp-32        42.10 (   0.00%)       42.20 (  -0.24%)
Amean     elsp-64        32.21 (   0.00%)       32.14 (   0.23%)
Amean     elsp-128       31.59 (   0.00%)       31.68 (  -0.27%)
Amean     elsp-160       31.76 (   0.00%)       31.69 (   0.21%)

A Haswell machine showed the worst results for kernbench

Amean     elsp-2        459.99 (   0.00%)      465.27 *  -1.15%*
Amean     elsp-4        250.76 (   0.00%)      253.17 *  -0.96%*
Amean     elsp-8        141.28 (   0.00%)      141.78 (  -0.36%)
Amean     elsp-16        77.71 (   0.00%)       77.88 (  -0.22%)
Amean     elsp-32        44.09 (   0.00%)       44.40 (  -0.69%)
Amean     elsp-64        33.79 (   0.00%)       33.46 (   0.96%)
Amean     elsp-128       33.14 (   0.00%)       33.26 (  -0.37%)
Amean     elsp-160       33.26 (   0.00%)       33.36 *  -0.30%*

The series with review feedback and dealing with cpuless nodes is queued
and should complete over the weekend.

> Acked-by: Dave Hansen <dave.hansen@intel.com>

Thanks!

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/6 v2] Calculate pcp->high based on zone sizes and active CPUs
  2021-05-28 15:18       ` Mel Gorman
@ 2021-05-28 16:17         ` Dave Hansen
  2021-05-31 12:00           ` Feng Tang
  0 siblings, 1 reply; 34+ messages in thread
From: Dave Hansen @ 2021-05-28 16:17 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Hillf Danton, Dave Hansen, Vlastimil Babka,
	Michal Hocko, LKML, Linux-MM, Tang, Feng

On 5/28/21 8:18 AM, Mel Gorman wrote:
>> BTW, to do some of this testing, Feng was doing a plain old kernel
>> build.  On the one system where this got run, he noted a ~2% regression
>> in build times.  Nothing major, but you might want to be on the lookout
>> in case 0day or the other test harnesses find something similar once
>> this series gets to them.
>>
> What type of system was it?
> 
> I noticed minor differences for some thread counts on kernel compilations
> but for CascadeLake at least, it was mostly neutral. Below is an old test
> result based on a previous revision.

It's a Cascade Lake as well.  But, I never trust hardware at a hardware
company.  These could be preproduction CPUs or BIOS or both, or have
some bonkers configuration knob flipped.

It's also got a bunch of PMEM plugged and onlined, including the
_possibility_ of kernel data structures ended up on PMEM.  They *mostly*
don't end up there, but it does happen on occasion.

Anyway, I'll see if we can do some more runs with your latest version.
It looks like it's been picked up for -mm so 0day should be pounding on
it soon enough.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/6 v2] Calculate pcp->high based on zone sizes and active CPUs
  2021-05-28 16:17         ` Dave Hansen
@ 2021-05-31 12:00           ` Feng Tang
  0 siblings, 0 replies; 34+ messages in thread
From: Feng Tang @ 2021-05-31 12:00 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Mel Gorman, Andrew Morton, Hillf Danton, Dave Hansen,
	Vlastimil Babka, Michal Hocko, LKML, Linux-MM

On Fri, May 28, 2021 at 09:17:41AM -0700, Dave Hansen wrote:
> On 5/28/21 8:18 AM, Mel Gorman wrote:
> >> BTW, to do some of this testing, Feng was doing a plain old kernel
> >> build.  On the one system where this got run, he noted a ~2% regression
> >> in build times.  Nothing major, but you might want to be on the lookout
> >> in case 0day or the other test harnesses find something similar once
> >> this series gets to them.
> >>
> > What type of system was it?
> > 
> > I noticed minor differences for some thread counts on kernel compilations
> > but for CascadeLake at least, it was mostly neutral. Below is an old test
> > result based on a previous revision.
> 
> It's a Cascade Lake as well.  But, I never trust hardware at a hardware
> company.  These could be preproduction CPUs or BIOS or both, or have
> some bonkers configuration knob flipped.
> 
> It's also got a bunch of PMEM plugged and onlined, including the
> _possibility_ of kernel data structures ended up on PMEM.  They *mostly*
> don't end up there, but it does happen on occasion.
> 
> Anyway, I'll see if we can do some more runs with your latest version.
> It looks like it's been picked up for -mm so 0day should be pounding on
> it soon enough.

Yes, usually 0day has more benchmark test covering -mm tree.

As for the kbuild test run for v2, after more runs, the previous 2%
longer kbuild time turns to 1% shorter time, seems to be in normal
deviation range.

Also I checked Mel's v3 branch which has the fix for cpuless node,
the pcp 'high' looks normal on PMEM node:

  pagesets
    cpu: 0
              count: 67
              high:  724
              batch: 63
  vm stats threshold: 125

Thanks,
Feng




^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/6] mm/page_alloc: Disassociate the pcp->high from pcp->batch
  2021-05-21 21:52   ` Dave Hansen
@ 2021-05-24  8:32     ` Mel Gorman
  0 siblings, 0 replies; 34+ messages in thread
From: Mel Gorman @ 2021-05-24  8:32 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Linux-MM, Dave Hansen, Matthew Wilcox, Vlastimil Babka,
	Michal Hocko, Nicholas Piggin, LKML

On Fri, May 21, 2021 at 02:52:39PM -0700, Dave Hansen wrote:
> On 5/21/21 3:28 AM, Mel Gorman wrote:
> > Note that in this patch the pcp->high values are adjusted after memory
> > hotplug events, min_free_kbytes adjustments and watermark scale factor
> > adjustments but not CPU hotplug events.
> 
> Not that it was a long wait to figure it out, but I'd probably say:
> 
> 	"CPU hotplug events are handled later in the series".
> 
> instead of just saying they're not handled.
> 
> > Before grep -E "high:|batch" /proc/zoneinfo | tail -2
> >               high:  378
> >               batch: 63
> > 
> > After grep -E "high:|batch" /proc/zoneinfo | tail -2
> >               high:  649
> >               batch: 63
> 
> You noted the relationship between pcp->high and zone lock contention.
> Larger ->high values mean less contention.  It's probably also worth
> noting the trend of having more logical CPUs per NUMA node.
> 

It's noted in the leader with "neither takes the zone size into account
or the number of CPUs local to a zone". Your point is valid but I'm not
doing much about the number of CPUs sharing a lock.

> I have the feeling when this was put in place it wasn't uncommon to have
> somewhere between 1 and 8 CPUs in a node pounding on a zone.
> 

True.

> Today, having ~60 is common.  I've occasionally resorted to recommending
> that folks enable hardware features like Sub-NUMA-Clustering [1] since
> it increases the number of zones and decreases the number of CPUs
> pounding on each zone lock.
> 

Enabling SNC to reduce concention is very unfortunate. It potentially
causes page age inversion issue as it's similar to specifying numa=fake=N

> 1.
> https://software.intel.com/content/www/us/en/develop/articles/intel-xeon-processor-scalable-family-technical-overview.html
> 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index a48f305f0381..bf5cdc466e6c 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -2163,14 +2163,6 @@ void __init page_alloc_init_late(void)
> >  	/* Block until all are initialised */
> >  	wait_for_completion(&pgdat_init_all_done_comp);
> >  
> > -	/*
> > -	 * The number of managed pages has changed due to the initialisation
> > -	 * so the pcpu batch and high limits needs to be updated or the limits
> > -	 * will be artificially small.
> > -	 */
> > -	for_each_populated_zone(zone)
> > -		zone_pcp_update(zone);
> > -
> >  	/*
> >  	 * We initialized the rest of the deferred pages.  Permanently disable
> >  	 * on-demand struct page initialization.
> > @@ -6594,13 +6586,12 @@ static int zone_batchsize(struct zone *zone)
> >  	int batch;
> >  
> >  	/*
> > -	 * The per-cpu-pages pools are set to around 1000th of the
> > -	 * size of the zone.
> > +	 * The number of pages to batch allocate is either 0.1%
> 
> Probably worth making that "~0.1%" just in case someone goes looking for
> the /1000 and can't find it.
> 

Done

> > +	 * of the zone or 1MB, whichever is smaller. The batch
> > +	 * size is striking a balance between allocation latency
> > +	 * and zone lock contention.
> >  	 */
> > -	batch = zone_managed_pages(zone) / 1024;
> > -	/* But no more than a meg. */
> > -	if (batch * PAGE_SIZE > 1024 * 1024)
> > -		batch = (1024 * 1024) / PAGE_SIZE;
> > +	batch = min(zone_managed_pages(zone) >> 10, (1024 * 1024) / PAGE_SIZE);
> >  	batch /= 4;		/* We effectively *= 4 below */
> >  	if (batch < 1)
> >  		batch = 1;
> > @@ -6637,6 +6628,27 @@ static int zone_batchsize(struct zone *zone)
> >  #endif
> >  }
> >  
> > +static int zone_highsize(struct zone *zone)
> > +{
> > +#ifdef CONFIG_MMU
> > +	int high;
> > +	int nr_local_cpus;
> > +
> > +	/*
> > +	 * The high value of the pcp is based on the zone low watermark
> > +	 * when reclaim is potentially active spread across the online
> > +	 * CPUs local to a zone. Note that early in boot that CPUs may
> > +	 * not be online yet.
> > +	 */
> 
> FWIW, I like the way the changelog talked about this a bit better, with
> the goal of avoiding background reclaim even in the face of a bunch of
> full pcp's.
> 
> > +	nr_local_cpus = max(1U, cpumask_weight(cpumask_of_node(zone_to_nid(zone))));

Done.

> > +	high = low_wmark_pages(zone) / nr_local_cpus;
> 
> I'm a little concerned that this might get out of hand on really big
> nodes with no CPUs.  For persistent memory (which we *do* toss into the
> page allocator for volatile use), we can have multi-terabyte zones with
> no CPUs in the node.
> 

It should not get out of hand given that it's based on the low watermark,
at least for local CPus.

> Also, while the CPUs which are on the node are the ones *most* likely to
> be hitting the ->high limit, we do *keep* a pcp for each possible CPU.
> So, the amount of memory which can actually be sequestered is
> num_online_cpus()*high.  Right?
> 

Potentially yes for short durations but remote CPUs are drained every
few seconds by refresh_cpu_vm_stats so it's a transient problem.

> *That* might really get out of hand if we have nr_local_cpus=1.
> 
> We might want some overall cap on 'high', or even to scale it
> differently for the zone-local cpus' pcps versus remote.

I'm reluctant to prematurely set this because I don't have a test case
and machine where this has been demonstrated to be a problem but I would
not be opposed to a patch added on top which demonstrated a reasonable
case where too many pages are pinned on remote CPUs for too long. I would
imagine this involves a machine with large amounts of persistent memory
onlined as a memory-like device and using an interleave policy showing
that reclaim on the persistent memory node is triggered prematurely.

My initial thinking is that remote CPUs should simply clamp the remote
cpus on a static value such as pcp->batch << 1 but I would prefer this
was based on a real test.  I expect the check for a CPU being local to
a zone would be done in __zone_set_pageset_high_and_batch.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/6] mm/page_alloc: Disassociate the pcp->high from pcp->batch
  2021-05-21 10:28 ` [PATCH 2/6] mm/page_alloc: Disassociate the pcp->high from pcp->batch Mel Gorman
@ 2021-05-21 21:52   ` Dave Hansen
  2021-05-24  8:32     ` Mel Gorman
  0 siblings, 1 reply; 34+ messages in thread
From: Dave Hansen @ 2021-05-21 21:52 UTC (permalink / raw)
  To: Mel Gorman, Linux-MM
  Cc: Dave Hansen, Matthew Wilcox, Vlastimil Babka, Michal Hocko,
	Nicholas Piggin, LKML

On 5/21/21 3:28 AM, Mel Gorman wrote:
> Note that in this patch the pcp->high values are adjusted after memory
> hotplug events, min_free_kbytes adjustments and watermark scale factor
> adjustments but not CPU hotplug events.

Not that it was a long wait to figure it out, but I'd probably say:

	"CPU hotplug events are handled later in the series".

instead of just saying they're not handled.

> Before grep -E "high:|batch" /proc/zoneinfo | tail -2
>               high:  378
>               batch: 63
> 
> After grep -E "high:|batch" /proc/zoneinfo | tail -2
>               high:  649
>               batch: 63

You noted the relationship between pcp->high and zone lock contention.
Larger ->high values mean less contention.  It's probably also worth
noting the trend of having more logical CPUs per NUMA node.

I have the feeling when this was put in place it wasn't uncommon to have
somewhere between 1 and 8 CPUs in a node pounding on a zone.

Today, having ~60 is common.  I've occasionally resorted to recommending
that folks enable hardware features like Sub-NUMA-Clustering [1] since
it increases the number of zones and decreases the number of CPUs
pounding on each zone lock.

1.
https://software.intel.com/content/www/us/en/develop/articles/intel-xeon-processor-scalable-family-technical-overview.html

> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index a48f305f0381..bf5cdc466e6c 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2163,14 +2163,6 @@ void __init page_alloc_init_late(void)
>  	/* Block until all are initialised */
>  	wait_for_completion(&pgdat_init_all_done_comp);
>  
> -	/*
> -	 * The number of managed pages has changed due to the initialisation
> -	 * so the pcpu batch and high limits needs to be updated or the limits
> -	 * will be artificially small.
> -	 */
> -	for_each_populated_zone(zone)
> -		zone_pcp_update(zone);
> -
>  	/*
>  	 * We initialized the rest of the deferred pages.  Permanently disable
>  	 * on-demand struct page initialization.
> @@ -6594,13 +6586,12 @@ static int zone_batchsize(struct zone *zone)
>  	int batch;
>  
>  	/*
> -	 * The per-cpu-pages pools are set to around 1000th of the
> -	 * size of the zone.
> +	 * The number of pages to batch allocate is either 0.1%

Probably worth making that "~0.1%" just in case someone goes looking for
the /1000 and can't find it.

> +	 * of the zone or 1MB, whichever is smaller. The batch
> +	 * size is striking a balance between allocation latency
> +	 * and zone lock contention.
>  	 */
> -	batch = zone_managed_pages(zone) / 1024;
> -	/* But no more than a meg. */
> -	if (batch * PAGE_SIZE > 1024 * 1024)
> -		batch = (1024 * 1024) / PAGE_SIZE;
> +	batch = min(zone_managed_pages(zone) >> 10, (1024 * 1024) / PAGE_SIZE);
>  	batch /= 4;		/* We effectively *= 4 below */
>  	if (batch < 1)
>  		batch = 1;
> @@ -6637,6 +6628,27 @@ static int zone_batchsize(struct zone *zone)
>  #endif
>  }
>  
> +static int zone_highsize(struct zone *zone)
> +{
> +#ifdef CONFIG_MMU
> +	int high;
> +	int nr_local_cpus;
> +
> +	/*
> +	 * The high value of the pcp is based on the zone low watermark
> +	 * when reclaim is potentially active spread across the online
> +	 * CPUs local to a zone. Note that early in boot that CPUs may
> +	 * not be online yet.
> +	 */

FWIW, I like the way the changelog talked about this a bit better, with
the goal of avoiding background reclaim even in the face of a bunch of
full pcp's.

> +	nr_local_cpus = max(1U, cpumask_weight(cpumask_of_node(zone_to_nid(zone))));
> +	high = low_wmark_pages(zone) / nr_local_cpus;

I'm a little concerned that this might get out of hand on really big
nodes with no CPUs.  For persistent memory (which we *do* toss into the
page allocator for volatile use), we can have multi-terabyte zones with
no CPUs in the node.

Also, while the CPUs which are on the node are the ones *most* likely to
be hitting the ->high limit, we do *keep* a pcp for each possible CPU.
So, the amount of memory which can actually be sequestered is
num_online_cpus()*high.  Right?

*That* might really get out of hand if we have nr_local_cpus=1.

We might want some overall cap on 'high', or even to scale it
differently for the zone-local cpus' pcps versus remote.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 2/6] mm/page_alloc: Disassociate the pcp->high from pcp->batch
  2021-05-21 10:28 [RFC PATCH 0/6] " Mel Gorman
@ 2021-05-21 10:28 ` Mel Gorman
  2021-05-21 21:52   ` Dave Hansen
  0 siblings, 1 reply; 34+ messages in thread
From: Mel Gorman @ 2021-05-21 10:28 UTC (permalink / raw)
  To: Linux-MM
  Cc: Dave Hansen, Matthew Wilcox, Vlastimil Babka, Michal Hocko,
	Nicholas Piggin, LKML, Mel Gorman

The pcp high watermark is based on the batch size but there is no
relationship between them other than it is convenient to use early in
boot.

This patch takes the first step and bases pcp->high on the zone low
watermark split across the number of CPUs local to a zone while the batch
size remains the same to avoid increasing allocation latencies. The intent
behind the default pcp->high is "set the number of PCP pages such that
if they are all full that background reclaim is not started prematurely".

Note that in this patch the pcp->high values are adjusted after memory
hotplug events, min_free_kbytes adjustments and watermark scale factor
adjustments but not CPU hotplug events.

On a test KVM instance;

Before grep -E "high:|batch" /proc/zoneinfo | tail -2
              high:  378
              batch: 63

After grep -E "high:|batch" /proc/zoneinfo | tail -2
              high:  649
              batch: 63

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/page_alloc.c | 53 ++++++++++++++++++++++++++++++++-----------------
 1 file changed, 35 insertions(+), 18 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a48f305f0381..bf5cdc466e6c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2163,14 +2163,6 @@ void __init page_alloc_init_late(void)
 	/* Block until all are initialised */
 	wait_for_completion(&pgdat_init_all_done_comp);
 
-	/*
-	 * The number of managed pages has changed due to the initialisation
-	 * so the pcpu batch and high limits needs to be updated or the limits
-	 * will be artificially small.
-	 */
-	for_each_populated_zone(zone)
-		zone_pcp_update(zone);
-
 	/*
 	 * We initialized the rest of the deferred pages.  Permanently disable
 	 * on-demand struct page initialization.
@@ -6594,13 +6586,12 @@ static int zone_batchsize(struct zone *zone)
 	int batch;
 
 	/*
-	 * The per-cpu-pages pools are set to around 1000th of the
-	 * size of the zone.
+	 * The number of pages to batch allocate is either 0.1%
+	 * of the zone or 1MB, whichever is smaller. The batch
+	 * size is striking a balance between allocation latency
+	 * and zone lock contention.
 	 */
-	batch = zone_managed_pages(zone) / 1024;
-	/* But no more than a meg. */
-	if (batch * PAGE_SIZE > 1024 * 1024)
-		batch = (1024 * 1024) / PAGE_SIZE;
+	batch = min(zone_managed_pages(zone) >> 10, (1024 * 1024) / PAGE_SIZE);
 	batch /= 4;		/* We effectively *= 4 below */
 	if (batch < 1)
 		batch = 1;
@@ -6637,6 +6628,27 @@ static int zone_batchsize(struct zone *zone)
 #endif
 }
 
+static int zone_highsize(struct zone *zone)
+{
+#ifdef CONFIG_MMU
+	int high;
+	int nr_local_cpus;
+
+	/*
+	 * The high value of the pcp is based on the zone low watermark
+	 * when reclaim is potentially active spread across the online
+	 * CPUs local to a zone. Note that early in boot that CPUs may
+	 * not be online yet.
+	 */
+	nr_local_cpus = max(1U, cpumask_weight(cpumask_of_node(zone_to_nid(zone))));
+	high = low_wmark_pages(zone) / nr_local_cpus;
+
+	return high;
+#else
+	return 0;
+#endif
+}
+
 /*
  * pcp->high and pcp->batch values are related and generally batch is lower
  * than high. They are also related to pcp->count such that count is lower
@@ -6698,11 +6710,10 @@ static void __zone_set_pageset_high_and_batch(struct zone *zone, unsigned long h
  */
 static void zone_set_pageset_high_and_batch(struct zone *zone)
 {
-	unsigned long new_high, new_batch;
+	int new_high, new_batch;
 
-	new_batch = zone_batchsize(zone);
-	new_high = 6 * new_batch;
-	new_batch = max(1UL, 1 * new_batch);
+	new_batch = max(1, zone_batchsize(zone));
+	new_high = zone_highsize(zone);
 
 	if (zone->pageset_high == new_high &&
 	    zone->pageset_batch == new_batch)
@@ -8170,6 +8181,12 @@ static void __setup_per_zone_wmarks(void)
 		zone->_watermark[WMARK_LOW]  = min_wmark_pages(zone) + tmp;
 		zone->_watermark[WMARK_HIGH] = min_wmark_pages(zone) + tmp * 2;
 
+		/*
+		 * The watermark size have changed so update the pcpu batch
+		 * and high limits or the limits may be inappropriate.
+		 */
+		zone_set_pageset_high_and_batch(zone);
+
 		spin_unlock_irqrestore(&zone->lock, flags);
 	}
 
-- 
2.26.2


^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2021-05-31 12:00 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-25  8:01 [PATCH 0/6 v2] Calculate pcp->high based on zone sizes and active CPUs Mel Gorman
2021-05-25  8:01 ` [PATCH 1/6] mm/page_alloc: Delete vm.percpu_pagelist_fraction Mel Gorman
2021-05-26 17:41   ` Vlastimil Babka
2021-05-25  8:01 ` [PATCH 2/6] mm/page_alloc: Disassociate the pcp->high from pcp->batch Mel Gorman
2021-05-26 18:14   ` Vlastimil Babka
2021-05-27 10:52     ` Mel Gorman
2021-05-28 10:27       ` Vlastimil Babka
2021-05-25  8:01 ` [PATCH 3/6] mm/page_alloc: Adjust pcp->high after CPU hotplug events Mel Gorman
2021-05-28 11:08   ` Vlastimil Babka
2021-05-25  8:01 ` [PATCH 4/6] mm/page_alloc: Scale the number of pages that are batch freed Mel Gorman
2021-05-28 11:19   ` Vlastimil Babka
2021-05-25  8:01 ` [PATCH 5/6] mm/page_alloc: Limit the number of pages on PCP lists when reclaim is active Mel Gorman
2021-05-28 11:43   ` Vlastimil Babka
2021-05-25  8:01 ` [PATCH 6/6] mm/page_alloc: Introduce vm.percpu_pagelist_high_fraction Mel Gorman
2021-05-28 11:59   ` Vlastimil Babka
2021-05-28 12:53     ` Mel Gorman
2021-05-28 14:38       ` Vlastimil Babka
2021-05-27 19:36 ` [PATCH 0/6 v2] Calculate pcp->high based on zone sizes and active CPUs Dave Hansen
2021-05-28  8:55   ` Mel Gorman
2021-05-28  9:03     ` David Hildenbrand
2021-05-28  9:08       ` David Hildenbrand
2021-05-28  9:49         ` Mel Gorman
2021-05-28  9:52           ` David Hildenbrand
2021-05-28 10:09             ` Mel Gorman
2021-05-28 10:21               ` David Hildenbrand
2021-05-28 12:12     ` Vlastimil Babka
2021-05-28 12:37       ` Mel Gorman
2021-05-28 14:39     ` Dave Hansen
2021-05-28 15:18       ` Mel Gorman
2021-05-28 16:17         ` Dave Hansen
2021-05-31 12:00           ` Feng Tang
  -- strict thread matches above, loose matches on Subject: below --
2021-05-21 10:28 [RFC PATCH 0/6] " Mel Gorman
2021-05-21 10:28 ` [PATCH 2/6] mm/page_alloc: Disassociate the pcp->high from pcp->batch Mel Gorman
2021-05-21 21:52   ` Dave Hansen
2021-05-24  8:32     ` Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).