From: Mel Gorman <mgorman@techsingularity.net>
To: Dave Hansen <dave.hansen@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
Hillf Danton <hdanton@sina.com>,
Dave Hansen <dave.hansen@linux.intel.com>,
Vlastimil Babka <vbabka@suse.cz>,
Michal Hocko <mhocko@kernel.org>,
LKML <linux-kernel@vger.kernel.org>,
Linux-MM <linux-mm@kvack.org>, "Tang, Feng" <feng.tang@intel.com>
Subject: Re: [PATCH 0/6 v2] Calculate pcp->high based on zone sizes and active CPUs
Date: Fri, 28 May 2021 09:55:45 +0100 [thread overview]
Message-ID: <20210528085545.GJ30378@techsingularity.net> (raw)
In-Reply-To: <7177f59b-dc05-daff-7dc6-5815b539a790@intel.com>
On Thu, May 27, 2021 at 12:36:21PM -0700, Dave Hansen wrote:
> Hi Mel,
>
> Feng Tang tossed these on a "Cascade Lake" system with 96 threads and
> ~512G of persistent memory and 128G of DRAM. The PMEM is in "volatile
> use" mode and being managed via the buddy just like the normal RAM.
>
> The PMEM zones are big ones:
>
> present 65011712 = 248 G
> high 134595 = 525 M
>
> The PMEM nodes, of course, don't have any CPUs in them.
>
> With your series, the pcp->high value per-cpu is 69584 pages or about
> 270MB per CPU. Scaled up by the 96 CPU threads, that's ~26GB of
> worst-case memory in the pcps per zone, or roughly 10% of the size of
> the zone.
>
> I did see quite a few pcp->counts above 60,000, so it's definitely
> possible in practice to see the pcps filled up. This was not observed
> to cause any actual problems in practice. But, it's still a bit worrisome.
>
Ok, it does have the potential to trigger early reclaim as pages are
stored on remote PCP lists. The problem would be transient because
vmstat would drain those pages over time but still, how about this patch
on top of the series?
--8<--
mm/page_alloc: Split pcp->high across all online CPUs for cpuless nodes
Dave Hansen reported the following about Feng Tang's tests on a machine
with persistent memory onlined as a DRAM-like device.
Feng Tang tossed these on a "Cascade Lake" system with 96 threads and
~512G of persistent memory and 128G of DRAM. The PMEM is in "volatile
use" mode and being managed via the buddy just like the normal RAM.
The PMEM zones are big ones:
present 65011712 = 248 G
high 134595 = 525 M
The PMEM nodes, of course, don't have any CPUs in them.
With your series, the pcp->high value per-cpu is 69584 pages or about
270MB per CPU. Scaled up by the 96 CPU threads, that's ~26GB of
worst-case memory in the pcps per zone, or roughly 10% of the size of
the zone.
This should not cause a problem as such although it could trigger reclaim
due to pages being stored on per-cpu lists for CPUs remote to a node. It
is not possible to treat cpuless nodes exactly the same as normal nodes
but the worst-case scenario can be mitigated by splitting pcp->high across
all online CPUs for cpuless memory nodes.
Suggested-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
mm/page_alloc.c | 14 +++++++++-----
1 file changed, 9 insertions(+), 5 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d708aa14f4ef..af566e97a0f8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6687,7 +6687,7 @@ static int zone_highsize(struct zone *zone, int batch, int cpu_online)
{
#ifdef CONFIG_MMU
int high;
- int nr_local_cpus;
+ int nr_split_cpus;
unsigned long total_pages;
if (!percpu_pagelist_high_fraction) {
@@ -6710,10 +6710,14 @@ static int zone_highsize(struct zone *zone, int batch, int cpu_online)
* Split the high value across all online CPUs local to the zone. Note
* that early in boot that CPUs may not be online yet and that during
* CPU hotplug that the cpumask is not yet updated when a CPU is being
- * onlined.
- */
- nr_local_cpus = max(1U, cpumask_weight(cpumask_of_node(zone_to_nid(zone)))) + cpu_online;
- high = total_pages / nr_local_cpus;
+ * onlined. For memory nodes that have no CPUs, split pcp->high across
+ * all online CPUs to mitigate the risk that reclaim is triggered
+ * prematurely due to pages stored on pcp lists.
+ */
+ nr_split_cpus = cpumask_weight(cpumask_of_node(zone_to_nid(zone))) + cpu_online;
+ if (!nr_split_cpus)
+ nr_split_cpus = num_online_cpus();
+ high = total_pages / nr_split_cpus;
/*
* Ensure high is at least batch*4. The multiple is based on the
next prev parent reply other threads:[~2021-05-28 8:55 UTC|newest]
Thread overview: 31+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-05-25 8:01 [PATCH 0/6 v2] Calculate pcp->high based on zone sizes and active CPUs Mel Gorman
2021-05-25 8:01 ` [PATCH 1/6] mm/page_alloc: Delete vm.percpu_pagelist_fraction Mel Gorman
2021-05-26 17:41 ` Vlastimil Babka
2021-05-25 8:01 ` [PATCH 2/6] mm/page_alloc: Disassociate the pcp->high from pcp->batch Mel Gorman
2021-05-26 18:14 ` Vlastimil Babka
2021-05-27 10:52 ` Mel Gorman
2021-05-28 10:27 ` Vlastimil Babka
2021-05-25 8:01 ` [PATCH 3/6] mm/page_alloc: Adjust pcp->high after CPU hotplug events Mel Gorman
2021-05-28 11:08 ` Vlastimil Babka
2021-05-25 8:01 ` [PATCH 4/6] mm/page_alloc: Scale the number of pages that are batch freed Mel Gorman
2021-05-28 11:19 ` Vlastimil Babka
2021-05-25 8:01 ` [PATCH 5/6] mm/page_alloc: Limit the number of pages on PCP lists when reclaim is active Mel Gorman
2021-05-28 11:43 ` Vlastimil Babka
2021-05-25 8:01 ` [PATCH 6/6] mm/page_alloc: Introduce vm.percpu_pagelist_high_fraction Mel Gorman
2021-05-28 11:59 ` Vlastimil Babka
2021-05-28 12:53 ` Mel Gorman
2021-05-28 14:38 ` Vlastimil Babka
2021-05-27 19:36 ` [PATCH 0/6 v2] Calculate pcp->high based on zone sizes and active CPUs Dave Hansen
2021-05-28 8:55 ` Mel Gorman [this message]
2021-05-28 9:03 ` David Hildenbrand
2021-05-28 9:08 ` David Hildenbrand
2021-05-28 9:49 ` Mel Gorman
2021-05-28 9:52 ` David Hildenbrand
2021-05-28 10:09 ` Mel Gorman
2021-05-28 10:21 ` David Hildenbrand
2021-05-28 12:12 ` Vlastimil Babka
2021-05-28 12:37 ` Mel Gorman
2021-05-28 14:39 ` Dave Hansen
2021-05-28 15:18 ` Mel Gorman
2021-05-28 16:17 ` Dave Hansen
2021-05-31 12:00 ` Feng Tang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20210528085545.GJ30378@techsingularity.net \
--to=mgorman@techsingularity.net \
--cc=akpm@linux-foundation.org \
--cc=dave.hansen@intel.com \
--cc=dave.hansen@linux.intel.com \
--cc=feng.tang@intel.com \
--cc=hdanton@sina.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@kernel.org \
--cc=vbabka@suse.cz \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.