From: Dave Hansen <dave.hansen@intel.com>
To: Mel Gorman <mgorman@techsingularity.net>, Linux-MM <linux-mm@kvack.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>,
Matthew Wilcox <willy@infradead.org>,
Vlastimil Babka <vbabka@suse.cz>,
Michal Hocko <mhocko@kernel.org>,
Nicholas Piggin <npiggin@gmail.com>,
LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 2/6] mm/page_alloc: Disassociate the pcp->high from pcp->batch
Date: Fri, 21 May 2021 14:52:39 -0700 [thread overview]
Message-ID: <83ddf311-cdfb-34cf-d08f-70590420beff@intel.com> (raw)
In-Reply-To: <20210521102826.28552-3-mgorman@techsingularity.net>
On 5/21/21 3:28 AM, Mel Gorman wrote:
> Note that in this patch the pcp->high values are adjusted after memory
> hotplug events, min_free_kbytes adjustments and watermark scale factor
> adjustments but not CPU hotplug events.
Not that it was a long wait to figure it out, but I'd probably say:
"CPU hotplug events are handled later in the series".
instead of just saying they're not handled.
> Before grep -E "high:|batch" /proc/zoneinfo | tail -2
> high: 378
> batch: 63
>
> After grep -E "high:|batch" /proc/zoneinfo | tail -2
> high: 649
> batch: 63
You noted the relationship between pcp->high and zone lock contention.
Larger ->high values mean less contention. It's probably also worth
noting the trend of having more logical CPUs per NUMA node.
I have the feeling when this was put in place it wasn't uncommon to have
somewhere between 1 and 8 CPUs in a node pounding on a zone.
Today, having ~60 is common. I've occasionally resorted to recommending
that folks enable hardware features like Sub-NUMA-Clustering [1] since
it increases the number of zones and decreases the number of CPUs
pounding on each zone lock.
1.
https://software.intel.com/content/www/us/en/develop/articles/intel-xeon-processor-scalable-family-technical-overview.html
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index a48f305f0381..bf5cdc466e6c 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2163,14 +2163,6 @@ void __init page_alloc_init_late(void)
> /* Block until all are initialised */
> wait_for_completion(&pgdat_init_all_done_comp);
>
> - /*
> - * The number of managed pages has changed due to the initialisation
> - * so the pcpu batch and high limits needs to be updated or the limits
> - * will be artificially small.
> - */
> - for_each_populated_zone(zone)
> - zone_pcp_update(zone);
> -
> /*
> * We initialized the rest of the deferred pages. Permanently disable
> * on-demand struct page initialization.
> @@ -6594,13 +6586,12 @@ static int zone_batchsize(struct zone *zone)
> int batch;
>
> /*
> - * The per-cpu-pages pools are set to around 1000th of the
> - * size of the zone.
> + * The number of pages to batch allocate is either 0.1%
Probably worth making that "~0.1%" just in case someone goes looking for
the /1000 and can't find it.
> + * of the zone or 1MB, whichever is smaller. The batch
> + * size is striking a balance between allocation latency
> + * and zone lock contention.
> */
> - batch = zone_managed_pages(zone) / 1024;
> - /* But no more than a meg. */
> - if (batch * PAGE_SIZE > 1024 * 1024)
> - batch = (1024 * 1024) / PAGE_SIZE;
> + batch = min(zone_managed_pages(zone) >> 10, (1024 * 1024) / PAGE_SIZE);
> batch /= 4; /* We effectively *= 4 below */
> if (batch < 1)
> batch = 1;
> @@ -6637,6 +6628,27 @@ static int zone_batchsize(struct zone *zone)
> #endif
> }
>
> +static int zone_highsize(struct zone *zone)
> +{
> +#ifdef CONFIG_MMU
> + int high;
> + int nr_local_cpus;
> +
> + /*
> + * The high value of the pcp is based on the zone low watermark
> + * when reclaim is potentially active spread across the online
> + * CPUs local to a zone. Note that early in boot that CPUs may
> + * not be online yet.
> + */
FWIW, I like the way the changelog talked about this a bit better, with
the goal of avoiding background reclaim even in the face of a bunch of
full pcp's.
> + nr_local_cpus = max(1U, cpumask_weight(cpumask_of_node(zone_to_nid(zone))));
> + high = low_wmark_pages(zone) / nr_local_cpus;
I'm a little concerned that this might get out of hand on really big
nodes with no CPUs. For persistent memory (which we *do* toss into the
page allocator for volatile use), we can have multi-terabyte zones with
no CPUs in the node.
Also, while the CPUs which are on the node are the ones *most* likely to
be hitting the ->high limit, we do *keep* a pcp for each possible CPU.
So, the amount of memory which can actually be sequestered is
num_online_cpus()*high. Right?
*That* might really get out of hand if we have nr_local_cpus=1.
We might want some overall cap on 'high', or even to scale it
differently for the zone-local cpus' pcps versus remote.
next prev parent reply other threads:[~2021-05-21 21:52 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-05-21 10:28 [RFC PATCH 0/6] Calculate pcp->high based on zone sizes and active CPUs Mel Gorman
2021-05-21 10:28 ` [PATCH 1/6] mm/page_alloc: Delete vm.percpu_pagelist_fraction Mel Gorman
2021-05-21 21:04 ` Dave Hansen
2021-05-21 10:28 ` [PATCH 2/6] mm/page_alloc: Disassociate the pcp->high from pcp->batch Mel Gorman
2021-05-21 21:52 ` Dave Hansen [this message]
2021-05-24 8:32 ` Mel Gorman
2021-05-21 10:28 ` [PATCH 3/6] mm/page_alloc: Adjust pcp->high after CPU hotplug events Mel Gorman
2021-05-21 22:13 ` Dave Hansen
2021-05-24 9:07 ` Mel Gorman
2021-05-24 15:52 ` Dave Hansen
2021-05-24 16:01 ` Mel Gorman
2021-05-21 10:28 ` [PATCH 4/6] mm/page_alloc: Scale the number of pages that are batch freed Mel Gorman
2021-05-21 22:36 ` Dave Hansen
2021-05-24 9:12 ` Mel Gorman
2021-05-21 10:28 ` [PATCH 5/6] mm/page_alloc: Limit the number of pages on PCP lists when reclaim is active Mel Gorman
2021-05-21 22:44 ` Dave Hansen
2021-05-24 9:22 ` Mel Gorman
2021-05-21 10:28 ` [PATCH 6/6] mm/page_alloc: Introduce vm.percpu_pagelist_high_fraction Mel Gorman
2021-05-21 22:57 ` Dave Hansen
2021-05-24 9:25 ` Mel Gorman
2021-05-25 8:01 [PATCH 0/6 v2] Calculate pcp->high based on zone sizes and active CPUs Mel Gorman
2021-05-25 8:01 ` [PATCH 2/6] mm/page_alloc: Disassociate the pcp->high from pcp->batch Mel Gorman
2021-05-26 18:14 ` Vlastimil Babka
2021-05-27 10:52 ` Mel Gorman
2021-05-28 10:27 ` Vlastimil Babka
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=83ddf311-cdfb-34cf-d08f-70590420beff@intel.com \
--to=dave.hansen@intel.com \
--cc=dave.hansen@linux.intel.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mgorman@techsingularity.net \
--cc=mhocko@kernel.org \
--cc=npiggin@gmail.com \
--cc=vbabka@suse.cz \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).