[PATCH 0/3] Recalculate per-cpu page allocator batch and high limits after deferred meminit

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/3] Recalculate per-cpu page allocator batch and high limits after deferred meminit
@ 2019-10-18 10:56 Mel Gorman
  2019-10-18 10:56 ` [PATCH 1/3] mm, pcp: Share common code between memory hotplug and percpu sysctl handler Mel Gorman
                   ` (3 more replies)
  0 siblings, 4 replies; 16+ messages in thread
From: Mel Gorman @ 2019-10-18 10:56 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Vlastimil Babka, Thomas Gleixner, Matt Fleming,
	Borislav Petkov, Linux-MM, Linux Kernel Mailing List, Mel Gorman

A private report stated that system CPU usage was excessive on an AMD
EPYC 2 machine while building kernels with much longer build times than
expected. The issue is partially explained by high zone lock contention
due to the per-cpu page allocator batch and high limits being calculated
incorrectly. This series addresses a large chunk of the problem. Patch 1
is mostly cosmetic but prepares for patch 2 which is the real fix. Patch
3 is definiely cosmetic but was noticed while implementing the fix. Proper
details are in the changelog for patch 2.

 include/linux/mm.h |  3 ---
 mm/internal.h      |  3 +++
 mm/page_alloc.c    | 33 ++++++++++++++++++++-------------
 3 files changed, 23 insertions(+), 16 deletions(-)

-- 
2.16.4

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 1/3] mm, pcp: Share common code between memory hotplug and percpu sysctl handler
  2019-10-18 10:56 [PATCH 0/3] Recalculate per-cpu page allocator batch and high limits after deferred meminit Mel Gorman
@ 2019-10-18 10:56 ` Mel Gorman
  2019-10-18 11:57   ` Matt Fleming
  2019-10-18 12:51   ` Michal Hocko
  2019-10-18 10:56 ` [PATCH 2/3] mm, meminit: Recalculate pcpu batch and high limits after init completes Mel Gorman
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 16+ messages in thread
From: Mel Gorman @ 2019-10-18 10:56 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Vlastimil Babka, Thomas Gleixner, Matt Fleming,
	Borislav Petkov, Linux-MM, Linux Kernel Mailing List, Mel Gorman

Both the percpu_pagelist_fraction sysctl handler and memory hotplug
have a common requirement of updating the pcpu page allocation batch
and high values. Split the relevant helper to share common code.

No functional change.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/page_alloc.c | 23 ++++++++++++-----------
 1 file changed, 12 insertions(+), 11 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c0b2e0306720..cafe568d36f6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7983,6 +7983,15 @@ int lowmem_reserve_ratio_sysctl_handler(struct ctl_table *table, int write,
 	return 0;
 }
 
+static void __zone_pcp_update(struct zone *zone)
+{
+	unsigned int cpu;
+
+	for_each_possible_cpu(cpu)
+		pageset_set_high_and_batch(zone,
+				per_cpu_ptr(zone->pageset, cpu));
+}
+
 /*
  * percpu_pagelist_fraction - changes the pcp->high for each zone on each
  * cpu.  It is the fraction of total pages in each zone that a hot per cpu
@@ -8014,13 +8023,8 @@ int percpu_pagelist_fraction_sysctl_handler(struct ctl_table *table, int write,
 	if (percpu_pagelist_fraction == old_percpu_pagelist_fraction)
 		goto out;
 
-	for_each_populated_zone(zone) {
-		unsigned int cpu;
-
-		for_each_possible_cpu(cpu)
-			pageset_set_high_and_batch(zone,
-					per_cpu_ptr(zone->pageset, cpu));
-	}
+	for_each_populated_zone(zone)
+		__zone_pcp_update(zone);
 out:
 	mutex_unlock(&pcp_batch_high_lock);
 	return ret;
@@ -8519,11 +8523,8 @@ void free_contig_range(unsigned long pfn, unsigned int nr_pages)
  */
 void __meminit zone_pcp_update(struct zone *zone)
 {
-	unsigned cpu;
 	mutex_lock(&pcp_batch_high_lock);
-	for_each_possible_cpu(cpu)
-		pageset_set_high_and_batch(zone,
-				per_cpu_ptr(zone->pageset, cpu));
+	__zone_pcp_update(zone);
 	mutex_unlock(&pcp_batch_high_lock);
 }
 #endif
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 2/3] mm, meminit: Recalculate pcpu batch and high limits after init completes
  2019-10-18 10:56 [PATCH 0/3] Recalculate per-cpu page allocator batch and high limits after deferred meminit Mel Gorman
  2019-10-18 10:56 ` [PATCH 1/3] mm, pcp: Share common code between memory hotplug and percpu sysctl handler Mel Gorman
@ 2019-10-18 10:56 ` Mel Gorman
  2019-10-18 11:57   ` Matt Fleming
  2019-10-18 13:01   ` Michal Hocko
  2019-10-18 10:56 ` [PATCH 3/3] mm, pcpu: Make zone pcp updates and reset internal to the mm Mel Gorman
  2019-10-18 11:58 ` [PATCH 0/3] Recalculate per-cpu page allocator batch and high limits after deferred meminit Matt Fleming
  3 siblings, 2 replies; 16+ messages in thread
From: Mel Gorman @ 2019-10-18 10:56 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Vlastimil Babka, Thomas Gleixner, Matt Fleming,
	Borislav Petkov, Linux-MM, Linux Kernel Mailing List, Mel Gorman

Deferred memory initialisation updates zone->managed_pages during
the initialisation phase but before that finishes, the per-cpu page
allocator (pcpu) calculates the number of pages allocated/freed in
batches as well as the maximum number of pages allowed on a per-cpu list.
As zone->managed_pages is not up to date yet, the pcpu initialisation
calculates inappropriately low batch and high values.

This increases zone lock contention quite severely in some cases with the
degree of severity depending on how many CPUs share a local zone and the
size of the zone. A private report indicated that kernel build times were
excessive with extremely high system CPU usage. A perf profile indicated
that a large chunk of time was lost on zone->lock contention.

This patch recalculates the pcpu batch and high values after deferred
initialisation completes on each node. It was tested on a 2-socket AMD
EPYC 2 machine using a kernel compilation workload -- allmodconfig and
all available CPUs.

mmtests configuration: config-workload-kernbench-max
Configuration was modified to build on a fresh XFS partition.

kernbench
                                5.4.0-rc3              5.4.0-rc3
                                  vanilla         resetpcpu-v1r1
Amean     user-256    13249.50 (   0.00%)    15928.40 * -20.22%*
Amean     syst-256    14760.30 (   0.00%)     4551.77 *  69.16%*
Amean     elsp-256      162.42 (   0.00%)      118.46 *  27.06%*
Stddev    user-256       42.97 (   0.00%)       50.83 ( -18.30%)
Stddev    syst-256      336.87 (   0.00%)       33.70 (  90.00%)
Stddev    elsp-256        2.46 (   0.00%)        0.81 (  67.01%)

                   5.4.0-rc3   5.4.0-rc3
                     vanillaresetpcpu-v1r1
Duration User       39766.24    47802.92
Duration System     44298.10    13671.93
Duration Elapsed      519.11      387.65

The patch reduces system CPU usage by 69.16% and total build time by
27.06%. The variance of system CPU usage is also much reduced.

Cc: stable@vger.kernel.org # v4.15+
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/page_alloc.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index cafe568d36f6..0a0dd74edc83 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1818,6 +1818,14 @@ static int __init deferred_init_memmap(void *data)
 	 */
 	while (spfn < epfn)
 		nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
+
+	/*
+	 * The number of managed pages has changed due to the initialisation
+	 * so the pcpu batch and high limits needs to be updated or the limits
+	 * will be artificially small.
+	 */
+	zone_pcp_update(zone);
+
 zone_empty:
 	pgdat_resize_unlock(pgdat, &flags);
 
@@ -8516,7 +8524,6 @@ void free_contig_range(unsigned long pfn, unsigned int nr_pages)
 	WARN(count != 0, "%d pages are still in use!\n", count);
 }
 
-#ifdef CONFIG_MEMORY_HOTPLUG
 /*
  * The zone indicated has a new number of managed_pages; batch sizes and percpu
  * page high values need to be recalulated.
@@ -8527,7 +8534,6 @@ void __meminit zone_pcp_update(struct zone *zone)
 	__zone_pcp_update(zone);
 	mutex_unlock(&pcp_batch_high_lock);
 }
-#endif
 
 void zone_pcp_reset(struct zone *zone)
 {
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 3/3] mm, pcpu: Make zone pcp updates and reset internal to the mm
  2019-10-18 10:56 [PATCH 0/3] Recalculate per-cpu page allocator batch and high limits after deferred meminit Mel Gorman
  2019-10-18 10:56 ` [PATCH 1/3] mm, pcp: Share common code between memory hotplug and percpu sysctl handler Mel Gorman
  2019-10-18 10:56 ` [PATCH 2/3] mm, meminit: Recalculate pcpu batch and high limits after init completes Mel Gorman
@ 2019-10-18 10:56 ` Mel Gorman
  2019-10-18 11:57   ` Matt Fleming
  2019-10-18 13:02   ` Michal Hocko
  2019-10-18 11:58 ` [PATCH 0/3] Recalculate per-cpu page allocator batch and high limits after deferred meminit Matt Fleming
  3 siblings, 2 replies; 16+ messages in thread
From: Mel Gorman @ 2019-10-18 10:56 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Vlastimil Babka, Thomas Gleixner, Matt Fleming,
	Borislav Petkov, Linux-MM, Linux Kernel Mailing List, Mel Gorman

Memory hotplug needs to be able to reset and reinit the pcpu allocator
batch and high limits but this action is internal to the VM. Move
the declaration to internal.h

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 include/linux/mm.h | 3 ---
 mm/internal.h      | 3 +++
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index cc292273e6ba..22d6104f2341 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2219,9 +2219,6 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...);
 
 extern void setup_per_cpu_pageset(void);
 
-extern void zone_pcp_update(struct zone *zone);
-extern void zone_pcp_reset(struct zone *zone);
-
 /* page_alloc.c */
 extern int min_free_kbytes;
 extern int watermark_boost_factor;
diff --git a/mm/internal.h b/mm/internal.h
index 0d5f720c75ab..0a3d41c7b3c5 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -165,6 +165,9 @@ extern void post_alloc_hook(struct page *page, unsigned int order,
 					gfp_t gfp_flags);
 extern int user_min_free_kbytes;
 
+extern void zone_pcp_update(struct zone *zone);
+extern void zone_pcp_reset(struct zone *zone);
+
 #if defined CONFIG_COMPACTION || defined CONFIG_CMA
 
 /*
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/3] mm, pcp: Share common code between memory hotplug and percpu sysctl handler
  2019-10-18 10:56 ` [PATCH 1/3] mm, pcp: Share common code between memory hotplug and percpu sysctl handler Mel Gorman
@ 2019-10-18 11:57   ` Matt Fleming
  2019-10-18 12:51   ` Michal Hocko
  1 sibling, 0 replies; 16+ messages in thread
From: Matt Fleming @ 2019-10-18 11:57 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Michal Hocko, Vlastimil Babka, Thomas Gleixner,
	Borislav Petkov, Linux-MM, Linux Kernel Mailing List

On Fri, 18 Oct, at 11:56:04AM, Mel Gorman wrote:
> Both the percpu_pagelist_fraction sysctl handler and memory hotplug
> have a common requirement of updating the pcpu page allocation batch
> and high values. Split the relevant helper to share common code.
> 
> No functional change.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---
>  mm/page_alloc.c | 23 ++++++++++++-----------
>  1 file changed, 12 insertions(+), 11 deletions(-)
 
Tested-by: Matt Fleming <matt@codeblueprint.co.uk>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/3] mm, meminit: Recalculate pcpu batch and high limits after init completes
  2019-10-18 10:56 ` [PATCH 2/3] mm, meminit: Recalculate pcpu batch and high limits after init completes Mel Gorman
@ 2019-10-18 11:57   ` Matt Fleming
  2019-10-18 13:01   ` Michal Hocko
  1 sibling, 0 replies; 16+ messages in thread
From: Matt Fleming @ 2019-10-18 11:57 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Michal Hocko, Vlastimil Babka, Thomas Gleixner,
	Borislav Petkov, Linux-MM, Linux Kernel Mailing List

On Fri, 18 Oct, at 11:56:05AM, Mel Gorman wrote:
> Deferred memory initialisation updates zone->managed_pages during
> the initialisation phase but before that finishes, the per-cpu page
> allocator (pcpu) calculates the number of pages allocated/freed in
> batches as well as the maximum number of pages allowed on a per-cpu list.
> As zone->managed_pages is not up to date yet, the pcpu initialisation
> calculates inappropriately low batch and high values.
> 
> This increases zone lock contention quite severely in some cases with the
> degree of severity depending on how many CPUs share a local zone and the
> size of the zone. A private report indicated that kernel build times were
> excessive with extremely high system CPU usage. A perf profile indicated
> that a large chunk of time was lost on zone->lock contention.
> 
> This patch recalculates the pcpu batch and high values after deferred
> initialisation completes on each node. It was tested on a 2-socket AMD
> EPYC 2 machine using a kernel compilation workload -- allmodconfig and
> all available CPUs.
> 
> mmtests configuration: config-workload-kernbench-max
> Configuration was modified to build on a fresh XFS partition.
> 
> kernbench
>                                 5.4.0-rc3              5.4.0-rc3
>                                   vanilla         resetpcpu-v1r1
> Amean     user-256    13249.50 (   0.00%)    15928.40 * -20.22%*
> Amean     syst-256    14760.30 (   0.00%)     4551.77 *  69.16%*
> Amean     elsp-256      162.42 (   0.00%)      118.46 *  27.06%*
> Stddev    user-256       42.97 (   0.00%)       50.83 ( -18.30%)
> Stddev    syst-256      336.87 (   0.00%)       33.70 (  90.00%)
> Stddev    elsp-256        2.46 (   0.00%)        0.81 (  67.01%)
> 
>                    5.4.0-rc3   5.4.0-rc3
>                      vanillaresetpcpu-v1r1
> Duration User       39766.24    47802.92
> Duration System     44298.10    13671.93
> Duration Elapsed      519.11      387.65
> 
> The patch reduces system CPU usage by 69.16% and total build time by
> 27.06%. The variance of system CPU usage is also much reduced.
> 
> Cc: stable@vger.kernel.org # v4.15+
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---
>  mm/page_alloc.c | 10 ++++++++--
>  1 file changed, 8 insertions(+), 2 deletions(-)

Tested-by: Matt Fleming <matt@codeblueprint.co.uk>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/3] mm, pcpu: Make zone pcp updates and reset internal to the mm
  2019-10-18 10:56 ` [PATCH 3/3] mm, pcpu: Make zone pcp updates and reset internal to the mm Mel Gorman
@ 2019-10-18 11:57   ` Matt Fleming
  2019-10-18 13:02   ` Michal Hocko
  1 sibling, 0 replies; 16+ messages in thread
From: Matt Fleming @ 2019-10-18 11:57 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Michal Hocko, Vlastimil Babka, Thomas Gleixner,
	Borislav Petkov, Linux-MM, Linux Kernel Mailing List

On Fri, 18 Oct, at 11:56:06AM, Mel Gorman wrote:
> Memory hotplug needs to be able to reset and reinit the pcpu allocator
> batch and high limits but this action is internal to the VM. Move
> the declaration to internal.h
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---
>  include/linux/mm.h | 3 ---
>  mm/internal.h      | 3 +++
>  2 files changed, 3 insertions(+), 3 deletions(-)

Tested-by: Matt Fleming <matt@codeblueprint.co.uk>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/3] Recalculate per-cpu page allocator batch and high limits after deferred meminit
  2019-10-18 10:56 [PATCH 0/3] Recalculate per-cpu page allocator batch and high limits after deferred meminit Mel Gorman
                   ` (2 preceding siblings ...)
  2019-10-18 10:56 ` [PATCH 3/3] mm, pcpu: Make zone pcp updates and reset internal to the mm Mel Gorman
@ 2019-10-18 11:58 ` Matt Fleming
  2019-10-18 12:54   ` Mel Gorman
  3 siblings, 1 reply; 16+ messages in thread
From: Matt Fleming @ 2019-10-18 11:58 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Michal Hocko, Vlastimil Babka, Thomas Gleixner,
	Borislav Petkov, Linux-MM, Linux Kernel Mailing List

On Fri, 18 Oct, at 11:56:03AM, Mel Gorman wrote:
> A private report stated that system CPU usage was excessive on an AMD
> EPYC 2 machine while building kernels with much longer build times than
> expected. The issue is partially explained by high zone lock contention
> due to the per-cpu page allocator batch and high limits being calculated
> incorrectly. This series addresses a large chunk of the problem. Patch 1
> is mostly cosmetic but prepares for patch 2 which is the real fix. Patch
> 3 is definiely cosmetic but was noticed while implementing the fix. Proper
> details are in the changelog for patch 2.
> 
>  include/linux/mm.h |  3 ---
>  mm/internal.h      |  3 +++
>  mm/page_alloc.c    | 33 ++++++++++++++++++++-------------
>  3 files changed, 23 insertions(+), 16 deletions(-)

Just to confirm, these patches don't fix the issue we're seeing on the
EPYC 2 machines, but they do return the batch sizes to sensible values.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/3] mm, pcp: Share common code between memory hotplug and percpu sysctl handler
  2019-10-18 10:56 ` [PATCH 1/3] mm, pcp: Share common code between memory hotplug and percpu sysctl handler Mel Gorman
  2019-10-18 11:57   ` Matt Fleming
@ 2019-10-18 12:51   ` Michal Hocko
  1 sibling, 0 replies; 16+ messages in thread
From: Michal Hocko @ 2019-10-18 12:51 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Vlastimil Babka, Thomas Gleixner, Matt Fleming,
	Borislav Petkov, Linux-MM, Linux Kernel Mailing List

On Fri 18-10-19 11:56:04, Mel Gorman wrote:
> Both the percpu_pagelist_fraction sysctl handler and memory hotplug
> have a common requirement of updating the pcpu page allocation batch
> and high values. Split the relevant helper to share common code.
> 
> No functional change.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/page_alloc.c | 23 ++++++++++++-----------
>  1 file changed, 12 insertions(+), 11 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index c0b2e0306720..cafe568d36f6 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -7983,6 +7983,15 @@ int lowmem_reserve_ratio_sysctl_handler(struct ctl_table *table, int write,
>  	return 0;
>  }
>  
> +static void __zone_pcp_update(struct zone *zone)
> +{
> +	unsigned int cpu;
> +
> +	for_each_possible_cpu(cpu)
> +		pageset_set_high_and_batch(zone,
> +				per_cpu_ptr(zone->pageset, cpu));
> +}
> +
>  /*
>   * percpu_pagelist_fraction - changes the pcp->high for each zone on each
>   * cpu.  It is the fraction of total pages in each zone that a hot per cpu
> @@ -8014,13 +8023,8 @@ int percpu_pagelist_fraction_sysctl_handler(struct ctl_table *table, int write,
>  	if (percpu_pagelist_fraction == old_percpu_pagelist_fraction)
>  		goto out;
>  
> -	for_each_populated_zone(zone) {
> -		unsigned int cpu;
> -
> -		for_each_possible_cpu(cpu)
> -			pageset_set_high_and_batch(zone,
> -					per_cpu_ptr(zone->pageset, cpu));
> -	}
> +	for_each_populated_zone(zone)
> +		__zone_pcp_update(zone);
>  out:
>  	mutex_unlock(&pcp_batch_high_lock);
>  	return ret;
> @@ -8519,11 +8523,8 @@ void free_contig_range(unsigned long pfn, unsigned int nr_pages)
>   */
>  void __meminit zone_pcp_update(struct zone *zone)
>  {
> -	unsigned cpu;
>  	mutex_lock(&pcp_batch_high_lock);
> -	for_each_possible_cpu(cpu)
> -		pageset_set_high_and_batch(zone,
> -				per_cpu_ptr(zone->pageset, cpu));
> +	__zone_pcp_update(zone);
>  	mutex_unlock(&pcp_batch_high_lock);
>  }
>  #endif
> -- 
> 2.16.4
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/3] Recalculate per-cpu page allocator batch and high limits after deferred meminit
  2019-10-18 11:58 ` [PATCH 0/3] Recalculate per-cpu page allocator batch and high limits after deferred meminit Matt Fleming
@ 2019-10-18 12:54   ` Mel Gorman
  2019-10-18 14:48     ` Matt Fleming
  0 siblings, 1 reply; 16+ messages in thread
From: Mel Gorman @ 2019-10-18 12:54 UTC (permalink / raw)
  To: Matt Fleming
  Cc: Andrew Morton, Michal Hocko, Vlastimil Babka, Thomas Gleixner,
	Borislav Petkov, Linux-MM, Linux Kernel Mailing List

On Fri, Oct 18, 2019 at 12:58:49PM +0100, Matt Fleming wrote:
> On Fri, 18 Oct, at 11:56:03AM, Mel Gorman wrote:
> > A private report stated that system CPU usage was excessive on an AMD
> > EPYC 2 machine while building kernels with much longer build times than
> > expected. The issue is partially explained by high zone lock contention
> > due to the per-cpu page allocator batch and high limits being calculated
> > incorrectly. This series addresses a large chunk of the problem. Patch 1
> > is mostly cosmetic but prepares for patch 2 which is the real fix. Patch
> > 3 is definiely cosmetic but was noticed while implementing the fix. Proper
> > details are in the changelog for patch 2.
> > 
> >  include/linux/mm.h |  3 ---
> >  mm/internal.h      |  3 +++
> >  mm/page_alloc.c    | 33 ++++++++++++++++++++-------------
> >  3 files changed, 23 insertions(+), 16 deletions(-)
> 
> Just to confirm, these patches don't fix the issue we're seeing on the
> EPYC 2 machines, but they do return the batch sizes to sensible values.

To be clear, does the patch a) fix *some* of the issue and there is
something else also going on that needs to be chased down or b) has no
impact on build time or system CPU usage on your machine?

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/3] mm, meminit: Recalculate pcpu batch and high limits after init completes
  2019-10-18 10:56 ` [PATCH 2/3] mm, meminit: Recalculate pcpu batch and high limits after init completes Mel Gorman
  2019-10-18 11:57   ` Matt Fleming
@ 2019-10-18 13:01   ` Michal Hocko
  2019-10-18 14:09     ` Mel Gorman
  1 sibling, 1 reply; 16+ messages in thread
From: Michal Hocko @ 2019-10-18 13:01 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Vlastimil Babka, Thomas Gleixner, Matt Fleming,
	Borislav Petkov, Linux-MM, Linux Kernel Mailing List

On Fri 18-10-19 11:56:05, Mel Gorman wrote:
> Deferred memory initialisation updates zone->managed_pages during
> the initialisation phase but before that finishes, the per-cpu page
> allocator (pcpu) calculates the number of pages allocated/freed in
> batches as well as the maximum number of pages allowed on a per-cpu list.
> As zone->managed_pages is not up to date yet, the pcpu initialisation
> calculates inappropriately low batch and high values.
> 
> This increases zone lock contention quite severely in some cases with the
> degree of severity depending on how many CPUs share a local zone and the
> size of the zone. A private report indicated that kernel build times were
> excessive with extremely high system CPU usage. A perf profile indicated
> that a large chunk of time was lost on zone->lock contention.
> 
> This patch recalculates the pcpu batch and high values after deferred
> initialisation completes on each node. It was tested on a 2-socket AMD
> EPYC 2 machine using a kernel compilation workload -- allmodconfig and
> all available CPUs.
> 
> mmtests configuration: config-workload-kernbench-max
> Configuration was modified to build on a fresh XFS partition.
> 
> kernbench
>                                 5.4.0-rc3              5.4.0-rc3
>                                   vanilla         resetpcpu-v1r1
> Amean     user-256    13249.50 (   0.00%)    15928.40 * -20.22%*
> Amean     syst-256    14760.30 (   0.00%)     4551.77 *  69.16%*
> Amean     elsp-256      162.42 (   0.00%)      118.46 *  27.06%*
> Stddev    user-256       42.97 (   0.00%)       50.83 ( -18.30%)
> Stddev    syst-256      336.87 (   0.00%)       33.70 (  90.00%)
> Stddev    elsp-256        2.46 (   0.00%)        0.81 (  67.01%)
> 
>                    5.4.0-rc3   5.4.0-rc3
>                      vanillaresetpcpu-v1r1
> Duration User       39766.24    47802.92
> Duration System     44298.10    13671.93
> Duration Elapsed      519.11      387.65
> 
> The patch reduces system CPU usage by 69.16% and total build time by
> 27.06%. The variance of system CPU usage is also much reduced.

The fix makes sense. It would be nice to see the difference in the batch
sizes from the initial setup compared to the one after the deferred
intialization is done

> Cc: stable@vger.kernel.org # v4.15+

Hmm, are you sure about 4.15? Doesn't this go all the way down to
deferred initialization? I do not see any recent changes on when
setup_per_cpu_pageset is called.

> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/page_alloc.c | 10 ++++++++--
>  1 file changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index cafe568d36f6..0a0dd74edc83 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1818,6 +1818,14 @@ static int __init deferred_init_memmap(void *data)
>  	 */
>  	while (spfn < epfn)
>  		nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
> +
> +	/*
> +	 * The number of managed pages has changed due to the initialisation
> +	 * so the pcpu batch and high limits needs to be updated or the limits
> +	 * will be artificially small.
> +	 */
> +	zone_pcp_update(zone);
> +
>  zone_empty:
>  	pgdat_resize_unlock(pgdat, &flags);
>  
> @@ -8516,7 +8524,6 @@ void free_contig_range(unsigned long pfn, unsigned int nr_pages)
>  	WARN(count != 0, "%d pages are still in use!\n", count);
>  }
>  
> -#ifdef CONFIG_MEMORY_HOTPLUG
>  /*
>   * The zone indicated has a new number of managed_pages; batch sizes and percpu
>   * page high values need to be recalulated.
> @@ -8527,7 +8534,6 @@ void __meminit zone_pcp_update(struct zone *zone)
>  	__zone_pcp_update(zone);
>  	mutex_unlock(&pcp_batch_high_lock);
>  }
> -#endif
>  
>  void zone_pcp_reset(struct zone *zone)
>  {
> -- 
> 2.16.4
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/3] mm, pcpu: Make zone pcp updates and reset internal to the mm
  2019-10-18 10:56 ` [PATCH 3/3] mm, pcpu: Make zone pcp updates and reset internal to the mm Mel Gorman
  2019-10-18 11:57   ` Matt Fleming
@ 2019-10-18 13:02   ` Michal Hocko
  1 sibling, 0 replies; 16+ messages in thread
From: Michal Hocko @ 2019-10-18 13:02 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Vlastimil Babka, Thomas Gleixner, Matt Fleming,
	Borislav Petkov, Linux-MM, Linux Kernel Mailing List

On Fri 18-10-19 11:56:06, Mel Gorman wrote:
> Memory hotplug needs to be able to reset and reinit the pcpu allocator
> batch and high limits but this action is internal to the VM. Move
> the declaration to internal.h
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  include/linux/mm.h | 3 ---
>  mm/internal.h      | 3 +++
>  2 files changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index cc292273e6ba..22d6104f2341 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2219,9 +2219,6 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...);
>  
>  extern void setup_per_cpu_pageset(void);
>  
> -extern void zone_pcp_update(struct zone *zone);
> -extern void zone_pcp_reset(struct zone *zone);
> -
>  /* page_alloc.c */
>  extern int min_free_kbytes;
>  extern int watermark_boost_factor;
> diff --git a/mm/internal.h b/mm/internal.h
> index 0d5f720c75ab..0a3d41c7b3c5 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -165,6 +165,9 @@ extern void post_alloc_hook(struct page *page, unsigned int order,
>  					gfp_t gfp_flags);
>  extern int user_min_free_kbytes;
>  
> +extern void zone_pcp_update(struct zone *zone);
> +extern void zone_pcp_reset(struct zone *zone);
> +
>  #if defined CONFIG_COMPACTION || defined CONFIG_CMA
>  
>  /*
> -- 
> 2.16.4
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/3] mm, meminit: Recalculate pcpu batch and high limits after init completes
  2019-10-18 13:01   ` Michal Hocko
@ 2019-10-18 14:09     ` Mel Gorman
  2019-10-19  1:40       ` Andrew Morton
  0 siblings, 1 reply; 16+ messages in thread
From: Mel Gorman @ 2019-10-18 14:09 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Vlastimil Babka, Thomas Gleixner, Matt Fleming,
	Borislav Petkov, Linux-MM, Linux Kernel Mailing List

On Fri, Oct 18, 2019 at 03:01:27PM +0200, Michal Hocko wrote:
> On Fri 18-10-19 11:56:05, Mel Gorman wrote:
> > Deferred memory initialisation updates zone->managed_pages during
> > the initialisation phase but before that finishes, the per-cpu page
> > allocator (pcpu) calculates the number of pages allocated/freed in
> > batches as well as the maximum number of pages allowed on a per-cpu list.
> > As zone->managed_pages is not up to date yet, the pcpu initialisation
> > calculates inappropriately low batch and high values.
> > 
> > This increases zone lock contention quite severely in some cases with the
> > degree of severity depending on how many CPUs share a local zone and the
> > size of the zone. A private report indicated that kernel build times were
> > excessive with extremely high system CPU usage. A perf profile indicated
> > that a large chunk of time was lost on zone->lock contention.
> > 
> > This patch recalculates the pcpu batch and high values after deferred
> > initialisation completes on each node. It was tested on a 2-socket AMD
> > EPYC 2 machine using a kernel compilation workload -- allmodconfig and
> > all available CPUs.
> > 
> > mmtests configuration: config-workload-kernbench-max
> > Configuration was modified to build on a fresh XFS partition.
> > 
> > kernbench
> >                                 5.4.0-rc3              5.4.0-rc3
> >                                   vanilla         resetpcpu-v1r1
> > Amean     user-256    13249.50 (   0.00%)    15928.40 * -20.22%*
> > Amean     syst-256    14760.30 (   0.00%)     4551.77 *  69.16%*
> > Amean     elsp-256      162.42 (   0.00%)      118.46 *  27.06%*
> > Stddev    user-256       42.97 (   0.00%)       50.83 ( -18.30%)
> > Stddev    syst-256      336.87 (   0.00%)       33.70 (  90.00%)
> > Stddev    elsp-256        2.46 (   0.00%)        0.81 (  67.01%)
> > 
> >                    5.4.0-rc3   5.4.0-rc3
> >                      vanillaresetpcpu-v1r1
> > Duration User       39766.24    47802.92
> > Duration System     44298.10    13671.93
> > Duration Elapsed      519.11      387.65
> > 
> > The patch reduces system CPU usage by 69.16% and total build time by
> > 27.06%. The variance of system CPU usage is also much reduced.
> 
> The fix makes sense. It would be nice to see the difference in the batch
> sizes from the initial setup compared to the one after the deferred
> intialization is done
> 

Before, this was the breakdown of batch and high values over all zones
were

    256               batch: 1
    256               batch: 63
    512               batch: 7

    256               high:  0
    256               high:  378
    512               high:  42

i.e. 512 pcpu pagesets had a batch limit of 7 and a high limit of 42.
These were for the NORMAL zones on the system. After the patch

    256               batch: 1
    768               batch: 63

    256               high:  0
    768               high:  378

> > Cc: stable@vger.kernel.org # v4.15+
> 
> Hmm, are you sure about 4.15? Doesn't this go all the way down to
> deferred initialization? I do not see any recent changes on when
> setup_per_cpu_pageset is called.
> 

No, I'm not 100% sure. It looks like this was always an issue from the
code but did not happen on at least one 4.12-based distribution kernel for
reasons that are non-obvious. Either way, the tag should have been "v4.1+"

Thanks.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/3] Recalculate per-cpu page allocator batch and high limits after deferred meminit
  2019-10-18 12:54   ` Mel Gorman
@ 2019-10-18 14:48     ` Matt Fleming
  0 siblings, 0 replies; 16+ messages in thread
From: Matt Fleming @ 2019-10-18 14:48 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Michal Hocko, Vlastimil Babka, Thomas Gleixner,
	Borislav Petkov, Linux-MM, Linux Kernel Mailing List

On Fri, 18 Oct, at 01:54:49PM, Mel Gorman wrote:
> On Fri, Oct 18, 2019 at 12:58:49PM +0100, Matt Fleming wrote:
> > On Fri, 18 Oct, at 11:56:03AM, Mel Gorman wrote:
> > > A private report stated that system CPU usage was excessive on an AMD
> > > EPYC 2 machine while building kernels with much longer build times than
> > > expected. The issue is partially explained by high zone lock contention
> > > due to the per-cpu page allocator batch and high limits being calculated
> > > incorrectly. This series addresses a large chunk of the problem. Patch 1
> > > is mostly cosmetic but prepares for patch 2 which is the real fix. Patch
> > > 3 is definiely cosmetic but was noticed while implementing the fix. Proper
> > > details are in the changelog for patch 2.
> > > 
> > >  include/linux/mm.h |  3 ---
> > >  mm/internal.h      |  3 +++
> > >  mm/page_alloc.c    | 33 ++++++++++++++++++++-------------
> > >  3 files changed, 23 insertions(+), 16 deletions(-)
> > 
> > Just to confirm, these patches don't fix the issue we're seeing on the
> > EPYC 2 machines, but they do return the batch sizes to sensible values.
> 
> To be clear, does the patch a) fix *some* of the issue and there is
> something else also going on that needs to be chased down or b) has no
> impact on build time or system CPU usage on your machine?

Sorry, I realise my email was pretty unclear.

These patches *do* fix some of the issue because I no longer see as
much contention on the zone locks with the patches applied.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/3] mm, meminit: Recalculate pcpu batch and high limits after init completes
  2019-10-18 14:09     ` Mel Gorman
@ 2019-10-19  1:40       ` Andrew Morton
  2019-10-20  9:32         ` Mel Gorman
  0 siblings, 1 reply; 16+ messages in thread
From: Andrew Morton @ 2019-10-19  1:40 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Michal Hocko, Vlastimil Babka, Thomas Gleixner, Matt Fleming,
	Borislav Petkov, Linux-MM, Linux Kernel Mailing List

On Fri, 18 Oct 2019 15:09:59 +0100 Mel Gorman <mgorman@techsingularity.net> wrote:

> > > Cc: stable@vger.kernel.org # v4.15+
> > 
> > Hmm, are you sure about 4.15? Doesn't this go all the way down to
> > deferred initialization? I do not see any recent changes on when
> > setup_per_cpu_pageset is called.
> > 
> 
> No, I'm not 100% sure. It looks like this was always an issue from the
> code but did not happen on at least one 4.12-based distribution kernel for
> reasons that are non-obvious. Either way, the tag should have been "v4.1+"

I could mark

mm-pcp-share-common-code-between-memory-hotplug-and-percpu-sysctl-handler.patch
mm-meminit-recalculate-pcpu-batch-and-high-limits-after-init-completes.patch

as Cc: <stable@vger.kernel.org>	[4.1+]

But for backporting purposes it's a bit cumbersome that [2/3] is the
important patch.  I think I'll switch the ordering so that
mm-meminit-recalculate-pcpu-batch-and-high-limits-after-init-completes.patch
is the first patch and the other two can be queued for 5.5-rc1, OK?

Also, is a Reported-by:Matt appropriate here?


From: Mel Gorman <mgorman@techsingularity.net>
Subject: mm, meminit: recalculate pcpu batch and high limits after init completes

Deferred memory initialisation updates zone->managed_pages during the
initialisation phase but before that finishes, the per-cpu page allocator
(pcpu) calculates the number of pages allocated/freed in batches as well
as the maximum number of pages allowed on a per-cpu list.  As
zone->managed_pages is not up to date yet, the pcpu initialisation
calculates inappropriately low batch and high values.

This increases zone lock contention quite severely in some cases with the
degree of severity depending on how many CPUs share a local zone and the
size of the zone.  A private report indicated that kernel build times were
excessive with extremely high system CPU usage.  A perf profile indicated
that a large chunk of time was lost on zone->lock contention.

This patch recalculates the pcpu batch and high values after deferred
initialisation completes on each node.  It was tested on a 2-socket AMD
EPYC 2 machine using a kernel compilation workload -- allmodconfig and all
available CPUs.

mmtests configuration: config-workload-kernbench-max Configuration was
modified to build on a fresh XFS partition.

kernbench
                                5.4.0-rc3              5.4.0-rc3
                                  vanilla         resetpcpu-v1r1
Amean     user-256    13249.50 (   0.00%)    15928.40 * -20.22%*
Amean     syst-256    14760.30 (   0.00%)     4551.77 *  69.16%*
Amean     elsp-256      162.42 (   0.00%)      118.46 *  27.06%*
Stddev    user-256       42.97 (   0.00%)       50.83 ( -18.30%)
Stddev    syst-256      336.87 (   0.00%)       33.70 (  90.00%)
Stddev    elsp-256        2.46 (   0.00%)        0.81 (  67.01%)

                   5.4.0-rc3   5.4.0-rc3
                     vanillaresetpcpu-v1r1
Duration User       39766.24    47802.92
Duration System     44298.10    13671.93
Duration Elapsed      519.11      387.65

The patch reduces system CPU usage by 69.16% and total build time by
27.06%.  The variance of system CPU usage is also much reduced.

Link: http://lkml.kernel.org/r/20191018105606.3249-3-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Tested-by: Matt Fleming <matt@codeblueprint.co.uk>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: <stable@vger.kernel.org>	[4.1+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/page_alloc.c |   10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

--- a/mm/page_alloc.c~mm-meminit-recalculate-pcpu-batch-and-high-limits-after-init-completes
+++ a/mm/page_alloc.c
@@ -1818,6 +1818,14 @@ static int __init deferred_init_memmap(v
 	 */
 	while (spfn < epfn)
 		nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
+
+	/*
+	 * The number of managed pages has changed due to the initialisation
+	 * so the pcpu batch and high limits needs to be updated or the limits
+	 * will be artificially small.
+	 */
+	zone_pcp_update(zone);
+
 zone_empty:
 	pgdat_resize_unlock(pgdat, &flags);
 
@@ -8514,7 +8522,6 @@ void free_contig_range(unsigned long pfn
 	WARN(count != 0, "%d pages are still in use!\n", count);
 }
 
-#ifdef CONFIG_MEMORY_HOTPLUG
 /*
  * The zone indicated has a new number of managed_pages; batch sizes and percpu
  * page high values need to be recalulated.
@@ -8528,7 +8535,6 @@ void __meminit zone_pcp_update(struct zo
 				per_cpu_ptr(zone->pageset, cpu));
 	mutex_unlock(&pcp_batch_high_lock);
 }
-#endif
 
 void zone_pcp_reset(struct zone *zone)
 {
_


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/3] mm, meminit: Recalculate pcpu batch and high limits after init completes
  2019-10-19  1:40       ` Andrew Morton
@ 2019-10-20  9:32         ` Mel Gorman
  0 siblings, 0 replies; 16+ messages in thread
From: Mel Gorman @ 2019-10-20  9:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Vlastimil Babka, Thomas Gleixner, Matt Fleming,
	Borislav Petkov, Linux-MM, Linux Kernel Mailing List

On Fri, Oct 18, 2019 at 06:40:24PM -0700, Andrew Morton wrote:
> On Fri, 18 Oct 2019 15:09:59 +0100 Mel Gorman <mgorman@techsingularity.net> wrote:
> 
> > > > Cc: stable@vger.kernel.org # v4.15+
> > > 
> > > Hmm, are you sure about 4.15? Doesn't this go all the way down to
> > > deferred initialization? I do not see any recent changes on when
> > > setup_per_cpu_pageset is called.
> > > 
> > 
> > No, I'm not 100% sure. It looks like this was always an issue from the
> > code but did not happen on at least one 4.12-based distribution kernel for
> > reasons that are non-obvious. Either way, the tag should have been "v4.1+"
> 
> I could mark
> 
> mm-pcp-share-common-code-between-memory-hotplug-and-percpu-sysctl-handler.patch
> mm-meminit-recalculate-pcpu-batch-and-high-limits-after-init-completes.patch
> 
> as Cc: <stable@vger.kernel.org>	[4.1+]
> 

That would be fine.

> But for backporting purposes it's a bit cumbersome that [2/3] is the
> important patch.  I think I'll switch the ordering so that
> mm-meminit-recalculate-pcpu-batch-and-high-limits-after-init-completes.patch
> is the first patch and the other two can be queued for 5.5-rc1, OK?
> 

It might be easier to simply collapse patch 1 and 2 together. They were
only split to make the review easier and to avoid two relatively big
changes in one patch.

> Also, is a Reported-by:Matt appropriate here?
> 

I don't object but I'm not actually sure who reported this first. I think
it was Thomas who talked to Boris about an EPYC performance issue, who
talked to Matt thinking it might be a scheduler issue who identified it
was my problem :P

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2019-10-20  9:32 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-18 10:56 [PATCH 0/3] Recalculate per-cpu page allocator batch and high limits after deferred meminit Mel Gorman
2019-10-18 10:56 ` [PATCH 1/3] mm, pcp: Share common code between memory hotplug and percpu sysctl handler Mel Gorman
2019-10-18 11:57   ` Matt Fleming
2019-10-18 12:51   ` Michal Hocko
2019-10-18 10:56 ` [PATCH 2/3] mm, meminit: Recalculate pcpu batch and high limits after init completes Mel Gorman
2019-10-18 11:57   ` Matt Fleming
2019-10-18 13:01   ` Michal Hocko
2019-10-18 14:09     ` Mel Gorman
2019-10-19  1:40       ` Andrew Morton
2019-10-20  9:32         ` Mel Gorman
2019-10-18 10:56 ` [PATCH 3/3] mm, pcpu: Make zone pcp updates and reset internal to the mm Mel Gorman
2019-10-18 11:57   ` Matt Fleming
2019-10-18 13:02   ` Michal Hocko
2019-10-18 11:58 ` [PATCH 0/3] Recalculate per-cpu page allocator batch and high limits after deferred meminit Matt Fleming
2019-10-18 12:54   ` Mel Gorman
2019-10-18 14:48     ` Matt Fleming

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.