[PATCH 0/3] Recalculate per-cpu page allocator batch and high limits after deferred meminit v2

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/3] Recalculate per-cpu page allocator batch and high limits after deferred meminit v2
@ 2019-10-21  9:48 Mel Gorman
  2019-10-21  9:48 ` [PATCH 1/3] mm, meminit: Recalculate pcpu batch and high limits after init completes Mel Gorman
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Mel Gorman @ 2019-10-21  9:48 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Vlastimil Babka, Thomas Gleixner, Matt Fleming,
	Borislav Petkov, Linux-MM, Linux Kernel Mailing List, Mel Gorman

This is an updated series that addresses some review feedback and an
LKP warning.  I did not preserve Michal Hocko's ack for the fix as it
has changed.  This series replaces the following patches in mmotm

mm-pcp-share-common-code-between-memory-hotplug-and-percpu-sysctl-handler.patch
mm-meminit-recalculate-pcpu-batch-and-high-limits-after-init-completes.patch
mm-pcpu-make-zone-pcp-updates-and-reset-internal-to-the-mm.patch

Changelog since V1
o Fix a "might sleep" warning
o Reorder for easier backporting

A private report stated that system CPU usage was excessive on an AMD
EPYC 2 machine while building kernels with much longer build times than
expected. The issue is partially explained by high zone lock contention
due to the per-cpu page allocator batch and high limits being calculated
incorrectly. This series addresses a large chunk of the problem. Patch
1 is the real fix and the other two are cosmetic issues noticed while
implementing the fix.

 include/linux/mm.h |  3 ---
 mm/internal.h      |  3 +++
 mm/page_alloc.c    | 31 ++++++++++++++++++++-----------
 3 files changed, 23 insertions(+), 14 deletions(-)

-- 
2.16.4

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/3] mm, meminit: Recalculate pcpu batch and high limits after init completes
  2019-10-21  9:48 [PATCH 0/3] Recalculate per-cpu page allocator batch and high limits after deferred meminit v2 Mel Gorman
@ 2019-10-21  9:48 ` Mel Gorman
  2019-10-21 10:27   ` Michal Hocko
                     ` (4 more replies)
  2019-10-21  9:48 ` [PATCH 2/3] mm, pcp: Share common code between memory hotplug and percpu sysctl handler Mel Gorman
  2019-10-21  9:48 ` [PATCH 3/3] mm, pcpu: Make zone pcp updates and reset internal to the mm Mel Gorman
  2 siblings, 5 replies; 13+ messages in thread
From: Mel Gorman @ 2019-10-21  9:48 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Vlastimil Babka, Thomas Gleixner, Matt Fleming,
	Borislav Petkov, Linux-MM, Linux Kernel Mailing List, Mel Gorman

Deferred memory initialisation updates zone->managed_pages during
the initialisation phase but before that finishes, the per-cpu page
allocator (pcpu) calculates the number of pages allocated/freed in
batches as well as the maximum number of pages allowed on a per-cpu list.
As zone->managed_pages is not up to date yet, the pcpu initialisation
calculates inappropriately low batch and high values.

This increases zone lock contention quite severely in some cases with the
degree of severity depending on how many CPUs share a local zone and the
size of the zone. A private report indicated that kernel build times were
excessive with extremely high system CPU usage. A perf profile indicated
that a large chunk of time was lost on zone->lock contention.

This patch recalculates the pcpu batch and high values after deferred
initialisation completes for every populated zone in the system. It
was tested on a 2-socket AMD EPYC 2 machine using a kernel compilation
workload -- allmodconfig and all available CPUs.

mmtests configuration: config-workload-kernbench-max
Configuration was modified to build on a fresh XFS partition.

kernbench
                                5.4.0-rc3              5.4.0-rc3
                                  vanilla           resetpcpu-v2
Amean     user-256    13249.50 (   0.00%)    16401.31 * -23.79%*
Amean     syst-256    14760.30 (   0.00%)     4448.39 *  69.86%*
Amean     elsp-256      162.42 (   0.00%)      119.13 *  26.65%*
Stddev    user-256       42.97 (   0.00%)       19.15 (  55.43%)
Stddev    syst-256      336.87 (   0.00%)        6.71 (  98.01%)
Stddev    elsp-256        2.46 (   0.00%)        0.39 (  84.03%)

                   5.4.0-rc3    5.4.0-rc3
                     vanilla resetpcpu-v2
Duration User       39766.24     49221.79
Duration System     44298.10     13361.67
Duration Elapsed      519.11       388.87

The patch reduces system CPU usage by 69.86% and total build time by
26.65%. The variance of system CPU usage is also much reduced.

Before, this was the breakdown of batch and high values over all zones was.

    256               batch: 1
    256               batch: 63
    512               batch: 7
    256               high:  0
    256               high:  378
    512               high:  42

512 pcpu pagesets had a batch limit of 7 and a high limit of 42. After the patch

    256               batch: 1
    768               batch: 63
    256               high:  0
    768               high:  378

Cc: stable@vger.kernel.org # v4.1+
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/page_alloc.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c0b2e0306720..f972076d0f6b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1947,6 +1947,14 @@ void __init page_alloc_init_late(void)
 	/* Block until all are initialised */
 	wait_for_completion(&pgdat_init_all_done_comp);

+	/*
+	 * The number of managed pages has changed due to the initialisation
+	 * so the pcpu batch and high limits needs to be updated or the limits
+	 * will be artificially small.
+	 */
+	for_each_populated_zone(zone)
+		zone_pcp_update(zone);
+
 	/*
 	 * We initialized the rest of the deferred pages.  Permanently disable
 	 * on-demand struct page initialization.
-- 
2.16.4

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/3] mm, meminit: Recalculate pcpu batch and high limits after init completes
  2019-10-21  9:48 ` [PATCH 1/3] mm, meminit: Recalculate pcpu batch and high limits after init completes Mel Gorman
@ 2019-10-21 10:27   ` Michal Hocko
  2019-10-21 11:42   ` Vlastimil Babka
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 13+ messages in thread
From: Michal Hocko @ 2019-10-21 10:27 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Vlastimil Babka, Thomas Gleixner, Matt Fleming,
	Borislav Petkov, Linux-MM, Linux Kernel Mailing List

On Mon 21-10-19 10:48:06, Mel Gorman wrote:
> Deferred memory initialisation updates zone->managed_pages during
> the initialisation phase but before that finishes, the per-cpu page
> allocator (pcpu) calculates the number of pages allocated/freed in
> batches as well as the maximum number of pages allowed on a per-cpu list.
> As zone->managed_pages is not up to date yet, the pcpu initialisation
> calculates inappropriately low batch and high values.
> 
> This increases zone lock contention quite severely in some cases with the
> degree of severity depending on how many CPUs share a local zone and the
> size of the zone. A private report indicated that kernel build times were
> excessive with extremely high system CPU usage. A perf profile indicated
> that a large chunk of time was lost on zone->lock contention.
> 
> This patch recalculates the pcpu batch and high values after deferred
> initialisation completes for every populated zone in the system. It
> was tested on a 2-socket AMD EPYC 2 machine using a kernel compilation
> workload -- allmodconfig and all available CPUs.
> 
> mmtests configuration: config-workload-kernbench-max
> Configuration was modified to build on a fresh XFS partition.
> 
> kernbench
>                                 5.4.0-rc3              5.4.0-rc3
>                                   vanilla           resetpcpu-v2
> Amean     user-256    13249.50 (   0.00%)    16401.31 * -23.79%*
> Amean     syst-256    14760.30 (   0.00%)     4448.39 *  69.86%*
> Amean     elsp-256      162.42 (   0.00%)      119.13 *  26.65%*
> Stddev    user-256       42.97 (   0.00%)       19.15 (  55.43%)
> Stddev    syst-256      336.87 (   0.00%)        6.71 (  98.01%)
> Stddev    elsp-256        2.46 (   0.00%)        0.39 (  84.03%)
> 
>                    5.4.0-rc3    5.4.0-rc3
>                      vanilla resetpcpu-v2
> Duration User       39766.24     49221.79
> Duration System     44298.10     13361.67
> Duration Elapsed      519.11       388.87
> 
> The patch reduces system CPU usage by 69.86% and total build time by
> 26.65%. The variance of system CPU usage is also much reduced.
> 
> Before, this was the breakdown of batch and high values over all zones was.
> 
>     256               batch: 1
>     256               batch: 63
>     512               batch: 7
>     256               high:  0
>     256               high:  378
>     512               high:  42
> 
> 512 pcpu pagesets had a batch limit of 7 and a high limit of 42. After the patch
> 
>     256               batch: 1
>     768               batch: 63
>     256               high:  0
>     768               high:  378
> 
> Cc: stable@vger.kernel.org # v4.1+
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/page_alloc.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index c0b2e0306720..f972076d0f6b 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1947,6 +1947,14 @@ void __init page_alloc_init_late(void)
>  	/* Block until all are initialised */
>  	wait_for_completion(&pgdat_init_all_done_comp);
>  
> +	/*
> +	 * The number of managed pages has changed due to the initialisation
> +	 * so the pcpu batch and high limits needs to be updated or the limits
> +	 * will be artificially small.
> +	 */
> +	for_each_populated_zone(zone)
> +		zone_pcp_update(zone);
> +
>  	/*
>  	 * We initialized the rest of the deferred pages.  Permanently disable
>  	 * on-demand struct page initialization.
> -- 
> 2.16.4

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/3] mm, meminit: Recalculate pcpu batch and high limits after init completes
  2019-10-21  9:48 ` [PATCH 1/3] mm, meminit: Recalculate pcpu batch and high limits after init completes Mel Gorman
  2019-10-21 10:27   ` Michal Hocko
@ 2019-10-21 11:42   ` Vlastimil Babka
  2019-10-21 14:01   ` Qian Cai
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 13+ messages in thread
From: Vlastimil Babka @ 2019-10-21 11:42 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton
  Cc: Michal Hocko, Thomas Gleixner, Matt Fleming, Borislav Petkov,
	Linux-MM, Linux Kernel Mailing List

On 10/21/19 11:48 AM, Mel Gorman wrote:
> Deferred memory initialisation updates zone->managed_pages during
> the initialisation phase but before that finishes, the per-cpu page
> allocator (pcpu) calculates the number of pages allocated/freed in
> batches as well as the maximum number of pages allowed on a per-cpu list.
> As zone->managed_pages is not up to date yet, the pcpu initialisation
> calculates inappropriately low batch and high values.
> 
> This increases zone lock contention quite severely in some cases with the
> degree of severity depending on how many CPUs share a local zone and the
> size of the zone. A private report indicated that kernel build times were
> excessive with extremely high system CPU usage. A perf profile indicated
> that a large chunk of time was lost on zone->lock contention.
> 
> This patch recalculates the pcpu batch and high values after deferred
> initialisation completes for every populated zone in the system. It
> was tested on a 2-socket AMD EPYC 2 machine using a kernel compilation
> workload -- allmodconfig and all available CPUs.
> 
> mmtests configuration: config-workload-kernbench-max
> Configuration was modified to build on a fresh XFS partition.
> 
> kernbench
>                                 5.4.0-rc3              5.4.0-rc3
>                                   vanilla           resetpcpu-v2
> Amean     user-256    13249.50 (   0.00%)    16401.31 * -23.79%*
> Amean     syst-256    14760.30 (   0.00%)     4448.39 *  69.86%*
> Amean     elsp-256      162.42 (   0.00%)      119.13 *  26.65%*
> Stddev    user-256       42.97 (   0.00%)       19.15 (  55.43%)
> Stddev    syst-256      336.87 (   0.00%)        6.71 (  98.01%)
> Stddev    elsp-256        2.46 (   0.00%)        0.39 (  84.03%)
> 
>                    5.4.0-rc3    5.4.0-rc3
>                      vanilla resetpcpu-v2
> Duration User       39766.24     49221.79
> Duration System     44298.10     13361.67
> Duration Elapsed      519.11       388.87
> 
> The patch reduces system CPU usage by 69.86% and total build time by
> 26.65%. The variance of system CPU usage is also much reduced.
> 
> Before, this was the breakdown of batch and high values over all zones was.
> 
>     256               batch: 1
>     256               batch: 63
>     512               batch: 7
>     256               high:  0
>     256               high:  378
>     512               high:  42
> 
> 512 pcpu pagesets had a batch limit of 7 and a high limit of 42. After the patch
> 
>     256               batch: 1
>     768               batch: 63
>     256               high:  0
>     768               high:  378
> 
> Cc: stable@vger.kernel.org # v4.1+
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  mm/page_alloc.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index c0b2e0306720..f972076d0f6b 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1947,6 +1947,14 @@ void __init page_alloc_init_late(void)
>  	/* Block until all are initialised */
>  	wait_for_completion(&pgdat_init_all_done_comp);
>  
> +	/*
> +	 * The number of managed pages has changed due to the initialisation
> +	 * so the pcpu batch and high limits needs to be updated or the limits
> +	 * will be artificially small.
> +	 */
> +	for_each_populated_zone(zone)
> +		zone_pcp_update(zone);
> +
>  	/*
>  	 * We initialized the rest of the deferred pages.  Permanently disable
>  	 * on-demand struct page initialization.
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/3] mm, meminit: Recalculate pcpu batch and high limits after init completes
  2019-10-21  9:48 ` [PATCH 1/3] mm, meminit: Recalculate pcpu batch and high limits after init completes Mel Gorman
  2019-10-21 10:27   ` Michal Hocko
  2019-10-21 11:42   ` Vlastimil Babka
@ 2019-10-21 14:01   ` Qian Cai
  2019-10-21 14:12     ` Michal Hocko
  2019-10-21 14:25     ` Mel Gorman
  2019-10-21 19:39   ` [PATCH] mm, meminit: Recalculate pcpu batch and high limits after init completes -fix Mel Gorman
       [not found]   ` <20191026131036.A7A5421655@mail.kernel.org>
  4 siblings, 2 replies; 13+ messages in thread
From: Qian Cai @ 2019-10-21 14:01 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Michal Hocko, Vlastimil Babka, Thomas Gleixner,
	Matt Fleming, Borislav Petkov, Linux-MM,
	Linux Kernel Mailing List



> On Oct 21, 2019, at 5:48 AM, Mel Gorman <mgorman@techsingularity.net> wrote:
> 
> Deferred memory initialisation updates zone->managed_pages during
> the initialisation phase but before that finishes, the per-cpu page
> allocator (pcpu) calculates the number of pages allocated/freed in
> batches as well as the maximum number of pages allowed on a per-cpu list.
> As zone->managed_pages is not up to date yet, the pcpu initialisation
> calculates inappropriately low batch and high values.
> 
> This increases zone lock contention quite severely in some cases with the
> degree of severity depending on how many CPUs share a local zone and the
> size of the zone. A private report indicated that kernel build times were
> excessive with extremely high system CPU usage. A perf profile indicated
> that a large chunk of time was lost on zone->lock contention.
> 
> This patch recalculates the pcpu batch and high values after deferred
> initialisation completes for every populated zone in the system. It
> was tested on a 2-socket AMD EPYC 2 machine using a kernel compilation
> workload -- allmodconfig and all available CPUs.
> 
> mmtests configuration: config-workload-kernbench-max
> Configuration was modified to build on a fresh XFS partition.
> 
> kernbench
>                                5.4.0-rc3              5.4.0-rc3
>                                  vanilla           resetpcpu-v2
> Amean     user-256    13249.50 (   0.00%)    16401.31 * -23.79%*
> Amean     syst-256    14760.30 (   0.00%)     4448.39 *  69.86%*
> Amean     elsp-256      162.42 (   0.00%)      119.13 *  26.65%*
> Stddev    user-256       42.97 (   0.00%)       19.15 (  55.43%)
> Stddev    syst-256      336.87 (   0.00%)        6.71 (  98.01%)
> Stddev    elsp-256        2.46 (   0.00%)        0.39 (  84.03%)
> 
>                   5.4.0-rc3    5.4.0-rc3
>                     vanilla resetpcpu-v2
> Duration User       39766.24     49221.79
> Duration System     44298.10     13361.67
> Duration Elapsed      519.11       388.87
> 
> The patch reduces system CPU usage by 69.86% and total build time by
> 26.65%. The variance of system CPU usage is also much reduced.
> 
> Before, this was the breakdown of batch and high values over all zones was.
> 
>    256               batch: 1
>    256               batch: 63
>    512               batch: 7
>    256               high:  0
>    256               high:  378
>    512               high:  42
> 
> 512 pcpu pagesets had a batch limit of 7 and a high limit of 42. After the patch
> 
>    256               batch: 1
>    768               batch: 63
>    256               high:  0
>    768               high:  378
> 
> Cc: stable@vger.kernel.org # v4.1+
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---
> mm/page_alloc.c | 8 ++++++++
> 1 file changed, 8 insertions(+)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index c0b2e0306720..f972076d0f6b 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1947,6 +1947,14 @@ void __init page_alloc_init_late(void)
> 	/* Block until all are initialised */
> 	wait_for_completion(&pgdat_init_all_done_comp);
> 
> +	/*
> +	 * The number of managed pages has changed due to the initialisation
> +	 * so the pcpu batch and high limits needs to be updated or the limits
> +	 * will be artificially small.
> +	 */
> +	for_each_populated_zone(zone)
> +		zone_pcp_update(zone);
> +
> 	/*
> 	 * We initialized the rest of the deferred pages.  Permanently disable
> 	 * on-demand struct page initialization.
> -- 
> 2.16.4
> 
> 

Warnings from linux-next,

[   14.265911][  T659] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:935
[   14.265992][  T659] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 659, name: pgdatinit8
[   14.266044][  T659] 1 lock held by pgdatinit8/659:
[   14.266075][  T659]  #0: c000201ffca87b40 (&(&pgdat->node_size_lock)->rlock){....}, at: deferred_init_memmap+0xc4/0x26c
[   14.266160][  T659] irq event stamp: 26
[   14.266194][  T659] hardirqs last  enabled at (25): [<c000000000950584>] _raw_spin_unlock_irq+0x44/0x80
[   14.266246][  T659] hardirqs last disabled at (26): [<c0000000009502ec>] _raw_spin_lock_irqsave+0x3c/0xa0
[   14.266299][  T659] softirqs last  enabled at (0): [<c0000000000ff8d0>] copy_process+0x720/0x19b0
[   14.266339][  T659] softirqs last disabled at (0): [<0000000000000000>] 0x0
[   14.266400][  T659] CPU: 64 PID: 659 Comm: pgdatinit8 Not tainted 5.4.0-rc4-next-20191021 #1
[   14.266462][  T659] Call Trace:
[   14.266494][  T659] [c00000003d8efae0] [c000000000921cf4] dump_stack+0xe8/0x164 (unreliable)
[   14.266538][  T659] [c00000003d8efb30] [c000000000157c54] ___might_sleep+0x334/0x370
[   14.266577][  T659] [c00000003d8efbb0] [c00000000094a784] __mutex_lock+0x84/0xb20
[   14.266627][  T659] [c00000003d8efcc0] [c000000000954038] zone_pcp_update+0x34/0x64
[   14.266677][  T659] [c00000003d8efcf0] [c000000000b9e6bc] deferred_init_memmap+0x1b8/0x26c
[   14.266740][  T659] [c00000003d8efdb0] [c000000000149528] kthread+0x1a8/0x1b0
[   14.266792][  T659] [c00000003d8efe20] [c00000000000b748] ret_from_kernel_thread+0x5c/0x74
[   14.268288][  T659] node 8 initialised, 1879186 pages in 12200ms
[   14.268527][  T659] pgdatinit8 (659) used greatest stack depth: 27984 bytes left
[   15.589983][  T658] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:935
[   15.590041][  T658] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 658, name: pgdatinit0
[   15.590078][  T658] 1 lock held by pgdatinit0/658:
[   15.590108][  T658]  #0: c000001fff5c7b40 (&(&pgdat->node_size_lock)->rlock){....}, at: deferred_init_memmap+0xc4/0x26c
[   15.590192][  T658] irq event stamp: 18
[   15.590224][  T658] hardirqs last  enabled at (17): [<c000000000950654>] _raw_spin_unlock_irqrestore+0x94/0xd0
[   15.590283][  T658] hardirqs last disabled at (18): [<c0000000009502ec>] _raw_spin_lock_irqsave+0x3c/0xa0
[   15.590332][  T658] softirqs last  enabled at (0): [<c0000000000ff8d0>] copy_process+0x720/0x19b0
[   15.590379][  T658] softirqs last disabled at (0): [<0000000000000000>] 0x0
[   15.590414][  T658] CPU: 8 PID: 658 Comm: pgdatinit0 Tainted: G        W         5.4.0-rc4-next-20191021 #1
[   15.590460][  T658] Call Trace:
[   15.590491][  T658] [c00000003d8cfae0] [c000000000921cf4] dump_stack+0xe8/0x164 (unreliable)
[   15.590541][  T658] [c00000003d8cfb30] [c000000000157c54] ___might_sleep+0x334/0x370
[   15.590588][  T658] [c00000003d8cfbb0] [c00000000094a784] __mutex_lock+0x84/0xb20
[   15.590643][  T658] [c00000003d8cfcc0] [c000000000954038] zone_pcp_update+0x34/0x64
[   15.590689][  T658] [c00000003d8cfcf0] [c000000000b9e6bc] deferred_init_memmap+0x1b8/0x26c
[   15.590739][  T658] [c00000003d8cfdb0] [c000000000149528] kthread+0x1a8/0x1b0
[   15.590790][  T658] [c00000003d8cfe20] [c00000000000b748] ret_from_kernel_thread+0x5c/0x74


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/3] mm, meminit: Recalculate pcpu batch and high limits after init completes
  2019-10-21 14:01   ` Qian Cai
@ 2019-10-21 14:12     ` Michal Hocko
  2019-10-21 14:25     ` Mel Gorman
  1 sibling, 0 replies; 13+ messages in thread
From: Michal Hocko @ 2019-10-21 14:12 UTC (permalink / raw)
  To: Qian Cai
  Cc: Mel Gorman, Andrew Morton, Vlastimil Babka, Thomas Gleixner,
	Matt Fleming, Borislav Petkov, Linux-MM,
	Linux Kernel Mailing List

On Mon 21-10-19 10:01:24, Qian Cai wrote:
> 
> 
> > On Oct 21, 2019, at 5:48 AM, Mel Gorman <mgorman@techsingularity.net> wrote:
i[...]
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index c0b2e0306720..f972076d0f6b 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1947,6 +1947,14 @@ void __init page_alloc_init_late(void)
> > 	/* Block until all are initialised */
> > 	wait_for_completion(&pgdat_init_all_done_comp);
> > 
> > +	/*
> > +	 * The number of managed pages has changed due to the initialisation
> > +	 * so the pcpu batch and high limits needs to be updated or the limits
> > +	 * will be artificially small.
> > +	 */
> > +	for_each_populated_zone(zone)
> > +		zone_pcp_update(zone);
> > +
> > 	/*
> > 	 * We initialized the rest of the deferred pages.  Permanently disable
> > 	 * on-demand struct page initialization.
> > -- 
> > 2.16.4
> > 
> > 
> 
> Warnings from linux-next,
> 
> [   14.265911][  T659] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:935
> [   14.265992][  T659] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 659, name: pgdatinit8
> [   14.266044][  T659] 1 lock held by pgdatinit8/659:
> [   14.266075][  T659]  #0: c000201ffca87b40 (&(&pgdat->node_size_lock)->rlock){....}, at: deferred_init_memmap+0xc4/0x26c

This is really surprising to say the least. I do not see any spinlock
held here. Besides that we do sleep in wait_for_completion already.
Is it possible that the patch has been misplaced? zone_pcp_update is
called from page_alloc_init_late which is a different context than
deferred_init_memmap which runs in a separate kthread.

> [   14.266160][  T659] irq event stamp: 26
> [   14.266194][  T659] hardirqs last  enabled at (25): [<c000000000950584>] _raw_spin_unlock_irq+0x44/0x80
> [   14.266246][  T659] hardirqs last disabled at (26): [<c0000000009502ec>] _raw_spin_lock_irqsave+0x3c/0xa0
> [   14.266299][  T659] softirqs last  enabled at (0): [<c0000000000ff8d0>] copy_process+0x720/0x19b0
> [   14.266339][  T659] softirqs last disabled at (0): [<0000000000000000>] 0x0
> [   14.266400][  T659] CPU: 64 PID: 659 Comm: pgdatinit8 Not tainted 5.4.0-rc4-next-20191021 #1
> [   14.266462][  T659] Call Trace:
> [   14.266494][  T659] [c00000003d8efae0] [c000000000921cf4] dump_stack+0xe8/0x164 (unreliable)
> [   14.266538][  T659] [c00000003d8efb30] [c000000000157c54] ___might_sleep+0x334/0x370
> [   14.266577][  T659] [c00000003d8efbb0] [c00000000094a784] __mutex_lock+0x84/0xb20
> [   14.266627][  T659] [c00000003d8efcc0] [c000000000954038] zone_pcp_update+0x34/0x64
> [   14.266677][  T659] [c00000003d8efcf0] [c000000000b9e6bc] deferred_init_memmap+0x1b8/0x26c
> [   14.266740][  T659] [c00000003d8efdb0] [c000000000149528] kthread+0x1a8/0x1b0
> [   14.266792][  T659] [c00000003d8efe20] [c00000000000b748] ret_from_kernel_thread+0x5c/0x74
> [   14.268288][  T659] node 8 initialised, 1879186 pages in 12200ms
> [   14.268527][  T659] pgdatinit8 (659) used greatest stack depth: 27984 bytes left
> [   15.589983][  T658] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:935
> [   15.590041][  T658] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 658, name: pgdatinit0
> [   15.590078][  T658] 1 lock held by pgdatinit0/658:
> [   15.590108][  T658]  #0: c000001fff5c7b40 (&(&pgdat->node_size_lock)->rlock){....}, at: deferred_init_memmap+0xc4/0x26c
> [   15.590192][  T658] irq event stamp: 18
> [   15.590224][  T658] hardirqs last  enabled at (17): [<c000000000950654>] _raw_spin_unlock_irqrestore+0x94/0xd0
> [   15.590283][  T658] hardirqs last disabled at (18): [<c0000000009502ec>] _raw_spin_lock_irqsave+0x3c/0xa0
> [   15.590332][  T658] softirqs last  enabled at (0): [<c0000000000ff8d0>] copy_process+0x720/0x19b0
> [   15.590379][  T658] softirqs last disabled at (0): [<0000000000000000>] 0x0
> [   15.590414][  T658] CPU: 8 PID: 658 Comm: pgdatinit0 Tainted: G        W         5.4.0-rc4-next-20191021 #1
> [   15.590460][  T658] Call Trace:
> [   15.590491][  T658] [c00000003d8cfae0] [c000000000921cf4] dump_stack+0xe8/0x164 (unreliable)
> [   15.590541][  T658] [c00000003d8cfb30] [c000000000157c54] ___might_sleep+0x334/0x370
> [   15.590588][  T658] [c00000003d8cfbb0] [c00000000094a784] __mutex_lock+0x84/0xb20
> [   15.590643][  T658] [c00000003d8cfcc0] [c000000000954038] zone_pcp_update+0x34/0x64
> [   15.590689][  T658] [c00000003d8cfcf0] [c000000000b9e6bc] deferred_init_memmap+0x1b8/0x26c
> [   15.590739][  T658] [c00000003d8cfdb0] [c000000000149528] kthread+0x1a8/0x1b0
> [   15.590790][  T658] [c00000003d8cfe20] [c00000000000b748] ret_from_kernel_thread+0x5c/0x74

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/3] mm, meminit: Recalculate pcpu batch and high limits after init completes
  2019-10-21 14:01   ` Qian Cai
  2019-10-21 14:12     ` Michal Hocko
@ 2019-10-21 14:25     ` Mel Gorman
  1 sibling, 0 replies; 13+ messages in thread
From: Mel Gorman @ 2019-10-21 14:25 UTC (permalink / raw)
  To: Qian Cai
  Cc: Andrew Morton, Michal Hocko, Vlastimil Babka, Thomas Gleixner,
	Matt Fleming, Borislav Petkov, Linux-MM,
	Linux Kernel Mailing List

On Mon, Oct 21, 2019 at 10:01:24AM -0400, Qian Cai wrote:
> Warnings from linux-next,
> 
> [   14.265911][  T659] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:935
> [   14.265992][  T659] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 659, name: pgdatinit8
> [   14.266044][  T659] 1 lock held by pgdatinit8/659:

Fixed in v2 posted this morning. It should hit linux-next eventually.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH] mm, meminit: Recalculate pcpu batch and high limits after init completes -fix
  2019-10-21  9:48 ` [PATCH 1/3] mm, meminit: Recalculate pcpu batch and high limits after init completes Mel Gorman
                     ` (2 preceding siblings ...)
  2019-10-21 14:01   ` Qian Cai
@ 2019-10-21 19:39   ` Mel Gorman
       [not found]   ` <20191026131036.A7A5421655@mail.kernel.org>
  4 siblings, 0 replies; 13+ messages in thread
From: Mel Gorman @ 2019-10-21 19:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Vlastimil Babka, Thomas Gleixner, Matt Fleming,
	Borislav Petkov, Linux-MM, Linux Kernel Mailing List, lkp

LKP reported the following build problem from two hunks that did not
survive the reshuffling of the series reordering.

 ld: mm/page_alloc.o: in function `page_alloc_init_late':
 mm/page_alloc.c:1956: undefined reference to `zone_pcp_update'

Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/page_alloc.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4179376bb336..e9926bf77463 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -8524,7 +8524,6 @@ void free_contig_range(unsigned long pfn, unsigned int nr_pages)
 	WARN(count != 0, "%d pages are still in use!\n", count);
 }
 
-#ifdef CONFIG_MEMORY_HOTPLUG
 /*
  * The zone indicated has a new number of managed_pages; batch sizes and percpu
  * page high values need to be recalulated.
@@ -8535,7 +8534,6 @@ void __meminit zone_pcp_update(struct zone *zone)
 	__zone_pcp_update(zone);
 	mutex_unlock(&pcp_batch_high_lock);
 }
-#endif
 
 void zone_pcp_reset(struct zone *zone)
 {

^ permalink raw reply related	[flat|nested] 13+ messages in thread

[parent not found: <20191026131036.A7A5421655@mail.kernel.org>]

* Re: [PATCH 1/3] mm, meminit: Recalculate pcpu batch and high limits after init completes
       [not found]   ` <20191026131036.A7A5421655@mail.kernel.org>
@ 2019-10-27 20:43     ` Mel Gorman
  0 siblings, 0 replies; 13+ messages in thread
From: Mel Gorman @ 2019-10-27 20:43 UTC (permalink / raw)
  To: Sasha Levin; +Cc: Andrew Morton, Michal Hocko, stable

On Sat, Oct 26, 2019 at 01:10:35PM +0000, Sasha Levin wrote:
> Hi,
> 
> [This is an automated email]
> 
> This commit has been processed because it contains a -stable tag.
> The stable tag indicates that it's relevant for the following trees: 4.1+
> 
> The bot has tested the following trees: v5.3.7, v4.19.80, v4.14.150, v4.9.197, v4.4.197.
> 
> v5.3.7: Build OK!
> v4.19.80: Build failed! Errors:
> 
> v4.14.150: Failed to apply! Possible dependencies:
>     3c2c648842843 ("mm/page_alloc.c: fix typos in comments")
>     66e8b438bd5c7 ("mm/memblock.c: make the index explicit argument of for_each_memblock_type")
>     c9e97a1997fbf ("mm: initialize pages on demand during boot")
> 
> v4.9.197: Failed to apply! Possible dependencies:
>     3c2c648842843 ("mm/page_alloc.c: fix typos in comments")
>     66e8b438bd5c7 ("mm/memblock.c: make the index explicit argument of for_each_memblock_type")
>     c9e97a1997fbf ("mm: initialize pages on demand during boot")
> 
> v4.4.197: Failed to apply! Possible dependencies:
>     0a687aace3b8e ("mm,oom: do not loop !__GFP_FS allocation if the OOM killer is disabled")
>     0caeef63e6d2f ("libnvdimm: Add a poison list and export badblocks")
>     0e749e54244ee ("dax: increase granularity of dax_clear_blocks() operations")
>     34c0fd540e79f ("mm, dax, pmem: introduce pfn_t")
>     3c2c648842843 ("mm/page_alloc.c: fix typos in comments")
>     3da88fb3bacfa ("mm, oom: move GFP_NOFS check to out_of_memory")
>     4b94ffdc4163b ("x86, mm: introduce vmem_altmap to augment vmemmap_populate()")
>     5020e285856cb ("mm, oom: give __GFP_NOFAIL allocations access to memory reserves")
>     52db400fcd502 ("pmem, dax: clean up clear_pmem()")
>     66e8b438bd5c7 ("mm/memblock.c: make the index explicit argument of for_each_memblock_type")
>     7cf91a98e607c ("mm/compaction: speed up pageblock_pfn_to_page() when zone is contiguous")
>     87ba05dff3510 ("libnvdimm: don't fail init for full badblocks list")
>     8c9c1701c7c23 ("mm/memblock: introduce for_each_memblock_type()")
>     9476df7d80dfc ("mm: introduce find_dev_pagemap()")
>     ad9a8bde2cb19 ("libnvdimm, pmem: move definition of nvdimm_namespace_add_poison to nd.h")
>     b2e0d1625e193 ("dax: fix lifetime of in-kernel dax mappings with dax_map_atomic()")
>     b95f5f4391fad ("libnvdimm: convert to statically allocated badblocks")
>     ba6c19fd113a3 ("include/linux/memblock.h: Clean up code for several trivial details")
>     c9e97a1997fbf ("mm: initialize pages on demand during boot")
> 
> 
> NOTE: The patch will not be queued to stable trees until it is upstream.
> 
> How should we proceed with this patch?
> 

What were the 4.19.80 build errors?

For the older kernels, it would have to be confirmed those kernels are
definietly affected. The test machines I tried fails to even boot on those
kernels so I need to find a NUMA machine that is old enough to boot those
kernels and confirmed affected by the bug before determining what the
backport needs to look like.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 2/3] mm, pcp: Share common code between memory hotplug and percpu sysctl handler
  2019-10-21  9:48 [PATCH 0/3] Recalculate per-cpu page allocator batch and high limits after deferred meminit v2 Mel Gorman
  2019-10-21  9:48 ` [PATCH 1/3] mm, meminit: Recalculate pcpu batch and high limits after init completes Mel Gorman
@ 2019-10-21  9:48 ` Mel Gorman
  2019-10-21 11:44   ` Vlastimil Babka
  2019-10-21  9:48 ` [PATCH 3/3] mm, pcpu: Make zone pcp updates and reset internal to the mm Mel Gorman
  2 siblings, 1 reply; 13+ messages in thread
From: Mel Gorman @ 2019-10-21  9:48 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Vlastimil Babka, Thomas Gleixner, Matt Fleming,
	Borislav Petkov, Linux-MM, Linux Kernel Mailing List, Mel Gorman

Both the percpu_pagelist_fraction sysctl handler and memory hotplug
have a common requirement of updating the pcpu page allocation batch
and high values. Split the relevant helper to share common code.

No functional change.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 mm/page_alloc.c | 23 ++++++++++++-----------
 1 file changed, 12 insertions(+), 11 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f972076d0f6b..4179376bb336 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7991,6 +7991,15 @@ int lowmem_reserve_ratio_sysctl_handler(struct ctl_table *table, int write,
 	return 0;
 }
 
+static void __zone_pcp_update(struct zone *zone)
+{
+	unsigned int cpu;
+
+	for_each_possible_cpu(cpu)
+		pageset_set_high_and_batch(zone,
+				per_cpu_ptr(zone->pageset, cpu));
+}
+
 /*
  * percpu_pagelist_fraction - changes the pcp->high for each zone on each
  * cpu.  It is the fraction of total pages in each zone that a hot per cpu
@@ -8022,13 +8031,8 @@ int percpu_pagelist_fraction_sysctl_handler(struct ctl_table *table, int write,
 	if (percpu_pagelist_fraction == old_percpu_pagelist_fraction)
 		goto out;
 
-	for_each_populated_zone(zone) {
-		unsigned int cpu;
-
-		for_each_possible_cpu(cpu)
-			pageset_set_high_and_batch(zone,
-					per_cpu_ptr(zone->pageset, cpu));
-	}
+	for_each_populated_zone(zone)
+		__zone_pcp_update(zone);
 out:
 	mutex_unlock(&pcp_batch_high_lock);
 	return ret;
@@ -8527,11 +8531,8 @@ void free_contig_range(unsigned long pfn, unsigned int nr_pages)
  */
 void __meminit zone_pcp_update(struct zone *zone)
 {
-	unsigned cpu;
 	mutex_lock(&pcp_batch_high_lock);
-	for_each_possible_cpu(cpu)
-		pageset_set_high_and_batch(zone,
-				per_cpu_ptr(zone->pageset, cpu));
+	__zone_pcp_update(zone);
 	mutex_unlock(&pcp_batch_high_lock);
 }
 #endif
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/3] mm, pcp: Share common code between memory hotplug and percpu sysctl handler
  2019-10-21  9:48 ` [PATCH 2/3] mm, pcp: Share common code between memory hotplug and percpu sysctl handler Mel Gorman
@ 2019-10-21 11:44   ` Vlastimil Babka
  0 siblings, 0 replies; 13+ messages in thread
From: Vlastimil Babka @ 2019-10-21 11:44 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton
  Cc: Michal Hocko, Thomas Gleixner, Matt Fleming, Borislav Petkov,
	Linux-MM, Linux Kernel Mailing List

On 10/21/19 11:48 AM, Mel Gorman wrote:
> Both the percpu_pagelist_fraction sysctl handler and memory hotplug
> have a common requirement of updating the pcpu page allocation batch
> and high values. Split the relevant helper to share common code.
> 
> No functional change.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: Michal Hocko <mhocko@suse.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 3/3] mm, pcpu: Make zone pcp updates and reset internal to the mm
  2019-10-21  9:48 [PATCH 0/3] Recalculate per-cpu page allocator batch and high limits after deferred meminit v2 Mel Gorman
  2019-10-21  9:48 ` [PATCH 1/3] mm, meminit: Recalculate pcpu batch and high limits after init completes Mel Gorman
  2019-10-21  9:48 ` [PATCH 2/3] mm, pcp: Share common code between memory hotplug and percpu sysctl handler Mel Gorman
@ 2019-10-21  9:48 ` Mel Gorman
  2019-10-21 11:42   ` Vlastimil Babka
  2 siblings, 1 reply; 13+ messages in thread
From: Mel Gorman @ 2019-10-21  9:48 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Vlastimil Babka, Thomas Gleixner, Matt Fleming,
	Borislav Petkov, Linux-MM, Linux Kernel Mailing List, Mel Gorman

Memory hotplug needs to be able to reset and reinit the pcpu allocator
batch and high limits but this action is internal to the VM. Move
the declaration to internal.h

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/mm.h | 3 ---
 mm/internal.h      | 3 +++
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index cc292273e6ba..22d6104f2341 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2219,9 +2219,6 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...);
 
 extern void setup_per_cpu_pageset(void);
 
-extern void zone_pcp_update(struct zone *zone);
-extern void zone_pcp_reset(struct zone *zone);
-
 /* page_alloc.c */
 extern int min_free_kbytes;
 extern int watermark_boost_factor;
diff --git a/mm/internal.h b/mm/internal.h
index 0d5f720c75ab..0a3d41c7b3c5 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -165,6 +165,9 @@ extern void post_alloc_hook(struct page *page, unsigned int order,
 					gfp_t gfp_flags);
 extern int user_min_free_kbytes;
 
+extern void zone_pcp_update(struct zone *zone);
+extern void zone_pcp_reset(struct zone *zone);
+
 #if defined CONFIG_COMPACTION || defined CONFIG_CMA
 
 /*
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/3] mm, pcpu: Make zone pcp updates and reset internal to the mm
  2019-10-21  9:48 ` [PATCH 3/3] mm, pcpu: Make zone pcp updates and reset internal to the mm Mel Gorman
@ 2019-10-21 11:42   ` Vlastimil Babka
  0 siblings, 0 replies; 13+ messages in thread
From: Vlastimil Babka @ 2019-10-21 11:42 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton
  Cc: Michal Hocko, Thomas Gleixner, Matt Fleming, Borislav Petkov,
	Linux-MM, Linux Kernel Mailing List

On 10/21/19 11:48 AM, Mel Gorman wrote:
> Memory hotplug needs to be able to reset and reinit the pcpu allocator
> batch and high limits but this action is internal to the VM. Move
> the declaration to internal.h
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: Michal Hocko <mhocko@suse.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2019-10-27 20:51 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-21  9:48 [PATCH 0/3] Recalculate per-cpu page allocator batch and high limits after deferred meminit v2 Mel Gorman
2019-10-21  9:48 ` [PATCH 1/3] mm, meminit: Recalculate pcpu batch and high limits after init completes Mel Gorman
2019-10-21 10:27   ` Michal Hocko
2019-10-21 11:42   ` Vlastimil Babka
2019-10-21 14:01   ` Qian Cai
2019-10-21 14:12     ` Michal Hocko
2019-10-21 14:25     ` Mel Gorman
2019-10-21 19:39   ` [PATCH] mm, meminit: Recalculate pcpu batch and high limits after init completes -fix Mel Gorman
     [not found]   ` <20191026131036.A7A5421655@mail.kernel.org>
2019-10-27 20:43     ` [PATCH 1/3] mm, meminit: Recalculate pcpu batch and high limits after init completes Mel Gorman
2019-10-21  9:48 ` [PATCH 2/3] mm, pcp: Share common code between memory hotplug and percpu sysctl handler Mel Gorman
2019-10-21 11:44   ` Vlastimil Babka
2019-10-21  9:48 ` [PATCH 3/3] mm, pcpu: Make zone pcp updates and reset internal to the mm Mel Gorman
2019-10-21 11:42   ` Vlastimil Babka

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.