linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 0/3] initialize deferred pages with interrupts enabled
@ 2020-04-03 14:09 Pavel Tatashin
  2020-04-03 14:09 ` [PATCH v4 1/3] mm: call touch_nmi_watchdog() on max order boundaries in deferred init Pavel Tatashin
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Pavel Tatashin @ 2020-04-03 14:09 UTC (permalink / raw)
  To: linux-kernel, akpm, mhocko, linux-mm, dan.j.williams,
	shile.zhang, daniel.m.jordan, pasha.tatashin, ktkhai, david,
	jmorris, sashal, vbabka

Keep interrupts enabled during deferred page initialization in order to
make code more modular and allow jiffies to update.

Original approach, and discussion can be found here:
https://lore.kernel.org/linux-mm/20200311123848.118638-1-shile.zhang@linux.alibaba.com

Changelog
v4:
- Added reviewed-by Daniel Joardan, and also fixed a stupid mistake: sched_clock()
instead of cond_resched().

v3:
- Splitted cond_resched() change into a separate patch as suggested by
  David Hildenbrand

v2:
- Addressed comments Daniel Jordan. Replaced touch_nmi_watchdog() to cond_resched().
  Added reviewed-by's and acked-by's.

v1:
https://lore.kernel.org/linux-mm/20200401193238.22544-1-pasha.tatashin@soleen.com

Daniel Jordan (1):
  mm: call touch_nmi_watchdog() on max order boundaries in deferred init

Pavel Tatashin (2):
  mm: initialize deferred pages with interrupts enabled
  mm: call cond_resched() from deferred_init_memmap()

 include/linux/mmzone.h |  2 ++
 mm/page_alloc.c        | 27 +++++++++++----------------
 2 files changed, 13 insertions(+), 16 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v4 1/3] mm: call touch_nmi_watchdog() on max order boundaries in deferred init
  2020-04-03 14:09 [PATCH v4 0/3] initialize deferred pages with interrupts enabled Pavel Tatashin
@ 2020-04-03 14:09 ` Pavel Tatashin
  2020-04-03 14:09 ` [PATCH v4 2/3] mm: initialize deferred pages with interrupts enabled Pavel Tatashin
  2020-04-03 14:09 ` [PATCH v4 3/3] mm: call cond_resched() from deferred_init_memmap() Pavel Tatashin
  2 siblings, 0 replies; 9+ messages in thread
From: Pavel Tatashin @ 2020-04-03 14:09 UTC (permalink / raw)
  To: linux-kernel, akpm, mhocko, linux-mm, dan.j.williams,
	shile.zhang, daniel.m.jordan, pasha.tatashin, ktkhai, david,
	jmorris, sashal, vbabka

From: Daniel Jordan <daniel.m.jordan@oracle.com>

deferred_init_memmap() disables interrupts the entire time, so it calls
touch_nmi_watchdog() periodically to avoid soft lockup splats.  Soon it
will run with interrupts enabled, at which point cond_resched() should
be used instead.

deferred_grow_zone() makes the same watchdog calls through code shared
with deferred init but will continue to run with interrupts disabled, so
it can't call cond_resched().

Pull the watchdog calls up to these two places to allow the first to be
changed later, independently of the second.  The frequency reduces from
twice per pageblock (init and free) to once per max order block.

Fixes: 3a2d7fa8a3d5 ("mm: disable interrupts while initializing deferred pages")
Cc: stable@vger.kernel.org # 4.17+

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/page_alloc.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e5f76da8cd4e..d95bfd328107 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1633,7 +1633,6 @@ static void __init deferred_free_pages(unsigned long pfn,
 		} else if (!(pfn & nr_pgmask)) {
 			deferred_free_range(pfn - nr_free, nr_free);
 			nr_free = 1;
-			touch_nmi_watchdog();
 		} else {
 			nr_free++;
 		}
@@ -1663,7 +1662,6 @@ static unsigned long  __init deferred_init_pages(struct zone *zone,
 			continue;
 		} else if (!page || !(pfn & nr_pgmask)) {
 			page = pfn_to_page(pfn);
-			touch_nmi_watchdog();
 		} else {
 			page++;
 		}
@@ -1803,8 +1801,10 @@ static int __init deferred_init_memmap(void *data)
 	 * that we can avoid introducing any issues with the buddy
 	 * allocator.
 	 */
-	while (spfn < epfn)
+	while (spfn < epfn) {
 		nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
+		touch_nmi_watchdog();
+	}
 zone_empty:
 	pgdat_resize_unlock(pgdat, &flags);
 
@@ -1888,6 +1888,7 @@ deferred_grow_zone(struct zone *zone, unsigned int order)
 		first_deferred_pfn = spfn;
 
 		nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
+		touch_nmi_watchdog();
 
 		/* We should only stop along section boundaries */
 		if ((first_deferred_pfn ^ spfn) < PAGES_PER_SECTION)
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v4 2/3] mm: initialize deferred pages with interrupts enabled
  2020-04-03 14:09 [PATCH v4 0/3] initialize deferred pages with interrupts enabled Pavel Tatashin
  2020-04-03 14:09 ` [PATCH v4 1/3] mm: call touch_nmi_watchdog() on max order boundaries in deferred init Pavel Tatashin
@ 2020-04-03 14:09 ` Pavel Tatashin
  2020-04-03 15:18   ` David Hildenbrand
  2020-04-03 14:09 ` [PATCH v4 3/3] mm: call cond_resched() from deferred_init_memmap() Pavel Tatashin
  2 siblings, 1 reply; 9+ messages in thread
From: Pavel Tatashin @ 2020-04-03 14:09 UTC (permalink / raw)
  To: linux-kernel, akpm, mhocko, linux-mm, dan.j.williams,
	shile.zhang, daniel.m.jordan, pasha.tatashin, ktkhai, david,
	jmorris, sashal, vbabka

Initializing struct pages is a long task and keeping interrupts disabled
for the duration of this operation introduces a number of problems.

1. jiffies are not updated for long period of time, and thus incorrect time
   is reported. See proposed solution and discussion here:
   lkml/20200311123848.118638-1-shile.zhang@linux.alibaba.com
2. It prevents farther improving deferred page initialization by allowing
   intra-node multi-threading.

We are keeping interrupts disabled to solve a rather theoretical problem
that was never observed in real world (See 3a2d7fa8a3d5).

Lets keep interrupts enabled. In case we ever encounter a scenario where
an interrupt thread wants to allocate large amount of memory this early in
boot we can deal with that by growing zone (see deferred_grow_zone()) by
the needed amount before starting deferred_init_memmap() threads.

Before:
[    1.232459] node 0 initialised, 12058412 pages in 1ms

After:
[    1.632580] node 0 initialised, 12051227 pages in 436ms

Fixes: 3a2d7fa8a3d5 ("mm: disable interrupts while initializing deferred pages")
Cc: stable@vger.kernel.org # 4.17+

Reported-by: Shile Zhang <shile.zhang@linux.alibaba.com>
Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/mmzone.h |  2 ++
 mm/page_alloc.c        | 20 +++++++-------------
 2 files changed, 9 insertions(+), 13 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index e84d448988b6..ac6a8245f063 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -723,6 +723,8 @@ typedef struct pglist_data {
 	/*
 	 * Must be held any time you expect node_start_pfn,
 	 * node_present_pages, node_spanned_pages or nr_zones to stay constant.
+	 * Also synchronizes pgdat->first_deferred_pfn during deferred page
+	 * init.
 	 *
 	 * pgdat_resize_lock() and pgdat_resize_unlock() are provided to
 	 * manipulate node_size_lock without checking for CONFIG_MEMORY_HOTPLUG
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d95bfd328107..5ffa8d7e5545 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1784,6 +1784,13 @@ static int __init deferred_init_memmap(void *data)
 	BUG_ON(pgdat->first_deferred_pfn > pgdat_end_pfn(pgdat));
 	pgdat->first_deferred_pfn = ULONG_MAX;
 
+	/*
+	 * Once we unlock here, the zone cannot be grown anymore, thus if an
+	 * interrupt thread must allocate this early in boot, zone must be
+	 * pre-grown prior to start of deferred page initialization.
+	 */
+	pgdat_resize_unlock(pgdat, &flags);
+
 	/* Only the highest zone is deferred so find it */
 	for (zid = 0; zid < MAX_NR_ZONES; zid++) {
 		zone = pgdat->node_zones + zid;
@@ -1806,8 +1813,6 @@ static int __init deferred_init_memmap(void *data)
 		touch_nmi_watchdog();
 	}
 zone_empty:
-	pgdat_resize_unlock(pgdat, &flags);
-
 	/* Sanity check that the next zone really is unpopulated */
 	WARN_ON(++zid < MAX_NR_ZONES && populated_zone(++zone));
 
@@ -1849,17 +1854,6 @@ deferred_grow_zone(struct zone *zone, unsigned int order)
 
 	pgdat_resize_lock(pgdat, &flags);
 
-	/*
-	 * If deferred pages have been initialized while we were waiting for
-	 * the lock, return true, as the zone was grown.  The caller will retry
-	 * this zone.  We won't return to this function since the caller also
-	 * has this static branch.
-	 */
-	if (!static_branch_unlikely(&deferred_pages)) {
-		pgdat_resize_unlock(pgdat, &flags);
-		return true;
-	}
-
 	/*
 	 * If someone grew this zone while we were waiting for spinlock, return
 	 * true, as there might be enough pages already.
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v4 3/3] mm: call cond_resched() from deferred_init_memmap()
  2020-04-03 14:09 [PATCH v4 0/3] initialize deferred pages with interrupts enabled Pavel Tatashin
  2020-04-03 14:09 ` [PATCH v4 1/3] mm: call touch_nmi_watchdog() on max order boundaries in deferred init Pavel Tatashin
  2020-04-03 14:09 ` [PATCH v4 2/3] mm: initialize deferred pages with interrupts enabled Pavel Tatashin
@ 2020-04-03 14:09 ` Pavel Tatashin
  2020-04-03 14:12   ` Michal Hocko
                     ` (3 more replies)
  2 siblings, 4 replies; 9+ messages in thread
From: Pavel Tatashin @ 2020-04-03 14:09 UTC (permalink / raw)
  To: linux-kernel, akpm, mhocko, linux-mm, dan.j.williams,
	shile.zhang, daniel.m.jordan, pasha.tatashin, ktkhai, david,
	jmorris, sashal, vbabka

Now that deferred pages are initialized with interrupts enabled we can
replace touch_nmi_watchdog() with cond_resched(), as it was before
3a2d7fa8a3d5.

For now, we cannot do the same in deferred_grow_zone() as it is still
initializes pages with interrupts disabled.

This change fixes RCU problem described:
linux-mm/20200401104156.11564-2-david@redhat.com

[   60.474005] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[   60.475000] rcu:  1-...0: (0 ticks this GP) idle=02a/1/0x4000000000000000 softirq=1/1 fqs=15000
[   60.475000] rcu:  (detected by 0, t=60002 jiffies, g=-1199, q=1)
[   60.475000] Sending NMI from CPU 0 to CPUs 1:
[    1.760091] NMI backtrace for cpu 1
[    1.760091] CPU: 1 PID: 20 Comm: pgdatinit0 Not tainted 4.18.0-147.9.1.el8_1.x86_64 #1
[    1.760091] Hardware name: Red Hat KVM, BIOS 1.13.0-1.module+el8.2.0+5520+4e5817f3 04/01/2014
[    1.760091] RIP: 0010:__init_single_page.isra.65+0x10/0x4f
[    1.760091] Code: 48 83 cf 63 48 89 f8 0f 1f 40 00 48 89 c6 48 89 d7 e8 6b 18 80 ff 66 90 5b c3 31 c0 b9 10 00 00 00 49 89 f8 48 c1 e6 33 f3 ab <b8> 07 00 00 00 48 c1 e2 36 41 c7 40 34 01 00 00 00 48 c1 e0 33 41
[    1.760091] RSP: 0000:ffffba783123be40 EFLAGS: 00000006
[    1.760091] RAX: 0000000000000000 RBX: fffffad34405e300 RCX: 0000000000000000
[    1.760091] RDX: 0000000000000000 RSI: 0010000000000000 RDI: fffffad34405e340
[    1.760091] RBP: 0000000033f3177e R08: fffffad34405e300 R09: 0000000000000002
[    1.760091] R10: 000000000000002b R11: ffff98afb691a500 R12: 0000000000000002
[    1.760091] R13: 0000000000000000 R14: 000000003f03ea00 R15: 000000003e10178c
[    1.760091] FS:  0000000000000000(0000) GS:ffff9c9ebeb00000(0000) knlGS:0000000000000000
[    1.760091] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    1.760091] CR2: 00000000ffffffff CR3: 000000a1cf20a001 CR4: 00000000003606e0
[    1.760091] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    1.760091] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[    1.760091] Call Trace:
[    1.760091]  deferred_init_pages+0x8f/0xbf
[    1.760091]  deferred_init_memmap+0x184/0x29d
[    1.760091]  ? deferred_free_pages.isra.97+0xba/0xba
[    1.760091]  kthread+0x112/0x130
[    1.760091]  ? kthread_flush_work_fn+0x10/0x10
[    1.760091]  ret_from_fork+0x35/0x40
[   89.123011] node 0 initialised, 1055935372 pages in 88650ms

Fixes: 3a2d7fa8a3d5 ("mm: disable interrupts while initializing deferred pages")
Cc: stable@vger.kernel.org # 4.17+

Reported-by: Yiqian Wei <yiwei@redhat.com>
Tested-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
---
 mm/page_alloc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5ffa8d7e5545..deacfe575872 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1810,7 +1810,7 @@ static int __init deferred_init_memmap(void *data)
 	 */
 	while (spfn < epfn) {
 		nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
-		touch_nmi_watchdog();
+		cond_resched();
 	}
 zone_empty:
 	/* Sanity check that the next zone really is unpopulated */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v4 3/3] mm: call cond_resched() from deferred_init_memmap()
  2020-04-03 14:09 ` [PATCH v4 3/3] mm: call cond_resched() from deferred_init_memmap() Pavel Tatashin
@ 2020-04-03 14:12   ` Michal Hocko
  2020-04-03 15:16   ` David Hildenbrand
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 9+ messages in thread
From: Michal Hocko @ 2020-04-03 14:12 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: linux-kernel, akpm, linux-mm, dan.j.williams, shile.zhang,
	daniel.m.jordan, ktkhai, david, jmorris, sashal, vbabka

On Fri 03-04-20 10:09:52, Pavel Tatashin wrote:
> Now that deferred pages are initialized with interrupts enabled we can
> replace touch_nmi_watchdog() with cond_resched(), as it was before
> 3a2d7fa8a3d5.
> 
> For now, we cannot do the same in deferred_grow_zone() as it is still
> initializes pages with interrupts disabled.
> 
> This change fixes RCU problem described:
> linux-mm/20200401104156.11564-2-david@redhat.com
> 
> [   60.474005] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
> [   60.475000] rcu:  1-...0: (0 ticks this GP) idle=02a/1/0x4000000000000000 softirq=1/1 fqs=15000
> [   60.475000] rcu:  (detected by 0, t=60002 jiffies, g=-1199, q=1)
> [   60.475000] Sending NMI from CPU 0 to CPUs 1:
> [    1.760091] NMI backtrace for cpu 1
> [    1.760091] CPU: 1 PID: 20 Comm: pgdatinit0 Not tainted 4.18.0-147.9.1.el8_1.x86_64 #1
> [    1.760091] Hardware name: Red Hat KVM, BIOS 1.13.0-1.module+el8.2.0+5520+4e5817f3 04/01/2014
> [    1.760091] RIP: 0010:__init_single_page.isra.65+0x10/0x4f
> [    1.760091] Code: 48 83 cf 63 48 89 f8 0f 1f 40 00 48 89 c6 48 89 d7 e8 6b 18 80 ff 66 90 5b c3 31 c0 b9 10 00 00 00 49 89 f8 48 c1 e6 33 f3 ab <b8> 07 00 00 00 48 c1 e2 36 41 c7 40 34 01 00 00 00 48 c1 e0 33 41
> [    1.760091] RSP: 0000:ffffba783123be40 EFLAGS: 00000006
> [    1.760091] RAX: 0000000000000000 RBX: fffffad34405e300 RCX: 0000000000000000
> [    1.760091] RDX: 0000000000000000 RSI: 0010000000000000 RDI: fffffad34405e340
> [    1.760091] RBP: 0000000033f3177e R08: fffffad34405e300 R09: 0000000000000002
> [    1.760091] R10: 000000000000002b R11: ffff98afb691a500 R12: 0000000000000002
> [    1.760091] R13: 0000000000000000 R14: 000000003f03ea00 R15: 000000003e10178c
> [    1.760091] FS:  0000000000000000(0000) GS:ffff9c9ebeb00000(0000) knlGS:0000000000000000
> [    1.760091] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    1.760091] CR2: 00000000ffffffff CR3: 000000a1cf20a001 CR4: 00000000003606e0
> [    1.760091] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [    1.760091] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [    1.760091] Call Trace:
> [    1.760091]  deferred_init_pages+0x8f/0xbf
> [    1.760091]  deferred_init_memmap+0x184/0x29d
> [    1.760091]  ? deferred_free_pages.isra.97+0xba/0xba
> [    1.760091]  kthread+0x112/0x130
> [    1.760091]  ? kthread_flush_work_fn+0x10/0x10
> [    1.760091]  ret_from_fork+0x35/0x40
> [   89.123011] node 0 initialised, 1055935372 pages in 88650ms
> 
> Fixes: 3a2d7fa8a3d5 ("mm: disable interrupts while initializing deferred pages")
> Cc: stable@vger.kernel.org # 4.17+
> 
> Reported-by: Yiqian Wei <yiwei@redhat.com>
> Tested-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/page_alloc.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 5ffa8d7e5545..deacfe575872 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1810,7 +1810,7 @@ static int __init deferred_init_memmap(void *data)
>  	 */
>  	while (spfn < epfn) {
>  		nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
> -		touch_nmi_watchdog();
> +		cond_resched();
>  	}
>  zone_empty:
>  	/* Sanity check that the next zone really is unpopulated */
> -- 
> 2.17.1

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v4 3/3] mm: call cond_resched() from deferred_init_memmap()
  2020-04-03 14:09 ` [PATCH v4 3/3] mm: call cond_resched() from deferred_init_memmap() Pavel Tatashin
  2020-04-03 14:12   ` Michal Hocko
@ 2020-04-03 15:16   ` David Hildenbrand
  2020-04-03 15:17   ` David Hildenbrand
  2020-04-06  7:02   ` Pankaj Gupta
  3 siblings, 0 replies; 9+ messages in thread
From: David Hildenbrand @ 2020-04-03 15:16 UTC (permalink / raw)
  To: Pavel Tatashin, linux-kernel, akpm, mhocko, linux-mm,
	dan.j.williams, shile.zhang, daniel.m.jordan, ktkhai, jmorris,
	sashal, vbabka

On 03.04.20 16:09, Pavel Tatashin wrote:
> Now that deferred pages are initialized with interrupts enabled we can
> replace touch_nmi_watchdog() with cond_resched(), as it was before
> 3a2d7fa8a3d5.
> 
> For now, we cannot do the same in deferred_grow_zone() as it is still
> initializes pages with interrupts disabled.
> 
> This change fixes RCU problem described:
> linux-mm/20200401104156.11564-2-david@redhat.com
> 
> [   60.474005] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
> [   60.475000] rcu:  1-...0: (0 ticks this GP) idle=02a/1/0x4000000000000000 softirq=1/1 fqs=15000
> [   60.475000] rcu:  (detected by 0, t=60002 jiffies, g=-1199, q=1)
> [   60.475000] Sending NMI from CPU 0 to CPUs 1:
> [    1.760091] NMI backtrace for cpu 1
> [    1.760091] CPU: 1 PID: 20 Comm: pgdatinit0 Not tainted 4.18.0-147.9.1.el8_1.x86_64 #1
> [    1.760091] Hardware name: Red Hat KVM, BIOS 1.13.0-1.module+el8.2.0+5520+4e5817f3 04/01/2014
> [    1.760091] RIP: 0010:__init_single_page.isra.65+0x10/0x4f
> [    1.760091] Code: 48 83 cf 63 48 89 f8 0f 1f 40 00 48 89 c6 48 89 d7 e8 6b 18 80 ff 66 90 5b c3 31 c0 b9 10 00 00 00 49 89 f8 48 c1 e6 33 f3 ab <b8> 07 00 00 00 48 c1 e2 36 41 c7 40 34 01 00 00 00 48 c1 e0 33 41
> [    1.760091] RSP: 0000:ffffba783123be40 EFLAGS: 00000006
> [    1.760091] RAX: 0000000000000000 RBX: fffffad34405e300 RCX: 0000000000000000
> [    1.760091] RDX: 0000000000000000 RSI: 0010000000000000 RDI: fffffad34405e340
> [    1.760091] RBP: 0000000033f3177e R08: fffffad34405e300 R09: 0000000000000002
> [    1.760091] R10: 000000000000002b R11: ffff98afb691a500 R12: 0000000000000002
> [    1.760091] R13: 0000000000000000 R14: 000000003f03ea00 R15: 000000003e10178c
> [    1.760091] FS:  0000000000000000(0000) GS:ffff9c9ebeb00000(0000) knlGS:0000000000000000
> [    1.760091] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    1.760091] CR2: 00000000ffffffff CR3: 000000a1cf20a001 CR4: 00000000003606e0
> [    1.760091] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [    1.760091] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [    1.760091] Call Trace:
> [    1.760091]  deferred_init_pages+0x8f/0xbf
> [    1.760091]  deferred_init_memmap+0x184/0x29d
> [    1.760091]  ? deferred_free_pages.isra.97+0xba/0xba
> [    1.760091]  kthread+0x112/0x130
> [    1.760091]  ? kthread_flush_work_fn+0x10/0x10
> [    1.760091]  ret_from_fork+0x35/0x40
> [   89.123011] node 0 initialised, 1055935372 pages in 88650ms
> 
> Fixes: 3a2d7fa8a3d5 ("mm: disable interrupts while initializing deferred pages")
> Cc: stable@vger.kernel.org # 4.17+
> 
> Reported-by: Yiqian Wei <yiwei@redhat.com>
> Tested-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
> ---
>  mm/page_alloc.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 5ffa8d7e5545..deacfe575872 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1810,7 +1810,7 @@ static int __init deferred_init_memmap(void *data)
>  	 */
>  	while (spfn < epfn) {
>  		nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
> -		touch_nmi_watchdog();
> +		cond_resched();
>  	}
>  zone_empty:
>  	/* Sanity check that the next zone really is unpopulated */
> 

Reviewed-by: David Hildenbrand <david@redhat.com>

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v4 3/3] mm: call cond_resched() from deferred_init_memmap()
  2020-04-03 14:09 ` [PATCH v4 3/3] mm: call cond_resched() from deferred_init_memmap() Pavel Tatashin
  2020-04-03 14:12   ` Michal Hocko
  2020-04-03 15:16   ` David Hildenbrand
@ 2020-04-03 15:17   ` David Hildenbrand
  2020-04-06  7:02   ` Pankaj Gupta
  3 siblings, 0 replies; 9+ messages in thread
From: David Hildenbrand @ 2020-04-03 15:17 UTC (permalink / raw)
  To: Pavel Tatashin, linux-kernel, akpm, mhocko, linux-mm,
	dan.j.williams, shile.zhang, daniel.m.jordan, ktkhai, jmorris,
	sashal, vbabka

On 03.04.20 16:09, Pavel Tatashin wrote:
> Now that deferred pages are initialized with interrupts enabled we can
> replace touch_nmi_watchdog() with cond_resched(), as it was before
> 3a2d7fa8a3d5.
> 
> For now, we cannot do the same in deferred_grow_zone() as it is still
> initializes pages with interrupts disabled.
> 
> This change fixes RCU problem described:
> linux-mm/20200401104156.11564-2-david@redhat.com

BTW,

https://lkml.kernel.org/r/20200401104156.11564-2-david@redhat.com

Would be better.


-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v4 2/3] mm: initialize deferred pages with interrupts enabled
  2020-04-03 14:09 ` [PATCH v4 2/3] mm: initialize deferred pages with interrupts enabled Pavel Tatashin
@ 2020-04-03 15:18   ` David Hildenbrand
  0 siblings, 0 replies; 9+ messages in thread
From: David Hildenbrand @ 2020-04-03 15:18 UTC (permalink / raw)
  To: Pavel Tatashin, linux-kernel, akpm, mhocko, linux-mm,
	dan.j.williams, shile.zhang, daniel.m.jordan, ktkhai, jmorris,
	sashal, vbabka

On 03.04.20 16:09, Pavel Tatashin wrote:
> Initializing struct pages is a long task and keeping interrupts disabled
> for the duration of this operation introduces a number of problems.
> 
> 1. jiffies are not updated for long period of time, and thus incorrect time
>    is reported. See proposed solution and discussion here:
>    lkml/20200311123848.118638-1-shile.zhang@linux.alibaba.com
> 2. It prevents farther improving deferred page initialization by allowing
>    intra-node multi-threading.
> 
> We are keeping interrupts disabled to solve a rather theoretical problem
> that was never observed in real world (See 3a2d7fa8a3d5).
> 
> Lets keep interrupts enabled. In case we ever encounter a scenario where
> an interrupt thread wants to allocate large amount of memory this early in
> boot we can deal with that by growing zone (see deferred_grow_zone()) by
> the needed amount before starting deferred_init_memmap() threads.
> 
> Before:
> [    1.232459] node 0 initialised, 12058412 pages in 1ms
> 
> After:
> [    1.632580] node 0 initialised, 12051227 pages in 436ms
> 
> Fixes: 3a2d7fa8a3d5 ("mm: disable interrupts while initializing deferred pages")
> Cc: stable@vger.kernel.org # 4.17+
> 
> Reported-by: Shile Zhang <shile.zhang@linux.alibaba.com>
> Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
> Acked-by: Michal Hocko <mhocko@suse.com>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>  include/linux/mmzone.h |  2 ++
>  mm/page_alloc.c        | 20 +++++++-------------
>  2 files changed, 9 insertions(+), 13 deletions(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index e84d448988b6..ac6a8245f063 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -723,6 +723,8 @@ typedef struct pglist_data {
>  	/*
>  	 * Must be held any time you expect node_start_pfn,
>  	 * node_present_pages, node_spanned_pages or nr_zones to stay constant.
> +	 * Also synchronizes pgdat->first_deferred_pfn during deferred page
> +	 * init.
>  	 *
>  	 * pgdat_resize_lock() and pgdat_resize_unlock() are provided to
>  	 * manipulate node_size_lock without checking for CONFIG_MEMORY_HOTPLUG
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index d95bfd328107..5ffa8d7e5545 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1784,6 +1784,13 @@ static int __init deferred_init_memmap(void *data)
>  	BUG_ON(pgdat->first_deferred_pfn > pgdat_end_pfn(pgdat));
>  	pgdat->first_deferred_pfn = ULONG_MAX;
>  
> +	/*
> +	 * Once we unlock here, the zone cannot be grown anymore, thus if an
> +	 * interrupt thread must allocate this early in boot, zone must be
> +	 * pre-grown prior to start of deferred page initialization.
> +	 */
> +	pgdat_resize_unlock(pgdat, &flags);
> +
>  	/* Only the highest zone is deferred so find it */
>  	for (zid = 0; zid < MAX_NR_ZONES; zid++) {
>  		zone = pgdat->node_zones + zid;
> @@ -1806,8 +1813,6 @@ static int __init deferred_init_memmap(void *data)
>  		touch_nmi_watchdog();
>  	}
>  zone_empty:
> -	pgdat_resize_unlock(pgdat, &flags);
> -
>  	/* Sanity check that the next zone really is unpopulated */
>  	WARN_ON(++zid < MAX_NR_ZONES && populated_zone(++zone));
>  
> @@ -1849,17 +1854,6 @@ deferred_grow_zone(struct zone *zone, unsigned int order)
>  
>  	pgdat_resize_lock(pgdat, &flags);
>  
> -	/*
> -	 * If deferred pages have been initialized while we were waiting for
> -	 * the lock, return true, as the zone was grown.  The caller will retry
> -	 * this zone.  We won't return to this function since the caller also
> -	 * has this static branch.
> -	 */
> -	if (!static_branch_unlikely(&deferred_pages)) {
> -		pgdat_resize_unlock(pgdat, &flags);
> -		return true;
> -	}
> -
>  	/*
>  	 * If someone grew this zone while we were waiting for spinlock, return
>  	 * true, as there might be enough pages already.
> 

Reviewed-by: David Hildenbrand <david@redhat.com>

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v4 3/3] mm: call cond_resched() from deferred_init_memmap()
  2020-04-03 14:09 ` [PATCH v4 3/3] mm: call cond_resched() from deferred_init_memmap() Pavel Tatashin
                     ` (2 preceding siblings ...)
  2020-04-03 15:17   ` David Hildenbrand
@ 2020-04-06  7:02   ` Pankaj Gupta
  3 siblings, 0 replies; 9+ messages in thread
From: Pankaj Gupta @ 2020-04-06  7:02 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: linux-kernel, Andrew Morton, Michal Hocko, linux-mm,
	Dan Williams, Shile Zhang, Daniel Jordan, Kirill Tkhai,
	David Hildenbrand, jmorris, sashal, Vlastimil Babka

> Now that deferred pages are initialized with interrupts enabled we can
> replace touch_nmi_watchdog() with cond_resched(), as it was before
> 3a2d7fa8a3d5.
>
> For now, we cannot do the same in deferred_grow_zone() as it is still
> initializes pages with interrupts disabled.
>
> This change fixes RCU problem described:
> linux-mm/20200401104156.11564-2-david@redhat.com
>
> [   60.474005] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
> [   60.475000] rcu:  1-...0: (0 ticks this GP) idle=02a/1/0x4000000000000000 softirq=1/1 fqs=15000
> [   60.475000] rcu:  (detected by 0, t=60002 jiffies, g=-1199, q=1)
> [   60.475000] Sending NMI from CPU 0 to CPUs 1:
> [    1.760091] NMI backtrace for cpu 1
> [    1.760091] CPU: 1 PID: 20 Comm: pgdatinit0 Not tainted 4.18.0-147.9.1.el8_1.x86_64 #1
> [    1.760091] Hardware name: Red Hat KVM, BIOS 1.13.0-1.module+el8.2.0+5520+4e5817f3 04/01/2014
> [    1.760091] RIP: 0010:__init_single_page.isra.65+0x10/0x4f
> [    1.760091] Code: 48 83 cf 63 48 89 f8 0f 1f 40 00 48 89 c6 48 89 d7 e8 6b 18 80 ff 66 90 5b c3 31 c0 b9 10 00 00 00 49 89 f8 48 c1 e6 33 f3 ab <b8> 07 00 00 00 48 c1 e2 36 41 c7 40 34 01 00 00 00 48 c1 e0 33 41
> [    1.760091] RSP: 0000:ffffba783123be40 EFLAGS: 00000006
> [    1.760091] RAX: 0000000000000000 RBX: fffffad34405e300 RCX: 0000000000000000
> [    1.760091] RDX: 0000000000000000 RSI: 0010000000000000 RDI: fffffad34405e340
> [    1.760091] RBP: 0000000033f3177e R08: fffffad34405e300 R09: 0000000000000002
> [    1.760091] R10: 000000000000002b R11: ffff98afb691a500 R12: 0000000000000002
> [    1.760091] R13: 0000000000000000 R14: 000000003f03ea00 R15: 000000003e10178c
> [    1.760091] FS:  0000000000000000(0000) GS:ffff9c9ebeb00000(0000) knlGS:0000000000000000
> [    1.760091] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    1.760091] CR2: 00000000ffffffff CR3: 000000a1cf20a001 CR4: 00000000003606e0
> [    1.760091] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [    1.760091] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [    1.760091] Call Trace:
> [    1.760091]  deferred_init_pages+0x8f/0xbf
> [    1.760091]  deferred_init_memmap+0x184/0x29d
> [    1.760091]  ? deferred_free_pages.isra.97+0xba/0xba
> [    1.760091]  kthread+0x112/0x130
> [    1.760091]  ? kthread_flush_work_fn+0x10/0x10
> [    1.760091]  ret_from_fork+0x35/0x40
> [   89.123011] node 0 initialised, 1055935372 pages in 88650ms
>
> Fixes: 3a2d7fa8a3d5 ("mm: disable interrupts while initializing deferred pages")
> Cc: stable@vger.kernel.org # 4.17+
>
> Reported-by: Yiqian Wei <yiwei@redhat.com>
> Tested-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
> ---
>  mm/page_alloc.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 5ffa8d7e5545..deacfe575872 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1810,7 +1810,7 @@ static int __init deferred_init_memmap(void *data)
>          */
>         while (spfn < epfn) {
>                 nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
> -               touch_nmi_watchdog();
> +               cond_resched();
>         }
>  zone_empty:
>         /* Sanity check that the next zone really is unpopulated */
> --
> 2.17.1

Reviewed-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com>

>
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2020-04-06  7:03 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-03 14:09 [PATCH v4 0/3] initialize deferred pages with interrupts enabled Pavel Tatashin
2020-04-03 14:09 ` [PATCH v4 1/3] mm: call touch_nmi_watchdog() on max order boundaries in deferred init Pavel Tatashin
2020-04-03 14:09 ` [PATCH v4 2/3] mm: initialize deferred pages with interrupts enabled Pavel Tatashin
2020-04-03 15:18   ` David Hildenbrand
2020-04-03 14:09 ` [PATCH v4 3/3] mm: call cond_resched() from deferred_init_memmap() Pavel Tatashin
2020-04-03 14:12   ` Michal Hocko
2020-04-03 15:16   ` David Hildenbrand
2020-04-03 15:17   ` David Hildenbrand
2020-04-06  7:02   ` Pankaj Gupta

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).