linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/3] initialize deferred pages with interrupts enabled
@ 2020-04-03 13:35 Pavel Tatashin
  2020-04-03 13:35 ` [PATCH v3 1/3] mm: call touch_nmi_watchdog() on max order boundaries in deferred init Pavel Tatashin
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Pavel Tatashin @ 2020-04-03 13:35 UTC (permalink / raw)
  To: linux-kernel, akpm, mhocko, linux-mm, dan.j.williams,
	shile.zhang, daniel.m.jordan, pasha.tatashin, ktkhai, david,
	jmorris, sashal, vbabka

Keep interrupts enabled during deferred page initialization in order to
make code more modular and allow jiffies to update.

Original approach, and discussion can be found here:
https://lore.kernel.org/linux-mm/20200311123848.118638-1-shile.zhang@linux.alibaba.com

Changelog
v3:
- Splitted cond_resched() change into a separate patch as suggested by
  David Hildenbrand

v2:
- Addressed comments Daniel Jordan. Replaced touch_nmi_watchdog() to cond_resched().
  Added reviewed-by's and acked-by's.

v1:
https://lore.kernel.org/linux-mm/20200401193238.22544-1-pasha.tatashin@soleen.com

Daniel Jordan (1):
  mm: call touch_nmi_watchdog() on max order boundaries in deferred init

Pavel Tatashin (2):
  mm: initialize deferred pages with interrupts enabled
  mm: call cond_resched() from deferred_init_memmap()

 include/linux/mmzone.h |  2 ++
 mm/page_alloc.c        | 27 +++++++++++----------------
 2 files changed, 13 insertions(+), 16 deletions(-)

-- 
2.17.1



^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH v3 1/3] mm: call touch_nmi_watchdog() on max order boundaries in deferred init
  2020-04-03 13:35 [PATCH v3 0/3] initialize deferred pages with interrupts enabled Pavel Tatashin
@ 2020-04-03 13:35 ` Pavel Tatashin
  2020-04-03 13:35 ` [PATCH v3 2/3] mm: initialize deferred pages with interrupts enabled Pavel Tatashin
  2020-04-03 13:35 ` [PATCH v3 3/3] mm: call cond_resched() from deferred_init_memmap() Pavel Tatashin
  2 siblings, 0 replies; 6+ messages in thread
From: Pavel Tatashin @ 2020-04-03 13:35 UTC (permalink / raw)
  To: linux-kernel, akpm, mhocko, linux-mm, dan.j.williams,
	shile.zhang, daniel.m.jordan, pasha.tatashin, ktkhai, david,
	jmorris, sashal, vbabka

From: Daniel Jordan <daniel.m.jordan@oracle.com>

deferred_init_memmap() disables interrupts the entire time, so it calls
touch_nmi_watchdog() periodically to avoid soft lockup splats.  Soon it
will run with interrupts enabled, at which point cond_resched() should
be used instead.

deferred_grow_zone() makes the same watchdog calls through code shared
with deferred init but will continue to run with interrupts disabled, so
it can't call cond_resched().

Pull the watchdog calls up to these two places to allow the first to be
changed later, independently of the second.  The frequency reduces from
twice per pageblock (init and free) to once per max order block.

Fixes: 3a2d7fa8a3d5 ("mm: disable interrupts while initializing deferred pages")
Cc: stable@vger.kernel.org # 4.17+

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>

---
 mm/page_alloc.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3c4eb750a199..e8ff6a176164 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1639,7 +1639,6 @@ static void __init deferred_free_pages(unsigned long pfn,
 		} else if (!(pfn & nr_pgmask)) {
 			deferred_free_range(pfn - nr_free, nr_free);
 			nr_free = 1;
-			touch_nmi_watchdog();
 		} else {
 			nr_free++;
 		}
@@ -1669,7 +1668,6 @@ static unsigned long  __init deferred_init_pages(struct zone *zone,
 			continue;
 		} else if (!page || !(pfn & nr_pgmask)) {
 			page = pfn_to_page(pfn);
-			touch_nmi_watchdog();
 		} else {
 			page++;
 		}
@@ -1809,8 +1807,10 @@ static int __init deferred_init_memmap(void *data)
 	 * that we can avoid introducing any issues with the buddy
 	 * allocator.
 	 */
-	while (spfn < epfn)
+	while (spfn < epfn) {
 		nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
+		touch_nmi_watchdog();
+	}
 zone_empty:
 	pgdat_resize_unlock(pgdat, &flags);
 
@@ -1894,6 +1894,7 @@ deferred_grow_zone(struct zone *zone, unsigned int order)
 		first_deferred_pfn = spfn;
 
 		nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
+		touch_nmi_watchdog();
 
 		/* We should only stop along section boundaries */
 		if ((first_deferred_pfn ^ spfn) < PAGES_PER_SECTION)
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH v3 2/3] mm: initialize deferred pages with interrupts enabled
  2020-04-03 13:35 [PATCH v3 0/3] initialize deferred pages with interrupts enabled Pavel Tatashin
  2020-04-03 13:35 ` [PATCH v3 1/3] mm: call touch_nmi_watchdog() on max order boundaries in deferred init Pavel Tatashin
@ 2020-04-03 13:35 ` Pavel Tatashin
  2020-04-03 13:35 ` [PATCH v3 3/3] mm: call cond_resched() from deferred_init_memmap() Pavel Tatashin
  2 siblings, 0 replies; 6+ messages in thread
From: Pavel Tatashin @ 2020-04-03 13:35 UTC (permalink / raw)
  To: linux-kernel, akpm, mhocko, linux-mm, dan.j.williams,
	shile.zhang, daniel.m.jordan, pasha.tatashin, ktkhai, david,
	jmorris, sashal, vbabka

Initializing struct pages is a long task and keeping interrupts disabled
for the duration of this operation introduces a number of problems.

1. jiffies are not updated for long period of time, and thus incorrect time
   is reported. See proposed solution and discussion here:
   lkml/20200311123848.118638-1-shile.zhang@linux.alibaba.com
2. It prevents farther improving deferred page initialization by allowing
   intra-node multi-threading.

We are keeping interrupts disabled to solve a rather theoretical problem
that was never observed in real world (See 3a2d7fa8a3d5).

Lets keep interrupts enabled. In case we ever encounter a scenario where
an interrupt thread wants to allocate large amount of memory this early in
boot we can deal with that by growing zone (see deferred_grow_zone()) by
the needed amount before starting deferred_init_memmap() threads.

Before:
[    1.232459] node 0 initialised, 12058412 pages in 1ms

After:
[    1.632580] node 0 initialised, 12051227 pages in 436ms

Fixes: 3a2d7fa8a3d5 ("mm: disable interrupts while initializing deferred pages")
Cc: stable@vger.kernel.org # 4.17+

Reported-by: Shile Zhang <shile.zhang@linux.alibaba.com>
Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/mmzone.h |  2 ++
 mm/page_alloc.c        | 20 +++++++-------------
 2 files changed, 9 insertions(+), 13 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 462f6873905a..c5bdf55da034 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -721,6 +721,8 @@ typedef struct pglist_data {
 	/*
 	 * Must be held any time you expect node_start_pfn,
 	 * node_present_pages, node_spanned_pages or nr_zones to stay constant.
+	 * Also synchronizes pgdat->first_deferred_pfn during deferred page
+	 * init.
 	 *
 	 * pgdat_resize_lock() and pgdat_resize_unlock() are provided to
 	 * manipulate node_size_lock without checking for CONFIG_MEMORY_HOTPLUG
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e8ff6a176164..4a60f2427eb0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1790,6 +1790,13 @@ static int __init deferred_init_memmap(void *data)
 	BUG_ON(pgdat->first_deferred_pfn > pgdat_end_pfn(pgdat));
 	pgdat->first_deferred_pfn = ULONG_MAX;
 
+	/*
+	 * Once we unlock here, the zone cannot be grown anymore, thus if an
+	 * interrupt thread must allocate this early in boot, zone must be
+	 * pre-grown prior to start of deferred page initialization.
+	 */
+	pgdat_resize_unlock(pgdat, &flags);
+
 	/* Only the highest zone is deferred so find it */
 	for (zid = 0; zid < MAX_NR_ZONES; zid++) {
 		zone = pgdat->node_zones + zid;
@@ -1812,8 +1819,6 @@ static int __init deferred_init_memmap(void *data)
 		touch_nmi_watchdog();
 	}
 zone_empty:
-	pgdat_resize_unlock(pgdat, &flags);
-
 	/* Sanity check that the next zone really is unpopulated */
 	WARN_ON(++zid < MAX_NR_ZONES && populated_zone(++zone));
 
@@ -1855,17 +1860,6 @@ deferred_grow_zone(struct zone *zone, unsigned int order)
 
 	pgdat_resize_lock(pgdat, &flags);
 
-	/*
-	 * If deferred pages have been initialized while we were waiting for
-	 * the lock, return true, as the zone was grown.  The caller will retry
-	 * this zone.  We won't return to this function since the caller also
-	 * has this static branch.
-	 */
-	if (!static_branch_unlikely(&deferred_pages)) {
-		pgdat_resize_unlock(pgdat, &flags);
-		return true;
-	}
-
 	/*
 	 * If someone grew this zone while we were waiting for spinlock, return
 	 * true, as there might be enough pages already.
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH v3 3/3] mm: call cond_resched() from deferred_init_memmap()
  2020-04-03 13:35 [PATCH v3 0/3] initialize deferred pages with interrupts enabled Pavel Tatashin
  2020-04-03 13:35 ` [PATCH v3 1/3] mm: call touch_nmi_watchdog() on max order boundaries in deferred init Pavel Tatashin
  2020-04-03 13:35 ` [PATCH v3 2/3] mm: initialize deferred pages with interrupts enabled Pavel Tatashin
@ 2020-04-03 13:35 ` Pavel Tatashin
  2020-04-03 13:44   ` Daniel Jordan
  2 siblings, 1 reply; 6+ messages in thread
From: Pavel Tatashin @ 2020-04-03 13:35 UTC (permalink / raw)
  To: linux-kernel, akpm, mhocko, linux-mm, dan.j.williams,
	shile.zhang, daniel.m.jordan, pasha.tatashin, ktkhai, david,
	jmorris, sashal, vbabka

Now that deferred pages are initialized with interrupts enabled we can
replace touch_nmi_watchdog() with cond_resched(), as it was before
3a2d7fa8a3d5.

For now, we cannot do the same in deferred_grow_zone() as it is still
initializes pages with interrupts disabled.

This change fixes RCU problem described:
linux-mm/20200401104156.11564-2-david@redhat.com

[   60.474005] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[   60.475000] rcu:  1-...0: (0 ticks this GP) idle=02a/1/0x4000000000000000 softirq=1/1 fqs=15000
[   60.475000] rcu:  (detected by 0, t=60002 jiffies, g=-1199, q=1)
[   60.475000] Sending NMI from CPU 0 to CPUs 1:
[    1.760091] NMI backtrace for cpu 1
[    1.760091] CPU: 1 PID: 20 Comm: pgdatinit0 Not tainted 4.18.0-147.9.1.el8_1.x86_64 #1
[    1.760091] Hardware name: Red Hat KVM, BIOS 1.13.0-1.module+el8.2.0+5520+4e5817f3 04/01/2014
[    1.760091] RIP: 0010:__init_single_page.isra.65+0x10/0x4f
[    1.760091] Code: 48 83 cf 63 48 89 f8 0f 1f 40 00 48 89 c6 48 89 d7 e8 6b 18 80 ff 66 90 5b c3 31 c0 b9 10 00 00 00 49 89 f8 48 c1 e6 33 f3 ab <b8> 07 00 00 00 48 c1 e2 36 41 c7 40 34 01 00 00 00 48 c1 e0 33 41
[    1.760091] RSP: 0000:ffffba783123be40 EFLAGS: 00000006
[    1.760091] RAX: 0000000000000000 RBX: fffffad34405e300 RCX: 0000000000000000
[    1.760091] RDX: 0000000000000000 RSI: 0010000000000000 RDI: fffffad34405e340
[    1.760091] RBP: 0000000033f3177e R08: fffffad34405e300 R09: 0000000000000002
[    1.760091] R10: 000000000000002b R11: ffff98afb691a500 R12: 0000000000000002
[    1.760091] R13: 0000000000000000 R14: 000000003f03ea00 R15: 000000003e10178c
[    1.760091] FS:  0000000000000000(0000) GS:ffff9c9ebeb00000(0000) knlGS:0000000000000000
[    1.760091] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    1.760091] CR2: 00000000ffffffff CR3: 000000a1cf20a001 CR4: 00000000003606e0
[    1.760091] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    1.760091] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[    1.760091] Call Trace:
[    1.760091]  deferred_init_pages+0x8f/0xbf
[    1.760091]  deferred_init_memmap+0x184/0x29d
[    1.760091]  ? deferred_free_pages.isra.97+0xba/0xba
[    1.760091]  kthread+0x112/0x130
[    1.760091]  ? kthread_flush_work_fn+0x10/0x10
[    1.760091]  ret_from_fork+0x35/0x40
[   89.123011] node 0 initialised, 1055935372 pages in 88650ms

Fixes: 3a2d7fa8a3d5 ("mm: disable interrupts while initializing deferred pages")
Cc: stable@vger.kernel.org # 4.17+

Reported-by: Yiqian Wei <yiwei@redhat.com>
Tested-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
---
 mm/page_alloc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4a60f2427eb0..445f74358997 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1816,7 +1816,7 @@ static int __init deferred_init_memmap(void *data)
 	 */
 	while (spfn < epfn) {
 		nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
-		touch_nmi_watchdog();
+		sched_clock();
 	}
 zone_empty:
 	/* Sanity check that the next zone really is unpopulated */
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH v3 3/3] mm: call cond_resched() from deferred_init_memmap()
  2020-04-03 13:35 ` [PATCH v3 3/3] mm: call cond_resched() from deferred_init_memmap() Pavel Tatashin
@ 2020-04-03 13:44   ` Daniel Jordan
  2020-04-03 14:03     ` Pavel Tatashin
  0 siblings, 1 reply; 6+ messages in thread
From: Daniel Jordan @ 2020-04-03 13:44 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: linux-kernel, akpm, mhocko, linux-mm, dan.j.williams,
	shile.zhang, daniel.m.jordan, ktkhai, david, jmorris, sashal,
	vbabka

On Fri, Apr 03, 2020 at 09:35:49AM -0400, Pavel Tatashin wrote:
> Now that deferred pages are initialized with interrupts enabled we can
> replace touch_nmi_watchdog() with cond_resched(), as it was before
> 3a2d7fa8a3d5.
...
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 4a60f2427eb0..445f74358997 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1816,7 +1816,7 @@ static int __init deferred_init_memmap(void *data)
>  	 */
>  	while (spfn < epfn) {
>  		nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
> -		touch_nmi_watchdog();
> +		sched_clock();

I think you meant cond_resched()?

With that,
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v3 3/3] mm: call cond_resched() from deferred_init_memmap()
  2020-04-03 13:44   ` Daniel Jordan
@ 2020-04-03 14:03     ` Pavel Tatashin
  0 siblings, 0 replies; 6+ messages in thread
From: Pavel Tatashin @ 2020-04-03 14:03 UTC (permalink / raw)
  To: Daniel Jordan
  Cc: LKML, Andrew Morton, Michal Hocko, linux-mm, Dan Williams,
	Shile Zhang, Kirill Tkhai, David Hildenbrand, James Morris,
	Sasha Levin, Vlastimil Babka

> I think you meant cond_resched()?
>
> With that,
> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>

Thank you Of course, I will re-submit quickly!

Pasha


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2020-04-03 14:03 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-03 13:35 [PATCH v3 0/3] initialize deferred pages with interrupts enabled Pavel Tatashin
2020-04-03 13:35 ` [PATCH v3 1/3] mm: call touch_nmi_watchdog() on max order boundaries in deferred init Pavel Tatashin
2020-04-03 13:35 ` [PATCH v3 2/3] mm: initialize deferred pages with interrupts enabled Pavel Tatashin
2020-04-03 13:35 ` [PATCH v3 3/3] mm: call cond_resched() from deferred_init_memmap() Pavel Tatashin
2020-04-03 13:44   ` Daniel Jordan
2020-04-03 14:03     ` Pavel Tatashin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).