linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/2] initialize deferred pages with interrupts enabled
@ 2020-04-01 22:57 Pavel Tatashin
  2020-04-01 22:57 ` [PATCH v2 1/2] mm: call touch_nmi_watchdog() on max order boundaries in deferred init Pavel Tatashin
  2020-04-01 22:57 ` [PATCH v2 2/2] mm: initialize deferred pages with interrupts enabled Pavel Tatashin
  0 siblings, 2 replies; 15+ messages in thread
From: Pavel Tatashin @ 2020-04-01 22:57 UTC (permalink / raw)
  To: linux-kernel, akpm, mhocko, linux-mm, dan.j.williams,
	shile.zhang, daniel.m.jordan, pasha.tatashin, ktkhai, david,
	jmorris, sashal, vbabka

Keep interrupts enabled during deferred page initialization in order to
make code more modular and allow jiffies to update.

Original approach, and discussion can be found here:
https://lore.kernel.org/linux-mm/20200311123848.118638-1-shile.zhang@linux.alibaba.com

Changelog

v2:
- Addressed comments Daniel Jordan. Replaced touch_nmi_watchdog() to cond_resched().
  Added reviewed-by's and acked-by's.

v1:
https://lore.kernel.org/linux-mm/20200401193238.22544-1-pasha.tatashin@soleen.com

Daniel Jordan (1):
  mm: call touch_nmi_watchdog() on max order boundaries in deferred init

Pavel Tatashin (1):
  mm: initialize deferred pages with interrupts enabled

 include/linux/mmzone.h |  2 ++
 mm/page_alloc.c        | 27 +++++++++++----------------
 2 files changed, 13 insertions(+), 16 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v2 1/2] mm: call touch_nmi_watchdog() on max order boundaries in deferred init
  2020-04-01 22:57 [PATCH v2 0/2] initialize deferred pages with interrupts enabled Pavel Tatashin
@ 2020-04-01 22:57 ` Pavel Tatashin
  2020-04-02  7:23   ` David Hildenbrand
                     ` (2 more replies)
  2020-04-01 22:57 ` [PATCH v2 2/2] mm: initialize deferred pages with interrupts enabled Pavel Tatashin
  1 sibling, 3 replies; 15+ messages in thread
From: Pavel Tatashin @ 2020-04-01 22:57 UTC (permalink / raw)
  To: linux-kernel, akpm, mhocko, linux-mm, dan.j.williams,
	shile.zhang, daniel.m.jordan, pasha.tatashin, ktkhai, david,
	jmorris, sashal, vbabka

From: Daniel Jordan <daniel.m.jordan@oracle.com>

deferred_init_memmap() disables interrupts the entire time, so it calls
touch_nmi_watchdog() periodically to avoid soft lockup splats.  Soon it
will run with interrupts enabled, at which point cond_resched() should
be used instead.

deferred_grow_zone() makes the same watchdog calls through code shared
with deferred init but will continue to run with interrupts disabled, so
it can't call cond_resched().

Pull the watchdog calls up to these two places to allow the first to be
changed later, independently of the second.  The frequency reduces from
twice per pageblock (init and free) to once per max order block.

Fixes: 3a2d7fa8a3d5 ("mm: disable interrupts while initializing deferred pages")
Cc: stable@vger.kernel.org # 4.17+

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
---
 mm/page_alloc.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3c4eb750a199..e8ff6a176164 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1639,7 +1639,6 @@ static void __init deferred_free_pages(unsigned long pfn,
 		} else if (!(pfn & nr_pgmask)) {
 			deferred_free_range(pfn - nr_free, nr_free);
 			nr_free = 1;
-			touch_nmi_watchdog();
 		} else {
 			nr_free++;
 		}
@@ -1669,7 +1668,6 @@ static unsigned long  __init deferred_init_pages(struct zone *zone,
 			continue;
 		} else if (!page || !(pfn & nr_pgmask)) {
 			page = pfn_to_page(pfn);
-			touch_nmi_watchdog();
 		} else {
 			page++;
 		}
@@ -1809,8 +1807,10 @@ static int __init deferred_init_memmap(void *data)
 	 * that we can avoid introducing any issues with the buddy
 	 * allocator.
 	 */
-	while (spfn < epfn)
+	while (spfn < epfn) {
 		nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
+		touch_nmi_watchdog();
+	}
 zone_empty:
 	pgdat_resize_unlock(pgdat, &flags);
 
@@ -1894,6 +1894,7 @@ deferred_grow_zone(struct zone *zone, unsigned int order)
 		first_deferred_pfn = spfn;
 
 		nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
+		touch_nmi_watchdog();
 
 		/* We should only stop along section boundaries */
 		if ((first_deferred_pfn ^ spfn) < PAGES_PER_SECTION)
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 2/2] mm: initialize deferred pages with interrupts enabled
  2020-04-01 22:57 [PATCH v2 0/2] initialize deferred pages with interrupts enabled Pavel Tatashin
  2020-04-01 22:57 ` [PATCH v2 1/2] mm: call touch_nmi_watchdog() on max order boundaries in deferred init Pavel Tatashin
@ 2020-04-01 22:57 ` Pavel Tatashin
  2020-04-02  7:14   ` David Hildenbrand
                     ` (2 more replies)
  1 sibling, 3 replies; 15+ messages in thread
From: Pavel Tatashin @ 2020-04-01 22:57 UTC (permalink / raw)
  To: linux-kernel, akpm, mhocko, linux-mm, dan.j.williams,
	shile.zhang, daniel.m.jordan, pasha.tatashin, ktkhai, david,
	jmorris, sashal, vbabka

Initializing struct pages is a long task and keeping interrupts disabled
for the duration of this operation introduces a number of problems.

1. jiffies are not updated for long period of time, and thus incorrect time
   is reported. See proposed solution and discussion here:
   lkml/20200311123848.118638-1-shile.zhang@linux.alibaba.com
2. It prevents farther improving deferred page initialization by allowing
   intra-node multi-threading.

We are keeping interrupts disabled to solve a rather theoretical problem
that was never observed in real world (See 3a2d7fa8a3d5).

Lets keep interrupts enabled. In case we ever encounter a scenario where
an interrupt thread wants to allocate large amount of memory this early in
boot we can deal with that by growing zone (see deferred_grow_zone()) by
the needed amount before starting deferred_init_memmap() threads.

Before:
[    1.232459] node 0 initialised, 12058412 pages in 1ms

After:
[    1.632580] node 0 initialised, 12051227 pages in 436ms

Fixes: 3a2d7fa8a3d5 ("mm: disable interrupts while initializing deferred pages")
Cc: stable@vger.kernel.org # 4.17+

Reported-by: Shile Zhang <shile.zhang@linux.alibaba.com>
Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/mmzone.h |  2 ++
 mm/page_alloc.c        | 22 ++++++++--------------
 2 files changed, 10 insertions(+), 14 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 462f6873905a..c5bdf55da034 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -721,6 +721,8 @@ typedef struct pglist_data {
 	/*
 	 * Must be held any time you expect node_start_pfn,
 	 * node_present_pages, node_spanned_pages or nr_zones to stay constant.
+	 * Also synchronizes pgdat->first_deferred_pfn during deferred page
+	 * init.
 	 *
 	 * pgdat_resize_lock() and pgdat_resize_unlock() are provided to
 	 * manipulate node_size_lock without checking for CONFIG_MEMORY_HOTPLUG
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e8ff6a176164..68669d3a5a66 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1790,6 +1790,13 @@ static int __init deferred_init_memmap(void *data)
 	BUG_ON(pgdat->first_deferred_pfn > pgdat_end_pfn(pgdat));
 	pgdat->first_deferred_pfn = ULONG_MAX;
 
+	/*
+	 * Once we unlock here, the zone cannot be grown anymore, thus if an
+	 * interrupt thread must allocate this early in boot, zone must be
+	 * pre-grown prior to start of deferred page initialization.
+	 */
+	pgdat_resize_unlock(pgdat, &flags);
+
 	/* Only the highest zone is deferred so find it */
 	for (zid = 0; zid < MAX_NR_ZONES; zid++) {
 		zone = pgdat->node_zones + zid;
@@ -1809,11 +1816,9 @@ static int __init deferred_init_memmap(void *data)
 	 */
 	while (spfn < epfn) {
 		nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
-		touch_nmi_watchdog();
+		cond_resched();
 	}
 zone_empty:
-	pgdat_resize_unlock(pgdat, &flags);
-
 	/* Sanity check that the next zone really is unpopulated */
 	WARN_ON(++zid < MAX_NR_ZONES && populated_zone(++zone));
 
@@ -1855,17 +1860,6 @@ deferred_grow_zone(struct zone *zone, unsigned int order)
 
 	pgdat_resize_lock(pgdat, &flags);
 
-	/*
-	 * If deferred pages have been initialized while we were waiting for
-	 * the lock, return true, as the zone was grown.  The caller will retry
-	 * this zone.  We won't return to this function since the caller also
-	 * has this static branch.
-	 */
-	if (!static_branch_unlikely(&deferred_pages)) {
-		pgdat_resize_unlock(pgdat, &flags);
-		return true;
-	}
-
 	/*
 	 * If someone grew this zone while we were waiting for spinlock, return
 	 * true, as there might be enough pages already.
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 2/2] mm: initialize deferred pages with interrupts enabled
  2020-04-01 22:57 ` [PATCH v2 2/2] mm: initialize deferred pages with interrupts enabled Pavel Tatashin
@ 2020-04-02  7:14   ` David Hildenbrand
  2020-04-02 15:49     ` Pavel Tatashin
  2020-04-02  7:38   ` David Hildenbrand
  2020-04-02 12:09   ` Vlastimil Babka
  2 siblings, 1 reply; 15+ messages in thread
From: David Hildenbrand @ 2020-04-02  7:14 UTC (permalink / raw)
  To: Pavel Tatashin, linux-kernel, akpm, mhocko, linux-mm,
	dan.j.williams, shile.zhang, daniel.m.jordan, ktkhai, jmorris,
	sashal, vbabka

On 02.04.20 00:57, Pavel Tatashin wrote:
> Initializing struct pages is a long task and keeping interrupts disabled
> for the duration of this operation introduces a number of problems.
> 
> 1. jiffies are not updated for long period of time, and thus incorrect time
>    is reported. See proposed solution and discussion here:
>    lkml/20200311123848.118638-1-shile.zhang@linux.alibaba.com
> 2. It prevents farther improving deferred page initialization by allowing
>    intra-node multi-threading.
> 
> We are keeping interrupts disabled to solve a rather theoretical problem
> that was never observed in real world (See 3a2d7fa8a3d5).
> 
> Lets keep interrupts enabled. In case we ever encounter a scenario where
> an interrupt thread wants to allocate large amount of memory this early in
> boot we can deal with that by growing zone (see deferred_grow_zone()) by
> the needed amount before starting deferred_init_memmap() threads.
> 
> Before:
> [    1.232459] node 0 initialised, 12058412 pages in 1ms
> 
> After:
> [    1.632580] node 0 initialised, 12051227 pages in 436ms
> 
> Fixes: 3a2d7fa8a3d5 ("mm: disable interrupts while initializing deferred pages")
> Cc: stable@vger.kernel.org # 4.17+
> 

Can you please add my details about the use of cond_resched() fixing
detection of RCU stalls?

https://lore.kernel.org/linux-mm/20200401104156.11564-2-david@redhat.com/

In the meantime, I'll give these two patches a churn. Thanks


-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 1/2] mm: call touch_nmi_watchdog() on max order boundaries in deferred init
  2020-04-01 22:57 ` [PATCH v2 1/2] mm: call touch_nmi_watchdog() on max order boundaries in deferred init Pavel Tatashin
@ 2020-04-02  7:23   ` David Hildenbrand
  2020-04-02  7:46   ` Michal Hocko
  2020-04-02 11:36   ` Vlastimil Babka
  2 siblings, 0 replies; 15+ messages in thread
From: David Hildenbrand @ 2020-04-02  7:23 UTC (permalink / raw)
  To: Pavel Tatashin, linux-kernel, akpm, mhocko, linux-mm,
	dan.j.williams, shile.zhang, daniel.m.jordan, ktkhai, jmorris,
	sashal, vbabka

On 02.04.20 00:57, Pavel Tatashin wrote:
> From: Daniel Jordan <daniel.m.jordan@oracle.com>
> 
> deferred_init_memmap() disables interrupts the entire time, so it calls
> touch_nmi_watchdog() periodically to avoid soft lockup splats.  Soon it
> will run with interrupts enabled, at which point cond_resched() should
> be used instead.
> 
> deferred_grow_zone() makes the same watchdog calls through code shared
> with deferred init but will continue to run with interrupts disabled, so
> it can't call cond_resched().
> 
> Pull the watchdog calls up to these two places to allow the first to be
> changed later, independently of the second.  The frequency reduces from
> twice per pageblock (init and free) to once per max order block.
> 
> Fixes: 3a2d7fa8a3d5 ("mm: disable interrupts while initializing deferred pages")
> Cc: stable@vger.kernel.org # 4.17+
> 
> Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
> Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
> ---
>  mm/page_alloc.c | 7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3c4eb750a199..e8ff6a176164 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1639,7 +1639,6 @@ static void __init deferred_free_pages(unsigned long pfn,
>  		} else if (!(pfn & nr_pgmask)) {
>  			deferred_free_range(pfn - nr_free, nr_free);
>  			nr_free = 1;
> -			touch_nmi_watchdog();
>  		} else {
>  			nr_free++;
>  		}
> @@ -1669,7 +1668,6 @@ static unsigned long  __init deferred_init_pages(struct zone *zone,
>  			continue;
>  		} else if (!page || !(pfn & nr_pgmask)) {
>  			page = pfn_to_page(pfn);
> -			touch_nmi_watchdog();
>  		} else {
>  			page++;
>  		}
> @@ -1809,8 +1807,10 @@ static int __init deferred_init_memmap(void *data)
>  	 * that we can avoid introducing any issues with the buddy
>  	 * allocator.
>  	 */
> -	while (spfn < epfn)
> +	while (spfn < epfn) {
>  		nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
> +		touch_nmi_watchdog();
> +	}
>  zone_empty:
>  	pgdat_resize_unlock(pgdat, &flags);
>  
> @@ -1894,6 +1894,7 @@ deferred_grow_zone(struct zone *zone, unsigned int order)
>  		first_deferred_pfn = spfn;
>  
>  		nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
> +		touch_nmi_watchdog();
>  
>  		/* We should only stop along section boundaries */
>  		if ((first_deferred_pfn ^ spfn) < PAGES_PER_SECTION)
> 

Reviewed-by: David Hildenbrand <david@redhat.com>

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 2/2] mm: initialize deferred pages with interrupts enabled
  2020-04-01 22:57 ` [PATCH v2 2/2] mm: initialize deferred pages with interrupts enabled Pavel Tatashin
  2020-04-02  7:14   ` David Hildenbrand
@ 2020-04-02  7:38   ` David Hildenbrand
  2020-04-02  7:47     ` Michal Hocko
  2020-04-02 12:09   ` Vlastimil Babka
  2 siblings, 1 reply; 15+ messages in thread
From: David Hildenbrand @ 2020-04-02  7:38 UTC (permalink / raw)
  To: Pavel Tatashin, linux-kernel, akpm, mhocko, linux-mm,
	dan.j.williams, shile.zhang, daniel.m.jordan, ktkhai, jmorris,
	sashal, vbabka

> +	/*
> +	 * Once we unlock here, the zone cannot be grown anymore, thus if an
> +	 * interrupt thread must allocate this early in boot, zone must be
> +	 * pre-grown prior to start of deferred page initialization.
> +	 */
> +	pgdat_resize_unlock(pgdat, &flags);
> +
>  	/* Only the highest zone is deferred so find it */
>  	for (zid = 0; zid < MAX_NR_ZONES; zid++) {
>  		zone = pgdat->node_zones + zid;
> @@ -1809,11 +1816,9 @@ static int __init deferred_init_memmap(void *data)
>  	 */
>  	while (spfn < epfn) {
>  		nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
> -		touch_nmi_watchdog();
> +		cond_resched();

I do wonder if this change is strictly required in this patch (IOW, if
we could keep calling touch_nmi_watchdog() also without holding a spinlock)

Anyhow, it's the right thing to do.

>  	}
>  zone_empty:
> -	pgdat_resize_unlock(pgdat, &flags);
> -
>  	/* Sanity check that the next zone really is unpopulated */
>  	WARN_ON(++zid < MAX_NR_ZONES && populated_zone(++zone));
>  
> @@ -1855,17 +1860,6 @@ deferred_grow_zone(struct zone *zone, unsigned int order)
>  
>  	pgdat_resize_lock(pgdat, &flags);
>  
> -	/*
> -	 * If deferred pages have been initialized while we were waiting for
> -	 * the lock, return true, as the zone was grown.  The caller will retry
> -	 * this zone.  We won't return to this function since the caller also
> -	 * has this static branch.
> -	 */
> -	if (!static_branch_unlikely(&deferred_pages)) {
> -		pgdat_resize_unlock(pgdat, &flags);
> -		return true;
> -	}
> -
>  	/*
>  	 * If someone grew this zone while we were waiting for spinlock, return
>  	 * true, as there might be enough pages already.
> 


I think we should also look into cleaning up deferred_grow_zone( next),
we still have that touch_nmi_watchdog() in there. We should rework
locking. (I think Michal requested that as well)

For now, this seems to survive my basic testing (RCU stalls gone)

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 1/2] mm: call touch_nmi_watchdog() on max order boundaries in deferred init
  2020-04-01 22:57 ` [PATCH v2 1/2] mm: call touch_nmi_watchdog() on max order boundaries in deferred init Pavel Tatashin
  2020-04-02  7:23   ` David Hildenbrand
@ 2020-04-02  7:46   ` Michal Hocko
  2020-04-02 11:36   ` Vlastimil Babka
  2 siblings, 0 replies; 15+ messages in thread
From: Michal Hocko @ 2020-04-02  7:46 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: linux-kernel, akpm, linux-mm, dan.j.williams, shile.zhang,
	daniel.m.jordan, ktkhai, david, jmorris, sashal, vbabka

I have only now noticed that you have reposted.

On Wed 01-04-20 18:57:22, Pavel Tatashin wrote:
> From: Daniel Jordan <daniel.m.jordan@oracle.com>
> 
> deferred_init_memmap() disables interrupts the entire time, so it calls
> touch_nmi_watchdog() periodically to avoid soft lockup splats.  Soon it
> will run with interrupts enabled, at which point cond_resched() should
> be used instead.
> 
> deferred_grow_zone() makes the same watchdog calls through code shared
> with deferred init but will continue to run with interrupts disabled, so
> it can't call cond_resched().
> 
> Pull the watchdog calls up to these two places to allow the first to be
> changed later, independently of the second.  The frequency reduces from
> twice per pageblock (init and free) to once per max order block.
> 
> Fixes: 3a2d7fa8a3d5 ("mm: disable interrupts while initializing deferred pages")
> Cc: stable@vger.kernel.org # 4.17+

This patch is not fixing anything, right? It cleans up the code to make
further changes easier which is good.

> Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
> Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/page_alloc.c | 7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3c4eb750a199..e8ff6a176164 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1639,7 +1639,6 @@ static void __init deferred_free_pages(unsigned long pfn,
>  		} else if (!(pfn & nr_pgmask)) {
>  			deferred_free_range(pfn - nr_free, nr_free);
>  			nr_free = 1;
> -			touch_nmi_watchdog();
>  		} else {
>  			nr_free++;
>  		}
> @@ -1669,7 +1668,6 @@ static unsigned long  __init deferred_init_pages(struct zone *zone,
>  			continue;
>  		} else if (!page || !(pfn & nr_pgmask)) {
>  			page = pfn_to_page(pfn);
> -			touch_nmi_watchdog();
>  		} else {
>  			page++;
>  		}
> @@ -1809,8 +1807,10 @@ static int __init deferred_init_memmap(void *data)
>  	 * that we can avoid introducing any issues with the buddy
>  	 * allocator.
>  	 */
> -	while (spfn < epfn)
> +	while (spfn < epfn) {
>  		nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
> +		touch_nmi_watchdog();
> +	}
>  zone_empty:
>  	pgdat_resize_unlock(pgdat, &flags);
>  
> @@ -1894,6 +1894,7 @@ deferred_grow_zone(struct zone *zone, unsigned int order)
>  		first_deferred_pfn = spfn;
>  
>  		nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
> +		touch_nmi_watchdog();
>  
>  		/* We should only stop along section boundaries */
>  		if ((first_deferred_pfn ^ spfn) < PAGES_PER_SECTION)
> -- 
> 2.17.1
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 2/2] mm: initialize deferred pages with interrupts enabled
  2020-04-02  7:38   ` David Hildenbrand
@ 2020-04-02  7:47     ` Michal Hocko
  2020-04-02 15:13       ` Pavel Tatashin
  0 siblings, 1 reply; 15+ messages in thread
From: Michal Hocko @ 2020-04-02  7:47 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Pavel Tatashin, linux-kernel, akpm, linux-mm, dan.j.williams,
	shile.zhang, daniel.m.jordan, ktkhai, jmorris, sashal, vbabka

On Thu 02-04-20 09:38:54, David Hildenbrand wrote:
> > +	/*
> > +	 * Once we unlock here, the zone cannot be grown anymore, thus if an
> > +	 * interrupt thread must allocate this early in boot, zone must be
> > +	 * pre-grown prior to start of deferred page initialization.
> > +	 */
> > +	pgdat_resize_unlock(pgdat, &flags);
> > +
> >  	/* Only the highest zone is deferred so find it */
> >  	for (zid = 0; zid < MAX_NR_ZONES; zid++) {
> >  		zone = pgdat->node_zones + zid;
> > @@ -1809,11 +1816,9 @@ static int __init deferred_init_memmap(void *data)
> >  	 */
> >  	while (spfn < epfn) {
> >  		nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
> > -		touch_nmi_watchdog();
> > +		cond_resched();
> 
> I do wonder if this change is strictly required in this patch (IOW, if
> we could keep calling touch_nmi_watchdog() also without holding a spinlock)

Exactly. I would go with your patch on top.
 
> Anyhow, it's the right thing to do.
> 
> >  	}
> >  zone_empty:
> > -	pgdat_resize_unlock(pgdat, &flags);
> > -
> >  	/* Sanity check that the next zone really is unpopulated */
> >  	WARN_ON(++zid < MAX_NR_ZONES && populated_zone(++zone));
> >  
> > @@ -1855,17 +1860,6 @@ deferred_grow_zone(struct zone *zone, unsigned int order)
> >  
> >  	pgdat_resize_lock(pgdat, &flags);
> >  
> > -	/*
> > -	 * If deferred pages have been initialized while we were waiting for
> > -	 * the lock, return true, as the zone was grown.  The caller will retry
> > -	 * this zone.  We won't return to this function since the caller also
> > -	 * has this static branch.
> > -	 */
> > -	if (!static_branch_unlikely(&deferred_pages)) {
> > -		pgdat_resize_unlock(pgdat, &flags);
> > -		return true;
> > -	}
> > -
> >  	/*
> >  	 * If someone grew this zone while we were waiting for spinlock, return
> >  	 * true, as there might be enough pages already.
> > 
> 
> 
> I think we should also look into cleaning up deferred_grow_zone( next),
> we still have that touch_nmi_watchdog() in there. We should rework
> locking. (I think Michal requested that as well)
> 
> For now, this seems to survive my basic testing (RCU stalls gone)
> 
> -- 
> Thanks,
> 
> David / dhildenb
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 1/2] mm: call touch_nmi_watchdog() on max order boundaries in deferred init
  2020-04-01 22:57 ` [PATCH v2 1/2] mm: call touch_nmi_watchdog() on max order boundaries in deferred init Pavel Tatashin
  2020-04-02  7:23   ` David Hildenbrand
  2020-04-02  7:46   ` Michal Hocko
@ 2020-04-02 11:36   ` Vlastimil Babka
  2 siblings, 0 replies; 15+ messages in thread
From: Vlastimil Babka @ 2020-04-02 11:36 UTC (permalink / raw)
  To: Pavel Tatashin, linux-kernel, akpm, mhocko, linux-mm,
	dan.j.williams, shile.zhang, daniel.m.jordan, ktkhai, david,
	jmorris, sashal

On 4/2/20 12:57 AM, Pavel Tatashin wrote:
> From: Daniel Jordan <daniel.m.jordan@oracle.com>
> 
> deferred_init_memmap() disables interrupts the entire time, so it calls
> touch_nmi_watchdog() periodically to avoid soft lockup splats.  Soon it
> will run with interrupts enabled, at which point cond_resched() should
> be used instead.
> 
> deferred_grow_zone() makes the same watchdog calls through code shared
> with deferred init but will continue to run with interrupts disabled, so
> it can't call cond_resched().
> 
> Pull the watchdog calls up to these two places to allow the first to be
> changed later, independently of the second.  The frequency reduces from
> twice per pageblock (init and free) to once per max order block.
> 
> Fixes: 3a2d7fa8a3d5 ("mm: disable interrupts while initializing deferred pages")
> Cc: stable@vger.kernel.org # 4.17+
> 
> Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
> Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  mm/page_alloc.c | 7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3c4eb750a199..e8ff6a176164 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1639,7 +1639,6 @@ static void __init deferred_free_pages(unsigned long pfn,
>  		} else if (!(pfn & nr_pgmask)) {
>  			deferred_free_range(pfn - nr_free, nr_free);
>  			nr_free = 1;
> -			touch_nmi_watchdog();
>  		} else {
>  			nr_free++;
>  		}
> @@ -1669,7 +1668,6 @@ static unsigned long  __init deferred_init_pages(struct zone *zone,
>  			continue;
>  		} else if (!page || !(pfn & nr_pgmask)) {
>  			page = pfn_to_page(pfn);
> -			touch_nmi_watchdog();
>  		} else {
>  			page++;
>  		}
> @@ -1809,8 +1807,10 @@ static int __init deferred_init_memmap(void *data)
>  	 * that we can avoid introducing any issues with the buddy
>  	 * allocator.
>  	 */
> -	while (spfn < epfn)
> +	while (spfn < epfn) {
>  		nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
> +		touch_nmi_watchdog();
> +	}
>  zone_empty:
>  	pgdat_resize_unlock(pgdat, &flags);
>  
> @@ -1894,6 +1894,7 @@ deferred_grow_zone(struct zone *zone, unsigned int order)
>  		first_deferred_pfn = spfn;
>  
>  		nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
> +		touch_nmi_watchdog();
>  
>  		/* We should only stop along section boundaries */
>  		if ((first_deferred_pfn ^ spfn) < PAGES_PER_SECTION)
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 2/2] mm: initialize deferred pages with interrupts enabled
  2020-04-01 22:57 ` [PATCH v2 2/2] mm: initialize deferred pages with interrupts enabled Pavel Tatashin
  2020-04-02  7:14   ` David Hildenbrand
  2020-04-02  7:38   ` David Hildenbrand
@ 2020-04-02 12:09   ` Vlastimil Babka
  2020-04-02 15:05     ` Pavel Tatashin
  2 siblings, 1 reply; 15+ messages in thread
From: Vlastimil Babka @ 2020-04-02 12:09 UTC (permalink / raw)
  To: Pavel Tatashin, linux-kernel, akpm, mhocko, linux-mm,
	dan.j.williams, shile.zhang, daniel.m.jordan, ktkhai, david,
	jmorris, sashal

On 4/2/20 12:57 AM, Pavel Tatashin wrote:
> Initializing struct pages is a long task and keeping interrupts disabled
> for the duration of this operation introduces a number of problems.
> 
> 1. jiffies are not updated for long period of time, and thus incorrect time
>    is reported. See proposed solution and discussion here:
>    lkml/20200311123848.118638-1-shile.zhang@linux.alibaba.com
> 2. It prevents farther improving deferred page initialization by allowing
>    intra-node multi-threading.
> 
> We are keeping interrupts disabled to solve a rather theoretical problem
> that was never observed in real world (See 3a2d7fa8a3d5).
> 
> Lets keep interrupts enabled. In case we ever encounter a scenario where
> an interrupt thread wants to allocate large amount of memory this early in
> boot we can deal with that by growing zone (see deferred_grow_zone()) by
> the needed amount before starting deferred_init_memmap() threads.
> 
> Before:
> [    1.232459] node 0 initialised, 12058412 pages in 1ms
> 
> After:
> [    1.632580] node 0 initialised, 12051227 pages in 436ms
> 
> Fixes: 3a2d7fa8a3d5 ("mm: disable interrupts while initializing deferred pages")
> Cc: stable@vger.kernel.org # 4.17+

TBH I don't remember my concern anymore. Reading my mail now [1] it seems I was
thinking the problem could happen not just in interrupt context, but with other
kthreads as well.
Anyway I agree with the approach of waiting for actual issues being reported and
then eventually pre-growing more.

Acked-by: Vlastimil Babka <vbabka@suse.cz>

[1] https://lore.kernel.org/linux-mm/33e3a3ff-0318-1a07-3c57-6be638046c87@suse.cz/

> Reported-by: Shile Zhang <shile.zhang@linux.alibaba.com>
> Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
> Acked-by: Michal Hocko <mhocko@suse.com>
> ---
>  include/linux/mmzone.h |  2 ++
>  mm/page_alloc.c        | 22 ++++++++--------------
>  2 files changed, 10 insertions(+), 14 deletions(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 462f6873905a..c5bdf55da034 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -721,6 +721,8 @@ typedef struct pglist_data {
>  	/*
>  	 * Must be held any time you expect node_start_pfn,
>  	 * node_present_pages, node_spanned_pages or nr_zones to stay constant.
> +	 * Also synchronizes pgdat->first_deferred_pfn during deferred page
> +	 * init.
>  	 *
>  	 * pgdat_resize_lock() and pgdat_resize_unlock() are provided to
>  	 * manipulate node_size_lock without checking for CONFIG_MEMORY_HOTPLUG
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index e8ff6a176164..68669d3a5a66 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1790,6 +1790,13 @@ static int __init deferred_init_memmap(void *data)
>  	BUG_ON(pgdat->first_deferred_pfn > pgdat_end_pfn(pgdat));
>  	pgdat->first_deferred_pfn = ULONG_MAX;
>  
> +	/*
> +	 * Once we unlock here, the zone cannot be grown anymore, thus if an
> +	 * interrupt thread must allocate this early in boot, zone must be
> +	 * pre-grown prior to start of deferred page initialization.
> +	 */
> +	pgdat_resize_unlock(pgdat, &flags);
> +
>  	/* Only the highest zone is deferred so find it */
>  	for (zid = 0; zid < MAX_NR_ZONES; zid++) {
>  		zone = pgdat->node_zones + zid;
> @@ -1809,11 +1816,9 @@ static int __init deferred_init_memmap(void *data)
>  	 */
>  	while (spfn < epfn) {
>  		nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
> -		touch_nmi_watchdog();
> +		cond_resched();
>  	}
>  zone_empty:
> -	pgdat_resize_unlock(pgdat, &flags);
> -
>  	/* Sanity check that the next zone really is unpopulated */
>  	WARN_ON(++zid < MAX_NR_ZONES && populated_zone(++zone));
>  
> @@ -1855,17 +1860,6 @@ deferred_grow_zone(struct zone *zone, unsigned int order)
>  
>  	pgdat_resize_lock(pgdat, &flags);
>  
> -	/*
> -	 * If deferred pages have been initialized while we were waiting for
> -	 * the lock, return true, as the zone was grown.  The caller will retry
> -	 * this zone.  We won't return to this function since the caller also
> -	 * has this static branch.
> -	 */
> -	if (!static_branch_unlikely(&deferred_pages)) {
> -		pgdat_resize_unlock(pgdat, &flags);
> -		return true;
> -	}
> -
>  	/*
>  	 * If someone grew this zone while we were waiting for spinlock, return
>  	 * true, as there might be enough pages already.
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 2/2] mm: initialize deferred pages with interrupts enabled
  2020-04-02 12:09   ` Vlastimil Babka
@ 2020-04-02 15:05     ` Pavel Tatashin
  0 siblings, 0 replies; 15+ messages in thread
From: Pavel Tatashin @ 2020-04-02 15:05 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: LKML, Andrew Morton, Michal Hocko, linux-mm, Dan Williams,
	Shile Zhang, Daniel Jordan, Kirill Tkhai, David Hildenbrand,
	James Morris, Sasha Levin

> TBH I don't remember my concern anymore. Reading my mail now [1] it seems I was
> thinking the problem could happen not just in interrupt context, but with other
> kthreads as well.
> Anyway I agree with the approach of waiting for actual issues being reported and
> then eventually pre-growing more.
>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>

Thank you.

Pasha

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 2/2] mm: initialize deferred pages with interrupts enabled
  2020-04-02  7:47     ` Michal Hocko
@ 2020-04-02 15:13       ` Pavel Tatashin
  2020-04-02 17:16         ` Michal Hocko
  0 siblings, 1 reply; 15+ messages in thread
From: Pavel Tatashin @ 2020-04-02 15:13 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Hildenbrand, LKML, Andrew Morton, linux-mm, Dan Williams,
	Shile Zhang, Daniel Jordan, Kirill Tkhai, James Morris,
	Sasha Levin, Vlastimil Babka

> > I do wonder if this change is strictly required in this patch (IOW, if
> > we could keep calling touch_nmi_watchdog() also without holding a spinlock)
>
> Exactly. I would go with your patch on top.
>
> > Anyhow, it's the right thing to do.

Michal,

The reason I changed it here is because in the original patch that
this patch fixes we changed cond_sched() to touch_nmi_watchdog():
$ git show 3a2d7fa8a3d5 | grep -E '(nmi|sched)'
- cond_resched();
+ touch_nmi_watchdog();
- cond_resched();
+ touch_nmi_watchdog();

So, should I move it to a separate patch or is it OK to keep it here?

Thank you,
Pasha

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 2/2] mm: initialize deferred pages with interrupts enabled
  2020-04-02  7:14   ` David Hildenbrand
@ 2020-04-02 15:49     ` Pavel Tatashin
  0 siblings, 0 replies; 15+ messages in thread
From: Pavel Tatashin @ 2020-04-02 15:49 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: LKML, Andrew Morton, Michal Hocko, linux-mm, Dan Williams,
	Shile Zhang, Daniel Jordan, Kirill Tkhai, James Morris,
	Sasha Levin, Vlastimil Babka

> Can you please add my details about the use of cond_resched() fixing
> detection of RCU stalls?
>
> https://lore.kernel.org/linux-mm/20200401104156.11564-2-david@redhat.com/
>
> In the meantime, I'll give these two patches a churn. Thanks

Hi David,

I will add your RCU details in the next revision.

Thank you,
Pasha

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 2/2] mm: initialize deferred pages with interrupts enabled
  2020-04-02 15:13       ` Pavel Tatashin
@ 2020-04-02 17:16         ` Michal Hocko
  2020-04-02 18:25           ` Pavel Tatashin
  0 siblings, 1 reply; 15+ messages in thread
From: Michal Hocko @ 2020-04-02 17:16 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: David Hildenbrand, LKML, Andrew Morton, linux-mm, Dan Williams,
	Shile Zhang, Daniel Jordan, Kirill Tkhai, James Morris,
	Sasha Levin, Vlastimil Babka

On Thu 02-04-20 11:13:49, Pavel Tatashin wrote:
> > > I do wonder if this change is strictly required in this patch (IOW, if
> > > we could keep calling touch_nmi_watchdog() also without holding a spinlock)
> >
> > Exactly. I would go with your patch on top.
> >
> > > Anyhow, it's the right thing to do.
> 
> Michal,
> 
> The reason I changed it here is because in the original patch that
> this patch fixes we changed cond_sched() to touch_nmi_watchdog():
> $ git show 3a2d7fa8a3d5 | grep -E '(nmi|sched)'
> - cond_resched();
> + touch_nmi_watchdog();
> - cond_resched();
> + touch_nmi_watchdog();
> 
> So, should I move it to a separate patch or is it OK to keep it here?

Having in a separate patch would be better IMO. If for nothing else, the
RCU stall would be easier to see.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 2/2] mm: initialize deferred pages with interrupts enabled
  2020-04-02 17:16         ` Michal Hocko
@ 2020-04-02 18:25           ` Pavel Tatashin
  0 siblings, 0 replies; 15+ messages in thread
From: Pavel Tatashin @ 2020-04-02 18:25 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Hildenbrand, LKML, Andrew Morton, linux-mm, Dan Williams,
	Shile Zhang, Daniel Jordan, Kirill Tkhai, James Morris,
	Sasha Levin, Vlastimil Babka

> > So, should I move it to a separate patch or is it OK to keep it here?
>
> Having in a separate patch would be better IMO. If for nothing else, the
> RCU stall would be easier to see.

Sure, I will submit it as a separate patch.

Thank you,
Pasha

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2020-04-02 18:26 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-01 22:57 [PATCH v2 0/2] initialize deferred pages with interrupts enabled Pavel Tatashin
2020-04-01 22:57 ` [PATCH v2 1/2] mm: call touch_nmi_watchdog() on max order boundaries in deferred init Pavel Tatashin
2020-04-02  7:23   ` David Hildenbrand
2020-04-02  7:46   ` Michal Hocko
2020-04-02 11:36   ` Vlastimil Babka
2020-04-01 22:57 ` [PATCH v2 2/2] mm: initialize deferred pages with interrupts enabled Pavel Tatashin
2020-04-02  7:14   ` David Hildenbrand
2020-04-02 15:49     ` Pavel Tatashin
2020-04-02  7:38   ` David Hildenbrand
2020-04-02  7:47     ` Michal Hocko
2020-04-02 15:13       ` Pavel Tatashin
2020-04-02 17:16         ` Michal Hocko
2020-04-02 18:25           ` Pavel Tatashin
2020-04-02 12:09   ` Vlastimil Babka
2020-04-02 15:05     ` Pavel Tatashin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).