linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] mm: initialize deferred pages with interrupts enabled
@ 2020-04-01 19:32 Pavel Tatashin
  2020-04-01 19:57 ` Michal Hocko
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Pavel Tatashin @ 2020-04-01 19:32 UTC (permalink / raw)
  To: linux-kernel, akpm, mhocko, linux-mm, dan.j.williams,
	shile.zhang, daniel.m.jordan, pasha.tatashin, ktkhai, david,
	jmorris, sashal

Initializing struct pages is a long task and keeping interrupts disabled
for the duration of this operation introduces a number of problems.

1. jiffies are not updated for long period of time, and thus incorrect time
   is reported. See proposed solution and discussion here:
   lkml/20200311123848.118638-1-shile.zhang@linux.alibaba.com
2. It prevents farther improving deferred page initialization by allowing
   inter-node multi-threading.

We are keeping interrupts disabled to solve a rather theoretical problem
that was never observed in real world (See 3a2d7fa8a3d5).

Lets keep interrupts enabled. In case we ever encounter a scenario where
an interrupt thread wants to allocate large amount of memory this early in
boot we can deal with that by growing zone (see deferred_grow_zone()) by
the needed amount before starting deferred_init_memmap() threads.

Before:
[    1.232459] node 0 initialised, 12058412 pages in 1ms

After:
[    1.632580] node 0 initialised, 12051227 pages in 436ms

Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
---
 mm/page_alloc.c | 21 +++++++--------------
 1 file changed, 7 insertions(+), 14 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3c4eb750a199..4498a13b372d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1792,6 +1792,13 @@ static int __init deferred_init_memmap(void *data)
 	BUG_ON(pgdat->first_deferred_pfn > pgdat_end_pfn(pgdat));
 	pgdat->first_deferred_pfn = ULONG_MAX;
 
+	/*
+	 * Once we unlock here, the zone cannot be grown anymore, thus if an
+	 * interrupt thread must allocate this early in boot, zone must be
+	 * pre-grown prior to start of deferred page initialization.
+	 */
+	pgdat_resize_unlock(pgdat, &flags);
+
 	/* Only the highest zone is deferred so find it */
 	for (zid = 0; zid < MAX_NR_ZONES; zid++) {
 		zone = pgdat->node_zones + zid;
@@ -1812,8 +1819,6 @@ static int __init deferred_init_memmap(void *data)
 	while (spfn < epfn)
 		nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
 zone_empty:
-	pgdat_resize_unlock(pgdat, &flags);
-
 	/* Sanity check that the next zone really is unpopulated */
 	WARN_ON(++zid < MAX_NR_ZONES && populated_zone(++zone));
 
@@ -1854,18 +1859,6 @@ deferred_grow_zone(struct zone *zone, unsigned int order)
 		return false;
 
 	pgdat_resize_lock(pgdat, &flags);
-
-	/*
-	 * If deferred pages have been initialized while we were waiting for
-	 * the lock, return true, as the zone was grown.  The caller will retry
-	 * this zone.  We won't return to this function since the caller also
-	 * has this static branch.
-	 */
-	if (!static_branch_unlikely(&deferred_pages)) {
-		pgdat_resize_unlock(pgdat, &flags);
-		return true;
-	}
-
 	/*
 	 * If someone grew this zone while we were waiting for spinlock, return
 	 * true, as there might be enough pages already.
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] mm: initialize deferred pages with interrupts enabled
  2020-04-01 19:32 [PATCH] mm: initialize deferred pages with interrupts enabled Pavel Tatashin
@ 2020-04-01 19:57 ` Michal Hocko
  2020-04-01 20:27   ` Pavel Tatashin
  2020-04-01 19:58 ` Michal Hocko
  2020-04-01 20:00 ` Daniel Jordan
  2 siblings, 1 reply; 9+ messages in thread
From: Michal Hocko @ 2020-04-01 19:57 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: linux-kernel, akpm, linux-mm, dan.j.williams, shile.zhang,
	daniel.m.jordan, ktkhai, david, jmorris, sashal

On Wed 01-04-20 15:32:38, Pavel Tatashin wrote:
> Initializing struct pages is a long task and keeping interrupts disabled
> for the duration of this operation introduces a number of problems.
> 
> 1. jiffies are not updated for long period of time, and thus incorrect time
>    is reported. See proposed solution and discussion here:
>    lkml/20200311123848.118638-1-shile.zhang@linux.alibaba.com

http://lkml.kernel.org/r/20200311123848.118638-1-shile.zhang@linux.alibaba.com

> 2. It prevents farther improving deferred page initialization by allowing
>    inter-node multi-threading.
> 
> We are keeping interrupts disabled to solve a rather theoretical problem
> that was never observed in real world (See 3a2d7fa8a3d5).
> 
> Lets keep interrupts enabled. In case we ever encounter a scenario where
> an interrupt thread wants to allocate large amount of memory this early in
> boot we can deal with that by growing zone (see deferred_grow_zone()) by
> the needed amount before starting deferred_init_memmap() threads.
>
> Before:
> [    1.232459] node 0 initialised, 12058412 pages in 1ms
> 
> After:
> [    1.632580] node 0 initialised, 12051227 pages in 436ms
> 

Fixes: 3a2d7fa8a3d5 ("mm: disable interrupts while initializing deferred pages")
> Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>

I would much rather see pgdat_resize_lock completely out of both the
allocator and deferred init path altogether but this can be done in a
separate patch. This one looks slightly safer for stable backports.

To be completely honest I would love to see the resize lock go away
completely. That might need a deeper thought but I believe it is
something that has never been done properly.

Acked-by: Michal Hocko <mhocko@suse.com>

Thanks!

> ---
>  mm/page_alloc.c | 21 +++++++--------------
>  1 file changed, 7 insertions(+), 14 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3c4eb750a199..4498a13b372d 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1792,6 +1792,13 @@ static int __init deferred_init_memmap(void *data)
>  	BUG_ON(pgdat->first_deferred_pfn > pgdat_end_pfn(pgdat));
>  	pgdat->first_deferred_pfn = ULONG_MAX;
>  
> +	/*
> +	 * Once we unlock here, the zone cannot be grown anymore, thus if an
> +	 * interrupt thread must allocate this early in boot, zone must be
> +	 * pre-grown prior to start of deferred page initialization.
> +	 */
> +	pgdat_resize_unlock(pgdat, &flags);
> +
>  	/* Only the highest zone is deferred so find it */
>  	for (zid = 0; zid < MAX_NR_ZONES; zid++) {
>  		zone = pgdat->node_zones + zid;
> @@ -1812,8 +1819,6 @@ static int __init deferred_init_memmap(void *data)
>  	while (spfn < epfn)
>  		nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
>  zone_empty:
> -	pgdat_resize_unlock(pgdat, &flags);
> -
>  	/* Sanity check that the next zone really is unpopulated */
>  	WARN_ON(++zid < MAX_NR_ZONES && populated_zone(++zone));
>  
> @@ -1854,18 +1859,6 @@ deferred_grow_zone(struct zone *zone, unsigned int order)
>  		return false;
>  
>  	pgdat_resize_lock(pgdat, &flags);
> -
> -	/*
> -	 * If deferred pages have been initialized while we were waiting for
> -	 * the lock, return true, as the zone was grown.  The caller will retry
> -	 * this zone.  We won't return to this function since the caller also
> -	 * has this static branch.
> -	 */
> -	if (!static_branch_unlikely(&deferred_pages)) {
> -		pgdat_resize_unlock(pgdat, &flags);
> -		return true;
> -	}
> -
>  	/*
>  	 * If someone grew this zone while we were waiting for spinlock, return
>  	 * true, as there might be enough pages already.
> -- 
> 2.17.1
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] mm: initialize deferred pages with interrupts enabled
  2020-04-01 19:32 [PATCH] mm: initialize deferred pages with interrupts enabled Pavel Tatashin
  2020-04-01 19:57 ` Michal Hocko
@ 2020-04-01 19:58 ` Michal Hocko
  2020-04-01 20:00 ` Daniel Jordan
  2 siblings, 0 replies; 9+ messages in thread
From: Michal Hocko @ 2020-04-01 19:58 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: linux-kernel, akpm, linux-mm, dan.j.williams, shile.zhang,
	daniel.m.jordan, ktkhai, david, jmorris, sashal, Vlastimil Babka

btw. Cc Vlastimil

On Wed 01-04-20 15:32:38, Pavel Tatashin wrote:
> Initializing struct pages is a long task and keeping interrupts disabled
> for the duration of this operation introduces a number of problems.
> 
> 1. jiffies are not updated for long period of time, and thus incorrect time
>    is reported. See proposed solution and discussion here:
>    lkml/20200311123848.118638-1-shile.zhang@linux.alibaba.com
> 2. It prevents farther improving deferred page initialization by allowing
>    inter-node multi-threading.
> 
> We are keeping interrupts disabled to solve a rather theoretical problem
> that was never observed in real world (See 3a2d7fa8a3d5).
> 
> Lets keep interrupts enabled. In case we ever encounter a scenario where
> an interrupt thread wants to allocate large amount of memory this early in
> boot we can deal with that by growing zone (see deferred_grow_zone()) by
> the needed amount before starting deferred_init_memmap() threads.
> 
> Before:
> [    1.232459] node 0 initialised, 12058412 pages in 1ms
> 
> After:
> [    1.632580] node 0 initialised, 12051227 pages in 436ms
> 
> Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
> ---
>  mm/page_alloc.c | 21 +++++++--------------
>  1 file changed, 7 insertions(+), 14 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3c4eb750a199..4498a13b372d 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1792,6 +1792,13 @@ static int __init deferred_init_memmap(void *data)
>  	BUG_ON(pgdat->first_deferred_pfn > pgdat_end_pfn(pgdat));
>  	pgdat->first_deferred_pfn = ULONG_MAX;
>  
> +	/*
> +	 * Once we unlock here, the zone cannot be grown anymore, thus if an
> +	 * interrupt thread must allocate this early in boot, zone must be
> +	 * pre-grown prior to start of deferred page initialization.
> +	 */
> +	pgdat_resize_unlock(pgdat, &flags);
> +
>  	/* Only the highest zone is deferred so find it */
>  	for (zid = 0; zid < MAX_NR_ZONES; zid++) {
>  		zone = pgdat->node_zones + zid;
> @@ -1812,8 +1819,6 @@ static int __init deferred_init_memmap(void *data)
>  	while (spfn < epfn)
>  		nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
>  zone_empty:
> -	pgdat_resize_unlock(pgdat, &flags);
> -
>  	/* Sanity check that the next zone really is unpopulated */
>  	WARN_ON(++zid < MAX_NR_ZONES && populated_zone(++zone));
>  
> @@ -1854,18 +1859,6 @@ deferred_grow_zone(struct zone *zone, unsigned int order)
>  		return false;
>  
>  	pgdat_resize_lock(pgdat, &flags);
> -
> -	/*
> -	 * If deferred pages have been initialized while we were waiting for
> -	 * the lock, return true, as the zone was grown.  The caller will retry
> -	 * this zone.  We won't return to this function since the caller also
> -	 * has this static branch.
> -	 */
> -	if (!static_branch_unlikely(&deferred_pages)) {
> -		pgdat_resize_unlock(pgdat, &flags);
> -		return true;
> -	}
> -
>  	/*
>  	 * If someone grew this zone while we were waiting for spinlock, return
>  	 * true, as there might be enough pages already.
> -- 
> 2.17.1
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] mm: initialize deferred pages with interrupts enabled
  2020-04-01 19:32 [PATCH] mm: initialize deferred pages with interrupts enabled Pavel Tatashin
  2020-04-01 19:57 ` Michal Hocko
  2020-04-01 19:58 ` Michal Hocko
@ 2020-04-01 20:00 ` Daniel Jordan
  2020-04-01 20:08   ` Daniel Jordan
  2 siblings, 1 reply; 9+ messages in thread
From: Daniel Jordan @ 2020-04-01 20:00 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: linux-kernel, akpm, mhocko, linux-mm, dan.j.williams,
	shile.zhang, daniel.m.jordan, ktkhai, david, jmorris, sashal

On Wed, Apr 01, 2020 at 03:32:38PM -0400, Pavel Tatashin wrote:
> Initializing struct pages is a long task and keeping interrupts disabled
> for the duration of this operation introduces a number of problems.
> 
> 1. jiffies are not updated for long period of time, and thus incorrect time
>    is reported. See proposed solution and discussion here:
>    lkml/20200311123848.118638-1-shile.zhang@linux.alibaba.com
> 2. It prevents farther improving deferred page initialization by allowing

                                                                   not allowing
>    inter-node multi-threading.

     intra-node

...
> After:
> [    1.632580] node 0 initialised, 12051227 pages in 436ms

Fixes: 3a2d7fa8a3d5 ("mm: disable interrupts while initializing deferred pages")
Reported-by: Shile Zhang <shile.zhang@linux.alibaba.com>

> Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>

Freezing jiffies for a while during boot sounds like stable to me, so

Cc: <stable@vger.kernel.org>    [4.17.x+]


Can you please add a comment to mmzone.h above node_size_lock, something like

         * Must be held any time you expect node_start_pfn,
         * node_present_pages, node_spanned_pages or nr_zones to stay constant.
+        * Also synchronizes pgdat->first_deferred_pfn during deferred page
+        * init.
         ...
        spinlock_t node_size_lock;

> @@ -1854,18 +1859,6 @@ deferred_grow_zone(struct zone *zone, unsigned int order)
>  		return false;
>  
>  	pgdat_resize_lock(pgdat, &flags);
> -
> -	/*
> -	 * If deferred pages have been initialized while we were waiting for
> -	 * the lock, return true, as the zone was grown.  The caller will retry
> -	 * this zone.  We won't return to this function since the caller also
> -	 * has this static branch.
> -	 */
> -	if (!static_branch_unlikely(&deferred_pages)) {
> -		pgdat_resize_unlock(pgdat, &flags);
> -		return true;
> -	}
> -

Huh, looks like this wasn't needed even before this change.


The rest looks fine.

Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] mm: initialize deferred pages with interrupts enabled
  2020-04-01 20:00 ` Daniel Jordan
@ 2020-04-01 20:08   ` Daniel Jordan
  2020-04-01 20:31     ` Pavel Tatashin
  2020-04-02  7:36     ` Michal Hocko
  0 siblings, 2 replies; 9+ messages in thread
From: Daniel Jordan @ 2020-04-01 20:08 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: linux-kernel, akpm, mhocko, linux-mm, dan.j.williams,
	shile.zhang, daniel.m.jordan, ktkhai, david, jmorris, sashal

On Wed, Apr 01, 2020 at 04:00:27PM -0400, Daniel Jordan wrote:
> On Wed, Apr 01, 2020 at 03:32:38PM -0400, Pavel Tatashin wrote:
> > Initializing struct pages is a long task and keeping interrupts disabled
> > for the duration of this operation introduces a number of problems.
> > 
> > 1. jiffies are not updated for long period of time, and thus incorrect time
> >    is reported. See proposed solution and discussion here:
> >    lkml/20200311123848.118638-1-shile.zhang@linux.alibaba.com
> > 2. It prevents farther improving deferred page initialization by allowing
> 
>                                                                    not allowing
> >    inter-node multi-threading.
> 
>      intra-node
> 
> ...
> > After:
> > [    1.632580] node 0 initialised, 12051227 pages in 436ms
> 
> Fixes: 3a2d7fa8a3d5 ("mm: disable interrupts while initializing deferred pages")
> Reported-by: Shile Zhang <shile.zhang@linux.alibaba.com>
> 
> > Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
> 
> Freezing jiffies for a while during boot sounds like stable to me, so
> 
> Cc: <stable@vger.kernel.org>    [4.17.x+]
> 
> 
> Can you please add a comment to mmzone.h above node_size_lock, something like
> 
>          * Must be held any time you expect node_start_pfn,
>          * node_present_pages, node_spanned_pages or nr_zones to stay constant.
> +        * Also synchronizes pgdat->first_deferred_pfn during deferred page
> +        * init.
>          ...
>         spinlock_t node_size_lock;
> 
> > @@ -1854,18 +1859,6 @@ deferred_grow_zone(struct zone *zone, unsigned int order)
> >  		return false;
> >  
> >  	pgdat_resize_lock(pgdat, &flags);
> > -
> > -	/*
> > -	 * If deferred pages have been initialized while we were waiting for
> > -	 * the lock, return true, as the zone was grown.  The caller will retry
> > -	 * this zone.  We won't return to this function since the caller also
> > -	 * has this static branch.
> > -	 */
> > -	if (!static_branch_unlikely(&deferred_pages)) {
> > -		pgdat_resize_unlock(pgdat, &flags);
> > -		return true;
> > -	}
> > -
> 
> Huh, looks like this wasn't needed even before this change.
> 
> 
> The rest looks fine.
> 
> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>

...except for I forgot about the touch_nmi_watchdog() calls.  I think you'd
need something kind of like this before your patch.

---8<---

From: Daniel Jordan <daniel.m.jordan@oracle.com>
Date: Fri, 27 Mar 2020 17:29:05 -0400
Subject: [PATCH] mm: call touch_nmi_watchdog() on max order boundaries in
 deferred init

deferred_init_memmap() disables interrupts the entire time, so it calls
touch_nmi_watchdog() periodically to avoid soft lockup splats.  Soon it
will run with interrupts enabled, at which point cond_resched() should
be used instead.

deferred_grow_zone() makes the same watchdog calls through code shared
with deferred init but will continue to run with interrupts disabled, so
it can't call cond_resched().

Pull the watchdog calls up to these two places to allow the first to be
changed later, independently of the second.  The frequency reduces from
twice per pageblock (init and free) to once per max order block.

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
---
 mm/page_alloc.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 212734c4f8b0..4cf18c534233 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1639,7 +1639,6 @@ static void __init deferred_free_pages(unsigned long pfn,
 		} else if (!(pfn & nr_pgmask)) {
 			deferred_free_range(pfn - nr_free, nr_free);
 			nr_free = 1;
-			touch_nmi_watchdog();
 		} else {
 			nr_free++;
 		}
@@ -1669,7 +1668,6 @@ static unsigned long  __init deferred_init_pages(struct zone *zone,
 			continue;
 		} else if (!page || !(pfn & nr_pgmask)) {
 			page = pfn_to_page(pfn);
-			touch_nmi_watchdog();
 		} else {
 			page++;
 		}
@@ -1813,8 +1811,10 @@ static int __init deferred_init_memmap(void *data)
 	 * that we can avoid introducing any issues with the buddy
 	 * allocator.
 	 */
-	while (spfn < epfn)
+	while (spfn < epfn) {
 		nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
+		touch_nmi_watchdog();
+	}
 zone_empty:
 	pgdat_resize_unlock(pgdat, &flags);
 
@@ -1908,6 +1908,7 @@ deferred_grow_zone_locked(pg_data_t *pgdat, struct zone *zone,
 		first_deferred_pfn = spfn;
 
 		nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
+		touch_nmi_watchdog();
 
 		/* We should only stop along section boundaries */
 		if ((first_deferred_pfn ^ spfn) < PAGES_PER_SECTION)
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] mm: initialize deferred pages with interrupts enabled
  2020-04-01 19:57 ` Michal Hocko
@ 2020-04-01 20:27   ` Pavel Tatashin
  2020-04-02  7:34     ` Michal Hocko
  0 siblings, 1 reply; 9+ messages in thread
From: Pavel Tatashin @ 2020-04-01 20:27 UTC (permalink / raw)
  To: Michal Hocko
  Cc: LKML, Andrew Morton, linux-mm, Dan Williams, Shile Zhang,
	Daniel Jordan, Kirill Tkhai, David Hildenbrand, James Morris,
	Sasha Levin

On Wed, Apr 1, 2020 at 3:57 PM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Wed 01-04-20 15:32:38, Pavel Tatashin wrote:
> > Initializing struct pages is a long task and keeping interrupts disabled
> > for the duration of this operation introduces a number of problems.
> >
> > 1. jiffies are not updated for long period of time, and thus incorrect time
> >    is reported. See proposed solution and discussion here:
> >    lkml/20200311123848.118638-1-shile.zhang@linux.alibaba.com
>
> http://lkml.kernel.org/r/20200311123848.118638-1-shile.zhang@linux.alibaba.com
>
> > 2. It prevents farther improving deferred page initialization by allowing
> >    inter-node multi-threading.
> >
> > We are keeping interrupts disabled to solve a rather theoretical problem
> > that was never observed in real world (See 3a2d7fa8a3d5).
> >
> > Lets keep interrupts enabled. In case we ever encounter a scenario where
> > an interrupt thread wants to allocate large amount of memory this early in
> > boot we can deal with that by growing zone (see deferred_grow_zone()) by
> > the needed amount before starting deferred_init_memmap() threads.
> >
> > Before:
> > [    1.232459] node 0 initialised, 12058412 pages in 1ms
> >
> > After:
> > [    1.632580] node 0 initialised, 12051227 pages in 436ms
> >
>
> Fixes: 3a2d7fa8a3d5 ("mm: disable interrupts while initializing deferred pages")
> > Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
>
> I would much rather see pgdat_resize_lock completely out of both the
> allocator and deferred init path altogether but this can be done in a
> separate patch. This one looks slightly safer for stable backports.

This is what I wanted to do, but after studying deferred_grow_zone(),
I do not see a simple way to solve this. It is one thing to fail an
allocation, and it is another thing to have a corruption because of
race.

> To be completely honest I would love to see the resize lock go away
> completely. That might need a deeper thought but I believe it is
> something that has never been done properly.
>
> Acked-by: Michal Hocko <mhocko@suse.com>

Thank you,
Pasha


>
> Thanks!
>
> > ---
> >  mm/page_alloc.c | 21 +++++++--------------
> >  1 file changed, 7 insertions(+), 14 deletions(-)
> >
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 3c4eb750a199..4498a13b372d 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1792,6 +1792,13 @@ static int __init deferred_init_memmap(void *data)
> >       BUG_ON(pgdat->first_deferred_pfn > pgdat_end_pfn(pgdat));
> >       pgdat->first_deferred_pfn = ULONG_MAX;
> >
> > +     /*
> > +      * Once we unlock here, the zone cannot be grown anymore, thus if an
> > +      * interrupt thread must allocate this early in boot, zone must be
> > +      * pre-grown prior to start of deferred page initialization.
> > +      */
> > +     pgdat_resize_unlock(pgdat, &flags);
> > +
> >       /* Only the highest zone is deferred so find it */
> >       for (zid = 0; zid < MAX_NR_ZONES; zid++) {
> >               zone = pgdat->node_zones + zid;
> > @@ -1812,8 +1819,6 @@ static int __init deferred_init_memmap(void *data)
> >       while (spfn < epfn)
> >               nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
> >  zone_empty:
> > -     pgdat_resize_unlock(pgdat, &flags);
> > -
> >       /* Sanity check that the next zone really is unpopulated */
> >       WARN_ON(++zid < MAX_NR_ZONES && populated_zone(++zone));
> >
> > @@ -1854,18 +1859,6 @@ deferred_grow_zone(struct zone *zone, unsigned int order)
> >               return false;
> >
> >       pgdat_resize_lock(pgdat, &flags);
> > -
> > -     /*
> > -      * If deferred pages have been initialized while we were waiting for
> > -      * the lock, return true, as the zone was grown.  The caller will retry
> > -      * this zone.  We won't return to this function since the caller also
> > -      * has this static branch.
> > -      */
> > -     if (!static_branch_unlikely(&deferred_pages)) {
> > -             pgdat_resize_unlock(pgdat, &flags);
> > -             return true;
> > -     }
> > -
> >       /*
> >        * If someone grew this zone while we were waiting for spinlock, return
> >        * true, as there might be enough pages already.
> > --
> > 2.17.1
> >
>
> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] mm: initialize deferred pages with interrupts enabled
  2020-04-01 20:08   ` Daniel Jordan
@ 2020-04-01 20:31     ` Pavel Tatashin
  2020-04-02  7:36     ` Michal Hocko
  1 sibling, 0 replies; 9+ messages in thread
From: Pavel Tatashin @ 2020-04-01 20:31 UTC (permalink / raw)
  To: Daniel Jordan
  Cc: LKML, Andrew Morton, Michal Hocko, linux-mm, Dan Williams,
	Shile Zhang, Kirill Tkhai, David Hildenbrand, James Morris,
	Sasha Levin

On Wed, Apr 1, 2020 at 4:10 PM Daniel Jordan <daniel.m.jordan@oracle.com> wrote:
>
> On Wed, Apr 01, 2020 at 04:00:27PM -0400, Daniel Jordan wrote:
> > On Wed, Apr 01, 2020 at 03:32:38PM -0400, Pavel Tatashin wrote:
> > > Initializing struct pages is a long task and keeping interrupts disabled
> > > for the duration of this operation introduces a number of problems.
> > >
> > > 1. jiffies are not updated for long period of time, and thus incorrect time
> > >    is reported. See proposed solution and discussion here:
> > >    lkml/20200311123848.118638-1-shile.zhang@linux.alibaba.com
> > > 2. It prevents farther improving deferred page initialization by allowing
> >
> >                                                                    not allowing
> > >    inter-node multi-threading.
> >
> >      intra-node
> >
> > ...
> > > After:
> > > [    1.632580] node 0 initialised, 12051227 pages in 436ms
> >
> > Fixes: 3a2d7fa8a3d5 ("mm: disable interrupts while initializing deferred pages")
> > Reported-by: Shile Zhang <shile.zhang@linux.alibaba.com>
> >
> > > Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
> >
> > Freezing jiffies for a while during boot sounds like stable to me, so
> >
> > Cc: <stable@vger.kernel.org>    [4.17.x+]
> >
> >
> > Can you please add a comment to mmzone.h above node_size_lock, something like
> >
> >          * Must be held any time you expect node_start_pfn,
> >          * node_present_pages, node_spanned_pages or nr_zones to stay constant.
> > +        * Also synchronizes pgdat->first_deferred_pfn during deferred page
> > +        * init.
> >          ...
> >         spinlock_t node_size_lock;
> >
> > > @@ -1854,18 +1859,6 @@ deferred_grow_zone(struct zone *zone, unsigned int order)
> > >             return false;
> > >
> > >     pgdat_resize_lock(pgdat, &flags);
> > > -
> > > -   /*
> > > -    * If deferred pages have been initialized while we were waiting for
> > > -    * the lock, return true, as the zone was grown.  The caller will retry
> > > -    * this zone.  We won't return to this function since the caller also
> > > -    * has this static branch.
> > > -    */
> > > -   if (!static_branch_unlikely(&deferred_pages)) {
> > > -           pgdat_resize_unlock(pgdat, &flags);
> > > -           return true;
> > > -   }
> > > -
> >
> > Huh, looks like this wasn't needed even before this change.
> >
> >
> > The rest looks fine.
> >
> > Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
>
> ...except for I forgot about the touch_nmi_watchdog() calls.  I think you'd
> need something kind of like this before your patch.

Thank you for review. You are right, I will add your patch, and modify
my to change touch_nmi_watchdog() to cond_resched().

Pasha

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] mm: initialize deferred pages with interrupts enabled
  2020-04-01 20:27   ` Pavel Tatashin
@ 2020-04-02  7:34     ` Michal Hocko
  0 siblings, 0 replies; 9+ messages in thread
From: Michal Hocko @ 2020-04-02  7:34 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: LKML, Andrew Morton, linux-mm, Dan Williams, Shile Zhang,
	Daniel Jordan, Kirill Tkhai, David Hildenbrand, James Morris,
	Sasha Levin

On Wed 01-04-20 16:27:33, Pavel Tatashin wrote:
> On Wed, Apr 1, 2020 at 3:57 PM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Wed 01-04-20 15:32:38, Pavel Tatashin wrote:
> > > Initializing struct pages is a long task and keeping interrupts disabled
> > > for the duration of this operation introduces a number of problems.
> > >
> > > 1. jiffies are not updated for long period of time, and thus incorrect time
> > >    is reported. See proposed solution and discussion here:
> > >    lkml/20200311123848.118638-1-shile.zhang@linux.alibaba.com
> >
> > http://lkml.kernel.org/r/20200311123848.118638-1-shile.zhang@linux.alibaba.com
> >
> > > 2. It prevents farther improving deferred page initialization by allowing
> > >    inter-node multi-threading.
> > >
> > > We are keeping interrupts disabled to solve a rather theoretical problem
> > > that was never observed in real world (See 3a2d7fa8a3d5).
> > >
> > > Lets keep interrupts enabled. In case we ever encounter a scenario where
> > > an interrupt thread wants to allocate large amount of memory this early in
> > > boot we can deal with that by growing zone (see deferred_grow_zone()) by
> > > the needed amount before starting deferred_init_memmap() threads.
> > >
> > > Before:
> > > [    1.232459] node 0 initialised, 12058412 pages in 1ms
> > >
> > > After:
> > > [    1.632580] node 0 initialised, 12051227 pages in 436ms
> > >
> >
> > Fixes: 3a2d7fa8a3d5 ("mm: disable interrupts while initializing deferred pages")
> > > Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
> >
> > I would much rather see pgdat_resize_lock completely out of both the
> > allocator and deferred init path altogether but this can be done in a
> > separate patch. This one looks slightly safer for stable backports.
> 
> This is what I wanted to do, but after studying deferred_grow_zone(),
> I do not see a simple way to solve this. It is one thing to fail an
> allocation, and it is another thing to have a corruption because of
> race.

Let's discuss deferred_grow_zone after this all settles down. I still
have to study it because I wasn't aware that this is actually a page
allocator path relying on the resize lock. My recollection was that the
resize lock is only about memory hotplug. Your patches flew by and I
didn't have time to review them back then. So I have to admit I have
seen the resize lock too simple.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] mm: initialize deferred pages with interrupts enabled
  2020-04-01 20:08   ` Daniel Jordan
  2020-04-01 20:31     ` Pavel Tatashin
@ 2020-04-02  7:36     ` Michal Hocko
  1 sibling, 0 replies; 9+ messages in thread
From: Michal Hocko @ 2020-04-02  7:36 UTC (permalink / raw)
  To: Daniel Jordan
  Cc: Pavel Tatashin, linux-kernel, akpm, linux-mm, dan.j.williams,
	shile.zhang, ktkhai, david, jmorris, sashal

On Wed 01-04-20 16:08:55, Daniel Jordan wrote:
[...]
> From: Daniel Jordan <daniel.m.jordan@oracle.com>
> Date: Fri, 27 Mar 2020 17:29:05 -0400
> Subject: [PATCH] mm: call touch_nmi_watchdog() on max order boundaries in
>  deferred init
> 
> deferred_init_memmap() disables interrupts the entire time, so it calls
> touch_nmi_watchdog() periodically to avoid soft lockup splats.  Soon it
> will run with interrupts enabled, at which point cond_resched() should
> be used instead.
> 
> deferred_grow_zone() makes the same watchdog calls through code shared
> with deferred init but will continue to run with interrupts disabled, so
> it can't call cond_resched().
> 
> Pull the watchdog calls up to these two places to allow the first to be
> changed later, independently of the second.  The frequency reduces from
> twice per pageblock (init and free) to once per max order block.

This makes sense but I am not really sure this is necessary for the
stable backport.

> Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/page_alloc.c | 7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 212734c4f8b0..4cf18c534233 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1639,7 +1639,6 @@ static void __init deferred_free_pages(unsigned long pfn,
>  		} else if (!(pfn & nr_pgmask)) {
>  			deferred_free_range(pfn - nr_free, nr_free);
>  			nr_free = 1;
> -			touch_nmi_watchdog();
>  		} else {
>  			nr_free++;
>  		}
> @@ -1669,7 +1668,6 @@ static unsigned long  __init deferred_init_pages(struct zone *zone,
>  			continue;
>  		} else if (!page || !(pfn & nr_pgmask)) {
>  			page = pfn_to_page(pfn);
> -			touch_nmi_watchdog();
>  		} else {
>  			page++;
>  		}
> @@ -1813,8 +1811,10 @@ static int __init deferred_init_memmap(void *data)
>  	 * that we can avoid introducing any issues with the buddy
>  	 * allocator.
>  	 */
> -	while (spfn < epfn)
> +	while (spfn < epfn) {
>  		nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
> +		touch_nmi_watchdog();
> +	}
>  zone_empty:
>  	pgdat_resize_unlock(pgdat, &flags);
>  
> @@ -1908,6 +1908,7 @@ deferred_grow_zone_locked(pg_data_t *pgdat, struct zone *zone,
>  		first_deferred_pfn = spfn;
>  
>  		nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
> +		touch_nmi_watchdog();
>  
>  		/* We should only stop along section boundaries */
>  		if ((first_deferred_pfn ^ spfn) < PAGES_PER_SECTION)
> -- 
> 2.25.0
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2020-04-02  7:36 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-01 19:32 [PATCH] mm: initialize deferred pages with interrupts enabled Pavel Tatashin
2020-04-01 19:57 ` Michal Hocko
2020-04-01 20:27   ` Pavel Tatashin
2020-04-02  7:34     ` Michal Hocko
2020-04-01 19:58 ` Michal Hocko
2020-04-01 20:00 ` Daniel Jordan
2020-04-01 20:08   ` Daniel Jordan
2020-04-01 20:31     ` Pavel Tatashin
2020-04-02  7:36     ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).