linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH STABLE 4.9] mm: Avoid calling build_all_zonelists_init under hotplug context
@ 2020-08-18 11:00 Oscar Salvador
  2020-08-18 12:24 ` Michal Hocko
                   ` (2 more replies)
  0 siblings, 3 replies; 5+ messages in thread
From: Oscar Salvador @ 2020-08-18 11:00 UTC (permalink / raw)
  To: stable
  Cc: linux-kernel, linux-mm, mhocko, vbabka, david, Oscar Salvador,
	Vlastimil Babka

Recently a customer of ours experienced a crash when booting the
system while enabling memory-hotplug.

The problem is that Normal zones on different nodes don't get their private
zone->pageset allocated, and keep sharing the initial boot_pageset.
The sharing between zones is normally safe as explained by the comment for
boot_pageset - it's a percpu structure, and manipulations are done with
disabled interrupts, and boot_pageset is set up in a way that any page placed
on its pcplist is immediately flushed to shared zone's freelist, because
pcp->high == 1.
However, the hotplug operation updates pcp->high to a higher value as it
expects to be operating on a private pageset.

The problem is in build_all_zonelists(), which is called when the first range
of pages is onlined for the Normal zone of node X or Y:

	if (system_state == SYSTEM_BOOTING) {
		build_all_zonelists_init();
	} else {
	#ifdef CONFIG_MEMORY_HOTPLUG
		if (zone)
			setup_zone_pageset(zone);
	#endif
		/* we have to stop all cpus to guarantee there is no user
		of zonelist */
		stop_machine(__build_all_zonelists, pgdat, NULL);
		/* cpuset refresh routine should be here */
	}

When called during hotplug, it should execute the setup_zone_pageset(zone)
which allocates the private pageset.
However, with memhp_default_state=online, this happens early while
system_state == SYSTEM_BOOTING is still true, hence this step is skipped.
(and build_all_zonelists_init() is probably unsafe anyway at this point).

Another hotplug operation on the same zone then leads to zone_pcp_update(zone)
called from online_pages(), which updates the pcp->high for the shared
boot_pageset to a value higher than 1.
At that point, pages freed from Node X and Y Normal zones can end up on the same
pcplist and from there they can be freed to the wrong zone's freelist,
leading to the corruption and crashes.

Please, note that upstream has fixed that differently (and unintentionally) by
adding another boot state (SYSTEM_SCHEDULING), which is set before smp_init().
That should happen before memory hotplug events even with memhp_default_state=online.
Backporting that would be too intrusive.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
Debugged-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/mmzone.h |  3 ++-
 init/main.c            |  2 +-
 mm/memory_hotplug.c    | 10 +++++-----
 mm/page_alloc.c        |  7 ++++---
 4 files changed, 12 insertions(+), 10 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index e3d7754f25f0..5c7645e156a5 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -756,7 +756,8 @@ static inline bool is_dev_zone(const struct zone *zone)
 #include <linux/memory_hotplug.h>
 
 extern struct mutex zonelists_mutex;
-void build_all_zonelists(pg_data_t *pgdat, struct zone *zone);
+void build_all_zonelists(pg_data_t *pgdat, struct zone *zone,
+			 bool hotplug_context);
 void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx);
 bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
 			 int classzone_idx, unsigned int alloc_flags,
diff --git a/init/main.c b/init/main.c
index d47860dbe896..7ad08957dd18 100644
--- a/init/main.c
+++ b/init/main.c
@@ -512,7 +512,7 @@ asmlinkage __visible void __init start_kernel(void)
 	smp_prepare_boot_cpu();	/* arch-specific boot-cpu hooks */
 	boot_cpu_hotplug_init();
 
-	build_all_zonelists(NULL, NULL);
+	build_all_zonelists(NULL, NULL, false);
 	page_alloc_init();
 
 	pr_notice("Kernel command line: %s\n", boot_command_line);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 449999657c0b..a4ffe5996317 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1125,7 +1125,7 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_typ
 	mutex_lock(&zonelists_mutex);
 	if (!populated_zone(zone)) {
 		need_zonelists_rebuild = 1;
-		build_all_zonelists(NULL, zone);
+		build_all_zonelists(NULL, zone, true);
 	}
 
 	ret = walk_system_ram_range(pfn, nr_pages, &onlined_pages,
@@ -1146,7 +1146,7 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_typ
 	if (onlined_pages) {
 		node_states_set_node(nid, &arg);
 		if (need_zonelists_rebuild)
-			build_all_zonelists(NULL, NULL);
+			build_all_zonelists(NULL, NULL, true);
 		else
 			zone_pcp_update(zone);
 	}
@@ -1220,7 +1220,7 @@ static pg_data_t __ref *hotadd_new_pgdat(int nid, u64 start)
 	 * to access not-initialized zonelist, build here.
 	 */
 	mutex_lock(&zonelists_mutex);
-	build_all_zonelists(pgdat, NULL);
+	build_all_zonelists(pgdat, NULL, true);
 	mutex_unlock(&zonelists_mutex);
 
 	/*
@@ -1276,7 +1276,7 @@ int try_online_node(int nid)
 
 	if (pgdat->node_zonelists->_zonerefs->zone == NULL) {
 		mutex_lock(&zonelists_mutex);
-		build_all_zonelists(NULL, NULL);
+		build_all_zonelists(NULL, NULL, true);
 		mutex_unlock(&zonelists_mutex);
 	}
 
@@ -2016,7 +2016,7 @@ static int __ref __offline_pages(unsigned long start_pfn,
 	if (!populated_zone(zone)) {
 		zone_pcp_reset(zone);
 		mutex_lock(&zonelists_mutex);
-		build_all_zonelists(NULL, NULL);
+		build_all_zonelists(NULL, NULL, true);
 		mutex_unlock(&zonelists_mutex);
 	} else
 		zone_pcp_update(zone);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index de00e0fec484..f394dd87fa03 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4608,7 +4608,7 @@ int numa_zonelist_order_handler(struct ctl_table *table, int write,
 			user_zonelist_order = oldval;
 		} else if (oldval != user_zonelist_order) {
 			mutex_lock(&zonelists_mutex);
-			build_all_zonelists(NULL, NULL);
+			build_all_zonelists(NULL, NULL, false);
 			mutex_unlock(&zonelists_mutex);
 		}
 	}
@@ -4988,11 +4988,12 @@ build_all_zonelists_init(void)
  * (2) call of __init annotated helper build_all_zonelists_init
  * [protected by SYSTEM_BOOTING].
  */
-void __ref build_all_zonelists(pg_data_t *pgdat, struct zone *zone)
+void __ref build_all_zonelists(pg_data_t *pgdat, struct zone *zone,
+			       bool hotplug_context)
 {
 	set_zonelist_order();
 
-	if (system_state == SYSTEM_BOOTING) {
+	if (system_state == SYSTEM_BOOTING && !hotplug_context) {
 		build_all_zonelists_init();
 	} else {
 #ifdef CONFIG_MEMORY_HOTPLUG
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH STABLE 4.9] mm: Avoid calling build_all_zonelists_init under hotplug context
  2020-08-18 11:00 [PATCH STABLE 4.9] mm: Avoid calling build_all_zonelists_init under hotplug context Oscar Salvador
@ 2020-08-18 12:24 ` Michal Hocko
  2020-08-20  8:14   ` Greg KH
  2020-08-18 12:26 ` David Hildenbrand
  2020-08-20  8:39 ` Patch "mm: Avoid calling build_all_zonelists_init under hotplug context" has been added to the 4.9-stable tree gregkh
  2 siblings, 1 reply; 5+ messages in thread
From: Michal Hocko @ 2020-08-18 12:24 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: stable, linux-kernel, linux-mm, vbabka, david, Vlastimil Babka

On Tue 18-08-20 13:00:46, Oscar Salvador wrote:
> Recently a customer of ours experienced a crash when booting the
> system while enabling memory-hotplug.
> 
> The problem is that Normal zones on different nodes don't get their private
> zone->pageset allocated, and keep sharing the initial boot_pageset.
> The sharing between zones is normally safe as explained by the comment for
> boot_pageset - it's a percpu structure, and manipulations are done with
> disabled interrupts, and boot_pageset is set up in a way that any page placed
> on its pcplist is immediately flushed to shared zone's freelist, because
> pcp->high == 1.
> However, the hotplug operation updates pcp->high to a higher value as it
> expects to be operating on a private pageset.
> 
> The problem is in build_all_zonelists(), which is called when the first range
> of pages is onlined for the Normal zone of node X or Y:
> 
> 	if (system_state == SYSTEM_BOOTING) {
> 		build_all_zonelists_init();
> 	} else {
> 	#ifdef CONFIG_MEMORY_HOTPLUG
> 		if (zone)
> 			setup_zone_pageset(zone);
> 	#endif
> 		/* we have to stop all cpus to guarantee there is no user
> 		of zonelist */
> 		stop_machine(__build_all_zonelists, pgdat, NULL);
> 		/* cpuset refresh routine should be here */
> 	}
> 
> When called during hotplug, it should execute the setup_zone_pageset(zone)
> which allocates the private pageset.
> However, with memhp_default_state=online, this happens early while
> system_state == SYSTEM_BOOTING is still true, hence this step is skipped.
> (and build_all_zonelists_init() is probably unsafe anyway at this point).
> 
> Another hotplug operation on the same zone then leads to zone_pcp_update(zone)
> called from online_pages(), which updates the pcp->high for the shared
> boot_pageset to a value higher than 1.
> At that point, pages freed from Node X and Y Normal zones can end up on the same
> pcplist and from there they can be freed to the wrong zone's freelist,
> leading to the corruption and crashes.
> 
> Please, note that upstream has fixed that differently (and unintentionally) by
> adding another boot state (SYSTEM_SCHEDULING), which is set before smp_init().
> That should happen before memory hotplug events even with memhp_default_state=online.
> Backporting that would be too intrusive.
> 
> Signed-off-by: Oscar Salvador <osalvador@suse.de>
> Debugged-by: Vlastimil Babka <vbabka@suse.cz>

Yes, I believe this is the easiest and the least scary way to fix the
issue for stable kernel users. Feel free to add
Acked-by: Michal Hocko <mhocko@suse.com> # for stable trees

for that purpose.

Thanks a lot!

> ---
>  include/linux/mmzone.h |  3 ++-
>  init/main.c            |  2 +-
>  mm/memory_hotplug.c    | 10 +++++-----
>  mm/page_alloc.c        |  7 ++++---
>  4 files changed, 12 insertions(+), 10 deletions(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index e3d7754f25f0..5c7645e156a5 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -756,7 +756,8 @@ static inline bool is_dev_zone(const struct zone *zone)
>  #include <linux/memory_hotplug.h>
>  
>  extern struct mutex zonelists_mutex;
> -void build_all_zonelists(pg_data_t *pgdat, struct zone *zone);
> +void build_all_zonelists(pg_data_t *pgdat, struct zone *zone,
> +			 bool hotplug_context);
>  void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx);
>  bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
>  			 int classzone_idx, unsigned int alloc_flags,
> diff --git a/init/main.c b/init/main.c
> index d47860dbe896..7ad08957dd18 100644
> --- a/init/main.c
> +++ b/init/main.c
> @@ -512,7 +512,7 @@ asmlinkage __visible void __init start_kernel(void)
>  	smp_prepare_boot_cpu();	/* arch-specific boot-cpu hooks */
>  	boot_cpu_hotplug_init();
>  
> -	build_all_zonelists(NULL, NULL);
> +	build_all_zonelists(NULL, NULL, false);
>  	page_alloc_init();
>  
>  	pr_notice("Kernel command line: %s\n", boot_command_line);
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 449999657c0b..a4ffe5996317 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1125,7 +1125,7 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_typ
>  	mutex_lock(&zonelists_mutex);
>  	if (!populated_zone(zone)) {
>  		need_zonelists_rebuild = 1;
> -		build_all_zonelists(NULL, zone);
> +		build_all_zonelists(NULL, zone, true);
>  	}
>  
>  	ret = walk_system_ram_range(pfn, nr_pages, &onlined_pages,
> @@ -1146,7 +1146,7 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_typ
>  	if (onlined_pages) {
>  		node_states_set_node(nid, &arg);
>  		if (need_zonelists_rebuild)
> -			build_all_zonelists(NULL, NULL);
> +			build_all_zonelists(NULL, NULL, true);
>  		else
>  			zone_pcp_update(zone);
>  	}
> @@ -1220,7 +1220,7 @@ static pg_data_t __ref *hotadd_new_pgdat(int nid, u64 start)
>  	 * to access not-initialized zonelist, build here.
>  	 */
>  	mutex_lock(&zonelists_mutex);
> -	build_all_zonelists(pgdat, NULL);
> +	build_all_zonelists(pgdat, NULL, true);
>  	mutex_unlock(&zonelists_mutex);
>  
>  	/*
> @@ -1276,7 +1276,7 @@ int try_online_node(int nid)
>  
>  	if (pgdat->node_zonelists->_zonerefs->zone == NULL) {
>  		mutex_lock(&zonelists_mutex);
> -		build_all_zonelists(NULL, NULL);
> +		build_all_zonelists(NULL, NULL, true);
>  		mutex_unlock(&zonelists_mutex);
>  	}
>  
> @@ -2016,7 +2016,7 @@ static int __ref __offline_pages(unsigned long start_pfn,
>  	if (!populated_zone(zone)) {
>  		zone_pcp_reset(zone);
>  		mutex_lock(&zonelists_mutex);
> -		build_all_zonelists(NULL, NULL);
> +		build_all_zonelists(NULL, NULL, true);
>  		mutex_unlock(&zonelists_mutex);
>  	} else
>  		zone_pcp_update(zone);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index de00e0fec484..f394dd87fa03 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4608,7 +4608,7 @@ int numa_zonelist_order_handler(struct ctl_table *table, int write,
>  			user_zonelist_order = oldval;
>  		} else if (oldval != user_zonelist_order) {
>  			mutex_lock(&zonelists_mutex);
> -			build_all_zonelists(NULL, NULL);
> +			build_all_zonelists(NULL, NULL, false);
>  			mutex_unlock(&zonelists_mutex);
>  		}
>  	}
> @@ -4988,11 +4988,12 @@ build_all_zonelists_init(void)
>   * (2) call of __init annotated helper build_all_zonelists_init
>   * [protected by SYSTEM_BOOTING].
>   */
> -void __ref build_all_zonelists(pg_data_t *pgdat, struct zone *zone)
> +void __ref build_all_zonelists(pg_data_t *pgdat, struct zone *zone,
> +			       bool hotplug_context)
>  {
>  	set_zonelist_order();
>  
> -	if (system_state == SYSTEM_BOOTING) {
> +	if (system_state == SYSTEM_BOOTING && !hotplug_context) {
>  		build_all_zonelists_init();
>  	} else {
>  #ifdef CONFIG_MEMORY_HOTPLUG
> -- 
> 2.26.2
> 

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH STABLE 4.9] mm: Avoid calling build_all_zonelists_init under hotplug context
  2020-08-18 11:00 [PATCH STABLE 4.9] mm: Avoid calling build_all_zonelists_init under hotplug context Oscar Salvador
  2020-08-18 12:24 ` Michal Hocko
@ 2020-08-18 12:26 ` David Hildenbrand
  2020-08-20  8:39 ` Patch "mm: Avoid calling build_all_zonelists_init under hotplug context" has been added to the 4.9-stable tree gregkh
  2 siblings, 0 replies; 5+ messages in thread
From: David Hildenbrand @ 2020-08-18 12:26 UTC (permalink / raw)
  To: Oscar Salvador, stable
  Cc: linux-kernel, linux-mm, mhocko, vbabka, Vlastimil Babka

On 18.08.20 13:00, Oscar Salvador wrote:
> Recently a customer of ours experienced a crash when booting the
> system while enabling memory-hotplug.
> 
> The problem is that Normal zones on different nodes don't get their private
> zone->pageset allocated, and keep sharing the initial boot_pageset.
> The sharing between zones is normally safe as explained by the comment for
> boot_pageset - it's a percpu structure, and manipulations are done with
> disabled interrupts, and boot_pageset is set up in a way that any page placed
> on its pcplist is immediately flushed to shared zone's freelist, because
> pcp->high == 1.
> However, the hotplug operation updates pcp->high to a higher value as it
> expects to be operating on a private pageset.
> 
> The problem is in build_all_zonelists(), which is called when the first range
> of pages is onlined for the Normal zone of node X or Y:
> 
> 	if (system_state == SYSTEM_BOOTING) {
> 		build_all_zonelists_init();
> 	} else {
> 	#ifdef CONFIG_MEMORY_HOTPLUG
> 		if (zone)
> 			setup_zone_pageset(zone);
> 	#endif
> 		/* we have to stop all cpus to guarantee there is no user
> 		of zonelist */
> 		stop_machine(__build_all_zonelists, pgdat, NULL);
> 		/* cpuset refresh routine should be here */
> 	}
> 
> When called during hotplug, it should execute the setup_zone_pageset(zone)
> which allocates the private pageset.
> However, with memhp_default_state=online, this happens early while
> system_state == SYSTEM_BOOTING is still true, hence this step is skipped.
> (and build_all_zonelists_init() is probably unsafe anyway at this point).
> 
> Another hotplug operation on the same zone then leads to zone_pcp_update(zone)
> called from online_pages(), which updates the pcp->high for the shared
> boot_pageset to a value higher than 1.
> At that point, pages freed from Node X and Y Normal zones can end up on the same
> pcplist and from there they can be freed to the wrong zone's freelist,
> leading to the corruption and crashes.
> 
> Please, note that upstream has fixed that differently (and unintentionally) by
> adding another boot state (SYSTEM_SCHEDULING), which is set before smp_init().
> That should happen before memory hotplug events even with memhp_default_state=online.
> Backporting that would be too intrusive.
> 
> Signed-off-by: Oscar Salvador <osalvador@suse.de>
> Debugged-by: Vlastimil Babka <vbabka@suse.cz>

So, we have ACPI running and already adding DIMMs while booting? Crazy.

Looks sane to me. Thanks!


-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH STABLE 4.9] mm: Avoid calling build_all_zonelists_init under hotplug context
  2020-08-18 12:24 ` Michal Hocko
@ 2020-08-20  8:14   ` Greg KH
  0 siblings, 0 replies; 5+ messages in thread
From: Greg KH @ 2020-08-20  8:14 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Oscar Salvador, stable, linux-kernel, linux-mm, vbabka, david,
	Vlastimil Babka

On Tue, Aug 18, 2020 at 02:24:46PM +0200, Michal Hocko wrote:
> On Tue 18-08-20 13:00:46, Oscar Salvador wrote:
> > Recently a customer of ours experienced a crash when booting the
> > system while enabling memory-hotplug.
> > 
> > The problem is that Normal zones on different nodes don't get their private
> > zone->pageset allocated, and keep sharing the initial boot_pageset.
> > The sharing between zones is normally safe as explained by the comment for
> > boot_pageset - it's a percpu structure, and manipulations are done with
> > disabled interrupts, and boot_pageset is set up in a way that any page placed
> > on its pcplist is immediately flushed to shared zone's freelist, because
> > pcp->high == 1.
> > However, the hotplug operation updates pcp->high to a higher value as it
> > expects to be operating on a private pageset.
> > 
> > The problem is in build_all_zonelists(), which is called when the first range
> > of pages is onlined for the Normal zone of node X or Y:
> > 
> > 	if (system_state == SYSTEM_BOOTING) {
> > 		build_all_zonelists_init();
> > 	} else {
> > 	#ifdef CONFIG_MEMORY_HOTPLUG
> > 		if (zone)
> > 			setup_zone_pageset(zone);
> > 	#endif
> > 		/* we have to stop all cpus to guarantee there is no user
> > 		of zonelist */
> > 		stop_machine(__build_all_zonelists, pgdat, NULL);
> > 		/* cpuset refresh routine should be here */
> > 	}
> > 
> > When called during hotplug, it should execute the setup_zone_pageset(zone)
> > which allocates the private pageset.
> > However, with memhp_default_state=online, this happens early while
> > system_state == SYSTEM_BOOTING is still true, hence this step is skipped.
> > (and build_all_zonelists_init() is probably unsafe anyway at this point).
> > 
> > Another hotplug operation on the same zone then leads to zone_pcp_update(zone)
> > called from online_pages(), which updates the pcp->high for the shared
> > boot_pageset to a value higher than 1.
> > At that point, pages freed from Node X and Y Normal zones can end up on the same
> > pcplist and from there they can be freed to the wrong zone's freelist,
> > leading to the corruption and crashes.
> > 
> > Please, note that upstream has fixed that differently (and unintentionally) by
> > adding another boot state (SYSTEM_SCHEDULING), which is set before smp_init().
> > That should happen before memory hotplug events even with memhp_default_state=online.
> > Backporting that would be too intrusive.
> > 
> > Signed-off-by: Oscar Salvador <osalvador@suse.de>
> > Debugged-by: Vlastimil Babka <vbabka@suse.cz>
> 
> Yes, I believe this is the easiest and the least scary way to fix the
> issue for stable kernel users. Feel free to add
> Acked-by: Michal Hocko <mhocko@suse.com> # for stable trees
> 
> for that purpose.

Now queued up, thanks!

greg k-h


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Patch "mm: Avoid calling build_all_zonelists_init under hotplug context" has been added to the 4.9-stable tree
  2020-08-18 11:00 [PATCH STABLE 4.9] mm: Avoid calling build_all_zonelists_init under hotplug context Oscar Salvador
  2020-08-18 12:24 ` Michal Hocko
  2020-08-18 12:26 ` David Hildenbrand
@ 2020-08-20  8:39 ` gregkh
  2 siblings, 0 replies; 5+ messages in thread
From: gregkh @ 2020-08-20  8:39 UTC (permalink / raw)
  To: david, gregkh, linux-mm, mhocko, osalvador, vbabka, vbabka; +Cc: stable-commits


This is a note to let you know that I've just added the patch titled

    mm: Avoid calling build_all_zonelists_init under hotplug context

to the 4.9-stable tree which can be found at:
    http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=summary

The filename of the patch is:
     mm-avoid-calling-build_all_zonelists_init-under-hotplug-context.patch
and it can be found in the queue-4.9 subdirectory.

If you, or anyone else, feels it should not be added to the stable tree,
please let <stable@vger.kernel.org> know about it.


From osalvador@suse.de  Thu Aug 20 10:13:12 2020
From: Oscar Salvador <osalvador@suse.de>
Date: Tue, 18 Aug 2020 13:00:46 +0200
Subject: mm: Avoid calling build_all_zonelists_init under hotplug context
To: stable@vger.kernel.org
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, mhocko@suse.com, vbabka@suse.com, david@redhat.com, Oscar Salvador <osalvador@suse.de>, Vlastimil Babka <vbabka@suse.cz>
Message-ID: <20200818110046.6664-1-osalvador@suse.de>

From: Oscar Salvador <osalvador@suse.de>

Recently a customer of ours experienced a crash when booting the
system while enabling memory-hotplug.

The problem is that Normal zones on different nodes don't get their private
zone->pageset allocated, and keep sharing the initial boot_pageset.
The sharing between zones is normally safe as explained by the comment for
boot_pageset - it's a percpu structure, and manipulations are done with
disabled interrupts, and boot_pageset is set up in a way that any page placed
on its pcplist is immediately flushed to shared zone's freelist, because
pcp->high == 1.
However, the hotplug operation updates pcp->high to a higher value as it
expects to be operating on a private pageset.

The problem is in build_all_zonelists(), which is called when the first range
of pages is onlined for the Normal zone of node X or Y:

	if (system_state == SYSTEM_BOOTING) {
		build_all_zonelists_init();
	} else {
	#ifdef CONFIG_MEMORY_HOTPLUG
		if (zone)
			setup_zone_pageset(zone);
	#endif
		/* we have to stop all cpus to guarantee there is no user
		of zonelist */
		stop_machine(__build_all_zonelists, pgdat, NULL);
		/* cpuset refresh routine should be here */
	}

When called during hotplug, it should execute the setup_zone_pageset(zone)
which allocates the private pageset.
However, with memhp_default_state=online, this happens early while
system_state == SYSTEM_BOOTING is still true, hence this step is skipped.
(and build_all_zonelists_init() is probably unsafe anyway at this point).

Another hotplug operation on the same zone then leads to zone_pcp_update(zone)
called from online_pages(), which updates the pcp->high for the shared
boot_pageset to a value higher than 1.
At that point, pages freed from Node X and Y Normal zones can end up on the same
pcplist and from there they can be freed to the wrong zone's freelist,
leading to the corruption and crashes.

Please, note that upstream has fixed that differently (and unintentionally) by
adding another boot state (SYSTEM_SCHEDULING), which is set before smp_init().
That should happen before memory hotplug events even with memhp_default_state=online.
Backporting that would be too intrusive.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
Debugged-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com> # for stable trees
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
 include/linux/mmzone.h |    3 ++-
 init/main.c            |    2 +-
 mm/memory_hotplug.c    |   10 +++++-----
 mm/page_alloc.c        |    7 ++++---
 4 files changed, 12 insertions(+), 10 deletions(-)

--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -756,7 +756,8 @@ static inline bool is_dev_zone(const str
 #include <linux/memory_hotplug.h>
 
 extern struct mutex zonelists_mutex;
-void build_all_zonelists(pg_data_t *pgdat, struct zone *zone);
+void build_all_zonelists(pg_data_t *pgdat, struct zone *zone,
+			 bool hotplug_context);
 void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx);
 bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
 			 int classzone_idx, unsigned int alloc_flags,
--- a/init/main.c
+++ b/init/main.c
@@ -512,7 +512,7 @@ asmlinkage __visible void __init start_k
 	smp_prepare_boot_cpu();	/* arch-specific boot-cpu hooks */
 	boot_cpu_hotplug_init();
 
-	build_all_zonelists(NULL, NULL);
+	build_all_zonelists(NULL, NULL, false);
 	page_alloc_init();
 
 	pr_notice("Kernel command line: %s\n", boot_command_line);
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1125,7 +1125,7 @@ int __ref online_pages(unsigned long pfn
 	mutex_lock(&zonelists_mutex);
 	if (!populated_zone(zone)) {
 		need_zonelists_rebuild = 1;
-		build_all_zonelists(NULL, zone);
+		build_all_zonelists(NULL, zone, true);
 	}
 
 	ret = walk_system_ram_range(pfn, nr_pages, &onlined_pages,
@@ -1146,7 +1146,7 @@ int __ref online_pages(unsigned long pfn
 	if (onlined_pages) {
 		node_states_set_node(nid, &arg);
 		if (need_zonelists_rebuild)
-			build_all_zonelists(NULL, NULL);
+			build_all_zonelists(NULL, NULL, true);
 		else
 			zone_pcp_update(zone);
 	}
@@ -1220,7 +1220,7 @@ static pg_data_t __ref *hotadd_new_pgdat
 	 * to access not-initialized zonelist, build here.
 	 */
 	mutex_lock(&zonelists_mutex);
-	build_all_zonelists(pgdat, NULL);
+	build_all_zonelists(pgdat, NULL, true);
 	mutex_unlock(&zonelists_mutex);
 
 	/*
@@ -1276,7 +1276,7 @@ int try_online_node(int nid)
 
 	if (pgdat->node_zonelists->_zonerefs->zone == NULL) {
 		mutex_lock(&zonelists_mutex);
-		build_all_zonelists(NULL, NULL);
+		build_all_zonelists(NULL, NULL, true);
 		mutex_unlock(&zonelists_mutex);
 	}
 
@@ -2016,7 +2016,7 @@ repeat:
 	if (!populated_zone(zone)) {
 		zone_pcp_reset(zone);
 		mutex_lock(&zonelists_mutex);
-		build_all_zonelists(NULL, NULL);
+		build_all_zonelists(NULL, NULL, true);
 		mutex_unlock(&zonelists_mutex);
 	} else
 		zone_pcp_update(zone);
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4608,7 +4608,7 @@ int numa_zonelist_order_handler(struct c
 			user_zonelist_order = oldval;
 		} else if (oldval != user_zonelist_order) {
 			mutex_lock(&zonelists_mutex);
-			build_all_zonelists(NULL, NULL);
+			build_all_zonelists(NULL, NULL, false);
 			mutex_unlock(&zonelists_mutex);
 		}
 	}
@@ -4988,11 +4988,12 @@ build_all_zonelists_init(void)
  * (2) call of __init annotated helper build_all_zonelists_init
  * [protected by SYSTEM_BOOTING].
  */
-void __ref build_all_zonelists(pg_data_t *pgdat, struct zone *zone)
+void __ref build_all_zonelists(pg_data_t *pgdat, struct zone *zone,
+			       bool hotplug_context)
 {
 	set_zonelist_order();
 
-	if (system_state == SYSTEM_BOOTING) {
+	if (system_state == SYSTEM_BOOTING && !hotplug_context) {
 		build_all_zonelists_init();
 	} else {
 #ifdef CONFIG_MEMORY_HOTPLUG


Patches currently in stable-queue which might be from osalvador@suse.de are

queue-4.9/mm-avoid-calling-build_all_zonelists_init-under-hotplug-context.patch


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2020-08-20  8:39 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-18 11:00 [PATCH STABLE 4.9] mm: Avoid calling build_all_zonelists_init under hotplug context Oscar Salvador
2020-08-18 12:24 ` Michal Hocko
2020-08-20  8:14   ` Greg KH
2020-08-18 12:26 ` David Hildenbrand
2020-08-20  8:39 ` Patch "mm: Avoid calling build_all_zonelists_init under hotplug context" has been added to the 4.9-stable tree gregkh

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).