Race condition in build_all_zonelists() when offlining movable zone

linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* Race condition in build_all_zonelists() when offlining movable zone
@ 2022-08-17  3:42 Patrick Daly
  2022-08-17  6:38 ` Michal Hocko
  0 siblings, 1 reply; 24+ messages in thread
From: Patrick Daly @ 2022-08-17  3:42 UTC (permalink / raw)
  To: linux-mm, linux-arm-kernel, mhocko

System: arm64 with 5.15 based kernel. CONFIG_NUMA=n.

NODE_DATA(nid)->node_zonelists[ZONELIST_FALLBACK] - before offline operation
[0] - ZONE_MOVABLE
[1] - ZONE_NORMAL
[2] - NULL

For a GFP_KERNEL allocation, alloc_pages_slowpath() will save the offset of
ZONE_NORMAL in ac->preferred_zoneref. If a concurrent memory_offline operation
removes the last page from ZONE_MOVABLE, build_all_zonelists() &
build_zonerefs_node() will update node_zonelists as shown below. Only
populated zones are added.

NODE_DATA(nid)->node_zonelists[ZONELIST_FALLBACK] - after offline operation
[0] - ZONE_NORMAL
[1] - NULL
[2] - NULL  

The thread in alloc_pages_slowpath() will call get_page_from_freelist()
repeatedly to allocate from the zones in zonelist beginning from
preferred_zoneref. Since this is now NULL, it will never succeed, and OOM
killer will kill all killable processes.

I noticed a comment on a recent change bb7645c33869
("mm, page_alloc: fix build_zonerefs_node()") which appeared to be relevant,
but later replies indicated concerns with performance implications.
https://lore.kernel.org/linux-mm/Yk7NqTlw7lmFzpKb@dhcp22.suse.cz/

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Race condition in build_all_zonelists() when offlining movable zone
  2022-08-17  3:42 Race condition in build_all_zonelists() when offlining movable zone Patrick Daly
@ 2022-08-17  6:38 ` Michal Hocko
  2022-08-17  6:59   ` Michal Hocko
  0 siblings, 1 reply; 24+ messages in thread
From: Michal Hocko @ 2022-08-17  6:38 UTC (permalink / raw)
  To: Patrick Daly
  Cc: linux-mm, linux-arm-kernel, Mel Gorman, David Hildenbrand, Juergen Gross

[Cc Mel, David and Juergen]

On Tue 16-08-22 20:42:50, Patrick Daly wrote:
> System: arm64 with 5.15 based kernel. CONFIG_NUMA=n.
> 
> NODE_DATA(nid)->node_zonelists[ZONELIST_FALLBACK] - before offline operation
> [0] - ZONE_MOVABLE
> [1] - ZONE_NORMAL
> [2] - NULL
> 
> For a GFP_KERNEL allocation, alloc_pages_slowpath() will save the offset of
> ZONE_NORMAL in ac->preferred_zoneref. If a concurrent memory_offline operation
> removes the last page from ZONE_MOVABLE, build_all_zonelists() &
> build_zonerefs_node() will update node_zonelists as shown below. Only
> populated zones are added.
>
> NODE_DATA(nid)->node_zonelists[ZONELIST_FALLBACK] - after offline operation
> [0] - ZONE_NORMAL
> [1] - NULL
> [2] - NULL  
> 
> The thread in alloc_pages_slowpath() will call get_page_from_freelist()
> repeatedly to allocate from the zones in zonelist beginning from
> preferred_zoneref. Since this is now NULL, it will never succeed, and OOM
> killer will kill all killable processes.
>
> I noticed a comment on a recent change bb7645c33869 ("mm, page_alloc:
> fix build_zonerefs_node()") which appeared to be relevant, but later
> replies indicated concerns with performance implications.
> https://lore.kernel.org/linux-mm/Yk7NqTlw7lmFzpKb@dhcp22.suse.cz/

I guess you mean e553f62f10d9 here. After re-reading the discussion I
seem to remember. We've decided to go with a simple fix (the said
commit) but I do not think we have realized this side effect of the
zonelists index invalidating.

In order to address that, we should either have to call first_zones_zonelist
inside get_page_from_freelist if the zoneref doesn't correspond to a
real zone in the zonelist or we should revisit my older approach
referenced above.

Thanks for the report!
-- 
Michal Hocko
SUSE Labs

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Race condition in build_all_zonelists() when offlining movable zone
  2022-08-17  6:38 ` Michal Hocko
@ 2022-08-17  6:59   ` Michal Hocko
  2022-08-17 10:40     ` Mel Gorman
  0 siblings, 1 reply; 24+ messages in thread
From: Michal Hocko @ 2022-08-17  6:59 UTC (permalink / raw)
  To: Patrick Daly
  Cc: linux-mm, linux-arm-kernel, Mel Gorman, David Hildenbrand, Juergen Gross

On Wed 17-08-22 08:38:13, Michal Hocko wrote:
> [Cc Mel, David and Juergen]
> 
> On Tue 16-08-22 20:42:50, Patrick Daly wrote:
> > System: arm64 with 5.15 based kernel. CONFIG_NUMA=n.
> > 
> > NODE_DATA(nid)->node_zonelists[ZONELIST_FALLBACK] - before offline operation
> > [0] - ZONE_MOVABLE
> > [1] - ZONE_NORMAL
> > [2] - NULL
> > 
> > For a GFP_KERNEL allocation, alloc_pages_slowpath() will save the offset of
> > ZONE_NORMAL in ac->preferred_zoneref. If a concurrent memory_offline operation
> > removes the last page from ZONE_MOVABLE, build_all_zonelists() &
> > build_zonerefs_node() will update node_zonelists as shown below. Only
> > populated zones are added.
> >
> > NODE_DATA(nid)->node_zonelists[ZONELIST_FALLBACK] - after offline operation
> > [0] - ZONE_NORMAL
> > [1] - NULL
> > [2] - NULL  
> > 
> > The thread in alloc_pages_slowpath() will call get_page_from_freelist()
> > repeatedly to allocate from the zones in zonelist beginning from
> > preferred_zoneref. Since this is now NULL, it will never succeed, and OOM
> > killer will kill all killable processes.
> >
> > I noticed a comment on a recent change bb7645c33869 ("mm, page_alloc:
> > fix build_zonerefs_node()") which appeared to be relevant, but later
> > replies indicated concerns with performance implications.
> > https://lore.kernel.org/linux-mm/Yk7NqTlw7lmFzpKb@dhcp22.suse.cz/
> 
> I guess you mean e553f62f10d9 here. After re-reading the discussion I
> seem to remember. We've decided to go with a simple fix (the said
> commit) but I do not think we have realized this side effect of the
> zonelists index invalidating.
> 
> In order to address that, we should either have to call first_zones_zonelist
> inside get_page_from_freelist if the zoneref doesn't correspond to a
> real zone in the zonelist or we should revisit my older approach
> referenced above.

Would this work? It is not really great to pay an overhead for unlikely
event in the hot path but we might use a similar trick to check_retry_cpuset
in the slowpath to detect this situation. 

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b0bcab50f0a3..bce786d7fcb4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4098,7 +4098,17 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 	 * See also __cpuset_node_allowed() comment in kernel/cpuset.c.
 	 */
 	no_fallback = alloc_flags & ALLOC_NOFRAGMENT;
+
+	/*
+	 * A race with memory offlining could alter zones on the zonelist
+	 * e.g. dropping the top (movable) zone if it gets unpoppulated
+	 * and so preferred_zoneref is not valid anymore
+	 */
+	if (unlikely(!ac->preferred_zoneref->zone))
+		ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,
+					ac->highest_zoneidx, ac->nodemask);
 	z = ac->preferred_zoneref;
+
 	for_next_zone_zonelist_nodemask(zone, z, ac->highest_zoneidx,
 					ac->nodemask) {
 		struct page *page;
-- 
Michal Hocko
SUSE Labs

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: Race condition in build_all_zonelists() when offlining movable zone
  2022-08-17  6:59   ` Michal Hocko
@ 2022-08-17 10:40     ` Mel Gorman
  2022-08-17 10:54       ` Michal Hocko
  2022-08-23  6:36       ` David Hildenbrand
  0 siblings, 2 replies; 24+ messages in thread
From: Mel Gorman @ 2022-08-17 10:40 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Patrick Daly, linux-mm, linux-arm-kernel, David Hildenbrand,
	Juergen Gross

On Wed, Aug 17, 2022 at 08:59:11AM +0200, Michal Hocko wrote:
> > In order to address that, we should either have to call first_zones_zonelist
> > inside get_page_from_freelist if the zoneref doesn't correspond to a
> > real zone in the zonelist or we should revisit my older approach
> > referenced above.
> 
> Would this work? It is not really great to pay an overhead for unlikely
> event in the hot path but we might use a similar trick to check_retry_cpuset
> in the slowpath to detect this situation. 
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index b0bcab50f0a3..bce786d7fcb4 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4098,7 +4098,17 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
>  	 * See also __cpuset_node_allowed() comment in kernel/cpuset.c.
>  	 */
>  	no_fallback = alloc_flags & ALLOC_NOFRAGMENT;
> +
> +	/*
> +	 * A race with memory offlining could alter zones on the zonelist
> +	 * e.g. dropping the top (movable) zone if it gets unpoppulated
> +	 * and so preferred_zoneref is not valid anymore
> +	 */
> +	if (unlikely(!ac->preferred_zoneref->zone))
> +		ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,
> +					ac->highest_zoneidx, ac->nodemask);
>  	z = ac->preferred_zoneref;
> +

ac->preferred_zoneref->zone could still be a valid pointer to a zone,
but an empty one so that would imply

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e5486d47406e..38ce123af543 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5191,6 +5191,10 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	if (check_retry_cpuset(cpuset_mems_cookie, ac))
 		goto retry_cpuset;
 
+	/* Hotplug could have drained the preferred zone. */
+	if (!populated_zone(ac->preferred_zoneref->zone))
+		goto retry_cpuset;
+
 	/* Reclaim has failed us, start killing things */
 	page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
 	if (page)

But even that is fragile. If there were multiple zones in the zonelist
and the preferred zone was further down the list, the zone could still
be populated but a different zone than expected. It may be better to have
the same type of seq counter that restarts the allocation attempt if the
zonelist changes.

So.... this? It is seqcount only with a basic lock as there already is a
full lock on the writer side and it would appear to be overkill to protect
the reader side with read_seqbegin_or_lock as it complicates the writer side.

(boot tested only)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e5486d47406e..158954b10724 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4708,6 +4708,22 @@ void fs_reclaim_release(gfp_t gfp_mask)
 EXPORT_SYMBOL_GPL(fs_reclaim_release);
 #endif
 
+/*
+ * Zonelists may change due to hotplug during allocation. Detect when zonelists
+ * have been rebuilt so allocation retries.
+ */
+static seqcount_t zonelist_update_seq = SEQCNT_ZERO(zonelist_update_seq);
+
+static unsigned int zonelist_update_begin(void)
+{
+	return read_seqcount_begin(&zonelist_update_seq);
+}
+
+static unsigned int zonelist_update_retry(unsigned int seq)
+{
+	return read_seqcount_retry(&zonelist_update_seq, seq);
+}
+
 /* Perform direct synchronous page reclaim */
 static unsigned long
 __perform_reclaim(gfp_t gfp_mask, unsigned int order,
@@ -5001,6 +5017,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	int compaction_retries;
 	int no_progress_loops;
 	unsigned int cpuset_mems_cookie;
+	unsigned int zonelist_update_cookie;
 	int reserve_flags;
 
 	/*
@@ -5016,6 +5033,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	no_progress_loops = 0;
 	compact_priority = DEF_COMPACT_PRIORITY;
 	cpuset_mems_cookie = read_mems_allowed_begin();
+	zonelist_update_cookie = zonelist_update_begin();
 
 	/*
 	 * The fast path uses conservative alloc_flags to succeed only until
@@ -5191,6 +5209,9 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	if (check_retry_cpuset(cpuset_mems_cookie, ac))
 		goto retry_cpuset;
 
+	if (zonelist_update_retry(zonelist_update_cookie))
+		goto retry_cpuset;
+
 	/* Reclaim has failed us, start killing things */
 	page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
 	if (page)
@@ -6517,6 +6538,7 @@ static void __build_all_zonelists(void *data)
 	static DEFINE_SPINLOCK(lock);
 
 	spin_lock(&lock);
+	write_seqcount_begin(&zonelist_update_seq);
 
 #ifdef CONFIG_NUMA
 	memset(node_load, 0, sizeof(node_load));
@@ -6553,6 +6575,7 @@ static void __build_all_zonelists(void *data)
 #endif
 	}
 
+	write_seqcount_end(&zonelist_update_seq);
 	spin_unlock(&lock);
 }
 
-- 
Mel Gorman
SUSE Labs

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: Race condition in build_all_zonelists() when offlining movable zone
  2022-08-17 10:40     ` Mel Gorman
@ 2022-08-17 10:54       ` Michal Hocko
  2022-08-17 11:26         ` Mel Gorman
  2022-08-23  6:36       ` David Hildenbrand
  1 sibling, 1 reply; 24+ messages in thread
From: Michal Hocko @ 2022-08-17 10:54 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Patrick Daly, linux-mm, linux-arm-kernel, David Hildenbrand,
	Juergen Gross

On Wed 17-08-22 11:40:28, Mel Gorman wrote:
> On Wed, Aug 17, 2022 at 08:59:11AM +0200, Michal Hocko wrote:
> > > In order to address that, we should either have to call first_zones_zonelist
> > > inside get_page_from_freelist if the zoneref doesn't correspond to a
> > > real zone in the zonelist or we should revisit my older approach
> > > referenced above.
> > 
> > Would this work? It is not really great to pay an overhead for unlikely
> > event in the hot path but we might use a similar trick to check_retry_cpuset
> > in the slowpath to detect this situation. 
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index b0bcab50f0a3..bce786d7fcb4 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -4098,7 +4098,17 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
> >  	 * See also __cpuset_node_allowed() comment in kernel/cpuset.c.
> >  	 */
> >  	no_fallback = alloc_flags & ALLOC_NOFRAGMENT;
> > +
> > +	/*
> > +	 * A race with memory offlining could alter zones on the zonelist
> > +	 * e.g. dropping the top (movable) zone if it gets unpoppulated
> > +	 * and so preferred_zoneref is not valid anymore
> > +	 */
> > +	if (unlikely(!ac->preferred_zoneref->zone))
> > +		ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,
> > +					ac->highest_zoneidx, ac->nodemask);
> >  	z = ac->preferred_zoneref;
> > +
> 
> ac->preferred_zoneref->zone could still be a valid pointer to a zone,
> but an empty one so that would imply

OK, I managed to confuse myself. I thought the zoneref always points to
a specific zone in the zonelist.
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index e5486d47406e..38ce123af543 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5191,6 +5191,10 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	if (check_retry_cpuset(cpuset_mems_cookie, ac))
>  		goto retry_cpuset;
>  
> +	/* Hotplug could have drained the preferred zone. */
> +	if (!populated_zone(ac->preferred_zoneref->zone))
> +		goto retry_cpuset;
> +
>  	/* Reclaim has failed us, start killing things */
>  	page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
>  	if (page)

Would it be possible to have preferred_zoneref always unpopulated?
 
> But even that is fragile. If there were multiple zones in the zonelist
> and the preferred zone was further down the list, the zone could still
> be populated but a different zone than expected. It may be better to have
> the same type of seq counter that restarts the allocation attempt if the
> zonelist changes.
> 
> So.... this? It is seqcount only with a basic lock as there already is a
> full lock on the writer side and it would appear to be overkill to protect
> the reader side with read_seqbegin_or_lock as it complicates the writer side.
> 
> (boot tested only)

yes, from a quick look it seems ok and it matches handling of a race for
cpuset which is not the same thing but very similar in principle. So
conceptually looks good to me.
-- 
Michal Hocko
SUSE Labs

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Race condition in build_all_zonelists() when offlining movable zone
  2022-08-17 10:54       ` Michal Hocko
@ 2022-08-17 11:26         ` Mel Gorman
  2022-08-19  2:11           ` Patrick Daly
  2022-08-22 20:18           ` Patrick Daly
  0 siblings, 2 replies; 24+ messages in thread
From: Mel Gorman @ 2022-08-17 11:26 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Patrick Daly, linux-mm, linux-arm-kernel, David Hildenbrand,
	Juergen Gross

On Wed, Aug 17, 2022 at 12:54:08PM +0200, Michal Hocko wrote:
> On Wed 17-08-22 11:40:28, Mel Gorman wrote:
> > On Wed, Aug 17, 2022 at 08:59:11AM +0200, Michal Hocko wrote:
> > > > In order to address that, we should either have to call first_zones_zonelist
> > > > inside get_page_from_freelist if the zoneref doesn't correspond to a
> > > > real zone in the zonelist or we should revisit my older approach
> > > > referenced above.
> > > 
> > > Would this work? It is not really great to pay an overhead for unlikely
> > > event in the hot path but we might use a similar trick to check_retry_cpuset
> > > in the slowpath to detect this situation. 
> > > 
> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > index b0bcab50f0a3..bce786d7fcb4 100644
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -4098,7 +4098,17 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
> > >  	 * See also __cpuset_node_allowed() comment in kernel/cpuset.c.
> > >  	 */
> > >  	no_fallback = alloc_flags & ALLOC_NOFRAGMENT;
> > > +
> > > +	/*
> > > +	 * A race with memory offlining could alter zones on the zonelist
> > > +	 * e.g. dropping the top (movable) zone if it gets unpoppulated
> > > +	 * and so preferred_zoneref is not valid anymore
> > > +	 */
> > > +	if (unlikely(!ac->preferred_zoneref->zone))
> > > +		ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,
> > > +					ac->highest_zoneidx, ac->nodemask);
> > >  	z = ac->preferred_zoneref;
> > > +
> > 
> > ac->preferred_zoneref->zone could still be a valid pointer to a zone,
> > but an empty one so that would imply
> 
> OK, I managed to confuse myself. I thought the zoneref always points to
> a specific zone in the zonelist.

It's a bit confusing.

> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index e5486d47406e..38ce123af543 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -5191,6 +5191,10 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> >  	if (check_retry_cpuset(cpuset_mems_cookie, ac))
> >  		goto retry_cpuset;
> >  
> > +	/* Hotplug could have drained the preferred zone. */
> > +	if (!populated_zone(ac->preferred_zoneref->zone))
> > +		goto retry_cpuset;
> > +
> >  	/* Reclaim has failed us, start killing things */
> >  	page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
> >  	if (page)
> 
> Would it be possible to have preferred_zoneref always unpopulated?
>  

I don't think so but this is not my preferred approach anyway.

> > But even that is fragile. If there were multiple zones in the zonelist
> > and the preferred zone was further down the list, the zone could still
> > be populated but a different zone than expected. It may be better to have
> > the same type of seq counter that restarts the allocation attempt if the
> > zonelist changes.
> > 
> > So.... this? It is seqcount only with a basic lock as there already is a
> > full lock on the writer side and it would appear to be overkill to protect
> > the reader side with read_seqbegin_or_lock as it complicates the writer side.
> > 
> > (boot tested only)
> 
> yes, from a quick look it seems ok and it matches handling of a race for
> cpuset which is not the same thing but very similar in principle. So
> conceptually looks good to me.

Patrick, as you have a hotplug test that removes a full zone, would you
like to test? I haven't boot tested it on v5.15 but it applies cleanly
and compiles at least.

-- 
Mel Gorman
SUSE Labs

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Race condition in build_all_zonelists() when offlining movable zone
  2022-08-17 11:26         ` Mel Gorman
@ 2022-08-19  2:11           ` Patrick Daly
  2022-08-22 20:18           ` Patrick Daly
  1 sibling, 0 replies; 24+ messages in thread
From: Patrick Daly @ 2022-08-19  2:11 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Michal Hocko, linux-mm, linux-arm-kernel, David Hildenbrand,
	Juergen Gross

On Wed, Aug 17, 2022 at 12:26:47PM +0100, Mel Gorman wrote:
> 
> Patrick, as you have a hotplug test that removes a full zone, would you
> like to test? I haven't boot tested it on v5.15 but it applies cleanly
> and compiles at least.

Certainly. I'll let you know when we've finished our test run.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Race condition in build_all_zonelists() when offlining movable zone
  2022-08-17 11:26         ` Mel Gorman
  2022-08-19  2:11           ` Patrick Daly
@ 2022-08-22 20:18           ` Patrick Daly
  1 sibling, 0 replies; 24+ messages in thread
From: Patrick Daly @ 2022-08-22 20:18 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Michal Hocko, linux-mm, linux-arm-kernel, David Hildenbrand,
	Juergen Gross

On Wed, Aug 17, 2022 at 12:26:47PM +0100, Mel Gorman wrote:
> Patrick, as you have a hotplug test that removes a full zone, would you
> like to test? I haven't boot tested it on v5.15 but it applies cleanly
> and compiles at least.
> 

Hi Mel/Michal

Thanks for the patch. I did not see any re-occurance of the original issue,
or any stability issues linked to the new code.

--Patrick

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Race condition in build_all_zonelists() when offlining movable zone
  2022-08-17 10:40     ` Mel Gorman
  2022-08-17 10:54       ` Michal Hocko
@ 2022-08-23  6:36       ` David Hildenbrand
  2022-08-23  8:33         ` Mel Gorman
  1 sibling, 1 reply; 24+ messages in thread
From: David Hildenbrand @ 2022-08-23  6:36 UTC (permalink / raw)
  To: Mel Gorman, Michal Hocko
  Cc: Patrick Daly, linux-mm, linux-arm-kernel, Juergen Gross

On 17.08.22 12:40, Mel Gorman wrote:
> On Wed, Aug 17, 2022 at 08:59:11AM +0200, Michal Hocko wrote:
>>> In order to address that, we should either have to call first_zones_zonelist
>>> inside get_page_from_freelist if the zoneref doesn't correspond to a
>>> real zone in the zonelist or we should revisit my older approach
>>> referenced above.
>>
>> Would this work? It is not really great to pay an overhead for unlikely
>> event in the hot path but we might use a similar trick to check_retry_cpuset
>> in the slowpath to detect this situation. 
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index b0bcab50f0a3..bce786d7fcb4 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -4098,7 +4098,17 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
>>  	 * See also __cpuset_node_allowed() comment in kernel/cpuset.c.
>>  	 */
>>  	no_fallback = alloc_flags & ALLOC_NOFRAGMENT;
>> +
>> +	/*
>> +	 * A race with memory offlining could alter zones on the zonelist
>> +	 * e.g. dropping the top (movable) zone if it gets unpoppulated
>> +	 * and so preferred_zoneref is not valid anymore
>> +	 */
>> +	if (unlikely(!ac->preferred_zoneref->zone))
>> +		ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,
>> +					ac->highest_zoneidx, ac->nodemask);
>>  	z = ac->preferred_zoneref;
>> +
> 
> ac->preferred_zoneref->zone could still be a valid pointer to a zone,
> but an empty one so that would imply
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index e5486d47406e..38ce123af543 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5191,6 +5191,10 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	if (check_retry_cpuset(cpuset_mems_cookie, ac))
>  		goto retry_cpuset;
>  
> +	/* Hotplug could have drained the preferred zone. */
> +	if (!populated_zone(ac->preferred_zoneref->zone))
> +		goto retry_cpuset;
> +
>  	/* Reclaim has failed us, start killing things */
>  	page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
>  	if (page)
> 
> But even that is fragile. If there were multiple zones in the zonelist
> and the preferred zone was further down the list, the zone could still
> be populated but a different zone than expected. It may be better to have
> the same type of seq counter that restarts the allocation attempt if the
> zonelist changes.
> 
> So.... this? It is seqcount only with a basic lock as there already is a
> full lock on the writer side and it would appear to be overkill to protect
> the reader side with read_seqbegin_or_lock as it complicates the writer side.
> 
> (boot tested only)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index e5486d47406e..158954b10724 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4708,6 +4708,22 @@ void fs_reclaim_release(gfp_t gfp_mask)
>  EXPORT_SYMBOL_GPL(fs_reclaim_release);
>  #endif
>  
> +/*
> + * Zonelists may change due to hotplug during allocation. Detect when zonelists
> + * have been rebuilt so allocation retries.
> + */
> +static seqcount_t zonelist_update_seq = SEQCNT_ZERO(zonelist_update_seq);
> +
> +static unsigned int zonelist_update_begin(void)
> +{
> +	return read_seqcount_begin(&zonelist_update_seq);
> +}
> +
> +static unsigned int zonelist_update_retry(unsigned int seq)
> +{
> +	return read_seqcount_retry(&zonelist_update_seq, seq);
> +}
> +
>  /* Perform direct synchronous page reclaim */
>  static unsigned long
>  __perform_reclaim(gfp_t gfp_mask, unsigned int order,
> @@ -5001,6 +5017,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	int compaction_retries;
>  	int no_progress_loops;
>  	unsigned int cpuset_mems_cookie;
> +	unsigned int zonelist_update_cookie;
>  	int reserve_flags;
>  
>  	/*
> @@ -5016,6 +5033,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	no_progress_loops = 0;
>  	compact_priority = DEF_COMPACT_PRIORITY;
>  	cpuset_mems_cookie = read_mems_allowed_begin();
> +	zonelist_update_cookie = zonelist_update_begin();
>  
>  	/*
>  	 * The fast path uses conservative alloc_flags to succeed only until
> @@ -5191,6 +5209,9 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	if (check_retry_cpuset(cpuset_mems_cookie, ac))
>  		goto retry_cpuset;
>  
> +	if (zonelist_update_retry(zonelist_update_cookie))
> +		goto retry_cpuset;
> +
>  	/* Reclaim has failed us, start killing things */
>  	page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
>  	if (page)
> @@ -6517,6 +6538,7 @@ static void __build_all_zonelists(void *data)
>  	static DEFINE_SPINLOCK(lock);
>  
>  	spin_lock(&lock);
> +	write_seqcount_begin(&zonelist_update_seq);
>  
>  #ifdef CONFIG_NUMA
>  	memset(node_load, 0, sizeof(node_load));
> @@ -6553,6 +6575,7 @@ static void __build_all_zonelists(void *data)
>  #endif
>  	}
>  
> +	write_seqcount_end(&zonelist_update_seq);
>  	spin_unlock(&lock);

Do we want to get rid of the static lock by using a seqlock_t instead of
a seqcount_t?


-- 
Thanks,

David / dhildenb


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Race condition in build_all_zonelists() when offlining movable zone
  2022-08-23  6:36       ` David Hildenbrand
@ 2022-08-23  8:33         ` Mel Gorman
  2022-08-23  8:52           ` David Hildenbrand
  0 siblings, 1 reply; 24+ messages in thread
From: Mel Gorman @ 2022-08-23  8:33 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Michal Hocko, Patrick Daly, linux-mm, linux-arm-kernel, Juergen Gross

On Tue, Aug 23, 2022 at 08:36:34AM +0200, David Hildenbrand wrote:
> > @@ -6517,6 +6538,7 @@ static void __build_all_zonelists(void *data)
> >  	static DEFINE_SPINLOCK(lock);
> >  
> >  	spin_lock(&lock);
> > +	write_seqcount_begin(&zonelist_update_seq);
> >  
> >  #ifdef CONFIG_NUMA
> >  	memset(node_load, 0, sizeof(node_load));
> > @@ -6553,6 +6575,7 @@ static void __build_all_zonelists(void *data)
> >  #endif
> >  	}
> >  
> > +	write_seqcount_end(&zonelist_update_seq);
> >  	spin_unlock(&lock);
> 
> Do we want to get rid of the static lock by using a seqlock_t instead of
> a seqcount_t?
> 

I do not think so because it's a relatively heavy lock compared to the
counter and the read side.

As the read-side can be called from hardirq or softirq context, the
write-side needs to disable irqs or bottom halves as well according to the
documentation. That is relatively minor as the write side is rare but it's
tricker because the calling context can be both IRQ or softirq so the IRQ
protection would have to be used.

The read-side also gets more complicated. The read side becomes
either read_seqlock_excl() (or bh or IRQ as appropriate) or
read_seqbegin_or_lock. The read_seqlock_excl acts like a spinlock blocking
out other readers and writers, we definitely do not want to block out other
readers in the allocator path because .... that is crazy, it's basically a
global memory allocator lock. There is not an obvious option of limiting
the scope of that lock such as a single zone because it's the zonelists
we care about, not an individual zone. I guess it could be done on a
pgdat basis selected by the preferred zone but it's also unnecessarily
complicated and a relatively heavy lock.

The other obvious choice is read_seqbegin_or_lock to locklessly try and
then retry if necessary. This has better semantics as a lockless version
exists with the caller tracking more state but again, the retry step is
heavy and acts as a global lock.

In this case, the seqcounter or seqlock is protecting relatively simple
data -- the zonelist pointing to struct zones that never disappear (the
zone might be unpopulated but the struct zone still exists). The critical
data being protected in this context is either the PCP lists or the buddy
lists, each which has separate locking. The zonelist needs less protection
although RCU protection would be a potential, if somewhat complicated
option, as even if the zonelist itself is valid after an RCU update,
the zones listed are not necessarily useful any more.

There is no real advantage to using seqcount_spinlock_t either as the
associated lock would be a global lock and there isn't any lockdep advantage
to doing that either.

As the alternatives have side effects, I would prefer to see any proposed
conversion on top of the fix with review determining if there is any
unnecessary additional serialisation.

-- 
Mel Gorman
SUSE Labs

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Race condition in build_all_zonelists() when offlining movable zone
  2022-08-23  8:33         ` Mel Gorman
@ 2022-08-23  8:52           ` David Hildenbrand
  2022-08-23  9:49             ` Mel Gorman
  0 siblings, 1 reply; 24+ messages in thread
From: David Hildenbrand @ 2022-08-23  8:52 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Michal Hocko, Patrick Daly, linux-mm, linux-arm-kernel, Juergen Gross

On 23.08.22 10:33, Mel Gorman wrote:
> On Tue, Aug 23, 2022 at 08:36:34AM +0200, David Hildenbrand wrote:
>>> @@ -6517,6 +6538,7 @@ static void __build_all_zonelists(void *data)
>>>  	static DEFINE_SPINLOCK(lock);
>>>  
>>>  	spin_lock(&lock);
>>> +	write_seqcount_begin(&zonelist_update_seq);
>>>  
>>>  #ifdef CONFIG_NUMA
>>>  	memset(node_load, 0, sizeof(node_load));
>>> @@ -6553,6 +6575,7 @@ static void __build_all_zonelists(void *data)
>>>  #endif
>>>  	}
>>>  
>>> +	write_seqcount_end(&zonelist_update_seq);
>>>  	spin_unlock(&lock);
>>
>> Do we want to get rid of the static lock by using a seqlock_t instead of
>> a seqcount_t?
>>
> 
> I do not think so because it's a relatively heavy lock compared to the
> counter and the read side.

I was primarily asking because seqlock.h states: "Sequential locks
(seqlock_t):  Sequence counters with an embedded spinlock for writer
serialization and non-preemptibility." seems to be precisely what we are
doing here.

> 
> As the read-side can be called from hardirq or softirq context, the
> write-side needs to disable irqs or bottom halves as well according to the
> documentation. That is relatively minor as the write side is rare but it's
> tricker because the calling context can be both IRQ or softirq so the IRQ
> protection would have to be used.

Naive me would just have used write_seqlock()/write_sequnlock() and
read_seqbegin()/read_seqretry() to result in almost the same code as
with your change -- but having both mechanisms encapsulated.


Yeah, there are special write_seqlock_bh()/write_sequnlock_bh(),
write_sequnlock_irq() ... but IIRC we don't have to care about that at
all when just using the primitives as above. But most probably I am
missing something important.

> 
> The read-side also gets more complicated. The read side becomes
> either read_seqlock_excl() (or bh or IRQ as appropriate) or
> read_seqbegin_or_lock. The read_seqlock_excl acts like a spinlock blocking
> out other readers and writers, we definitely do not want to block out other
> readers in the allocator path because .... that is crazy, it's basically a
> global memory allocator lock. There is not an obvious option of limiting

Yes.

> the scope of that lock such as a single zone because it's the zonelists
> we care about, not an individual zone. I guess it could be done on a
> pgdat basis selected by the preferred zone but it's also unnecessarily
> complicated and a relatively heavy lock.
> 
> The other obvious choice is read_seqbegin_or_lock to locklessly try and
> then retry if necessary. This has better semantics as a lockless version
> exists with the caller tracking more state but again, the retry step is
> heavy and acts as a global lock.

Documentation/locking/seqlock.rst points out read path #1, that's just
for "Normal Sequence readers which never block a writer but they must
retry if a writer is in progress by detecting change in the sequence
number."

It's a simple use of read_seqbegin/read_seqretry.
read_seqbegin()/read_seqretry() translate essentially to
read_seqcount_begin()/read_seqcount_begin() -- except some kcsan() checks.

> 
> In this case, the seqcounter or seqlock is protecting relatively simple
> data -- the zonelist pointing to struct zones that never disappear (the
> zone might be unpopulated but the struct zone still exists). The critical
> data being protected in this context is either the PCP lists or the buddy
> lists, each which has separate locking. The zonelist needs less protection
> although RCU protection would be a potential, if somewhat complicated
> option, as even if the zonelist itself is valid after an RCU update,
> the zones listed are not necessarily useful any more.
> 
> There is no real advantage to using seqcount_spinlock_t either as the
> associated lock would be a global lock and there isn't any lockdep advantage
> to doing that either.
> 
> As the alternatives have side effects, I would prefer to see any proposed
> conversion on top of the fix with review determining if there is any
> unnecessary additional serialisation.
> 

Agreed with more complicated conversions. Using a seqlock_t to replace
the spinlock and avoid introducing the sequcount here would have been
easy and straight forward, though.

-- 
Thanks,

David / dhildenb


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Race condition in build_all_zonelists() when offlining movable zone
  2022-08-23  8:52           ` David Hildenbrand
@ 2022-08-23  9:49             ` Mel Gorman
  2022-08-23 10:34               ` David Hildenbrand
  0 siblings, 1 reply; 24+ messages in thread
From: Mel Gorman @ 2022-08-23  9:49 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Michal Hocko, Patrick Daly, linux-mm, linux-arm-kernel, Juergen Gross

On Tue, Aug 23, 2022 at 10:52:34AM +0200, David Hildenbrand wrote:
> On 23.08.22 10:33, Mel Gorman wrote:
> > On Tue, Aug 23, 2022 at 08:36:34AM +0200, David Hildenbrand wrote:
> >>> @@ -6517,6 +6538,7 @@ static void __build_all_zonelists(void *data)
> >>>  	static DEFINE_SPINLOCK(lock);
> >>>  
> >>>  	spin_lock(&lock);
> >>> +	write_seqcount_begin(&zonelist_update_seq);
> >>>  
> >>>  #ifdef CONFIG_NUMA
> >>>  	memset(node_load, 0, sizeof(node_load));
> >>> @@ -6553,6 +6575,7 @@ static void __build_all_zonelists(void *data)
> >>>  #endif
> >>>  	}
> >>>  
> >>> +	write_seqcount_end(&zonelist_update_seq);
> >>>  	spin_unlock(&lock);
> >>
> >> Do we want to get rid of the static lock by using a seqlock_t instead of
> >> a seqcount_t?
> >>
> > 
> > I do not think so because it's a relatively heavy lock compared to the
> > counter and the read side.
> 
> I was primarily asking because seqlock.h states: "Sequential locks
> (seqlock_t):  Sequence counters with an embedded spinlock for writer
> serialization and non-preemptibility." seems to be precisely what we are
> doing here.
> 
> > 
> > As the read-side can be called from hardirq or softirq context, the
> > write-side needs to disable irqs or bottom halves as well according to the
> > documentation. That is relatively minor as the write side is rare but it's
> > tricker because the calling context can be both IRQ or softirq so the IRQ
> > protection would have to be used.
> 
> Naive me would just have used write_seqlock()/write_sequnlock() and
> read_seqbegin()/read_seqretry() to result in almost the same code as
> with your change -- but having both mechanisms encapsulated.
> 
> 
> Yeah, there are special write_seqlock_bh()/write_sequnlock_bh(),
> write_sequnlock_irq() ... but IIRC we don't have to care about that at
> all when just using the primitives as above. But most probably I am
> missing something important.
> 

You're not missing anything important, I'm just not a massive fan of the
API naming because it's unclear from the context if it's a plain counter
or a locked counter and felt it was better to keep the locking explicit.

A seqlock version is below. I updated the comments and naming to make it
clear the read-side is for iteration, what the locking protocol is and
match the retry naming with the cpuset equivalent. It boots on KVM but
would need another test from Patrick to be certain it still works. Patrick,
would you mind testing this version please?

---8<---
 mm/page_alloc.c | 33 ++++++++++++++++++++++++++++-----
 1 file changed, 28 insertions(+), 5 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e5486d47406e..a644c7b638a3 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4708,6 +4708,24 @@ void fs_reclaim_release(gfp_t gfp_mask)
 EXPORT_SYMBOL_GPL(fs_reclaim_release);
 #endif
 
+/*
+ * Zonelists may change due to hotplug during allocation. Detect when zonelists
+ * have been rebuilt so allocation retries. Reader side does not lock and
+ * retries the allocation if zonelist changes. Writer side is protected by the
+ * embedded spin_lock.
+ */
+DEFINE_SEQLOCK(zonelist_update_seq);
+
+static unsigned int zonelist_iter_begin(void)
+{
+	return read_seqbegin(&zonelist_update_seq);
+}
+
+static unsigned int check_retry_zonelist(unsigned int seq)
+{
+	return read_seqretry(&zonelist_update_seq, seq);
+}
+
 /* Perform direct synchronous page reclaim */
 static unsigned long
 __perform_reclaim(gfp_t gfp_mask, unsigned int order,
@@ -5001,6 +5019,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	int compaction_retries;
 	int no_progress_loops;
 	unsigned int cpuset_mems_cookie;
+	unsigned int zonelist_iter_cookie;
 	int reserve_flags;
 
 	/*
@@ -5016,6 +5035,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	no_progress_loops = 0;
 	compact_priority = DEF_COMPACT_PRIORITY;
 	cpuset_mems_cookie = read_mems_allowed_begin();
+	zonelist_iter_cookie = zonelist_iter_begin();
 
 	/*
 	 * The fast path uses conservative alloc_flags to succeed only until
@@ -5187,8 +5207,12 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 		goto retry;
 
 
-	/* Deal with possible cpuset update races before we start OOM killing */
-	if (check_retry_cpuset(cpuset_mems_cookie, ac))
+	/*
+	 * Deal with possible cpuset update races or zonelist updates to avoid
+	 * a unnecessary OOM kill.
+	 */
+	if (check_retry_cpuset(cpuset_mems_cookie, ac) ||
+	    check_retry_zonelist(zonelist_iter_cookie))
 		goto retry_cpuset;
 
 	/* Reclaim has failed us, start killing things */
@@ -6514,9 +6538,8 @@ static void __build_all_zonelists(void *data)
 	int nid;
 	int __maybe_unused cpu;
 	pg_data_t *self = data;
-	static DEFINE_SPINLOCK(lock);
 
-	spin_lock(&lock);
+	write_seqlock(&zonelist_update_seq);
 
 #ifdef CONFIG_NUMA
 	memset(node_load, 0, sizeof(node_load));
@@ -6553,7 +6576,7 @@ static void __build_all_zonelists(void *data)
 #endif
 	}
 
-	spin_unlock(&lock);
+	write_sequnlock(&zonelist_update_seq);
 }
 
 static noinline void __init

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: Race condition in build_all_zonelists() when offlining movable zone
  2022-08-23  9:49             ` Mel Gorman
@ 2022-08-23 10:34               ` David Hildenbrand
  2022-08-23 10:57                 ` Michal Hocko
  2022-08-23 11:09                 ` Mel Gorman
  0 siblings, 2 replies; 24+ messages in thread
From: David Hildenbrand @ 2022-08-23 10:34 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Michal Hocko, Patrick Daly, linux-mm, linux-arm-kernel, Juergen Gross

On 23.08.22 11:49, Mel Gorman wrote:
> On Tue, Aug 23, 2022 at 10:52:34AM +0200, David Hildenbrand wrote:
>> On 23.08.22 10:33, Mel Gorman wrote:
>>> On Tue, Aug 23, 2022 at 08:36:34AM +0200, David Hildenbrand wrote:
>>>>> @@ -6517,6 +6538,7 @@ static void __build_all_zonelists(void *data)
>>>>>  	static DEFINE_SPINLOCK(lock);
>>>>>  
>>>>>  	spin_lock(&lock);
>>>>> +	write_seqcount_begin(&zonelist_update_seq);
>>>>>  
>>>>>  #ifdef CONFIG_NUMA
>>>>>  	memset(node_load, 0, sizeof(node_load));
>>>>> @@ -6553,6 +6575,7 @@ static void __build_all_zonelists(void *data)
>>>>>  #endif
>>>>>  	}
>>>>>  
>>>>> +	write_seqcount_end(&zonelist_update_seq);
>>>>>  	spin_unlock(&lock);
>>>>
>>>> Do we want to get rid of the static lock by using a seqlock_t instead of
>>>> a seqcount_t?
>>>>
>>>
>>> I do not think so because it's a relatively heavy lock compared to the
>>> counter and the read side.
>>
>> I was primarily asking because seqlock.h states: "Sequential locks
>> (seqlock_t):  Sequence counters with an embedded spinlock for writer
>> serialization and non-preemptibility." seems to be precisely what we are
>> doing here.
>>
>>>
>>> As the read-side can be called from hardirq or softirq context, the
>>> write-side needs to disable irqs or bottom halves as well according to the
>>> documentation. That is relatively minor as the write side is rare but it's
>>> tricker because the calling context can be both IRQ or softirq so the IRQ
>>> protection would have to be used.
>>
>> Naive me would just have used write_seqlock()/write_sequnlock() and
>> read_seqbegin()/read_seqretry() to result in almost the same code as
>> with your change -- but having both mechanisms encapsulated.
>>
>>
>> Yeah, there are special write_seqlock_bh()/write_sequnlock_bh(),
>> write_sequnlock_irq() ... but IIRC we don't have to care about that at
>> all when just using the primitives as above. But most probably I am
>> missing something important.
>>
> 
> You're not missing anything important, I'm just not a massive fan of the
> API naming because it's unclear from the context if it's a plain counter
> or a locked counter and felt it was better to keep the locking explicit.
> 
> A seqlock version is below. I updated the comments and naming to make it
> clear the read-side is for iteration, what the locking protocol is and
> match the retry naming with the cpuset equivalent. It boots on KVM but
> would need another test from Patrick to be certain it still works. Patrick,
> would you mind testing this version please?
> 
> ---8<---
>  mm/page_alloc.c | 33 ++++++++++++++++++++++++++++-----
>  1 file changed, 28 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index e5486d47406e..a644c7b638a3 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4708,6 +4708,24 @@ void fs_reclaim_release(gfp_t gfp_mask)
>  EXPORT_SYMBOL_GPL(fs_reclaim_release);
>  #endif
>  
> +/*
> + * Zonelists may change due to hotplug during allocation. Detect when zonelists
> + * have been rebuilt so allocation retries. Reader side does not lock and
> + * retries the allocation if zonelist changes. Writer side is protected by the
> + * embedded spin_lock.
> + */
> +DEFINE_SEQLOCK(zonelist_update_seq);
> +
> +static unsigned int zonelist_iter_begin(void)
> +{
> +	return read_seqbegin(&zonelist_update_seq);
> +}
> +
> +static unsigned int check_retry_zonelist(unsigned int seq)
> +{
> +	return read_seqretry(&zonelist_update_seq, seq);
> +}
> +
>  /* Perform direct synchronous page reclaim */
>  static unsigned long
>  __perform_reclaim(gfp_t gfp_mask, unsigned int order,
> @@ -5001,6 +5019,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	int compaction_retries;
>  	int no_progress_loops;
>  	unsigned int cpuset_mems_cookie;
> +	unsigned int zonelist_iter_cookie;
>  	int reserve_flags;
>  
>  	/*
> @@ -5016,6 +5035,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	no_progress_loops = 0;
>  	compact_priority = DEF_COMPACT_PRIORITY;
>  	cpuset_mems_cookie = read_mems_allowed_begin();
> +	zonelist_iter_cookie = zonelist_iter_begin();
>  
>  	/*
>  	 * The fast path uses conservative alloc_flags to succeed only until
> @@ -5187,8 +5207,12 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  		goto retry;
>  
>  
> -	/* Deal with possible cpuset update races before we start OOM killing */
> -	if (check_retry_cpuset(cpuset_mems_cookie, ac))
> +	/*
> +	 * Deal with possible cpuset update races or zonelist updates to avoid
> +	 * a unnecessary OOM kill.
> +	 */
> +	if (check_retry_cpuset(cpuset_mems_cookie, ac) ||
> +	    check_retry_zonelist(zonelist_iter_cookie))
>  		goto retry_cpuset;
>  
>  	/* Reclaim has failed us, start killing things */
> @@ -6514,9 +6538,8 @@ static void __build_all_zonelists(void *data)
>  	int nid;
>  	int __maybe_unused cpu;
>  	pg_data_t *self = data;
> -	static DEFINE_SPINLOCK(lock);
>  
> -	spin_lock(&lock);
> +	write_seqlock(&zonelist_update_seq);
>  
>  #ifdef CONFIG_NUMA
>  	memset(node_load, 0, sizeof(node_load));
> @@ -6553,7 +6576,7 @@ static void __build_all_zonelists(void *data)
>  #endif
>  	}
>  
> -	spin_unlock(&lock);
> +	write_sequnlock(&zonelist_update_seq);
>  }
>  
>  static noinline void __init
> 

LGTM. The "retry_cpuset" label might deserve a better name now.

Would

Fixes: 6aa303defb74 ("mm, vmscan: only allocate and reclaim from zones
with pages managed by the buddy allocator")

be correct?

-- 
Thanks,

David / dhildenb


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Race condition in build_all_zonelists() when offlining movable zone
  2022-08-23 10:34               ` David Hildenbrand
@ 2022-08-23 10:57                 ` Michal Hocko
  2022-08-23 11:09                 ` Mel Gorman
  1 sibling, 0 replies; 24+ messages in thread
From: Michal Hocko @ 2022-08-23 10:57 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Mel Gorman, Patrick Daly, linux-mm, linux-arm-kernel, Juergen Gross

On Tue 23-08-22 12:34:09, David Hildenbrand wrote:
> On 23.08.22 11:49, Mel Gorman wrote:
> > On Tue, Aug 23, 2022 at 10:52:34AM +0200, David Hildenbrand wrote:
> >> On 23.08.22 10:33, Mel Gorman wrote:
> >>> On Tue, Aug 23, 2022 at 08:36:34AM +0200, David Hildenbrand wrote:
> >>>>> @@ -6517,6 +6538,7 @@ static void __build_all_zonelists(void *data)
> >>>>>  	static DEFINE_SPINLOCK(lock);
> >>>>>  
> >>>>>  	spin_lock(&lock);
> >>>>> +	write_seqcount_begin(&zonelist_update_seq);
> >>>>>  
> >>>>>  #ifdef CONFIG_NUMA
> >>>>>  	memset(node_load, 0, sizeof(node_load));
> >>>>> @@ -6553,6 +6575,7 @@ static void __build_all_zonelists(void *data)
> >>>>>  #endif
> >>>>>  	}
> >>>>>  
> >>>>> +	write_seqcount_end(&zonelist_update_seq);
> >>>>>  	spin_unlock(&lock);
> >>>>
> >>>> Do we want to get rid of the static lock by using a seqlock_t instead of
> >>>> a seqcount_t?
> >>>>
> >>>
> >>> I do not think so because it's a relatively heavy lock compared to the
> >>> counter and the read side.
> >>
> >> I was primarily asking because seqlock.h states: "Sequential locks
> >> (seqlock_t):  Sequence counters with an embedded spinlock for writer
> >> serialization and non-preemptibility." seems to be precisely what we are
> >> doing here.
> >>
> >>>
> >>> As the read-side can be called from hardirq or softirq context, the
> >>> write-side needs to disable irqs or bottom halves as well according to the
> >>> documentation. That is relatively minor as the write side is rare but it's
> >>> tricker because the calling context can be both IRQ or softirq so the IRQ
> >>> protection would have to be used.
> >>
> >> Naive me would just have used write_seqlock()/write_sequnlock() and
> >> read_seqbegin()/read_seqretry() to result in almost the same code as
> >> with your change -- but having both mechanisms encapsulated.
> >>
> >>
> >> Yeah, there are special write_seqlock_bh()/write_sequnlock_bh(),
> >> write_sequnlock_irq() ... but IIRC we don't have to care about that at
> >> all when just using the primitives as above. But most probably I am
> >> missing something important.
> >>
> > 
> > You're not missing anything important, I'm just not a massive fan of the
> > API naming because it's unclear from the context if it's a plain counter
> > or a locked counter and felt it was better to keep the locking explicit.
> > 
> > A seqlock version is below. I updated the comments and naming to make it
> > clear the read-side is for iteration, what the locking protocol is and
> > match the retry naming with the cpuset equivalent. It boots on KVM but
> > would need another test from Patrick to be certain it still works. Patrick,
> > would you mind testing this version please?
> > 
> > ---8<---
> >  mm/page_alloc.c | 33 ++++++++++++++++++++++++++++-----
> >  1 file changed, 28 insertions(+), 5 deletions(-)
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index e5486d47406e..a644c7b638a3 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -4708,6 +4708,24 @@ void fs_reclaim_release(gfp_t gfp_mask)
> >  EXPORT_SYMBOL_GPL(fs_reclaim_release);
> >  #endif
> >  
> > +/*
> > + * Zonelists may change due to hotplug during allocation. Detect when zonelists
> > + * have been rebuilt so allocation retries. Reader side does not lock and
> > + * retries the allocation if zonelist changes. Writer side is protected by the
> > + * embedded spin_lock.
> > + */
> > +DEFINE_SEQLOCK(zonelist_update_seq);
> > +
> > +static unsigned int zonelist_iter_begin(void)
> > +{
> > +	return read_seqbegin(&zonelist_update_seq);
> > +}
> > +
> > +static unsigned int check_retry_zonelist(unsigned int seq)
> > +{
> > +	return read_seqretry(&zonelist_update_seq, seq);
> > +}
> > +
> >  /* Perform direct synchronous page reclaim */
> >  static unsigned long
> >  __perform_reclaim(gfp_t gfp_mask, unsigned int order,
> > @@ -5001,6 +5019,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> >  	int compaction_retries;
> >  	int no_progress_loops;
> >  	unsigned int cpuset_mems_cookie;
> > +	unsigned int zonelist_iter_cookie;
> >  	int reserve_flags;
> >  
> >  	/*
> > @@ -5016,6 +5035,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> >  	no_progress_loops = 0;
> >  	compact_priority = DEF_COMPACT_PRIORITY;
> >  	cpuset_mems_cookie = read_mems_allowed_begin();
> > +	zonelist_iter_cookie = zonelist_iter_begin();
> >  
> >  	/*
> >  	 * The fast path uses conservative alloc_flags to succeed only until
> > @@ -5187,8 +5207,12 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> >  		goto retry;
> >  
> >  
> > -	/* Deal with possible cpuset update races before we start OOM killing */
> > -	if (check_retry_cpuset(cpuset_mems_cookie, ac))
> > +	/*
> > +	 * Deal with possible cpuset update races or zonelist updates to avoid
> > +	 * a unnecessary OOM kill.
> > +	 */
> > +	if (check_retry_cpuset(cpuset_mems_cookie, ac) ||
> > +	    check_retry_zonelist(zonelist_iter_cookie))
> >  		goto retry_cpuset;
> >  
> >  	/* Reclaim has failed us, start killing things */
> > @@ -6514,9 +6538,8 @@ static void __build_all_zonelists(void *data)
> >  	int nid;
> >  	int __maybe_unused cpu;
> >  	pg_data_t *self = data;
> > -	static DEFINE_SPINLOCK(lock);
> >  
> > -	spin_lock(&lock);
> > +	write_seqlock(&zonelist_update_seq);
> >  
> >  #ifdef CONFIG_NUMA
> >  	memset(node_load, 0, sizeof(node_load));
> > @@ -6553,7 +6576,7 @@ static void __build_all_zonelists(void *data)
> >  #endif
> >  	}
> >  
> > -	spin_unlock(&lock);
> > +	write_sequnlock(&zonelist_update_seq);
> >  }
> >  
> >  static noinline void __init
> > 
> 
> LGTM. The "retry_cpuset" label might deserve a better name now.

I will not object but I liked the previous version which was already in
line with the cpuset retry (read_mems_allowed_retry).
 
> Would
> 
> Fixes: 6aa303defb74 ("mm, vmscan: only allocate and reclaim from zones
> with pages managed by the buddy allocator")
> 
> be correct?

Yes.

-- 
Michal Hocko
SUSE Labs

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Race condition in build_all_zonelists() when offlining movable zone
  2022-08-23 10:34               ` David Hildenbrand
  2022-08-23 10:57                 ` Michal Hocko
@ 2022-08-23 11:09                 ` Mel Gorman
  2022-08-23 12:18                   ` Michal Hocko
  1 sibling, 1 reply; 24+ messages in thread
From: Mel Gorman @ 2022-08-23 11:09 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Michal Hocko, Patrick Daly, linux-mm, linux-arm-kernel, Juergen Gross

On Tue, Aug 23, 2022 at 12:34:09PM +0200, David Hildenbrand wrote:
> > @@ -6553,7 +6576,7 @@ static void __build_all_zonelists(void *data)
> >  #endif
> >  	}
> >  
> > -	spin_unlock(&lock);
> > +	write_sequnlock(&zonelist_update_seq);
> >  }
> >  
> >  static noinline void __init
> > 
> 
> LGTM. The "retry_cpuset" label might deserve a better name now.
> 

Good point ...  "restart"?

> Would
> 
> Fixes: 6aa303defb74 ("mm, vmscan: only allocate and reclaim from zones
> with pages managed by the buddy allocator")
> 
> be correct?
> 

Not specifically because the bug is due to a zone being completely removed
resulting in a rebuild. This race probably existed ever since memory
hotremove could theoritically remove a complete zone. A Cc: Stable would
be appropriate as it'll apply with fuzz back to at least 5.4.210 but beyond
that, it should be driven by a specific bug report showing that hot-remove
of a full zone was possible and triggered the race.

-- 
Mel Gorman
SUSE Labs

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Race condition in build_all_zonelists() when offlining movable zone
  2022-08-23 11:09                 ` Mel Gorman
@ 2022-08-23 12:18                   ` Michal Hocko
  2022-08-23 12:58                     ` Mel Gorman
  0 siblings, 1 reply; 24+ messages in thread
From: Michal Hocko @ 2022-08-23 12:18 UTC (permalink / raw)
  To: Mel Gorman
  Cc: David Hildenbrand, Patrick Daly, linux-mm, linux-arm-kernel,
	Juergen Gross

On Tue 23-08-22 12:09:46, Mel Gorman wrote:
> On Tue, Aug 23, 2022 at 12:34:09PM +0200, David Hildenbrand wrote:
> > > @@ -6553,7 +6576,7 @@ static void __build_all_zonelists(void *data)
> > >  #endif
> > >  	}
> > >  
> > > -	spin_unlock(&lock);
> > > +	write_sequnlock(&zonelist_update_seq);
> > >  }
> > >  
> > >  static noinline void __init
> > > 
> > 
> > LGTM. The "retry_cpuset" label might deserve a better name now.
> > 
> 
> Good point ...  "restart"?
> 
> > Would
> > 
> > Fixes: 6aa303defb74 ("mm, vmscan: only allocate and reclaim from zones
> > with pages managed by the buddy allocator")
> > 
> > be correct?
> > 
> 
> Not specifically because the bug is due to a zone being completely removed
> resulting in a rebuild. This race probably existed ever since memory
> hotremove could theoritically remove a complete zone. A Cc: Stable would
> be appropriate as it'll apply with fuzz back to at least 5.4.210 but beyond
> that, it should be driven by a specific bug report showing that hot-remove
> of a full zone was possible and triggered the race.

I do not think so. 6aa303defb74 has changed the zonelist building and
changed the check from pfn range (populated) to managed (with a memory).
-- 
Michal Hocko
SUSE Labs

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Race condition in build_all_zonelists() when offlining movable zone
  2022-08-23 12:18                   ` Michal Hocko
@ 2022-08-23 12:58                     ` Mel Gorman
  2022-08-23 13:25                       ` Michal Hocko
  0 siblings, 1 reply; 24+ messages in thread
From: Mel Gorman @ 2022-08-23 12:58 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Hildenbrand, Patrick Daly, linux-mm, linux-arm-kernel,
	Juergen Gross

On Tue, Aug 23, 2022 at 02:18:27PM +0200, Michal Hocko wrote:
> On Tue 23-08-22 12:09:46, Mel Gorman wrote:
> > On Tue, Aug 23, 2022 at 12:34:09PM +0200, David Hildenbrand wrote:
> > > > @@ -6553,7 +6576,7 @@ static void __build_all_zonelists(void *data)
> > > >  #endif
> > > >  	}
> > > >  
> > > > -	spin_unlock(&lock);
> > > > +	write_sequnlock(&zonelist_update_seq);
> > > >  }
> > > >  
> > > >  static noinline void __init
> > > > 
> > > 
> > > LGTM. The "retry_cpuset" label might deserve a better name now.
> > > 
> > 
> > Good point ...  "restart"?
> > 
> > > Would
> > > 
> > > Fixes: 6aa303defb74 ("mm, vmscan: only allocate and reclaim from zones
> > > with pages managed by the buddy allocator")
> > > 
> > > be correct?
> > > 
> > 
> > Not specifically because the bug is due to a zone being completely removed
> > resulting in a rebuild. This race probably existed ever since memory
> > hotremove could theoritically remove a complete zone. A Cc: Stable would
> > be appropriate as it'll apply with fuzz back to at least 5.4.210 but beyond
> > that, it should be driven by a specific bug report showing that hot-remove
> > of a full zone was possible and triggered the race.
> 
> I do not think so. 6aa303defb74 has changed the zonelist building and
> changed the check from pfn range (populated) to managed (with a memory).

I'm not 100% convinced. The present_pages should have been the spanned range
minus any holes that exist in the zone. If the zone is completely removed,
the span should be zero meaning present and managed are both zero. No? 

The patch *would* have made the situation worse because if a zone was
hot-removed to the point where only reserved pages were present then the
bug would still trigger. On that basis, I'm ok with adding the Fixes: as
it's at least partially true and it covers the range of -stable kernels that
are trivial to backport. Beyond that, it would need greater care and testing.

-- 
Mel Gorman
SUSE Labs

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Race condition in build_all_zonelists() when offlining movable zone
  2022-08-23 12:58                     ` Mel Gorman
@ 2022-08-23 13:25                       ` Michal Hocko
  2022-08-23 13:50                         ` David Hildenbrand
  0 siblings, 1 reply; 24+ messages in thread
From: Michal Hocko @ 2022-08-23 13:25 UTC (permalink / raw)
  To: Mel Gorman
  Cc: David Hildenbrand, Patrick Daly, linux-mm, linux-arm-kernel,
	Juergen Gross

On Tue 23-08-22 13:58:50, Mel Gorman wrote:
> On Tue, Aug 23, 2022 at 02:18:27PM +0200, Michal Hocko wrote:
> > On Tue 23-08-22 12:09:46, Mel Gorman wrote:
> > > On Tue, Aug 23, 2022 at 12:34:09PM +0200, David Hildenbrand wrote:
> > > > > @@ -6553,7 +6576,7 @@ static void __build_all_zonelists(void *data)
> > > > >  #endif
> > > > >  	}
> > > > >  
> > > > > -	spin_unlock(&lock);
> > > > > +	write_sequnlock(&zonelist_update_seq);
> > > > >  }
> > > > >  
> > > > >  static noinline void __init
> > > > > 
> > > > 
> > > > LGTM. The "retry_cpuset" label might deserve a better name now.
> > > > 
> > > 
> > > Good point ...  "restart"?
> > > 
> > > > Would
> > > > 
> > > > Fixes: 6aa303defb74 ("mm, vmscan: only allocate and reclaim from zones
> > > > with pages managed by the buddy allocator")
> > > > 
> > > > be correct?
> > > > 
> > > 
> > > Not specifically because the bug is due to a zone being completely removed
> > > resulting in a rebuild. This race probably existed ever since memory
> > > hotremove could theoritically remove a complete zone. A Cc: Stable would
> > > be appropriate as it'll apply with fuzz back to at least 5.4.210 but beyond
> > > that, it should be driven by a specific bug report showing that hot-remove
> > > of a full zone was possible and triggered the race.
> > 
> > I do not think so. 6aa303defb74 has changed the zonelist building and
> > changed the check from pfn range (populated) to managed (with a memory).
> 
> I'm not 100% convinced. The present_pages should have been the spanned range
> minus any holes that exist in the zone. If the zone is completely removed,
> the span should be zero meaning present and managed are both zero. No? 

IIRC, and David will correct me if I am mixing this up. The difference
is that zonelists are rebuilt during memory offlining and that is when
managed pages are removed from the allocator. Zone itself still has that
physical range populated and so this patch would have made a difference.

Now, you are right that this is likely possible even without that commit
but it is highly unlikely because physical hotremove is a very rare
operation and the race window would be so large that it would be likely
unfeasible.

-- 
Michal Hocko
SUSE Labs

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Race condition in build_all_zonelists() when offlining movable zone
  2022-08-23 13:25                       ` Michal Hocko
@ 2022-08-23 13:50                         ` David Hildenbrand
  2022-08-23 13:57                           ` Michal Hocko
  0 siblings, 1 reply; 24+ messages in thread
From: David Hildenbrand @ 2022-08-23 13:50 UTC (permalink / raw)
  To: Michal Hocko, Mel Gorman
  Cc: Patrick Daly, linux-mm, linux-arm-kernel, Juergen Gross

On 23.08.22 15:25, Michal Hocko wrote:
> On Tue 23-08-22 13:58:50, Mel Gorman wrote:
>> On Tue, Aug 23, 2022 at 02:18:27PM +0200, Michal Hocko wrote:
>>> On Tue 23-08-22 12:09:46, Mel Gorman wrote:
>>>> On Tue, Aug 23, 2022 at 12:34:09PM +0200, David Hildenbrand wrote:
>>>>>> @@ -6553,7 +6576,7 @@ static void __build_all_zonelists(void *data)
>>>>>>  #endif
>>>>>>  	}
>>>>>>  
>>>>>> -	spin_unlock(&lock);
>>>>>> +	write_sequnlock(&zonelist_update_seq);
>>>>>>  }
>>>>>>  
>>>>>>  static noinline void __init
>>>>>>
>>>>>
>>>>> LGTM. The "retry_cpuset" label might deserve a better name now.
>>>>>
>>>>
>>>> Good point ...  "restart"?
>>>>
>>>>> Would
>>>>>
>>>>> Fixes: 6aa303defb74 ("mm, vmscan: only allocate and reclaim from zones
>>>>> with pages managed by the buddy allocator")
>>>>>
>>>>> be correct?
>>>>>
>>>>
>>>> Not specifically because the bug is due to a zone being completely removed
>>>> resulting in a rebuild. This race probably existed ever since memory
>>>> hotremove could theoritically remove a complete zone. A Cc: Stable would
>>>> be appropriate as it'll apply with fuzz back to at least 5.4.210 but beyond
>>>> that, it should be driven by a specific bug report showing that hot-remove
>>>> of a full zone was possible and triggered the race.
>>>
>>> I do not think so. 6aa303defb74 has changed the zonelist building and
>>> changed the check from pfn range (populated) to managed (with a memory).
>>
>> I'm not 100% convinced. The present_pages should have been the spanned range
>> minus any holes that exist in the zone. If the zone is completely removed,
>> the span should be zero meaning present and managed are both zero. No? 
> 
> IIRC, and David will correct me if I am mixing this up. The difference
> is that zonelists are rebuilt during memory offlining and that is when
> managed pages are removed from the allocator. Zone itself still has that
> physical range populated and so this patch would have made a difference.

To recap, memory offlining adjusts managed+present pages of the zone
essentially in one go. If after the adjustments, the zone is no longer
populated (present==0), we rebuild the zone lists.

Once that's done, we try shrinking the zone (start+spanned pages) --
which results in zone_start_pfn == 0 if there are no more pages. That
happens *after* rebuilding the zonelists via remove_pfn_range_from_zone().


Note that populated_zone() checks for present_pages. The actual zone
span (e.g., spanned_pages) is a different story and not of interest when
building zones or wanting to allocate memory.

> 
> Now, you are right that this is likely possible even without that commit
> but it is highly unlikely because physical hotremove is a very rare
> operation and the race window would be so large that it would be likely
> unfeasible.

I think I agree that 6aa303defb74 is most likely not the origin of this.
It could only have been the origin in weird corner cases where we
actually succeed offlining one memory block (adjust present+managed) and
end up with managed=0 and present!=0 -- which barely happens in
practice: especially for ZONE_MOVABLE. (yeah, there is memory ballooning
that adjusts managed pages dynamically and might provoke such a
situation on ZONE_MOVABLE)

-- 
Thanks,

David / dhildenb


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Race condition in build_all_zonelists() when offlining movable zone
  2022-08-23 13:50                         ` David Hildenbrand
@ 2022-08-23 13:57                           ` Michal Hocko
  2022-08-23 15:14                             ` Mel Gorman
  0 siblings, 1 reply; 24+ messages in thread
From: Michal Hocko @ 2022-08-23 13:57 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Mel Gorman, Patrick Daly, linux-mm, linux-arm-kernel, Juergen Gross

On Tue 23-08-22 15:50:23, David Hildenbrand wrote:
> On 23.08.22 15:25, Michal Hocko wrote:
> > On Tue 23-08-22 13:58:50, Mel Gorman wrote:
> >> On Tue, Aug 23, 2022 at 02:18:27PM +0200, Michal Hocko wrote:
> >>> On Tue 23-08-22 12:09:46, Mel Gorman wrote:
> >>>> On Tue, Aug 23, 2022 at 12:34:09PM +0200, David Hildenbrand wrote:
> >>>>>> @@ -6553,7 +6576,7 @@ static void __build_all_zonelists(void *data)
> >>>>>>  #endif
> >>>>>>  	}
> >>>>>>  
> >>>>>> -	spin_unlock(&lock);
> >>>>>> +	write_sequnlock(&zonelist_update_seq);
> >>>>>>  }
> >>>>>>  
> >>>>>>  static noinline void __init
> >>>>>>
> >>>>>
> >>>>> LGTM. The "retry_cpuset" label might deserve a better name now.
> >>>>>
> >>>>
> >>>> Good point ...  "restart"?
> >>>>
> >>>>> Would
> >>>>>
> >>>>> Fixes: 6aa303defb74 ("mm, vmscan: only allocate and reclaim from zones
> >>>>> with pages managed by the buddy allocator")
> >>>>>
> >>>>> be correct?
> >>>>>
> >>>>
> >>>> Not specifically because the bug is due to a zone being completely removed
> >>>> resulting in a rebuild. This race probably existed ever since memory
> >>>> hotremove could theoritically remove a complete zone. A Cc: Stable would
> >>>> be appropriate as it'll apply with fuzz back to at least 5.4.210 but beyond
> >>>> that, it should be driven by a specific bug report showing that hot-remove
> >>>> of a full zone was possible and triggered the race.
> >>>
> >>> I do not think so. 6aa303defb74 has changed the zonelist building and
> >>> changed the check from pfn range (populated) to managed (with a memory).
> >>
> >> I'm not 100% convinced. The present_pages should have been the spanned range
> >> minus any holes that exist in the zone. If the zone is completely removed,
> >> the span should be zero meaning present and managed are both zero. No? 
> > 
> > IIRC, and David will correct me if I am mixing this up. The difference
> > is that zonelists are rebuilt during memory offlining and that is when
> > managed pages are removed from the allocator. Zone itself still has that
> > physical range populated and so this patch would have made a difference.
> 
> To recap, memory offlining adjusts managed+present pages of the zone
> essentially in one go. If after the adjustments, the zone is no longer
> populated (present==0), we rebuild the zone lists.
> 
> Once that's done, we try shrinking the zone (start+spanned pages) --
> which results in zone_start_pfn == 0 if there are no more pages. That
> happens *after* rebuilding the zonelists via remove_pfn_range_from_zone().
> 
> 
> Note that populated_zone() checks for present_pages. The actual zone
> span (e.g., spanned_pages) is a different story and not of interest when
> building zones or wanting to allocate memory.
> 
> > 
> > Now, you are right that this is likely possible even without that commit
> > but it is highly unlikely because physical hotremove is a very rare
> > operation and the race window would be so large that it would be likely
> > unfeasible.
> 
> I think I agree that 6aa303defb74 is most likely not the origin of this.
> It could only have been the origin in weird corner cases where we
> actually succeed offlining one memory block (adjust present+managed) and
> end up with managed=0 and present!=0 -- which barely happens in
> practice: especially for ZONE_MOVABLE. (yeah, there is memory ballooning
> that adjusts managed pages dynamically and might provoke such a
> situation on ZONE_MOVABLE)

OK, thanks for the correction David. Then I would agree that Fixes tag
could be more confusing than helpful and your above summary would be a
great part of the changelog.

Thanks!
-- 
Michal Hocko
SUSE Labs

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Race condition in build_all_zonelists() when offlining movable zone
  2022-08-23 13:57                           ` Michal Hocko
@ 2022-08-23 15:14                             ` Mel Gorman
  2022-08-23 15:38                               ` Michal Hocko
  0 siblings, 1 reply; 24+ messages in thread
From: Mel Gorman @ 2022-08-23 15:14 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Hildenbrand, Patrick Daly, linux-mm, linux-arm-kernel,
	Juergen Gross

On Tue, Aug 23, 2022 at 03:57:38PM +0200, Michal Hocko wrote:
> > I think I agree that 6aa303defb74 is most likely not the origin of this.
> > It could only have been the origin in weird corner cases where we
> > actually succeed offlining one memory block (adjust present+managed) and
> > end up with managed=0 and present!=0 -- which barely happens in
> > practice: especially for ZONE_MOVABLE. (yeah, there is memory ballooning
> > that adjusts managed pages dynamically and might provoke such a
> > situation on ZONE_MOVABLE)
> 
> OK, thanks for the correction David. Then I would agree that Fixes tag
> could be more confusing than helpful and your above summary would be a
> great part of the changelog.
> 

Given that 6aa303defb74 still makes it potentially worse, it's as good a
Fixes-by as any given that anything prior to that commit would need careful
examination. The race changes shape going further back in time until memory
hot-remove was initially added and if someone needs to go that far back,
they'll also need to check if the ZLC needs special treatment.

Provisional patch and changelog is below. I'd still like to get a Tested-by
from Patrick to confirm it still fixes the problem before posting formally.

--8<--
mm/page_alloc: Fix race condition between build_all_zonelists and page allocation

Patrick Daly reported the following problem;

	NODE_DATA(nid)->node_zonelists[ZONELIST_FALLBACK] - before offline operation
	[0] - ZONE_MOVABLE
	[1] - ZONE_NORMAL
	[2] - NULL

	For a GFP_KERNEL allocation, alloc_pages_slowpath() will save the
	offset of ZONE_NORMAL in ac->preferred_zoneref. If a concurrent
	memory_offline operation removes the last page from ZONE_MOVABLE,
	build_all_zonelists() & build_zonerefs_node() will update
	node_zonelists as shown below. Only populated zones are added.

	NODE_DATA(nid)->node_zonelists[ZONELIST_FALLBACK] - after offline operation
	[0] - ZONE_NORMAL
	[1] - NULL
	[2] - NULL

The race is simple -- page allocation could be in progress when a memory
hot-remove operation triggers a zonelist rebuild that removes zones.
The allocation request will still have a valid ac->preferred_zoneref that
is now pointing to NULL and triggers an OOM kill.

This problem probably always existed but may be slighly easier to trigger
due to 6aa303defb74 ("mm, vmscan: only allocate and reclaim from zones
with pages managed by the buddy allocator") which distinguishes between
zones that are completely unpopulated versus zones that have valid pages
but they are all reserved. Memory hotplug had multiple stages with
timing considerations around managed/present page updates, the zonelist
rebuild and the zone span updates. As David Hildenbrand puts it

	memory offlining adjusts managed+present pages of the zone
	essentially in one go. If after the adjustments, the zone is no
	longer populated (present==0), we rebuild the zone lists.

	Once that's done, we try shrinking the zone (start+spanned
	pages) -- which results in zone_start_pfn == 0 if there are no
	more pages. That happens *after* rebuilding the zonelists via
	remove_pfn_range_from_zone().

The only requirement to fix the race is that a page allocation request
identifies when a zonelist rebuild has happened since the allocation
request started and no page has yet been allocated. Use a seqlock_t to track
zonelist updates with a lockless read-side of the zonelist and protecting
the rebuild and update of the counter with a spinlock.

Fixes: 6aa303defb74 ("mm, vmscan: only allocate and reclaim from zones with pages managed by the buddy allocator")
Reported-by: Patrick Daly <quic_pdaly@quicinc.com>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Cc: <stable@vger.kernel.org>
---
 mm/page_alloc.c | 37 ++++++++++++++++++++++++++++++-------
 1 file changed, 30 insertions(+), 7 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e5486d47406e..216e21048ddf 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4708,6 +4708,24 @@ void fs_reclaim_release(gfp_t gfp_mask)
 EXPORT_SYMBOL_GPL(fs_reclaim_release);
 #endif
 
+/*
+ * Zonelists may change due to hotplug during allocation. Detect when zonelists
+ * have been rebuilt so allocation retries. Reader side does not lock and
+ * retries the allocation if zonelist changes. Writer side is protected by the
+ * embedded spin_lock.
+ */
+DEFINE_SEQLOCK(zonelist_update_seq);
+
+static unsigned int zonelist_iter_begin(void)
+{
+	return read_seqbegin(&zonelist_update_seq);
+}
+
+static unsigned int check_retry_zonelist(unsigned int seq)
+{
+	return read_seqretry(&zonelist_update_seq, seq);
+}
+
 /* Perform direct synchronous page reclaim */
 static unsigned long
 __perform_reclaim(gfp_t gfp_mask, unsigned int order,
@@ -5001,6 +5019,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	int compaction_retries;
 	int no_progress_loops;
 	unsigned int cpuset_mems_cookie;
+	unsigned int zonelist_iter_cookie;
 	int reserve_flags;
 
 	/*
@@ -5011,11 +5030,12 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 				(__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)))
 		gfp_mask &= ~__GFP_ATOMIC;
 
-retry_cpuset:
+restart:
 	compaction_retries = 0;
 	no_progress_loops = 0;
 	compact_priority = DEF_COMPACT_PRIORITY;
 	cpuset_mems_cookie = read_mems_allowed_begin();
+	zonelist_iter_cookie = zonelist_iter_begin();
 
 	/*
 	 * The fast path uses conservative alloc_flags to succeed only until
@@ -5187,9 +5207,13 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 		goto retry;
 
 
-	/* Deal with possible cpuset update races before we start OOM killing */
-	if (check_retry_cpuset(cpuset_mems_cookie, ac))
-		goto retry_cpuset;
+	/*
+	 * Deal with possible cpuset update races or zonelist updates to avoid
+	 * a unnecessary OOM kill.
+	 */
+	if (check_retry_cpuset(cpuset_mems_cookie, ac) ||
+	    check_retry_zonelist(zonelist_iter_cookie))
+		goto restart;
 
 	/* Reclaim has failed us, start killing things */
 	page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
@@ -6514,9 +6538,8 @@ static void __build_all_zonelists(void *data)
 	int nid;
 	int __maybe_unused cpu;
 	pg_data_t *self = data;
-	static DEFINE_SPINLOCK(lock);
 
-	spin_lock(&lock);
+	write_seqlock(&zonelist_update_seq);
 
 #ifdef CONFIG_NUMA
 	memset(node_load, 0, sizeof(node_load));
@@ -6553,7 +6576,7 @@ static void __build_all_zonelists(void *data)
 #endif
 	}
 
-	spin_unlock(&lock);
+	write_sequnlock(&zonelist_update_seq);
 }
 
 static noinline void __init

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: Race condition in build_all_zonelists() when offlining movable zone
  2022-08-23 15:14                             ` Mel Gorman
@ 2022-08-23 15:38                               ` Michal Hocko
  2022-08-23 15:51                                 ` David Hildenbrand
  0 siblings, 1 reply; 24+ messages in thread
From: Michal Hocko @ 2022-08-23 15:38 UTC (permalink / raw)
  To: Mel Gorman
  Cc: David Hildenbrand, Patrick Daly, linux-mm, linux-arm-kernel,
	Juergen Gross

On Tue 23-08-22 16:14:15, Mel Gorman wrote:
> On Tue, Aug 23, 2022 at 03:57:38PM +0200, Michal Hocko wrote:
> > > I think I agree that 6aa303defb74 is most likely not the origin of this.
> > > It could only have been the origin in weird corner cases where we
> > > actually succeed offlining one memory block (adjust present+managed) and
> > > end up with managed=0 and present!=0 -- which barely happens in
> > > practice: especially for ZONE_MOVABLE. (yeah, there is memory ballooning
> > > that adjusts managed pages dynamically and might provoke such a
> > > situation on ZONE_MOVABLE)
> > 
> > OK, thanks for the correction David. Then I would agree that Fixes tag
> > could be more confusing than helpful and your above summary would be a
> > great part of the changelog.
> > 
> 
> Given that 6aa303defb74 still makes it potentially worse, it's as good a
> Fixes-by as any given that anything prior to that commit would need careful
> examination. The race changes shape going further back in time until memory
> hot-remove was initially added and if someone needs to go that far back,
> they'll also need to check if the ZLC needs special treatment.
> 
> Provisional patch and changelog is below. I'd still like to get a Tested-by
> from Patrick to confirm it still fixes the problem before posting formally.
> 
> --8<--
> mm/page_alloc: Fix race condition between build_all_zonelists and page allocation
> 
> Patrick Daly reported the following problem;
> 
> 	NODE_DATA(nid)->node_zonelists[ZONELIST_FALLBACK] - before offline operation
> 	[0] - ZONE_MOVABLE
> 	[1] - ZONE_NORMAL
> 	[2] - NULL
> 
> 	For a GFP_KERNEL allocation, alloc_pages_slowpath() will save the
> 	offset of ZONE_NORMAL in ac->preferred_zoneref. If a concurrent
> 	memory_offline operation removes the last page from ZONE_MOVABLE,
> 	build_all_zonelists() & build_zonerefs_node() will update
> 	node_zonelists as shown below. Only populated zones are added.
> 
> 	NODE_DATA(nid)->node_zonelists[ZONELIST_FALLBACK] - after offline operation
> 	[0] - ZONE_NORMAL
> 	[1] - NULL
> 	[2] - NULL
> 
> The race is simple -- page allocation could be in progress when a memory
> hot-remove operation triggers a zonelist rebuild that removes zones.
> The allocation request will still have a valid ac->preferred_zoneref that
> is now pointing to NULL and triggers an OOM kill.
> 
> This problem probably always existed but may be slighly easier to trigger
> due to 6aa303defb74 ("mm, vmscan: only allocate and reclaim from zones
> with pages managed by the buddy allocator") which distinguishes between
> zones that are completely unpopulated versus zones that have valid pages
> but they are all reserved. Memory hotplug had multiple stages with
> timing considerations around managed/present page updates, the zonelist
> rebuild and the zone span updates. As David Hildenbrand puts it
> 
> 	memory offlining adjusts managed+present pages of the zone
> 	essentially in one go. If after the adjustments, the zone is no
> 	longer populated (present==0), we rebuild the zone lists.
> 
> 	Once that's done, we try shrinking the zone (start+spanned
> 	pages) -- which results in zone_start_pfn == 0 if there are no
> 	more pages. That happens *after* rebuilding the zonelists via
> 	remove_pfn_range_from_zone().
> 
> The only requirement to fix the race is that a page allocation request
> identifies when a zonelist rebuild has happened since the allocation
> request started and no page has yet been allocated. Use a seqlock_t to track
> zonelist updates with a lockless read-side of the zonelist and protecting
> the rebuild and update of the counter with a spinlock.
> 
> Fixes: 6aa303defb74 ("mm, vmscan: only allocate and reclaim from zones with pages managed by the buddy allocator")
> Reported-by: Patrick Daly <quic_pdaly@quicinc.com>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Cc: <stable@vger.kernel.org>
> ---
>  mm/page_alloc.c | 37 ++++++++++++++++++++++++++++++-------
>  1 file changed, 30 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index e5486d47406e..216e21048ddf 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4708,6 +4708,24 @@ void fs_reclaim_release(gfp_t gfp_mask)
>  EXPORT_SYMBOL_GPL(fs_reclaim_release);
>  #endif
>  
> +/*
> + * Zonelists may change due to hotplug during allocation. Detect when zonelists
> + * have been rebuilt so allocation retries. Reader side does not lock and
> + * retries the allocation if zonelist changes. Writer side is protected by the
> + * embedded spin_lock.
> + */
> +DEFINE_SEQLOCK(zonelist_update_seq);
> +
> +static unsigned int zonelist_iter_begin(void)
> +{

You likely want something like
	if (IS_ENABLED(CONFIG_MEMORY_HOTREMOVE))
		return read_seqbegin(&zonelist_update_seq);
	return 0;

> +	return read_seqbegin(&zonelist_update_seq);
> +}
> +
> +static unsigned int check_retry_zonelist(unsigned int seq)
> +{
	if (IS_ENABLED(CONFIG_MEMORY_HOTREMOVE))
		return read_seqretry(&zonelist_update_seq, seq);
	return seq;

> +	return read_seqretry(&zonelist_update_seq, seq);
> +}
> +

to avoid overhead on systems without HOTREMOVE configured.

Other than that LGTM.
Acked-by: Michal Hocko <mhocko@suse.com>

Thanks!

>  /* Perform direct synchronous page reclaim */
>  static unsigned long
>  __perform_reclaim(gfp_t gfp_mask, unsigned int order,
> @@ -5001,6 +5019,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	int compaction_retries;
>  	int no_progress_loops;
>  	unsigned int cpuset_mems_cookie;
> +	unsigned int zonelist_iter_cookie;
>  	int reserve_flags;
>  
>  	/*
> @@ -5011,11 +5030,12 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  				(__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)))
>  		gfp_mask &= ~__GFP_ATOMIC;
>  
> -retry_cpuset:
> +restart:
>  	compaction_retries = 0;
>  	no_progress_loops = 0;
>  	compact_priority = DEF_COMPACT_PRIORITY;
>  	cpuset_mems_cookie = read_mems_allowed_begin();
> +	zonelist_iter_cookie = zonelist_iter_begin();
>  
>  	/*
>  	 * The fast path uses conservative alloc_flags to succeed only until
> @@ -5187,9 +5207,13 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  		goto retry;
>  
>  
> -	/* Deal with possible cpuset update races before we start OOM killing */
> -	if (check_retry_cpuset(cpuset_mems_cookie, ac))
> -		goto retry_cpuset;
> +	/*
> +	 * Deal with possible cpuset update races or zonelist updates to avoid
> +	 * a unnecessary OOM kill.
> +	 */
> +	if (check_retry_cpuset(cpuset_mems_cookie, ac) ||
> +	    check_retry_zonelist(zonelist_iter_cookie))
> +		goto restart;
>  
>  	/* Reclaim has failed us, start killing things */
>  	page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
> @@ -6514,9 +6538,8 @@ static void __build_all_zonelists(void *data)
>  	int nid;
>  	int __maybe_unused cpu;
>  	pg_data_t *self = data;
> -	static DEFINE_SPINLOCK(lock);
>  
> -	spin_lock(&lock);
> +	write_seqlock(&zonelist_update_seq);
>  
>  #ifdef CONFIG_NUMA
>  	memset(node_load, 0, sizeof(node_load));
> @@ -6553,7 +6576,7 @@ static void __build_all_zonelists(void *data)
>  #endif
>  	}
>  
> -	spin_unlock(&lock);
> +	write_sequnlock(&zonelist_update_seq);
>  }
>  
>  static noinline void __init

-- 
Michal Hocko
SUSE Labs

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Race condition in build_all_zonelists() when offlining movable zone
  2022-08-23 15:38                               ` Michal Hocko
@ 2022-08-23 15:51                                 ` David Hildenbrand
  2022-08-24  9:45                                   ` Mel Gorman
  0 siblings, 1 reply; 24+ messages in thread
From: David Hildenbrand @ 2022-08-23 15:51 UTC (permalink / raw)
  To: Michal Hocko, Mel Gorman
  Cc: Patrick Daly, linux-mm, linux-arm-kernel, Juergen Gross

On 23.08.22 17:38, Michal Hocko wrote:
> On Tue 23-08-22 16:14:15, Mel Gorman wrote:
>> On Tue, Aug 23, 2022 at 03:57:38PM +0200, Michal Hocko wrote:
>>>> I think I agree that 6aa303defb74 is most likely not the origin of this.
>>>> It could only have been the origin in weird corner cases where we
>>>> actually succeed offlining one memory block (adjust present+managed) and
>>>> end up with managed=0 and present!=0 -- which barely happens in
>>>> practice: especially for ZONE_MOVABLE. (yeah, there is memory ballooning
>>>> that adjusts managed pages dynamically and might provoke such a
>>>> situation on ZONE_MOVABLE)
>>>
>>> OK, thanks for the correction David. Then I would agree that Fixes tag
>>> could be more confusing than helpful and your above summary would be a
>>> great part of the changelog.
>>>
>>
>> Given that 6aa303defb74 still makes it potentially worse, it's as good a
>> Fixes-by as any given that anything prior to that commit would need careful
>> examination. The race changes shape going further back in time until memory
>> hot-remove was initially added and if someone needs to go that far back,
>> they'll also need to check if the ZLC needs special treatment.
>>
>> Provisional patch and changelog is below. I'd still like to get a Tested-by
>> from Patrick to confirm it still fixes the problem before posting formally.
>>
>> --8<--
>> mm/page_alloc: Fix race condition between build_all_zonelists and page allocation
>>
>> Patrick Daly reported the following problem;
>>
>> 	NODE_DATA(nid)->node_zonelists[ZONELIST_FALLBACK] - before offline operation
>> 	[0] - ZONE_MOVABLE
>> 	[1] - ZONE_NORMAL
>> 	[2] - NULL
>>
>> 	For a GFP_KERNEL allocation, alloc_pages_slowpath() will save the
>> 	offset of ZONE_NORMAL in ac->preferred_zoneref. If a concurrent
>> 	memory_offline operation removes the last page from ZONE_MOVABLE,
>> 	build_all_zonelists() & build_zonerefs_node() will update
>> 	node_zonelists as shown below. Only populated zones are added.
>>
>> 	NODE_DATA(nid)->node_zonelists[ZONELIST_FALLBACK] - after offline operation
>> 	[0] - ZONE_NORMAL
>> 	[1] - NULL
>> 	[2] - NULL
>>
>> The race is simple -- page allocation could be in progress when a memory
>> hot-remove operation triggers a zonelist rebuild that removes zones.
>> The allocation request will still have a valid ac->preferred_zoneref that
>> is now pointing to NULL and triggers an OOM kill.
>>
>> This problem probably always existed but may be slighly easier to trigger

s/slighly/slightly/

>> due to 6aa303defb74 ("mm, vmscan: only allocate and reclaim from zones
>> with pages managed by the buddy allocator") which distinguishes between
>> zones that are completely unpopulated versus zones that have valid pages
>> but they are all reserved. Memory hotplug had multiple stages with

Not necessarily reserved, simply not managed by the buddy (e.g., early
allocations, memory ballooning / virtio-mem).

>> timing considerations around managed/present page updates, the zonelist
>> rebuild and the zone span updates. As David Hildenbrand puts it
>>
>> 	memory offlining adjusts managed+present pages of the zone
>> 	essentially in one go. If after the adjustments, the zone is no
>> 	longer populated (present==0), we rebuild the zone lists.
>>
>> 	Once that's done, we try shrinking the zone (start+spanned
>> 	pages) -- which results in zone_start_pfn == 0 if there are no
>> 	more pages. That happens *after* rebuilding the zonelists via
>> 	remove_pfn_range_from_zone().
>>
>> The only requirement to fix the race is that a page allocation request
>> identifies when a zonelist rebuild has happened since the allocation
>> request started and no page has yet been allocated. Use a seqlock_t to track
>> zonelist updates with a lockless read-side of the zonelist and protecting
>> the rebuild and update of the counter with a spinlock.
>>
>> Fixes: 6aa303defb74 ("mm, vmscan: only allocate and reclaim from zones with pages managed by the buddy allocator")
>> Reported-by: Patrick Daly <quic_pdaly@quicinc.com>
>> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
>> Cc: <stable@vger.kernel.org>
>> ---
>>  mm/page_alloc.c | 37 ++++++++++++++++++++++++++++++-------
>>  1 file changed, 30 insertions(+), 7 deletions(-)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index e5486d47406e..216e21048ddf 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -4708,6 +4708,24 @@ void fs_reclaim_release(gfp_t gfp_mask)
>>  EXPORT_SYMBOL_GPL(fs_reclaim_release);
>>  #endif
>>  
>> +/*
>> + * Zonelists may change due to hotplug during allocation. Detect when zonelists
>> + * have been rebuilt so allocation retries. Reader side does not lock and
>> + * retries the allocation if zonelist changes. Writer side is protected by the
>> + * embedded spin_lock.
>> + */
>> +DEFINE_SEQLOCK(zonelist_update_seq);
>> +
>> +static unsigned int zonelist_iter_begin(void)
>> +{
> 
> You likely want something like
> 	if (IS_ENABLED(CONFIG_MEMORY_HOTREMOVE))
> 		return read_seqbegin(&zonelist_update_seq);
> 	return 0;
> 
>> +	return read_seqbegin(&zonelist_update_seq);
>> +}
>> +
>> +static unsigned int check_retry_zonelist(unsigned int seq)
>> +{
> 	if (IS_ENABLED(CONFIG_MEMORY_HOTREMOVE))
> 		return read_seqretry(&zonelist_update_seq, seq);
> 	return seq;
> 
>> +	return read_seqretry(&zonelist_update_seq, seq);
>> +}
>> +
> 
> to avoid overhead on systems without HOTREMOVE configured.
> 
> Other than that LGTM.
> Acked-by: Michal Hocko <mhocko@suse.com>


Makes sense to me, although I wonder how much it will matter in practice.

Reviewed-by: David Hildenbrand <david@redhat.com>

-- 
Thanks,

David / dhildenb


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Race condition in build_all_zonelists() when offlining movable zone
  2022-08-23 15:51                                 ` David Hildenbrand
@ 2022-08-24  9:45                                   ` Mel Gorman
  0 siblings, 0 replies; 24+ messages in thread
From: Mel Gorman @ 2022-08-24  9:45 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Michal Hocko, Patrick Daly, linux-mm, linux-arm-kernel, Juergen Gross

On Tue, Aug 23, 2022 at 05:51:25PM +0200, David Hildenbrand wrote:
> >> The race is simple -- page allocation could be in progress when a memory
> >> hot-remove operation triggers a zonelist rebuild that removes zones.
> >> The allocation request will still have a valid ac->preferred_zoneref that
> >> is now pointing to NULL and triggers an OOM kill.
> >>
> >> This problem probably always existed but may be slighly easier to trigger
> 
> s/slighly/slightly/
> 

Fixed.

> >> due to 6aa303defb74 ("mm, vmscan: only allocate and reclaim from zones
> >> with pages managed by the buddy allocator") which distinguishes between
> >> zones that are completely unpopulated versus zones that have valid pages
> >> but they are all reserved. Memory hotplug had multiple stages with
> 
> Not necessarily reserved, simply not managed by the buddy (e.g., early
> allocations, memory ballooning / virtio-mem).
> 

Fair point, I filed all that under "reserved" but you're right, this is
clearer.

> >> +/*
> >> + * Zonelists may change due to hotplug during allocation. Detect when zonelists
> >> + * have been rebuilt so allocation retries. Reader side does not lock and
> >> + * retries the allocation if zonelist changes. Writer side is protected by the
> >> + * embedded spin_lock.
> >> + */
> >> +DEFINE_SEQLOCK(zonelist_update_seq);
> >> +
> >> +static unsigned int zonelist_iter_begin(void)
> >> +{
> > 
> > You likely want something like
> > 	if (IS_ENABLED(CONFIG_MEMORY_HOTREMOVE))
> > 		return read_seqbegin(&zonelist_update_seq);
> > 	return 0;
> > 
> >> +	return read_seqbegin(&zonelist_update_seq);
> >> +}
> >> +
> >> +static unsigned int check_retry_zonelist(unsigned int seq)
> >> +{
> > 	if (IS_ENABLED(CONFIG_MEMORY_HOTREMOVE))
> > 		return read_seqretry(&zonelist_update_seq, seq);
> > 	return seq;
> > 
> >> +	return read_seqretry(&zonelist_update_seq, seq);
> >> +}
> >> +
> > 
> > to avoid overhead on systems without HOTREMOVE configured.
> > 
> > Other than that LGTM.
> > Acked-by: Michal Hocko <mhocko@suse.com>
> 

It's a minor saving but fair enough. HOTREMOVE is now the only reason
zonelists can be rebuilt. While that was not always true, if it ever
changes again, it's a simple fix.

Thanks Michal.

> Makes sense to me, although I wonder how much it will matter in practice.
> 

Probably none at all as it's one branch but it's still valid.

> Reviewed-by: David Hildenbrand <david@redhat.com>
> 

Thanks David.

-- 
Mel Gorman
SUSE Labs

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2022-08-24  9:46 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-17  3:42 Race condition in build_all_zonelists() when offlining movable zone Patrick Daly
2022-08-17  6:38 ` Michal Hocko
2022-08-17  6:59   ` Michal Hocko
2022-08-17 10:40     ` Mel Gorman
2022-08-17 10:54       ` Michal Hocko
2022-08-17 11:26         ` Mel Gorman
2022-08-19  2:11           ` Patrick Daly
2022-08-22 20:18           ` Patrick Daly
2022-08-23  6:36       ` David Hildenbrand
2022-08-23  8:33         ` Mel Gorman
2022-08-23  8:52           ` David Hildenbrand
2022-08-23  9:49             ` Mel Gorman
2022-08-23 10:34               ` David Hildenbrand
2022-08-23 10:57                 ` Michal Hocko
2022-08-23 11:09                 ` Mel Gorman
2022-08-23 12:18                   ` Michal Hocko
2022-08-23 12:58                     ` Mel Gorman
2022-08-23 13:25                       ` Michal Hocko
2022-08-23 13:50                         ` David Hildenbrand
2022-08-23 13:57                           ` Michal Hocko
2022-08-23 15:14                             ` Mel Gorman
2022-08-23 15:38                               ` Michal Hocko
2022-08-23 15:51                                 ` David Hildenbrand
2022-08-24  9:45                                   ` Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).