LKML Archive on lore.kernel.org
 help / Atom feed
* [PATCH] mm, page_alloc: do not break __GFP_THISNODE by zonelist reset
@ 2018-05-25 13:08 Vlastimil Babka
  2018-05-25 19:43 ` Andrew Morton
  2018-05-28  8:49 ` Mel Gorman
  0 siblings, 2 replies; 7+ messages in thread
From: Vlastimil Babka @ 2018-05-25 13:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Mel Gorman, Michal Hocko, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, stable

In __alloc_pages_slowpath() we reset zonelist and preferred_zoneref for
allocations that can ignore memory policies. The zonelist is obtained from
current CPU's node. This is a problem for __GFP_THISNODE allocations that want
to allocate on a different node, e.g. because the allocating thread has been
migrated to a different CPU.

This has been observed to break SLAB in our 4.4-based kernel, because there it
relies on __GFP_THISNODE working as intended. If a slab page is put on wrong
node's list, then further list manipulations may corrupt the list because
page_to_nid() is used to determine which node's list_lock should be locked and
thus we may take a wrong lock and race.

Current SLAB implementation seems to be immune by luck thanks to commit
511e3a058812 ("mm/slab: make cache_grow() handle the page allocated on
arbitrary node") but there may be others assuming that __GFP_THISNODE works as
promised.

We can fix it by simply removing the zonelist reset completely. There is
actually no reason to reset it, because memory policies and cpusets don't
affect the zonelist choice in the first place. This was different when commit
183f6371aac2 ("mm: ignore mempolicies when using ALLOC_NO_WATERMARK")
introduced the code, as mempolicies provided their own restricted zonelists.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Fixes: 183f6371aac2 ("mm: ignore mempolicies when using ALLOC_NO_WATERMARK")
Cc: <stable@vger.kernel.org>
---
Hi,

we might consider this for 4.17 although I don't know if there's anything
currently broken. Stable backports should be more important, but will have to
be reviewed carefully, as the code went through many changes.
BTW I think that also the ac->preferred_zoneref reset is currently useless if
we don't also reset ac->nodemask from a mempolicy to NULL first (which we
probably should for the OOM victims etc?), but I would leave that for a
separate patch.

Vlastimil

 mm/page_alloc.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 905db9d7962f..be0f0b5d3935 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4165,7 +4165,6 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	 * orientated.
 	 */
 	if (!(alloc_flags & ALLOC_CPUSET) || reserve_flags) {
-		ac->zonelist = node_zonelist(numa_node_id(), gfp_mask);
 		ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,
 					ac->high_zoneidx, ac->nodemask);
 	}
-- 
2.17.0

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] mm, page_alloc: do not break __GFP_THISNODE by zonelist reset
  2018-05-25 13:08 [PATCH] mm, page_alloc: do not break __GFP_THISNODE by zonelist reset Vlastimil Babka
@ 2018-05-25 19:43 ` Andrew Morton
  2018-05-25 20:48   ` Vlastimil Babka
  2018-05-28  7:21   ` Michal Hocko
  2018-05-28  8:49 ` Mel Gorman
  1 sibling, 2 replies; 7+ messages in thread
From: Andrew Morton @ 2018-05-25 19:43 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, linux-kernel, Mel Gorman, Michal Hocko, David Rientjes,
	Joonsoo Kim, stable

On Fri, 25 May 2018 15:08:53 +0200 Vlastimil Babka <vbabka@suse.cz> wrote:

> we might consider this for 4.17 although I don't know if there's anything
> currently broken. Stable backports should be more important, but will have to
> be reviewed carefully, as the code went through many changes.
> BTW I think that also the ac->preferred_zoneref reset is currently useless if
> we don't also reset ac->nodemask from a mempolicy to NULL first (which we
> probably should for the OOM victims etc?), but I would leave that for a
> separate patch.

Confused.  If nothing is currently broken then why is a backport
needed?  Presumably because we expect breakage in the future?  Can you
expand on this?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] mm, page_alloc: do not break __GFP_THISNODE by zonelist reset
  2018-05-25 19:43 ` Andrew Morton
@ 2018-05-25 20:48   ` Vlastimil Babka
  2018-05-28  9:55     ` Vlastimil Babka
  2018-05-28  7:21   ` Michal Hocko
  1 sibling, 1 reply; 7+ messages in thread
From: Vlastimil Babka @ 2018-05-25 20:48 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Mel Gorman, Michal Hocko, David Rientjes,
	Joonsoo Kim, stable

On 05/25/2018 09:43 PM, Andrew Morton wrote:
> On Fri, 25 May 2018 15:08:53 +0200 Vlastimil Babka <vbabka@suse.cz> wrote:
> 
>> we might consider this for 4.17 although I don't know if there's anything
>> currently broken. Stable backports should be more important, but will have to
>> be reviewed carefully, as the code went through many changes.
>> BTW I think that also the ac->preferred_zoneref reset is currently useless if
>> we don't also reset ac->nodemask from a mempolicy to NULL first (which we
>> probably should for the OOM victims etc?), but I would leave that for a
>> separate patch.
> 
> Confused.  If nothing is currently broken then why is a backport
> needed?  Presumably because we expect breakage in the future?  Can you
> expand on this?

I mean that SLAB is currently not affected, but in older kernels than
4.7 that don't yet have 511e3a058812 ("mm/slab: make cache_grow() handle
the page allocated on arbitrary node") it is. That's at least 4.4 LTS.
Older ones I'll have to check.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] mm, page_alloc: do not break __GFP_THISNODE by zonelist reset
  2018-05-25 19:43 ` Andrew Morton
  2018-05-25 20:48   ` Vlastimil Babka
@ 2018-05-28  7:21   ` Michal Hocko
  2018-05-28 10:00     ` Vlastimil Babka
  1 sibling, 1 reply; 7+ messages in thread
From: Michal Hocko @ 2018-05-28  7:21 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, linux-mm, linux-kernel, Mel Gorman,
	David Rientjes, Joonsoo Kim, stable

On Fri 25-05-18 12:43:00, Andrew Morton wrote:
> On Fri, 25 May 2018 15:08:53 +0200 Vlastimil Babka <vbabka@suse.cz> wrote:
> 
> > we might consider this for 4.17 although I don't know if there's anything
> > currently broken. Stable backports should be more important, but will have to
> > be reviewed carefully, as the code went through many changes.
> > BTW I think that also the ac->preferred_zoneref reset is currently useless if
> > we don't also reset ac->nodemask from a mempolicy to NULL first (which we
> > probably should for the OOM victims etc?), but I would leave that for a
> > separate patch.
> 
> Confused.  If nothing is currently broken then why is a backport
> needed?  Presumably because we expect breakage in the future?  Can you
> expand on this?

__GFP_THISNODE is documented to _use_ the given node. Allocating from a
different one is a bug. Maybe the current code can cope with that or at
least doesn't blow up in an obvious way but the bug is still there.

I am still not sure what to do about the zonelist reset. It still seems
like an echo from the past but using numa_node_id for __GFP_THISNODE is
a clear bug because our task could have been migrated to a cpu on a
different than requested node.

Acked-by: Michal Hocko <mhocko@suse.com>

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] mm, page_alloc: do not break __GFP_THISNODE by zonelist reset
  2018-05-25 13:08 [PATCH] mm, page_alloc: do not break __GFP_THISNODE by zonelist reset Vlastimil Babka
  2018-05-25 19:43 ` Andrew Morton
@ 2018-05-28  8:49 ` Mel Gorman
  1 sibling, 0 replies; 7+ messages in thread
From: Mel Gorman @ 2018-05-28  8:49 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, linux-mm, linux-kernel, Michal Hocko,
	David Rientjes, Joonsoo Kim, stable

On Fri, May 25, 2018 at 03:08:53PM +0200, Vlastimil Babka wrote:
> In __alloc_pages_slowpath() we reset zonelist and preferred_zoneref for
> allocations that can ignore memory policies. The zonelist is obtained from
> current CPU's node. This is a problem for __GFP_THISNODE allocations that want
> to allocate on a different node, e.g. because the allocating thread has been
> migrated to a different CPU.
> 
> This has been observed to break SLAB in our 4.4-based kernel, because there it
> relies on __GFP_THISNODE working as intended. If a slab page is put on wrong
> node's list, then further list manipulations may corrupt the list because
> page_to_nid() is used to determine which node's list_lock should be locked and
> thus we may take a wrong lock and race.
> 
> Current SLAB implementation seems to be immune by luck thanks to commit
> 511e3a058812 ("mm/slab: make cache_grow() handle the page allocated on
> arbitrary node") but there may be others assuming that __GFP_THISNODE works as
> promised.
> 
> We can fix it by simply removing the zonelist reset completely. There is
> actually no reason to reset it, because memory policies and cpusets don't
> affect the zonelist choice in the first place. This was different when commit
> 183f6371aac2 ("mm: ignore mempolicies when using ALLOC_NO_WATERMARK")
> introduced the code, as mempolicies provided their own restricted zonelists.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> Fixes: 183f6371aac2 ("mm: ignore mempolicies when using ALLOC_NO_WATERMARK")

Acked-by: Mel Gorman <mgorman@techsingularity.net>

Thanks.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] mm, page_alloc: do not break __GFP_THISNODE by zonelist reset
  2018-05-25 20:48   ` Vlastimil Babka
@ 2018-05-28  9:55     ` Vlastimil Babka
  0 siblings, 0 replies; 7+ messages in thread
From: Vlastimil Babka @ 2018-05-28  9:55 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Mel Gorman, Michal Hocko, David Rientjes,
	Joonsoo Kim, stable

On 05/25/2018 10:48 PM, Vlastimil Babka wrote:
> On 05/25/2018 09:43 PM, Andrew Morton wrote:
>> On Fri, 25 May 2018 15:08:53 +0200 Vlastimil Babka <vbabka@suse.cz> wrote:
>>
>>> we might consider this for 4.17 although I don't know if there's anything
>>> currently broken. Stable backports should be more important, but will have to
>>> be reviewed carefully, as the code went through many changes.
>>> BTW I think that also the ac->preferred_zoneref reset is currently useless if
>>> we don't also reset ac->nodemask from a mempolicy to NULL first (which we
>>> probably should for the OOM victims etc?), but I would leave that for a
>>> separate patch.
>>
>> Confused.  If nothing is currently broken then why is a backport
>> needed?  Presumably because we expect breakage in the future?  Can you
>> expand on this?
> 
> I mean that SLAB is currently not affected, but in older kernels than
> 4.7 that don't yet have 511e3a058812 ("mm/slab: make cache_grow() handle
> the page allocated on arbitrary node") it is. That's at least 4.4 LTS.
> Older ones I'll have to check.

So I've checked the non-EOL LTS's at kernel.org and:

4.16, 4.14, 4.9 - same as mainline (__GFP_THISNODE broken, but SLAB is OK)
4.4, 4.1, 3.16 - SLAB potentially broken if it makes an
ALLOC_NO_WATERMARKS allocation (our 4.4 kernel has backports that extend
it to also !ALLOC_CPUSET so it's more likely).

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] mm, page_alloc: do not break __GFP_THISNODE by zonelist reset
  2018-05-28  7:21   ` Michal Hocko
@ 2018-05-28 10:00     ` Vlastimil Babka
  0 siblings, 0 replies; 7+ messages in thread
From: Vlastimil Babka @ 2018-05-28 10:00 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton
  Cc: linux-mm, linux-kernel, Mel Gorman, David Rientjes, Joonsoo Kim, stable

On 05/28/2018 09:21 AM, Michal Hocko wrote:
> On Fri 25-05-18 12:43:00, Andrew Morton wrote:
>> On Fri, 25 May 2018 15:08:53 +0200 Vlastimil Babka <vbabka@suse.cz> wrote:
>>
>>> we might consider this for 4.17 although I don't know if there's anything
>>> currently broken. Stable backports should be more important, but will have to
>>> be reviewed carefully, as the code went through many changes.
>>> BTW I think that also the ac->preferred_zoneref reset is currently useless if
>>> we don't also reset ac->nodemask from a mempolicy to NULL first (which we
>>> probably should for the OOM victims etc?), but I would leave that for a
>>> separate patch.
>>
>> Confused.  If nothing is currently broken then why is a backport
>> needed?  Presumably because we expect breakage in the future?  Can you
>> expand on this?
> 
> __GFP_THISNODE is documented to _use_ the given node. Allocating from a
> different one is a bug. Maybe the current code can cope with that or at
> least doesn't blow up in an obvious way but the bug is still there.
> 
> I am still not sure what to do about the zonelist reset. It still seems
> like an echo from the past

Hmm actually it seems that even at the time of commit 183f6371aac2
introduced the reset, the per-policy zonelists for MPOL_BIND policies
were gone for years. Mempolicy only affects which node's zonelist is
used, but that always contains all the nodes (unless __GFP_THISNODE) so
there's no reason to get another node's zonelist to escape mempolicy
restrictions.

Mempolicy restrictions are given as nodemask, so if we want to ignore
them for OOM victims etc, we have to reset nodemask instead. But again
we have to be careful in case the nodemask doesn't come from mempolicy,
but from somebody who might be broken if we ignore it.

> but using numa_node_id for __GFP_THISNODE is
> a clear bug because our task could have been migrated to a cpu on a
> different than requested node.
> 
> Acked-by: Michal Hocko <mhocko@suse.com>
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, back to index

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-05-25 13:08 [PATCH] mm, page_alloc: do not break __GFP_THISNODE by zonelist reset Vlastimil Babka
2018-05-25 19:43 ` Andrew Morton
2018-05-25 20:48   ` Vlastimil Babka
2018-05-28  9:55     ` Vlastimil Babka
2018-05-28  7:21   ` Michal Hocko
2018-05-28 10:00     ` Vlastimil Babka
2018-05-28  8:49 ` Mel Gorman

LKML Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/lkml/0 lkml/git/0.git
	git clone --mirror https://lore.kernel.org/lkml/1 lkml/git/1.git
	git clone --mirror https://lore.kernel.org/lkml/2 lkml/git/2.git
	git clone --mirror https://lore.kernel.org/lkml/3 lkml/git/3.git
	git clone --mirror https://lore.kernel.org/lkml/4 lkml/git/4.git
	git clone --mirror https://lore.kernel.org/lkml/5 lkml/git/5.git
	git clone --mirror https://lore.kernel.org/lkml/6 lkml/git/6.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 lkml lkml/ https://lore.kernel.org/lkml \
		linux-kernel@vger.kernel.org linux-kernel@archiver.kernel.org
	public-inbox-index lkml


Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-kernel


AGPL code for this site: git clone https://public-inbox.org/ public-inbox