[PATCH v3] mm/mempolicy: fix a race between offset_il_node and mpol_rebind

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3] mm/mempolicy: fix a race between offset_il_node and mpol_rebind_task
@ 2021-08-15  6:10 yanghui
  2021-08-17  0:59 ` Andrew Morton
  0 siblings, 1 reply; 6+ messages in thread
From: yanghui @ 2021-08-15  6:10 UTC (permalink / raw)
  To: akpm, willy, songmuchun; +Cc: linux-mm, linux-kernel, yanghui

Servers happened below panic:
Kernel version:5.4.56
BUG: unable to handle page fault for address: 0000000000002c48
RIP: 0010:__next_zones_zonelist+0x1d/0x40
[264003.977696] RAX: 0000000000002c40 RBX: 0000000000100dca RCX: 0000000000000014
[264003.977872] Call Trace:
[264003.977888]  __alloc_pages_nodemask+0x277/0x310
[264003.977908]  alloc_page_interleave+0x13/0x70
[264003.977926]  handle_mm_fault+0xf99/0x1390
[264003.977951]  __do_page_fault+0x288/0x500
[264003.977979]  ? schedule+0x39/0xa0
[264003.977994]  do_page_fault+0x30/0x110
[264003.978010]  page_fault+0x3e/0x50

The reason of panic is that MAX_NUMNODES is passd in the third parameter
in function __alloc_pages_nodemask(preferred_nid). So if to access
zonelist->zoneref->zone_idx in __next_zones_zonelist the panic will happen.

In offset_il_node(), first_node() return nid from pol->v.nodes, after
this other threads may changed pol->v.nodes before next_node().
This race condition will let next_node return MAX_NUMNODES.So put
pol->nodes in a local variable.

The race condition is between offset_il_node and cpuset_change_task_nodemask:
CPU0:                                     CPU1:
alloc_pages_vma()
  interleave_nid(pol,)
    offset_il_node(pol,)
      first_node(pol->v.nodes)            cpuset_change_task_nodemask
                      //nodes==0xc          mpol_rebind_task
                                              mpol_rebind_policy
                                                mpol_rebind_nodemask(pol,nodes)
                      //nodes==0x3
      next_node(nid, pol->v.nodes)//return MAX_NUMNODES

Signed-off-by: yanghui <yanghui.def@bytedance.com>
---
Changes in v2:
	1.Fix WRITE_ONCE/READ_ONCE can't deal with more than sizeof(long) bits data. 
Changes in v3:
	1.Modify some wrong comments.

 mm/mempolicy.c | 17 +++++++++++++----
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index e32360e90274..54f6eaff18c5 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1965,17 +1965,26 @@ unsigned int mempolicy_slab_node(void)
  */
 static unsigned offset_il_node(struct mempolicy *pol, unsigned long n)
 {
-	unsigned nnodes = nodes_weight(pol->nodes);
-	unsigned target;
+	nodemask_t nodemask = pol->nodes;
+	unsigned int target, nnodes;
 	int i;
 	int nid;
+	/*
+	 * The barrier will stabilize the nodemask in a register or on
+	 * the stack so that it will stop changing under the code.
+	 *
+	 * Between first_node() and next_node(), pol->nodes could be changed
+	 * by other threads. So we put pol->nodes in a local stack.
+	 */
+	barrier();
 
+	nnodes = nodes_weight(nodemask);
 	if (!nnodes)
 		return numa_node_id();
 	target = (unsigned int)n % nnodes;
-	nid = first_node(pol->nodes);
+	nid = first_node(nodemask);
 	for (i = 0; i < target; i++)
-		nid = next_node(nid, pol->nodes);
+		nid = next_node(nid, nodemask);
 	return nid;
 }
 
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH v3] mm/mempolicy: fix a race between offset_il_node and mpol_rebind_task
  2021-08-15  6:10 [PATCH v3] mm/mempolicy: fix a race between offset_il_node and mpol_rebind_task yanghui
@ 2021-08-17  0:59 ` Andrew Morton
  2021-08-17  1:43   ` Matthew Wilcox
  0 siblings, 1 reply; 6+ messages in thread
From: Andrew Morton @ 2021-08-17  0:59 UTC (permalink / raw)
  To: yanghui; +Cc: willy, songmuchun, linux-mm, linux-kernel

On Sun, 15 Aug 2021 14:10:34 +0800 yanghui <yanghui.def@bytedance.com> wrote:

> Servers happened below panic:
> Kernel version:5.4.56
> BUG: unable to handle page fault for address: 0000000000002c48
> RIP: 0010:__next_zones_zonelist+0x1d/0x40
> [264003.977696] RAX: 0000000000002c40 RBX: 0000000000100dca RCX: 0000000000000014
> [264003.977872] Call Trace:
> [264003.977888]  __alloc_pages_nodemask+0x277/0x310
> [264003.977908]  alloc_page_interleave+0x13/0x70
> [264003.977926]  handle_mm_fault+0xf99/0x1390
> [264003.977951]  __do_page_fault+0x288/0x500
> [264003.977979]  ? schedule+0x39/0xa0
> [264003.977994]  do_page_fault+0x30/0x110
> [264003.978010]  page_fault+0x3e/0x50
> 
> The reason of panic is that MAX_NUMNODES is passd in the third parameter
> in function __alloc_pages_nodemask(preferred_nid). So if to access
> zonelist->zoneref->zone_idx in __next_zones_zonelist the panic will happen.
> 
> In offset_il_node(), first_node() return nid from pol->v.nodes, after
> this other threads may changed pol->v.nodes before next_node().
> This race condition will let next_node return MAX_NUMNODES.So put
> pol->nodes in a local variable.
> 
> The race condition is between offset_il_node and cpuset_change_task_nodemask:
> CPU0:                                     CPU1:
> alloc_pages_vma()
>   interleave_nid(pol,)
>     offset_il_node(pol,)
>       first_node(pol->v.nodes)            cpuset_change_task_nodemask
>                       //nodes==0xc          mpol_rebind_task
>                                               mpol_rebind_policy
>                                                 mpol_rebind_nodemask(pol,nodes)
>                       //nodes==0x3
>       next_node(nid, pol->v.nodes)//return MAX_NUMNODES
> 
>
> ...
>
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1965,17 +1965,26 @@ unsigned int mempolicy_slab_node(void)
>   */
>  static unsigned offset_il_node(struct mempolicy *pol, unsigned long n)
>  {
> -	unsigned nnodes = nodes_weight(pol->nodes);
> -	unsigned target;
> +	nodemask_t nodemask = pol->nodes;

Ouch.  nodemask_t can be large - up to 128 bytes I think.  This looks
like an expensive thing to be adding to fast paths (alloc_pages_vma()).

Plus it consumes a lot of stack.

> +	unsigned int target, nnodes;
>  	int i;
>  	int nid;
> +	/*
> +	 * The barrier will stabilize the nodemask in a register or on
> +	 * the stack so that it will stop changing under the code.
> +	 *
> +	 * Between first_node() and next_node(), pol->nodes could be changed
> +	 * by other threads. So we put pol->nodes in a local stack.
> +	 */
> +	barrier();
>  
> +	nnodes = nodes_weight(nodemask);
>  	if (!nnodes)
>  		return numa_node_id();
>  	target = (unsigned int)n % nnodes;
> -	nid = first_node(pol->nodes);
> +	nid = first_node(nodemask);
>  	for (i = 0; i < target; i++)
> -		nid = next_node(nid, pol->nodes);
> +		nid = next_node(nid, nodemask);
>  	return nid;
>  }

The whole idea seems a bit hacky and fragile to be.  We're dealing with
a potentially stale copy of the nodemask, yes?

Ordinarily this is troublesome because there could be other problems
caused by working off stale data and a better fix would be to simply
avoid using stale data!

But I guess that if the worst case is that once in a billion times,
interleaving hands out a page which isn't on the intended node then we
can live with that.

And if this guess is correct and it is indeed the case that this is the
worst case, can we please spell all this out in the changelog.



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v3] mm/mempolicy: fix a race between offset_il_node and mpol_rebind_task
  2021-08-17  0:59 ` Andrew Morton
@ 2021-08-17  1:43   ` Matthew Wilcox
  2021-08-18 14:02     ` Muchun Song
  0 siblings, 1 reply; 6+ messages in thread
From: Matthew Wilcox @ 2021-08-17  1:43 UTC (permalink / raw)
  To: Andrew Morton; +Cc: yanghui, songmuchun, linux-mm, linux-kernel

On Mon, Aug 16, 2021 at 05:59:52PM -0700, Andrew Morton wrote:
> On Sun, 15 Aug 2021 14:10:34 +0800 yanghui <yanghui.def@bytedance.com> wrote:
> 
> > Servers happened below panic:
> > Kernel version:5.4.56
> > BUG: unable to handle page fault for address: 0000000000002c48
> > RIP: 0010:__next_zones_zonelist+0x1d/0x40
> > [264003.977696] RAX: 0000000000002c40 RBX: 0000000000100dca RCX: 0000000000000014
> > [264003.977872] Call Trace:
> > [264003.977888]  __alloc_pages_nodemask+0x277/0x310
> > [264003.977908]  alloc_page_interleave+0x13/0x70
> > [264003.977926]  handle_mm_fault+0xf99/0x1390
> > [264003.977951]  __do_page_fault+0x288/0x500
> > [264003.977979]  ? schedule+0x39/0xa0
> > [264003.977994]  do_page_fault+0x30/0x110
> > [264003.978010]  page_fault+0x3e/0x50
> > 
> > The reason of panic is that MAX_NUMNODES is passd in the third parameter
> > in function __alloc_pages_nodemask(preferred_nid). So if to access
> > zonelist->zoneref->zone_idx in __next_zones_zonelist the panic will happen.
> > 
> > In offset_il_node(), first_node() return nid from pol->v.nodes, after
> > this other threads may changed pol->v.nodes before next_node().
> > This race condition will let next_node return MAX_NUMNODES.So put
> > pol->nodes in a local variable.
> > 
> > The race condition is between offset_il_node and cpuset_change_task_nodemask:
> > CPU0:                                     CPU1:
> > alloc_pages_vma()
> >   interleave_nid(pol,)
> >     offset_il_node(pol,)
> >       first_node(pol->v.nodes)            cpuset_change_task_nodemask
> >                       //nodes==0xc          mpol_rebind_task
> >                                               mpol_rebind_policy
> >                                                 mpol_rebind_nodemask(pol,nodes)
> >                       //nodes==0x3
> >       next_node(nid, pol->v.nodes)//return MAX_NUMNODES
> > 
> >
> > ...
> >
> > --- a/mm/mempolicy.c
> > +++ b/mm/mempolicy.c
> > @@ -1965,17 +1965,26 @@ unsigned int mempolicy_slab_node(void)
> >   */
> >  static unsigned offset_il_node(struct mempolicy *pol, unsigned long n)
> >  {
> > -	unsigned nnodes = nodes_weight(pol->nodes);
> > -	unsigned target;
> > +	nodemask_t nodemask = pol->nodes;
> 
> Ouch.  nodemask_t can be large - up to 128 bytes I think.  This looks
> like an expensive thing to be adding to fast paths (alloc_pages_vma()).

Copying a fixed-size 128 bytes to the stack isn't going to be _that_
expensive.

> Plus it consumes a lot of stack.

alloc_pages_vma() tends to be a leaf function, so not that bad.

> > +	unsigned int target, nnodes;
> >  	int i;
> >  	int nid;
> > +	/*
> > +	 * The barrier will stabilize the nodemask in a register or on
> > +	 * the stack so that it will stop changing under the code.
> > +	 *
> > +	 * Between first_node() and next_node(), pol->nodes could be changed
> > +	 * by other threads. So we put pol->nodes in a local stack.
> > +	 */
> > +	barrier();

I think this could be an smp_rmb()?

> > +	nnodes = nodes_weight(nodemask);
> >  	if (!nnodes)
> >  		return numa_node_id();
> >  	target = (unsigned int)n % nnodes;
> > -	nid = first_node(pol->nodes);
> > +	nid = first_node(nodemask);
> >  	for (i = 0; i < target; i++)
> > -		nid = next_node(nid, pol->nodes);
> > +		nid = next_node(nid, nodemask);
> >  	return nid;
> >  }
> 
> The whole idea seems a bit hacky and fragile to be.  We're dealing with
> a potentially stale copy of the nodemask, yes?

Correct.  Also potentially a nodemask in the middle of being changed,
so it may be some unholy amalgam of previous and next.

> Ordinarily this is troublesome because there could be other problems
> caused by working off stale data and a better fix would be to simply
> avoid using stale data!
> 
> But I guess that if the worst case is that once in a billion times,
> interleaving hands out a page which isn't on the intended node then we
> can live with that.
> 
> And if this guess is correct and it is indeed the case that this is the
> worst case, can we please spell all this out in the changelog.

I think that taking a lock here is worse than copying to the stack.
But that seems like the kind of thing that could be measured?

I don't think that working off stale / amalgam data is a bad thing,
we only need consistency.  This is, after all, interleaved allocation.
The user has asked for us, more or less, to choose a node at random to
allocate from.

What ruffles my feathers more is that we call next_node() up to n-2 times,
and on average (n-1)/2 times (where n is the number of permitted nodes).
I can't help but feel that we could do better to randomly distribute
pages between nodes.  Even having a special case for all-bits-set or
n-contiguous-bits-set-and-all-other-bits-clear would go a long way to
speed this up.

I don't know if anyone has a real complaint about how long this takes
to choose a node, though.  I'm loathe to optimise this without data.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v3] mm/mempolicy: fix a race between offset_il_node and mpol_rebind_task
  2021-08-17  1:43   ` Matthew Wilcox
@ 2021-08-18 14:02     ` Muchun Song
  2021-08-18 15:07       ` Matthew Wilcox
  0 siblings, 1 reply; 6+ messages in thread
From: Muchun Song @ 2021-08-18 14:02 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Andrew Morton, yanghui, Linux Memory Management List, LKML

On Tue, Aug 17, 2021 at 9:43 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Mon, Aug 16, 2021 at 05:59:52PM -0700, Andrew Morton wrote:
> > On Sun, 15 Aug 2021 14:10:34 +0800 yanghui <yanghui.def@bytedance.com> wrote:
> >
> > > Servers happened below panic:
> > > Kernel version:5.4.56
> > > BUG: unable to handle page fault for address: 0000000000002c48
> > > RIP: 0010:__next_zones_zonelist+0x1d/0x40
> > > [264003.977696] RAX: 0000000000002c40 RBX: 0000000000100dca RCX: 0000000000000014
> > > [264003.977872] Call Trace:
> > > [264003.977888]  __alloc_pages_nodemask+0x277/0x310
> > > [264003.977908]  alloc_page_interleave+0x13/0x70
> > > [264003.977926]  handle_mm_fault+0xf99/0x1390
> > > [264003.977951]  __do_page_fault+0x288/0x500
> > > [264003.977979]  ? schedule+0x39/0xa0
> > > [264003.977994]  do_page_fault+0x30/0x110
> > > [264003.978010]  page_fault+0x3e/0x50
> > >
> > > The reason of panic is that MAX_NUMNODES is passd in the third parameter
> > > in function __alloc_pages_nodemask(preferred_nid). So if to access
> > > zonelist->zoneref->zone_idx in __next_zones_zonelist the panic will happen.
> > >
> > > In offset_il_node(), first_node() return nid from pol->v.nodes, after
> > > this other threads may changed pol->v.nodes before next_node().
> > > This race condition will let next_node return MAX_NUMNODES.So put
> > > pol->nodes in a local variable.
> > >
> > > The race condition is between offset_il_node and cpuset_change_task_nodemask:
> > > CPU0:                                     CPU1:
> > > alloc_pages_vma()
> > >   interleave_nid(pol,)
> > >     offset_il_node(pol,)
> > >       first_node(pol->v.nodes)            cpuset_change_task_nodemask
> > >                       //nodes==0xc          mpol_rebind_task
> > >                                               mpol_rebind_policy
> > >                                                 mpol_rebind_nodemask(pol,nodes)
> > >                       //nodes==0x3
> > >       next_node(nid, pol->v.nodes)//return MAX_NUMNODES
> > >
> > >
> > > ...
> > >
> > > --- a/mm/mempolicy.c
> > > +++ b/mm/mempolicy.c
> > > @@ -1965,17 +1965,26 @@ unsigned int mempolicy_slab_node(void)
> > >   */
> > >  static unsigned offset_il_node(struct mempolicy *pol, unsigned long n)
> > >  {
> > > -   unsigned nnodes = nodes_weight(pol->nodes);
> > > -   unsigned target;
> > > +   nodemask_t nodemask = pol->nodes;
> >
> > Ouch.  nodemask_t can be large - up to 128 bytes I think.  This looks
> > like an expensive thing to be adding to fast paths (alloc_pages_vma()).
>
> Copying a fixed-size 128 bytes to the stack isn't going to be _that_
> expensive.
>
> > Plus it consumes a lot of stack.
>
> alloc_pages_vma() tends to be a leaf function, so not that bad.
>
> > > +   unsigned int target, nnodes;
> > >     int i;
> > >     int nid;
> > > +   /*
> > > +    * The barrier will stabilize the nodemask in a register or on
> > > +    * the stack so that it will stop changing under the code.
> > > +    *
> > > +    * Between first_node() and next_node(), pol->nodes could be changed
> > > +    * by other threads. So we put pol->nodes in a local stack.
> > > +    */
> > > +   barrier();
>
> I think this could be an smp_rmb()?

Hi Matthew,

I have a question. Why is barrier() not enough?

Thanks.

>
> > > +   nnodes = nodes_weight(nodemask);
> > >     if (!nnodes)
> > >             return numa_node_id();
> > >     target = (unsigned int)n % nnodes;
> > > -   nid = first_node(pol->nodes);
> > > +   nid = first_node(nodemask);
> > >     for (i = 0; i < target; i++)
> > > -           nid = next_node(nid, pol->nodes);
> > > +           nid = next_node(nid, nodemask);
> > >     return nid;
> > >  }
> >
> > The whole idea seems a bit hacky and fragile to be.  We're dealing with
> > a potentially stale copy of the nodemask, yes?
>
> Correct.  Also potentially a nodemask in the middle of being changed,
> so it may be some unholy amalgam of previous and next.
>
> > Ordinarily this is troublesome because there could be other problems
> > caused by working off stale data and a better fix would be to simply
> > avoid using stale data!
> >
> > But I guess that if the worst case is that once in a billion times,
> > interleaving hands out a page which isn't on the intended node then we
> > can live with that.
> >
> > And if this guess is correct and it is indeed the case that this is the
> > worst case, can we please spell all this out in the changelog.
>
> I think that taking a lock here is worse than copying to the stack.
> But that seems like the kind of thing that could be measured?
>
> I don't think that working off stale / amalgam data is a bad thing,
> we only need consistency.  This is, after all, interleaved allocation.
> The user has asked for us, more or less, to choose a node at random to
> allocate from.
>
> What ruffles my feathers more is that we call next_node() up to n-2 times,
> and on average (n-1)/2 times (where n is the number of permitted nodes).
> I can't help but feel that we could do better to randomly distribute
> pages between nodes.  Even having a special case for all-bits-set or
> n-contiguous-bits-set-and-all-other-bits-clear would go a long way to
> speed this up.
>
> I don't know if anyone has a real complaint about how long this takes
> to choose a node, though.  I'm loathe to optimise this without data.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v3] mm/mempolicy: fix a race between offset_il_node and mpol_rebind_task
  2021-08-18 14:02     ` Muchun Song
@ 2021-08-18 15:07       ` Matthew Wilcox
  2021-08-19  2:04         ` Muchun Song
  0 siblings, 1 reply; 6+ messages in thread
From: Matthew Wilcox @ 2021-08-18 15:07 UTC (permalink / raw)
  To: Muchun Song; +Cc: Andrew Morton, yanghui, Linux Memory Management List, LKML

On Wed, Aug 18, 2021 at 10:02:46PM +0800, Muchun Song wrote:
> On Tue, Aug 17, 2021 at 9:43 AM Matthew Wilcox <willy@infradead.org> wrote:
> > > > +   unsigned int target, nnodes;
> > > >     int i;
> > > >     int nid;
> > > > +   /*
> > > > +    * The barrier will stabilize the nodemask in a register or on
> > > > +    * the stack so that it will stop changing under the code.
> > > > +    *
> > > > +    * Between first_node() and next_node(), pol->nodes could be changed
> > > > +    * by other threads. So we put pol->nodes in a local stack.
> > > > +    */
> > > > +   barrier();
> >
> > I think this could be an smp_rmb()?
> 
> Hi Matthew,
> 
> I have a question. Why is barrier() not enough?

I think barrier() may be more than is necessary.  We don't need a
barrier on non-SMP systems (or do we?)  And we only need to order reads,
not writes.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v3] mm/mempolicy: fix a race between offset_il_node and mpol_rebind_task
  2021-08-18 15:07       ` Matthew Wilcox
@ 2021-08-19  2:04         ` Muchun Song
  0 siblings, 0 replies; 6+ messages in thread
From: Muchun Song @ 2021-08-19  2:04 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Andrew Morton, yanghui, Linux Memory Management List, LKML

On Wed, Aug 18, 2021 at 11:07 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Wed, Aug 18, 2021 at 10:02:46PM +0800, Muchun Song wrote:
> > On Tue, Aug 17, 2021 at 9:43 AM Matthew Wilcox <willy@infradead.org> wrote:
> > > > > +   unsigned int target, nnodes;
> > > > >     int i;
> > > > >     int nid;
> > > > > +   /*
> > > > > +    * The barrier will stabilize the nodemask in a register or on
> > > > > +    * the stack so that it will stop changing under the code.
> > > > > +    *
> > > > > +    * Between first_node() and next_node(), pol->nodes could be changed
> > > > > +    * by other threads. So we put pol->nodes in a local stack.
> > > > > +    */
> > > > > +   barrier();
> > >
> > > I think this could be an smp_rmb()?
> >
> > Hi Matthew,
> >
> > I have a question. Why is barrier() not enough?
>
> I think barrier() may be more than is necessary.  We don't need a
> barrier on non-SMP systems (or do we?)  And we only need to order reads,
> not writes.

Here barrier() is just a compiler barrier, which is cheaper than
smp_rmb() which usually equals to memory barrier instruction
plus barrier(). So I think barrier() , which will stabilize the
nodemask in a register or on the stack, is more appropriate here.

Thanks.


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2021-08-19  2:04 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-15  6:10 [PATCH v3] mm/mempolicy: fix a race between offset_il_node and mpol_rebind_task yanghui
2021-08-17  0:59 ` Andrew Morton
2021-08-17  1:43   ` Matthew Wilcox
2021-08-18 14:02     ` Muchun Song
2021-08-18 15:07       ` Matthew Wilcox
2021-08-19  2:04         ` Muchun Song

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).