linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* futex: clarification needed with drop_futex_key_refs and memory barriers
@ 2016-03-26 15:56 Petros Koutoupis
  2016-03-27  6:25 ` Mike Galbraith
  0 siblings, 1 reply; 6+ messages in thread
From: Petros Koutoupis @ 2016-03-26 15:56 UTC (permalink / raw)
  To: linux-kernel; +Cc: petros

I stumbled on an interesting scenario which I am unable to fully explain and I
was hoping to get some other opinions on why this would or wouldn't work.

In recent testing on a 48-core Haswell arch server, our multi-threaded user space
application was utilizing 60% to 100% more CPU than on our smaller 24-core servers
(running an identical load). After spending a considerable amount of time analyzing
stack dumps and straces it became immediately apparent that those exact threads
operating with the higher CPU utilization were off in futex land.

Shortly afterward I stumbled on commit 76835b0ebf8a7fe85beb03c75121419a7dec52f0
(futex: Ensure get_futex_key_refs() always implies a barrier) which addressed the
handling of private futexes and preventing a race condition by completing the
function with a memory barrier. Now, I fully understand why this patch was implemented:
to have a memory barrier before checking the "waiters." It makes sense. What doesn't
make sense (so far) is when I apply the same patch to the drop counterpart,
drop_futex_key_refs(), and the problem goes away. See the change and my notes below.


--- linux/kernel/futex.c.orig   2016-03-25 19:45:08.169563263 -0500
+++ linux/kernel/futex.c        2016-03-25 19:46:06.901562211 -0500
@@ -438,11 +438,13 @@ static void drop_futex_key_refs(union fu

        switch (key->both.offset & (FUT_OFF_INODE|FUT_OFF_MMSHARED)) {
        case FUT_OFF_INODE:
-               iput(key->shared.inode);
+               iput(key->shared.inode); /* implies smp_mb(); (B) */
                break;
        case FUT_OFF_MMSHARED:
-               mmdrop(key->private.mm);
+               mmdrop(key->private.mm); /* implies smp_mb(); (B) */
                break;
+       default:
+               smp_mb(); /* explicit smp_mb(); (B) */
        }
 }


The iput() and mmdrop() routines in the switch statement eventually use
atomic_dec_and_test() which according to the Documentation/memory-barriers.txt
implies an smp_mb() on each side of the actual operation. Notice that private
futexes aren't handled by this (read below) this switch.

Now there is a wrapper put_futex_key() which is called in a few function as a
way to clean up before before retrying, but in every case, and before it is
called, a check is made to see if the futex is private and if so, retries at
a more appropriate area of its respective function.

Now I have found two functions where this type of check/protection aren't made
and I am curious as to if I stumbled on what could potentially lead to a race
condition in a large SMP environment. Please refer to futex_wait() (called indirectly
via unqueue_me()) and futex_requeue().

Any thoughts or opinions would be greatly appreciated. Thank you in advance.

--
Petros

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: futex: clarification needed with drop_futex_key_refs and memory barriers
  2016-03-26 15:56 futex: clarification needed with drop_futex_key_refs and memory barriers Petros Koutoupis
@ 2016-03-27  6:25 ` Mike Galbraith
  2016-03-29  9:50   ` Peter Zijlstra
  0 siblings, 1 reply; 6+ messages in thread
From: Mike Galbraith @ 2016-03-27  6:25 UTC (permalink / raw)
  To: Petros Koutoupis, linux-kernel
  Cc: Paul E. McKenney, Thomas Gleixner, Peter Zijlstra

(futex ordering pop-flare)

On Sat, 2016-03-26 at 10:56 -0500, Petros Koutoupis wrote:
> I stumbled on an interesting scenario which I am unable to fully explain and I
> was hoping to get some other opinions on why this would or wouldn't work.
> 
> In recent testing on a 48-core Haswell arch server, our multi-threaded user space
> application was utilizing 60% to 100% more CPU than on our smaller 24-core servers
> (running an identical load). After spending a considerable amount of time analyzing
> stack dumps and straces it became immediately apparent that those exact threads
> operating with the higher CPU utilization were off in futex land.
> 
> Shortly afterward I stumbled on commit 76835b0ebf8a7fe85beb03c75121419a7dec52f0
> (futex: Ensure get_futex_key_refs() always implies a barrier) which addressed the
> handling of private futexes and preventing a race condition by completing the
> function with a memory barrier. Now, I fully understand why this patch was implemented:
> to have a memory barrier before checking the "waiters." It makes sense. What doesn't
> make sense (so far) is when I apply the same patch to the drop counterpart,
> drop_futex_key_refs(), and the problem goes away. See the change and my notes below.
> 
> 
> --- linux/kernel/futex.c.orig   2016-03-25 19:45:08.169563263 -0500
> +++ linux/kernel/futex.c        2016-03-25 19:46:06.901562211 -0500
> @@ -438,11 +438,13 @@ static void drop_futex_key_refs(union fu
> 
>         switch (key->both.offset & (FUT_OFF_INODE|FUT_OFF_MMSHARED)) {
>         case FUT_OFF_INODE:
> -               iput(key->shared.inode);
> +               iput(key->shared.inode); /* implies smp_mb(); (B) */
>                 break;
>         case FUT_OFF_MMSHARED:
> -               mmdrop(key->private.mm);
> +               mmdrop(key->private.mm); /* implies smp_mb(); (B) */
>                 break;
> +       default:
> +               smp_mb(); /* explicit smp_mb(); (B) */
>         }
>  }
> 
> 
> The iput() and mmdrop() routines in the switch statement eventually use
> atomic_dec_and_test() which according to the Documentation/memory-barriers.txt
> implies an smp_mb() on each side of the actual operation. Notice that private
> futexes aren't handled by this (read below) this switch.
> 
> Now there is a wrapper put_futex_key() which is called in a few function as a
> way to clean up before before retrying, but in every case, and before it is
> called, a check is made to see if the futex is private and if so, retries at
> a more appropriate area of its respective function.
> 
> Now I have found two functions where this type of check/protection aren't made
> and I am curious as to if I stumbled on what could potentially lead to a race
> condition in a large SMP environment. Please refer to futex_wait() (called indirectly
> via unqueue_me()) and futex_requeue().
> 
> Any thoughts or opinions would be greatly appreciated. Thank you in advance.
> 
> --
> Petros
> 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: futex: clarification needed with drop_futex_key_refs and memory barriers
  2016-03-27  6:25 ` Mike Galbraith
@ 2016-03-29  9:50   ` Peter Zijlstra
  2016-03-31  1:45     ` Petros Koutoupis
  2016-04-02  6:39     ` Davidlohr Bueso
  0 siblings, 2 replies; 6+ messages in thread
From: Peter Zijlstra @ 2016-03-29  9:50 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Petros Koutoupis, linux-kernel, Paul E. McKenney,
	Thomas Gleixner, Davidlohr Bueso, Linus Torvalds,
	catalin.marinas, H. Peter Anvin

On Sun, Mar 27, 2016 at 08:25:48AM +0200, Mike Galbraith wrote:
> (futex ordering pop-flare)
> 
> On Sat, 2016-03-26 at 10:56 -0500, Petros Koutoupis wrote:
> > I stumbled on an interesting scenario which I am unable to fully explain and I
> > was hoping to get some other opinions on why this would or wouldn't work.
> > 
> > In recent testing on a 48-core Haswell arch server, our multi-threaded user space
> > application was utilizing 60% to 100% more CPU than on our smaller 24-core servers
> > (running an identical load). After spending a considerable amount of time analyzing
> > stack dumps and straces it became immediately apparent that those exact threads
> > operating with the higher CPU utilization were off in futex land.

perf should be able to tell you that pretty quickly, no?

A question about your workload; are you typically 'stuck' in
futex_wait() or some of the requeue stuff (which is more complex). Let
me assume regular futex_wait() for now.

> > Shortly afterward I stumbled on commit 76835b0ebf8a7fe85beb03c75121419a7dec52f0
> > (futex: Ensure get_futex_key_refs() always implies a barrier) which addressed the
> > handling of private futexes and preventing a race condition by completing the
> > function with a memory barrier. Now, I fully understand why this patch was implemented:
> > to have a memory barrier before checking the "waiters." It makes sense. 

Right; well, a barrier isn't _before_ something, its between two things.
And I think we're ordering the user-space store to the futex value
against the waiters load, but the code could do with a comment here.

And this should be totally irrelevant for x86 because a CPL change
already implies a full barrier.

> > What doesn't
> > make sense (so far) is when I apply the same patch to the drop counterpart,
> > drop_futex_key_refs(), and the problem goes away. See the change and my notes below.
> > 
> > 
> > --- linux/kernel/futex.c.orig   2016-03-25 19:45:08.169563263 -0500
> > +++ linux/kernel/futex.c        2016-03-25 19:46:06.901562211 -0500
> > @@ -438,11 +438,13 @@ static void drop_futex_key_refs(union fu
> > 
> >         switch (key->both.offset & (FUT_OFF_INODE|FUT_OFF_MMSHARED)) {
> >         case FUT_OFF_INODE:
> > -               iput(key->shared.inode);
> > +               iput(key->shared.inode); /* implies smp_mb(); (B) */
> >                 break;
> >         case FUT_OFF_MMSHARED:
> > -               mmdrop(key->private.mm);
> > +               mmdrop(key->private.mm); /* implies smp_mb(); (B) */
> >                 break;
> > +       default:
> > +               smp_mb(); /* explicit smp_mb(); (B) */
> >         }
> >  }

So the patch makes sense from a symmetry POV, but I'm having a hard time
explaining why this would make any difference to your workload.

The best I can come up with is that the explicit MFENCE (as opposed to a
LOCK prefixed instruction) causes a store-buffer flush and hence the
waker more often sees !waiters and therefore tries to acquire the bucket
lock less.

But this is not a correctness (nor ordering) issue; but purely an
architectural side-effect. Furthermore; some proposed changes:

  http://marc.info/?l=linux-kernel&m=145400059704564&w=2

might change this side-effect.


In any case; the below (completely irrelevant patch for you) is
something I would propose. It gives hb_waiter_dec() RELEASE like
semantics and ensures it cannot creep into the lock sections its
typically behind. Although strictly speaking I think it being inside
that lock region is sufficient.

It also re-orders the increment in requeue to happen before we add to
the list (as is the proper order) and removes a totally bogus comment;
spin_lock() does _NOT_ imply a full barrier.

---
 kernel/futex.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 5d6ce6413ef1..615f277f313b 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -360,6 +360,7 @@ static inline void hb_waiters_inc(struct futex_hash_bucket *hb)
 static inline void hb_waiters_dec(struct futex_hash_bucket *hb)
 {
 #ifdef CONFIG_SMP
+	smp_mb__before_atomic();
 	atomic_dec(&hb->waiters);
 #endif
 }
@@ -1442,8 +1443,8 @@ void requeue_futex(struct futex_q *q, struct futex_hash_bucket *hb1,
 	if (likely(&hb1->chain != &hb2->chain)) {
 		plist_del(&q->list, &hb1->chain);
 		hb_waiters_dec(hb1);
-		plist_add(&q->list, &hb2->chain);
 		hb_waiters_inc(hb2);
+		plist_add(&q->list, &hb2->chain);
 		q->lock_ptr = &hb2->lock;
 	}
 	get_futex_key_refs(key2);
@@ -1864,7 +1865,7 @@ static inline struct futex_hash_bucket *queue_lock(struct futex_q *q)
 
 	q->lock_ptr = &hb->lock;
 
-	spin_lock(&hb->lock); /* implies MB (A) */
+	spin_lock(&hb->lock);
 	return hb;
 }
 

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: futex: clarification needed with drop_futex_key_refs and memory barriers
  2016-03-29  9:50   ` Peter Zijlstra
@ 2016-03-31  1:45     ` Petros Koutoupis
  2016-03-31  7:36       ` Peter Zijlstra
  2016-04-02  6:39     ` Davidlohr Bueso
  1 sibling, 1 reply; 6+ messages in thread
From: Petros Koutoupis @ 2016-03-31  1:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mike Galbraith, linux-kernel, Paul E. McKenney, Thomas Gleixner,
	Davidlohr Bueso, Linus Torvalds, catalin.marinas, H. Peter Anvin

Peter,

Thank you very much for your input. Please read my comments below...

On Tue, 2016-03-29 at 11:50 +0200, Peter Zijlstra wrote:
> On Sun, Mar 27, 2016 at 08:25:48AM +0200, Mike Galbraith wrote:
> > (futex ordering pop-flare)
> > 
> > On Sat, 2016-03-26 at 10:56 -0500, Petros Koutoupis wrote:
> > > I stumbled on an interesting scenario which I am unable to fully explain and I
> > > was hoping to get some other opinions on why this would or wouldn't work.
> > > 
> > > In recent testing on a 48-core Haswell arch server, our multi-threaded user space
> > > application was utilizing 60% to 100% more CPU than on our smaller 24-core servers
> > > (running an identical load). After spending a considerable amount of time analyzing
> > > stack dumps and straces it became immediately apparent that those exact threads
> > > operating with the higher CPU utilization were off in futex land.
> 
> perf should be able to tell you that pretty quickly, no?
> 

In an ideal scenario perf would work well but unfortunately, in our
custom and quite limited environment, it isn't possible. At least, not
at the moment. We are currently working to correct this.

> A question about your workload; are you typically 'stuck' in
> futex_wait() or some of the requeue stuff (which is more complex). Let
> me assume regular futex_wait() for now.
> 

It is futex_wait().

> > > Shortly afterward I stumbled on commit 76835b0ebf8a7fe85beb03c75121419a7dec52f0
> > > (futex: Ensure get_futex_key_refs() always implies a barrier) which addressed the
> > > handling of private futexes and preventing a race condition by completing the
> > > function with a memory barrier. Now, I fully understand why this patch was implemented:
> > > to have a memory barrier before checking the "waiters." It makes sense. 
> 
> Right; well, a barrier isn't _before_ something, its between two things.
> And I think we're ordering the user-space store to the futex value
> against the waiters load, but the code could do with a comment here.
> 
> And this should be totally irrelevant for x86 because a CPL change
> already implies a full barrier.
> 
> > > What doesn't
> > > make sense (so far) is when I apply the same patch to the drop counterpart,
> > > drop_futex_key_refs(), and the problem goes away. See the change and my notes below.
> > > 
> > > 
> > > --- linux/kernel/futex.c.orig   2016-03-25 19:45:08.169563263 -0500
> > > +++ linux/kernel/futex.c        2016-03-25 19:46:06.901562211 -0500
> > > @@ -438,11 +438,13 @@ static void drop_futex_key_refs(union fu
> > > 
> > >         switch (key->both.offset & (FUT_OFF_INODE|FUT_OFF_MMSHARED)) {
> > >         case FUT_OFF_INODE:
> > > -               iput(key->shared.inode);
> > > +               iput(key->shared.inode); /* implies smp_mb(); (B) */
> > >                 break;
> > >         case FUT_OFF_MMSHARED:
> > > -               mmdrop(key->private.mm);
> > > +               mmdrop(key->private.mm); /* implies smp_mb(); (B) */
> > >                 break;
> > > +       default:
> > > +               smp_mb(); /* explicit smp_mb(); (B) */
> > >         }
> > >  }
> 
> So the patch makes sense from a symmetry POV, but I'm having a hard time
> explaining why this would make any difference to your workload.
> 
> The best I can come up with is that the explicit MFENCE (as opposed to a
> LOCK prefixed instruction) causes a store-buffer flush and hence the
> waker more often sees !waiters and therefore tries to acquire the bucket
> lock less.
> 
> But this is not a correctness (nor ordering) issue; but purely an
> architectural side-effect. Furthermore; some proposed changes:
> 
>   http://marc.info/?l=linux-kernel&m=145400059704564&w=2
> 
> might change this side-effect.
> 

Has there been an update to this patch? The last I see, the conversation
ended at the end of January, and there hasn't been a change in the
mainline.

> 
> In any case; the below (completely irrelevant patch for you) is
> something I would propose. It gives hb_waiter_dec() RELEASE like
> semantics and ensures it cannot creep into the lock sections its
> typically behind. Although strictly speaking I think it being inside
> that lock region is sufficient.
> 
> It also re-orders the increment in requeue to happen before we add to
> the list (as is the proper order) and removes a totally bogus comment;
> spin_lock() does _NOT_ imply a full barrier.
> 
> ---
>  kernel/futex.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/futex.c b/kernel/futex.c
> index 5d6ce6413ef1..615f277f313b 100644
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -360,6 +360,7 @@ static inline void hb_waiters_inc(struct futex_hash_bucket *hb)
>  static inline void hb_waiters_dec(struct futex_hash_bucket *hb)
>  {
>  #ifdef CONFIG_SMP
> +	smp_mb__before_atomic();
>  	atomic_dec(&hb->waiters);
>  #endif
>  }
> @@ -1442,8 +1443,8 @@ void requeue_futex(struct futex_q *q, struct futex_hash_bucket *hb1,
>  	if (likely(&hb1->chain != &hb2->chain)) {
>  		plist_del(&q->list, &hb1->chain);
>  		hb_waiters_dec(hb1);
> -		plist_add(&q->list, &hb2->chain);
>  		hb_waiters_inc(hb2);
> +		plist_add(&q->list, &hb2->chain);
>  		q->lock_ptr = &hb2->lock;
>  	}
>  	get_futex_key_refs(key2);
> @@ -1864,7 +1865,7 @@ static inline struct futex_hash_bucket *queue_lock(struct futex_q *q)
>  
>  	q->lock_ptr = &hb->lock;
>  
> -	spin_lock(&hb->lock); /* implies MB (A) */
> +	spin_lock(&hb->lock);
>  	return hb;
>  }
>  

Your adjustments here make complete sense. Are you preparing it for
submission in the near future?

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: futex: clarification needed with drop_futex_key_refs and memory barriers
  2016-03-31  1:45     ` Petros Koutoupis
@ 2016-03-31  7:36       ` Peter Zijlstra
  0 siblings, 0 replies; 6+ messages in thread
From: Peter Zijlstra @ 2016-03-31  7:36 UTC (permalink / raw)
  To: Petros Koutoupis
  Cc: Mike Galbraith, linux-kernel, Paul E. McKenney, Thomas Gleixner,
	Davidlohr Bueso, Linus Torvalds, catalin.marinas, H. Peter Anvin

On Wed, Mar 30, 2016 at 08:45:10PM -0500, Petros Koutoupis wrote:
> > But this is not a correctness (nor ordering) issue; but purely an
> > architectural side-effect. Furthermore; some proposed changes:
> > 
> >   http://marc.info/?l=linux-kernel&m=145400059704564&w=2
> > 
> > might change this side-effect.
> > 
> 
> Has there been an update to this patch? The last I see, the conversation
> ended at the end of January, and there hasn't been a change in the
> mainline.

Peter Anvin was going to look at this with some of the Intel hardware
people to fully explore the ramifications of this change, we're waiting
on feedback from that.

> >> Your adjustments here make complete sense. Are you preparing it for
> submission in the near future?

I'll think about it, adding the extra barrier for decrement is of course
not really nice if not strictly required. And while it will not impact
x86 the weakly ordered archs will be affected.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: futex: clarification needed with drop_futex_key_refs and memory barriers
  2016-03-29  9:50   ` Peter Zijlstra
  2016-03-31  1:45     ` Petros Koutoupis
@ 2016-04-02  6:39     ` Davidlohr Bueso
  1 sibling, 0 replies; 6+ messages in thread
From: Davidlohr Bueso @ 2016-04-02  6:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mike Galbraith, Petros Koutoupis, linux-kernel, Paul E. McKenney,
	Thomas Gleixner, Linus Torvalds, catalin.marinas, H. Peter Anvin

On Tue, 29 Mar 2016, Peter Zijlstra wrote:

>In any case; the below (completely irrelevant patch for you) is
>something I would propose. It gives hb_waiter_dec() RELEASE like
>semantics and ensures it cannot creep into the lock sections its
>typically behind. Although strictly speaking I think it being inside
>that lock region is sufficient.

Indeed, it should be sufficient. Racing with waiter decrements
(leaking into the hb critical region) is perfectly ok as the
consequence is that the reader thread will simply go take the lock
by not seeing the 1->0 dec, and therefore no harm done. Something that
0->1 cannot afford to rely on, obviously. So I think we can save the extra
barrier for the release semantics and keep the call relaxed, as the
performance penalty would be higher. Or are you referring to something else?

>
>It also re-orders the increment in requeue to happen before we add to
>the list (as is the proper order)

ack to this part -- should be a separate patch.

Thanks,
Davidlohr

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2016-04-02  6:39 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-03-26 15:56 futex: clarification needed with drop_futex_key_refs and memory barriers Petros Koutoupis
2016-03-27  6:25 ` Mike Galbraith
2016-03-29  9:50   ` Peter Zijlstra
2016-03-31  1:45     ` Petros Koutoupis
2016-03-31  7:36       ` Peter Zijlstra
2016-04-02  6:39     ` Davidlohr Bueso

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).