All of lore.kernel.org
 help / color / mirror / Atom feed
* RFC: Ideal Adaptive Spinning Conditions
@ 2010-03-31 23:21 Darren Hart
  2010-03-31 23:35 ` Roland Dreier
                   ` (3 more replies)
  0 siblings, 4 replies; 24+ messages in thread
From: Darren Hart @ 2010-03-31 23:21 UTC (permalink / raw)
  To: lkml, ,
	Steven Rostedt, Peter Zijlstra, Gregory Haskins,
	Sven-Thorsten Dietrich, Peter Morreale, Chris Wright,
	Thomas Gleixner, Ingo Molnar, Eric Dumazet

I'm looking at some adaptive spinning with futexes as a way to help 
reduce the dependence on sched_yield() to implement userspace spinlocks. 
Chris, I included you in the CC after reading your comments regarding 
sched_yield() at kernel summit and I thought you might be interested.

I have an experimental patchset that implements FUTEX_LOCK and 
FUTEX_LOCK_ADAPTIVE in the kernel and use something akin to 
mutex_spin_on_owner() for the first waiter to spin. What I'm finding is 
that adaptive spinning actually hurts my particular test case, so I was 
hoping to poll people for context regarding the existing adaptive 
spinning implementations in the kernel as to where we see benefit. Under 
which conditions does adaptive spinning help?

I presume locks with a short average hold time stand to gain the most as 
the longer the lock is held the more likely the spinner will expire its 
timeslice or that the scheduling gain becomes noise in the acquisition 
time. My test case simple calls "lock();unlock()" for a fixed number of 
iterations and reports the iterations per second at the end of the run. 
It can run with an arbitrary number of threads as well. I typically run 
with 256 threads for 10M iterations.

          futex_lock: Result: 635 Kiter/s
futex_lock_adaptive: Result: 542 Kiter/s

I've limited the number of spinners to 1 but feel that perhaps this 
should be configurable as locks with very short hold times could benefit 
from up to NR_CPUS-1 spinners.

I'd really appreciate any data, just general insight, you may have 
acquired while implementing adaptive spinning for rt-mutexes and 
mutexes. Open questions for me regarding conditions where adaptive 
spinning helps are:

o What type of lock hold times do we expect to benefit?
o How much contention is a good match for adaptive spinning?
   - this is related to the number of threads to run in the test
o How many spinners should be allowed?

I can share the kernel patches if people are interested, but they are 
really early, and I'm not sure they are of much value until I better 
understand the conditions where this is expected to be useful.

Thanks,

-- 
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: RFC: Ideal Adaptive Spinning Conditions
  2010-03-31 23:21 RFC: Ideal Adaptive Spinning Conditions Darren Hart
@ 2010-03-31 23:35 ` Roland Dreier
  2010-04-01  2:03   ` Darren Hart
  2010-03-31 23:38 ` Steven Rostedt
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 24+ messages in thread
From: Roland Dreier @ 2010-03-31 23:35 UTC (permalink / raw)
  To: Darren Hart
  Cc: lkml, ,
	Steven Rostedt, Peter Zijlstra, Gregory Haskins,
	Sven-Thorsten Dietrich, Peter Morreale, Chris Wright,
	Thomas Gleixner, Ingo Molnar, Eric Dumazet

 > I'm looking at some adaptive spinning with futexes as a way to help
 > reduce the dependence on sched_yield() to implement userspace
 > spinlocks. Chris, I included you in the CC after reading your comments
 > regarding sched_yield() at kernel summit and I thought you might be
 > interested.

I think you may have the wrong Chris.  Chris Wright wasn't at the most
recent kernel summit and I do have a vague recollection of Chris Mason
mentioning that Oracle developers found that sched_yield() performed
better than futexes + kernel scheduling.
-- 
Roland Dreier <rolandd@cisco.com> || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: RFC: Ideal Adaptive Spinning Conditions
  2010-03-31 23:21 RFC: Ideal Adaptive Spinning Conditions Darren Hart
  2010-03-31 23:35 ` Roland Dreier
@ 2010-03-31 23:38 ` Steven Rostedt
  2010-04-01  0:17   ` Peter W. Morreale
  2010-04-01  2:13   ` Darren Hart
  2010-04-01  2:10 ` Darren Hart
  2010-04-01 14:20 ` Avi Kivity
  3 siblings, 2 replies; 24+ messages in thread
From: Steven Rostedt @ 2010-03-31 23:38 UTC (permalink / raw)
  To: Darren Hart
  Cc: lkml,,
	Peter Zijlstra, Gregory Haskins, Sven-Thorsten Dietrich,
	Peter Morreale, Chris Wright, Thomas Gleixner, Ingo Molnar,
	Eric Dumazet

On Wed, 2010-03-31 at 16:21 -0700, Darren Hart wrote:

> o What type of lock hold times do we expect to benefit?

0 (that's a zero) :-p

I haven't seen your patches but you are not doing a heuristic approach,
are you? That is, do not "spin" hoping the lock will suddenly become
free. I was against that for -rt and I would be against that for futex
too.

> o How much contention is a good match for adaptive spinning?
>    - this is related to the number of threads to run in the test
> o How many spinners should be allowed?
> 
> I can share the kernel patches if people are interested, but they are 
> really early, and I'm not sure they are of much value until I better 
> understand the conditions where this is expected to be useful.

Again, I don't know how you implemented your adaptive spinners, but the
trick to it in -rt was that it would only spin while the owner of the
lock was actually running. If it was not running, it would sleep. No
point waiting for a sleeping task to release its lock.

Is this what you did? Because, IIRC, this only benefited spinlocks
converted to mutexes. It did not help with semaphores, because
semaphores could be held for a long time. Thus, it was good for short
held locks, but hurt performance on long held locks.

If userspace is going to do this, I guess the blocked task would need to
go into kernel, and spin there (with preempt enabled) if the task is
still active and holding the lock.

Then the application would need to determine which to use. An adaptive
spinner for short held locks, and a normal futex for long held locks.

-- Steve



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: RFC: Ideal Adaptive Spinning Conditions
  2010-03-31 23:38 ` Steven Rostedt
@ 2010-04-01  0:17   ` Peter W. Morreale
  2010-04-01  2:25     ` Darren Hart
  2010-04-03 17:51     ` john cooper
  2010-04-01  2:13   ` Darren Hart
  1 sibling, 2 replies; 24+ messages in thread
From: Peter W. Morreale @ 2010-04-01  0:17 UTC (permalink / raw)
  To: rostedt
  Cc: Darren Hart, lkml,,
	Peter Zijlstra, Gregory Haskins, Sven-Thorsten Dietrich,
	Chris Wright, Thomas Gleixner, Ingo Molnar, Eric Dumazet

On Wed, 2010-03-31 at 19:38 -0400, Steven Rostedt wrote:
> On Wed, 2010-03-31 at 16:21 -0700, Darren Hart wrote:
> 
> > o What type of lock hold times do we expect to benefit?
> 
> 0 (that's a zero) :-p
> 
> I haven't seen your patches but you are not doing a heuristic approach,
> are you? That is, do not "spin" hoping the lock will suddenly become
> free. I was against that for -rt and I would be against that for futex
> too.
> 
> > o How much contention is a good match for adaptive spinning?
> >    - this is related to the number of threads to run in the test
> > o How many spinners should be allowed?
> > 
> > I can share the kernel patches if people are interested, but they are 
> > really early, and I'm not sure they are of much value until I better 
> > understand the conditions where this is expected to be useful.
> 
> Again, I don't know how you implemented your adaptive spinners, but the
> trick to it in -rt was that it would only spin while the owner of the
> lock was actually running. If it was not running, it would sleep. No
> point waiting for a sleeping task to release its lock.

Right.  This was *critical* for the adaptive rtmutex.   Note in the RT
patch, everybody spins as long as the current owner is on CPU.   

FWIW, IIRC, Solaris has a heuristic approach where incoming tasks spin
for a period of time before going to sleep.  (Cray UINCOS did the same) 

> 
> Is this what you did? Because, IIRC, this only benefited spinlocks
> converted to mutexes. It did not help with semaphores, because
> semaphores could be held for a long time. Thus, it was good for short
> held locks, but hurt performance on long held locks.
> 

nod.  The entire premise was based on the fact that we were converting
spinlocks, which by definition were short held locks.  What I found
during early development was that the sleep/wakeup cycle was more
intrusive for RT than spinning.  

IIRC, I measured something like 380k context switches/second prior to
the adaptive patches for a dbench test and we cut this down to somewhere
around 50k, with a corresponding increase in throughput.  (I can't
remember specific numbers any more, it was a while ago... ;-)

When applied to semaphores, the benefit was in the noise range as I
recall..

(dbench was chosen due to the heavy contention on the dcache spinlock) 


Best,
-PWM


> If userspace is going to do this, I guess the blocked task would need to
> go into kernel, and spin there (with preempt enabled) if the task is
> still active and holding the lock.
> 
> Then the application would need to determine which to use. An adaptive
> spinner for short held locks, and a normal futex for long held locks.
> 
> -- Steve
> 
> 



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: RFC: Ideal Adaptive Spinning Conditions
  2010-03-31 23:35 ` Roland Dreier
@ 2010-04-01  2:03   ` Darren Hart
  2010-04-01 17:02     ` Chris Wright
  0 siblings, 1 reply; 24+ messages in thread
From: Darren Hart @ 2010-04-01  2:03 UTC (permalink / raw)
  To: Roland Dreier
  Cc: lkml, ,
	Steven Rostedt, Peter Zijlstra, Gregory Haskins,
	Sven-Thorsten Dietrich, Peter Morreale, Chris Wright,
	Thomas Gleixner, Ingo Molnar, Eric Dumazet

Roland Dreier wrote:
>  > I'm looking at some adaptive spinning with futexes as a way to help
>  > reduce the dependence on sched_yield() to implement userspace
>  > spinlocks. Chris, I included you in the CC after reading your comments
>  > regarding sched_yield() at kernel summit and I thought you might be
>  > interested.
> 
> I think you may have the wrong Chris.  Chris Wright wasn't at the most
> recent kernel summit and I do have a vague recollection of Chris Mason
> mentioning that Oracle developers found that sched_yield() performed
> better than futexes + kernel scheduling.

Oh bother, Sorry Chris! I even talked to Chris Wright about this already 
in person and he told me I probably was looking for Chris Mason, and I 
still got the last names crossed up!

-- 
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: RFC: Ideal Adaptive Spinning Conditions
  2010-03-31 23:21 RFC: Ideal Adaptive Spinning Conditions Darren Hart
  2010-03-31 23:35 ` Roland Dreier
  2010-03-31 23:38 ` Steven Rostedt
@ 2010-04-01  2:10 ` Darren Hart
  2010-04-01 14:04   ` Chris Mason
  2010-04-01 14:20 ` Avi Kivity
  3 siblings, 1 reply; 24+ messages in thread
From: Darren Hart @ 2010-04-01  2:10 UTC (permalink / raw)
  To: lkml, ,
	Steven Rostedt, Peter Zijlstra, Gregory Haskins,
	Sven-Thorsten Dietrich, Peter Morreale, Thomas Gleixner,
	Ingo Molnar, Eric Dumazet, Chris Mason

CC'ing the right Chris this time.

Darren Hart wrote:
> I'm looking at some adaptive spinning with futexes as a way to help 
> reduce the dependence on sched_yield() to implement userspace spinlocks. 
> Chris, I included you in the CC after reading your comments regarding 
> sched_yield() at kernel summit and I thought you might be interested.
> 
> I have an experimental patchset that implements FUTEX_LOCK and 
> FUTEX_LOCK_ADAPTIVE in the kernel and use something akin to 
> mutex_spin_on_owner() for the first waiter to spin. What I'm finding is 
> that adaptive spinning actually hurts my particular test case, so I was 
> hoping to poll people for context regarding the existing adaptive 
> spinning implementations in the kernel as to where we see benefit. Under 
> which conditions does adaptive spinning help?
> 
> I presume locks with a short average hold time stand to gain the most as 
> the longer the lock is held the more likely the spinner will expire its 
> timeslice or that the scheduling gain becomes noise in the acquisition 
> time. My test case simple calls "lock();unlock()" for a fixed number of 
> iterations and reports the iterations per second at the end of the run. 
> It can run with an arbitrary number of threads as well. I typically run 
> with 256 threads for 10M iterations.
> 
>          futex_lock: Result: 635 Kiter/s
> futex_lock_adaptive: Result: 542 Kiter/s
> 
> I've limited the number of spinners to 1 but feel that perhaps this 
> should be configurable as locks with very short hold times could benefit 
> from up to NR_CPUS-1 spinners.
> 
> I'd really appreciate any data, just general insight, you may have 
> acquired while implementing adaptive spinning for rt-mutexes and 
> mutexes. Open questions for me regarding conditions where adaptive 
> spinning helps are:
> 
> o What type of lock hold times do we expect to benefit?
> o How much contention is a good match for adaptive spinning?
>   - this is related to the number of threads to run in the test
> o How many spinners should be allowed?
> 
> I can share the kernel patches if people are interested, but they are 
> really early, and I'm not sure they are of much value until I better 
> understand the conditions where this is expected to be useful.
> 
> Thanks,
> 


-- 
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: RFC: Ideal Adaptive Spinning Conditions
  2010-03-31 23:38 ` Steven Rostedt
  2010-04-01  0:17   ` Peter W. Morreale
@ 2010-04-01  2:13   ` Darren Hart
  2010-04-01  2:25     ` Steven Rostedt
  1 sibling, 1 reply; 24+ messages in thread
From: Darren Hart @ 2010-04-01  2:13 UTC (permalink / raw)
  To: rostedt
  Cc: lkml,,
	Peter Zijlstra, Gregory Haskins, Sven-Thorsten Dietrich,
	Peter Morreale, Thomas Gleixner, Ingo Molnar, Eric Dumazet,
	Chris Mason

Steven Rostedt wrote:
> On Wed, 2010-03-31 at 16:21 -0700, Darren Hart wrote:
> 
>> o What type of lock hold times do we expect to benefit?
> 
> 0 (that's a zero) :-p
> 
> I haven't seen your patches but you are not doing a heuristic approach,
> are you? That is, do not "spin" hoping the lock will suddenly become
> free. I was against that for -rt and I would be against that for futex
> too.

I'm not sure what you're getting at here. Adaptive spinning is indeed 
hoping the lock will become free while you are spinning and checking 
it's owner...

> 
>> o How much contention is a good match for adaptive spinning?
>>    - this is related to the number of threads to run in the test
>> o How many spinners should be allowed?
>>
>> I can share the kernel patches if people are interested, but they are 
>> really early, and I'm not sure they are of much value until I better 
>> understand the conditions where this is expected to be useful.
> 
> Again, I don't know how you implemented your adaptive spinners, but the
> trick to it in -rt was that it would only spin while the owner of the
> lock was actually running. If it was not running, it would sleep. No
> point waiting for a sleeping task to release its lock.

It does exactly this.

> Is this what you did? Because, IIRC, this only benefited spinlocks
> converted to mutexes. It did not help with semaphores, because
> semaphores could be held for a long time. Thus, it was good for short
> held locks, but hurt performance on long held locks.

Trouble is, I'm still seeing performance penalties even on the shortest 
critical section possible (lock();unlock();)

> If userspace is going to do this, I guess the blocked task would need to
> go into kernel, and spin there (with preempt enabled) if the task is
> still active and holding the lock.

It is currently under preempt_disable() just like mutexes. I asked Peter 
why it was done that way for mutexes, but didn't really get an answer. 
He did point out that since we check need_resched() at every iteration 
that we won't run longer than our timeslice anyway, so it shouldn't be a 
  problem.

> Then the application would need to determine which to use. An adaptive
> spinner for short held locks, and a normal futex for long held locks.

Yes, this was intended to be an optional thing (and certainly not the 
default).


-- 
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: RFC: Ideal Adaptive Spinning Conditions
  2010-04-01  0:17   ` Peter W. Morreale
@ 2010-04-01  2:25     ` Darren Hart
  2010-04-03 18:00       ` john cooper
  2010-04-03 17:51     ` john cooper
  1 sibling, 1 reply; 24+ messages in thread
From: Darren Hart @ 2010-04-01  2:25 UTC (permalink / raw)
  To: Peter W. Morreale
  Cc: rostedt, lkml,,
	Peter Zijlstra, Gregory Haskins, Sven-Thorsten Dietrich,
	Thomas Gleixner, Ingo Molnar, Eric Dumazet, Chris Mason

Peter W. Morreale wrote:
> On Wed, 2010-03-31 at 19:38 -0400, Steven Rostedt wrote:
>> On Wed, 2010-03-31 at 16:21 -0700, Darren Hart wrote:
>>
>>> o What type of lock hold times do we expect to benefit?
>> 0 (that's a zero) :-p
>>
>> I haven't seen your patches but you are not doing a heuristic approach,
>> are you? That is, do not "spin" hoping the lock will suddenly become
>> free. I was against that for -rt and I would be against that for futex
>> too.
>>
>>> o How much contention is a good match for adaptive spinning?
>>>    - this is related to the number of threads to run in the test
>>> o How many spinners should be allowed?
>>>
>>> I can share the kernel patches if people are interested, but they are 
>>> really early, and I'm not sure they are of much value until I better 
>>> understand the conditions where this is expected to be useful.
>> Again, I don't know how you implemented your adaptive spinners, but the
>> trick to it in -rt was that it would only spin while the owner of the
>> lock was actually running. If it was not running, it would sleep. No
>> point waiting for a sleeping task to release its lock.
> 
> Right.  This was *critical* for the adaptive rtmutex.   Note in the RT
> patch, everybody spins as long as the current owner is on CPU.

Everybody spins? Really? For RT Tasks I suppose that makes sense as they 
will sort out the priority for themselves and if they preempt the owner 
they will all immediately schedule out and boost the priority of the 
owner.... but then we lose the benefit of spinning since we just put 
everyone to sleep. I'll have to take a look at that and see what I'm 
missing.


> FWIW, IIRC, Solaris has a heuristic approach where incoming tasks spin
> for a period of time before going to sleep.  (Cray UINCOS did the same) 

I suppose a heuristic approach could still be used so long as continued 
spinning was conditional on the owner continuing to run on a CPU.


>> Is this what you did? Because, IIRC, this only benefited spinlocks
>> converted to mutexes. It did not help with semaphores, because
>> semaphores could be held for a long time. Thus, it was good for short
>> held locks, but hurt performance on long held locks.
>>
> 
> nod.  The entire premise was based on the fact that we were converting
> spinlocks, which by definition were short held locks.  What I found
> during early development was that the sleep/wakeup cycle was more
> intrusive for RT than spinning.  

Right, and I'm looking to provide some kernel assistance for userspace 
spinlocks here, and am targeting short lived critical sections as well.

> 
> IIRC, I measured something like 380k context switches/second prior to
> the adaptive patches for a dbench test and we cut this down to somewhere
> around 50k, with a corresponding increase in throughput.  (I can't
> remember specific numbers any more, it was a while ago... ;-)
> 
> When applied to semaphores, the benefit was in the noise range as I
> recall..
> 
> (dbench was chosen due to the heavy contention on the dcache spinlock) 

Interesting, thanks for the input.

--
Darren

> 
> 
> Best,
> -PWM
> 
> 
>> If userspace is going to do this, I guess the blocked task would need to
>> go into kernel, and spin there (with preempt enabled) if the task is
>> still active and holding the lock.
>>
>> Then the application would need to determine which to use. An adaptive
>> spinner for short held locks, and a normal futex for long held locks.
>>
>> -- Steve
>>
>>
> 
> 


-- 
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: RFC: Ideal Adaptive Spinning Conditions
  2010-04-01  2:13   ` Darren Hart
@ 2010-04-01  2:25     ` Steven Rostedt
  2010-04-01  5:15       ` Darren Hart
  2010-04-04  1:50       ` Rik van Riel
  0 siblings, 2 replies; 24+ messages in thread
From: Steven Rostedt @ 2010-04-01  2:25 UTC (permalink / raw)
  To: Darren Hart
  Cc: lkml,,
	Peter Zijlstra, Gregory Haskins, Sven-Thorsten Dietrich,
	Peter Morreale, Thomas Gleixner, Ingo Molnar, Eric Dumazet,
	Chris Mason

On Wed, 2010-03-31 at 19:13 -0700, Darren Hart wrote:
> Steven Rostedt wrote:
> > On Wed, 2010-03-31 at 16:21 -0700, Darren Hart wrote:
> > 
> >> o What type of lock hold times do we expect to benefit?
> > 
> > 0 (that's a zero) :-p
> > 
> > I haven't seen your patches but you are not doing a heuristic approach,
> > are you? That is, do not "spin" hoping the lock will suddenly become
> > free. I was against that for -rt and I would be against that for futex
> > too.
> 
> I'm not sure what you're getting at here. Adaptive spinning is indeed 
> hoping the lock will become free while you are spinning and checking 
> it's owner...

I'm talking about the original idea people had of "lets spin for 50us
and hope it is unlocked before then", which I thought was not a good
idea.


> 
> > 
> >> o How much contention is a good match for adaptive spinning?
> >>    - this is related to the number of threads to run in the test
> >> o How many spinners should be allowed?
> >>
> >> I can share the kernel patches if people are interested, but they are 
> >> really early, and I'm not sure they are of much value until I better 
> >> understand the conditions where this is expected to be useful.
> > 
> > Again, I don't know how you implemented your adaptive spinners, but the
> > trick to it in -rt was that it would only spin while the owner of the
> > lock was actually running. If it was not running, it would sleep. No
> > point waiting for a sleeping task to release its lock.
> 
> It does exactly this.

OK, that's good.

> 
> > Is this what you did? Because, IIRC, this only benefited spinlocks
> > converted to mutexes. It did not help with semaphores, because
> > semaphores could be held for a long time. Thus, it was good for short
> > held locks, but hurt performance on long held locks.
> 
> Trouble is, I'm still seeing performance penalties even on the shortest 
> critical section possible (lock();unlock();)

performance penalties compared to what? not having adaptive at all?

> 
> > If userspace is going to do this, I guess the blocked task would need to
> > go into kernel, and spin there (with preempt enabled) if the task is
> > still active and holding the lock.
> 
> It is currently under preempt_disable() just like mutexes. I asked Peter 
> why it was done that way for mutexes, but didn't really get an answer. 
> He did point out that since we check need_resched() at every iteration 
> that we won't run longer than our timeslice anyway, so it shouldn't be a 
>   problem.

Sure it's not a problem ;-)

-- Steve



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: RFC: Ideal Adaptive Spinning Conditions
  2010-04-01  2:25     ` Steven Rostedt
@ 2010-04-01  5:15       ` Darren Hart
  2010-04-01 12:46         ` Gregory Haskins
  2010-04-04  1:50       ` Rik van Riel
  1 sibling, 1 reply; 24+ messages in thread
From: Darren Hart @ 2010-04-01  5:15 UTC (permalink / raw)
  To: rostedt
  Cc: lkml,,
	Peter Zijlstra, Gregory Haskins, Sven-Thorsten Dietrich,
	Peter Morreale, Thomas Gleixner, Ingo Molnar, Eric Dumazet,
	Chris Mason

Steven Rostedt wrote:
> On Wed, 2010-03-31 at 19:13 -0700, Darren Hart wrote:
>> Steven Rostedt wrote:
>>> On Wed, 2010-03-31 at 16:21 -0700, Darren Hart wrote:
>>>
>>>> o What type of lock hold times do we expect to benefit?
>>> 0 (that's a zero) :-p
>>>
>>> I haven't seen your patches but you are not doing a heuristic approach,
>>> are you? That is, do not "spin" hoping the lock will suddenly become
>>> free. I was against that for -rt and I would be against that for futex
>>> too.
>> I'm not sure what you're getting at here. Adaptive spinning is indeed 
>> hoping the lock will become free while you are spinning and checking 
>> it's owner...
> 
> I'm talking about the original idea people had of "lets spin for 50us
> and hope it is unlocked before then", which I thought was not a good
> idea.
> 
> 
>>>> o How much contention is a good match for adaptive spinning?
>>>>    - this is related to the number of threads to run in the test
>>>> o How many spinners should be allowed?
>>>>
>>>> I can share the kernel patches if people are interested, but they are 
>>>> really early, and I'm not sure they are of much value until I better 
>>>> understand the conditions where this is expected to be useful.
>>> Again, I don't know how you implemented your adaptive spinners, but the
>>> trick to it in -rt was that it would only spin while the owner of the
>>> lock was actually running. If it was not running, it would sleep. No
>>> point waiting for a sleeping task to release its lock.
>> It does exactly this.
> 
> OK, that's good.
> 
>>> Is this what you did? Because, IIRC, this only benefited spinlocks
>>> converted to mutexes. It did not help with semaphores, because
>>> semaphores could be held for a long time. Thus, it was good for short
>>> held locks, but hurt performance on long held locks.
>> Trouble is, I'm still seeing performance penalties even on the shortest 
>> critical section possible (lock();unlock();)
> 
> performance penalties compared to what? not having adaptive at all?

Right. See the data in the original mail:


          futex_lock: Result: 635 Kiter/s
futex_lock_adaptive: Result: 542 Kiter/s

So 15% fewer lock/unlock iterations per second with in kernel adaptive 
spinning enabled for a critical section approaching 0 in length. But If 
we agree I'm taking the right approach, then it's time for me to polish 
things up a bit and send them out for review.

-- 
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: RFC: Ideal Adaptive Spinning Conditions
  2010-04-01  5:15       ` Darren Hart
@ 2010-04-01 12:46         ` Gregory Haskins
  0 siblings, 0 replies; 24+ messages in thread
From: Gregory Haskins @ 2010-04-01 12:46 UTC (permalink / raw)
  To: rostedt, Darren Hart
  Cc: Ingo Molnar, Eric Dumazet, Peter Zijlstra, Thomas Gleixner,
	Peter Morreale, Sven Dietrich, Chris Mason, lkml

>>> On 4/1/2010 at 01:15 AM, in message <4BB42C04.7090308@us.ibm.com>, Darren Hart
<dvhltc@us.ibm.com> wrote: 
> Steven Rostedt wrote:
>> On Wed, 2010-03-31 at 19:13 -0700, Darren Hart wrote:
>>> Steven Rostedt wrote:
>>>> On Wed, 2010-03-31 at 16:21 -0700, Darren Hart wrote:
>>>>
>>>>> o What type of lock hold times do we expect to benefit?
>>>> 0 (that's a zero) :-p
>>>>
>>>> I haven't seen your patches but you are not doing a heuristic approach,
>>>> are you? That is, do not "spin" hoping the lock will suddenly become
>>>> free. I was against that for -rt and I would be against that for futex
>>>> too.
>>> I'm not sure what you're getting at here. Adaptive spinning is indeed 
>>> hoping the lock will become free while you are spinning and checking 
>>> it's owner...
>> 
>> I'm talking about the original idea people had of "lets spin for 50us
>> and hope it is unlocked before then", which I thought was not a good
>> idea.
>> 
>> 
>>>>> o How much contention is a good match for adaptive spinning?
>>>>>    - this is related to the number of threads to run in the test
>>>>> o How many spinners should be allowed?
>>>>>
>>>>> I can share the kernel patches if people are interested, but they are 
>>>>> really early, and I'm not sure they are of much value until I better 
>>>>> understand the conditions where this is expected to be useful.
>>>> Again, I don't know how you implemented your adaptive spinners, but the
>>>> trick to it in -rt was that it would only spin while the owner of the
>>>> lock was actually running. If it was not running, it would sleep. No
>>>> point waiting for a sleeping task to release its lock.
>>> It does exactly this.
>> 
>> OK, that's good.
>> 
>>>> Is this what you did? Because, IIRC, this only benefited spinlocks
>>>> converted to mutexes. It did not help with semaphores, because
>>>> semaphores could be held for a long time. Thus, it was good for short
>>>> held locks, but hurt performance on long held locks.
>>> Trouble is, I'm still seeing performance penalties even on the shortest 
>>> critical section possible (lock();unlock();)
>> 
>> performance penalties compared to what? not having adaptive at all?
> 
> Right. See the data in the original mail:
> 
> 
>           futex_lock: Result: 635 Kiter/s
> futex_lock_adaptive: Result: 542 Kiter/s
> 
> So 15% fewer lock/unlock iterations per second with in kernel adaptive 
> spinning enabled for a critical section approaching 0 in length. But If 
> we agree I'm taking the right approach, then it's time for me to polish 
> things up a bit and send them out for review.

Hi Darren,

Part of the magic of adaptive locks is the avoidance of the sleep path, and part of it is using the SMP resources in what is effectively a distributed search (e.g. having N cpus actively looking for when they can acquire verses going to sleep and having the scheduler wake them up later).

I haven't seen your algorithm, but if its not simply trading the sleep path for searching for acquisition+lockowner changes, that may perhaps be a reason for an observed regression.  For instance, if you have to do substantial work to get to the adaptive part of the algorithm, there could be an unbalanced amount of overhead in selecting this path. 

The whole premise is predicated on the following sequence chart:

non-adaptive:

time   |   threadA            |          threadB
----------------------------------------
         | lock(X)  (granted) |
  t0    |                           | lock(X) (contended)
         |                           | add_wait_list(X, B)
         | unlock(X)             |
         | grant(X, B)           |
         | wake_up(B)          | 
         |                           | schedule()
         |                           |         |
         |                           |         |
         |                           |         V
  t1    |                           | schedule() returns
         |                           | (returns from lock())

adaptive:

time   |   threadA            |          threadB
----------------------------------------
         | lock(X)  (granted) |
  t0    |                           | lock(X) (contended)
         | unlock(X)             |
         | grant(X, B)           |
         |                           | while(!is_granted(X, B) && is_running(lock_owner(X))
  t1    |                           | (returns from lock()

The idea is that the time interval t0-t1 is shorter in the adaptive case (at least most of the time), and is why the spinning doesn't end up hurting us (the cpu is also busy in the schedule() code otherwise).  This is the win-win scenario for adaptive.  This is the case generally with short-hold locks (like the spinlocks that were converted to mutexes in -rt tend to be). 

For cases where the lock is held longer than the scheduling overhead, we will undoubtedly see more cpu utilization than the non-adaptive case.  However, we may _still_ see performance improvements in some scenarios due to the fact that the "grant" operation still has less overhead (thead A can skip the "wakeup" and therefore thread B still comes out of the loop with less latency that it could otherwise.  In this scenario you are trading cpu cycles for reduced latency, though the performance gains (if any) are not as profound in the ideal case. 

..and then of course there are cpu-bound workloads with long(er) held locks.   They don't like adaptive-locks much at all ;)

Long story short, I would try to quantify what your "t1-t0" delta actually is for the workload you are observing to start.  That will give you a ballpark of whether you should expect any gains or not. 

-Greg 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: RFC: Ideal Adaptive Spinning Conditions
  2010-04-01  2:10 ` Darren Hart
@ 2010-04-01 14:04   ` Chris Mason
  0 siblings, 0 replies; 24+ messages in thread
From: Chris Mason @ 2010-04-01 14:04 UTC (permalink / raw)
  To: Darren Hart
  Cc: lkml, ,
	Steven Rostedt, Peter Zijlstra, Gregory Haskins,
	Sven-Thorsten Dietrich, Peter Morreale, Thomas Gleixner,
	Ingo Molnar, Eric Dumazet

On Wed, Mar 31, 2010 at 07:10:50PM -0700, Darren Hart wrote:
> CC'ing the right Chris this time.
> 
> Darren Hart wrote:
> >I'm looking at some adaptive spinning with futexes as a way to
> >help reduce the dependence on sched_yield() to implement userspace
> >spinlocks. Chris, I included you in the CC after reading your
> >comments regarding sched_yield() at kernel summit and I thought
> >you might be interested.
> >
> >I have an experimental patchset that implements FUTEX_LOCK and
> >FUTEX_LOCK_ADAPTIVE in the kernel and use something akin to
> >mutex_spin_on_owner() for the first waiter to spin. What I'm
> >finding is that adaptive spinning actually hurts my particular
> >test case, so I was hoping to poll people for context regarding
> >the existing adaptive spinning implementations in the kernel as to
> >where we see benefit. Under which conditions does adaptive
> >spinning help?
> >
> >I presume locks with a short average hold time stand to gain the
> >most as the longer the lock is held the more likely the spinner
> >will expire its timeslice or that the scheduling gain becomes
> >noise in the acquisition time. My test case simple calls
> >"lock();unlock()" for a fixed number of iterations and reports the
> >iterations per second at the end of the run. It can run with an
> >arbitrary number of threads as well. I typically run with 256
> >threads for 10M iterations.
> >
> >         futex_lock: Result: 635 Kiter/s
> >futex_lock_adaptive: Result: 542 Kiter/s
> >
> >I've limited the number of spinners to 1 but feel that perhaps
> >this should be configurable as locks with very short hold times
> >could benefit from up to NR_CPUS-1 spinners.

We tried something similar in the original adaptive mutex
implementation.  I just went back and reread the threads and the biggest
boost in performance came when we:

1) didn't limit the number of spinners
2) didn't try to be fair to waiters

So, lets say we've spun for a while and given up and tossed a process
onto a wait queue.  One of the mutex iterations would see the process on
the wait queue and nicely hop on behind it.

We ended up changing things to spin regardless of what other processes
were doing, and that made a big difference.  The spinning loops have
cond_resched() sprinkled in important places to make sure we don't keep
the CPU away from the process that actually owns the mutex.

> >
> >I'd really appreciate any data, just general insight, you may have
> >acquired while implementing adaptive spinning for rt-mutexes and
> >mutexes. Open questions for me regarding conditions where adaptive
> >spinning helps are:
> >
> >o What type of lock hold times do we expect to benefit?
> >o How much contention is a good match for adaptive spinning?
> >  - this is related to the number of threads to run in the test
> >o How many spinners should be allowed?
> >

The btrfs benchmarks I was doing on the mutexes had 50 processes on a 4
CPU system, and no limits on the number of spinning processes.  The
locks they were hitting were btree locks that were heavily contended for
each operation.

Most of the time, btrfs is able to take the mutex, do a short operation
and release the mutex without scheduling.

-chris


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: RFC: Ideal Adaptive Spinning Conditions
  2010-03-31 23:21 RFC: Ideal Adaptive Spinning Conditions Darren Hart
                   ` (2 preceding siblings ...)
  2010-04-01  2:10 ` Darren Hart
@ 2010-04-01 14:20 ` Avi Kivity
  2010-04-01 15:54   ` Darren Hart
  3 siblings, 1 reply; 24+ messages in thread
From: Avi Kivity @ 2010-04-01 14:20 UTC (permalink / raw)
  To: Darren Hart
  Cc: lkml, ,
	Steven Rostedt, Peter Zijlstra, Gregory Haskins,
	Sven-Thorsten Dietrich, Peter Morreale, Chris Wright,
	Thomas Gleixner, Ingo Molnar, Eric Dumazet

On 04/01/2010 02:21 AM, Darren Hart wrote:
> I'm looking at some adaptive spinning with futexes as a way to help 
> reduce the dependence on sched_yield() to implement userspace 
> spinlocks. Chris, I included you in the CC after reading your comments 
> regarding sched_yield() at kernel summit and I thought you might be 
> interested.
>
> I have an experimental patchset that implements FUTEX_LOCK and 
> FUTEX_LOCK_ADAPTIVE in the kernel and use something akin to 
> mutex_spin_on_owner() for the first waiter to spin. What I'm finding 
> is that adaptive spinning actually hurts my particular test case, so I 
> was hoping to poll people for context regarding the existing adaptive 
> spinning implementations in the kernel as to where we see benefit. 
> Under which conditions does adaptive spinning help?
>
> I presume locks with a short average hold time stand to gain the most 
> as the longer the lock is held the more likely the spinner will expire 
> its timeslice or that the scheduling gain becomes noise in the 
> acquisition time. My test case simple calls "lock();unlock()" for a 
> fixed number of iterations and reports the iterations per second at 
> the end of the run. It can run with an arbitrary number of threads as 
> well. I typically run with 256 threads for 10M iterations.
>
>          futex_lock: Result: 635 Kiter/s
> futex_lock_adaptive: Result: 542 Kiter/s

A lock(); unlock(); loop spends most of its time with the lock held or 
contended.  Can you something like this:


    lock();
    for (i = 0; i < 1000; ++i)
         asm volatile ("" : : : "memory");
    unlock();
    for (i = 0; i < 10000; ++i)
         asm volatile ("" : : : "memory");

This simulates a lock hold ratio of 10% with the lock hold time 
exceeding the acquisition time.  Will be interesting to lower both loop 
bounds as well.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: RFC: Ideal Adaptive Spinning Conditions
  2010-04-01 14:20 ` Avi Kivity
@ 2010-04-01 15:54   ` Darren Hart
  2010-04-01 16:10     ` Avi Kivity
  0 siblings, 1 reply; 24+ messages in thread
From: Darren Hart @ 2010-04-01 15:54 UTC (permalink / raw)
  To: Avi Kivity
  Cc: lkml, ,
	Steven Rostedt, Peter Zijlstra, Gregory Haskins,
	Sven-Thorsten Dietrich, Peter Morreale, Chris Wright,
	Thomas Gleixner, Ingo Molnar, Eric Dumazet, Chris Mason

Avi Kivity wrote:
> On 04/01/2010 02:21 AM, Darren Hart wrote:
>> I'm looking at some adaptive spinning with futexes as a way to help 
>> reduce the dependence on sched_yield() to implement userspace 
>> spinlocks. Chris, I included you in the CC after reading your comments 
>> regarding sched_yield() at kernel summit and I thought you might be 
>> interested.
>>
>> I have an experimental patchset that implements FUTEX_LOCK and 
>> FUTEX_LOCK_ADAPTIVE in the kernel and use something akin to 
>> mutex_spin_on_owner() for the first waiter to spin. What I'm finding 
>> is that adaptive spinning actually hurts my particular test case, so I 
>> was hoping to poll people for context regarding the existing adaptive 
>> spinning implementations in the kernel as to where we see benefit. 
>> Under which conditions does adaptive spinning help?
>>
>> I presume locks with a short average hold time stand to gain the most 
>> as the longer the lock is held the more likely the spinner will expire 
>> its timeslice or that the scheduling gain becomes noise in the 
>> acquisition time. My test case simple calls "lock();unlock()" for a 
>> fixed number of iterations and reports the iterations per second at 
>> the end of the run. It can run with an arbitrary number of threads as 
>> well. I typically run with 256 threads for 10M iterations.
>>
>>          futex_lock: Result: 635 Kiter/s
>> futex_lock_adaptive: Result: 542 Kiter/s
> 
> A lock(); unlock(); loop spends most of its time with the lock held or 
> contended.  Can you something like this:
> 
> 
>    lock();
>    for (i = 0; i < 1000; ++i)
>         asm volatile ("" : : : "memory");
>    unlock();
>    for (i = 0; i < 10000; ++i)
>         asm volatile ("" : : : "memory");


Great idea. I'll be doing a more rigorous investigation on this of 
course, but I thought I'd share the results of just dumping this into 
the testcase:

# ./futex_lock -i10000000
futex_lock: Measure FUTEX_LOCK operations per second
	Arguments: iterations=10000000 threads=256 adaptive=0
Result: 420 Kiter/s
lock calls:      9999872
lock syscalls:   665824 (6.66%)
unlock calls:    9999872
unlock syscalls: 861240 (8.61%)

# ./futex_lock -a -i10000000
futex_lock: Measure FUTEX_LOCK operations per second
	Arguments: iterations=10000000 threads=256 adaptive=1
Result: 426 Kiter/s
lock calls:      9999872
lock syscalls:   558787 (5.59%)
unlock calls:    9999872
unlock syscalls: 603412 (6.03%)

This is the first time I've seen adaptive locking have an advantage! The 
second set of runs showed a slightly greater advantage. Note that this 
was still with spinners being limited to one.

My thanks to everyone for their insight. I'll be preparing some result 
matrices and will share the patches and testcases here shortly.

--
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: RFC: Ideal Adaptive Spinning Conditions
  2010-04-01 15:54   ` Darren Hart
@ 2010-04-01 16:10     ` Avi Kivity
  2010-04-01 17:10       ` Darren Hart
  0 siblings, 1 reply; 24+ messages in thread
From: Avi Kivity @ 2010-04-01 16:10 UTC (permalink / raw)
  To: Darren Hart
  Cc: lkml, ,
	Steven Rostedt, Peter Zijlstra, Gregory Haskins,
	Sven-Thorsten Dietrich, Peter Morreale, Chris Wright,
	Thomas Gleixner, Ingo Molnar, Eric Dumazet, Chris Mason

On 04/01/2010 06:54 PM, Darren Hart wrote:
>> A lock(); unlock(); loop spends most of its time with the lock held 
>> or contended.  Can you something like this:
>>
>>
>>    lock();
>>    for (i = 0; i < 1000; ++i)
>>         asm volatile ("" : : : "memory");
>>    unlock();
>>    for (i = 0; i < 10000; ++i)
>>         asm volatile ("" : : : "memory");
>
>
>
> Great idea. I'll be doing a more rigorous investigation on this of 
> course, but I thought I'd share the results of just dumping this into 
> the testcase:
>
> # ./futex_lock -i10000000
> futex_lock: Measure FUTEX_LOCK operations per second
>     Arguments: iterations=10000000 threads=256 adaptive=0
> Result: 420 Kiter/s
> lock calls:      9999872
> lock syscalls:   665824 (6.66%)
> unlock calls:    9999872
> unlock syscalls: 861240 (8.61%)
>
> # ./futex_lock -a -i10000000
> futex_lock: Measure FUTEX_LOCK operations per second
>     Arguments: iterations=10000000 threads=256 adaptive=1
> Result: 426 Kiter/s
> lock calls:      9999872
> lock syscalls:   558787 (5.59%)
> unlock calls:    9999872
> unlock syscalls: 603412 (6.03%)
>
> This is the first time I've seen adaptive locking have an advantage! 
> The second set of runs showed a slightly greater advantage. Note that 
> this was still with spinners being limited to one.

Question - do all threads finish at the same time, or wildly different 
times?

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: RFC: Ideal Adaptive Spinning Conditions
  2010-04-01  2:03   ` Darren Hart
@ 2010-04-01 17:02     ` Chris Wright
  0 siblings, 0 replies; 24+ messages in thread
From: Chris Wright @ 2010-04-01 17:02 UTC (permalink / raw)
  To: Darren Hart
  Cc: Roland Dreier, lkml, ,
	Steven Rostedt, Peter Zijlstra, Gregory Haskins,
	Sven-Thorsten Dietrich, Peter Morreale, Chris Wright,
	Thomas Gleixner, Ingo Molnar, Eric Dumazet

* Darren Hart (dvhltc@us.ibm.com) wrote:
> Roland Dreier wrote:
>>  > I'm looking at some adaptive spinning with futexes as a way to help
>>  > reduce the dependence on sched_yield() to implement userspace
>>  > spinlocks. Chris, I included you in the CC after reading your comments
>>  > regarding sched_yield() at kernel summit and I thought you might be
>>  > interested.
>>
>> I think you may have the wrong Chris.  Chris Wright wasn't at the most
>> recent kernel summit and I do have a vague recollection of Chris Mason
>> mentioning that Oracle developers found that sched_yield() performed
>> better than futexes + kernel scheduling.
>
> Oh bother, Sorry Chris! I even talked to Chris Wright about this already  
> in person and he told me I probably was looking for Chris Mason, and I  
> still got the last names crossed up!

No worries Darren, I happen to be interested in the topic too ;-)
(from KVM point of view, which has a slightly different take)

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: RFC: Ideal Adaptive Spinning Conditions
  2010-04-01 16:10     ` Avi Kivity
@ 2010-04-01 17:10       ` Darren Hart
  2010-04-01 17:15         ` Avi Kivity
  0 siblings, 1 reply; 24+ messages in thread
From: Darren Hart @ 2010-04-01 17:10 UTC (permalink / raw)
  To: Avi Kivity
  Cc: lkml, ,
	Steven Rostedt, Peter Zijlstra, Gregory Haskins,
	Sven-Thorsten Dietrich, Peter Morreale, Chris Wright,
	Thomas Gleixner, Ingo Molnar, Eric Dumazet, Chris Mason

Avi Kivity wrote:
> On 04/01/2010 06:54 PM, Darren Hart wrote:
>>> A lock(); unlock(); loop spends most of its time with the lock held 
>>> or contended.  Can you something like this:
>>>
>>>
>>>    lock();
>>>    for (i = 0; i < 1000; ++i)
>>>         asm volatile ("" : : : "memory");
>>>    unlock();
>>>    for (i = 0; i < 10000; ++i)
>>>         asm volatile ("" : : : "memory");
>>
>>
>>
>> Great idea. I'll be doing a more rigorous investigation on this of 
>> course, but I thought I'd share the results of just dumping this into 
>> the testcase:
>>
>> # ./futex_lock -i10000000
>> futex_lock: Measure FUTEX_LOCK operations per second
>>     Arguments: iterations=10000000 threads=256 adaptive=0
>> Result: 420 Kiter/s
>> lock calls:      9999872
>> lock syscalls:   665824 (6.66%)
>> unlock calls:    9999872
>> unlock syscalls: 861240 (8.61%)
>>
>> # ./futex_lock -a -i10000000
>> futex_lock: Measure FUTEX_LOCK operations per second
>>     Arguments: iterations=10000000 threads=256 adaptive=1
>> Result: 426 Kiter/s
>> lock calls:      9999872
>> lock syscalls:   558787 (5.59%)
>> unlock calls:    9999872
>> unlock syscalls: 603412 (6.03%)
>>
>> This is the first time I've seen adaptive locking have an advantage! 
>> The second set of runs showed a slightly greater advantage. Note that 
>> this was still with spinners being limited to one.
> 
> Question - do all threads finish at the same time, or wildly different 
> times?

I'm not sure, I can add some fairness metrics to the test that will help 
characterize how that's working. My suspicion is that there will be 
several threads that don't make any progress until the very end - since 
adaptive spinning is an "unfair" locking technique.

-- 
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: RFC: Ideal Adaptive Spinning Conditions
  2010-04-01 17:10       ` Darren Hart
@ 2010-04-01 17:15         ` Avi Kivity
  0 siblings, 0 replies; 24+ messages in thread
From: Avi Kivity @ 2010-04-01 17:15 UTC (permalink / raw)
  To: Darren Hart
  Cc: lkml, ,
	Steven Rostedt, Peter Zijlstra, Gregory Haskins,
	Sven-Thorsten Dietrich, Peter Morreale, Chris Wright,
	Thomas Gleixner, Ingo Molnar, Eric Dumazet, Chris Mason

On 04/01/2010 08:10 PM, Darren Hart wrote:
> Avi Kivity wrote:
>> On 04/01/2010 06:54 PM, Darren Hart wrote:
>>>> A lock(); unlock(); loop spends most of its time with the lock held 
>>>> or contended.  Can you something like this:
>>>>
>>>>
>>>>    lock();
>>>>    for (i = 0; i < 1000; ++i)
>>>>         asm volatile ("" : : : "memory");
>>>>    unlock();
>>>>    for (i = 0; i < 10000; ++i)
>>>>         asm volatile ("" : : : "memory");
>>>
>>>
>>>
>>> Great idea. I'll be doing a more rigorous investigation on this of 
>>> course, but I thought I'd share the results of just dumping this 
>>> into the testcase:
>>>
>>> # ./futex_lock -i10000000
>>> futex_lock: Measure FUTEX_LOCK operations per second
>>>     Arguments: iterations=10000000 threads=256 adaptive=0
>>> Result: 420 Kiter/s
>>> lock calls:      9999872
>>> lock syscalls:   665824 (6.66%)
>>> unlock calls:    9999872
>>> unlock syscalls: 861240 (8.61%)
>>>
>>> # ./futex_lock -a -i10000000
>>> futex_lock: Measure FUTEX_LOCK operations per second
>>>     Arguments: iterations=10000000 threads=256 adaptive=1
>>> Result: 426 Kiter/s
>>> lock calls:      9999872
>>> lock syscalls:   558787 (5.59%)
>>> unlock calls:    9999872
>>> unlock syscalls: 603412 (6.03%)
>>>
>>> This is the first time I've seen adaptive locking have an advantage! 
>>> The second set of runs showed a slightly greater advantage. Note 
>>> that this was still with spinners being limited to one.
>>
>> Question - do all threads finish at the same time, or wildly 
>> different times?
>
> I'm not sure, I can add some fairness metrics to the test that will 
> help characterize how that's working. My suspicion is that there will 
> be several threads that don't make any progress until the very end - 
> since adaptive spinning is an "unfair" locking technique.
>

Well, if the amount of unfairness differs between the tests (unfair 
unfairness?) then you may see results that are not directly related to 
spin vs yield.  You need to make the test more self-regulating so the 
results are more repeatable (yet not make it a round-robin test).

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: RFC: Ideal Adaptive Spinning Conditions
  2010-04-01  0:17   ` Peter W. Morreale
  2010-04-01  2:25     ` Darren Hart
@ 2010-04-03 17:51     ` john cooper
  1 sibling, 0 replies; 24+ messages in thread
From: john cooper @ 2010-04-03 17:51 UTC (permalink / raw)
  To: Peter W. Morreale
  Cc: rostedt, Darren Hart, lkml,,
	Peter Zijlstra, Gregory Haskins, Sven-Thorsten Dietrich,
	Chris Wright, Thomas Gleixner, Ingo Molnar, Eric Dumazet,
	john cooper

Peter W. Morreale wrote:

> Right.  This was *critical* for the adaptive rtmutex.   Note in the RT
> patch, everybody spins as long as the current owner is on CPU.   
> 
> FWIW, IIRC, Solaris has a heuristic approach where incoming tasks spin
> for a period of time before going to sleep.  (Cray UINCOS did the same) 

IIRC Solaris mutexes are declared either simple spin or adaptive,
an acquisition attempt of the latter only checking the hold
status of the mutex and if held the owner's run status before
making the spin vs. block decision.

I don't believe mixing of the two mutex types within a given path
was permissible as a previously acquired simple spin mutex could
remain held when a subsequent adaptive mutex decided to block.
Although presumably an elevated IPL could have been sufficient to
flag that scenario.

-john
  
-- 
john.cooper@third-harmonic.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: RFC: Ideal Adaptive Spinning Conditions
  2010-04-01  2:25     ` Darren Hart
@ 2010-04-03 18:00       ` john cooper
  2010-04-05 14:06         ` Darren Hart
  0 siblings, 1 reply; 24+ messages in thread
From: john cooper @ 2010-04-03 18:00 UTC (permalink / raw)
  To: Darren Hart
  Cc: Peter W. Morreale, rostedt, lkml,,
	Peter Zijlstra, Gregory Haskins, Sven-Thorsten Dietrich,
	Thomas Gleixner, Ingo Molnar, Eric Dumazet, Chris Mason,
	john cooper

Darren Hart wrote:
> Right, and I'm looking to provide some kernel assistance for userspace
> spinlocks here, and am targeting short lived critical sections as well.

What did you have in mind beyond existing mechanisms
which address sibling contention?

One scenario which AFAICT isn't yet addressed is that
of a userspace spin lock holder taking a scheduling
preemption which may result in other threads piling up
on the lock orders of magnitude beyond normal wait times,
until the lock holder is rescheduled.  

-john

-- 
john.cooper@third-harmonic.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: RFC: Ideal Adaptive Spinning Conditions
  2010-04-01  2:25     ` Steven Rostedt
  2010-04-01  5:15       ` Darren Hart
@ 2010-04-04  1:50       ` Rik van Riel
  2010-04-04 15:06         ` Peter W. Morreale
  2010-04-05 14:10         ` Darren Hart
  1 sibling, 2 replies; 24+ messages in thread
From: Rik van Riel @ 2010-04-04  1:50 UTC (permalink / raw)
  To: rostedt
  Cc: Darren Hart, lkml,,
	Peter Zijlstra, Gregory Haskins, Sven-Thorsten Dietrich,
	Peter Morreale, Thomas Gleixner, Ingo Molnar, Eric Dumazet,
	Chris Mason, john cooper

On 03/31/2010 10:25 PM, Steven Rostedt wrote:
> On Wed, 2010-03-31 at 19:13 -0700, Darren Hart wrote:
>> Steven Rostedt wrote:
>>> On Wed, 2010-03-31 at 16:21 -0700, Darren Hart wrote:
>>>
>>>> o What type of lock hold times do we expect to benefit?
>>>
>>> 0 (that's a zero) :-p
>>>
>>> I haven't seen your patches but you are not doing a heuristic approach,
>>> are you? That is, do not "spin" hoping the lock will suddenly become
>>> free. I was against that for -rt and I would be against that for futex
>>> too.
>>
>> I'm not sure what you're getting at here. Adaptive spinning is indeed
>> hoping the lock will become free while you are spinning and checking
>> it's owner...
>
> I'm talking about the original idea people had of "lets spin for 50us
> and hope it is unlocked before then", which I thought was not a good
> idea.

Maybe not a good idea when running on bare metal, but it
could be a big help when running virtualized.

A lock with a short hold time can, occasionally, have a
very long hold time, when the VCPU holding the lock is
preempted by the host/hypervisor.

An adaptive lock would spin-and-acquire if the lock holder
is running, while turning into a sleep lock when the lock
holder has been preempted.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: RFC: Ideal Adaptive Spinning Conditions
  2010-04-04  1:50       ` Rik van Riel
@ 2010-04-04 15:06         ` Peter W. Morreale
  2010-04-05 14:10         ` Darren Hart
  1 sibling, 0 replies; 24+ messages in thread
From: Peter W. Morreale @ 2010-04-04 15:06 UTC (permalink / raw)
  To: Rik van Riel
  Cc: rostedt, Darren Hart, lkml,,
	Peter Zijlstra, Gregory Haskins, Sven-Thorsten Dietrich,
	Thomas Gleixner, Ingo Molnar, Eric Dumazet, Chris Mason,
	john cooper

On Sat, 2010-04-03 at 21:50 -0400, Rik van Riel wrote:
> On 03/31/2010 10:25 PM, Steven Rostedt wrote:
> > On Wed, 2010-03-31 at 19:13 -0700, Darren Hart wrote:
> >> Steven Rostedt wrote:
> >>> On Wed, 2010-03-31 at 16:21 -0700, Darren Hart wrote:
> >>>
> >>>> o What type of lock hold times do we expect to benefit?
> >>>
> >>> 0 (that's a zero) :-p
> >>>
> >>> I haven't seen your patches but you are not doing a heuristic approach,
> >>> are you? That is, do not "spin" hoping the lock will suddenly become
> >>> free. I was against that for -rt and I would be against that for futex
> >>> too.
> >>
> >> I'm not sure what you're getting at here. Adaptive spinning is indeed
> >> hoping the lock will become free while you are spinning and checking
> >> it's owner...
> >
> > I'm talking about the original idea people had of "lets spin for 50us
> > and hope it is unlocked before then", which I thought was not a good
> > idea.
> 
> Maybe not a good idea when running on bare metal, but it
> could be a big help when running virtualized.
> 
> A lock with a short hold time can, occasionally, have a
> very long hold time, when the VCPU holding the lock is
> preempted by the host/hypervisor.
> 
> An adaptive lock would spin-and-acquire if the lock holder
> is running, while turning into a sleep lock when the lock
> holder has been preempted.
> 

Which is precisely what the RT variant does.  Each iteration of the
'spin' loop verifies that the lock owner is on CPU.   If the owner is
not, all other tasks stop spinning and sleep.

So how does a timeout help in the VCPU case?  

Best,
-PWM






^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: RFC: Ideal Adaptive Spinning Conditions
  2010-04-03 18:00       ` john cooper
@ 2010-04-05 14:06         ` Darren Hart
  0 siblings, 0 replies; 24+ messages in thread
From: Darren Hart @ 2010-04-05 14:06 UTC (permalink / raw)
  To: john cooper
  Cc: Peter W. Morreale, rostedt, lkml,,
	Peter Zijlstra, Gregory Haskins, Sven-Thorsten Dietrich,
	Thomas Gleixner, Ingo Molnar, Eric Dumazet, Chris Mason,
	john cooper

john cooper wrote:
> Darren Hart wrote:
>> Right, and I'm looking to provide some kernel assistance for userspace
>> spinlocks here, and am targeting short lived critical sections as well.
> 
> What did you have in mind beyond existing mechanisms
> which address sibling contention?
> 
> One scenario which AFAICT isn't yet addressed is that
> of a userspace spin lock holder taking a scheduling
> preemption which may result in other threads piling up
> on the lock orders of magnitude beyond normal wait times,
> until the lock holder is rescheduled.  

That is an excellent example.

Another is the highly fragile performance characteristics of spinlocks 
with sched_yield() implementations lead to. As sched_yield 
implementations change, the scheduling behavior of the spinning tasks 
also changes. As the number of cores grows, more performance tuning is 
required. Sched_yield() essentially allows the spinner to spin for a 
time and then get off the cpu for a time - but it doesn't have any idea 
about the state of the lock owner and it's pure chance if the spinning 
task with schedule back in at an opportune time, or if it will just be 
adding to the scheduling overhead and CPU resources the owner is still 
trying to acquire.

The idea here is to leverage the additional information we have in the 
kernel to make more intelligent decisions about how long to spin (as 
well as how many tasks should spin).

-- 
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: RFC: Ideal Adaptive Spinning Conditions
  2010-04-04  1:50       ` Rik van Riel
  2010-04-04 15:06         ` Peter W. Morreale
@ 2010-04-05 14:10         ` Darren Hart
  1 sibling, 0 replies; 24+ messages in thread
From: Darren Hart @ 2010-04-05 14:10 UTC (permalink / raw)
  To: Rik van Riel
  Cc: rostedt, lkml,,
	Peter Zijlstra, Gregory Haskins, Sven-Thorsten Dietrich,
	Peter Morreale, Thomas Gleixner, Ingo Molnar, Eric Dumazet,
	Chris Mason, john cooper

Rik van Riel wrote:
> On 03/31/2010 10:25 PM, Steven Rostedt wrote:
>> On Wed, 2010-03-31 at 19:13 -0700, Darren Hart wrote:
>>> Steven Rostedt wrote:
>>>> On Wed, 2010-03-31 at 16:21 -0700, Darren Hart wrote:
>>>>
>>>>> o What type of lock hold times do we expect to benefit?
>>>>
>>>> 0 (that's a zero) :-p
>>>>
>>>> I haven't seen your patches but you are not doing a heuristic approach,
>>>> are you? That is, do not "spin" hoping the lock will suddenly become
>>>> free. I was against that for -rt and I would be against that for futex
>>>> too.
>>>
>>> I'm not sure what you're getting at here. Adaptive spinning is indeed
>>> hoping the lock will become free while you are spinning and checking
>>> it's owner...
>>
>> I'm talking about the original idea people had of "lets spin for 50us
>> and hope it is unlocked before then", which I thought was not a good
>> idea.
> 
> Maybe not a good idea when running on bare metal, but it
> could be a big help when running virtualized.
> 
> A lock with a short hold time can, occasionally, have a
> very long hold time, when the VCPU holding the lock is
> preempted by the host/hypervisor.
> 
> An adaptive lock would spin-and-acquire if the lock holder
> is running, while turning into a sleep lock when the lock
> holder has been preempted.

Right, Steven was referring to a braindead 50us spin, regardless of the 
state of the owner. This is the kind of spinning userspace spinlocks 
currently implement because they have no information regarding the state 
of the owner. By adding adaptive spinning to the kernel, these userspace 
locks can make more intelligent spinning decisions.


-- 
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2010-04-05 14:11 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-03-31 23:21 RFC: Ideal Adaptive Spinning Conditions Darren Hart
2010-03-31 23:35 ` Roland Dreier
2010-04-01  2:03   ` Darren Hart
2010-04-01 17:02     ` Chris Wright
2010-03-31 23:38 ` Steven Rostedt
2010-04-01  0:17   ` Peter W. Morreale
2010-04-01  2:25     ` Darren Hart
2010-04-03 18:00       ` john cooper
2010-04-05 14:06         ` Darren Hart
2010-04-03 17:51     ` john cooper
2010-04-01  2:13   ` Darren Hart
2010-04-01  2:25     ` Steven Rostedt
2010-04-01  5:15       ` Darren Hart
2010-04-01 12:46         ` Gregory Haskins
2010-04-04  1:50       ` Rik van Riel
2010-04-04 15:06         ` Peter W. Morreale
2010-04-05 14:10         ` Darren Hart
2010-04-01  2:10 ` Darren Hart
2010-04-01 14:04   ` Chris Mason
2010-04-01 14:20 ` Avi Kivity
2010-04-01 15:54   ` Darren Hart
2010-04-01 16:10     ` Avi Kivity
2010-04-01 17:10       ` Darren Hart
2010-04-01 17:15         ` Avi Kivity

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.