All of lore.kernel.org
 help / color / mirror / Atom feed
* semaphore and mutex in current Linux kernel (3.2.2)
@ 2012-04-01  9:56 Chen, Dennis (SRDC SW)
  2012-04-01 12:19 ` Ingo Molnar
  0 siblings, 1 reply; 19+ messages in thread
From: Chen, Dennis (SRDC SW) @ 2012-04-01  9:56 UTC (permalink / raw)
  To: linux-kernel; +Cc: mingo

Documentation/mutex-design.txt:

"- 'struct mutex' is smaller on most architectures: E.g. on x86,
   'struct semaphore' is 20 bytes, 'struct mutex' is 16 bytes.
   A smaller structure size means less RAM footprint, and better
   CPU-cache utilization."
================================================================
Now in my x86-64 32-bit Linux environment, 'struct semaphone' is 16 bytes,
'struct mutex' is 20 bytes. So seems the RAM footprint advantages are not there...

For the performance advantages followed, I don't have the ./test-mutex and maybe the 
testing environment, so haven't the 1st hand data for this item...





^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: semaphore and mutex in current Linux kernel (3.2.2)
  2012-04-01  9:56 semaphore and mutex in current Linux kernel (3.2.2) Chen, Dennis (SRDC SW)
@ 2012-04-01 12:19 ` Ingo Molnar
  2012-04-02 15:28   ` Chen, Dennis (SRDC SW)
  0 siblings, 1 reply; 19+ messages in thread
From: Ingo Molnar @ 2012-04-01 12:19 UTC (permalink / raw)
  To: Chen, Dennis (SRDC SW); +Cc: linux-kernel, mingo


* Chen, Dennis (SRDC SW) <Dennis1.Chen@amd.com> wrote:

> Documentation/mutex-design.txt:
> 
> "- 'struct mutex' is smaller on most architectures: E.g. on x86,
>    'struct semaphore' is 20 bytes, 'struct mutex' is 16 bytes.
>    A smaller structure size means less RAM footprint, and better
>    CPU-cache utilization."
> ================================================================
>
> Now in my x86-64 32-bit Linux environment, 'struct semaphone' 
> is 16 bytes, 'struct mutex' is 20 bytes. So seems the RAM 
> footprint advantages are not there...

It got larger due to the adaptive spin-mutex performance 
optimization.

> For the performance advantages followed, I don't have the 
> ./test-mutex and maybe the testing environment, so haven't the 
> 1st hand data for this item...

Well, a way to reproduce that would be to find a lock_mutex 
intense workload ('perf top -g', etc.), and then changing back 
the underlying mutex to a semaphore, and measure the performance 
of the two primitives.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: semaphore and mutex in current Linux kernel (3.2.2)
  2012-04-01 12:19 ` Ingo Molnar
@ 2012-04-02 15:28   ` Chen, Dennis (SRDC SW)
  2012-04-03  7:52     ` Ingo Molnar
  0 siblings, 1 reply; 19+ messages in thread
From: Chen, Dennis (SRDC SW) @ 2012-04-02 15:28 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel, mingo

On Sun, Apr 1, 2012 at 8:19 PM, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Chen, Dennis (SRDC SW) <Dennis1.Chen@amd.com> wrote:
>
>> Documentation/mutex-design.txt:
>>
>> "- 'struct mutex' is smaller on most architectures: E.g. on x86,
>>    'struct semaphore' is 20 bytes, 'struct mutex' is 16 bytes.
>>    A smaller structure size means less RAM footprint, and better
>>    CPU-cache utilization."
>> ================================================================
>>
>> Now in my x86-64 32-bit Linux environment, 'struct semaphone'
>> is 16 bytes, 'struct mutex' is 20 bytes. So seems the RAM
>> footprint advantages are not there...
>
> It got larger due to the adaptive spin-mutex performance
> optimization.
>

Thanks Ingo, so part of the document is obsolete and "less RAM footprint" and "tighter code"
is not advantage of mutex anymore, being the disadvantage factor. Spin-mutex performance optimization
is highly expected...   

>> For the performance advantages followed, I don't have the
>> ./test-mutex and maybe the testing environment, so haven't the
>> 1st hand data for this item...
>
> Well, a way to reproduce that would be to find a lock_mutex
> intense workload ('perf top -g', etc.), and then changing back
> the underlying mutex to a semaphore, and measure the performance
> of the two primitives.
>
Why not the 'test-mutex' tool in the document? 
I guess it should be the private tool from you, if you can share it to me then I can help to make
a new round performance check of the two primitives in the latest kernel... make a deal?  
 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: semaphore and mutex in current Linux kernel (3.2.2)
  2012-04-02 15:28   ` Chen, Dennis (SRDC SW)
@ 2012-04-03  7:52     ` Ingo Molnar
  2012-04-05  8:37       ` Chen, Dennis (SRDC SW)
  0 siblings, 1 reply; 19+ messages in thread
From: Ingo Molnar @ 2012-04-03  7:52 UTC (permalink / raw)
  To: Chen, Dennis (SRDC SW); +Cc: linux-kernel, mingo


* Chen, Dennis (SRDC SW) <Dennis1.Chen@amd.com> wrote:

> > Well, a way to reproduce that would be to find a lock_mutex
> > intense workload ('perf top -g', etc.), and then changing back
> > the underlying mutex to a semaphore, and measure the performance
> > of the two primitives.
>
> Why not the 'test-mutex' tool in the document? 
>
> I guess it should be the private tool from you, if you can 
> share it to me then I can help to make a new round performance 
> check of the two primitives in the latest kernel... make a 
> deal?

I think I posted it back then - but IIRC it was really just 
open-coded mutex fastpath executed in user-space by a couple of 
threads. To do that today you'd have to create it anew: just 
copy the current mutex fast-path to user-space and measure it.

I'm not sure what the point of comparative measurements with 
semaphores would be: for example we don't have per architecture 
optimized semaphores anymore, we switched the legacy semaphores 
to a generic version and are phasing them out.

Mutexes have various advantages (such as lockdep coverage and in 
general tighter semantics that makes their usage more robust) 
and we aren't going back to semaphores.

What would make a ton of sense would be to create a 'perf bench' 
module that would use the kernel's mutex code and would measure 
it in user-space. 'perf bench mem' already does a simplified 
form of that: it measures the kernel's memcpy and memset 
routines:

$ perf bench mem memcpy -r help
# Running mem/memcpy benchmark...
Unknown routine:help
Available routines...
	default ... Default memcpy() provided by glibc
	x86-64-unrolled ... unrolled memcpy() in arch/x86/lib/memcpy_64.S
	x86-64-movsq ... movsq-based memcpy() in arch/x86/lib/memcpy_64.S
	x86-64-movsb ... movsb-based memcpy() in arch/x86/lib/memcpy_64.S

$ perf bench mem memcpy -r x86-64-movsq
# Running mem/memcpy benchmark...
# Copying 1MB Bytes ...

       2.229595 GB/Sec
      10.850694 GB/Sec (with prefault)

$ perf bench mem memcpy -r x86-64-movsb
# Running mem/memcpy benchmark...
# Copying 1MB Bytes ...

       2.055921 GB/Sec
       2.447525 GB/Sec (with prefault)

So what could be done is to add something like:

  perf bench locking mutex

Which would, similarly to tools/perf/bench/mem-memcpy-x86-64-asm.S
et al inline the mutex routines, would build user-space glue 
around them and would allow them to be measured.

Such a new feature could then be used to improve mutex 
performance in the future. Likewise:

  perf bench locking spinlock

could be used to do something similar to spinlocks - measuring 
them contended and uncontended, cache-cold and cache-hot, etc.

This would then be used by all future kernel developer 
generations to improve these locking primitives - avoiding the 
test-mutex kind of obscolescence and bitrot :-)

So if you'd be interested in writing that brand new benchmarking 
feature and need help then let the perf people know.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: semaphore and mutex in current Linux kernel (3.2.2)
  2012-04-03  7:52     ` Ingo Molnar
@ 2012-04-05  8:37       ` Chen, Dennis (SRDC SW)
  2012-04-05 14:15         ` Clemens Ladisch
  0 siblings, 1 reply; 19+ messages in thread
From: Chen, Dennis (SRDC SW) @ 2012-04-05  8:37 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel, mingo

On Tue, Apr 3, 2012 at 3:52 PM, Ingo Molnar <mingo@kernel.org> wrote:
> I'm not sure what the point of comparative measurements with
> semaphores would be: for example we don't have per architecture
> optimized semaphores anymore, we switched the legacy semaphores
> to a generic version and are phasing them out.
>

About the point, very simple, I am very curious about the mutex performance 
optimization (actually I am curious about almost everything in the kernel :)
I know that the rationale of the mutex's optimization is, if the lock owner is 
running, it's likely to release the lock soon. So make the waiter to spin a 
short time waiting for the lock to be released is reasonable given the workload 
of a process switch.

But how about if the lock owner running doesn't release the lock soon? Will it
degrade the mutex performance comparing with semaphore?
Look at the below code in mutex slow path:

int mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner)
{
	if (!sched_feat(OWNER_SPIN))
		return 0;

	rcu_read_lock();
	while (owner_running(lock, owner)) {
		if (need_resched())
			break;

		arch_mutex_cpu_relax();
	}
	rcu_read_unlock();

	/*
	 * We break out the loop above on need_resched() and when the
	 * owner changed, which is a sign for heavy contention. Return
	 * success only when lock->owner is NULL.
	 */
	return lock->owner == NULL;
}

According to the codes, I guess probably the waiter will busy wait in the while loop. 
So I made an experiment:
1. write a simple character device kernel module. Add a busy wait for 2 minutes 
between the mutex_lock/mutex_unlock in its read function like this:

xxx_read(){
    /* 2 minutes */
    unsigned long j = jiffies + 120 * HZ; 
    mutex_lock(&c_mutex);
    while(time_before(jiffies, j))
        cpu_relax();
    mutex_unlock(&c_mutex);
}

2. write an Application to open and read this device, I startup the App in 2 
different CPUs almost the same time:
# taskset 0x00000001 ./main
# taskset 0x00000002 ./main

The App in CPU0 will get the mutex lock and running about 2 minutes to release the lock, 
according the mutex_spin_on_owner(), I guess the App in CPU1 will run in the while loop, 
but the ps command output:

root     30197  0.0  0.0   4024   324 pts/2    R+   11:30   0:00 ./main
root     30198  0.0  0.0   4024   324 pts/0    D+   11:30   0:00 ./main

D+ means the App in CPU1 is sleeping in a UNINTERRUPTIBLE state. This is very interesting,
How does this happen? I check my kernel config, the CONFIG_MUTEX_SPIN_ON_OWNER is 
Set and '/sys/kernel/debug# cat sched_features' output:
... ICK LB_BIAS OWNER_SPIN NONTASK_POWER TTWU_QUEUE NO_FORCE_SD_OVERLAP

I know I must have some misunderstanding of the code, but I don't know where it is...

> Mutexes have various advantages (such as lockdep coverage and in
> general tighter semantics that makes their usage more robust)

Yes, I agree.

> and we aren't going back to semaphores.
>
> What would make a ton of sense would be to create a 'perf bench'
> module that would use the kernel's mutex code and would measure
> it in user-space. 'perf bench mem' already does a simplified
>
> So if you'd be interested in writing that brand new benchmarking
> feature and need help then let the perf people know.

Thanks Ingo for the info, it's already in my TODO list of the benchmarking feature for
some kernel feature...



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: semaphore and mutex in current Linux kernel (3.2.2)
  2012-04-05  8:37       ` Chen, Dennis (SRDC SW)
@ 2012-04-05 14:15         ` Clemens Ladisch
  2012-04-06  9:45           ` Chen, Dennis (SRDC SW)
  0 siblings, 1 reply; 19+ messages in thread
From: Clemens Ladisch @ 2012-04-05 14:15 UTC (permalink / raw)
  To: Chen, Dennis (SRDC SW); +Cc: Ingo Molnar, linux-kernel

Chen, Dennis (SRDC SW) wrote:
> I know that the rationale of the mutex's optimization is, if the lock owner is
> running, it's likely to release the lock soon. So make the waiter to spin a
> short time waiting for the lock to be released is reasonable given the workload
> of a process switch.
>
> But how about if the lock owner running doesn't release the lock soon?

It would not make sense to spin too long, especially if some other
process wants to run on the same CPU.

> int mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner)
> {
> 	...
> 	while (owner_running(lock, owner)) {
> 		if (need_resched())
> 			break;
>
> 		arch_mutex_cpu_relax();
> 	}
> 	...
> }
>
> [experiment]
>
> D+ means the App in CPU1 is sleeping in a UNINTERRUPTIBLE state. This is very interesting,
> How does this happen?

Your experiment shows that there must be some condition that makes the
code break out of the spin loop ...


Regards,
Clemens

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: semaphore and mutex in current Linux kernel (3.2.2)
  2012-04-05 14:15         ` Clemens Ladisch
@ 2012-04-06  9:45           ` Chen, Dennis (SRDC SW)
  2012-04-06 10:10             ` Clemens Ladisch
  0 siblings, 1 reply; 19+ messages in thread
From: Chen, Dennis (SRDC SW) @ 2012-04-06  9:45 UTC (permalink / raw)
  To: Clemens Ladisch; +Cc: Ingo Molnar, linux-kernel

On Thu, Apr 5, 2012 at 10:15 PM, Clemens Ladisch <clemens@ladisch.de> wrote:
> Chen, Dennis (SRDC SW) wrote:
>> I know that the rationale of the mutex's optimization is, if the lock owner is
>> running, it's likely to release the lock soon. So make the waiter to spin a
>> short time waiting for the lock to be released is reasonable given the workload
>> of a process switch.
>>
>> But how about if the lock owner running doesn't release the lock soon?
>
> It would not make sense to spin too long, especially if some other
> process wants to run on the same CPU.

Yes, you're right. If the waiter spin too long to wait for the owner to release the 
lock, the system performance will be degraded...

>> int mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner)
>> {
>>       ...
>>       while (owner_running(lock, owner)) {
>>               if (need_resched())
>>                       break;
>>
>>               arch_mutex_cpu_relax();
>>       }
>>       ...
>> }
>>
>> [experiment]
>>
>> D+ means the App in CPU1 is sleeping in a UNINTERRUPTIBLE state. This is very interesting,
>> How does this happen?
>
> Your experiment shows that there must be some condition that makes the
> code break out of the spin loop ...

I guess this is related with RCU component, but I don't found the right place where the code 
Located yet. So Ingo's comments will be very helpful event a little bit... :-)


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: semaphore and mutex in current Linux kernel (3.2.2)
  2012-04-06  9:45           ` Chen, Dennis (SRDC SW)
@ 2012-04-06 10:10             ` Clemens Ladisch
  2012-04-06 17:47               ` Chen, Dennis (SRDC SW)
  0 siblings, 1 reply; 19+ messages in thread
From: Clemens Ladisch @ 2012-04-06 10:10 UTC (permalink / raw)
  To: Chen, Dennis (SRDC SW); +Cc: Ingo Molnar, linux-kernel

Chen, Dennis (SRDC SW) wrote:
> On Thu, Apr 5, 2012 at 10:15 PM, Clemens Ladisch <clemens@ladisch.de> wrote:
>> It would not make sense to spin too long, especially if some other
>> process wants to run on the same CPU.
>>
>>> int mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner)
>>> {
>>>       ...
>>>       while (owner_running(lock, owner)) {
>>>               if (need_resched())
>>>                       break;
>>>
>>>               arch_mutex_cpu_relax();
>>>       }
>>>       ...
>>> }
>>>
>>> D+ means the App in CPU1 is sleeping in a UNINTERRUPTIBLE state. This is very interesting,
>>> How does this happen?
>>
>> Your experiment shows that there must be some condition that makes the
>> code break out of the spin loop ...
>
> I guess this is related with RCU component, but I don't found the right place where the code
> Located yet.

"On the internet, nobody can hear you being subtle."

If some other process wants to run on the same CPU, needs_resched() is set.
(This might happen to make the cursor blink, for keyboard input, or when
somebody starts a rogue process like ps.)


Regards,
Clemens

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: semaphore and mutex in current Linux kernel (3.2.2)
  2012-04-06 10:10             ` Clemens Ladisch
@ 2012-04-06 17:47               ` Chen, Dennis (SRDC SW)
  2012-04-09 18:45                 ` Paul E. McKenney
  0 siblings, 1 reply; 19+ messages in thread
From: Chen, Dennis (SRDC SW) @ 2012-04-06 17:47 UTC (permalink / raw)
  To: Clemens Ladisch; +Cc: Ingo Molnar, linux-kernel

On Fri, Apr 6, 2012 at 6:10 PM, Clemens Ladisch <clemens@ladisch.de> wrote:
> Chen, Dennis (SRDC SW) wrote:
>> On Thu, Apr 5, 2012 at 10:15 PM, Clemens Ladisch <clemens@ladisch.de> wrote:
>>
>> I guess this is related with RCU component, but I don't found the right place where the code
>> Located yet.
>
> "On the internet, nobody can hear you being subtle."
>
> If some other process wants to run on the same CPU, needs_resched() is set.
> (This might happen to make the cursor blink, for keyboard input, or when
> somebody starts a rogue process like ps.)
>

Hmm, I forget that in each timer interrupt, __rcu_pending() will be called, it will call
set_need_resched() to set the TIF_NEED_RESCHED in some condition...
The optimization of mutex work closely with rcu, so fantastic!


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: semaphore and mutex in current Linux kernel (3.2.2)
  2012-04-06 17:47               ` Chen, Dennis (SRDC SW)
@ 2012-04-09 18:45                 ` Paul E. McKenney
  2012-04-11  5:04                   ` Chen, Dennis (SRDC SW)
  0 siblings, 1 reply; 19+ messages in thread
From: Paul E. McKenney @ 2012-04-09 18:45 UTC (permalink / raw)
  To: Chen, Dennis (SRDC SW); +Cc: Clemens Ladisch, Ingo Molnar, linux-kernel

On Fri, Apr 06, 2012 at 05:47:28PM +0000, Chen, Dennis (SRDC SW) wrote:
> On Fri, Apr 6, 2012 at 6:10 PM, Clemens Ladisch <clemens@ladisch.de> wrote:
> > Chen, Dennis (SRDC SW) wrote:
> >> On Thu, Apr 5, 2012 at 10:15 PM, Clemens Ladisch <clemens@ladisch.de> wrote:
> >>
> >> I guess this is related with RCU component, but I don't found the right place where the code
> >> Located yet.
> >
> > "On the internet, nobody can hear you being subtle."
> >
> > If some other process wants to run on the same CPU, needs_resched() is set.
> > (This might happen to make the cursor blink, for keyboard input, or when
> > somebody starts a rogue process like ps.)
> >
> 
> Hmm, I forget that in each timer interrupt, __rcu_pending() will be called, it will call
> set_need_resched() to set the TIF_NEED_RESCHED in some condition...
> The optimization of mutex work closely with rcu, so fantastic!

I must confess that you all lost me on this one.

There is a call to set_need_resched() in __rcu_pending(), which is
invoked when the current CPU has not yet responded to a non-preemptible
RCU grace period for some time.  However, in the common case where the
CPUs all respond in reasonable time, __rcu_pending() will never call
set_need_resched().

However, we really do not want to call set_need_resched() on every call
to __rcu_pending().  There is almost certainly a better solution to any
problem that might be solved by a per-jiffy call to set_need_resched().

So, what are you really trying to do?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: semaphore and mutex in current Linux kernel (3.2.2)
  2012-04-09 18:45                 ` Paul E. McKenney
@ 2012-04-11  5:04                   ` Chen, Dennis (SRDC SW)
  2012-04-11 17:30                     ` Paul E. McKenney
  0 siblings, 1 reply; 19+ messages in thread
From: Chen, Dennis (SRDC SW) @ 2012-04-11  5:04 UTC (permalink / raw)
  To: paulmck; +Cc: Clemens Ladisch, Ingo Molnar, linux-kernel

On Tue, Apr 10, 2012 at 2:45 AM, Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:
> On Fri, Apr 06, 2012 at 05:47:28PM +0000, Chen, Dennis (SRDC SW) wrote:
>> On Fri, Apr 6, 2012 at 6:10 PM, Clemens Ladisch <clemens@ladisch.de> wrote:
>> > Chen, Dennis (SRDC SW) wrote:
>> >
>> > "On the internet, nobody can hear you being subtle."
>> >
>> > If some other process wants to run on the same CPU, needs_resched() is set.
>> > (This might happen to make the cursor blink, for keyboard input, or when
>> > somebody starts a rogue process like ps.)
>> >
>>
>> Hmm, I forget that in each timer interrupt, __rcu_pending() will be called, it will call
>> set_need_resched() to set the TIF_NEED_RESCHED in some condition...
>> The optimization of mutex work closely with rcu, so fantastic!
>
> I must confess that you all lost me on this one.
>
> There is a call to set_need_resched() in __rcu_pending(), which is
> invoked when the current CPU has not yet responded to a non-preemptible
> RCU grace period for some time.  However, in the common case where the
> CPUs all respond in reasonable time, __rcu_pending() will never call
> set_need_resched().
>
> However, we really do not want to call set_need_resched() on every call
> to __rcu_pending().  There is almost certainly a better solution to any
> problem that might be solved by a per-jiffy call to set_need_resched().
>
> So, what are you really trying to do?
>
>                                                        Thanx, Paul

Paul, I must confess that maybe you're right, I've realized the misunderstanding in the previous email.
But I don't want to pretend that I have a full understanding for your " There is almost certainly a 
better solution to any problem that might be solved by a per-jiffy call to set_need_resched()", because
this is related with your last question.

I just want to measure the performance between semaphore and mutex, before that I looked at the mutex 
optimization code, and the focus is on the mutex_spin_on_owner() function, I don't know how long it will
take before some components in the kernel call set_need_resched() to break the while loop. If it's on
jiffies level, given the time of a process switch possibly in microsecond level, that means current 
process must spin for several jiffies before it got the mutex lock or go to sleep finally, I can't
see the benefit here... 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: semaphore and mutex in current Linux kernel (3.2.2)
  2012-04-11  5:04                   ` Chen, Dennis (SRDC SW)
@ 2012-04-11 17:30                     ` Paul E. McKenney
  2012-04-12  9:42                       ` Chen, Dennis (SRDC SW)
  0 siblings, 1 reply; 19+ messages in thread
From: Paul E. McKenney @ 2012-04-11 17:30 UTC (permalink / raw)
  To: Chen, Dennis (SRDC SW); +Cc: Clemens Ladisch, Ingo Molnar, linux-kernel

On Wed, Apr 11, 2012 at 05:04:03AM +0000, Chen, Dennis (SRDC SW) wrote:
> On Tue, Apr 10, 2012 at 2:45 AM, Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:
> > On Fri, Apr 06, 2012 at 05:47:28PM +0000, Chen, Dennis (SRDC SW) wrote:
> >> On Fri, Apr 6, 2012 at 6:10 PM, Clemens Ladisch <clemens@ladisch.de> wrote:
> >> > Chen, Dennis (SRDC SW) wrote:
> >> >
> >> > "On the internet, nobody can hear you being subtle."
> >> >
> >> > If some other process wants to run on the same CPU, needs_resched() is set.
> >> > (This might happen to make the cursor blink, for keyboard input, or when
> >> > somebody starts a rogue process like ps.)
> >> >
> >>
> >> Hmm, I forget that in each timer interrupt, __rcu_pending() will be called, it will call
> >> set_need_resched() to set the TIF_NEED_RESCHED in some condition...
> >> The optimization of mutex work closely with rcu, so fantastic!
> >
> > I must confess that you all lost me on this one.
> >
> > There is a call to set_need_resched() in __rcu_pending(), which is
> > invoked when the current CPU has not yet responded to a non-preemptible
> > RCU grace period for some time.  However, in the common case where the
> > CPUs all respond in reasonable time, __rcu_pending() will never call
> > set_need_resched().
> >
> > However, we really do not want to call set_need_resched() on every call
> > to __rcu_pending().  There is almost certainly a better solution to any
> > problem that might be solved by a per-jiffy call to set_need_resched().
> >
> > So, what are you really trying to do?
> >
> >                                                        Thanx, Paul
> 
> Paul, I must confess that maybe you're right, I've realized the misunderstanding in the previous email.
> But I don't want to pretend that I have a full understanding for your " There is almost certainly a 
> better solution to any problem that might be solved by a per-jiffy call to set_need_resched()", because
> this is related with your last question.
> 
> I just want to measure the performance between semaphore and mutex, before that I looked at the mutex 
> optimization code, and the focus is on the mutex_spin_on_owner() function, I don't know how long it will
> take before some components in the kernel call set_need_resched() to break the while loop. If it's on
> jiffies level, given the time of a process switch possibly in microsecond level, that means current 
> process must spin for several jiffies before it got the mutex lock or go to sleep finally, I can't
> see the benefit here... 

The loop spins only while a given task owns the lock.  This means that
for the loop to spin for several jiffies, one of three things must happen:

1.	The task holding the lock has been running continuously for
	several jiffies.  This would very likely be a bug.  (Why are
	you running CPU bound in the kernel for several jiffies,
	whether or not you are holding a lock?)

2.	The task spinning in mutex_spin_on_owner() happens to be being
	preempted at exactly the same times as the owner is, and so
	by poor luck happens to always see the owner running.

	The probability of this is quite low, should (famous last words!)
	can be safely ignored.

3.	The task spinning in mutex_spin_on_owner() happens to be being
	preempted at exactly the same times that the owner releases
	the lock, and so again by poor luck happens to always see the
	owner running.

	The probability of this is quite low, should (famous last words!)
	can be safely ignored.

Normally, the lock holder either blocks or releases the lock quickly,
so that mutex_spin_on_owner() exits its loop.

So, are you seeing a situation where mutex_spin_on_owner() really is
spinning for multiple jiffies?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: semaphore and mutex in current Linux kernel (3.2.2)
  2012-04-11 17:30                     ` Paul E. McKenney
@ 2012-04-12  9:42                       ` Chen, Dennis (SRDC SW)
  2012-04-12 15:18                         ` Paul E. McKenney
  0 siblings, 1 reply; 19+ messages in thread
From: Chen, Dennis (SRDC SW) @ 2012-04-12  9:42 UTC (permalink / raw)
  To: paulmck; +Cc: Clemens Ladisch, Ingo Molnar, linux-kernel

On Thu, Apr 12, 2012 at 1:30 AM, Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:
> On Wed, Apr 11, 2012 at 05:04:03AM +0000, Chen, Dennis (SRDC SW) wrote:
>>
>> Paul, I must confess that maybe you're right, I've realized the misunderstanding in the previous email.
>> But I don't want to pretend that I have a full understanding for your " There is almost certainly a
>> better solution to any problem that might be solved by a per-jiffy call to set_need_resched()", because
>> this is related with your last question.
>>
>> I just want to measure the performance between semaphore and mutex, before that I looked at the mutex
>> optimization code, and the focus is on the mutex_spin_on_owner() function, I don't know how long it will
>> take before some components in the kernel call set_need_resched() to break the while loop. If it's on
>> jiffies level, given the time of a process switch possibly in microsecond level, that means current
>> process must spin for several jiffies before it got the mutex lock or go to sleep finally, I can't
>> see the benefit here...
>
> The loop spins only while a given task owns the lock.  This means that
> for the loop to spin for several jiffies, one of three things must happen:
>
> 1.      The task holding the lock has been running continuously for
>        several jiffies.  This would very likely be a bug.  (Why are
>        you running CPU bound in the kernel for several jiffies,
>        whether or not you are holding a lock?)

I think mutex_lock is not the same as spin_lock, so it's not reasonable to add the time restriction
for the lock user...though maybe most owners of the lock release it very quickly in real life

>
> 2.      The task spinning in mutex_spin_on_owner() happens to be being
>        preempted at exactly the same times as the owner is, and so
>        by poor luck happens to always see the owner running.
>
>        The probability of this is quite low, should (famous last words!)
>        can be safely ignored.
>
> 3.      The task spinning in mutex_spin_on_owner() happens to be being
>        preempted at exactly the same times that the owner releases
>        the lock, and so again by poor luck happens to always see the
>        owner running.
>
>        The probability of this is quite low, should (famous last words!)
>        can be safely ignored.

The preemptive has been disabled at the beginning of the mutex lock slow path, so I guess above 2 and 3
are impossible...

>
> Normally, the lock holder either blocks or releases the lock quickly,
> so that mutex_spin_on_owner() exits its loop...
>
> So, are you seeing a situation where mutex_spin_on_owner() really is
> spinning for multiple jiffies?

I have a test to get the jiffies lasting when someone call set_need_resched() to break the loop:
    ...
    i0 = jiffies;
    rcu_read_lock();
    while(1){
        if (need_resched())
            break;
        cpu_relax();
    }
    rcu_read_unlock();
    i1 = jiffies;
    printk("i0 = %ld, i1 = %ld, HZ=%d\n", i0, i1, HZ);
    ...
The result is, in the case of HZ=1000, i1-i0 will be in [1,...,5] range randomly. So if we exit the while loop
on need_resched(), that means the optimization of mutex comparing semaphore doesn't success: mutex spins about 
several jiffies before goto sleep or get the lock finally,right? 

If lucky enough, maybe in 99% case we break the loop by owner_running(lock, owner), meaning the lock owner release
the lock in a very quick speed (can't be measured by jiffies). 

>
>                                                        Thanx, Paul



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: semaphore and mutex in current Linux kernel (3.2.2)
  2012-04-12  9:42                       ` Chen, Dennis (SRDC SW)
@ 2012-04-12 15:18                         ` Paul E. McKenney
  2012-04-13 14:15                           ` Chen, Dennis (SRDC SW)
  0 siblings, 1 reply; 19+ messages in thread
From: Paul E. McKenney @ 2012-04-12 15:18 UTC (permalink / raw)
  To: Chen, Dennis (SRDC SW); +Cc: Clemens Ladisch, Ingo Molnar, linux-kernel

On Thu, Apr 12, 2012 at 09:42:36AM +0000, Chen, Dennis (SRDC SW) wrote:
> On Thu, Apr 12, 2012 at 1:30 AM, Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:
> > On Wed, Apr 11, 2012 at 05:04:03AM +0000, Chen, Dennis (SRDC SW) wrote:
> >>
> >> Paul, I must confess that maybe you're right, I've realized the misunderstanding in the previous email.
> >> But I don't want to pretend that I have a full understanding for your " There is almost certainly a
> >> better solution to any problem that might be solved by a per-jiffy call to set_need_resched()", because
> >> this is related with your last question.
> >>
> >> I just want to measure the performance between semaphore and mutex, before that I looked at the mutex
> >> optimization code, and the focus is on the mutex_spin_on_owner() function, I don't know how long it will
> >> take before some components in the kernel call set_need_resched() to break the while loop. If it's on
> >> jiffies level, given the time of a process switch possibly in microsecond level, that means current
> >> process must spin for several jiffies before it got the mutex lock or go to sleep finally, I can't
> >> see the benefit here...
> >
> > The loop spins only while a given task owns the lock.  This means that
> > for the loop to spin for several jiffies, one of three things must happen:
> >
> > 1.      The task holding the lock has been running continuously for
> >        several jiffies.  This would very likely be a bug.  (Why are
> >        you running CPU bound in the kernel for several jiffies,
> >        whether or not you are holding a lock?)
> 
> I think mutex_lock is not the same as spin_lock, so it's not reasonable to add the time restriction
> for the lock user...though maybe most owners of the lock release it very quickly in real life

Yep -- to do otherwise is usually bad for both performance and scalability.

> > 2.      The task spinning in mutex_spin_on_owner() happens to be being
> >        preempted at exactly the same times as the owner is, and so
> >        by poor luck happens to always see the owner running.
> >
> >        The probability of this is quite low, should (famous last words!)
> >        can be safely ignored.
> >
> > 3.      The task spinning in mutex_spin_on_owner() happens to be being
> >        preempted at exactly the same times that the owner releases
> >        the lock, and so again by poor luck happens to always see the
> >        owner running.
> >
> >        The probability of this is quite low, should (famous last words!)
> >        can be safely ignored.
> 
> The preemptive has been disabled at the beginning of the mutex lock slow path, so I guess above 2 and 3
> are impossible...

Almost.  The guy in mutex_spin_on_owner() might be interrupted at just
the wrong times, which would have the same effect as being preempted.
But in both cases, the probability is really low, so it can still be
safely ignored.

> > Normally, the lock holder either blocks or releases the lock quickly,
> > so that mutex_spin_on_owner() exits its loop...
> >
> > So, are you seeing a situation where mutex_spin_on_owner() really is
> > spinning for multiple jiffies?
> 
> I have a test to get the jiffies lasting when someone call set_need_resched() to break the loop:
>     ...
>     i0 = jiffies;
>     rcu_read_lock();
>     while(1){
>         if (need_resched())
>             break;
>         cpu_relax();
>     }
>     rcu_read_unlock();
>     i1 = jiffies;
>     printk("i0 = %ld, i1 = %ld, HZ=%d\n", i0, i1, HZ);
>     ...
> The result is, in the case of HZ=1000, i1-i0 will be in [1,...,5] range randomly. So if we exit the while loop
> on need_resched(), that means the optimization of mutex comparing semaphore doesn't success: mutex spins about 
> several jiffies before goto sleep or get the lock finally,right? 

But this is quite different than mutex_spin_on_owner().  You are doing
"while(1)" and mutex_spin_on_owner() is doing "while (owner_running(...))".
This means that mutex_spin_on_owner() will typically stop spinning as
soon as the owner blocks or releases the mutex.  In contrast, the only
way your loop can exit is if need_resched() returns true.

Therefore, your results are irrelevant to mutex_spin_on_owner().  If you
want to show that mutex_spin_on_owner() has a problem, you should
instrument mutex_spin_on_owner() and run your instrumented version in
the kernel.

> If lucky enough, maybe in 99% case we break the loop by owner_running(lock, owner), meaning the lock owner release
> the lock in a very quick speed (can't be measured by jiffies). 

That is in fact the intent of the design.

In theory, possible problems with the current code include:

1.	Unnecessarily blocking lower-priority tasks.

2.	Consuming energy spinning when the CPU might otherwise
	go idle and thus into a low-power state.

But you need to measure the actual code running a realistic workload
to get a fair diagnosis of any problems that might be present.


                                                        Thanx, Paul


^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: semaphore and mutex in current Linux kernel (3.2.2)
  2012-04-12 15:18                         ` Paul E. McKenney
@ 2012-04-13 14:15                           ` Chen, Dennis (SRDC SW)
  2012-04-13 18:43                             ` Paul E. McKenney
  0 siblings, 1 reply; 19+ messages in thread
From: Chen, Dennis (SRDC SW) @ 2012-04-13 14:15 UTC (permalink / raw)
  To: paulmck; +Cc: Clemens Ladisch, Ingo Molnar, linux-kernel

On Thu, Apr 12, 2012 at 11:18 PM, Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:
> On Thu, Apr 12, 2012 at 09:42:36AM +0000, Chen, Dennis (SRDC SW) wrote:
>> I think mutex_lock is not the same as spin_lock, so it's not reasonable to add the time restriction
>> for the lock user...though maybe most owners of the lock release it very quickly in real life
>
> Yep -- to do otherwise is usually bad for both performance and scalability.
>
>> The preemptive has been disabled at the beginning of the mutex lock slow path, so I guess above 2 and 3
>> are impossible...
>
> Almost.  The guy in mutex_spin_on_owner() might be interrupted at just
> the wrong times, which would have the same effect as being preempted.
> But in both cases, the probability is really low, so it can still be
> safely ignored.
>
>> I have a test to get the jiffies lasting when someone call set_need_resched() to break the loop:
>>     ...
>>     i0 = jiffies;
>>     rcu_read_lock();
>>     while(1){
>>         if (need_resched())
>>             break;
>>         cpu_relax();
>>     }
>>     rcu_read_unlock();
>>     i1 = jiffies;
>>     printk("i0 = %ld, i1 = %ld, HZ=%d\n", i0, i1, HZ);
>>     ...
>> The result is, in the case of HZ=1000, i1-i0 will be in [1,...,5] range randomly. So if we exit the while loop
>> on need_resched(), that means the optimization of mutex comparing semaphore doesn't success: mutex spins about
>> several jiffies before goto sleep or get the lock finally,right?
>
> But this is quite different than mutex_spin_on_owner().  You are doing
> "while(1)" and mutex_spin_on_owner() is doing "while (owner_running(...))".
> This means that mutex_spin_on_owner() will typically stop spinning as
> soon as the owner blocks or releases the mutex.  In contrast, the only
> way your loop can exit is if need_resched() returns true.
>
> Therefore, your results are irrelevant to mutex_spin_on_owner().  If you
> want to show that mutex_spin_on_owner() has a problem, you should
> instrument mutex_spin_on_owner() and run your instrumented version in
> the kernel.

I know that the test is different from mutex_spin_on_owner(), the reason I used while(1) not the actual code instead is
that I want to know how many jiffies will be elapsed when break the loop on need_resched in case of owner running long enough.
If you want to see the test result on mutex_spin_on_owner(), I have the following test case (based on my earlier message
in this thread, you can check if for some background context...):

1. <in a kernel module> in the xxx_read() function, I have the follow code pieces:
    /* idle loop for 5s between the 1st App get the lock and release it
     Unsigned long j = jiffies + 5 * HZ;
    /* identify ourself in the mutex_spin_on_owner function to void tons of printk message
     * The first App will get the mutex lock in fast path, it will not enter into slow path, so it's easy
     * to identify the 2nd App by tricky pid
     */
    current->pid |= (1 << 31);
    mutex_lock(&c_mutex);
    while(time_before(jiffies, j)){
        cpu_relax();
    }
    mutex_unlock(&c_mutex);
    current->pid &= ~(1 << 31);

2. <in the kernel source code> I make some changes in mutex_spin_on_owner as below:
    int mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner)
{
        unsigned long j0, j1; /*add*/
        if (!sched_feat(OWNER_SPIN))
                return 0;

        rcu_read_lock();
        j0 = jiffies; /*add*/
        while (owner_running(lock, owner)) {
                if (need_resched())
                        break;

                arch_mutex_cpu_relax();
        }
        j1 = jiffies;
        rcu_read_unlock();
        if(current->pid & (1 << 31))  /*add*/
                printk("j1-j0=%ld\n", j1-j0); /*add*/
        /*
         * We break out the loop above on need_resched() and when the
         * owner changed, which is a sign for heavy contention. Return
         * success only when lock->owner is NULL.
         */
        return lock->owner == NULL;
}

3. run the Apps in 2 different CPUs (0 and 2):
   #taskset 0x00000001 ./main
   #taskset 0x00000004 ./main
   
  The App running on cpu0 will get the lock directly, and it will in idle loop state for 5s owning the lock,
  So App on cpu2 will enter into slow path and call mutex_spin_on_owner(), thus we can get the delta jiffies
  When the loop break on need_resched(). The test result as below(HZ=1000, I run the test about 6 rounds):
  j1-j0=19    -- NO.1
  j1-j0=512   -- NO.2
  j1-j0=316   -- NO.3
  j1-j0=538   -- NO.4
  j1-j0=331   -- NO.5
  j1-j0=996   -- NO.6

obviously, the mutex will waste lots of CPU resource comparing semaphore in this case.
I know the base rock of mutex performance optimization is "if the lock owner is running, it is likely to
release the lock soon", I don't know if there are some statistics data to support this rational, in other words,
what does make us get this conclusion?

>> If lucky enough, maybe in 99% case we break the loop by owner_running(lock, owner), meaning the lock owner release
>> the lock in a very quick speed (can't be measured by jiffies).
>
> That is in fact the intent of the design.
>
> In theory, possible problems with the current code include:
>
> 1.      Unnecessarily blocking lower-priority tasks.
>
> 2.      Consuming energy spinning when the CPU might otherwise
>        go idle and thus into a low-power state.
>
> But you need to measure the actual code running a realistic workload
> to get a fair diagnosis of any problems that might be present.

Actually, as I mentioned earlier in this thread, I have no intention to care about the code design itself, I am just curious
about the PERFORMANCE OPTIMIZATION of mutex, not others. Now I am writing some simple benchmark code which used to measure 
the real performance of the 2 primitives with realistic workload, hopefully I can finish it in this weekend, I am not familiar
with the user space codes :-)

>
>                                                        Thanx, Paul
>



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: semaphore and mutex in current Linux kernel (3.2.2)
  2012-04-13 14:15                           ` Chen, Dennis (SRDC SW)
@ 2012-04-13 18:43                             ` Paul E. McKenney
  2012-04-16  8:33                               ` [PATCH 0/2] tools perf: Add a new benchmark tool for semaphore/mutex Chen, Dennis (SRDC SW)
  0 siblings, 1 reply; 19+ messages in thread
From: Paul E. McKenney @ 2012-04-13 18:43 UTC (permalink / raw)
  To: Chen, Dennis (SRDC SW); +Cc: Clemens Ladisch, Ingo Molnar, linux-kernel

On Fri, Apr 13, 2012 at 02:15:25PM +0000, Chen, Dennis (SRDC SW) wrote:
> On Thu, Apr 12, 2012 at 11:18 PM, Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:
> > On Thu, Apr 12, 2012 at 09:42:36AM +0000, Chen, Dennis (SRDC SW) wrote:
> >> I think mutex_lock is not the same as spin_lock, so it's not reasonable to add the time restriction
> >> for the lock user...though maybe most owners of the lock release it very quickly in real life
> >
> > Yep -- to do otherwise is usually bad for both performance and scalability.
> >
> >> The preemptive has been disabled at the beginning of the mutex lock slow path, so I guess above 2 and 3
> >> are impossible...
> >
> > Almost.  The guy in mutex_spin_on_owner() might be interrupted at just
> > the wrong times, which would have the same effect as being preempted.
> > But in both cases, the probability is really low, so it can still be
> > safely ignored.
> >
> >> I have a test to get the jiffies lasting when someone call set_need_resched() to break the loop:
> >>     ...
> >>     i0 = jiffies;
> >>     rcu_read_lock();
> >>     while(1){
> >>         if (need_resched())
> >>             break;
> >>         cpu_relax();
> >>     }
> >>     rcu_read_unlock();
> >>     i1 = jiffies;
> >>     printk("i0 = %ld, i1 = %ld, HZ=%d\n", i0, i1, HZ);
> >>     ...
> >> The result is, in the case of HZ=1000, i1-i0 will be in [1,...,5] range randomly. So if we exit the while loop
> >> on need_resched(), that means the optimization of mutex comparing semaphore doesn't success: mutex spins about
> >> several jiffies before goto sleep or get the lock finally,right?
> >
> > But this is quite different than mutex_spin_on_owner().  You are doing
> > "while(1)" and mutex_spin_on_owner() is doing "while (owner_running(...))".
> > This means that mutex_spin_on_owner() will typically stop spinning as
> > soon as the owner blocks or releases the mutex.  In contrast, the only
> > way your loop can exit is if need_resched() returns true.
> >
> > Therefore, your results are irrelevant to mutex_spin_on_owner().  If you
> > want to show that mutex_spin_on_owner() has a problem, you should
> > instrument mutex_spin_on_owner() and run your instrumented version in
> > the kernel.
> 
> I know that the test is different from mutex_spin_on_owner(), the reason I used while(1) not the actual code instead is
> that I want to know how many jiffies will be elapsed when break the loop on need_resched in case of owner running long enough.
> If you want to see the test result on mutex_spin_on_owner(), I have the following test case (based on my earlier message
> in this thread, you can check if for some background context...):
> 
> 1. <in a kernel module> in the xxx_read() function, I have the follow code pieces:
>     /* idle loop for 5s between the 1st App get the lock and release it
>      Unsigned long j = jiffies + 5 * HZ;
>     /* identify ourself in the mutex_spin_on_owner function to void tons of printk message
>      * The first App will get the mutex lock in fast path, it will not enter into slow path, so it's easy
>      * to identify the 2nd App by tricky pid
>      */
>     current->pid |= (1 << 31);
>     mutex_lock(&c_mutex);
>     while(time_before(jiffies, j)){
>         cpu_relax();
>     }
>     mutex_unlock(&c_mutex);
>     current->pid &= ~(1 << 31);
> 
> 2. <in the kernel source code> I make some changes in mutex_spin_on_owner as below:
>     int mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner)
> {
>         unsigned long j0, j1; /*add*/
>         if (!sched_feat(OWNER_SPIN))
>                 return 0;
> 
>         rcu_read_lock();
>         j0 = jiffies; /*add*/
>         while (owner_running(lock, owner)) {
>                 if (need_resched())
>                         break;
> 
>                 arch_mutex_cpu_relax();
>         }
>         j1 = jiffies;
>         rcu_read_unlock();
>         if(current->pid & (1 << 31))  /*add*/
>                 printk("j1-j0=%ld\n", j1-j0); /*add*/
>         /*
>          * We break out the loop above on need_resched() and when the
>          * owner changed, which is a sign for heavy contention. Return
>          * success only when lock->owner is NULL.
>          */
>         return lock->owner == NULL;
> }
> 
> 3. run the Apps in 2 different CPUs (0 and 2):
>    #taskset 0x00000001 ./main
>    #taskset 0x00000004 ./main
>    
>   The App running on cpu0 will get the lock directly, and it will in idle loop state for 5s owning the lock,
>   So App on cpu2 will enter into slow path and call mutex_spin_on_owner(), thus we can get the delta jiffies
>   When the loop break on need_resched(). The test result as below(HZ=1000, I run the test about 6 rounds):
>   j1-j0=19    -- NO.1
>   j1-j0=512   -- NO.2
>   j1-j0=316   -- NO.3
>   j1-j0=538   -- NO.4
>   j1-j0=331   -- NO.5
>   j1-j0=996   -- NO.6
> 
> obviously, the mutex will waste lots of CPU resource comparing semaphore in this case.
> I know the base rock of mutex performance optimization is "if the lock owner is running, it is likely to
> release the lock soon", I don't know if there are some statistics data to support this rational, in other words,
> what does make us get this conclusion?
> 
> >> If lucky enough, maybe in 99% case we break the loop by owner_running(lock, owner), meaning the lock owner release
> >> the lock in a very quick speed (can't be measured by jiffies).
> >
> > That is in fact the intent of the design.
> >
> > In theory, possible problems with the current code include:
> >
> > 1.      Unnecessarily blocking lower-priority tasks.
> >
> > 2.      Consuming energy spinning when the CPU might otherwise
> >        go idle and thus into a low-power state.
> >
> > But you need to measure the actual code running a realistic workload
> > to get a fair diagnosis of any problems that might be present.
> 
> Actually, as I mentioned earlier in this thread, I have no intention to care about the code design itself, I am just curious
> about the PERFORMANCE OPTIMIZATION of mutex, not others. Now I am writing some simple benchmark code which used to measure 
> the real performance of the 2 primitives with realistic workload, hopefully I can finish it in this weekend, I am not familiar
> with the user space codes :-)

Actual measurements of the real code showing the benefits of any change
you might propose would indeed be helpful.  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 0/2] tools perf: Add a new benchmark tool for semaphore/mutex
  2012-04-13 18:43                             ` Paul E. McKenney
@ 2012-04-16  8:33                               ` Chen, Dennis (SRDC SW)
  2012-04-16  9:24                                 ` Ingo Molnar
  0 siblings, 1 reply; 19+ messages in thread
From: Chen, Dennis (SRDC SW) @ 2012-04-16  8:33 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, paulmck, peterz, Paul Mackerras, Arnaldo Carvalho de Melo

<PATCH PREFACE>
-------------------
This patch series are used to add a new performance benchmark tool for semaphore or mutex:
The new tool will fork NR tasks specified through the command line and bind each of them
to every CPUs in the system equally. The command to launch the tool looks like:
'# perf bench locking mutex -p 8 -t 400 -c'

The above command will create 400 tasks in a system with 8-CPU, each CPU will have 50 tasks.
After the task be created, it will read all the files and directories in '/sys/module'.
sysfs is RAM based and its read operation for both dir and file is very sensitive for mutex
lock, also '/sys/module' has almost no dependencies on external devices.

We can use this tool with 'perf record' command to get the hot-spot of the codes or 
'perf top -g' to get live info, for example, below is a test case run in a intel i7-2600 box
(-c option is to get the cpu cycles, I don't use it in this test case):

# perf record -a perf bench locking mutex -p 8 -t 4000
# Running locking/mutex benchmark... 
 ...
 [13894 ]/6  duration        23 s   609392 us
 [13996 ]/4  duration        23 s   599418 us
 [14056 ]/0  duration        23 s   595710 us
 [13715 ]/3  duration        23 s   621719 us
 [13390 ]/6  duration        23 s   644020 us
 [13696 ]/0  duration        23 s   623101 us
 [14334 ]/6  duration        23 s   580262 us
 [14343 ]/7  duration        23 s   578702 us
 [14283 ]/3  duration        23 s   583007 us
 -----------------------------------
 Total duration     79353 s   943945 us

 real: 23.84   s
 user: 0.00   
 sys:  0.45   

# perf report
===================================================================================
...
# perf version : 3.3.2
# arch : x86_64
# nrcpus online : 8
# nrcpus avail : 8
# cpudesc : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz
# total memory : 3966460 kB
# cmdline : /usr/bin/perf record -a perf bench locking mutex -p 8 -t 4000

# Events: 131K cycles
#
# Overhead          Command                      Shared Object                                 Symbol
# ........  ...............  .................................  .....................................
#
    22.12%           perf  [kernel.kallsyms]                  [k] __mutex_lock_slowpath
     8.27%           perf  [kernel.kallsyms]                  [k] _raw_spin_lock
     6.16%           perf  [kernel.kallsyms]                  [k] mutex_unlock
     5.22%           perf  [kernel.kallsyms]                  [k] mutex_spin_on_owner
     4.94%           perf  [kernel.kallsyms]                  [k] sysfs_refresh_inode
     4.82%           perf  [kernel.kallsyms]                  [k] mutex_lock
     2.67%           perf  [kernel.kallsyms]                  [k] __mutex_unlock_slowpath
     2.61%           perf  [kernel.kallsyms]                  [k] link_path_walk
     2.42%           perf  [kernel.kallsyms]                  [k] _raw_spin_lock_irqsave
     1.61%           perf  [kernel.kallsyms]                  [k] __d_lookup
     1.18%           perf  [kernel.kallsyms]                  [k] clear_page_c
     1.16%           perf  [kernel.kallsyms]                  [k] dput
     0.97%           perf  [kernel.kallsyms]                  [k] do_lookup
     0.93%        swapper  [kernel.kallsyms]                  [k] intel_idle
     0.87%           perf  [kernel.kallsyms]                  [k] get_page_from_freelist
     0.85%           perf  [kernel.kallsyms]                  [k] __strncpy_from_user
     0.81%           perf  [kernel.kallsyms]                  [k] system_call
     0.78%           perf  libc-2.13.so                       [.] 0x84ef0         
     0.71%           perf  [kernel.kallsyms]                  [k] vfsmount_lock_local_lock
     0.68%           perf  [kernel.kallsyms]                  [k] sysfs_dentry_revalidate
     0.62%           perf  [kernel.kallsyms]                  [k] try_to_wake_up
     0.62%           perf  [kernel.kallsyms]                  [k] kfree
     0.60%           perf  [kernel.kallsyms]                  [k] kmem_cache_alloc   
............................................................................................

We can see that for 4000 tasks running in 8 CPUs simultaneously, it will create a very heavy 
contention for the mutex lock, so lot's of tasks enter into the slow path of the mutex lock...
I am very curious if we switch the mutex to the semaphore in this case, how's thing going? 
My next plan





^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/2] tools perf: Add a new benchmark tool for semaphore/mutex
  2012-04-16  8:33                               ` [PATCH 0/2] tools perf: Add a new benchmark tool for semaphore/mutex Chen, Dennis (SRDC SW)
@ 2012-04-16  9:24                                 ` Ingo Molnar
  2012-04-16 14:10                                   ` Chen, Dennis (SRDC SW)
  0 siblings, 1 reply; 19+ messages in thread
From: Ingo Molnar @ 2012-04-16  9:24 UTC (permalink / raw)
  To: Chen, Dennis (SRDC SW)
  Cc: linux-kernel, paulmck, peterz, Paul Mackerras, Arnaldo Carvalho de Melo


* Chen, Dennis (SRDC SW) <Dennis1.Chen@amd.com> wrote:

> <PATCH PREFACE>
> -------------------
> This patch series are used to add a new performance benchmark tool for semaphore or mutex:
> The new tool will fork NR tasks specified through the command line and bind each of them
> to every CPUs in the system equally. The command to launch the tool looks like:
> '# perf bench locking mutex -p 8 -t 400 -c'
> 
> The above command will create 400 tasks in a system with 8-CPU, each CPU will have 50 tasks.
> After the task be created, it will read all the files and directories in '/sys/module'.
> sysfs is RAM based and its read operation for both dir and file is very sensitive for mutex
> lock, also '/sys/module' has almost no dependencies on external devices.
> 
> We can use this tool with 'perf record' command to get the hot-spot of the codes or 
> 'perf top -g' to get live info, for example, below is a test case run in a intel i7-2600 box
> (-c option is to get the cpu cycles, I don't use it in this test case):
> 
> # perf record -a perf bench locking mutex -p 8 -t 4000
> # Running locking/mutex benchmark... 
>  ...
>  [13894 ]/6  duration        23 s   609392 us
>  [13996 ]/4  duration        23 s   599418 us
>  [14056 ]/0  duration        23 s   595710 us
>  [13715 ]/3  duration        23 s   621719 us
>  [13390 ]/6  duration        23 s   644020 us
>  [13696 ]/0  duration        23 s   623101 us
>  [14334 ]/6  duration        23 s   580262 us
>  [14343 ]/7  duration        23 s   578702 us
>  [14283 ]/3  duration        23 s   583007 us
>  -----------------------------------
>  Total duration     79353 s   943945 us
> 
>  real: 23.84   s
>  user: 0.00   
>  sys:  0.45   
> 
> # perf report
> ===================================================================================
> ...
> # perf version : 3.3.2
> # arch : x86_64
> # nrcpus online : 8
> # nrcpus avail : 8
> # cpudesc : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz
> # total memory : 3966460 kB
> # cmdline : /usr/bin/perf record -a perf bench locking mutex -p 8 -t 4000
> 
> # Events: 131K cycles
> #
> # Overhead          Command                      Shared Object                                 Symbol
> # ........  ...............  .................................  .....................................
> #
>     22.12%           perf  [kernel.kallsyms]                  [k] __mutex_lock_slowpath
>      8.27%           perf  [kernel.kallsyms]                  [k] _raw_spin_lock
>      6.16%           perf  [kernel.kallsyms]                  [k] mutex_unlock
>      5.22%           perf  [kernel.kallsyms]                  [k] mutex_spin_on_owner
>      4.94%           perf  [kernel.kallsyms]                  [k] sysfs_refresh_inode
>      4.82%           perf  [kernel.kallsyms]                  [k] mutex_lock
>      2.67%           perf  [kernel.kallsyms]                  [k] __mutex_unlock_slowpath
>      2.61%           perf  [kernel.kallsyms]                  [k] link_path_walk
>      2.42%           perf  [kernel.kallsyms]                  [k] _raw_spin_lock_irqsave
>      1.61%           perf  [kernel.kallsyms]                  [k] __d_lookup
>      1.18%           perf  [kernel.kallsyms]                  [k] clear_page_c
>      1.16%           perf  [kernel.kallsyms]                  [k] dput
>      0.97%           perf  [kernel.kallsyms]                  [k] do_lookup
>      0.93%        swapper  [kernel.kallsyms]                  [k] intel_idle
>      0.87%           perf  [kernel.kallsyms]                  [k] get_page_from_freelist
>      0.85%           perf  [kernel.kallsyms]                  [k] __strncpy_from_user
>      0.81%           perf  [kernel.kallsyms]                  [k] system_call
>      0.78%           perf  libc-2.13.so                       [.] 0x84ef0         
>      0.71%           perf  [kernel.kallsyms]                  [k] vfsmount_lock_local_lock
>      0.68%           perf  [kernel.kallsyms]                  [k] sysfs_dentry_revalidate
>      0.62%           perf  [kernel.kallsyms]                  [k] try_to_wake_up
>      0.62%           perf  [kernel.kallsyms]                  [k] kfree
>      0.60%           perf  [kernel.kallsyms]                  [k] kmem_cache_alloc   
> ............................................................................................
> 

Nice! Would be nice to lift some of this information over into 
the changelogs, to address my complaints in the previous mail.

> We can see that for 4000 tasks running in 8 CPUs simultaneously, it will create a very heavy 
> contention for the mutex lock, so lot's of tasks enter into the slow path of the mutex lock...
> I am very curious if we switch the mutex to the semaphore in this case, how's thing going? 
> My next plan

Seems like an unfinished sentence.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [PATCH 0/2] tools perf: Add a new benchmark tool for semaphore/mutex
  2012-04-16  9:24                                 ` Ingo Molnar
@ 2012-04-16 14:10                                   ` Chen, Dennis (SRDC SW)
  0 siblings, 0 replies; 19+ messages in thread
From: Chen, Dennis (SRDC SW) @ 2012-04-16 14:10 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, paulmck, peterz, Paul Mackerras, Arnaldo Carvalho de Melo

On Mon, Apr 16, 2012 at 5:24 PM, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Chen, Dennis (SRDC SW) <Dennis1.Chen@amd.com> wrote:
>
>> <PATCH PREFACE>
>> -------------------
>> This patch series are used to add a new performance benchmark tool for semaphore or mutex:
>> The new tool will fork NR tasks specified through the command line and bind each of them
>> to every CPUs in the system equally. The command to launch the tool looks like:
>> '# perf bench locking mutex -p 8 -t 400 -c'
>>
>> The above command will create 400 tasks in a system with 8-CPU, each CPU will have 50 tasks.
>> After the task be created, it will read all the files and directories in '/sys/module'.
>> sysfs is RAM based and its read operation for both dir and file is very sensitive for mutex
>> lock, also '/sys/module' has almost no dependencies on external devices.
>>
>> We can use this tool with 'perf record' command to get the hot-spot of the codes or
>> 'perf top -g' to get live info, for example, below is a test case run in a intel i7-2600 box
>> (-c option is to get the cpu cycles, I don't use it in this test case):
>>
>> # perf record -a perf bench locking mutex -p 8 -t 4000
>> # Running locking/mutex benchmark...
>>  ...
>>  [13894 ]/6  duration        23 s   609392 us
>>  [13996 ]/4  duration        23 s   599418 us
>>  [14056 ]/0  duration        23 s   595710 us
>>  [13715 ]/3  duration        23 s   621719 us
>>  [13390 ]/6  duration        23 s   644020 us
>>  [13696 ]/0  duration        23 s   623101 us
>>  [14334 ]/6  duration        23 s   580262 us
>>  [14343 ]/7  duration        23 s   578702 us
>>  [14283 ]/3  duration        23 s   583007 us
>>  -----------------------------------
>>  Total duration     79353 s   943945 us
>>
>>  real: 23.84   s
>>  user: 0.00
>>  sys:  0.45
>>
>> # perf report
>> ===================================================================================
>> ...
>> # perf version : 3.3.2
>> # arch : x86_64
>> # nrcpus online : 8
>> # nrcpus avail : 8
>> # cpudesc : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz
>> # total memory : 3966460 kB
>> # cmdline : /usr/bin/perf record -a perf bench locking mutex -p 8 -t 4000
>>
>> # Events: 131K cycles
>> #
>> # Overhead          Command                      Shared Object                                 Symbol
>> # ........  ...............  .................................  .....................................
>> #
>>     22.12%           perf  [kernel.kallsyms]                  [k] __mutex_lock_slowpath
>>      8.27%           perf  [kernel.kallsyms]                  [k] _raw_spin_lock
>>      6.16%           perf  [kernel.kallsyms]                  [k] mutex_unlock
>>      5.22%           perf  [kernel.kallsyms]                  [k] mutex_spin_on_owner
>>      4.94%           perf  [kernel.kallsyms]                  [k] sysfs_refresh_inode
>>      4.82%           perf  [kernel.kallsyms]                  [k] mutex_lock
>>      2.67%           perf  [kernel.kallsyms]                  [k] __mutex_unlock_slowpath
>>      2.61%           perf  [kernel.kallsyms]                  [k] link_path_walk
>>      2.42%           perf  [kernel.kallsyms]                  [k] _raw_spin_lock_irqsave
>>      1.61%           perf  [kernel.kallsyms]                  [k] __d_lookup
>>      1.18%           perf  [kernel.kallsyms]                  [k] clear_page_c
>>      1.16%           perf  [kernel.kallsyms]                  [k] dput
>>      0.97%           perf  [kernel.kallsyms]                  [k] do_lookup
>>      0.93%        swapper  [kernel.kallsyms]                  [k] intel_idle
>>      0.87%           perf  [kernel.kallsyms]                  [k] get_page_from_freelist
>>      0.85%           perf  [kernel.kallsyms]                  [k] __strncpy_from_user
>>      0.81%           perf  [kernel.kallsyms]                  [k] system_call
>>      0.78%           perf  libc-2.13.so                       [.] 0x84ef0
>>      0.71%           perf  [kernel.kallsyms]                  [k] vfsmount_lock_local_lock
>>      0.68%           perf  [kernel.kallsyms]                  [k] sysfs_dentry_revalidate
>>      0.62%           perf  [kernel.kallsyms]                  [k] try_to_wake_up
>>      0.62%           perf  [kernel.kallsyms]                  [k] kfree
>>      0.60%           perf  [kernel.kallsyms]                  [k] kmem_cache_alloc
>> ............................................................................................
>>
>
> Nice! Would be nice to lift some of this information over into
> the changelogs, to address my complaints in the previous mail.

Thanks for the suggestion! I will resubmit the patches into a single patch and include the above info
to address the changelog issue...

>> We can see that for 4000 tasks running in 8 CPUs simultaneously, it will create a very heavy
>> contention for the mutex lock, so lot's of tasks enter into the slow path of the mutex lock...
>> I am very curious if we switch the mutex to the semaphore in this case, how's thing going?
>> My next plan
>
> Seems like an unfinished sentence.

Oh, I mean my next plan is to do some performance analysis of the 2 primitives with this tool...

> Thanks,
>
>        Ingo


^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2012-04-16 14:10 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-04-01  9:56 semaphore and mutex in current Linux kernel (3.2.2) Chen, Dennis (SRDC SW)
2012-04-01 12:19 ` Ingo Molnar
2012-04-02 15:28   ` Chen, Dennis (SRDC SW)
2012-04-03  7:52     ` Ingo Molnar
2012-04-05  8:37       ` Chen, Dennis (SRDC SW)
2012-04-05 14:15         ` Clemens Ladisch
2012-04-06  9:45           ` Chen, Dennis (SRDC SW)
2012-04-06 10:10             ` Clemens Ladisch
2012-04-06 17:47               ` Chen, Dennis (SRDC SW)
2012-04-09 18:45                 ` Paul E. McKenney
2012-04-11  5:04                   ` Chen, Dennis (SRDC SW)
2012-04-11 17:30                     ` Paul E. McKenney
2012-04-12  9:42                       ` Chen, Dennis (SRDC SW)
2012-04-12 15:18                         ` Paul E. McKenney
2012-04-13 14:15                           ` Chen, Dennis (SRDC SW)
2012-04-13 18:43                             ` Paul E. McKenney
2012-04-16  8:33                               ` [PATCH 0/2] tools perf: Add a new benchmark tool for semaphore/mutex Chen, Dennis (SRDC SW)
2012-04-16  9:24                                 ` Ingo Molnar
2012-04-16 14:10                                   ` Chen, Dennis (SRDC SW)

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.