[PATCH 1/2] rwsem: check the lock before cpmxchg in down_write_trylock and rwsem_do

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 1/2] rwsem: check the lock before cpmxchg in down_write_trylock and rwsem_do_wake
       [not found] <cover.1371855277.git.tim.c.chen@linux.intel.com>
@ 2013-06-21 23:51 ` Tim Chen
  2013-06-22  0:10   ` Alex Shi
  2013-06-22  7:21   ` Peter Hurley
  2013-06-21 23:51 ` [PATCH 2/2] rwsem: do optimistic spinning for writer lock acquisition Tim Chen
  1 sibling, 2 replies; 25+ messages in thread
From: Tim Chen @ 2013-06-21 23:51 UTC (permalink / raw)
  To: Ingo Molnar, Andrew Morton
  Cc: Andrea Arcangeli, Alex Shi, Andi Kleen, Michel Lespinasse,
	Davidlohr Bueso, Matthew R Wilcox, Dave Hansen, Peter Zijlstra,
	Rik van Riel, Tim Chen, linux-kernel, linux-mm

Doing cmpxchg will cause cache bouncing when checking
sem->count. This could cause scalability issue
in a large machine (e.g. a 80 cores box).

A pre-read of sem->count can mitigate this.

Signed-off-by: Alex Shi <alex.shi@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 include/asm-generic/rwsem.h |    8 ++++----
 lib/rwsem.c                 |   21 +++++++++++++--------
 2 files changed, 17 insertions(+), 12 deletions(-)

diff --git a/include/asm-generic/rwsem.h b/include/asm-generic/rwsem.h
index bb1e2cd..052d973 100644
--- a/include/asm-generic/rwsem.h
+++ b/include/asm-generic/rwsem.h
@@ -70,11 +70,11 @@ static inline void __down_write(struct rw_semaphore *sem)
 
 static inline int __down_write_trylock(struct rw_semaphore *sem)
 {
-	long tmp;
+	if (unlikely(&sem->count != RWSEM_UNLOCKED_VALUE))
+		return 0;
 
-	tmp = cmpxchg(&sem->count, RWSEM_UNLOCKED_VALUE,
-		      RWSEM_ACTIVE_WRITE_BIAS);
-	return tmp == RWSEM_UNLOCKED_VALUE;
+	return cmpxchg(&sem->count, RWSEM_UNLOCKED_VALUE,
+		RWSEM_ACTIVE_WRITE_BIAS) == RWSEM_UNLOCKED_VALUE;
 }
 
 /*
diff --git a/lib/rwsem.c b/lib/rwsem.c
index 19c5fa9..2072af5 100644
--- a/lib/rwsem.c
+++ b/lib/rwsem.c
@@ -75,7 +75,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type)
 			 * will block as they will notice the queued writer.
 			 */
 			wake_up_process(waiter->task);
-		goto out;
+		return sem;
 	}
 
 	/* Writers might steal the lock before we grant it to the next reader.
@@ -85,15 +85,21 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type)
 	adjustment = 0;
 	if (wake_type != RWSEM_WAKE_READ_OWNED) {
 		adjustment = RWSEM_ACTIVE_READ_BIAS;
- try_reader_grant:
-		oldcount = rwsem_atomic_update(adjustment, sem) - adjustment;
-		if (unlikely(oldcount < RWSEM_WAITING_BIAS)) {
-			/* A writer stole the lock. Undo our reader grant. */
+		while (1) {
+			/* A writer stole the lock. */
+			if (sem->count < RWSEM_WAITING_BIAS)
+				return sem;
+
+			oldcount = rwsem_atomic_update(adjustment, sem)
+								- adjustment;
+			if (likely(oldcount >= RWSEM_WAITING_BIAS))
+				break;
+
+			 /* A writer stole the lock.  Undo our reader grant. */
 			if (rwsem_atomic_update(-adjustment, sem) &
 						RWSEM_ACTIVE_MASK)
-				goto out;
+				return sem;
 			/* Last active locker left. Retry waking readers. */
-			goto try_reader_grant;
 		}
 	}
 
@@ -136,7 +142,6 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type)
 	sem->wait_list.next = next;
 	next->prev = &sem->wait_list;
 
- out:
 	return sem;
 }
 
-- 
1.7.4.4




^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] rwsem: check the lock before cpmxchg in down_write_trylock and rwsem_do_wake
  2013-06-21 23:51 ` [PATCH 1/2] rwsem: check the lock before cpmxchg in down_write_trylock and rwsem_do_wake Tim Chen
@ 2013-06-22  0:10   ` Alex Shi
  2013-06-22  0:15     ` Davidlohr Bueso
  2013-06-24 16:34     ` Tim Chen
  2013-06-22  7:21   ` Peter Hurley
  1 sibling, 2 replies; 25+ messages in thread
From: Alex Shi @ 2013-06-22  0:10 UTC (permalink / raw)
  To: Tim Chen
  Cc: Ingo Molnar, Andrew Morton, Andrea Arcangeli, Andi Kleen,
	Michel Lespinasse, Davidlohr Bueso, Matthew R Wilcox,
	Dave Hansen, Peter Zijlstra, Rik van Riel, linux-kernel,
	linux-mm

On 06/22/2013 07:51 AM, Tim Chen wrote:
> Doing cmpxchg will cause cache bouncing when checking
> sem->count. This could cause scalability issue
> in a large machine (e.g. a 80 cores box).
> 
> A pre-read of sem->count can mitigate this.
> 
> Signed-off-by: Alex Shi <alex.shi@intel.com>
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>

Hi Tim,
there is a technical error in this patch.
the "From: " line should be 'Alex Shi', since he made the most input of
this patch.

And I still think split this patch to 4 smaller will make it more simple
to review, that I had sent you and Davidlohr.

could you like to re-send with my 4 patch version? :)

> ---
>  include/asm-generic/rwsem.h |    8 ++++----
>  lib/rwsem.c                 |   21 +++++++++++++--------
>  2 files changed, 17 insertions(+), 12 deletions(-)
> 
> diff --git a/include/asm-generic/rwsem.h b/include/asm-generic/rwsem.h
> index bb1e2cd..052d973 100644
> --- a/include/asm-generic/rwsem.h
> +++ b/include/asm-generic/rwsem.h
> @@ -70,11 +70,11 @@ static inline void __down_write(struct rw_semaphore *sem)
>  
>  static inline int __down_write_trylock(struct rw_semaphore *sem)
>  {
> -	long tmp;
> +	if (unlikely(&sem->count != RWSEM_UNLOCKED_VALUE))
> +		return 0;
>  
> -	tmp = cmpxchg(&sem->count, RWSEM_UNLOCKED_VALUE,
> -		      RWSEM_ACTIVE_WRITE_BIAS);
> -	return tmp == RWSEM_UNLOCKED_VALUE;
> +	return cmpxchg(&sem->count, RWSEM_UNLOCKED_VALUE,
> +		RWSEM_ACTIVE_WRITE_BIAS) == RWSEM_UNLOCKED_VALUE;
>  }
>  
>  /*
> diff --git a/lib/rwsem.c b/lib/rwsem.c
> index 19c5fa9..2072af5 100644
> --- a/lib/rwsem.c
> +++ b/lib/rwsem.c
> @@ -75,7 +75,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type)
>  			 * will block as they will notice the queued writer.
>  			 */
>  			wake_up_process(waiter->task);
> -		goto out;
> +		return sem;
>  	}
>  
>  	/* Writers might steal the lock before we grant it to the next reader.
> @@ -85,15 +85,21 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type)
>  	adjustment = 0;
>  	if (wake_type != RWSEM_WAKE_READ_OWNED) {
>  		adjustment = RWSEM_ACTIVE_READ_BIAS;
> - try_reader_grant:
> -		oldcount = rwsem_atomic_update(adjustment, sem) - adjustment;
> -		if (unlikely(oldcount < RWSEM_WAITING_BIAS)) {
> -			/* A writer stole the lock. Undo our reader grant. */
> +		while (1) {
> +			/* A writer stole the lock. */
> +			if (sem->count < RWSEM_WAITING_BIAS)
> +				return sem;
> +
> +			oldcount = rwsem_atomic_update(adjustment, sem)
> +								- adjustment;
> +			if (likely(oldcount >= RWSEM_WAITING_BIAS))
> +				break;
> +
> +			 /* A writer stole the lock.  Undo our reader grant. */
>  			if (rwsem_atomic_update(-adjustment, sem) &
>  						RWSEM_ACTIVE_MASK)
> -				goto out;
> +				return sem;
>  			/* Last active locker left. Retry waking readers. */
> -			goto try_reader_grant;
>  		}
>  	}
>  
> @@ -136,7 +142,6 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type)
>  	sem->wait_list.next = next;
>  	next->prev = &sem->wait_list;
>  
> - out:
>  	return sem;
>  }
>  
> 


-- 
Thanks
    Alex

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] rwsem: check the lock before cpmxchg in down_write_trylock and rwsem_do_wake
  2013-06-22  0:10   ` Alex Shi
@ 2013-06-22  0:15     ` Davidlohr Bueso
  2013-06-24 16:34     ` Tim Chen
  1 sibling, 0 replies; 25+ messages in thread
From: Davidlohr Bueso @ 2013-06-22  0:15 UTC (permalink / raw)
  To: Alex Shi
  Cc: Tim Chen, Ingo Molnar, Andrew Morton, Andrea Arcangeli,
	Andi Kleen, Michel Lespinasse, Matthew R Wilcox, Dave Hansen,
	Peter Zijlstra, Rik van Riel, linux-kernel, linux-mm

On Sat, 2013-06-22 at 08:10 +0800, Alex Shi wrote:
> On 06/22/2013 07:51 AM, Tim Chen wrote:
> > Doing cmpxchg will cause cache bouncing when checking
> > sem->count. This could cause scalability issue
> > in a large machine (e.g. a 80 cores box).
> > 
> > A pre-read of sem->count can mitigate this.
> > 
> > Signed-off-by: Alex Shi <alex.shi@intel.com>
> > Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> 
> Hi Tim,
> there is a technical error in this patch.
> the "From: " line should be 'Alex Shi', since he made the most input of
> this patch.
> 
> And I still think split this patch to 4 smaller will make it more simple
> to review, that I had sent you and Davidlohr.

Yep, and you had updated the changelog for 1/4: rwsem: check the lock
before cpmxchg in down_write_trylock to:

"cmpxchg will cause cache bouncing when do the value checking, that
cause scalability issue in a large machine (like a 80 cores box).

A lock status pre-read can relief this."

> 
> could you like to re-send with my 4 patch version? :)

For those 4 patches:
Acked-by: Davidlohr Bueso <davidlohr.bueso@hp.com>



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] rwsem: check the lock before cpmxchg in down_write_trylock and rwsem_do_wake
  2013-06-22  0:10   ` Alex Shi
  2013-06-22  0:15     ` Davidlohr Bueso
@ 2013-06-24 16:34     ` Tim Chen
  1 sibling, 0 replies; 25+ messages in thread
From: Tim Chen @ 2013-06-24 16:34 UTC (permalink / raw)
  To: Alex Shi
  Cc: Ingo Molnar, Andrew Morton, Andrea Arcangeli, Andi Kleen,
	Michel Lespinasse, Davidlohr Bueso, Matthew R Wilcox,
	Dave Hansen, Peter Zijlstra, Rik van Riel, linux-kernel,
	linux-mm

On Sat, 2013-06-22 at 08:10 +0800, Alex Shi wrote:
> On 06/22/2013 07:51 AM, Tim Chen wrote:
> > Doing cmpxchg will cause cache bouncing when checking
> > sem->count. This could cause scalability issue
> > in a large machine (e.g. a 80 cores box).
> > 
> > A pre-read of sem->count can mitigate this.
> > 
> > Signed-off-by: Alex Shi <alex.shi@intel.com>
> > Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> 
> Hi Tim,
> there is a technical error in this patch.
> the "From: " line should be 'Alex Shi', since he made the most input of
> this patch.
> 

Actually in my local git repository, I have the From line as from you
that's reflected in the change log of patch[0/2].
I'll put this as your signed-off only if you preferred for v2.  The
"From" field was you in my git but changes when I send them out.


> And I still think split this patch to 4 smaller will make it more simple
> to review, that I had sent you and Davidlohr.
> 
> could you like to re-send with my 4 patch version? :)
> 
> > ---
> >  include/asm-generic/rwsem.h |    8 ++++----
> >  lib/rwsem.c                 |   21 +++++++++++++--------
> >  2 files changed, 17 insertions(+), 12 deletions(-)
> > 
> > diff --git a/include/asm-generic/rwsem.h b/include/asm-generic/rwsem.h
> > index bb1e2cd..052d973 100644
> > --- a/include/asm-generic/rwsem.h
> > +++ b/include/asm-generic/rwsem.h
> > @@ -70,11 +70,11 @@ static inline void __down_write(struct rw_semaphore *sem)
> >  
> >  static inline int __down_write_trylock(struct rw_semaphore *sem)
> >  {
> > -	long tmp;
> > +	if (unlikely(&sem->count != RWSEM_UNLOCKED_VALUE))
> > +		return 0;
> >  
> > -	tmp = cmpxchg(&sem->count, RWSEM_UNLOCKED_VALUE,
> > -		      RWSEM_ACTIVE_WRITE_BIAS);
> > -	return tmp == RWSEM_UNLOCKED_VALUE;
> > +	return cmpxchg(&sem->count, RWSEM_UNLOCKED_VALUE,
> > +		RWSEM_ACTIVE_WRITE_BIAS) == RWSEM_UNLOCKED_VALUE;
> >  }
> >  
> >  /*
> > diff --git a/lib/rwsem.c b/lib/rwsem.c
> > index 19c5fa9..2072af5 100644
> > --- a/lib/rwsem.c
> > +++ b/lib/rwsem.c
> > @@ -75,7 +75,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type)
> >  			 * will block as they will notice the queued writer.
> >  			 */
> >  			wake_up_process(waiter->task);
> > -		goto out;
> > +		return sem;
> >  	}
> >  
> >  	/* Writers might steal the lock before we grant it to the next reader.
> > @@ -85,15 +85,21 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type)
> >  	adjustment = 0;
> >  	if (wake_type != RWSEM_WAKE_READ_OWNED) {
> >  		adjustment = RWSEM_ACTIVE_READ_BIAS;
> > - try_reader_grant:
> > -		oldcount = rwsem_atomic_update(adjustment, sem) - adjustment;
> > -		if (unlikely(oldcount < RWSEM_WAITING_BIAS)) {
> > -			/* A writer stole the lock. Undo our reader grant. */
> > +		while (1) {
> > +			/* A writer stole the lock. */
> > +			if (sem->count < RWSEM_WAITING_BIAS)
> > +				return sem;
> > +
> > +			oldcount = rwsem_atomic_update(adjustment, sem)
> > +								- adjustment;
> > +			if (likely(oldcount >= RWSEM_WAITING_BIAS))
> > +				break;
> > +
> > +			 /* A writer stole the lock.  Undo our reader grant. */
> >  			if (rwsem_atomic_update(-adjustment, sem) &
> >  						RWSEM_ACTIVE_MASK)
> > -				goto out;
> > +				return sem;
> >  			/* Last active locker left. Retry waking readers. */
> > -			goto try_reader_grant;
> >  		}
> >  	}
> >  
> > @@ -136,7 +142,6 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type)
> >  	sem->wait_list.next = next;
> >  	next->prev = &sem->wait_list;
> >  
> > - out:
> >  	return sem;
> >  }
> >  
> > 
> 
> 



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] rwsem: check the lock before cpmxchg in down_write_trylock and rwsem_do_wake
  2013-06-21 23:51 ` [PATCH 1/2] rwsem: check the lock before cpmxchg in down_write_trylock and rwsem_do_wake Tim Chen
  2013-06-22  0:10   ` Alex Shi
@ 2013-06-22  7:21   ` Peter Hurley
  2013-06-23  1:16     ` Alex Shi
  1 sibling, 1 reply; 25+ messages in thread
From: Peter Hurley @ 2013-06-22  7:21 UTC (permalink / raw)
  To: Tim Chen, Alex Shi, Michel Lespinasse
  Cc: Ingo Molnar, Andrew Morton, Andrea Arcangeli, Andi Kleen,
	Davidlohr Bueso, Matthew R Wilcox, Dave Hansen, Peter Zijlstra,
	Rik van Riel, linux-kernel, linux-mm

On 06/21/2013 07:51 PM, Tim Chen wrote:
> Doing cmpxchg will cause cache bouncing when checking
> sem->count. This could cause scalability issue
> in a large machine (e.g. a 80 cores box).
>
> A pre-read of sem->count can mitigate this.
>
> Signed-off-by: Alex Shi <alex.shi@intel.com>
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> ---
>   include/asm-generic/rwsem.h |    8 ++++----
>   lib/rwsem.c                 |   21 +++++++++++++--------
>   2 files changed, 17 insertions(+), 12 deletions(-)
>
> diff --git a/include/asm-generic/rwsem.h b/include/asm-generic/rwsem.h
> index bb1e2cd..052d973 100644
> --- a/include/asm-generic/rwsem.h
> +++ b/include/asm-generic/rwsem.h
> @@ -70,11 +70,11 @@ static inline void __down_write(struct rw_semaphore *sem)
>
>   static inline int __down_write_trylock(struct rw_semaphore *sem)
>   {
> -	long tmp;
> +	if (unlikely(&sem->count != RWSEM_UNLOCKED_VALUE))
                      ^^^^^^^^^^^

This is probably not what you want.


> +		return 0;
>
> -	tmp = cmpxchg(&sem->count, RWSEM_UNLOCKED_VALUE,
> -		      RWSEM_ACTIVE_WRITE_BIAS);
> -	return tmp == RWSEM_UNLOCKED_VALUE;
> +	return cmpxchg(&sem->count, RWSEM_UNLOCKED_VALUE,
> +		RWSEM_ACTIVE_WRITE_BIAS) == RWSEM_UNLOCKED_VALUE;
>   }
>
>   /*
> diff --git a/lib/rwsem.c b/lib/rwsem.c
> index 19c5fa9..2072af5 100644
> --- a/lib/rwsem.c
> +++ b/lib/rwsem.c
> @@ -75,7 +75,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type)
>   			 * will block as they will notice the queued writer.
>   			 */
>   			wake_up_process(waiter->task);
> -		goto out;
> +		return sem;

Please put these flow control changes in a separate patch.


>   	}
>
>   	/* Writers might steal the lock before we grant it to the next reader.
> @@ -85,15 +85,21 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type)
>   	adjustment = 0;
>   	if (wake_type != RWSEM_WAKE_READ_OWNED) {
>   		adjustment = RWSEM_ACTIVE_READ_BIAS;
> - try_reader_grant:
> -		oldcount = rwsem_atomic_update(adjustment, sem) - adjustment;
> -		if (unlikely(oldcount < RWSEM_WAITING_BIAS)) {
> -			/* A writer stole the lock. Undo our reader grant. */
> +		while (1) {
> +			/* A writer stole the lock. */
> +			if (sem->count < RWSEM_WAITING_BIAS)
> +				return sem;

I'm all for structured looping instead of goto labels but this optimization
is only useful on the 1st iteration. IOW, on the second iteration you already
know that you need to try for reclaiming the lock.


> +
> +			oldcount = rwsem_atomic_update(adjustment, sem)
> +								- adjustment;
> +			if (likely(oldcount >= RWSEM_WAITING_BIAS))
> +				break;
> +
> +			 /* A writer stole the lock.  Undo our reader grant. */
>   			if (rwsem_atomic_update(-adjustment, sem) &
>   						RWSEM_ACTIVE_MASK)
> -				goto out;
> +				return sem;
>   			/* Last active locker left. Retry waking readers. */
> -			goto try_reader_grant;
>   		}
>   	}
>
> @@ -136,7 +142,6 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type)
>   	sem->wait_list.next = next;
>   	next->prev = &sem->wait_list;
>
> - out:
>   	return sem;
>   }


Alex and Tim,

Was there a v1 of this series; ie., is this v2 (or higher)?

How are you validating lock correctness/behavior with this series?

Regards,
Peter Hurley


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] rwsem: check the lock before cpmxchg in down_write_trylock and rwsem_do_wake
  2013-06-22  7:21   ` Peter Hurley
@ 2013-06-23  1:16     ` Alex Shi
  2013-06-23  5:10       ` Andi Kleen
  0 siblings, 1 reply; 25+ messages in thread
From: Alex Shi @ 2013-06-23  1:16 UTC (permalink / raw)
  To: Peter Hurley
  Cc: Tim Chen, Michel Lespinasse, Ingo Molnar, Andrew Morton,
	Andrea Arcangeli, Andi Kleen, Davidlohr Bueso, Matthew R Wilcox,
	Dave Hansen, Peter Zijlstra, Rik van Riel, linux-kernel,
	linux-mm

On 06/22/2013 03:21 PM, Peter Hurley wrote:
> On 06/21/2013 07:51 PM, Tim Chen wrote:
>> Doing cmpxchg will cause cache bouncing when checking
>> sem->count. This could cause scalability issue
>> in a large machine (e.g. a 80 cores box).
>>
>> A pre-read of sem->count can mitigate this.
>>
>> Signed-off-by: Alex Shi <alex.shi@intel.com>
>> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
>> ---
>>   include/asm-generic/rwsem.h |    8 ++++----
>>   lib/rwsem.c                 |   21 +++++++++++++--------
>>   2 files changed, 17 insertions(+), 12 deletions(-)
>>
>> diff --git a/include/asm-generic/rwsem.h b/include/asm-generic/rwsem.h
>> index bb1e2cd..052d973 100644
>> --- a/include/asm-generic/rwsem.h
>> +++ b/include/asm-generic/rwsem.h
>> @@ -70,11 +70,11 @@ static inline void __down_write(struct
>> rw_semaphore *sem)
>>
>>   static inline int __down_write_trylock(struct rw_semaphore *sem)
>>   {
>> -    long tmp;
>> +    if (unlikely(&sem->count != RWSEM_UNLOCKED_VALUE))
>                      ^^^^^^^^^^^
> 
> This is probably not what you want.
> 

this function logical is quite simple. check the sem->count before
cmpxchg is no harm this logical.

So could you like to tell us what should we want?

> 
>> +        return 0;
>>
>> -    tmp = cmpxchg(&sem->count, RWSEM_UNLOCKED_VALUE,
>> -              RWSEM_ACTIVE_WRITE_BIAS);
>> -    return tmp == RWSEM_UNLOCKED_VALUE;
>> +    return cmpxchg(&sem->count, RWSEM_UNLOCKED_VALUE,
>> +        RWSEM_ACTIVE_WRITE_BIAS) == RWSEM_UNLOCKED_VALUE;
>>   }
>>
>>   /*
>> diff --git a/lib/rwsem.c b/lib/rwsem.c
>> index 19c5fa9..2072af5 100644
>> --- a/lib/rwsem.c
>> +++ b/lib/rwsem.c
>> @@ -75,7 +75,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum
>> rwsem_wake_type wake_type)
>>                * will block as they will notice the queued writer.
>>                */
>>               wake_up_process(waiter->task);
>> -        goto out;
>> +        return sem;
> 
> Please put these flow control changes in a separate patch.

I had sent the split patches to Tim&Davidlohr. They will send them out
as a single patchset.
> 
> 
>>       }
>>
>>       /* Writers might steal the lock before we grant it to the next
>> reader.
>> @@ -85,15 +85,21 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum
>> rwsem_wake_type wake_type)
>>       adjustment = 0;
>>       if (wake_type != RWSEM_WAKE_READ_OWNED) {
>>           adjustment = RWSEM_ACTIVE_READ_BIAS;
>> - try_reader_grant:
>> -        oldcount = rwsem_atomic_update(adjustment, sem) - adjustment;
>> -        if (unlikely(oldcount < RWSEM_WAITING_BIAS)) {
>> -            /* A writer stole the lock. Undo our reader grant. */
>> +        while (1) {
>> +            /* A writer stole the lock. */
>> +            if (sem->count < RWSEM_WAITING_BIAS)
>> +                return sem;
> 
> I'm all for structured looping instead of goto labels but this optimization
> is only useful on the 1st iteration. IOW, on the second iteration you
> already
> know that you need to try for reclaiming the lock.
> 

sorry. could you like to say more clear, what's the 1st or 2nd iteration
or others?
> 
>> +
>> +            oldcount = rwsem_atomic_update(adjustment, sem)
>> +                                - adjustment;
>> +            if (likely(oldcount >= RWSEM_WAITING_BIAS))
>> +                break;
>> +
>> +             /* A writer stole the lock.  Undo our reader grant. */
>>               if (rwsem_atomic_update(-adjustment, sem) &
>>                           RWSEM_ACTIVE_MASK)
>> -                goto out;
>> +                return sem;
>>               /* Last active locker left. Retry waking readers. */
>> -            goto try_reader_grant;
>>           }
>>       }
>>
>> @@ -136,7 +142,6 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum
>> rwsem_wake_type wake_type)
>>       sem->wait_list.next = next;
>>       next->prev = &sem->wait_list;
>>
>> - out:
>>       return sem;
>>   }
> 
> 
> Alex and Tim,
> 
> Was there a v1 of this series; ie., is this v2 (or higher)?
> 
> How are you validating lock correctness/behavior with this series?

some benchmark tested against this patch, mainly aim7. plus by eyes, we
didn't change the logical except check the lock value before  do locking
> 
> Regards,
> Peter Hurley
> 


-- 
Thanks
    Alex

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] rwsem: check the lock before cpmxchg in down_write_trylock and rwsem_do_wake
  2013-06-23  1:16     ` Alex Shi
@ 2013-06-23  5:10       ` Andi Kleen
  2013-06-23 11:52         ` Alex Shi
  0 siblings, 1 reply; 25+ messages in thread
From: Andi Kleen @ 2013-06-23  5:10 UTC (permalink / raw)
  To: Alex Shi
  Cc: Peter Hurley, Tim Chen, Michel Lespinasse, Ingo Molnar,
	Andrew Morton, Andrea Arcangeli, Andi Kleen, Davidlohr Bueso,
	Matthew R Wilcox, Dave Hansen, Peter Zijlstra, Rik van Riel,
	linux-kernel, linux-mm

> >> diff --git a/include/asm-generic/rwsem.h b/include/asm-generic/rwsem.h
> >> index bb1e2cd..052d973 100644
> >> --- a/include/asm-generic/rwsem.h
> >> +++ b/include/asm-generic/rwsem.h
> >> @@ -70,11 +70,11 @@ static inline void __down_write(struct
> >> rw_semaphore *sem)
> >>
> >>   static inline int __down_write_trylock(struct rw_semaphore *sem)
> >>   {
> >> -    long tmp;
> >> +    if (unlikely(&sem->count != RWSEM_UNLOCKED_VALUE))
> >                      ^^^^^^^^^^^
> > 
> > This is probably not what you want.
> > 
> 
> this function logical is quite simple. check the sem->count before
> cmpxchg is no harm this logical.
> 
> So could you like to tell us what should we want?

You are comparing the address, not the value. Remove the &
This was a nop too.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] rwsem: check the lock before cpmxchg in down_write_trylock and rwsem_do_wake
  2013-06-23  5:10       ` Andi Kleen
@ 2013-06-23 11:52         ` Alex Shi
  0 siblings, 0 replies; 25+ messages in thread
From: Alex Shi @ 2013-06-23 11:52 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Peter Hurley, Tim Chen, Michel Lespinasse, Ingo Molnar,
	Andrew Morton, Andrea Arcangeli, Davidlohr Bueso,
	Matthew R Wilcox, Dave Hansen, Peter Zijlstra, Rik van Riel,
	linux-kernel, linux-mm

On 06/23/2013 01:10 PM, Andi Kleen wrote:
>>>> diff --git a/include/asm-generic/rwsem.h b/include/asm-generic/rwsem.h
>>>> > >> index bb1e2cd..052d973 100644
>>>> > >> --- a/include/asm-generic/rwsem.h
>>>> > >> +++ b/include/asm-generic/rwsem.h
>>>> > >> @@ -70,11 +70,11 @@ static inline void __down_write(struct
>>>> > >> rw_semaphore *sem)
>>>> > >>
>>>> > >>   static inline int __down_write_trylock(struct rw_semaphore *sem)
>>>> > >>   {
>>>> > >> -    long tmp;
>>>> > >> +    if (unlikely(&sem->count != RWSEM_UNLOCKED_VALUE))
>>> > >                      ^^^^^^^^^^^
>>> > > 
>>> > > This is probably not what you want.
>>> > > 
>> > 
>> > this function logical is quite simple. check the sem->count before
>> > cmpxchg is no harm this logical.
>> > 
>> > So could you like to tell us what should we want?
> You are comparing the address, not the value. Remove the &
> This was a nop too.

ops, So stupid I am! :(

-- 
Thanks
    Alex

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 2/2] rwsem: do optimistic spinning for writer lock acquisition
       [not found] <cover.1371855277.git.tim.c.chen@linux.intel.com>
  2013-06-21 23:51 ` [PATCH 1/2] rwsem: check the lock before cpmxchg in down_write_trylock and rwsem_do_wake Tim Chen
@ 2013-06-21 23:51 ` Tim Chen
  2013-06-22  0:00   ` Davidlohr Bueso
                     ` (2 more replies)
  1 sibling, 3 replies; 25+ messages in thread
From: Tim Chen @ 2013-06-21 23:51 UTC (permalink / raw)
  To: Ingo Molnar, Andrew Morton
  Cc: Andrea Arcangeli, Alex Shi, Andi Kleen, Michel Lespinasse,
	Davidlohr Bueso, Matthew R Wilcox, Dave Hansen, Peter Zijlstra,
	Rik van Riel, Tim Chen, linux-kernel, linux-mm

Introduce in this patch optimistic spinning for writer lock
acquisition in read write semaphore.  The logic is
similar to the optimistic spinning in mutex but without
the MCS lock queueing of the spinner.  This provides a
better chance for a writer to acquire the lock before
being we block it and put it to sleep.

Disabling of pre-emption during optimistic spinning
was suggested by Davidlohr Bueso.  It
improved performance of aim7 for his test suite.

Combined with the patch to avoid unnecesary cmpxchg,
in testing by Davidlohr Bueso on aim7 workloads
on 8 socket 80 cores system, he saw improvements of
alltests (+14.5%), custom (+17%), disk (+11%), high_systime
(+5%), shared (+15%) and short (+4%), most of them after around 500
users when he implemented i_mmap as rwsem.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 Makefile              |    2 +-
 include/linux/rwsem.h |    3 +
 init/Kconfig          |    9 +++
 kernel/rwsem.c        |   29 +++++++++-
 lib/rwsem.c           |  148 +++++++++++++++++++++++++++++++++++++++++++++----
 5 files changed, 178 insertions(+), 13 deletions(-)

diff --git a/Makefile b/Makefile
index 49aa84b..7d1ef64 100644
--- a/Makefile
+++ b/Makefile
@@ -1,7 +1,7 @@
 VERSION = 3
 PATCHLEVEL = 10
 SUBLEVEL = 0
-EXTRAVERSION = -rc4
+EXTRAVERSION = -rc4-optspin4
 NAME = Unicycling Gorilla
 
 # *DOCUMENTATION*
diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h
index 0616ffe..0c5933b 100644
--- a/include/linux/rwsem.h
+++ b/include/linux/rwsem.h
@@ -29,6 +29,9 @@ struct rw_semaphore {
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 	struct lockdep_map	dep_map;
 #endif
+#ifdef CONFIG_RWSEM_SPIN_ON_WRITE_OWNER
+	struct task_struct	*owner;
+#endif
 };
 
 extern struct rw_semaphore *rwsem_down_read_failed(struct rw_semaphore *sem);
diff --git a/init/Kconfig b/init/Kconfig
index 9d3a788..1c582d1 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1595,6 +1595,15 @@ config TRACEPOINTS
 
 source "arch/Kconfig"
 
+config RWSEM_SPIN_ON_WRITE_OWNER
+	bool "Optimistic spin write acquisition for writer owned rw-sem"
+	default n
+	depends on SMP
+	help
+	  Allows a writer to perform optimistic spinning if another writer own
+	  the read write semaphore.  This gives a greater chance for writer to
+	  acquire a semaphore before blocking it and putting it to sleep.
+
 endmenu		# General setup
 
 config HAVE_GENERIC_DMA_COHERENT
diff --git a/kernel/rwsem.c b/kernel/rwsem.c
index cfff143..a32990a 100644
--- a/kernel/rwsem.c
+++ b/kernel/rwsem.c
@@ -12,6 +12,26 @@
 
 #include <linux/atomic.h>
 
+#ifdef CONFIG_RWSEM_SPIN_ON_WRITE_OWNER
+static inline void rwsem_set_owner(struct rw_semaphore *sem)
+{
+	sem->owner = current;
+}
+
+static inline void rwsem_clear_owner(struct rw_semaphore *sem)
+{
+	sem->owner = NULL;
+}
+#else
+static inline void rwsem_set_owner(struct rw_semaphore *sem)
+{
+}
+
+static inline void rwsem_clear_owner(struct rw_semaphore *sem)
+{
+}
+#endif
+
 /*
  * lock for reading
  */
@@ -48,6 +68,7 @@ void __sched down_write(struct rw_semaphore *sem)
 	rwsem_acquire(&sem->dep_map, 0, 0, _RET_IP_);
 
 	LOCK_CONTENDED(sem, __down_write_trylock, __down_write);
+	rwsem_set_owner(sem);
 }
 
 EXPORT_SYMBOL(down_write);
@@ -59,8 +80,10 @@ int down_write_trylock(struct rw_semaphore *sem)
 {
 	int ret = __down_write_trylock(sem);
 
-	if (ret == 1)
+	if (ret == 1) {
 		rwsem_acquire(&sem->dep_map, 0, 1, _RET_IP_);
+		rwsem_set_owner(sem);
+	}
 	return ret;
 }
 
@@ -86,6 +109,7 @@ void up_write(struct rw_semaphore *sem)
 	rwsem_release(&sem->dep_map, 1, _RET_IP_);
 
 	__up_write(sem);
+	rwsem_clear_owner(sem);
 }
 
 EXPORT_SYMBOL(up_write);
@@ -100,6 +124,7 @@ void downgrade_write(struct rw_semaphore *sem)
 	 * dependency.
 	 */
 	__downgrade_write(sem);
+	rwsem_clear_owner(sem);
 }
 
 EXPORT_SYMBOL(downgrade_write);
@@ -122,6 +147,7 @@ void _down_write_nest_lock(struct rw_semaphore *sem, struct lockdep_map *nest)
 	rwsem_acquire_nest(&sem->dep_map, 0, 0, nest, _RET_IP_);
 
 	LOCK_CONTENDED(sem, __down_write_trylock, __down_write);
+	rwsem_set_owner(sem);
 }
 
 EXPORT_SYMBOL(_down_write_nest_lock);
@@ -141,6 +167,7 @@ void down_write_nested(struct rw_semaphore *sem, int subclass)
 	rwsem_acquire(&sem->dep_map, subclass, 0, _RET_IP_);
 
 	LOCK_CONTENDED(sem, __down_write_trylock, __down_write);
+	rwsem_set_owner(sem);
 }
 
 EXPORT_SYMBOL(down_write_nested);
diff --git a/lib/rwsem.c b/lib/rwsem.c
index 2072af5..8e331c5 100644
--- a/lib/rwsem.c
+++ b/lib/rwsem.c
@@ -8,6 +8,7 @@
  */
 #include <linux/rwsem.h>
 #include <linux/sched.h>
+#include <linux/sched/rt.h>
 #include <linux/init.h>
 #include <linux/export.h>
 
@@ -27,6 +28,9 @@ void __init_rwsem(struct rw_semaphore *sem, const char *name,
 	sem->count = RWSEM_UNLOCKED_VALUE;
 	raw_spin_lock_init(&sem->wait_lock);
 	INIT_LIST_HEAD(&sem->wait_list);
+#ifdef CONFIG_RWSEM_SPIN_ON_WRITE_OWNER
+	sem->owner = NULL;
+#endif
 }
 
 EXPORT_SYMBOL(__init_rwsem);
@@ -192,6 +196,128 @@ struct rw_semaphore __sched *rwsem_down_read_failed(struct rw_semaphore *sem)
 	return sem;
 }
 
+static inline int rwsem_try_write_lock(long count, bool need_lock,
+	struct rw_semaphore *sem)
+{
+	if (!(count & RWSEM_ACTIVE_MASK)) {
+		/* Try acquiring the write lock. */
+		if (sem->count == RWSEM_WAITING_BIAS &&
+		    cmpxchg(&sem->count, RWSEM_WAITING_BIAS,
+			    RWSEM_ACTIVE_WRITE_BIAS) == RWSEM_WAITING_BIAS) {
+			if (need_lock)
+				raw_spin_lock_irq(&sem->wait_lock);
+			if (!list_is_singular(&sem->wait_list))
+				rwsem_atomic_update(RWSEM_WAITING_BIAS, sem);
+			return 1;
+		}
+	}
+	return 0;
+}
+
+#ifdef CONFIG_RWSEM_SPIN_ON_WRITE_OWNER
+static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
+{
+	int retval = true;
+
+	/* Spin only if active writer running */
+	if (!sem->owner)
+		return false;
+
+	rcu_read_lock();
+	if (sem->owner)
+		retval = sem->owner->on_cpu;
+	rcu_read_unlock();
+	/*
+	 * if lock->owner is not set, the sem owner may have just acquired
+	 * it and not set the owner yet, or the sem has been released, or
+	 * reader active.
+	 */
+	return retval;
+}
+
+static inline bool owner_running(struct rw_semaphore *lock,
+				struct task_struct *owner)
+{
+	if (lock->owner != owner)
+		return false;
+
+	/*
+	 * Ensure we emit the owner->on_cpu, dereference _after_ checking
+	 * lock->owner still matches owner, if that fails, owner might
+	 * point to free()d memory, if it still matches, the rcu_read_lock()
+	 * ensures the memory stays valid.
+	 */
+	barrier();
+
+	return owner->on_cpu;
+}
+
+static noinline
+int rwsem_spin_on_owner(struct rw_semaphore *lock, struct task_struct *owner)
+{
+	rcu_read_lock();
+	while (owner_running(lock, owner)) {
+		if (need_resched())
+			break;
+
+		arch_mutex_cpu_relax();
+	}
+	rcu_read_unlock();
+
+	/*
+	 * We break out the loop above on need_resched() and when the
+	 * owner changed, which is a sign for heavy contention. Return
+	 * success only when lock->owner is NULL.
+	 */
+	return lock->owner == NULL;
+}
+
+int rwsem_optimistic_spin(struct rw_semaphore *sem)
+{
+	struct	task_struct	*owner;
+	int	ret = 0;
+
+	/* sem->wait_lock should not be held when doing optimistic spinning */
+	if (!rwsem_can_spin_on_owner(sem))
+		return ret;
+
+	preempt_disable();
+	for (;;) {
+		owner = ACCESS_ONCE(sem->owner);
+		if (owner && !rwsem_spin_on_owner(sem, owner))
+			break;
+
+		/* wait_lock will be acquired if write_lock is obtained */
+		if (rwsem_try_write_lock(sem->count, true, sem)) {
+			ret = 1;
+			goto out;
+		}
+
+		/*
+		 * When there's no owner, we might have preempted between the
+		 * owner acquiring the lock and setting the owner field. If
+		 * we're an RT task that will live-lock because we won't let
+		 * the owner complete.
+		 */
+		if (!owner && (need_resched() || rt_task(current)))
+			break;
+
+		/*
+		 * The cpu_relax() call is a compiler barrier which forces
+		 * everything in this loop to be re-loaded. We don't need
+		 * memory barriers as we'll eventually observe the right
+		 * values at the cost of a few extra spins.
+		 */
+		arch_mutex_cpu_relax();
+
+	}
+
+out:
+	preempt_enable();
+	return ret;
+}
+#endif
+
 /*
  * wait until we successfully acquire the write lock
  */
@@ -200,6 +326,9 @@ struct rw_semaphore __sched *rwsem_down_write_failed(struct rw_semaphore *sem)
 	long count, adjustment = -RWSEM_ACTIVE_WRITE_BIAS;
 	struct rwsem_waiter waiter;
 	struct task_struct *tsk = current;
+#ifdef CONFIG_RWSEM_SPIN_ON_WRITE_OWNER
+	bool try_optimistic_spin = true;
+#endif
 
 	/* set up my own style of waitqueue */
 	waiter.task = tsk;
@@ -223,20 +352,17 @@ struct rw_semaphore __sched *rwsem_down_write_failed(struct rw_semaphore *sem)
 	/* wait until we successfully acquire the lock */
 	set_task_state(tsk, TASK_UNINTERRUPTIBLE);
 	while (true) {
-		if (!(count & RWSEM_ACTIVE_MASK)) {
-			/* Try acquiring the write lock. */
-			count = RWSEM_ACTIVE_WRITE_BIAS;
-			if (!list_is_singular(&sem->wait_list))
-				count += RWSEM_WAITING_BIAS;
-
-			if (sem->count == RWSEM_WAITING_BIAS &&
-			    cmpxchg(&sem->count, RWSEM_WAITING_BIAS, count) ==
-							RWSEM_WAITING_BIAS)
-				break;
-		}
+		if (rwsem_try_write_lock(count, false, sem))
+			break;
 
 		raw_spin_unlock_irq(&sem->wait_lock);
 
+#ifdef CONFIG_RWSEM_SPIN_ON_WRITE_OWNER
+		/* do optimistic spinning */
+		if (try_optimistic_spin && rwsem_optimistic_spin(sem))
+			break;
+		try_optimistic_spin = false;
+#endif
 		/* Block until there are no active lockers. */
 		do {
 			schedule();
-- 
1.7.4.4



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/2] rwsem: do optimistic spinning for writer lock acquisition
  2013-06-21 23:51 ` [PATCH 2/2] rwsem: do optimistic spinning for writer lock acquisition Tim Chen
@ 2013-06-22  0:00   ` Davidlohr Bueso
  2013-06-22  7:57   ` Peter Hurley
  2013-06-24  8:46   ` Peter Zijlstra
  2 siblings, 0 replies; 25+ messages in thread
From: Davidlohr Bueso @ 2013-06-22  0:00 UTC (permalink / raw)
  To: Tim Chen
  Cc: Ingo Molnar, Andrew Morton, Andrea Arcangeli, Alex Shi,
	Andi Kleen, Michel Lespinasse, Matthew R Wilcox, Dave Hansen,
	Peter Zijlstra, Rik van Riel, linux-kernel, linux-mm

On Fri, 2013-06-21 at 16:51 -0700, Tim Chen wrote:
> Introduce in this patch optimistic spinning for writer lock
> acquisition in read write semaphore.  The logic is
> similar to the optimistic spinning in mutex but without
> the MCS lock queueing of the spinner.  This provides a
> better chance for a writer to acquire the lock before
> being we block it and put it to sleep.
> 
> Disabling of pre-emption during optimistic spinning
> was suggested by Davidlohr Bueso.  It
> improved performance of aim7 for his test suite.
> 
> Combined with the patch to avoid unnecesary cmpxchg,
> in testing by Davidlohr Bueso on aim7 workloads
> on 8 socket 80 cores system, he saw improvements of
> alltests (+14.5%), custom (+17%), disk (+11%), high_systime
> (+5%), shared (+15%) and short (+4%), most of them after around 500
> users when he implemented i_mmap as rwsem.
> 
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>

Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com>

> ---
>  Makefile              |    2 +-
>  include/linux/rwsem.h |    3 +
>  init/Kconfig          |    9 +++
>  kernel/rwsem.c        |   29 +++++++++-
>  lib/rwsem.c           |  148 +++++++++++++++++++++++++++++++++++++++++++++----
>  5 files changed, 178 insertions(+), 13 deletions(-)
> 
> diff --git a/Makefile b/Makefile
> index 49aa84b..7d1ef64 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -1,7 +1,7 @@
>  VERSION = 3
>  PATCHLEVEL = 10
>  SUBLEVEL = 0
> -EXTRAVERSION = -rc4
> +EXTRAVERSION = -rc4-optspin4
>  NAME = Unicycling Gorilla
>  
>  # *DOCUMENTATION*
> diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h
> index 0616ffe..0c5933b 100644
> --- a/include/linux/rwsem.h
> +++ b/include/linux/rwsem.h
> @@ -29,6 +29,9 @@ struct rw_semaphore {
>  #ifdef CONFIG_DEBUG_LOCK_ALLOC
>  	struct lockdep_map	dep_map;
>  #endif
> +#ifdef CONFIG_RWSEM_SPIN_ON_WRITE_OWNER
> +	struct task_struct	*owner;
> +#endif
>  };
>  
>  extern struct rw_semaphore *rwsem_down_read_failed(struct rw_semaphore *sem);
> diff --git a/init/Kconfig b/init/Kconfig
> index 9d3a788..1c582d1 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -1595,6 +1595,15 @@ config TRACEPOINTS
>  
>  source "arch/Kconfig"
>  
> +config RWSEM_SPIN_ON_WRITE_OWNER
> +	bool "Optimistic spin write acquisition for writer owned rw-sem"
> +	default n
> +	depends on SMP
> +	help
> +	  Allows a writer to perform optimistic spinning if another writer own
> +	  the read write semaphore.  This gives a greater chance for writer to
> +	  acquire a semaphore before blocking it and putting it to sleep.
> +
>  endmenu		# General setup
>  
>  config HAVE_GENERIC_DMA_COHERENT
> diff --git a/kernel/rwsem.c b/kernel/rwsem.c
> index cfff143..a32990a 100644
> --- a/kernel/rwsem.c
> +++ b/kernel/rwsem.c
> @@ -12,6 +12,26 @@
>  
>  #include <linux/atomic.h>
>  
> +#ifdef CONFIG_RWSEM_SPIN_ON_WRITE_OWNER
> +static inline void rwsem_set_owner(struct rw_semaphore *sem)
> +{
> +	sem->owner = current;
> +}
> +
> +static inline void rwsem_clear_owner(struct rw_semaphore *sem)
> +{
> +	sem->owner = NULL;
> +}
> +#else
> +static inline void rwsem_set_owner(struct rw_semaphore *sem)
> +{
> +}
> +
> +static inline void rwsem_clear_owner(struct rw_semaphore *sem)
> +{
> +}
> +#endif
> +
>  /*
>   * lock for reading
>   */
> @@ -48,6 +68,7 @@ void __sched down_write(struct rw_semaphore *sem)
>  	rwsem_acquire(&sem->dep_map, 0, 0, _RET_IP_);
>  
>  	LOCK_CONTENDED(sem, __down_write_trylock, __down_write);
> +	rwsem_set_owner(sem);
>  }
>  
>  EXPORT_SYMBOL(down_write);
> @@ -59,8 +80,10 @@ int down_write_trylock(struct rw_semaphore *sem)
>  {
>  	int ret = __down_write_trylock(sem);
>  
> -	if (ret == 1)
> +	if (ret == 1) {
>  		rwsem_acquire(&sem->dep_map, 0, 1, _RET_IP_);
> +		rwsem_set_owner(sem);
> +	}
>  	return ret;
>  }
>  
> @@ -86,6 +109,7 @@ void up_write(struct rw_semaphore *sem)
>  	rwsem_release(&sem->dep_map, 1, _RET_IP_);
>  
>  	__up_write(sem);
> +	rwsem_clear_owner(sem);
>  }
>  
>  EXPORT_SYMBOL(up_write);
> @@ -100,6 +124,7 @@ void downgrade_write(struct rw_semaphore *sem)
>  	 * dependency.
>  	 */
>  	__downgrade_write(sem);
> +	rwsem_clear_owner(sem);
>  }
>  
>  EXPORT_SYMBOL(downgrade_write);
> @@ -122,6 +147,7 @@ void _down_write_nest_lock(struct rw_semaphore *sem, struct lockdep_map *nest)
>  	rwsem_acquire_nest(&sem->dep_map, 0, 0, nest, _RET_IP_);
>  
>  	LOCK_CONTENDED(sem, __down_write_trylock, __down_write);
> +	rwsem_set_owner(sem);
>  }
>  
>  EXPORT_SYMBOL(_down_write_nest_lock);
> @@ -141,6 +167,7 @@ void down_write_nested(struct rw_semaphore *sem, int subclass)
>  	rwsem_acquire(&sem->dep_map, subclass, 0, _RET_IP_);
>  
>  	LOCK_CONTENDED(sem, __down_write_trylock, __down_write);
> +	rwsem_set_owner(sem);
>  }
>  
>  EXPORT_SYMBOL(down_write_nested);
> diff --git a/lib/rwsem.c b/lib/rwsem.c
> index 2072af5..8e331c5 100644
> --- a/lib/rwsem.c
> +++ b/lib/rwsem.c
> @@ -8,6 +8,7 @@
>   */
>  #include <linux/rwsem.h>
>  #include <linux/sched.h>
> +#include <linux/sched/rt.h>
>  #include <linux/init.h>
>  #include <linux/export.h>
>  
> @@ -27,6 +28,9 @@ void __init_rwsem(struct rw_semaphore *sem, const char *name,
>  	sem->count = RWSEM_UNLOCKED_VALUE;
>  	raw_spin_lock_init(&sem->wait_lock);
>  	INIT_LIST_HEAD(&sem->wait_list);
> +#ifdef CONFIG_RWSEM_SPIN_ON_WRITE_OWNER
> +	sem->owner = NULL;
> +#endif
>  }
>  
>  EXPORT_SYMBOL(__init_rwsem);
> @@ -192,6 +196,128 @@ struct rw_semaphore __sched *rwsem_down_read_failed(struct rw_semaphore *sem)
>  	return sem;
>  }
>  
> +static inline int rwsem_try_write_lock(long count, bool need_lock,
> +	struct rw_semaphore *sem)
> +{
> +	if (!(count & RWSEM_ACTIVE_MASK)) {
> +		/* Try acquiring the write lock. */
> +		if (sem->count == RWSEM_WAITING_BIAS &&
> +		    cmpxchg(&sem->count, RWSEM_WAITING_BIAS,
> +			    RWSEM_ACTIVE_WRITE_BIAS) == RWSEM_WAITING_BIAS) {
> +			if (need_lock)
> +				raw_spin_lock_irq(&sem->wait_lock);
> +			if (!list_is_singular(&sem->wait_list))
> +				rwsem_atomic_update(RWSEM_WAITING_BIAS, sem);
> +			return 1;
> +		}
> +	}
> +	return 0;
> +}
> +
> +#ifdef CONFIG_RWSEM_SPIN_ON_WRITE_OWNER
> +static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
> +{
> +	int retval = true;
> +
> +	/* Spin only if active writer running */
> +	if (!sem->owner)
> +		return false;
> +
> +	rcu_read_lock();
> +	if (sem->owner)
> +		retval = sem->owner->on_cpu;
> +	rcu_read_unlock();
> +	/*
> +	 * if lock->owner is not set, the sem owner may have just acquired
> +	 * it and not set the owner yet, or the sem has been released, or
> +	 * reader active.
> +	 */
> +	return retval;
> +}
> +
> +static inline bool owner_running(struct rw_semaphore *lock,
> +				struct task_struct *owner)
> +{
> +	if (lock->owner != owner)
> +		return false;
> +
> +	/*
> +	 * Ensure we emit the owner->on_cpu, dereference _after_ checking
> +	 * lock->owner still matches owner, if that fails, owner might
> +	 * point to free()d memory, if it still matches, the rcu_read_lock()
> +	 * ensures the memory stays valid.
> +	 */
> +	barrier();
> +
> +	return owner->on_cpu;
> +}
> +
> +static noinline
> +int rwsem_spin_on_owner(struct rw_semaphore *lock, struct task_struct *owner)
> +{
> +	rcu_read_lock();
> +	while (owner_running(lock, owner)) {
> +		if (need_resched())
> +			break;
> +
> +		arch_mutex_cpu_relax();
> +	}
> +	rcu_read_unlock();
> +
> +	/*
> +	 * We break out the loop above on need_resched() and when the
> +	 * owner changed, which is a sign for heavy contention. Return
> +	 * success only when lock->owner is NULL.
> +	 */
> +	return lock->owner == NULL;
> +}
> +
> +int rwsem_optimistic_spin(struct rw_semaphore *sem)
> +{
> +	struct	task_struct	*owner;
> +	int	ret = 0;
> +
> +	/* sem->wait_lock should not be held when doing optimistic spinning */
> +	if (!rwsem_can_spin_on_owner(sem))
> +		return ret;
> +
> +	preempt_disable();
> +	for (;;) {
> +		owner = ACCESS_ONCE(sem->owner);
> +		if (owner && !rwsem_spin_on_owner(sem, owner))
> +			break;
> +
> +		/* wait_lock will be acquired if write_lock is obtained */
> +		if (rwsem_try_write_lock(sem->count, true, sem)) {
> +			ret = 1;
> +			goto out;
> +		}
> +
> +		/*
> +		 * When there's no owner, we might have preempted between the
> +		 * owner acquiring the lock and setting the owner field. If
> +		 * we're an RT task that will live-lock because we won't let
> +		 * the owner complete.
> +		 */
> +		if (!owner && (need_resched() || rt_task(current)))
> +			break;
> +
> +		/*
> +		 * The cpu_relax() call is a compiler barrier which forces
> +		 * everything in this loop to be re-loaded. We don't need
> +		 * memory barriers as we'll eventually observe the right
> +		 * values at the cost of a few extra spins.
> +		 */
> +		arch_mutex_cpu_relax();
> +
> +	}
> +
> +out:
> +	preempt_enable();
> +	return ret;
> +}
> +#endif
> +
>  /*
>   * wait until we successfully acquire the write lock
>   */
> @@ -200,6 +326,9 @@ struct rw_semaphore __sched *rwsem_down_write_failed(struct rw_semaphore *sem)
>  	long count, adjustment = -RWSEM_ACTIVE_WRITE_BIAS;
>  	struct rwsem_waiter waiter;
>  	struct task_struct *tsk = current;
> +#ifdef CONFIG_RWSEM_SPIN_ON_WRITE_OWNER
> +	bool try_optimistic_spin = true;
> +#endif
>  
>  	/* set up my own style of waitqueue */
>  	waiter.task = tsk;
> @@ -223,20 +352,17 @@ struct rw_semaphore __sched *rwsem_down_write_failed(struct rw_semaphore *sem)
>  	/* wait until we successfully acquire the lock */
>  	set_task_state(tsk, TASK_UNINTERRUPTIBLE);
>  	while (true) {
> -		if (!(count & RWSEM_ACTIVE_MASK)) {
> -			/* Try acquiring the write lock. */
> -			count = RWSEM_ACTIVE_WRITE_BIAS;
> -			if (!list_is_singular(&sem->wait_list))
> -				count += RWSEM_WAITING_BIAS;
> -
> -			if (sem->count == RWSEM_WAITING_BIAS &&
> -			    cmpxchg(&sem->count, RWSEM_WAITING_BIAS, count) ==
> -							RWSEM_WAITING_BIAS)
> -				break;
> -		}
> +		if (rwsem_try_write_lock(count, false, sem))
> +			break;
>  
>  		raw_spin_unlock_irq(&sem->wait_lock);
>  
> +#ifdef CONFIG_RWSEM_SPIN_ON_WRITE_OWNER
> +		/* do optimistic spinning */
> +		if (try_optimistic_spin && rwsem_optimistic_spin(sem))
> +			break;
> +		try_optimistic_spin = false;
> +#endif
>  		/* Block until there are no active lockers. */
>  		do {
>  			schedule();



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/2] rwsem: do optimistic spinning for writer lock acquisition
  2013-06-21 23:51 ` [PATCH 2/2] rwsem: do optimistic spinning for writer lock acquisition Tim Chen
  2013-06-22  0:00   ` Davidlohr Bueso
@ 2013-06-22  7:57   ` Peter Hurley
  2013-06-23 20:03     ` Davidlohr Bueso
  2013-06-24 21:58     ` Tim Chen
  2013-06-24  8:46   ` Peter Zijlstra
  2 siblings, 2 replies; 25+ messages in thread
From: Peter Hurley @ 2013-06-22  7:57 UTC (permalink / raw)
  To: Tim Chen, Alex Shi, Michel Lespinasse
  Cc: Ingo Molnar, Andrew Morton, Andrea Arcangeli, Andi Kleen,
	Davidlohr Bueso, Matthew R Wilcox, Dave Hansen, Peter Zijlstra,
	Rik van Riel, linux-kernel, linux-mm

On 06/21/2013 07:51 PM, Tim Chen wrote:
> Introduce in this patch optimistic spinning for writer lock
> acquisition in read write semaphore.  The logic is
> similar to the optimistic spinning in mutex but without
> the MCS lock queueing of the spinner.  This provides a
> better chance for a writer to acquire the lock before
> being we block it and put it to sleep.

This is just my opinion but I'd rather read the justification
here instead of referencing mutex logic that may or may not
exist in 2 years.


> Disabling of pre-emption during optimistic spinning
> was suggested by Davidlohr Bueso.  It
> improved performance of aim7 for his test suite.
>
> Combined with the patch to avoid unnecesary cmpxchg,
> in testing by Davidlohr Bueso on aim7 workloads
> on 8 socket 80 cores system, he saw improvements of
> alltests (+14.5%), custom (+17%), disk (+11%), high_systime
> (+5%), shared (+15%) and short (+4%), most of them after around 500
> users when he implemented i_mmap as rwsem.
>
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> ---
>   Makefile              |    2 +-
>   include/linux/rwsem.h |    3 +
>   init/Kconfig          |    9 +++
>   kernel/rwsem.c        |   29 +++++++++-
>   lib/rwsem.c           |  148 +++++++++++++++++++++++++++++++++++++++++++++----
>   5 files changed, 178 insertions(+), 13 deletions(-)
>
> diff --git a/Makefile b/Makefile
> index 49aa84b..7d1ef64 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -1,7 +1,7 @@
>   VERSION = 3
>   PATCHLEVEL = 10
>   SUBLEVEL = 0
> -EXTRAVERSION = -rc4
> +EXTRAVERSION = -rc4-optspin4
>   NAME = Unicycling Gorilla
>
>   # *DOCUMENTATION*
> diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h
> index 0616ffe..0c5933b 100644
> --- a/include/linux/rwsem.h
> +++ b/include/linux/rwsem.h
> @@ -29,6 +29,9 @@ struct rw_semaphore {
>   #ifdef CONFIG_DEBUG_LOCK_ALLOC
>   	struct lockdep_map	dep_map;
>   #endif
> +#ifdef CONFIG_RWSEM_SPIN_ON_WRITE_OWNER
> +	struct task_struct	*owner;
> +#endif
>   };
>
>   extern struct rw_semaphore *rwsem_down_read_failed(struct rw_semaphore *sem);
> diff --git a/init/Kconfig b/init/Kconfig
> index 9d3a788..1c582d1 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -1595,6 +1595,15 @@ config TRACEPOINTS
>
>   source "arch/Kconfig"
>
> +config RWSEM_SPIN_ON_WRITE_OWNER
> +	bool "Optimistic spin write acquisition for writer owned rw-sem"
> +	default n
> +	depends on SMP
> +	help
> +	  Allows a writer to perform optimistic spinning if another writer own
> +	  the read write semaphore.  This gives a greater chance for writer to
> +	  acquire a semaphore before blocking it and putting it to sleep.
> +
>   endmenu		# General setup
>
>   config HAVE_GENERIC_DMA_COHERENT
> diff --git a/kernel/rwsem.c b/kernel/rwsem.c
> index cfff143..a32990a 100644
> --- a/kernel/rwsem.c
> +++ b/kernel/rwsem.c
> @@ -12,6 +12,26 @@
>
>   #include <linux/atomic.h>
>
> +#ifdef CONFIG_RWSEM_SPIN_ON_WRITE_OWNER
> +static inline void rwsem_set_owner(struct rw_semaphore *sem)
> +{
> +	sem->owner = current;
> +}
> +
> +static inline void rwsem_clear_owner(struct rw_semaphore *sem)
> +{
> +	sem->owner = NULL;
> +}
> +#else
> +static inline void rwsem_set_owner(struct rw_semaphore *sem)
> +{
> +}
> +
> +static inline void rwsem_clear_owner(struct rw_semaphore *sem)
> +{
> +}
> +#endif
> +
>   /*
>    * lock for reading
>    */
> @@ -48,6 +68,7 @@ void __sched down_write(struct rw_semaphore *sem)
>   	rwsem_acquire(&sem->dep_map, 0, 0, _RET_IP_);
>
>   	LOCK_CONTENDED(sem, __down_write_trylock, __down_write);
> +	rwsem_set_owner(sem);
>   }
>
>   EXPORT_SYMBOL(down_write);
> @@ -59,8 +80,10 @@ int down_write_trylock(struct rw_semaphore *sem)
>   {
>   	int ret = __down_write_trylock(sem);
>
> -	if (ret == 1)
> +	if (ret == 1) {
>   		rwsem_acquire(&sem->dep_map, 0, 1, _RET_IP_);
> +		rwsem_set_owner(sem);
> +	}
>   	return ret;
>   }
>
> @@ -86,6 +109,7 @@ void up_write(struct rw_semaphore *sem)
>   	rwsem_release(&sem->dep_map, 1, _RET_IP_);
>
>   	__up_write(sem);
> +	rwsem_clear_owner(sem);
>   }
>
>   EXPORT_SYMBOL(up_write);
> @@ -100,6 +124,7 @@ void downgrade_write(struct rw_semaphore *sem)
>   	 * dependency.
>   	 */
>   	__downgrade_write(sem);
> +	rwsem_clear_owner(sem);
>   }
>
>   EXPORT_SYMBOL(downgrade_write);
> @@ -122,6 +147,7 @@ void _down_write_nest_lock(struct rw_semaphore *sem, struct lockdep_map *nest)
>   	rwsem_acquire_nest(&sem->dep_map, 0, 0, nest, _RET_IP_);
>
>   	LOCK_CONTENDED(sem, __down_write_trylock, __down_write);
> +	rwsem_set_owner(sem);
>   }
>
>   EXPORT_SYMBOL(_down_write_nest_lock);
> @@ -141,6 +167,7 @@ void down_write_nested(struct rw_semaphore *sem, int subclass)
>   	rwsem_acquire(&sem->dep_map, subclass, 0, _RET_IP_);
>
>   	LOCK_CONTENDED(sem, __down_write_trylock, __down_write);
> +	rwsem_set_owner(sem);
>   }
>
>   EXPORT_SYMBOL(down_write_nested);
> diff --git a/lib/rwsem.c b/lib/rwsem.c
> index 2072af5..8e331c5 100644
> --- a/lib/rwsem.c
> +++ b/lib/rwsem.c
> @@ -8,6 +8,7 @@
>    */
>   #include <linux/rwsem.h>
>   #include <linux/sched.h>
> +#include <linux/sched/rt.h>
>   #include <linux/init.h>
>   #include <linux/export.h>
>
> @@ -27,6 +28,9 @@ void __init_rwsem(struct rw_semaphore *sem, const char *name,
>   	sem->count = RWSEM_UNLOCKED_VALUE;
>   	raw_spin_lock_init(&sem->wait_lock);
>   	INIT_LIST_HEAD(&sem->wait_list);
> +#ifdef CONFIG_RWSEM_SPIN_ON_WRITE_OWNER
> +	sem->owner = NULL;
> +#endif
>   }
>
>   EXPORT_SYMBOL(__init_rwsem);
> @@ -192,6 +196,128 @@ struct rw_semaphore __sched *rwsem_down_read_failed(struct rw_semaphore *sem)
>   	return sem;
>   }
>
> +static inline int rwsem_try_write_lock(long count, bool need_lock,
> +	struct rw_semaphore *sem)
> +{
> +	if (!(count & RWSEM_ACTIVE_MASK)) {
> +		/* Try acquiring the write lock. */
> +		if (sem->count == RWSEM_WAITING_BIAS &&
> +		    cmpxchg(&sem->count, RWSEM_WAITING_BIAS,
> +			    RWSEM_ACTIVE_WRITE_BIAS) == RWSEM_WAITING_BIAS) {
> +			if (need_lock)
> +				raw_spin_lock_irq(&sem->wait_lock);
> +			if (!list_is_singular(&sem->wait_list))
> +				rwsem_atomic_update(RWSEM_WAITING_BIAS, sem);
> +			return 1;
> +		}
> +	}
> +	return 0;
> +}
> +
> +#ifdef CONFIG_RWSEM_SPIN_ON_WRITE_OWNER
> +static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
> +{
> +	int retval = true;
> +
> +	/* Spin only if active writer running */
> +	if (!sem->owner)
> +		return false;
> +
> +	rcu_read_lock();
> +	if (sem->owner)
> +		retval = sem->owner->on_cpu;
                          ^^^^^^^^^^^^^^^^^^

Why is this a safe dereference? Could not another cpu have just
dropped the sem (and thus set sem->owner to NULL and oops)?


> +	rcu_read_unlock();
> +	/*
> +	 * if lock->owner is not set, the sem owner may have just acquired
> +	 * it and not set the owner yet, or the sem has been released, or
> +	 * reader active.
> +	 */
> +	return retval;
> +}
> +
> +static inline bool owner_running(struct rw_semaphore *lock,
> +				struct task_struct *owner)
> +{
> +	if (lock->owner != owner)
> +		return false;
> +
> +	/*
> +	 * Ensure we emit the owner->on_cpu, dereference _after_ checking
> +	 * lock->owner still matches owner, if that fails, owner might
> +	 * point to free()d memory, if it still matches, the rcu_read_lock()
> +	 * ensures the memory stays valid.
> +	 */

Again just my opinion, but kernel style is to prefer multi-line comments
in a function comment block.

> +	barrier();
> +
> +	return owner->on_cpu;
> +}
> +
> +static noinline
> +int rwsem_spin_on_owner(struct rw_semaphore *lock, struct task_struct *owner)
> +{
> +	rcu_read_lock();
> +	while (owner_running(lock, owner)) {
> +		if (need_resched())
> +			break;
> +
> +		arch_mutex_cpu_relax();
> +	}
> +	rcu_read_unlock();
> +
> +	/*
> +	 * We break out the loop above on need_resched() and when the
> +	 * owner changed, which is a sign for heavy contention. Return
> +	 * success only when lock->owner is NULL.
> +	 */
> +	return lock->owner == NULL;
> +}
> +
> +int rwsem_optimistic_spin(struct rw_semaphore *sem)
> +{
> +	struct	task_struct	*owner;
> +	int	ret = 0;
> +
> +	/* sem->wait_lock should not be held when doing optimistic spinning */
> +	if (!rwsem_can_spin_on_owner(sem))
> +		return ret;
> +
> +	preempt_disable();
> +	for (;;) {
> +		owner = ACCESS_ONCE(sem->owner);
> +		if (owner && !rwsem_spin_on_owner(sem, owner))
> +			break;

Will this spin for full scheduler value on a reader-owned lock?

> +		/* wait_lock will be acquired if write_lock is obtained */
> +		if (rwsem_try_write_lock(sem->count, true, sem)) {
> +			ret = 1;
> +			goto out;
> +		}
> +
> +		/*
> +		 * When there's no owner, we might have preempted between the
                                                         ^^^^^^^^

Isn't pre-emption disabled?


> +		 * owner acquiring the lock and setting the owner field. If
> +		 * we're an RT task that will live-lock because we won't let
> +		 * the owner complete.
> +		 */
> +		if (!owner && (need_resched() || rt_task(current)))
> +			break;
> +
> +		/*
> +		 * The cpu_relax() call is a compiler barrier which forces
> +		 * everything in this loop to be re-loaded. We don't need
> +		 * memory barriers as we'll eventually observe the right
> +		 * values at the cost of a few extra spins.
> +		 */
> +		arch_mutex_cpu_relax();
> +
> +	}
> +
> +out:
> +	preempt_enable();
> +	return ret;
> +}
> +#endif
> +
>   /*
>    * wait until we successfully acquire the write lock
>    */
> @@ -200,6 +326,9 @@ struct rw_semaphore __sched *rwsem_down_write_failed(struct rw_semaphore *sem)
>   	long count, adjustment = -RWSEM_ACTIVE_WRITE_BIAS;
>   	struct rwsem_waiter waiter;
>   	struct task_struct *tsk = current;
> +#ifdef CONFIG_RWSEM_SPIN_ON_WRITE_OWNER
> +	bool try_optimistic_spin = true;
> +#endif
>
>   	/* set up my own style of waitqueue */
>   	waiter.task = tsk;
> @@ -223,20 +352,17 @@ struct rw_semaphore __sched *rwsem_down_write_failed(struct rw_semaphore *sem)
>   	/* wait until we successfully acquire the lock */
>   	set_task_state(tsk, TASK_UNINTERRUPTIBLE);
>   	while (true) {
> -		if (!(count & RWSEM_ACTIVE_MASK)) {
> -			/* Try acquiring the write lock. */
> -			count = RWSEM_ACTIVE_WRITE_BIAS;
> -			if (!list_is_singular(&sem->wait_list))
> -				count += RWSEM_WAITING_BIAS;
> -
> -			if (sem->count == RWSEM_WAITING_BIAS &&
> -			    cmpxchg(&sem->count, RWSEM_WAITING_BIAS, count) ==
> -							RWSEM_WAITING_BIAS)
> -				break;
> -		}
> +		if (rwsem_try_write_lock(count, false, sem))
> +			break;
>
>   		raw_spin_unlock_irq(&sem->wait_lock);
>
> +#ifdef CONFIG_RWSEM_SPIN_ON_WRITE_OWNER
> +		/* do optimistic spinning */
> +		if (try_optimistic_spin && rwsem_optimistic_spin(sem))
> +			break;
> +		try_optimistic_spin = false;
> +#endif
>   		/* Block until there are no active lockers. */
>   		do {
>   			schedule();

Regards,
Peter Hurley


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/2] rwsem: do optimistic spinning for writer lock acquisition
  2013-06-22  7:57   ` Peter Hurley
@ 2013-06-23 20:03     ` Davidlohr Bueso
  2013-06-24 17:11       ` Tim Chen
  2013-06-24 21:58     ` Tim Chen
  1 sibling, 1 reply; 25+ messages in thread
From: Davidlohr Bueso @ 2013-06-23 20:03 UTC (permalink / raw)
  To: Peter Hurley
  Cc: Tim Chen, Alex Shi, Michel Lespinasse, Ingo Molnar,
	Andrew Morton, Andrea Arcangeli, Andi Kleen, Matthew R Wilcox,
	Dave Hansen, Peter Zijlstra, Rik van Riel, linux-kernel,
	linux-mm

On Sat, 2013-06-22 at 03:57 -0400, Peter Hurley wrote:
> On 06/21/2013 07:51 PM, Tim Chen wrote:
> > Introduce in this patch optimistic spinning for writer lock
> > acquisition in read write semaphore.  The logic is
> > similar to the optimistic spinning in mutex but without
> > the MCS lock queueing of the spinner.  This provides a
> > better chance for a writer to acquire the lock before
> > being we block it and put it to sleep.
> 
> This is just my opinion but I'd rather read the justification
> here instead of referencing mutex logic that may or may not
> exist in 2 years.

We want to add optimistic spinning to rwsems because we've noticed that
the writer rwsem does not perform as well as mutexes. Tim noticed that
for exim (mail server) workloads, when reverting commit 4fc3f1d6 and I
noticed it when converting the i_mmap_mutex to a rwsem in some aim7
workloads. We've noticed that the biggest difference, in a nutshell, is
when we fail to acquire a mutex in the fastpath, optimistic spinning
comes in to play and we can avoid a large amount of unnecessary sleeping
and wait queue overhead.

For rwsems on the other hand, upon entering the writer slowpath in
rwsem_down_write_failed(), we just acquire the ->wait_lock, add
ourselves to the wait_queue and blocking until we get the lock.

Makes sense?

More information from the original thread:
https://lkml.org/lkml/2013/6/21/482

> 
> 
> > Disabling of pre-emption during optimistic spinning
> > was suggested by Davidlohr Bueso.  It
> > improved performance of aim7 for his test suite.
> >
> > Combined with the patch to avoid unnecesary cmpxchg,
> > in testing by Davidlohr Bueso on aim7 workloads
> > on 8 socket 80 cores system, he saw improvements of
> > alltests (+14.5%), custom (+17%), disk (+11%), high_systime
> > (+5%), shared (+15%) and short (+4%), most of them after around 500
> > users when he implemented i_mmap as rwsem.
> >
> > Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> > ---
> >   Makefile              |    2 +-
> >   include/linux/rwsem.h |    3 +
> >   init/Kconfig          |    9 +++
> >   kernel/rwsem.c        |   29 +++++++++-
> >   lib/rwsem.c           |  148 +++++++++++++++++++++++++++++++++++++++++++++----
> >   5 files changed, 178 insertions(+), 13 deletions(-)
> >
> > diff --git a/Makefile b/Makefile
> > index 49aa84b..7d1ef64 100644
> > --- a/Makefile
> > +++ b/Makefile
> > @@ -1,7 +1,7 @@
> >   VERSION = 3
> >   PATCHLEVEL = 10
> >   SUBLEVEL = 0
> > -EXTRAVERSION = -rc4
> > +EXTRAVERSION = -rc4-optspin4
> >   NAME = Unicycling Gorilla

This must obviously go.

> >
> >   # *DOCUMENTATION*
> > diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h
> > index 0616ffe..0c5933b 100644
> > --- a/include/linux/rwsem.h
> > +++ b/include/linux/rwsem.h
> > @@ -29,6 +29,9 @@ struct rw_semaphore {
> >   #ifdef CONFIG_DEBUG_LOCK_ALLOC
> >   	struct lockdep_map	dep_map;
> >   #endif
> > +#ifdef CONFIG_RWSEM_SPIN_ON_WRITE_OWNER
> > +	struct task_struct	*owner;
> > +#endif
> >   };
> >
> >   extern struct rw_semaphore *rwsem_down_read_failed(struct rw_semaphore *sem);
> > diff --git a/init/Kconfig b/init/Kconfig
> > index 9d3a788..1c582d1 100644
> > --- a/init/Kconfig
> > +++ b/init/Kconfig
> > @@ -1595,6 +1595,15 @@ config TRACEPOINTS
> >
> >   source "arch/Kconfig"
> >
> > +config RWSEM_SPIN_ON_WRITE_OWNER
> > +	bool "Optimistic spin write acquisition for writer owned rw-sem"
> > +	default n
> > +	depends on SMP
> > +	help
> > +	  Allows a writer to perform optimistic spinning if another writer own
> > +	  the read write semaphore.  This gives a greater chance for writer to
> > +	  acquire a semaphore before blocking it and putting it to sleep.
> > +
> >   endmenu		# General setup

Can't we just use CONFIG_SMP insted of adding a new Kconfig variable?

> >
> >   config HAVE_GENERIC_DMA_COHERENT
> > diff --git a/kernel/rwsem.c b/kernel/rwsem.c
> > index cfff143..a32990a 100644
> > --- a/kernel/rwsem.c
> > +++ b/kernel/rwsem.c
> > @@ -12,6 +12,26 @@
> >
> >   #include <linux/atomic.h>
> >
> > +#ifdef CONFIG_RWSEM_SPIN_ON_WRITE_OWNER
> > +static inline void rwsem_set_owner(struct rw_semaphore *sem)
> > +{
> > +	sem->owner = current;
> > +}
> > +
> > +static inline void rwsem_clear_owner(struct rw_semaphore *sem)
> > +{
> > +	sem->owner = NULL;
> > +}
> > +#else
> > +static inline void rwsem_set_owner(struct rw_semaphore *sem)
> > +{
> > +}
> > +
> > +static inline void rwsem_clear_owner(struct rw_semaphore *sem)
> > +{
> > +}
> > +#endif
> > +
> >   /*
> >    * lock for reading
> >    */
> > @@ -48,6 +68,7 @@ void __sched down_write(struct rw_semaphore *sem)
> >   	rwsem_acquire(&sem->dep_map, 0, 0, _RET_IP_);
> >
> >   	LOCK_CONTENDED(sem, __down_write_trylock, __down_write);
> > +	rwsem_set_owner(sem);
> >   }
> >
> >   EXPORT_SYMBOL(down_write);
> > @@ -59,8 +80,10 @@ int down_write_trylock(struct rw_semaphore *sem)
> >   {
> >   	int ret = __down_write_trylock(sem);
> >
> > -	if (ret == 1)
> > +	if (ret == 1) {
> >   		rwsem_acquire(&sem->dep_map, 0, 1, _RET_IP_);
> > +		rwsem_set_owner(sem);
> > +	}
> >   	return ret;
> >   }
> >
> > @@ -86,6 +109,7 @@ void up_write(struct rw_semaphore *sem)
> >   	rwsem_release(&sem->dep_map, 1, _RET_IP_);
> >
> >   	__up_write(sem);
> > +	rwsem_clear_owner(sem);
> >   }
> >
> >   EXPORT_SYMBOL(up_write);
> > @@ -100,6 +124,7 @@ void downgrade_write(struct rw_semaphore *sem)
> >   	 * dependency.
> >   	 */
> >   	__downgrade_write(sem);
> > +	rwsem_clear_owner(sem);
> >   }
> >
> >   EXPORT_SYMBOL(downgrade_write);
> > @@ -122,6 +147,7 @@ void _down_write_nest_lock(struct rw_semaphore *sem, struct lockdep_map *nest)
> >   	rwsem_acquire_nest(&sem->dep_map, 0, 0, nest, _RET_IP_);
> >
> >   	LOCK_CONTENDED(sem, __down_write_trylock, __down_write);
> > +	rwsem_set_owner(sem);
> >   }
> >
> >   EXPORT_SYMBOL(_down_write_nest_lock);
> > @@ -141,6 +167,7 @@ void down_write_nested(struct rw_semaphore *sem, int subclass)
> >   	rwsem_acquire(&sem->dep_map, subclass, 0, _RET_IP_);
> >
> >   	LOCK_CONTENDED(sem, __down_write_trylock, __down_write);
> > +	rwsem_set_owner(sem);
> >   }
> >
> >   EXPORT_SYMBOL(down_write_nested);
> > diff --git a/lib/rwsem.c b/lib/rwsem.c
> > index 2072af5..8e331c5 100644
> > --- a/lib/rwsem.c
> > +++ b/lib/rwsem.c
> > @@ -8,6 +8,7 @@
> >    */
> >   #include <linux/rwsem.h>
> >   #include <linux/sched.h>
> > +#include <linux/sched/rt.h>
> >   #include <linux/init.h>
> >   #include <linux/export.h>
> >
> > @@ -27,6 +28,9 @@ void __init_rwsem(struct rw_semaphore *sem, const char *name,
> >   	sem->count = RWSEM_UNLOCKED_VALUE;
> >   	raw_spin_lock_init(&sem->wait_lock);
> >   	INIT_LIST_HEAD(&sem->wait_list);
> > +#ifdef CONFIG_RWSEM_SPIN_ON_WRITE_OWNER
> > +	sem->owner = NULL;
> > +#endif
> >   }
> >
> >   EXPORT_SYMBOL(__init_rwsem);
> > @@ -192,6 +196,128 @@ struct rw_semaphore __sched *rwsem_down_read_failed(struct rw_semaphore *sem)
> >   	return sem;
> >   }
> >
> > +static inline int rwsem_try_write_lock(long count, bool need_lock,
> > +	struct rw_semaphore *sem)
> > +{
> > +	if (!(count & RWSEM_ACTIVE_MASK)) {
> > +		/* Try acquiring the write lock. */
> > +		if (sem->count == RWSEM_WAITING_BIAS &&
> > +		    cmpxchg(&sem->count, RWSEM_WAITING_BIAS,
> > +			    RWSEM_ACTIVE_WRITE_BIAS) == RWSEM_WAITING_BIAS) {
> > +			if (need_lock)
> > +				raw_spin_lock_irq(&sem->wait_lock);
> > +			if (!list_is_singular(&sem->wait_list))
> > +				rwsem_atomic_update(RWSEM_WAITING_BIAS, sem);
> > +			return 1;
> > +		}
> > +	}
> > +	return 0;
> > +}
> > +
> > +#ifdef CONFIG_RWSEM_SPIN_ON_WRITE_OWNER
> > +static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
> > +{
> > +	int retval = true;
> > +
> > +	/* Spin only if active writer running */
> > +	if (!sem->owner)
> > +		return false;
> > +
> > +	rcu_read_lock();
> > +	if (sem->owner)
> > +		retval = sem->owner->on_cpu;
>                           ^^^^^^^^^^^^^^^^^^
> 
> Why is this a safe dereference? Could not another cpu have just
> dropped the sem (and thus set sem->owner to NULL and oops)?
> 
> 
> > +	rcu_read_unlock();
> > +	/*
> > +	 * if lock->owner is not set, the sem owner may have just acquired
> > +	 * it and not set the owner yet, or the sem has been released, or
> > +	 * reader active.
> > +	 */
> > +	return retval;
> > +}
> > +
> > +static inline bool owner_running(struct rw_semaphore *lock,
> > +				struct task_struct *owner)
> > +{
> > +	if (lock->owner != owner)
> > +		return false;
> > +
> > +	/*
> > +	 * Ensure we emit the owner->on_cpu, dereference _after_ checking
> > +	 * lock->owner still matches owner, if that fails, owner might
> > +	 * point to free()d memory, if it still matches, the rcu_read_lock()
> > +	 * ensures the memory stays valid.
> > +	 */
> 
> Again just my opinion, but kernel style is to prefer multi-line comments
> in a function comment block.
> 
> > +	barrier();
> > +
> > +	return owner->on_cpu;
> > +}

A lot of these functions are exact duplicates of kernel/mutex.c - we
should probably think of adding generic interfaces for mutexes and
rwsems...

> > +
> > +static noinline
> > +int rwsem_spin_on_owner(struct rw_semaphore *lock, struct task_struct *owner)
> > +{
> > +	rcu_read_lock();
> > +	while (owner_running(lock, owner)) {
> > +		if (need_resched())
> > +			break;
> > +
> > +		arch_mutex_cpu_relax();
> > +	}
> > +	rcu_read_unlock();
> > +
> > +	/*
> > +	 * We break out the loop above on need_resched() and when the
> > +	 * owner changed, which is a sign for heavy contention. Return
> > +	 * success only when lock->owner is NULL.
> > +	 */
> > +	return lock->owner == NULL;
> > +}

ditto

> > +
> > +int rwsem_optimistic_spin(struct rw_semaphore *sem)
> > +{
> > +	struct	task_struct	*owner;
> > +	int	ret = 0;
> > +
> > +	/* sem->wait_lock should not be held when doing optimistic spinning */
> > +	if (!rwsem_can_spin_on_owner(sem))
> > +		return ret;
> > +
> > +	preempt_disable();
> > +	for (;;) {
> > +		owner = ACCESS_ONCE(sem->owner);
> > +		if (owner && !rwsem_spin_on_owner(sem, owner))
> > +			break;
> 
> Will this spin for full scheduler value on a reader-owned lock?

Yep, it should.

> 
> > +		/* wait_lock will be acquired if write_lock is obtained */
> > +		if (rwsem_try_write_lock(sem->count, true, sem)) {
> > +			ret = 1;
> > +			goto out;
> > +		}
> > +
> > +		/*
> > +		 * When there's no owner, we might have preempted between the
>                                                          ^^^^^^^^
> 
> Isn't pre-emption disabled?

Hmm yeah, that might be a bogus comment.

> 
> 
> > +		 * owner acquiring the lock and setting the owner field. If
> > +		 * we're an RT task that will live-lock because we won't let
> > +		 * the owner complete.
> > +		 */
> > +		if (!owner && (need_resched() || rt_task(current)))
> > +			break;
> > +
> > +		/*
> > +		 * The cpu_relax() call is a compiler barrier which forces
> > +		 * everything in this loop to be re-loaded. We don't need
> > +		 * memory barriers as we'll eventually observe the right
> > +		 * values at the cost of a few extra spins.
> > +		 */
> > +		arch_mutex_cpu_relax();
> > +
> > +	}
> > +
> > +out:
> > +	preempt_enable();
> > +	return ret;
> > +}
> > +#endif
> > +
> >   /*
> >    * wait until we successfully acquire the write lock
> >    */
> > @@ -200,6 +326,9 @@ struct rw_semaphore __sched *rwsem_down_write_failed(struct rw_semaphore *sem)
> >   	long count, adjustment = -RWSEM_ACTIVE_WRITE_BIAS;
> >   	struct rwsem_waiter waiter;
> >   	struct task_struct *tsk = current;
> > +#ifdef CONFIG_RWSEM_SPIN_ON_WRITE_OWNER
> > +	bool try_optimistic_spin = true;
> > +#endif
> >
> >   	/* set up my own style of waitqueue */
> >   	waiter.task = tsk;
> > @@ -223,20 +352,17 @@ struct rw_semaphore __sched *rwsem_down_write_failed(struct rw_semaphore *sem)
> >   	/* wait until we successfully acquire the lock */
> >   	set_task_state(tsk, TASK_UNINTERRUPTIBLE);
> >   	while (true) {
> > -		if (!(count & RWSEM_ACTIVE_MASK)) {
> > -			/* Try acquiring the write lock. */
> > -			count = RWSEM_ACTIVE_WRITE_BIAS;
> > -			if (!list_is_singular(&sem->wait_list))
> > -				count += RWSEM_WAITING_BIAS;
> > -
> > -			if (sem->count == RWSEM_WAITING_BIAS &&
> > -			    cmpxchg(&sem->count, RWSEM_WAITING_BIAS, count) ==
> > -							RWSEM_WAITING_BIAS)
> > -				break;
> > -		}
> > +		if (rwsem_try_write_lock(count, false, sem))
> > +			break;
> >
> >   		raw_spin_unlock_irq(&sem->wait_lock);
> >
> > +#ifdef CONFIG_RWSEM_SPIN_ON_WRITE_OWNER
> > +		/* do optimistic spinning */
> > +		if (try_optimistic_spin && rwsem_optimistic_spin(sem))
> > +			break;
> > +		try_optimistic_spin = false;
> > +#endif
> >   		/* Block until there are no active lockers. */
> >   		do {
> >   			schedule();
> 
> Regards,
> Peter Hurley
> 



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/2] rwsem: do optimistic spinning for writer lock acquisition
  2013-06-23 20:03     ` Davidlohr Bueso
@ 2013-06-24 17:11       ` Tim Chen
  2013-06-24 18:49         ` Peter Hurley
  0 siblings, 1 reply; 25+ messages in thread
From: Tim Chen @ 2013-06-24 17:11 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Peter Hurley, Alex Shi, Michel Lespinasse, Ingo Molnar,
	Andrew Morton, Andrea Arcangeli, Andi Kleen, Matthew R Wilcox,
	Dave Hansen, Peter Zijlstra, Rik van Riel, linux-kernel,
	linux-mm

On Sun, 2013-06-23 at 13:03 -0700, Davidlohr Bueso wrote:
> On Sat, 2013-06-22 at 03:57 -0400, Peter Hurley wrote:
> > On 06/21/2013 07:51 PM, Tim Chen wrote:
> > > Introduce in this patch optimistic spinning for writer lock
> > > acquisition in read write semaphore.  The logic is
> > > similar to the optimistic spinning in mutex but without
> > > the MCS lock queueing of the spinner.  This provides a
> > > better chance for a writer to acquire the lock before
> > > being we block it and put it to sleep.
> > 
> > This is just my opinion but I'd rather read the justification
> > here instead of referencing mutex logic that may or may not
> > exist in 2 years.
> 
> We want to add optimistic spinning to rwsems because we've noticed that
> the writer rwsem does not perform as well as mutexes. Tim noticed that
> for exim (mail server) workloads, when reverting commit 4fc3f1d6 and I
> noticed it when converting the i_mmap_mutex to a rwsem in some aim7
> workloads. We've noticed that the biggest difference, in a nutshell, is
> when we fail to acquire a mutex in the fastpath, optimistic spinning
> comes in to play and we can avoid a large amount of unnecessary sleeping
> and wait queue overhead.
> 
> For rwsems on the other hand, upon entering the writer slowpath in
> rwsem_down_write_failed(), we just acquire the ->wait_lock, add
> ourselves to the wait_queue and blocking until we get the lock.
> 
> Makes sense?
> 
> More information from the original thread:
> https://lkml.org/lkml/2013/6/21/482

Sounds good.  


> 
> > 
> > 
> > > Disabling of pre-emption during optimistic spinning
> > > was suggested by Davidlohr Bueso.  It
> > > improved performance of aim7 for his test suite.
> > >
> > > Combined with the patch to avoid unnecesary cmpxchg,
> > > in testing by Davidlohr Bueso on aim7 workloads
> > > on 8 socket 80 cores system, he saw improvements of
> > > alltests (+14.5%), custom (+17%), disk (+11%), high_systime
> > > (+5%), shared (+15%) and short (+4%), most of them after around 500
> > > users when he implemented i_mmap as rwsem.
> > >
> > > Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> > > ---
> > >   Makefile              |    2 +-
> > >   include/linux/rwsem.h |    3 +
> > >   init/Kconfig          |    9 +++
> > >   kernel/rwsem.c        |   29 +++++++++-
> > >   lib/rwsem.c           |  148 +++++++++++++++++++++++++++++++++++++++++++++----
> > >   5 files changed, 178 insertions(+), 13 deletions(-)
> > >
> > > diff --git a/Makefile b/Makefile
> > > index 49aa84b..7d1ef64 100644
> > > --- a/Makefile
> > > +++ b/Makefile
> > > @@ -1,7 +1,7 @@
> > >   VERSION = 3
> > >   PATCHLEVEL = 10
> > >   SUBLEVEL = 0
> > > -EXTRAVERSION = -rc4
> > > +EXTRAVERSION = -rc4-optspin4
> > >   NAME = Unicycling Gorilla
> 
> This must obviously go.

Yes.  Should not be there.

> 
> > >
> > >   # *DOCUMENTATION*
> > > diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h
> > > index 0616ffe..0c5933b 100644
> > > --- a/include/linux/rwsem.h
> > > +++ b/include/linux/rwsem.h
> > > @@ -29,6 +29,9 @@ struct rw_semaphore {
> > >   #ifdef CONFIG_DEBUG_LOCK_ALLOC
> > >   	struct lockdep_map	dep_map;
> > >   #endif
> > > +#ifdef CONFIG_RWSEM_SPIN_ON_WRITE_OWNER
> > > +	struct task_struct	*owner;
> > > +#endif
> > >   };
> > >
> > >   extern struct rw_semaphore *rwsem_down_read_failed(struct rw_semaphore *sem);
> > > diff --git a/init/Kconfig b/init/Kconfig
> > > index 9d3a788..1c582d1 100644
> > > --- a/init/Kconfig
> > > +++ b/init/Kconfig
> > > @@ -1595,6 +1595,15 @@ config TRACEPOINTS
> > >
> > >   source "arch/Kconfig"
> > >
> > > +config RWSEM_SPIN_ON_WRITE_OWNER
> > > +	bool "Optimistic spin write acquisition for writer owned rw-sem"
> > > +	default n
> > > +	depends on SMP
> > > +	help
> > > +	  Allows a writer to perform optimistic spinning if another writer own
> > > +	  the read write semaphore.  This gives a greater chance for writer to
> > > +	  acquire a semaphore before blocking it and putting it to sleep.
> > > +
> > >   endmenu		# General setup
> 
> Can't we just use CONFIG_SMP insted of adding a new Kconfig variable?

I am not comfortable to make the optimistic spinning of rw-sem a default
SMP config option yet.  I will like it to see more performance testing
in the tree.  I want the ability to turn it off easily.

> 
> > >
> > >   config HAVE_GENERIC_DMA_COHERENT
> > > diff --git a/kernel/rwsem.c b/kernel/rwsem.c
> > > index cfff143..a32990a 100644
> > > --- a/kernel/rwsem.c
> > > +++ b/kernel/rwsem.c
> > > @@ -12,6 +12,26 @@
> > >
> > >   #include <linux/atomic.h>
> > >
> > > +#ifdef CONFIG_RWSEM_SPIN_ON_WRITE_OWNER
> > > +static inline void rwsem_set_owner(struct rw_semaphore *sem)
> > > +{
> > > +	sem->owner = current;
> > > +}
> > > +
> > > +static inline void rwsem_clear_owner(struct rw_semaphore *sem)
> > > +{
> > > +	sem->owner = NULL;
> > > +}
> > > +#else
> > > +static inline void rwsem_set_owner(struct rw_semaphore *sem)
> > > +{
> > > +}
> > > +
> > > +static inline void rwsem_clear_owner(struct rw_semaphore *sem)
> > > +{
> > > +}
> > > +#endif
> > > +
> > >   /*
> > >    * lock for reading
> > >    */
> > > @@ -48,6 +68,7 @@ void __sched down_write(struct rw_semaphore *sem)
> > >   	rwsem_acquire(&sem->dep_map, 0, 0, _RET_IP_);
> > >
> > >   	LOCK_CONTENDED(sem, __down_write_trylock, __down_write);
> > > +	rwsem_set_owner(sem);
> > >   }
> > >
> > >   EXPORT_SYMBOL(down_write);
> > > @@ -59,8 +80,10 @@ int down_write_trylock(struct rw_semaphore *sem)
> > >   {
> > >   	int ret = __down_write_trylock(sem);
> > >
> > > -	if (ret == 1)
> > > +	if (ret == 1) {
> > >   		rwsem_acquire(&sem->dep_map, 0, 1, _RET_IP_);
> > > +		rwsem_set_owner(sem);
> > > +	}
> > >   	return ret;
> > >   }
> > >
> > > @@ -86,6 +109,7 @@ void up_write(struct rw_semaphore *sem)
> > >   	rwsem_release(&sem->dep_map, 1, _RET_IP_);
> > >
> > >   	__up_write(sem);
> > > +	rwsem_clear_owner(sem);
> > >   }
> > >
> > >   EXPORT_SYMBOL(up_write);
> > > @@ -100,6 +124,7 @@ void downgrade_write(struct rw_semaphore *sem)
> > >   	 * dependency.
> > >   	 */
> > >   	__downgrade_write(sem);
> > > +	rwsem_clear_owner(sem);
> > >   }
> > >
> > >   EXPORT_SYMBOL(downgrade_write);
> > > @@ -122,6 +147,7 @@ void _down_write_nest_lock(struct rw_semaphore *sem, struct lockdep_map *nest)
> > >   	rwsem_acquire_nest(&sem->dep_map, 0, 0, nest, _RET_IP_);
> > >
> > >   	LOCK_CONTENDED(sem, __down_write_trylock, __down_write);
> > > +	rwsem_set_owner(sem);
> > >   }
> > >
> > >   EXPORT_SYMBOL(_down_write_nest_lock);
> > > @@ -141,6 +167,7 @@ void down_write_nested(struct rw_semaphore *sem, int subclass)
> > >   	rwsem_acquire(&sem->dep_map, subclass, 0, _RET_IP_);
> > >
> > >   	LOCK_CONTENDED(sem, __down_write_trylock, __down_write);
> > > +	rwsem_set_owner(sem);
> > >   }
> > >
> > >   EXPORT_SYMBOL(down_write_nested);
> > > diff --git a/lib/rwsem.c b/lib/rwsem.c
> > > index 2072af5..8e331c5 100644
> > > --- a/lib/rwsem.c
> > > +++ b/lib/rwsem.c
> > > @@ -8,6 +8,7 @@
> > >    */
> > >   #include <linux/rwsem.h>
> > >   #include <linux/sched.h>
> > > +#include <linux/sched/rt.h>
> > >   #include <linux/init.h>
> > >   #include <linux/export.h>
> > >
> > > @@ -27,6 +28,9 @@ void __init_rwsem(struct rw_semaphore *sem, const char *name,
> > >   	sem->count = RWSEM_UNLOCKED_VALUE;
> > >   	raw_spin_lock_init(&sem->wait_lock);
> > >   	INIT_LIST_HEAD(&sem->wait_list);
> > > +#ifdef CONFIG_RWSEM_SPIN_ON_WRITE_OWNER
> > > +	sem->owner = NULL;
> > > +#endif
> > >   }
> > >
> > >   EXPORT_SYMBOL(__init_rwsem);
> > > @@ -192,6 +196,128 @@ struct rw_semaphore __sched *rwsem_down_read_failed(struct rw_semaphore *sem)
> > >   	return sem;
> > >   }
> > >
> > > +static inline int rwsem_try_write_lock(long count, bool need_lock,
> > > +	struct rw_semaphore *sem)
> > > +{
> > > +	if (!(count & RWSEM_ACTIVE_MASK)) {
> > > +		/* Try acquiring the write lock. */
> > > +		if (sem->count == RWSEM_WAITING_BIAS &&
> > > +		    cmpxchg(&sem->count, RWSEM_WAITING_BIAS,
> > > +			    RWSEM_ACTIVE_WRITE_BIAS) == RWSEM_WAITING_BIAS) {
> > > +			if (need_lock)
> > > +				raw_spin_lock_irq(&sem->wait_lock);
> > > +			if (!list_is_singular(&sem->wait_list))
> > > +				rwsem_atomic_update(RWSEM_WAITING_BIAS, sem);
> > > +			return 1;
> > > +		}
> > > +	}
> > > +	return 0;
> > > +}
> > > +
> > > +#ifdef CONFIG_RWSEM_SPIN_ON_WRITE_OWNER
> > > +static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
> > > +{
> > > +	int retval = true;
> > > +
> > > +	/* Spin only if active writer running */
> > > +	if (!sem->owner)
> > > +		return false;
> > > +
> > > +	rcu_read_lock();
> > > +	if (sem->owner)
> > > +		retval = sem->owner->on_cpu;
> >                           ^^^^^^^^^^^^^^^^^^
> > 
> > Why is this a safe dereference? Could not another cpu have just
> > dropped the sem (and thus set sem->owner to NULL and oops)?
> > 

The rcu read lock should protect against sem->owner being NULL.

> > 
> > > +	rcu_read_unlock();
> > > +	/*
> > > +	 * if lock->owner is not set, the sem owner may have just acquired
> > > +	 * it and not set the owner yet, or the sem has been released, or
> > > +	 * reader active.
> > > +	 */
> > > +	return retval;
> > > +}
> > > +
> > > +static inline bool owner_running(struct rw_semaphore *lock,
> > > +				struct task_struct *owner)
> > > +{
> > > +	if (lock->owner != owner)
> > > +		return false;
> > > +
> > > +	/*
> > > +	 * Ensure we emit the owner->on_cpu, dereference _after_ checking
> > > +	 * lock->owner still matches owner, if that fails, owner might
> > > +	 * point to free()d memory, if it still matches, the rcu_read_lock()
> > > +	 * ensures the memory stays valid.
> > > +	 */
> > 
> > Again just my opinion, but kernel style is to prefer multi-line comments
> > in a function comment block.
> > 
> > > +	barrier();
> > > +
> > > +	return owner->on_cpu;
> > > +}
> 
> A lot of these functions are exact duplicates of kernel/mutex.c - we
> should probably think of adding generic interfaces for mutexes and
> rwsems...

Probably there are pros and cons.  The cons is the logic are 
similar but may not be exact duplicate if later on we are 
adding rw-sem specific tweaks.  May be cleaner to keep the two 
separated.

> 
> > > +
> > > +static noinline
> > > +int rwsem_spin_on_owner(struct rw_semaphore *lock, struct task_struct *owner)
> > > +{
> > > +	rcu_read_lock();
> > > +	while (owner_running(lock, owner)) {
> > > +		if (need_resched())
> > > +			break;
> > > +
> > > +		arch_mutex_cpu_relax();
> > > +	}
> > > +	rcu_read_unlock();
> > > +
> > > +	/*
> > > +	 * We break out the loop above on need_resched() and when the
> > > +	 * owner changed, which is a sign for heavy contention. Return
> > > +	 * success only when lock->owner is NULL.
> > > +	 */
> > > +	return lock->owner == NULL;
> > > +}
> 
> ditto
> 
> > > +
> > > +int rwsem_optimistic_spin(struct rw_semaphore *sem)
> > > +{
> > > +	struct	task_struct	*owner;
> > > +	int	ret = 0;
> > > +
> > > +	/* sem->wait_lock should not be held when doing optimistic spinning */
> > > +	if (!rwsem_can_spin_on_owner(sem))
> > > +		return ret;
> > > +
> > > +	preempt_disable();
> > > +	for (;;) {
> > > +		owner = ACCESS_ONCE(sem->owner);
> > > +		if (owner && !rwsem_spin_on_owner(sem, owner))
> > > +			break;
> > 
> > Will this spin for full scheduler value on a reader-owned lock?
> 
> Yep, it should.

No. We will spin only on writer-owned lock in the current version.  
The owner field is only set when a writer own it.  Spinning on reader 
is tricky as there could be multiple readers.  Earlier Davidlohr and
I have privately tested a different version where we allowed the 
owner field to be set by reader but only get cleared if 
sem->owner == current (suggested by Matthew Wilcox).  However we 
didn't get a performance boost so I did not include this.

> 
> > 
> > > +		/* wait_lock will be acquired if write_lock is obtained */
> > > +		if (rwsem_try_write_lock(sem->count, true, sem)) {
> > > +			ret = 1;
> > > +			goto out;
> > > +		}
> > > +
> > > +		/*
> > > +		 * When there's no owner, we might have preempted between the
> >                                                          ^^^^^^^^
> > 
> > Isn't pre-emption disabled?
> 
> Hmm yeah, that might be a bogus comment.

That's true. My original change didn't have pre-emption disabled so this
slipped through.

> 
> > 
> > 
> > > +		 * owner acquiring the lock and setting the owner field. If
> > > +		 * we're an RT task that will live-lock because we won't let
> > > +		 * the owner complete.
> > > +		 */
> > > +		if (!owner && (need_resched() || rt_task(current)))
> > > +			break;
> > > +
> > > +		/*
> > > +		 * The cpu_relax() call is a compiler barrier which forces
> > > +		 * everything in this loop to be re-loaded. We don't need
> > > +		 * memory barriers as we'll eventually observe the right
> > > +		 * values at the cost of a few extra spins.
> > > +		 */
> > > +		arch_mutex_cpu_relax();
> > > +
> > > +	}
> > > +
> > > +out:
> > > +	preempt_enable();
> > > +	return ret;
> > > +}
> > > +#endif
> > > +
> > >   /*
> > >    * wait until we successfully acquire the write lock
> > >    */
> > > @@ -200,6 +326,9 @@ struct rw_semaphore __sched *rwsem_down_write_failed(struct rw_semaphore *sem)
> > >   	long count, adjustment = -RWSEM_ACTIVE_WRITE_BIAS;
> > >   	struct rwsem_waiter waiter;
> > >   	struct task_struct *tsk = current;
> > > +#ifdef CONFIG_RWSEM_SPIN_ON_WRITE_OWNER
> > > +	bool try_optimistic_spin = true;
> > > +#endif
> > >
> > >   	/* set up my own style of waitqueue */
> > >   	waiter.task = tsk;
> > > @@ -223,20 +352,17 @@ struct rw_semaphore __sched *rwsem_down_write_failed(struct rw_semaphore *sem)
> > >   	/* wait until we successfully acquire the lock */
> > >   	set_task_state(tsk, TASK_UNINTERRUPTIBLE);
> > >   	while (true) {
> > > -		if (!(count & RWSEM_ACTIVE_MASK)) {
> > > -			/* Try acquiring the write lock. */
> > > -			count = RWSEM_ACTIVE_WRITE_BIAS;
> > > -			if (!list_is_singular(&sem->wait_list))
> > > -				count += RWSEM_WAITING_BIAS;
> > > -
> > > -			if (sem->count == RWSEM_WAITING_BIAS &&
> > > -			    cmpxchg(&sem->count, RWSEM_WAITING_BIAS, count) ==
> > > -							RWSEM_WAITING_BIAS)
> > > -				break;
> > > -		}
> > > +		if (rwsem_try_write_lock(count, false, sem))
> > > +			break;
> > >
> > >   		raw_spin_unlock_irq(&sem->wait_lock);
> > >
> > > +#ifdef CONFIG_RWSEM_SPIN_ON_WRITE_OWNER
> > > +		/* do optimistic spinning */
> > > +		if (try_optimistic_spin && rwsem_optimistic_spin(sem))
> > > +			break;
> > > +		try_optimistic_spin = false;
> > > +#endif
> > >   		/* Block until there are no active lockers. */
> > >   		do {
> > >   			schedule();
> > 
> > Regards,
> > Peter Hurley
> > 
> 
> 



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/2] rwsem: do optimistic spinning for writer lock acquisition
  2013-06-24 17:11       ` Tim Chen
@ 2013-06-24 18:49         ` Peter Hurley
  2013-06-24 19:13           ` Tim Chen
  2013-06-24 20:17           ` Tim Chen
  0 siblings, 2 replies; 25+ messages in thread
From: Peter Hurley @ 2013-06-24 18:49 UTC (permalink / raw)
  To: Tim Chen
  Cc: Davidlohr Bueso, Alex Shi, Michel Lespinasse, Ingo Molnar,
	Andrew Morton, Andrea Arcangeli, Andi Kleen, Matthew R Wilcox,
	Dave Hansen, Peter Zijlstra, Rik van Riel, linux-kernel,
	linux-mm

On 06/24/2013 01:11 PM, Tim Chen wrote:
> On Sun, 2013-06-23 at 13:03 -0700, Davidlohr Bueso wrote:
>> On Sat, 2013-06-22 at 03:57 -0400, Peter Hurley wrote:
>>> On 06/21/2013 07:51 PM, Tim Chen wrote:
>>>>
>>>> +static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
>>>> +{
>>>> +	int retval = true;
>>>> +
>>>> +	/* Spin only if active writer running */
>>>> +	if (!sem->owner)
>>>> +		return false;
>>>> +
>>>> +	rcu_read_lock();
>>>> +	if (sem->owner)
>>>> +		retval = sem->owner->on_cpu;
>>>                            ^^^^^^^^^^^^^^^^^^
>>>
>>> Why is this a safe dereference? Could not another cpu have just
>>> dropped the sem (and thus set sem->owner to NULL and oops)?
>>>
>
> The rcu read lock should protect against sem->owner being NULL.

It doesn't.

Here's the comment from mutex_spin_on_owner():

   /*
    * Look out! "owner" is an entirely speculative pointer
    * access and not reliable.
    */

Regards,
Peter Hurley


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/2] rwsem: do optimistic spinning for writer lock acquisition
  2013-06-24 18:49         ` Peter Hurley
@ 2013-06-24 19:13           ` Tim Chen
  2013-06-24 20:32             ` Peter Hurley
  2013-06-24 20:17           ` Tim Chen
  1 sibling, 1 reply; 25+ messages in thread
From: Tim Chen @ 2013-06-24 19:13 UTC (permalink / raw)
  To: Peter Hurley
  Cc: Davidlohr Bueso, Alex Shi, Michel Lespinasse, Ingo Molnar,
	Andrew Morton, Andrea Arcangeli, Andi Kleen, Matthew R Wilcox,
	Dave Hansen, Peter Zijlstra, Rik van Riel, linux-kernel,
	linux-mm

On Mon, 2013-06-24 at 14:49 -0400, Peter Hurley wrote:
> On 06/24/2013 01:11 PM, Tim Chen wrote:
> > On Sun, 2013-06-23 at 13:03 -0700, Davidlohr Bueso wrote:
> >> On Sat, 2013-06-22 at 03:57 -0400, Peter Hurley wrote:
> >>> On 06/21/2013 07:51 PM, Tim Chen wrote:
> >>>>
> >>>> +static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
> >>>> +{
> >>>> +	int retval = true;
> >>>> +
> >>>> +	/* Spin only if active writer running */
> >>>> +	if (!sem->owner)
> >>>> +		return false;
> >>>> +
> >>>> +	rcu_read_lock();
> >>>> +	if (sem->owner)
> >>>> +		retval = sem->owner->on_cpu;
> >>>                            ^^^^^^^^^^^^^^^^^^
> >>>
> >>> Why is this a safe dereference? Could not another cpu have just
> >>> dropped the sem (and thus set sem->owner to NULL and oops)?
> >>>
> >
> > The rcu read lock should protect against sem->owner being NULL.
> 
> It doesn't.
> 
> Here's the comment from mutex_spin_on_owner():
> 
>    /*
>     * Look out! "owner" is an entirely speculative pointer
>     * access and not reliable.
>     */
> 

In mutex_spin_on_owner, after rcu_read_lock, the owner_running()
function de-references the owner pointer.  The rcu_read_lock prevents
owner from getting freed. The comment's intention is to warn that
owner->on_cpu may not be reliable.

I'm using similar logic in rw-sem.

Tim




^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/2] rwsem: do optimistic spinning for writer lock acquisition
  2013-06-24 19:13           ` Tim Chen
@ 2013-06-24 20:32             ` Peter Hurley
  0 siblings, 0 replies; 25+ messages in thread
From: Peter Hurley @ 2013-06-24 20:32 UTC (permalink / raw)
  To: Tim Chen
  Cc: Davidlohr Bueso, Alex Shi, Michel Lespinasse, Ingo Molnar,
	Andrew Morton, Andrea Arcangeli, Andi Kleen, Matthew R Wilcox,
	Dave Hansen, Peter Zijlstra, Rik van Riel, linux-kernel,
	linux-mm

On 06/24/2013 03:13 PM, Tim Chen wrote:
> On Mon, 2013-06-24 at 14:49 -0400, Peter Hurley wrote:
>> On 06/24/2013 01:11 PM, Tim Chen wrote:
>>> On Sun, 2013-06-23 at 13:03 -0700, Davidlohr Bueso wrote:
>>>> On Sat, 2013-06-22 at 03:57 -0400, Peter Hurley wrote:
>>>>> On 06/21/2013 07:51 PM, Tim Chen wrote:
>>>>>>
>>>>>> +static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
>>>>>> +{
>>>>>> +	int retval = true;
>>>>>> +
>>>>>> +	/* Spin only if active writer running */
>>>>>> +	if (!sem->owner)
>>>>>> +		return false;
>>>>>> +
>>>>>> +	rcu_read_lock();
>>>>>> +	if (sem->owner)
>>>>>> +		retval = sem->owner->on_cpu;
>>>>>                             ^^^^^^^^^^^^^^^^^^
>>>>>
>>>>> Why is this a safe dereference? Could not another cpu have just
>>>>> dropped the sem (and thus set sem->owner to NULL and oops)?
>>>>>
>>>
>>> The rcu read lock should protect against sem->owner being NULL.
>>
>> It doesn't.
>>
>> Here's the comment from mutex_spin_on_owner():
>>
>>     /*
>>      * Look out! "owner" is an entirely speculative pointer
>>      * access and not reliable.
>>      */
>>
>
> In mutex_spin_on_owner, after rcu_read_lock, the owner_running()
> function de-references the owner pointer.

Only after establishing the following preconditions:
1. snapshot of owner is non-NULL
2. mutex->owner == snapshot owner
3. memory holding mutex has not been freed (that's what the
    rcu_read_lock() is for)

Only then is the owner dereferenced and only through the snapshot
(not the now-possibly-rewritten sem->owner).

> I'm using similar logic in rw-sem.

With crucial details absent.

Regards,
Peter Hurley



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/2] rwsem: do optimistic spinning for writer lock acquisition
  2013-06-24 18:49         ` Peter Hurley
  2013-06-24 19:13           ` Tim Chen
@ 2013-06-24 20:17           ` Tim Chen
  2013-06-24 20:48             ` Peter Hurley
  2013-06-25  7:37             ` Peter Zijlstra
  1 sibling, 2 replies; 25+ messages in thread
From: Tim Chen @ 2013-06-24 20:17 UTC (permalink / raw)
  To: Peter Hurley
  Cc: Davidlohr Bueso, Alex Shi, Michel Lespinasse, Ingo Molnar,
	Andrew Morton, Andrea Arcangeli, Andi Kleen, Matthew R Wilcox,
	Dave Hansen, Peter Zijlstra, Rik van Riel, linux-kernel,
	linux-mm

On Mon, 2013-06-24 at 14:49 -0400, Peter Hurley wrote:
> On 06/24/2013 01:11 PM, Tim Chen wrote:
> > On Sun, 2013-06-23 at 13:03 -0700, Davidlohr Bueso wrote:
> >> On Sat, 2013-06-22 at 03:57 -0400, Peter Hurley wrote:
> >>> On 06/21/2013 07:51 PM, Tim Chen wrote:
> >>>>
> >>>> +static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
> >>>> +{
> >>>> +	int retval = true;
> >>>> +
> >>>> +	/* Spin only if active writer running */
> >>>> +	if (!sem->owner)
> >>>> +		return false;
> >>>> +
> >>>> +	rcu_read_lock();
> >>>> +	if (sem->owner)
> >>>> +		retval = sem->owner->on_cpu;
> >>>                            ^^^^^^^^^^^^^^^^^^
> >>>
> >>> Why is this a safe dereference? Could not another cpu have just
> >>> dropped the sem (and thus set sem->owner to NULL and oops)?
> >>>
> >
> > The rcu read lock should protect against sem->owner being NULL.
> 
> It doesn't.
> 
> Here's the comment from mutex_spin_on_owner():
> 
>    /*
>     * Look out! "owner" is an entirely speculative pointer
>     * access and not reliable.
>     */

On second thought, I agree with you.  I should change this to
something like

	int retval = true;
	task_struct *sem_owner;

	/* Spin only if active writer running */
	if (!sem->owner)
		return false;

	rcu_read_lock();
	sem_owner = sem->owner;
	if (sem_owner)
		retval = sem_owner->on_cpu;

Thanks.

Tim



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/2] rwsem: do optimistic spinning for writer lock acquisition
  2013-06-24 20:17           ` Tim Chen
@ 2013-06-24 20:48             ` Peter Hurley
  2013-06-24 21:30               ` Tim Chen
  2013-06-25  7:37             ` Peter Zijlstra
  1 sibling, 1 reply; 25+ messages in thread
From: Peter Hurley @ 2013-06-24 20:48 UTC (permalink / raw)
  To: Tim Chen
  Cc: Davidlohr Bueso, Alex Shi, Michel Lespinasse, Ingo Molnar,
	Andrew Morton, Andrea Arcangeli, Andi Kleen, Matthew R Wilcox,
	Dave Hansen, Peter Zijlstra, Rik van Riel, linux-kernel,
	linux-mm

On 06/24/2013 04:17 PM, Tim Chen wrote:
> On Mon, 2013-06-24 at 14:49 -0400, Peter Hurley wrote:
>> On 06/24/2013 01:11 PM, Tim Chen wrote:
>>> On Sun, 2013-06-23 at 13:03 -0700, Davidlohr Bueso wrote:
>>>> On Sat, 2013-06-22 at 03:57 -0400, Peter Hurley wrote:
>>>>> On 06/21/2013 07:51 PM, Tim Chen wrote:
>>>>>>
>>>>>> +static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
>>>>>> +{
>>>>>> +	int retval = true;
>>>>>> +
>>>>>> +	/* Spin only if active writer running */
>>>>>> +	if (!sem->owner)
>>>>>> +		return false;
>>>>>> +
>>>>>> +	rcu_read_lock();
>>>>>> +	if (sem->owner)
>>>>>> +		retval = sem->owner->on_cpu;
>>>>>                             ^^^^^^^^^^^^^^^^^^
>>>>>
>>>>> Why is this a safe dereference? Could not another cpu have just
>>>>> dropped the sem (and thus set sem->owner to NULL and oops)?
>>>>>
>>>
>>> The rcu read lock should protect against sem->owner being NULL.
>>
>> It doesn't.
>>
>> Here's the comment from mutex_spin_on_owner():
>>
>>     /*
>>      * Look out! "owner" is an entirely speculative pointer
>>      * access and not reliable.
>>      */
>
> On second thought, I agree with you.  I should change this to
> something like
>
> 	int retval = true;
> 	task_struct *sem_owner;
>
> 	/* Spin only if active writer running */
> 	if (!sem->owner)
> 		return false;
>
> 	rcu_read_lock();
> 	sem_owner = sem->owner;
> 	if (sem_owner)
> 		retval = sem_owner->on_cpu;
>

Our emails passed each other.

Also, I haven't given a lot of thought to if preemption must be disabled
before calling rwsem_can_spin_on_owner(). If so, wouldn't you just drop
rwsem_can_spin_on_owner() (because the conditions tested in the loop are
equivalent)?

Regards,
Peter Hurley



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/2] rwsem: do optimistic spinning for writer lock acquisition
  2013-06-24 20:48             ` Peter Hurley
@ 2013-06-24 21:30               ` Tim Chen
  0 siblings, 0 replies; 25+ messages in thread
From: Tim Chen @ 2013-06-24 21:30 UTC (permalink / raw)
  To: Peter Hurley
  Cc: Davidlohr Bueso, Alex Shi, Michel Lespinasse, Ingo Molnar,
	Andrew Morton, Andrea Arcangeli, Andi Kleen, Matthew R Wilcox,
	Dave Hansen, Peter Zijlstra, Rik van Riel, linux-kernel,
	linux-mm

On Mon, 2013-06-24 at 16:48 -0400, Peter Hurley wrote:

> 
> Also, I haven't given a lot of thought to if preemption must be disabled
> before calling rwsem_can_spin_on_owner(). If so, wouldn't you just drop
> rwsem_can_spin_on_owner() (because the conditions tested in the loop are
> equivalent)?
> 

Not totally equivalent.  If we drop the call to rwsem_can_spin_on_owner,
we will spin when readers are holding the lock (owner is null).  
Right now we only allow writers to spin when other writers are 
holding the lock by adding the rwsem_can_spin_on_owner check.  
Letting spinning on readers held lock is tricky as
we could have a reader that sleeps and if we don't detect the case.  We
could spin for too long.

Thanks.

Tim

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/2] rwsem: do optimistic spinning for writer lock acquisition
  2013-06-24 20:17           ` Tim Chen
  2013-06-24 20:48             ` Peter Hurley
@ 2013-06-25  7:37             ` Peter Zijlstra
  2013-06-25 16:00               ` Tim Chen
  1 sibling, 1 reply; 25+ messages in thread
From: Peter Zijlstra @ 2013-06-25  7:37 UTC (permalink / raw)
  To: Tim Chen
  Cc: Peter Hurley, Davidlohr Bueso, Alex Shi, Michel Lespinasse,
	Ingo Molnar, Andrew Morton, Andrea Arcangeli, Andi Kleen,
	Matthew R Wilcox, Dave Hansen, Rik van Riel, linux-kernel,
	linux-mm

On Mon, Jun 24, 2013 at 01:17:45PM -0700, Tim Chen wrote:
> On second thought, I agree with you.  I should change this to
> something like
> 
> 	int retval = true;
> 	task_struct *sem_owner;
> 
> 	/* Spin only if active writer running */
> 	if (!sem->owner)
> 		return false;
> 
> 	rcu_read_lock();
> 	sem_owner = sem->owner;

That should be: sem_owner = ACCESS_ONCE(sem->owner); to make sure the
compiler doesn't try and be clever and rereads.

> 	if (sem_owner)
> 		retval = sem_owner->on_cpu;
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/2] rwsem: do optimistic spinning for writer lock acquisition
  2013-06-25  7:37             ` Peter Zijlstra
@ 2013-06-25 16:00               ` Tim Chen
  0 siblings, 0 replies; 25+ messages in thread
From: Tim Chen @ 2013-06-25 16:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Peter Hurley, Davidlohr Bueso, Alex Shi, Michel Lespinasse,
	Ingo Molnar, Andrew Morton, Andrea Arcangeli, Andi Kleen,
	Matthew R Wilcox, Dave Hansen, Rik van Riel, linux-kernel,
	linux-mm

On Tue, 2013-06-25 at 09:37 +0200, Peter Zijlstra wrote:
> On Mon, Jun 24, 2013 at 01:17:45PM -0700, Tim Chen wrote:
> > On second thought, I agree with you.  I should change this to
> > something like
> > 
> > 	int retval = true;
> > 	task_struct *sem_owner;
> > 
> > 	/* Spin only if active writer running */
> > 	if (!sem->owner)
> > 		return false;
> > 
> > 	rcu_read_lock();
> > 	sem_owner = sem->owner;
> 
> That should be: sem_owner = ACCESS_ONCE(sem->owner); to make sure the
> compiler doesn't try and be clever and rereads.

Thanks.  Will incorporate this in next version of the patch.

Tim



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/2] rwsem: do optimistic spinning for writer lock acquisition
  2013-06-22  7:57   ` Peter Hurley
  2013-06-23 20:03     ` Davidlohr Bueso
@ 2013-06-24 21:58     ` Tim Chen
  2013-06-24 22:08       ` Peter Hurley
  1 sibling, 1 reply; 25+ messages in thread
From: Tim Chen @ 2013-06-24 21:58 UTC (permalink / raw)
  To: Peter Hurley
  Cc: Alex Shi, Michel Lespinasse, Ingo Molnar, Andrew Morton,
	Andrea Arcangeli, Andi Kleen, Davidlohr Bueso, Matthew R Wilcox,
	Dave Hansen, Peter Zijlstra, Rik van Riel, linux-kernel,
	linux-mm

On Sat, 2013-06-22 at 03:57 -0400, Peter Hurley wrote:
> Will this spin for full scheduler value on a reader-owned lock?
> 
> > +		/* wait_lock will be acquired if write_lock is obtained */
> > +		if (rwsem_try_write_lock(sem->count, true, sem)) {
> > +			ret = 1;
> > +			goto out;
> > +		}
> > +
> > +		/*
> > +		 * When there's no owner, we might have preempted between the
>                                                          ^^^^^^^^
> 
> Isn't pre-emption disabled?
> 

Peter, on further review, this code is needed.  This code guard against 
the case of this thread preempting another thread in the middle
of setting the  owner field.  Disabling preemption does not prevent this
thread from preempting others, even though others cannot preempt 
this thread.


> 
> > +		 * owner acquiring the lock and setting the owner field. If
> > +		 * we're an RT task that will live-lock because we won't let
> > +		 * the owner complete.
> > +		 */
> > +		if (!owner && (need_resched() || rt_task(current)))
> > +			break;
> > +
> > +		/*
> > +		 * The cpu_relax() call is a compiler barrier which forces
> > +		 * everything in this loop to be re-loaded. We don't need
> > +		 * memory barriers as we'll eventually observe the right
> > +		 * values at the cost of a few extra spins.
> > +		 */
> > +		arch_mutex_cpu_relax();
> > +
> > +	}
> > +
> > +out:
> > +	preempt_enable();
> > +	return ret;
> > +}
> > +#endif
> > +
> >   /*
> >    * wait until we successfully acquire the write lock
> >    */
> > @@ -200,6 +326,9 @@ struct rw_semaphore __sched *rwsem_down_write_failed(struct rw_semaphore *sem)
> >   	long count, adjustment = -RWSEM_ACTIVE_WRITE_BIAS;
> >   	struct rwsem_waiter waiter;
> >   	struct task_struct *tsk = current;
> > +#ifdef CONFIG_RWSEM_SPIN_ON_WRITE_OWNER
> > +	bool try_optimistic_spin = true;
> > +#endif
> >
> >   	/* set up my own style of waitqueue */
> >   	waiter.task = tsk;
> > @@ -223,20 +352,17 @@ struct rw_semaphore __sched *rwsem_down_write_failed(struct rw_semaphore *sem)
> >   	/* wait until we successfully acquire the lock */
> >   	set_task_state(tsk, TASK_UNINTERRUPTIBLE);
> >   	while (true) {
> > -		if (!(count & RWSEM_ACTIVE_MASK)) {
> > -			/* Try acquiring the write lock. */
> > -			count = RWSEM_ACTIVE_WRITE_BIAS;
> > -			if (!list_is_singular(&sem->wait_list))
> > -				count += RWSEM_WAITING_BIAS;
> > -
> > -			if (sem->count == RWSEM_WAITING_BIAS &&
> > -			    cmpxchg(&sem->count, RWSEM_WAITING_BIAS, count) ==
> > -							RWSEM_WAITING_BIAS)
> > -				break;
> > -		}
> > +		if (rwsem_try_write_lock(count, false, sem))
> > +			break;
> >
> >   		raw_spin_unlock_irq(&sem->wait_lock);
> >
> > +#ifdef CONFIG_RWSEM_SPIN_ON_WRITE_OWNER
> > +		/* do optimistic spinning */
> > +		if (try_optimistic_spin && rwsem_optimistic_spin(sem))
> > +			break;
> > +		try_optimistic_spin = false;
> > +#endif
> >   		/* Block until there are no active lockers. */
> >   		do {
> >   			schedule();
> 

Thanks.

Tim


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/2] rwsem: do optimistic spinning for writer lock acquisition
  2013-06-24 21:58     ` Tim Chen
@ 2013-06-24 22:08       ` Peter Hurley
  0 siblings, 0 replies; 25+ messages in thread
From: Peter Hurley @ 2013-06-24 22:08 UTC (permalink / raw)
  To: Tim Chen
  Cc: Alex Shi, Michel Lespinasse, Ingo Molnar, Andrew Morton,
	Andrea Arcangeli, Andi Kleen, Davidlohr Bueso, Matthew R Wilcox,
	Dave Hansen, Peter Zijlstra, Rik van Riel, linux-kernel,
	linux-mm

On 06/24/2013 05:58 PM, Tim Chen wrote:
> On Sat, 2013-06-22 at 03:57 -0400, Peter Hurley wrote:
>> Will this spin for full scheduler value on a reader-owned lock?
>>
>>> +		/* wait_lock will be acquired if write_lock is obtained */
>>> +		if (rwsem_try_write_lock(sem->count, true, sem)) {
>>> +			ret = 1;
>>> +			goto out;
>>> +		}
>>> +
>>> +		/*
>>> +		 * When there's no owner, we might have preempted between the
>>                                                           ^^^^^^^^
>>
>> Isn't pre-emption disabled?
>>
>
> Peter, on further review, this code is needed.  This code guard against
> the case of this thread preempting another thread in the middle
> of setting the  owner field.  Disabling preemption does not prevent this
> thread from preempting others, even though others cannot preempt
> this thread.

Yep; so the "we" in the quoted comment really refers to another thread
executing down_write_xxxx().

Thanks for the clarification.

Regards,
Peter Hurley

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/2] rwsem: do optimistic spinning for writer lock acquisition
  2013-06-21 23:51 ` [PATCH 2/2] rwsem: do optimistic spinning for writer lock acquisition Tim Chen
  2013-06-22  0:00   ` Davidlohr Bueso
  2013-06-22  7:57   ` Peter Hurley
@ 2013-06-24  8:46   ` Peter Zijlstra
  2013-06-24 16:36     ` Tim Chen
  2 siblings, 1 reply; 25+ messages in thread
From: Peter Zijlstra @ 2013-06-24  8:46 UTC (permalink / raw)
  To: Tim Chen
  Cc: Ingo Molnar, Andrew Morton, Andrea Arcangeli, Alex Shi,
	Andi Kleen, Michel Lespinasse, Davidlohr Bueso, Matthew R Wilcox,
	Dave Hansen, Rik van Riel, linux-kernel, linux-mm

On Fri, Jun 21, 2013 at 04:51:40PM -0700, Tim Chen wrote:
> Introduce in this patch optimistic spinning for writer lock
> acquisition in read write semaphore.  The logic is
> similar to the optimistic spinning in mutex but without
> the MCS lock queueing of the spinner.  This provides a
> better chance for a writer to acquire the lock before
> being we block it and put it to sleep.
> 
> Disabling of pre-emption during optimistic spinning
> was suggested by Davidlohr Bueso.  It
> improved performance of aim7 for his test suite.
> 
> Combined with the patch to avoid unnecesary cmpxchg,
> in testing by Davidlohr Bueso on aim7 workloads
> on 8 socket 80 cores system, he saw improvements of
> alltests (+14.5%), custom (+17%), disk (+11%), high_systime
> (+5%), shared (+15%) and short (+4%), most of them after around 500
> users when he implemented i_mmap as rwsem.
> 
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> ---
>  Makefile              |    2 +-
>  include/linux/rwsem.h |    3 +
>  init/Kconfig          |    9 +++
>  kernel/rwsem.c        |   29 +++++++++-
>  lib/rwsem.c           |  148 +++++++++++++++++++++++++++++++++++++++++++++----
>  5 files changed, 178 insertions(+), 13 deletions(-)
> 
> diff --git a/Makefile b/Makefile
> index 49aa84b..7d1ef64 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -1,7 +1,7 @@
>  VERSION = 3
>  PATCHLEVEL = 10
>  SUBLEVEL = 0
> -EXTRAVERSION = -rc4
> +EXTRAVERSION = -rc4-optspin4
>  NAME = Unicycling Gorilla
>  
>  # *DOCUMENTATION*

I'm fairly sure we don't want to commit this hunk ;-)

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/2] rwsem: do optimistic spinning for writer lock acquisition
  2013-06-24  8:46   ` Peter Zijlstra
@ 2013-06-24 16:36     ` Tim Chen
  0 siblings, 0 replies; 25+ messages in thread
From: Tim Chen @ 2013-06-24 16:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Andrew Morton, Andrea Arcangeli, Alex Shi,
	Andi Kleen, Michel Lespinasse, Davidlohr Bueso, Matthew R Wilcox,
	Dave Hansen, Rik van Riel, linux-kernel, linux-mm

On Mon, 2013-06-24 at 10:46 +0200, Peter Zijlstra wrote:
> On Fri, Jun 21, 2013 at 04:51:40PM -0700, Tim Chen wrote:
> > Introduce in this patch optimistic spinning for writer lock
> > acquisition in read write semaphore.  The logic is
> > similar to the optimistic spinning in mutex but without
> > the MCS lock queueing of the spinner.  This provides a
> > better chance for a writer to acquire the lock before
> > being we block it and put it to sleep.
> > 
> > Disabling of pre-emption during optimistic spinning
> > was suggested by Davidlohr Bueso.  It
> > improved performance of aim7 for his test suite.
> > 
> > Combined with the patch to avoid unnecesary cmpxchg,
> > in testing by Davidlohr Bueso on aim7 workloads
> > on 8 socket 80 cores system, he saw improvements of
> > alltests (+14.5%), custom (+17%), disk (+11%), high_systime
> > (+5%), shared (+15%) and short (+4%), most of them after around 500
> > users when he implemented i_mmap as rwsem.
> > 
> > Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> > ---
> >  Makefile              |    2 +-
> >  include/linux/rwsem.h |    3 +
> >  init/Kconfig          |    9 +++
> >  kernel/rwsem.c        |   29 +++++++++-
> >  lib/rwsem.c           |  148 +++++++++++++++++++++++++++++++++++++++++++++----
> >  5 files changed, 178 insertions(+), 13 deletions(-)
> > 
> > diff --git a/Makefile b/Makefile
> > index 49aa84b..7d1ef64 100644
> > --- a/Makefile
> > +++ b/Makefile
> > @@ -1,7 +1,7 @@
> >  VERSION = 3
> >  PATCHLEVEL = 10
> >  SUBLEVEL = 0
> > -EXTRAVERSION = -rc4
> > +EXTRAVERSION = -rc4-optspin4
> >  NAME = Unicycling Gorilla
> >  
> >  # *DOCUMENTATION*
> 
> I'm fairly sure we don't want to commit this hunk ;-)

Fat fingers.  Thanks for catching.

Tim


^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2013-06-25 16:00 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <cover.1371855277.git.tim.c.chen@linux.intel.com>
2013-06-21 23:51 ` [PATCH 1/2] rwsem: check the lock before cpmxchg in down_write_trylock and rwsem_do_wake Tim Chen
2013-06-22  0:10   ` Alex Shi
2013-06-22  0:15     ` Davidlohr Bueso
2013-06-24 16:34     ` Tim Chen
2013-06-22  7:21   ` Peter Hurley
2013-06-23  1:16     ` Alex Shi
2013-06-23  5:10       ` Andi Kleen
2013-06-23 11:52         ` Alex Shi
2013-06-21 23:51 ` [PATCH 2/2] rwsem: do optimistic spinning for writer lock acquisition Tim Chen
2013-06-22  0:00   ` Davidlohr Bueso
2013-06-22  7:57   ` Peter Hurley
2013-06-23 20:03     ` Davidlohr Bueso
2013-06-24 17:11       ` Tim Chen
2013-06-24 18:49         ` Peter Hurley
2013-06-24 19:13           ` Tim Chen
2013-06-24 20:32             ` Peter Hurley
2013-06-24 20:17           ` Tim Chen
2013-06-24 20:48             ` Peter Hurley
2013-06-24 21:30               ` Tim Chen
2013-06-25  7:37             ` Peter Zijlstra
2013-06-25 16:00               ` Tim Chen
2013-06-24 21:58     ` Tim Chen
2013-06-24 22:08       ` Peter Hurley
2013-06-24  8:46   ` Peter Zijlstra
2013-06-24 16:36     ` Tim Chen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).