All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/2] locking/qrwlock: More optimizations in qrwlock
@ 2015-06-15 22:24 Waiman Long
  2015-06-15 22:24 ` [PATCH v3 1/2] locking/qrwlock: Better optimization for interrupt context readers Waiman Long
  2015-06-15 22:24 ` [PATCH v3 2/2] locking/qrwlock: Don't contend with readers when setting _QW_WAITING Waiman Long
  0 siblings, 2 replies; 9+ messages in thread
From: Waiman Long @ 2015-06-15 22:24 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnd Bergmann
  Cc: linux-arch, linux-kernel, Will Deacon, Scott J Norton,
	Douglas Hatch, Waiman Long

v2->v3:
 - Fix incorrect commit log message in patch 1.

v1->v2:
 - Add microbenchmark data for the second patch

This patch set contains 2 patches on qrwlock. The first one is to
optimize for interrupt context reader slowpath.  The second one is
to optimize the writer slowpath.

Waiman Long (2):
  locking/qrwlock: Better optimization for interrupt context readers
  locking/qrwlock: Don't contend with readers when setting _QW_WAITING

 include/asm-generic/qrwlock.h |    4 +-
 kernel/locking/qrwlock.c      |   42 +++++++++++++++++++++++++++++++---------
 2 files changed, 34 insertions(+), 12 deletions(-)


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v3 1/2] locking/qrwlock: Better optimization for interrupt context readers
  2015-06-15 22:24 [PATCH v3 0/2] locking/qrwlock: More optimizations in qrwlock Waiman Long
@ 2015-06-15 22:24 ` Waiman Long
  2015-06-16 12:17   ` Will Deacon
  2015-06-15 22:24 ` [PATCH v3 2/2] locking/qrwlock: Don't contend with readers when setting _QW_WAITING Waiman Long
  1 sibling, 1 reply; 9+ messages in thread
From: Waiman Long @ 2015-06-15 22:24 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnd Bergmann
  Cc: linux-arch, linux-kernel, Will Deacon, Scott J Norton,
	Douglas Hatch, Waiman Long

The qrwlock is fair in the process context, but becoming unfair when
in the interrupt context to support use cases like the tasklist_lock.

The current code isn't that well-documented on what happens when
in the interrupt context. The rspin_until_writer_unlock() will only
spin if the writer has gotten the lock. If the writer is still in the
waiting state, the increment in the reader count will cause the writer
to remain in the waiting state and the new interrupt context reader
will get the lock and return immediately. The current code, however,
do an additional read of the lock value which is not necessary as the
information have already been there in the fast path. This may sometime
cause an additional cacheline load when the lock is highly contended.

This patch passes the lock value information gotten in the fast path
to the slow path to eliminate the additional read. It also clarify the
action for the interrupt context readers more explicitly.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
---
 include/asm-generic/qrwlock.h |    4 ++--
 kernel/locking/qrwlock.c      |   14 ++++++++------
 2 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/include/asm-generic/qrwlock.h b/include/asm-generic/qrwlock.h
index 6383d54..865d021 100644
--- a/include/asm-generic/qrwlock.h
+++ b/include/asm-generic/qrwlock.h
@@ -36,7 +36,7 @@
 /*
  * External function declarations
  */
-extern void queue_read_lock_slowpath(struct qrwlock *lock);
+extern void queue_read_lock_slowpath(struct qrwlock *lock, u32 cnts);
 extern void queue_write_lock_slowpath(struct qrwlock *lock);
 
 /**
@@ -105,7 +105,7 @@ static inline void queue_read_lock(struct qrwlock *lock)
 		return;
 
 	/* The slowpath will decrement the reader count, if necessary. */
-	queue_read_lock_slowpath(lock);
+	queue_read_lock_slowpath(lock, cnts);
 }
 
 /**
diff --git a/kernel/locking/qrwlock.c b/kernel/locking/qrwlock.c
index 00c12bb..d7d7557 100644
--- a/kernel/locking/qrwlock.c
+++ b/kernel/locking/qrwlock.c
@@ -43,22 +43,24 @@ rspin_until_writer_unlock(struct qrwlock *lock, u32 cnts)
  * queue_read_lock_slowpath - acquire read lock of a queue rwlock
  * @lock: Pointer to queue rwlock structure
  */
-void queue_read_lock_slowpath(struct qrwlock *lock)
+void queue_read_lock_slowpath(struct qrwlock *lock, u32 cnts)
 {
-	u32 cnts;
-
 	/*
 	 * Readers come here when they cannot get the lock without waiting
 	 */
 	if (unlikely(in_interrupt())) {
 		/*
-		 * Readers in interrupt context will spin until the lock is
-		 * available without waiting in the queue.
+		 * Readers in interrupt context will get the lock immediately
+		 * if the writer is just waiting (not holding the lock yet)
+		 * or they will spin until the lock is available without
+		 * waiting in the queue.
 		 */
-		cnts = smp_load_acquire((u32 *)&lock->cnts);
+		if ((cnts & _QW_WMASK) != _QW_LOCKED)
+			return;
 		rspin_until_writer_unlock(lock, cnts);
 		return;
 	}
+
 	atomic_sub(_QR_BIAS, &lock->cnts);
 
 	/*
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v3 2/2] locking/qrwlock: Don't contend with readers when setting _QW_WAITING
  2015-06-15 22:24 [PATCH v3 0/2] locking/qrwlock: More optimizations in qrwlock Waiman Long
  2015-06-15 22:24 ` [PATCH v3 1/2] locking/qrwlock: Better optimization for interrupt context readers Waiman Long
@ 2015-06-15 22:24 ` Waiman Long
  2015-06-16 18:02   ` Will Deacon
  1 sibling, 1 reply; 9+ messages in thread
From: Waiman Long @ 2015-06-15 22:24 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnd Bergmann
  Cc: linux-arch, linux-kernel, Will Deacon, Scott J Norton,
	Douglas Hatch, Waiman Long

The current cmpxchg() loop in setting the _QW_WAITING flag for writers
in queue_write_lock_slowpath() will contend with incoming readers
causing possibly extra cmpxchg() operations that are wasteful. This
patch changes the code to do a byte cmpxchg() to eliminate contention
with new readers.

A multithreaded microbenchmark running 5M read_lock/write_lock loop
on a 8-socket 80-core Westmere-EX machine running 4.0 based kernel
with the qspinlock patch have the following execution times (in ms)
with and without the patch:

With R:W ratio = 5:1

	Threads	   w/o patch	with patch	% change
	-------	   ---------	----------	--------
	   2	     990 	    895		  -9.6%
	   3	    2136 	   1912		 -10.5%
	   4	    3166	   2830		 -10.6%
	   5	    3953	   3629		  -8.2%
	   6	    4628	   4405		  -4.8%
	   7	    5344	   5197		  -2.8%
	   8	    6065	   6004		  -1.0%
	   9	    6826	   6811		  -0.2%
	  10	    7599	   7599		   0.0%
	  15	    9757	   9766		  +0.1%
	  20	   13767	  13817		  +0.4%

With small number of contending threads, this patch can improve
locking performance by up to 10%. With more contending threads,
however, the gain diminishes.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
---
 kernel/locking/qrwlock.c |   28 ++++++++++++++++++++++++----
 1 files changed, 24 insertions(+), 4 deletions(-)

diff --git a/kernel/locking/qrwlock.c b/kernel/locking/qrwlock.c
index d7d7557..559198a 100644
--- a/kernel/locking/qrwlock.c
+++ b/kernel/locking/qrwlock.c
@@ -22,6 +22,26 @@
 #include <linux/hardirq.h>
 #include <asm/qrwlock.h>
 
+/*
+ * This internal data structure is used for optimizing access to some of
+ * the subfields within the atomic_t cnts.
+ */
+struct __qrwlock {
+	union {
+		atomic_t cnts;
+		struct {
+#ifdef __LITTLE_ENDIAN
+			u8 wmode;	/* Writer mode   */
+			u8 rcnts[3];	/* Reader counts */
+#else
+			u8 rcnts[3];	/* Reader counts */
+			u8 wmode;	/* Writer mode   */
+#endif
+		};
+	};
+	arch_spinlock_t	lock;
+};
+
 /**
  * rspin_until_writer_unlock - inc reader count & spin until writer is gone
  * @lock  : Pointer to queue rwlock structure
@@ -109,10 +129,10 @@ void queue_write_lock_slowpath(struct qrwlock *lock)
 	 * or wait for a previous writer to go away.
 	 */
 	for (;;) {
-		cnts = atomic_read(&lock->cnts);
-		if (!(cnts & _QW_WMASK) &&
-		    (atomic_cmpxchg(&lock->cnts, cnts,
-				    cnts | _QW_WAITING) == cnts))
+		struct __qrwlock *l = (struct __qrwlock *)lock;
+
+		if (!READ_ONCE(l->wmode) &&
+		   (cmpxchg(&l->wmode, 0, _QW_WAITING) == 0))
 			break;
 
 		cpu_relax_lowlatency();
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 1/2] locking/qrwlock: Better optimization for interrupt context readers
  2015-06-15 22:24 ` [PATCH v3 1/2] locking/qrwlock: Better optimization for interrupt context readers Waiman Long
@ 2015-06-16 12:17   ` Will Deacon
  2015-06-18  1:30     ` Waiman Long
  0 siblings, 1 reply; 9+ messages in thread
From: Will Deacon @ 2015-06-16 12:17 UTC (permalink / raw)
  To: Waiman Long
  Cc: Peter Zijlstra, Ingo Molnar, Arnd Bergmann, linux-arch,
	linux-kernel, Scott J Norton, Douglas Hatch

Hi Waiman,

On Mon, Jun 15, 2015 at 11:24:02PM +0100, Waiman Long wrote:
> The qrwlock is fair in the process context, but becoming unfair when
> in the interrupt context to support use cases like the tasklist_lock.
> 
> The current code isn't that well-documented on what happens when
> in the interrupt context. The rspin_until_writer_unlock() will only
> spin if the writer has gotten the lock. If the writer is still in the
> waiting state, the increment in the reader count will cause the writer
> to remain in the waiting state and the new interrupt context reader
> will get the lock and return immediately. The current code, however,
> do an additional read of the lock value which is not necessary as the
> information have already been there in the fast path. This may sometime
> cause an additional cacheline load when the lock is highly contended.
> 
> This patch passes the lock value information gotten in the fast path
> to the slow path to eliminate the additional read. It also clarify the
> action for the interrupt context readers more explicitly.
> 
> Signed-off-by: Waiman Long <Waiman.Long@hp.com>
> ---
>  include/asm-generic/qrwlock.h |    4 ++--
>  kernel/locking/qrwlock.c      |   14 ++++++++------
>  2 files changed, 10 insertions(+), 8 deletions(-)

[...]

> diff --git a/kernel/locking/qrwlock.c b/kernel/locking/qrwlock.c
> index 00c12bb..d7d7557 100644
> --- a/kernel/locking/qrwlock.c
> +++ b/kernel/locking/qrwlock.c
> @@ -43,22 +43,24 @@ rspin_until_writer_unlock(struct qrwlock *lock, u32 cnts)
>   * queue_read_lock_slowpath - acquire read lock of a queue rwlock
>   * @lock: Pointer to queue rwlock structure
>   */
> -void queue_read_lock_slowpath(struct qrwlock *lock)
> +void queue_read_lock_slowpath(struct qrwlock *lock, u32 cnts)
>  {
> -	u32 cnts;
> -
>  	/*
>  	 * Readers come here when they cannot get the lock without waiting
>  	 */
>  	if (unlikely(in_interrupt())) {
>  		/*
> -		 * Readers in interrupt context will spin until the lock is
> -		 * available without waiting in the queue.
> +		 * Readers in interrupt context will get the lock immediately
> +		 * if the writer is just waiting (not holding the lock yet)
> +		 * or they will spin until the lock is available without
> +		 * waiting in the queue.
>  		 */
> -		cnts = smp_load_acquire((u32 *)&lock->cnts);
> +		if ((cnts & _QW_WMASK) != _QW_LOCKED)
> +			return;

I really doubt the check here is gaining you any performance, given
rspin_until_write_unlock does the same check immediately and should be
inlined. Just dropping the acquire and passing cnts through should be
sufficient.

Will

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 2/2] locking/qrwlock: Don't contend with readers when setting _QW_WAITING
  2015-06-15 22:24 ` [PATCH v3 2/2] locking/qrwlock: Don't contend with readers when setting _QW_WAITING Waiman Long
@ 2015-06-16 18:02   ` Will Deacon
  2015-06-18  1:33     ` Waiman Long
  0 siblings, 1 reply; 9+ messages in thread
From: Will Deacon @ 2015-06-16 18:02 UTC (permalink / raw)
  To: Waiman Long
  Cc: Peter Zijlstra, Ingo Molnar, Arnd Bergmann, linux-arch,
	linux-kernel, Scott J Norton, Douglas Hatch

On Mon, Jun 15, 2015 at 11:24:03PM +0100, Waiman Long wrote:
> The current cmpxchg() loop in setting the _QW_WAITING flag for writers
> in queue_write_lock_slowpath() will contend with incoming readers
> causing possibly extra cmpxchg() operations that are wasteful. This
> patch changes the code to do a byte cmpxchg() to eliminate contention
> with new readers.
> 
> A multithreaded microbenchmark running 5M read_lock/write_lock loop
> on a 8-socket 80-core Westmere-EX machine running 4.0 based kernel
> with the qspinlock patch have the following execution times (in ms)
> with and without the patch:
> 
> With R:W ratio = 5:1
> 
> 	Threads	   w/o patch	with patch	% change
> 	-------	   ---------	----------	--------
> 	   2	     990 	    895		  -9.6%
> 	   3	    2136 	   1912		 -10.5%
> 	   4	    3166	   2830		 -10.6%
> 	   5	    3953	   3629		  -8.2%
> 	   6	    4628	   4405		  -4.8%
> 	   7	    5344	   5197		  -2.8%
> 	   8	    6065	   6004		  -1.0%
> 	   9	    6826	   6811		  -0.2%
> 	  10	    7599	   7599		   0.0%
> 	  15	    9757	   9766		  +0.1%
> 	  20	   13767	  13817		  +0.4%
> 
> With small number of contending threads, this patch can improve
> locking performance by up to 10%. With more contending threads,
> however, the gain diminishes.
> 
> Signed-off-by: Waiman Long <Waiman.Long@hp.com>
> ---
>  kernel/locking/qrwlock.c |   28 ++++++++++++++++++++++++----
>  1 files changed, 24 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/locking/qrwlock.c b/kernel/locking/qrwlock.c
> index d7d7557..559198a 100644
> --- a/kernel/locking/qrwlock.c
> +++ b/kernel/locking/qrwlock.c
> @@ -22,6 +22,26 @@
>  #include <linux/hardirq.h>
>  #include <asm/qrwlock.h>
>  
> +/*
> + * This internal data structure is used for optimizing access to some of
> + * the subfields within the atomic_t cnts.
> + */
> +struct __qrwlock {
> +	union {
> +		atomic_t cnts;
> +		struct {
> +#ifdef __LITTLE_ENDIAN
> +			u8 wmode;	/* Writer mode   */
> +			u8 rcnts[3];	/* Reader counts */
> +#else
> +			u8 rcnts[3];	/* Reader counts */
> +			u8 wmode;	/* Writer mode   */
> +#endif
> +		};
> +	};
> +	arch_spinlock_t	lock;
> +};
> +
>  /**
>   * rspin_until_writer_unlock - inc reader count & spin until writer is gone
>   * @lock  : Pointer to queue rwlock structure
> @@ -109,10 +129,10 @@ void queue_write_lock_slowpath(struct qrwlock *lock)
>  	 * or wait for a previous writer to go away.
>  	 */
>  	for (;;) {
> -		cnts = atomic_read(&lock->cnts);
> -		if (!(cnts & _QW_WMASK) &&
> -		    (atomic_cmpxchg(&lock->cnts, cnts,
> -				    cnts | _QW_WAITING) == cnts))
> +		struct __qrwlock *l = (struct __qrwlock *)lock;
> +
> +		if (!READ_ONCE(l->wmode) &&
> +		   (cmpxchg(&l->wmode, 0, _QW_WAITING) == 0))
>  			break;

Maybe you could also update the x86 implementation of queue_write_unlock
to write the wmode field instead of casting to u8 *?

Will

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 1/2] locking/qrwlock: Better optimization for interrupt context readers
  2015-06-16 12:17   ` Will Deacon
@ 2015-06-18  1:30     ` Waiman Long
  0 siblings, 0 replies; 9+ messages in thread
From: Waiman Long @ 2015-06-18  1:30 UTC (permalink / raw)
  To: Will Deacon
  Cc: Peter Zijlstra, Ingo Molnar, Arnd Bergmann, linux-arch,
	linux-kernel, Scott J Norton, Douglas Hatch

On 06/16/2015 08:17 AM, Will Deacon wrote:
> Hi Waiman,
>
> On Mon, Jun 15, 2015 at 11:24:02PM +0100, Waiman Long wrote:
>> The qrwlock is fair in the process context, but becoming unfair when
>> in the interrupt context to support use cases like the tasklist_lock.
>>
>> The current code isn't that well-documented on what happens when
>> in the interrupt context. The rspin_until_writer_unlock() will only
>> spin if the writer has gotten the lock. If the writer is still in the
>> waiting state, the increment in the reader count will cause the writer
>> to remain in the waiting state and the new interrupt context reader
>> will get the lock and return immediately. The current code, however,
>> do an additional read of the lock value which is not necessary as the
>> information have already been there in the fast path. This may sometime
>> cause an additional cacheline load when the lock is highly contended.
>>
>> This patch passes the lock value information gotten in the fast path
>> to the slow path to eliminate the additional read. It also clarify the
>> action for the interrupt context readers more explicitly.
>>
>> Signed-off-by: Waiman Long<Waiman.Long@hp.com>
>> ---
>>   include/asm-generic/qrwlock.h |    4 ++--
>>   kernel/locking/qrwlock.c      |   14 ++++++++------
>>   2 files changed, 10 insertions(+), 8 deletions(-)
> [...]
>
>> diff --git a/kernel/locking/qrwlock.c b/kernel/locking/qrwlock.c
>> index 00c12bb..d7d7557 100644
>> --- a/kernel/locking/qrwlock.c
>> +++ b/kernel/locking/qrwlock.c
>> @@ -43,22 +43,24 @@ rspin_until_writer_unlock(struct qrwlock *lock, u32 cnts)
>>    * queue_read_lock_slowpath - acquire read lock of a queue rwlock
>>    * @lock: Pointer to queue rwlock structure
>>    */
>> -void queue_read_lock_slowpath(struct qrwlock *lock)
>> +void queue_read_lock_slowpath(struct qrwlock *lock, u32 cnts)
>>   {
>> -	u32 cnts;
>> -
>>   	/*
>>   	 * Readers come here when they cannot get the lock without waiting
>>   	 */
>>   	if (unlikely(in_interrupt())) {
>>   		/*
>> -		 * Readers in interrupt context will spin until the lock is
>> -		 * available without waiting in the queue.
>> +		 * Readers in interrupt context will get the lock immediately
>> +		 * if the writer is just waiting (not holding the lock yet)
>> +		 * or they will spin until the lock is available without
>> +		 * waiting in the queue.
>>   		 */
>> -		cnts = smp_load_acquire((u32 *)&lock->cnts);
>> +		if ((cnts&  _QW_WMASK) != _QW_LOCKED)
>> +			return;
> I really doubt the check here is gaining you any performance, given
> rspin_until_write_unlock does the same check immediately and should be
> inlined. Just dropping the acquire and passing cnts through should be
> sufficient.

Yes, you are right. I can just pass the cnt to 
rspin_until_write_unlock() and be done with it.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 2/2] locking/qrwlock: Don't contend with readers when setting _QW_WAITING
  2015-06-16 18:02   ` Will Deacon
@ 2015-06-18  1:33     ` Waiman Long
  2015-06-18 12:40       ` Will Deacon
  0 siblings, 1 reply; 9+ messages in thread
From: Waiman Long @ 2015-06-18  1:33 UTC (permalink / raw)
  To: Will Deacon
  Cc: Peter Zijlstra, Ingo Molnar, Arnd Bergmann, linux-arch,
	linux-kernel, Scott J Norton, Douglas Hatch

On 06/16/2015 02:02 PM, Will Deacon wrote:
> On Mon, Jun 15, 2015 at 11:24:03PM +0100, Waiman Long wrote:
>> The current cmpxchg() loop in setting the _QW_WAITING flag for writers
>> in queue_write_lock_slowpath() will contend with incoming readers
>> causing possibly extra cmpxchg() operations that are wasteful. This
>> patch changes the code to do a byte cmpxchg() to eliminate contention
>> with new readers.
>>
>> A multithreaded microbenchmark running 5M read_lock/write_lock loop
>> on a 8-socket 80-core Westmere-EX machine running 4.0 based kernel
>> with the qspinlock patch have the following execution times (in ms)
>> with and without the patch:
>>
>> With R:W ratio = 5:1
>>
>> 	Threads	   w/o patch	with patch	% change
>> 	-------	   ---------	----------	--------
>> 	   2	     990 	    895		  -9.6%
>> 	   3	    2136 	   1912		 -10.5%
>> 	   4	    3166	   2830		 -10.6%
>> 	   5	    3953	   3629		  -8.2%
>> 	   6	    4628	   4405		  -4.8%
>> 	   7	    5344	   5197		  -2.8%
>> 	   8	    6065	   6004		  -1.0%
>> 	   9	    6826	   6811		  -0.2%
>> 	  10	    7599	   7599		   0.0%
>> 	  15	    9757	   9766		  +0.1%
>> 	  20	   13767	  13817		  +0.4%
>>
>> With small number of contending threads, this patch can improve
>> locking performance by up to 10%. With more contending threads,
>> however, the gain diminishes.
>>
>> Signed-off-by: Waiman Long<Waiman.Long@hp.com>
>> ---
>>   kernel/locking/qrwlock.c |   28 ++++++++++++++++++++++++----
>>   1 files changed, 24 insertions(+), 4 deletions(-)
>>
>> diff --git a/kernel/locking/qrwlock.c b/kernel/locking/qrwlock.c
>> index d7d7557..559198a 100644
>> --- a/kernel/locking/qrwlock.c
>> +++ b/kernel/locking/qrwlock.c
>> @@ -22,6 +22,26 @@
>>   #include<linux/hardirq.h>
>>   #include<asm/qrwlock.h>
>>
>> +/*
>> + * This internal data structure is used for optimizing access to some of
>> + * the subfields within the atomic_t cnts.
>> + */
>> +struct __qrwlock {
>> +	union {
>> +		atomic_t cnts;
>> +		struct {
>> +#ifdef __LITTLE_ENDIAN
>> +			u8 wmode;	/* Writer mode   */
>> +			u8 rcnts[3];	/* Reader counts */
>> +#else
>> +			u8 rcnts[3];	/* Reader counts */
>> +			u8 wmode;	/* Writer mode   */
>> +#endif
>> +		};
>> +	};
>> +	arch_spinlock_t	lock;
>> +};
>> +
>>   /**
>>    * rspin_until_writer_unlock - inc reader count&  spin until writer is gone
>>    * @lock  : Pointer to queue rwlock structure
>> @@ -109,10 +129,10 @@ void queue_write_lock_slowpath(struct qrwlock *lock)
>>   	 * or wait for a previous writer to go away.
>>   	 */
>>   	for (;;) {
>> -		cnts = atomic_read(&lock->cnts);
>> -		if (!(cnts&  _QW_WMASK)&&
>> -		    (atomic_cmpxchg(&lock->cnts, cnts,
>> -				    cnts | _QW_WAITING) == cnts))
>> +		struct __qrwlock *l = (struct __qrwlock *)lock;
>> +
>> +		if (!READ_ONCE(l->wmode)&&
>> +		   (cmpxchg(&l->wmode, 0, _QW_WAITING) == 0))
>>   			break;
> Maybe you could also update the x86 implementation of queue_write_unlock
> to write the wmode field instead of casting to u8 *?
>
> Will

The queue_write_unlock() function is in the header file. I don't want to 
expose the internal structure to other files.

Cheers,
Longman



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 2/2] locking/qrwlock: Don't contend with readers when setting _QW_WAITING
  2015-06-18  1:33     ` Waiman Long
@ 2015-06-18 12:40       ` Will Deacon
  2015-06-18 22:14         ` Waiman Long
  0 siblings, 1 reply; 9+ messages in thread
From: Will Deacon @ 2015-06-18 12:40 UTC (permalink / raw)
  To: Waiman Long
  Cc: Peter Zijlstra, Ingo Molnar, Arnd Bergmann, linux-arch,
	linux-kernel, Scott J Norton, Douglas Hatch

On Thu, Jun 18, 2015 at 02:33:56AM +0100, Waiman Long wrote:
> On 06/16/2015 02:02 PM, Will Deacon wrote:
> > On Mon, Jun 15, 2015 at 11:24:03PM +0100, Waiman Long wrote:
> >> The current cmpxchg() loop in setting the _QW_WAITING flag for writers
> >> in queue_write_lock_slowpath() will contend with incoming readers
> >> causing possibly extra cmpxchg() operations that are wasteful. This
> >> patch changes the code to do a byte cmpxchg() to eliminate contention
> >> with new readers.
> >>
> >> A multithreaded microbenchmark running 5M read_lock/write_lock loop
> >> on a 8-socket 80-core Westmere-EX machine running 4.0 based kernel
> >> with the qspinlock patch have the following execution times (in ms)
> >> with and without the patch:
> >>
> >> With R:W ratio = 5:1
> >>
> >> 	Threads	   w/o patch	with patch	% change
> >> 	-------	   ---------	----------	--------
> >> 	   2	     990 	    895		  -9.6%
> >> 	   3	    2136 	   1912		 -10.5%
> >> 	   4	    3166	   2830		 -10.6%
> >> 	   5	    3953	   3629		  -8.2%
> >> 	   6	    4628	   4405		  -4.8%
> >> 	   7	    5344	   5197		  -2.8%
> >> 	   8	    6065	   6004		  -1.0%
> >> 	   9	    6826	   6811		  -0.2%
> >> 	  10	    7599	   7599		   0.0%
> >> 	  15	    9757	   9766		  +0.1%
> >> 	  20	   13767	  13817		  +0.4%
> >>
> >> With small number of contending threads, this patch can improve
> >> locking performance by up to 10%. With more contending threads,
> >> however, the gain diminishes.
> >>
> >> Signed-off-by: Waiman Long<Waiman.Long@hp.com>
> >> ---
> >>   kernel/locking/qrwlock.c |   28 ++++++++++++++++++++++++----
> >>   1 files changed, 24 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/kernel/locking/qrwlock.c b/kernel/locking/qrwlock.c
> >> index d7d7557..559198a 100644
> >> --- a/kernel/locking/qrwlock.c
> >> +++ b/kernel/locking/qrwlock.c
> >> @@ -22,6 +22,26 @@
> >>   #include<linux/hardirq.h>
> >>   #include<asm/qrwlock.h>
> >>
> >> +/*
> >> + * This internal data structure is used for optimizing access to some of
> >> + * the subfields within the atomic_t cnts.
> >> + */
> >> +struct __qrwlock {
> >> +	union {
> >> +		atomic_t cnts;
> >> +		struct {
> >> +#ifdef __LITTLE_ENDIAN
> >> +			u8 wmode;	/* Writer mode   */
> >> +			u8 rcnts[3];	/* Reader counts */
> >> +#else
> >> +			u8 rcnts[3];	/* Reader counts */
> >> +			u8 wmode;	/* Writer mode   */
> >> +#endif
> >> +		};
> >> +	};
> >> +	arch_spinlock_t	lock;
> >> +};
> >> +
> >>   /**
> >>    * rspin_until_writer_unlock - inc reader count&  spin until writer is gone
> >>    * @lock  : Pointer to queue rwlock structure
> >> @@ -109,10 +129,10 @@ void queue_write_lock_slowpath(struct qrwlock *lock)
> >>   	 * or wait for a previous writer to go away.
> >>   	 */
> >>   	for (;;) {
> >> -		cnts = atomic_read(&lock->cnts);
> >> -		if (!(cnts&  _QW_WMASK)&&
> >> -		    (atomic_cmpxchg(&lock->cnts, cnts,
> >> -				    cnts | _QW_WAITING) == cnts))
> >> +		struct __qrwlock *l = (struct __qrwlock *)lock;
> >> +
> >> +		if (!READ_ONCE(l->wmode)&&
> >> +		   (cmpxchg(&l->wmode, 0, _QW_WAITING) == 0))
> >>   			break;
> > Maybe you could also update the x86 implementation of queue_write_unlock
> > to write the wmode field instead of casting to u8 *?
> >
> The queue_write_unlock() function is in the header file. I don't want to 
> expose the internal structure to other files.

Then I don't see the value in the new data structure -- why not just cast
to u8 * instead? In my mind, the structure has the advantage of supporting
both big and little endian systems, but to be useful it would need to be
available in the header file for architectures that chose to override
queue_write_unlock.

As an aside, I have some patches to get this up and running on arm64
which would need something like this structure for the big-endian case.

Will

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 2/2] locking/qrwlock: Don't contend with readers when setting _QW_WAITING
  2015-06-18 12:40       ` Will Deacon
@ 2015-06-18 22:14         ` Waiman Long
  0 siblings, 0 replies; 9+ messages in thread
From: Waiman Long @ 2015-06-18 22:14 UTC (permalink / raw)
  To: Will Deacon
  Cc: Peter Zijlstra, Ingo Molnar, Arnd Bergmann, linux-arch,
	linux-kernel, Scott J Norton, Douglas Hatch

On 06/18/2015 08:40 AM, Will Deacon wrote:
> On Thu, Jun 18, 2015 at 02:33:56AM +0100, Waiman Long wrote:
>> On 06/16/2015 02:02 PM, Will Deacon wrote:
>>> On Mon, Jun 15, 2015 at 11:24:03PM +0100, Waiman Long wrote:
>>>> The current cmpxchg() loop in setting the _QW_WAITING flag for writers
>>>> in queue_write_lock_slowpath() will contend with incoming readers
>>>> causing possibly extra cmpxchg() operations that are wasteful. This
>>>> patch changes the code to do a byte cmpxchg() to eliminate contention
>>>> with new readers.
>>>>
>>>> A multithreaded microbenchmark running 5M read_lock/write_lock loop
>>>> on a 8-socket 80-core Westmere-EX machine running 4.0 based kernel
>>>> with the qspinlock patch have the following execution times (in ms)
>>>> with and without the patch:
>>>>
>>>> With R:W ratio = 5:1
>>>>
>>>> 	Threads	   w/o patch	with patch	% change
>>>> 	-------	   ---------	----------	--------
>>>> 	   2	     990 	    895		  -9.6%
>>>> 	   3	    2136 	   1912		 -10.5%
>>>> 	   4	    3166	   2830		 -10.6%
>>>> 	   5	    3953	   3629		  -8.2%
>>>> 	   6	    4628	   4405		  -4.8%
>>>> 	   7	    5344	   5197		  -2.8%
>>>> 	   8	    6065	   6004		  -1.0%
>>>> 	   9	    6826	   6811		  -0.2%
>>>> 	  10	    7599	   7599		   0.0%
>>>> 	  15	    9757	   9766		  +0.1%
>>>> 	  20	   13767	  13817		  +0.4%
>>>>
>>>> With small number of contending threads, this patch can improve
>>>> locking performance by up to 10%. With more contending threads,
>>>> however, the gain diminishes.
>>>>
>>>> Signed-off-by: Waiman Long<Waiman.Long@hp.com>
>>>> ---
>>>>    kernel/locking/qrwlock.c |   28 ++++++++++++++++++++++++----
>>>>    1 files changed, 24 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/kernel/locking/qrwlock.c b/kernel/locking/qrwlock.c
>>>> index d7d7557..559198a 100644
>>>> --- a/kernel/locking/qrwlock.c
>>>> +++ b/kernel/locking/qrwlock.c
>>>> @@ -22,6 +22,26 @@
>>>>    #include<linux/hardirq.h>
>>>>    #include<asm/qrwlock.h>
>>>>
>>>> +/*
>>>> + * This internal data structure is used for optimizing access to some of
>>>> + * the subfields within the atomic_t cnts.
>>>> + */
>>>> +struct __qrwlock {
>>>> +	union {
>>>> +		atomic_t cnts;
>>>> +		struct {
>>>> +#ifdef __LITTLE_ENDIAN
>>>> +			u8 wmode;	/* Writer mode   */
>>>> +			u8 rcnts[3];	/* Reader counts */
>>>> +#else
>>>> +			u8 rcnts[3];	/* Reader counts */
>>>> +			u8 wmode;	/* Writer mode   */
>>>> +#endif
>>>> +		};
>>>> +	};
>>>> +	arch_spinlock_t	lock;
>>>> +};
>>>> +
>>>>    /**
>>>>     * rspin_until_writer_unlock - inc reader count&   spin until writer is gone
>>>>     * @lock  : Pointer to queue rwlock structure
>>>> @@ -109,10 +129,10 @@ void queue_write_lock_slowpath(struct qrwlock *lock)
>>>>    	 * or wait for a previous writer to go away.
>>>>    	 */
>>>>    	for (;;) {
>>>> -		cnts = atomic_read(&lock->cnts);
>>>> -		if (!(cnts&   _QW_WMASK)&&
>>>> -		    (atomic_cmpxchg(&lock->cnts, cnts,
>>>> -				    cnts | _QW_WAITING) == cnts))
>>>> +		struct __qrwlock *l = (struct __qrwlock *)lock;
>>>> +
>>>> +		if (!READ_ONCE(l->wmode)&&
>>>> +		   (cmpxchg(&l->wmode, 0, _QW_WAITING) == 0))
>>>>    			break;
>>> Maybe you could also update the x86 implementation of queue_write_unlock
>>> to write the wmode field instead of casting to u8 *?
>>>
>> The queue_write_unlock() function is in the header file. I don't want to
>> expose the internal structure to other files.
> Then I don't see the value in the new data structure -- why not just cast
> to u8 * instead? In my mind, the structure has the advantage of supporting
> both big and little endian systems, but to be useful it would need to be
> available in the header file for architectures that chose to override
> queue_write_unlock.

Casting to (u8 *) directly will require ugly endian conditional 
compilation code in the function. It is much easier to look at and 
understand to do that in the data structure instead.

> As an aside, I have some patches to get this up and running on arm64
> which would need something like this structure for the big-endian case.

If there is going to be other consumer of the internal structure, I 
think it will be worthwhile to put that into the header file directly. I 
will update the patch to make that changes.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2015-06-18 22:15 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-06-15 22:24 [PATCH v3 0/2] locking/qrwlock: More optimizations in qrwlock Waiman Long
2015-06-15 22:24 ` [PATCH v3 1/2] locking/qrwlock: Better optimization for interrupt context readers Waiman Long
2015-06-16 12:17   ` Will Deacon
2015-06-18  1:30     ` Waiman Long
2015-06-15 22:24 ` [PATCH v3 2/2] locking/qrwlock: Don't contend with readers when setting _QW_WAITING Waiman Long
2015-06-16 18:02   ` Will Deacon
2015-06-18  1:33     ` Waiman Long
2015-06-18 12:40       ` Will Deacon
2015-06-18 22:14         ` Waiman Long

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.