From: Waiman Long <Waiman.Long@hp.com> To: Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@redhat.com>, "H. Peter Anvin" <hpa@zytor.com>, Peter Zijlstra <peterz@infradead.org> Cc: linux-arch@vger.kernel.org, x86@kernel.org, linux-kernel@vger.kernel.org, virtualization@lists.linux-foundation.org, xen-devel@lists.xenproject.org, kvm@vger.kernel.org, Paolo Bonzini <paolo.bonzini@gmail.com>, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>, Boris Ostrovsky <boris.ostrovsky@oracle.com>, "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>, Rik van Riel <riel@redhat.com>, Linus Torvalds <torvalds@linux-foundation.org>, Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>, David Vrabel <david.vrabel@citrix.com>, Oleg Nesterov <oleg@redhat.com>, Gleb Natapov <gleb@redhat.com>, Scott J Norton <scott.norton@hp.com>, Chegu Vinod <chegu_vinod@hp.com>, Waiman Long <Waiman.Long@hp.com> Subject: [PATCH v11 07/16] qspinlock: Use a simple write to grab the lock, if applicable Date: Fri, 30 May 2014 11:43:53 -0400 [thread overview] Message-ID: <1401464642-33890-8-git-send-email-Waiman.Long@hp.com> (raw) In-Reply-To: <1401464642-33890-1-git-send-email-Waiman.Long@hp.com> Currently, atomic_cmpxchg() is used to get the lock. However, this is not really necessary if there is more than one task in the queue and the queue head don't need to reset the queue code word. For that case, a simple write to set the lock bit is enough as the queue head will be the only one eligible to get the lock as long as it checks that both the lock and pending bits are not set. The current pending bit waiting code will ensure that the bit will not be set as soon as the queue code word (tail) in the lock is set. With that change, the are some slight improvement in the performance of the queue spinlock in the 5M loop micro-benchmark run on a 4-socket Westere-EX machine as shown in the tables below. [Standalone/Embedded - same node] # of tasks Before patch After patch %Change ---------- ----------- ---------- ------- 3 2324/2321 2248/2265 -3%/-2% 4 2890/2896 2819/2831 -2%/-2% 5 3611/3595 3522/3512 -2%/-2% 6 4281/4276 4173/4160 -3%/-3% 7 5018/5001 4875/4861 -3%/-3% 8 5759/5750 5563/5568 -3%/-3% [Standalone/Embedded - different nodes] # of tasks Before patch After patch %Change ---------- ----------- ---------- ------- 3 12242/12237 12087/12093 -1%/-1% 4 10688/10696 10507/10521 -2%/-2% It was also found that this change produced a much bigger performance improvement in the newer IvyBridge-EX chip and was essentially to close the performance gap between the ticket spinlock and queue spinlock. The disk workload of the AIM7 benchmark was run on a 4-socket Westmere-EX machine with both ext4 and xfs RAM disks at 3000 users on a 3.14 based kernel. The results of the test runs were: AIM7 XFS Disk Test kernel JPM Real Time Sys Time Usr Time ----- --- --------- -------- -------- ticketlock 5678233 3.17 96.61 5.81 qspinlock 5750799 3.13 94.83 5.97 AIM7 EXT4 Disk Test kernel JPM Real Time Sys Time Usr Time ----- --- --------- -------- -------- ticketlock 1114551 16.15 509.72 7.11 qspinlock 2184466 8.24 232.99 6.01 The ext4 filesystem run had a much higher spinlock contention than the xfs filesystem run. The "ebizzy -m" test was also run with the following results: kernel records/s Real Time Sys Time Usr Time ----- --------- --------- -------- -------- ticketlock 2075 10.00 216.35 3.49 qspinlock 3023 10.00 198.20 4.80 Signed-off-by: Waiman Long <Waiman.Long@hp.com> --- kernel/locking/qspinlock.c | 62 ++++++++++++++++++++++++++++++++----------- 1 files changed, 46 insertions(+), 16 deletions(-) diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c index 7f10758..2c7abe7 100644 --- a/kernel/locking/qspinlock.c +++ b/kernel/locking/qspinlock.c @@ -93,24 +93,33 @@ static inline struct mcs_spinlock *decode_tail(u32 tail) * By using the whole 2nd least significant byte for the pending bit, we * can allow better optimization of the lock acquisition for the pending * bit holder. + * + * This internal structure is also used by the set_locked function which + * is not restricted to _Q_PENDING_BITS == 8. */ -#if _Q_PENDING_BITS == 8 - struct __qspinlock { union { atomic_t val; - struct { #ifdef __LITTLE_ENDIAN + u8 locked; + struct { u16 locked_pending; u16 tail; + }; #else + struct { u16 tail; u16 locked_pending; -#endif }; + struct { + u8 reserved[3]; + u8 locked; + }; +#endif }; }; +#if _Q_PENDING_BITS == 8 /** * clear_pending_set_locked - take ownership and clear the pending bit. * @lock: Pointer to queue spinlock structure @@ -197,6 +206,22 @@ static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail) #endif /* _Q_PENDING_BITS == 8 */ /** + * set_locked - Set the lock bit and own the lock + * @lock: Pointer to queue spinlock structure + * + * This routine should only be called when the caller is the only one + * entitled to acquire the lock. + */ +static __always_inline void set_locked(struct qspinlock *lock) +{ + struct __qspinlock *l = (void *)lock; + + barrier(); + ACCESS_ONCE(l->locked) = _Q_LOCKED_VAL; + barrier(); +} + +/** * queue_spin_lock_slowpath - acquire the queue spinlock * @lock: Pointer to queue spinlock structure * @val: Current value of the queue spinlock 32-bit word @@ -332,10 +357,13 @@ queue: /* * we're at the head of the waitqueue, wait for the owner & pending to * go away. + * Load-acquired is used here because the set_locked() + * function below may not be a full memory barrier. * * *,x,y -> *,0,0 */ - while ((val = atomic_read(&lock->val)) & _Q_LOCKED_PENDING_MASK) + while ((val = smp_load_acquire(&lock->val.counter)) + & _Q_LOCKED_PENDING_MASK) arch_mutex_cpu_relax(); /* @@ -343,15 +371,19 @@ queue: * * n,0,0 -> 0,0,1 : lock, uncontended * *,0,0 -> *,0,1 : lock, contended + * + * If the queue head is the only one in the queue (lock value == tail), + * clear the tail code and grab the lock. Otherwise, we only need + * to grab the lock. */ for (;;) { - new = _Q_LOCKED_VAL; - if (val != tail) - new |= val; - - old = atomic_cmpxchg(&lock->val, val, new); - if (old == val) + if (val != tail) { + set_locked(lock); break; + } + old = atomic_cmpxchg(&lock->val, val, _Q_LOCKED_VAL); + if (old == val) + goto release; /* No contention */ val = old; } @@ -359,12 +391,10 @@ queue: /* * contended path; wait for next, release. */ - if (new != _Q_LOCKED_VAL) { - while (!(next = ACCESS_ONCE(node->next))) - arch_mutex_cpu_relax(); + while (!(next = ACCESS_ONCE(node->next))) + arch_mutex_cpu_relax(); - arch_mcs_spin_unlock_contended(&next->locked); - } + arch_mcs_spin_unlock_contended(&next->locked); release: /* -- 1.7.1
WARNING: multiple messages have this Message-ID (diff)
From: Waiman Long <Waiman.Long@hp.com> To: Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@redhat.com>, "H. Peter Anvin" <hpa@zytor.com>, Peter Zijlstra <peterz@infradead.org> Cc: linux-arch@vger.kernel.org, Waiman Long <Waiman.Long@hp.com>, Rik van Riel <riel@redhat.com>, Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>, Gleb Natapov <gleb@redhat.com>, kvm@vger.kernel.org, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>, Scott J Norton <scott.norton@hp.com>, x86@kernel.org, Paolo Bonzini <paolo.bonzini@gmail.com>, linux-kernel@vger.kernel.org, virtualization@lists.linux-foundation.org, Chegu Vinod <chegu_vinod@hp.com>, David Vrabel <david.vrabel@citrix.com>, Oleg Nesterov <oleg@redhat.com>, xen-devel@lists.xenproject.org, Boris Ostrovsky <boris.ostrovsky@oracle.com>, "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>, Linus Torvalds <torvalds@linux-foundation.org> Subject: [PATCH v11 07/16] qspinlock: Use a simple write to grab the lock, if applicable Date: Fri, 30 May 2014 11:43:53 -0400 [thread overview] Message-ID: <1401464642-33890-8-git-send-email-Waiman.Long@hp.com> (raw) In-Reply-To: <1401464642-33890-1-git-send-email-Waiman.Long@hp.com> Currently, atomic_cmpxchg() is used to get the lock. However, this is not really necessary if there is more than one task in the queue and the queue head don't need to reset the queue code word. For that case, a simple write to set the lock bit is enough as the queue head will be the only one eligible to get the lock as long as it checks that both the lock and pending bits are not set. The current pending bit waiting code will ensure that the bit will not be set as soon as the queue code word (tail) in the lock is set. With that change, the are some slight improvement in the performance of the queue spinlock in the 5M loop micro-benchmark run on a 4-socket Westere-EX machine as shown in the tables below. [Standalone/Embedded - same node] # of tasks Before patch After patch %Change ---------- ----------- ---------- ------- 3 2324/2321 2248/2265 -3%/-2% 4 2890/2896 2819/2831 -2%/-2% 5 3611/3595 3522/3512 -2%/-2% 6 4281/4276 4173/4160 -3%/-3% 7 5018/5001 4875/4861 -3%/-3% 8 5759/5750 5563/5568 -3%/-3% [Standalone/Embedded - different nodes] # of tasks Before patch After patch %Change ---------- ----------- ---------- ------- 3 12242/12237 12087/12093 -1%/-1% 4 10688/10696 10507/10521 -2%/-2% It was also found that this change produced a much bigger performance improvement in the newer IvyBridge-EX chip and was essentially to close the performance gap between the ticket spinlock and queue spinlock. The disk workload of the AIM7 benchmark was run on a 4-socket Westmere-EX machine with both ext4 and xfs RAM disks at 3000 users on a 3.14 based kernel. The results of the test runs were: AIM7 XFS Disk Test kernel JPM Real Time Sys Time Usr Time ----- --- --------- -------- -------- ticketlock 5678233 3.17 96.61 5.81 qspinlock 5750799 3.13 94.83 5.97 AIM7 EXT4 Disk Test kernel JPM Real Time Sys Time Usr Time ----- --- --------- -------- -------- ticketlock 1114551 16.15 509.72 7.11 qspinlock 2184466 8.24 232.99 6.01 The ext4 filesystem run had a much higher spinlock contention than the xfs filesystem run. The "ebizzy -m" test was also run with the following results: kernel records/s Real Time Sys Time Usr Time ----- --------- --------- -------- -------- ticketlock 2075 10.00 216.35 3.49 qspinlock 3023 10.00 198.20 4.80 Signed-off-by: Waiman Long <Waiman.Long@hp.com> --- kernel/locking/qspinlock.c | 62 ++++++++++++++++++++++++++++++++----------- 1 files changed, 46 insertions(+), 16 deletions(-) diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c index 7f10758..2c7abe7 100644 --- a/kernel/locking/qspinlock.c +++ b/kernel/locking/qspinlock.c @@ -93,24 +93,33 @@ static inline struct mcs_spinlock *decode_tail(u32 tail) * By using the whole 2nd least significant byte for the pending bit, we * can allow better optimization of the lock acquisition for the pending * bit holder. + * + * This internal structure is also used by the set_locked function which + * is not restricted to _Q_PENDING_BITS == 8. */ -#if _Q_PENDING_BITS == 8 - struct __qspinlock { union { atomic_t val; - struct { #ifdef __LITTLE_ENDIAN + u8 locked; + struct { u16 locked_pending; u16 tail; + }; #else + struct { u16 tail; u16 locked_pending; -#endif }; + struct { + u8 reserved[3]; + u8 locked; + }; +#endif }; }; +#if _Q_PENDING_BITS == 8 /** * clear_pending_set_locked - take ownership and clear the pending bit. * @lock: Pointer to queue spinlock structure @@ -197,6 +206,22 @@ static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail) #endif /* _Q_PENDING_BITS == 8 */ /** + * set_locked - Set the lock bit and own the lock + * @lock: Pointer to queue spinlock structure + * + * This routine should only be called when the caller is the only one + * entitled to acquire the lock. + */ +static __always_inline void set_locked(struct qspinlock *lock) +{ + struct __qspinlock *l = (void *)lock; + + barrier(); + ACCESS_ONCE(l->locked) = _Q_LOCKED_VAL; + barrier(); +} + +/** * queue_spin_lock_slowpath - acquire the queue spinlock * @lock: Pointer to queue spinlock structure * @val: Current value of the queue spinlock 32-bit word @@ -332,10 +357,13 @@ queue: /* * we're at the head of the waitqueue, wait for the owner & pending to * go away. + * Load-acquired is used here because the set_locked() + * function below may not be a full memory barrier. * * *,x,y -> *,0,0 */ - while ((val = atomic_read(&lock->val)) & _Q_LOCKED_PENDING_MASK) + while ((val = smp_load_acquire(&lock->val.counter)) + & _Q_LOCKED_PENDING_MASK) arch_mutex_cpu_relax(); /* @@ -343,15 +371,19 @@ queue: * * n,0,0 -> 0,0,1 : lock, uncontended * *,0,0 -> *,0,1 : lock, contended + * + * If the queue head is the only one in the queue (lock value == tail), + * clear the tail code and grab the lock. Otherwise, we only need + * to grab the lock. */ for (;;) { - new = _Q_LOCKED_VAL; - if (val != tail) - new |= val; - - old = atomic_cmpxchg(&lock->val, val, new); - if (old == val) + if (val != tail) { + set_locked(lock); break; + } + old = atomic_cmpxchg(&lock->val, val, _Q_LOCKED_VAL); + if (old == val) + goto release; /* No contention */ val = old; } @@ -359,12 +391,10 @@ queue: /* * contended path; wait for next, release. */ - if (new != _Q_LOCKED_VAL) { - while (!(next = ACCESS_ONCE(node->next))) - arch_mutex_cpu_relax(); + while (!(next = ACCESS_ONCE(node->next))) + arch_mutex_cpu_relax(); - arch_mcs_spin_unlock_contended(&next->locked); - } + arch_mcs_spin_unlock_contended(&next->locked); release: /* -- 1.7.1
next prev parent reply other threads:[~2014-05-30 15:45 UTC|newest] Thread overview: 102+ messages / expand[flat|nested] mbox.gz Atom feed top 2014-05-30 15:43 [PATCH v11 00/16] qspinlock: a 4-byte queue spinlock with PV support Waiman Long 2014-05-30 15:43 ` [PATCH v11 01/16] qspinlock: A simple generic 4-byte queue spinlock Waiman Long 2014-05-30 15:43 ` Waiman Long 2014-05-30 15:43 ` Waiman Long 2014-05-30 15:43 ` Waiman Long 2014-05-30 15:43 ` Waiman Long 2014-05-30 15:43 ` [PATCH v11 02/16] qspinlock, x86: Enable x86-64 to use " Waiman Long 2014-05-30 15:43 ` Waiman Long 2014-05-30 15:43 ` Waiman Long 2014-05-30 15:43 ` Waiman Long 2014-05-30 15:43 ` Waiman Long 2014-05-30 15:43 ` [PATCH v11 03/16] qspinlock: Add pending bit Waiman Long 2014-05-30 15:43 ` Waiman Long 2014-05-30 15:43 ` Waiman Long 2014-05-30 15:43 ` Waiman Long 2014-05-30 15:43 ` Waiman Long 2014-05-30 15:43 ` [PATCH v11 04/16] qspinlock: Extract out the exchange of tail code word Waiman Long 2014-05-30 15:43 ` Waiman Long 2014-05-30 15:43 ` Waiman Long 2014-05-30 15:43 ` [PATCH v11 05/16] qspinlock: Optimize for smaller NR_CPUS Waiman Long 2014-05-30 15:43 ` Waiman Long 2014-05-30 15:43 ` Waiman Long 2014-05-30 15:43 ` Waiman Long 2014-05-30 15:43 ` Waiman Long 2014-05-30 15:43 ` [PATCH v11 06/16] qspinlock: prolong the stay in the pending bit path Waiman Long 2014-06-11 10:26 ` Peter Zijlstra 2014-06-11 10:26 ` Peter Zijlstra 2014-06-11 21:22 ` Long, Wai Man 2014-06-11 21:22 ` Long, Wai Man 2014-06-11 21:22 ` Long, Wai Man 2014-06-12 6:00 ` Peter Zijlstra 2014-06-12 20:54 ` Waiman Long 2014-06-12 20:54 ` Waiman Long 2014-06-15 13:12 ` Peter Zijlstra 2014-06-15 13:12 ` Peter Zijlstra 2014-06-15 13:12 ` Peter Zijlstra 2014-06-12 20:54 ` Waiman Long 2014-06-12 6:00 ` Peter Zijlstra 2014-06-12 6:00 ` Peter Zijlstra 2014-06-11 10:26 ` Peter Zijlstra 2014-05-30 15:43 ` Waiman Long 2014-05-30 15:43 ` Waiman Long 2014-05-30 15:43 ` Waiman Long [this message] 2014-05-30 15:43 ` [PATCH v11 07/16] qspinlock: Use a simple write to grab the lock, if applicable Waiman Long 2014-05-30 15:43 ` Waiman Long 2014-05-30 15:43 ` [PATCH v11 08/16] qspinlock: Prepare for unfair lock support Waiman Long 2014-05-30 15:43 ` Waiman Long 2014-05-30 15:43 ` Waiman Long 2014-05-30 15:43 ` [PATCH v11 09/16] qspinlock, x86: Allow unfair spinlock in a virtual guest Waiman Long 2014-05-30 15:43 ` Waiman Long 2014-06-11 10:54 ` Peter Zijlstra 2014-06-11 10:54 ` Peter Zijlstra 2014-06-11 11:38 ` Peter Zijlstra 2014-06-11 11:38 ` Peter Zijlstra 2014-06-11 11:38 ` Peter Zijlstra 2014-06-12 1:37 ` Long, Wai Man 2014-06-12 1:37 ` Long, Wai Man 2014-06-12 1:37 ` Long, Wai Man 2014-06-12 5:50 ` Peter Zijlstra 2014-06-12 5:50 ` Peter Zijlstra 2014-06-12 21:08 ` Waiman Long 2014-06-12 21:08 ` Waiman Long 2014-06-15 13:14 ` Peter Zijlstra 2014-06-15 13:14 ` Peter Zijlstra 2014-06-15 13:14 ` Peter Zijlstra 2014-06-12 21:08 ` Waiman Long 2014-06-12 5:50 ` Peter Zijlstra 2014-06-11 10:54 ` Peter Zijlstra 2014-05-30 15:43 ` Waiman Long 2014-05-30 15:43 ` [PATCH v11 10/16] qspinlock: Split the MCS queuing code into a separate slowerpath Waiman Long 2014-05-30 15:43 ` Waiman Long 2014-05-30 15:43 ` Waiman Long 2014-05-30 15:43 ` [PATCH v11 11/16] pvqspinlock, x86: Rename paravirt_ticketlocks_enabled Waiman Long 2014-05-30 15:43 ` Waiman Long 2014-05-30 15:43 ` Waiman Long 2014-05-30 15:43 ` [PATCH v11 12/16] pvqspinlock, x86: Add PV data structure & methods Waiman Long 2014-05-30 15:43 ` Waiman Long 2014-05-30 15:43 ` Waiman Long 2014-05-30 15:43 ` [PATCH v11 13/16] pvqspinlock: Enable coexistence with the unfair lock Waiman Long 2014-05-30 15:43 ` Waiman Long 2014-05-30 15:43 ` Waiman Long 2014-05-30 15:44 ` [PATCH v11 14/16] pvqspinlock: Add qspinlock para-virtualization support Waiman Long 2014-06-12 8:17 ` Peter Zijlstra 2014-06-12 8:17 ` Peter Zijlstra 2014-06-12 20:48 ` Waiman Long 2014-06-12 20:48 ` Waiman Long 2014-06-15 13:16 ` Peter Zijlstra 2014-06-15 13:16 ` Peter Zijlstra 2014-06-15 13:16 ` Peter Zijlstra 2014-06-17 20:59 ` Konrad Rzeszutek Wilk 2014-06-17 20:59 ` Konrad Rzeszutek Wilk 2014-06-17 20:59 ` Konrad Rzeszutek Wilk 2014-06-12 20:48 ` Waiman Long 2014-06-12 8:17 ` Peter Zijlstra 2014-05-30 15:44 ` Waiman Long 2014-05-30 15:44 ` Waiman Long 2014-05-30 15:44 ` [PATCH v11 15/16] pvqspinlock, x86: Enable PV qspinlock PV for KVM Waiman Long 2014-05-30 15:44 ` Waiman Long 2014-05-30 15:44 ` Waiman Long 2014-05-30 15:44 ` [PATCH v11 16/16] pvqspinlock, x86: Enable PV qspinlock for XEN Waiman Long 2014-05-30 15:44 ` Waiman Long 2014-05-30 15:44 ` Waiman Long
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=1401464642-33890-8-git-send-email-Waiman.Long@hp.com \ --to=waiman.long@hp.com \ --cc=boris.ostrovsky@oracle.com \ --cc=chegu_vinod@hp.com \ --cc=david.vrabel@citrix.com \ --cc=gleb@redhat.com \ --cc=hpa@zytor.com \ --cc=konrad.wilk@oracle.com \ --cc=kvm@vger.kernel.org \ --cc=linux-arch@vger.kernel.org \ --cc=linux-kernel@vger.kernel.org \ --cc=mingo@redhat.com \ --cc=oleg@redhat.com \ --cc=paolo.bonzini@gmail.com \ --cc=paulmck@linux.vnet.ibm.com \ --cc=peterz@infradead.org \ --cc=raghavendra.kt@linux.vnet.ibm.com \ --cc=riel@redhat.com \ --cc=scott.norton@hp.com \ --cc=tglx@linutronix.de \ --cc=torvalds@linux-foundation.org \ --cc=virtualization@lists.linux-foundation.org \ --cc=x86@kernel.org \ --cc=xen-devel@lists.xenproject.org \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.