From: Waiman Long <Waiman.Long@hp.com>
To: Peter Zijlstra <peterz@infradead.org>,
Ingo Molnar <mingo@redhat.com>,
Thomas Gleixner <tglx@linutronix.de>,
"H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org, linux-kernel@vger.kernel.org,
Scott J Norton <scott.norton@hp.com>,
Douglas Hatch <doug.hatch@hp.com>,
Davidlohr Bueso <dave@stgolabs.net>,
Waiman Long <Waiman.Long@hp.com>
Subject: [PATCH v3 7/7] locking/pvqspinlock, x86: Optimize PV unlock code path
Date: Wed, 22 Jul 2015 16:12:42 -0400 [thread overview]
Message-ID: <1437595962-21472-8-git-send-email-Waiman.Long@hp.com> (raw)
In-Reply-To: <1437595962-21472-1-git-send-email-Waiman.Long@hp.com>
The unlock function in queued spinlocks was optimized for better
performance on bare metal systems at the expense of virtualized guests.
For x86-64 systems, the unlock call needs to go through a
PV_CALLEE_SAVE_REGS_THUNK() which saves and restores 8 64-bit
registers before calling the real __pv_queued_spin_unlock()
function. The thunk code may also be in a separate cacheline from
__pv_queued_spin_unlock().
This patch optimizes the PV unlock code path by:
1) Moving the unlock slowpath code from the fastpath into a separate
pv_queued_spin_unlock_slowpath() function to make the fastpath
as simple as possible..
2) For x86-64, hand-coded an assembly function to combine the register
saving thunk code with the fastpath code. Only registers that
are used in the fastpath will be saved and restored. If the
fastpath fails, the slowpath function will be called via another
PV_CALLEE_SAVE_REGS_THUNK(). For 32-bit, it falls back to the C
__pv_queued_spin_unlock() code as the thunk saves and restores
only one 32-bit register.
With a microbenchmark of 5M lock-unlock loop, the table below shows
the execution times before and after the patch with different number
of threads in a VM running on a 32-core Westmere-EX box:
Threads Before patch After patch % Change
------- ------------ ----------- --------
1 154.0 ms 119.2 ms -23%
2 1272 ms 1174 ms -7.7%
3 3705 ms 3349 ms -9.6%
4 3767 ms 3597 ms -4.5%
Signed-off-by: Waiman Long <Waiman.Long@hp.com>
---
arch/x86/include/asm/qspinlock_paravirt.h | 60 +++++++++++++++++++++++++++++
kernel/locking/qspinlock_paravirt.h | 31 ++++++++++-----
2 files changed, 81 insertions(+), 10 deletions(-)
diff --git a/arch/x86/include/asm/qspinlock_paravirt.h b/arch/x86/include/asm/qspinlock_paravirt.h
index b002e71..46f0f82 100644
--- a/arch/x86/include/asm/qspinlock_paravirt.h
+++ b/arch/x86/include/asm/qspinlock_paravirt.h
@@ -1,6 +1,66 @@
#ifndef __ASM_QSPINLOCK_PARAVIRT_H
#define __ASM_QSPINLOCK_PARAVIRT_H
+/*
+ * For x86-64, PV_CALLEE_SAVE_REGS_THUNK() saves and restores 8 64-bit
+ * registers. For i386, however, only 1 32-bit register needs to be saved
+ * and restored. So an optimized version of __pv_queued_spin_unlock() is
+ * hand-coded for 64-bit, but it isn't worthwhile to do it for 32-bit.
+ */
+#ifdef CONFIG_64BIT
+
+PV_CALLEE_SAVE_REGS_THUNK(pv_queued_spin_unlock_slowpath);
+#define __pv_queued_spin_unlock __pv_queued_spin_unlock
+#define PV_UNLOCK "__raw_callee_save___pv_queued_spin_unlock"
+#define PV_UNLOCK_SLOWPATH "__raw_callee_save_pv_queued_spin_unlock_slowpath"
+
+/*
+ * Optimized assembly version of __raw_callee_save___pv_queued_spin_unlock
+ * which combines the registers saving trunk and the body of the following
+ * C code:
+ *
+ * void __pv_queued_spin_unlock(struct qspinlock *lock)
+ * {
+ * struct __qspinlock *l = (void *)lock;
+ * u8 lockval = cmpxchg(&l->locked, _Q_LOCKED_VAL, 0);
+ *
+ * if (likely(lockval == _Q_LOCKED_VAL))
+ * return;
+ * pv_queued_spin_unlock_slowpath(lock, lockval);
+ * }
+ *
+ * For x86-64,
+ * rdi = lock (first argument)
+ * rsi = lockval (second argument)
+ * rdx = internal variable (set to 0)
+ */
+asm(".pushsection .text;"
+ ".globl " PV_UNLOCK ";"
+ ".align 4,0x90;"
+ PV_UNLOCK ": "
+ "push %rdx;"
+ "mov $0x1,%eax;"
+ "xor %edx,%edx;"
+ "lock cmpxchg %dl,(%rdi);"
+ "cmp $0x1,%al;"
+ "jne .slowpath;"
+ "pop %rdx;"
+ "ret;"
+ "nop;"
+ ".slowpath: "
+ "push %rsi;"
+ "movzbl %al,%esi;"
+ "call " PV_UNLOCK_SLOWPATH ";"
+ "pop %rsi;"
+ "pop %rdx;"
+ "ret;"
+ ".size " PV_UNLOCK ", .-" PV_UNLOCK ";"
+ ".popsection");
+
+#else /* CONFIG_64BIT */
+
+extern void __pv_queued_spin_unlock(struct qspinlock *lock);
PV_CALLEE_SAVE_REGS_THUNK(__pv_queued_spin_unlock);
+#endif /* CONFIG_64BIT */
#endif
diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index 1861287..56c3717 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -719,24 +719,21 @@ static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
}
/*
- * PV version of the unlock function to be used in stead of
- * queued_spin_unlock().
+ * PV version of the unlock fastpath and slowpath functions to be used
+ * in stead of queued_spin_unlock().
*/
-__visible void __pv_queued_spin_unlock(struct qspinlock *lock)
+__visible void
+pv_queued_spin_unlock_slowpath(struct qspinlock *lock, u8 lockval)
{
struct __qspinlock *l = (void *)lock;
struct pv_node *node, *next;
int i, nr_kick, cpus[PV_KICK_AHEAD_MAX];
- u8 lockval = cmpxchg(&l->locked, _Q_LOCKED_VAL, 0);
/*
* We must not unlock if SLOW, because in that case we must first
* unhash. Otherwise it would be possible to have multiple @lock
* entries, which would be BAD.
*/
- if (likely(lockval == _Q_LOCKED_VAL))
- return;
-
if (unlikely(lockval != _Q_SLOW_VAL)) {
if (debug_locks_silent)
return;
@@ -786,12 +783,26 @@ __visible void __pv_queued_spin_unlock(struct qspinlock *lock)
pv_kick(cpus[i]);
}
}
+
/*
* Include the architecture specific callee-save thunk of the
* __pv_queued_spin_unlock(). This thunk is put together with
- * __pv_queued_spin_unlock() near the top of the file to make sure
- * that the callee-save thunk and the real unlock function are close
- * to each other sharing consecutive instruction cachelines.
+ * __pv_queued_spin_unlock() to make the callee-save thunk and the real unlock
+ * function close to each other sharing consecutive instruction cachelines.
+ * Alternatively, architecture specific version of __pv_queued_spin_unlock()
+ * can be defined.
*/
#include <asm/qspinlock_paravirt.h>
+#ifndef __pv_queued_spin_unlock
+__visible void __pv_queued_spin_unlock(struct qspinlock *lock)
+{
+ struct __qspinlock *l = (void *)lock;
+ u8 lockval = cmpxchg(&l->locked, _Q_LOCKED_VAL, 0);
+
+ if (likely(lockval == _Q_LOCKED_VAL))
+ return;
+
+ pv_queued_spin_unlock_slowpath(lock, lockval);
+}
+#endif /* __pv_queued_spin_unlock */
--
1.7.1
next prev parent reply other threads:[~2015-07-22 20:13 UTC|newest]
Thread overview: 26+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-07-22 20:12 [PATCH v3 0/7] locking/qspinlock: Enhance pvqspinlock performance Waiman Long
2015-07-22 20:12 ` [PATCH v3 1/7] locking/pvqspinlock: Unconditional PV kick with _Q_SLOW_VAL Waiman Long
2015-07-25 22:31 ` Davidlohr Bueso
2015-07-27 1:46 ` Davidlohr Bueso
2015-07-27 17:50 ` Waiman Long
2015-07-27 18:41 ` Davidlohr Bueso
2015-07-31 8:39 ` Peter Zijlstra
2015-07-31 17:01 ` Waiman Long
2015-07-22 20:12 ` [PATCH v3 2/7] locking/pvqspinlock: Add pending bit support Waiman Long
2015-07-26 23:09 ` Davidlohr Bueso
2015-07-27 17:11 ` Waiman Long
2015-07-26 23:48 ` Davidlohr Bueso
2015-07-27 0:56 ` Davidlohr Bueso
2015-07-27 17:30 ` Waiman Long
2015-07-27 19:39 ` Davidlohr Bueso
2015-07-29 20:49 ` Waiman Long
2015-07-27 20:08 ` Davidlohr Bueso
2015-07-22 20:12 ` [PATCH v3 3/7] locking/pvqspinlock: Collect slowpath lock statistics Waiman Long
2015-07-27 1:14 ` Davidlohr Bueso
2015-07-27 17:33 ` Waiman Long
2015-07-22 20:12 ` [PATCH v3 4/7] locking/pvqspinlock: Enable deferment of vCPU kicking to unlock call Waiman Long
2015-07-22 20:12 ` [PATCH v3 5/7] locking/pvqspinlock: Allow vCPUs kick-ahead Waiman Long
2015-07-22 20:12 ` [PATCH v3 6/7] locking/pvqspinlock: Queue node adaptive spinning Waiman Long
2015-07-22 20:12 ` Waiman Long [this message]
2015-07-27 1:18 ` [PATCH v3 0/7] locking/qspinlock: Enhance pvqspinlock performance Davidlohr Bueso
2015-07-27 17:36 ` Waiman Long
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1437595962-21472-8-git-send-email-Waiman.Long@hp.com \
--to=waiman.long@hp.com \
--cc=dave@stgolabs.net \
--cc=doug.hatch@hp.com \
--cc=hpa@zytor.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@redhat.com \
--cc=peterz@infradead.org \
--cc=scott.norton@hp.com \
--cc=tglx@linutronix.de \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.