linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/6] x86: reduce paravirtualized spinlock overhead
@ 2015-04-30 10:53 Juergen Gross
  2015-04-30 10:53 ` [PATCH 1/6] x86: use macro instead of "0" for setting TICKET_SLOWPATH_FLAG Juergen Gross
                   ` (9 more replies)
  0 siblings, 10 replies; 17+ messages in thread
From: Juergen Gross @ 2015-04-30 10:53 UTC (permalink / raw)
  To: linux-kernel, x86, hpa, tglx, mingo, xen-devel, konrad.wilk,
	david.vrabel, boris.ostrovsky, jeremy, chrisw, akataria, rusty,
	virtualization, gleb, pbonzini, kvm
  Cc: Juergen Gross

Paravirtualized spinlocks produce some overhead even if the kernel is
running on bare metal. The main reason are the more complex locking
and unlocking functions. Especially unlocking is no longer just one
instruction but so complex that it is no longer inlined.

This patch series addresses this issue by adding two more pvops
functions to reduce the size of the inlined spinlock functions. When
running on bare metal unlocking is again basically one instruction.

Compile tested with CONFIG_PARAVIRT_SPINLOCKS on and off, 32 and 64
bits.

Functional testing on bare metal and as Xen dom0.

Correct patching verified by disassembly of active kernel.

Juergen Gross (6):
  x86: use macro instead of "0" for setting TICKET_SLOWPATH_FLAG
  x86: move decision about clearing slowpath flag into arch_spin_lock()
  x86: introduce new pvops function clear_slowpath
  x86: introduce new pvops function spin_unlock
  x86: switch config from UNINLINE_SPIN_UNLOCK to INLINE_SPIN_UNLOCK
  x86: remove no longer needed paravirt_ticketlocks_enabled

 arch/x86/Kconfig                      |  1 -
 arch/x86/include/asm/paravirt.h       | 13 +++++++++
 arch/x86/include/asm/paravirt_types.h | 12 ++++++++
 arch/x86/include/asm/spinlock.h       | 53 ++++++++++++-----------------------
 arch/x86/include/asm/spinlock_types.h |  3 +-
 arch/x86/kernel/kvm.c                 | 14 +--------
 arch/x86/kernel/paravirt-spinlocks.c  | 42 +++++++++++++++++++++++++--
 arch/x86/kernel/paravirt.c            | 12 ++++++++
 arch/x86/kernel/paravirt_patch_32.c   | 25 +++++++++++++++++
 arch/x86/kernel/paravirt_patch_64.c   | 24 ++++++++++++++++
 arch/x86/xen/spinlock.c               | 23 +--------------
 include/linux/spinlock_api_smp.h      |  2 +-
 kernel/Kconfig.locks                  |  7 +++--
 kernel/Kconfig.preempt                |  3 +-
 kernel/locking/spinlock.c             |  2 +-
 lib/Kconfig.debug                     |  1 -
 16 files changed, 154 insertions(+), 83 deletions(-)

-- 
2.1.4


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/6] x86: use macro instead of "0" for setting TICKET_SLOWPATH_FLAG
  2015-04-30 10:53 [PATCH 0/6] x86: reduce paravirtualized spinlock overhead Juergen Gross
@ 2015-04-30 10:53 ` Juergen Gross
  2015-04-30 10:53 ` [PATCH 2/6] x86: move decision about clearing slowpath flag into arch_spin_lock() Juergen Gross
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 17+ messages in thread
From: Juergen Gross @ 2015-04-30 10:53 UTC (permalink / raw)
  To: linux-kernel, x86, hpa, tglx, mingo, xen-devel, konrad.wilk,
	david.vrabel, boris.ostrovsky, jeremy, chrisw, akataria, rusty,
	virtualization, gleb, pbonzini, kvm
  Cc: Juergen Gross

For paravirtualized spinlocks setting the "slowpath" flag in
__ticket_enter_slowpath() is done via setting bit "0" in
lock->tickets.head instead of using a macro.

Change this by defining an appropriate macro.

Signed-off-by: Juergen Gross <jgross@suse.com>
---
 arch/x86/include/asm/spinlock.h       | 3 ++-
 arch/x86/include/asm/spinlock_types.h | 3 ++-
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index cf87de3..8ceec8d 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -46,7 +46,8 @@ static __always_inline bool static_key_false(struct static_key *key);
 
 static inline void __ticket_enter_slowpath(arch_spinlock_t *lock)
 {
-	set_bit(0, (volatile unsigned long *)&lock->tickets.head);
+	set_bit(TICKET_SLOWPATH_BIT,
+		(volatile unsigned long *)&lock->tickets.head);
 }
 
 #else  /* !CONFIG_PARAVIRT_SPINLOCKS */
diff --git a/arch/x86/include/asm/spinlock_types.h b/arch/x86/include/asm/spinlock_types.h
index 5f9d757..579409a 100644
--- a/arch/x86/include/asm/spinlock_types.h
+++ b/arch/x86/include/asm/spinlock_types.h
@@ -4,8 +4,9 @@
 #include <linux/types.h>
 
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
+#define TICKET_SLOWPATH_BIT	0
 #define __TICKET_LOCK_INC	2
-#define TICKET_SLOWPATH_FLAG	((__ticket_t)1)
+#define TICKET_SLOWPATH_FLAG	((__ticket_t)(1 << TICKET_SLOWPATH_BIT))
 #else
 #define __TICKET_LOCK_INC	1
 #define TICKET_SLOWPATH_FLAG	((__ticket_t)0)
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 2/6] x86: move decision about clearing slowpath flag into arch_spin_lock()
  2015-04-30 10:53 [PATCH 0/6] x86: reduce paravirtualized spinlock overhead Juergen Gross
  2015-04-30 10:53 ` [PATCH 1/6] x86: use macro instead of "0" for setting TICKET_SLOWPATH_FLAG Juergen Gross
@ 2015-04-30 10:53 ` Juergen Gross
  2015-04-30 10:54 ` [PATCH 3/6] x86: introduce new pvops function clear_slowpath Juergen Gross
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 17+ messages in thread
From: Juergen Gross @ 2015-04-30 10:53 UTC (permalink / raw)
  To: linux-kernel, x86, hpa, tglx, mingo, xen-devel, konrad.wilk,
	david.vrabel, boris.ostrovsky, jeremy, chrisw, akataria, rusty,
	virtualization, gleb, pbonzini, kvm
  Cc: Juergen Gross

The decision whether the slowpath flag is to be cleared for
paravirtualized spinlocks is located in
__ticket_check_and_clear_slowpath() today.

Move that decision into arch_spin_lock() and add an unlikely attribute
to it to avoid calling a function in case the compiler chooses not to
inline __ticket_check_and_clear_slowpath() and the slowpath flag isn't
set.

Signed-off-by: Juergen Gross <jgross@suse.com>
---
 arch/x86/include/asm/spinlock.h | 23 +++++++++++------------
 1 file changed, 11 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index 8ceec8d..268b9da 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -66,20 +66,18 @@ static inline int  __tickets_equal(__ticket_t one, __ticket_t two)
 	return !((one ^ two) & ~TICKET_SLOWPATH_FLAG);
 }
 
-static inline void __ticket_check_and_clear_slowpath(arch_spinlock_t *lock,
-							__ticket_t head)
+static inline void __ticket_clear_slowpath(arch_spinlock_t *lock,
+					   __ticket_t head)
 {
-	if (head & TICKET_SLOWPATH_FLAG) {
-		arch_spinlock_t old, new;
+	arch_spinlock_t old, new;
 
-		old.tickets.head = head;
-		new.tickets.head = head & ~TICKET_SLOWPATH_FLAG;
-		old.tickets.tail = new.tickets.head + TICKET_LOCK_INC;
-		new.tickets.tail = old.tickets.tail;
+	old.tickets.head = head;
+	new.tickets.head = head & ~TICKET_SLOWPATH_FLAG;
+	old.tickets.tail = new.tickets.head + TICKET_LOCK_INC;
+	new.tickets.tail = old.tickets.tail;
 
-		/* try to clear slowpath flag when there are no contenders */
-		cmpxchg(&lock->head_tail, old.head_tail, new.head_tail);
-	}
+	/* try to clear slowpath flag when there are no contenders */
+	cmpxchg(&lock->head_tail, old.head_tail, new.head_tail);
 }
 
 static __always_inline int arch_spin_value_unlocked(arch_spinlock_t lock)
@@ -120,7 +118,8 @@ static __always_inline void arch_spin_lock(arch_spinlock_t *lock)
 		__ticket_lock_spinning(lock, inc.tail);
 	}
 clear_slowpath:
-	__ticket_check_and_clear_slowpath(lock, inc.head);
+	if (unlikely(inc.head & TICKET_SLOWPATH_FLAG))
+		__ticket_clear_slowpath(lock, inc.head);
 out:
 	barrier();	/* make sure nothing creeps before the lock is taken */
 }
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 3/6] x86: introduce new pvops function clear_slowpath
  2015-04-30 10:53 [PATCH 0/6] x86: reduce paravirtualized spinlock overhead Juergen Gross
  2015-04-30 10:53 ` [PATCH 1/6] x86: use macro instead of "0" for setting TICKET_SLOWPATH_FLAG Juergen Gross
  2015-04-30 10:53 ` [PATCH 2/6] x86: move decision about clearing slowpath flag into arch_spin_lock() Juergen Gross
@ 2015-04-30 10:54 ` Juergen Gross
  2015-04-30 10:54 ` [PATCH 4/6] x86: introduce new pvops function spin_unlock Juergen Gross
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 17+ messages in thread
From: Juergen Gross @ 2015-04-30 10:54 UTC (permalink / raw)
  To: linux-kernel, x86, hpa, tglx, mingo, xen-devel, konrad.wilk,
	david.vrabel, boris.ostrovsky, jeremy, chrisw, akataria, rusty,
	virtualization, gleb, pbonzini, kvm
  Cc: Juergen Gross

To speed up paravirtualized spinlock handling when running on bare
metal introduce a new pvops function "clear_slowpath". This is a nop
when the kernel is running on bare metal.

As the clear_slowpath function is common for all users add a new
initialization function to set the pvops function pointer in order
to avoid spreading the knowledge which function to use.

Signed-off-by: Juergen Gross <jgross@suse.com>
---
 arch/x86/include/asm/paravirt.h       |  7 +++++++
 arch/x86/include/asm/paravirt_types.h |  1 +
 arch/x86/include/asm/spinlock.h       | 18 ++++--------------
 arch/x86/kernel/kvm.c                 |  2 ++
 arch/x86/kernel/paravirt-spinlocks.c  | 22 ++++++++++++++++++++++
 arch/x86/xen/spinlock.c               |  2 ++
 6 files changed, 38 insertions(+), 14 deletions(-)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 8957810..318f077 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -724,6 +724,13 @@ static __always_inline void __ticket_unlock_kick(struct arch_spinlock *lock,
 	PVOP_VCALL2(pv_lock_ops.unlock_kick, lock, ticket);
 }
 
+static __always_inline void __ticket_clear_slowpath(arch_spinlock_t *lock,
+						    __ticket_t head)
+{
+	PVOP_VCALL2(pv_lock_ops.clear_slowpath, lock, head);
+}
+
+void pv_lock_activate(void);
 #endif
 
 #ifdef CONFIG_X86_32
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index f7b0b5c..3432713 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -336,6 +336,7 @@ typedef u16 __ticket_t;
 struct pv_lock_ops {
 	struct paravirt_callee_save lock_spinning;
 	void (*unlock_kick)(struct arch_spinlock *lock, __ticket_t ticket);
+	void (*clear_slowpath)(arch_spinlock_t *lock, __ticket_t head);
 };
 
 /* This contains all the paravirt structures: we get a convenient
diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index 268b9da..ab76c3e 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -60,26 +60,16 @@ static inline void __ticket_unlock_kick(arch_spinlock_t *lock,
 {
 }
 
+static inline void __ticket_clear_slowpath(arch_spinlock_t *lock,
+					   __ticket_t head)
+{
+}
 #endif /* CONFIG_PARAVIRT_SPINLOCKS */
 static inline int  __tickets_equal(__ticket_t one, __ticket_t two)
 {
 	return !((one ^ two) & ~TICKET_SLOWPATH_FLAG);
 }
 
-static inline void __ticket_clear_slowpath(arch_spinlock_t *lock,
-					   __ticket_t head)
-{
-	arch_spinlock_t old, new;
-
-	old.tickets.head = head;
-	new.tickets.head = head & ~TICKET_SLOWPATH_FLAG;
-	old.tickets.tail = new.tickets.head + TICKET_LOCK_INC;
-	new.tickets.tail = old.tickets.tail;
-
-	/* try to clear slowpath flag when there are no contenders */
-	cmpxchg(&lock->head_tail, old.head_tail, new.head_tail);
-}
-
 static __always_inline int arch_spin_value_unlocked(arch_spinlock_t lock)
 {
 	return __tickets_equal(lock.tickets.head, lock.tickets.tail);
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 9435620..c3b4b43 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -830,6 +830,8 @@ void __init kvm_spinlock_init(void)
 
 	pv_lock_ops.lock_spinning = PV_CALLEE_SAVE(kvm_lock_spinning);
 	pv_lock_ops.unlock_kick = kvm_unlock_kick;
+
+	pv_lock_activate();
 }
 
 static __init int kvm_spinlock_init_jump(void)
diff --git a/arch/x86/kernel/paravirt-spinlocks.c b/arch/x86/kernel/paravirt-spinlocks.c
index bbb6c73..5ece813 100644
--- a/arch/x86/kernel/paravirt-spinlocks.c
+++ b/arch/x86/kernel/paravirt-spinlocks.c
@@ -8,10 +8,32 @@
 
 #include <asm/paravirt.h>
 
+#ifdef CONFIG_SMP
+static void pv_ticket_clear_slowpath(arch_spinlock_t *lock, __ticket_t head)
+{
+	arch_spinlock_t old, new;
+
+	old.tickets.head = head;
+	new.tickets.head = head & ~TICKET_SLOWPATH_FLAG;
+	old.tickets.tail = new.tickets.head + TICKET_LOCK_INC;
+	new.tickets.tail = old.tickets.tail;
+
+	/* try to clear slowpath flag when there are no contenders */
+	cmpxchg(&lock->head_tail, old.head_tail, new.head_tail);
+}
+
+void pv_lock_activate(void)
+{
+	pv_lock_ops.clear_slowpath = pv_ticket_clear_slowpath;
+}
+EXPORT_SYMBOL_GPL(pv_lock_activate);
+#endif
+
 struct pv_lock_ops pv_lock_ops = {
 #ifdef CONFIG_SMP
 	.lock_spinning = __PV_IS_CALLEE_SAVE(paravirt_nop),
 	.unlock_kick = paravirt_nop,
+	.clear_slowpath = paravirt_nop,
 #endif
 };
 EXPORT_SYMBOL(pv_lock_ops);
diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c
index 956374c..988c895 100644
--- a/arch/x86/xen/spinlock.c
+++ b/arch/x86/xen/spinlock.c
@@ -282,6 +282,8 @@ void __init xen_init_spinlocks(void)
 	printk(KERN_DEBUG "xen: PV spinlocks enabled\n");
 	pv_lock_ops.lock_spinning = PV_CALLEE_SAVE(xen_lock_spinning);
 	pv_lock_ops.unlock_kick = xen_unlock_kick;
+
+	pv_lock_activate();
 }
 
 /*
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 4/6] x86: introduce new pvops function spin_unlock
  2015-04-30 10:53 [PATCH 0/6] x86: reduce paravirtualized spinlock overhead Juergen Gross
                   ` (2 preceding siblings ...)
  2015-04-30 10:54 ` [PATCH 3/6] x86: introduce new pvops function clear_slowpath Juergen Gross
@ 2015-04-30 10:54 ` Juergen Gross
  2015-04-30 10:54 ` [PATCH 5/6] x86: switch config from UNINLINE_SPIN_UNLOCK to INLINE_SPIN_UNLOCK Juergen Gross
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 17+ messages in thread
From: Juergen Gross @ 2015-04-30 10:54 UTC (permalink / raw)
  To: linux-kernel, x86, hpa, tglx, mingo, xen-devel, konrad.wilk,
	david.vrabel, boris.ostrovsky, jeremy, chrisw, akataria, rusty,
	virtualization, gleb, pbonzini, kvm
  Cc: Juergen Gross

To speed up paravirtualized spinlock handling when running on bare
metal introduce a new pvops function "spin_unlock". This is a simple
add instruction (possibly with lock prefix) when the kernel is running
on bare metal.

As the patched instruction includes a lock prefix in some
configurations annotate this location to be subject to lock prefix
patching. This is working even if paravirtualization requires a call
instead of the unlock instruction, because patching the lock or it's
replacement is done only if the correct counterpart is found at the
location to be patched.

We even are not dependant on the order of lock prefix and
paravirtualization patching as the sample instruction is subject to
lock prefix patching as well.

Signed-off-by: Juergen Gross <jgross@suse.com>
---
 arch/x86/include/asm/paravirt.h       |  6 ++++++
 arch/x86/include/asm/paravirt_types.h | 11 +++++++++++
 arch/x86/include/asm/spinlock.h       | 24 ++++++++++--------------
 arch/x86/kernel/paravirt-spinlocks.c  | 16 ++++++++++++++++
 arch/x86/kernel/paravirt.c            | 12 ++++++++++++
 arch/x86/kernel/paravirt_patch_32.c   | 25 +++++++++++++++++++++++++
 arch/x86/kernel/paravirt_patch_64.c   | 24 ++++++++++++++++++++++++
 7 files changed, 104 insertions(+), 14 deletions(-)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 318f077..2f39129 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -730,6 +730,11 @@ static __always_inline void __ticket_clear_slowpath(arch_spinlock_t *lock,
 	PVOP_VCALL2(pv_lock_ops.clear_slowpath, lock, head);
 }
 
+static __always_inline void __ticket_unlock(arch_spinlock_t *lock)
+{
+	PVOP_VCALL1_LOCK(pv_lock_ops.unlock, lock);
+}
+
 void pv_lock_activate(void);
 #endif
 
@@ -843,6 +848,7 @@ static inline notrace unsigned long arch_local_irq_save(void)
 #undef PVOP_VCALL0
 #undef PVOP_CALL0
 #undef PVOP_VCALL1
+#undef PVOP_VCALL1_LOCK
 #undef PVOP_CALL1
 #undef PVOP_VCALL2
 #undef PVOP_CALL2
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index 3432713..a26af74 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -337,6 +337,7 @@ struct pv_lock_ops {
 	struct paravirt_callee_save lock_spinning;
 	void (*unlock_kick)(struct arch_spinlock *lock, __ticket_t ticket);
 	void (*clear_slowpath)(arch_spinlock_t *lock, __ticket_t head);
+	void (*unlock)(arch_spinlock_t *lock);
 };
 
 /* This contains all the paravirt structures: we get a convenient
@@ -398,6 +399,7 @@ extern struct pv_lock_ops pv_lock_ops;
 unsigned paravirt_patch_nop(void);
 unsigned paravirt_patch_ident_32(void *insnbuf, unsigned len);
 unsigned paravirt_patch_ident_64(void *insnbuf, unsigned len);
+unsigned paravirt_patch_unlock(void *insnbuf, unsigned len);
 unsigned paravirt_patch_ignore(unsigned len);
 unsigned paravirt_patch_call(void *insnbuf,
 			     const void *target, u16 tgt_clobbers,
@@ -620,6 +622,12 @@ int paravirt_disable_iospace(void);
 	__PVOP_CALL(rettype, op, "", "", PVOP_CALL_ARG1(arg1))
 #define PVOP_VCALL1(op, arg1)						\
 	__PVOP_VCALL(op, "", "", PVOP_CALL_ARG1(arg1))
+/*
+ * Using LOCK_PREFIX_HERE works because the lock prefix or it's replacement
+ * is checked to be present before being replaced.
+ */
+#define PVOP_VCALL1_LOCK(op, arg1)					\
+	__PVOP_VCALL(op, LOCK_PREFIX_HERE, "", PVOP_CALL_ARG1(arg1))
 
 #define PVOP_CALLEE1(rettype, op, arg1)					\
 	__PVOP_CALLEESAVE(rettype, op, "", "", PVOP_CALL_ARG1(arg1))
@@ -690,6 +698,9 @@ void paravirt_flush_lazy_mmu(void);
 void _paravirt_nop(void);
 u32 _paravirt_ident_32(u32);
 u64 _paravirt_ident_64(u64);
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+void _paravirt_native_ticket_unlock(arch_spinlock_t *lock);
+#endif
 
 #define paravirt_nop	((void *)_paravirt_nop)
 
diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index ab76c3e..40a1091 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -42,6 +42,10 @@
 extern struct static_key paravirt_ticketlocks_enabled;
 static __always_inline bool static_key_false(struct static_key *key);
 
+static inline void ___ticket_unlock(arch_spinlock_t *lock)
+{
+	__add(&lock->tickets.head, TICKET_LOCK_INC, UNLOCK_LOCK_PREFIX);
+}
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
 
 static inline void __ticket_enter_slowpath(arch_spinlock_t *lock)
@@ -64,6 +68,11 @@ static inline void __ticket_clear_slowpath(arch_spinlock_t *lock,
 					   __ticket_t head)
 {
 }
+
+static inline void __ticket_unlock(arch_spinlock_t *lock)
+{
+	___ticket_unlock(lock);
+}
 #endif /* CONFIG_PARAVIRT_SPINLOCKS */
 static inline int  __tickets_equal(__ticket_t one, __ticket_t two)
 {
@@ -131,20 +140,7 @@ static __always_inline int arch_spin_trylock(arch_spinlock_t *lock)
 
 static __always_inline void arch_spin_unlock(arch_spinlock_t *lock)
 {
-	if (TICKET_SLOWPATH_FLAG &&
-		static_key_false(&paravirt_ticketlocks_enabled)) {
-		__ticket_t head;
-
-		BUILD_BUG_ON(((__ticket_t)NR_CPUS) != NR_CPUS);
-
-		head = xadd(&lock->tickets.head, TICKET_LOCK_INC);
-
-		if (unlikely(head & TICKET_SLOWPATH_FLAG)) {
-			head &= ~TICKET_SLOWPATH_FLAG;
-			__ticket_unlock_kick(lock, (head + TICKET_LOCK_INC));
-		}
-	} else
-		__add(&lock->tickets.head, TICKET_LOCK_INC, UNLOCK_LOCK_PREFIX);
+	__ticket_unlock(lock);
 }
 
 static inline int arch_spin_is_locked(arch_spinlock_t *lock)
diff --git a/arch/x86/kernel/paravirt-spinlocks.c b/arch/x86/kernel/paravirt-spinlocks.c
index 5ece813..91273fb 100644
--- a/arch/x86/kernel/paravirt-spinlocks.c
+++ b/arch/x86/kernel/paravirt-spinlocks.c
@@ -22,9 +22,24 @@ static void pv_ticket_clear_slowpath(arch_spinlock_t *lock, __ticket_t head)
 	cmpxchg(&lock->head_tail, old.head_tail, new.head_tail);
 }
 
+static void pv_ticket_unlock(arch_spinlock_t *lock)
+{
+	__ticket_t head;
+
+	BUILD_BUG_ON(((__ticket_t)NR_CPUS) != NR_CPUS);
+
+	head = xadd(&lock->tickets.head, TICKET_LOCK_INC);
+
+	if (unlikely(head & TICKET_SLOWPATH_FLAG)) {
+		head &= ~TICKET_SLOWPATH_FLAG;
+		__ticket_unlock_kick(lock, (head + TICKET_LOCK_INC));
+	}
+}
+
 void pv_lock_activate(void)
 {
 	pv_lock_ops.clear_slowpath = pv_ticket_clear_slowpath;
+	pv_lock_ops.unlock = pv_ticket_unlock;
 }
 EXPORT_SYMBOL_GPL(pv_lock_activate);
 #endif
@@ -34,6 +49,7 @@ struct pv_lock_ops pv_lock_ops = {
 	.lock_spinning = __PV_IS_CALLEE_SAVE(paravirt_nop),
 	.unlock_kick = paravirt_nop,
 	.clear_slowpath = paravirt_nop,
+	.unlock = _paravirt_native_ticket_unlock,
 #endif
 };
 EXPORT_SYMBOL(pv_lock_ops);
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index c614dd4..5c5e9f1 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -57,6 +57,13 @@ u64 _paravirt_ident_64(u64 x)
 	return x;
 }
 
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+void _paravirt_native_ticket_unlock(arch_spinlock_t *lock)
+{
+	___ticket_unlock(lock);
+}
+#endif
+
 void __init default_banner(void)
 {
 	printk(KERN_INFO "Booting paravirtualized kernel on %s\n",
@@ -153,6 +160,11 @@ unsigned paravirt_patch_default(u8 type, u16 clobbers, void *insnbuf,
 	else if (opfunc == _paravirt_ident_64)
 		ret = paravirt_patch_ident_64(insnbuf, len);
 
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+	else if (opfunc == _paravirt_native_ticket_unlock)
+		ret = paravirt_patch_unlock(insnbuf, len);
+#endif
+
 	else if (type == PARAVIRT_PATCH(pv_cpu_ops.iret) ||
 		 type == PARAVIRT_PATCH(pv_cpu_ops.irq_enable_sysexit) ||
 		 type == PARAVIRT_PATCH(pv_cpu_ops.usergs_sysret32) ||
diff --git a/arch/x86/kernel/paravirt_patch_32.c b/arch/x86/kernel/paravirt_patch_32.c
index d9f32e6..628c23b 100644
--- a/arch/x86/kernel/paravirt_patch_32.c
+++ b/arch/x86/kernel/paravirt_patch_32.c
@@ -1,4 +1,6 @@
 #include <asm/paravirt.h>
+#include <asm/spinlock.h>
+#include <linux/stringify.h>
 
 DEF_NATIVE(pv_irq_ops, irq_disable, "cli");
 DEF_NATIVE(pv_irq_ops, irq_enable, "sti");
@@ -12,6 +14,14 @@ DEF_NATIVE(pv_mmu_ops, read_cr3, "mov %cr3, %eax");
 DEF_NATIVE(pv_cpu_ops, clts, "clts");
 DEF_NATIVE(pv_cpu_ops, read_tsc, "rdtsc");
 
+DEF_NATIVE(, unlock1, UNLOCK_LOCK_PREFIX
+		      "addb $"__stringify(__TICKET_LOCK_INC)", (%eax)");
+DEF_NATIVE(, unlock2, UNLOCK_LOCK_PREFIX
+		      "addw $"__stringify(__TICKET_LOCK_INC)", (%eax)");
+
+extern void __unlock_wrong_size(void)
+	__compiletime_error("Bad argument size for unlock");
+
 unsigned paravirt_patch_ident_32(void *insnbuf, unsigned len)
 {
 	/* arg in %eax, return in %eax */
@@ -24,6 +34,21 @@ unsigned paravirt_patch_ident_64(void *insnbuf, unsigned len)
 	return 0;
 }
 
+unsigned paravirt_patch_unlock(void *insnbuf, unsigned len)
+{
+	switch (sizeof(__ticket_t)) {
+	case __X86_CASE_B:
+		return paravirt_patch_insns(insnbuf, len,
+					    start__unlock1, end__unlock1);
+	case __X86_CASE_W:
+		return paravirt_patch_insns(insnbuf, len,
+					    start__unlock2, end__unlock2);
+	default:
+		__unlock_wrong_size();
+	}
+	return 0;
+}
+
 unsigned native_patch(u8 type, u16 clobbers, void *ibuf,
 		      unsigned long addr, unsigned len)
 {
diff --git a/arch/x86/kernel/paravirt_patch_64.c b/arch/x86/kernel/paravirt_patch_64.c
index a1da673..dc4d9af 100644
--- a/arch/x86/kernel/paravirt_patch_64.c
+++ b/arch/x86/kernel/paravirt_patch_64.c
@@ -1,5 +1,6 @@
 #include <asm/paravirt.h>
 #include <asm/asm-offsets.h>
+#include <asm/spinlock.h>
 #include <linux/stringify.h>
 
 DEF_NATIVE(pv_irq_ops, irq_disable, "cli");
@@ -21,6 +22,14 @@ DEF_NATIVE(pv_cpu_ops, swapgs, "swapgs");
 DEF_NATIVE(, mov32, "mov %edi, %eax");
 DEF_NATIVE(, mov64, "mov %rdi, %rax");
 
+DEF_NATIVE(, unlock1, UNLOCK_LOCK_PREFIX
+			"addb $"__stringify(__TICKET_LOCK_INC)", (%rdi)");
+DEF_NATIVE(, unlock2, UNLOCK_LOCK_PREFIX
+			"addw $"__stringify(__TICKET_LOCK_INC)", (%rdi)");
+
+extern void __unlock_wrong_size(void)
+	__compiletime_error("Bad argument size for unlock");
+
 unsigned paravirt_patch_ident_32(void *insnbuf, unsigned len)
 {
 	return paravirt_patch_insns(insnbuf, len,
@@ -33,6 +42,21 @@ unsigned paravirt_patch_ident_64(void *insnbuf, unsigned len)
 				    start__mov64, end__mov64);
 }
 
+unsigned paravirt_patch_unlock(void *insnbuf, unsigned len)
+{
+	switch (sizeof(__ticket_t)) {
+	case __X86_CASE_B:
+		return paravirt_patch_insns(insnbuf, len,
+					    start__unlock1, end__unlock1);
+	case __X86_CASE_W:
+		return paravirt_patch_insns(insnbuf, len,
+					    start__unlock2, end__unlock2);
+	default:
+		__unlock_wrong_size();
+	}
+	return 0;
+}
+
 unsigned native_patch(u8 type, u16 clobbers, void *ibuf,
 		      unsigned long addr, unsigned len)
 {
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 5/6] x86: switch config from UNINLINE_SPIN_UNLOCK to INLINE_SPIN_UNLOCK
  2015-04-30 10:53 [PATCH 0/6] x86: reduce paravirtualized spinlock overhead Juergen Gross
                   ` (3 preceding siblings ...)
  2015-04-30 10:54 ` [PATCH 4/6] x86: introduce new pvops function spin_unlock Juergen Gross
@ 2015-04-30 10:54 ` Juergen Gross
  2015-04-30 10:54 ` [PATCH 6/6] x86: remove no longer needed paravirt_ticketlocks_enabled Juergen Gross
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 17+ messages in thread
From: Juergen Gross @ 2015-04-30 10:54 UTC (permalink / raw)
  To: linux-kernel, x86, hpa, tglx, mingo, xen-devel, konrad.wilk,
	david.vrabel, boris.ostrovsky, jeremy, chrisw, akataria, rusty,
	virtualization, gleb, pbonzini, kvm
  Cc: Juergen Gross

There is no need any more for a special treatment of _raw_spin_unlock()
regarding inlining compared to the other spinlock functions. Just treat
it like all the other spinlock functions.

Remove selecting UNINLINE_SPIN_UNLOCK in case of PARAVIRT_SPINLOCKS.

Signed-off-by: Juergen Gross <jgross@suse.com>
---
 arch/x86/Kconfig                 | 1 -
 include/linux/spinlock_api_smp.h | 2 +-
 kernel/Kconfig.locks             | 7 ++++---
 kernel/Kconfig.preempt           | 3 +--
 kernel/locking/spinlock.c        | 2 +-
 lib/Kconfig.debug                | 1 -
 6 files changed, 7 insertions(+), 9 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 226d569..4f85c7e 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -666,7 +666,6 @@ config PARAVIRT_DEBUG
 config PARAVIRT_SPINLOCKS
 	bool "Paravirtualization layer for spinlocks"
 	depends on PARAVIRT && SMP
-	select UNINLINE_SPIN_UNLOCK
 	---help---
 	  Paravirtualized spinlocks allow a pvops backend to replace the
 	  spinlock implementation with something virtualization-friendly
diff --git a/include/linux/spinlock_api_smp.h b/include/linux/spinlock_api_smp.h
index 5344268..839a804 100644
--- a/include/linux/spinlock_api_smp.h
+++ b/include/linux/spinlock_api_smp.h
@@ -69,7 +69,7 @@ _raw_spin_unlock_irqrestore(raw_spinlock_t *lock, unsigned long flags)
 #define _raw_spin_trylock_bh(lock) __raw_spin_trylock_bh(lock)
 #endif
 
-#ifndef CONFIG_UNINLINE_SPIN_UNLOCK
+#ifdef CONFIG_INLINE_SPIN_UNLOCK
 #define _raw_spin_unlock(lock) __raw_spin_unlock(lock)
 #endif
 
diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks
index 08561f1..9cc5f72 100644
--- a/kernel/Kconfig.locks
+++ b/kernel/Kconfig.locks
@@ -87,9 +87,6 @@ config ARCH_INLINE_WRITE_UNLOCK_IRQ
 config ARCH_INLINE_WRITE_UNLOCK_IRQRESTORE
 	bool
 
-config UNINLINE_SPIN_UNLOCK
-	bool
-
 #
 # lock_* functions are inlined when:
 #   - DEBUG_SPINLOCK=n and GENERIC_LOCKBREAK=n and ARCH_INLINE_*LOCK=y
@@ -132,6 +129,10 @@ config INLINE_SPIN_LOCK_IRQSAVE
 	def_bool y
 	depends on !GENERIC_LOCKBREAK && ARCH_INLINE_SPIN_LOCK_IRQSAVE
 
+config INLINE_SPIN_UNLOCK
+	def_bool y
+	depends on !PREEMPT || ARCH_INLINE_SPIN_UNLOCK
+
 config INLINE_SPIN_UNLOCK_BH
 	def_bool y
 	depends on ARCH_INLINE_SPIN_UNLOCK_BH
diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index 3f9c974..6aca8987 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -36,7 +36,6 @@ config PREEMPT_VOLUNTARY
 config PREEMPT
 	bool "Preemptible Kernel (Low-Latency Desktop)"
 	select PREEMPT_COUNT
-	select UNINLINE_SPIN_UNLOCK if !ARCH_INLINE_SPIN_UNLOCK
 	help
 	  This option reduces the latency of the kernel by making
 	  all kernel code (that is not executing in a critical section)
diff --git a/kernel/locking/spinlock.c b/kernel/locking/spinlock.c
index db3ccb1..0e2b531 100644
--- a/kernel/locking/spinlock.c
+++ b/kernel/locking/spinlock.c
@@ -177,7 +177,7 @@ void __lockfunc _raw_spin_lock_bh(raw_spinlock_t *lock)
 EXPORT_SYMBOL(_raw_spin_lock_bh);
 #endif
 
-#ifdef CONFIG_UNINLINE_SPIN_UNLOCK
+#ifndef CONFIG_INLINE_SPIN_UNLOCK
 void __lockfunc _raw_spin_unlock(raw_spinlock_t *lock)
 {
 	__raw_spin_unlock(lock);
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 1767057..0b4cc3c 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -920,7 +920,6 @@ config RT_MUTEX_TESTER
 config DEBUG_SPINLOCK
 	bool "Spinlock and rw-lock debugging: basic checks"
 	depends on DEBUG_KERNEL
-	select UNINLINE_SPIN_UNLOCK
 	help
 	  Say Y here and build SMP to catch missing spinlock initialization
 	  and certain other kinds of spinlock errors commonly made.  This is
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 6/6] x86: remove no longer needed paravirt_ticketlocks_enabled
  2015-04-30 10:53 [PATCH 0/6] x86: reduce paravirtualized spinlock overhead Juergen Gross
                   ` (4 preceding siblings ...)
  2015-04-30 10:54 ` [PATCH 5/6] x86: switch config from UNINLINE_SPIN_UNLOCK to INLINE_SPIN_UNLOCK Juergen Gross
@ 2015-04-30 10:54 ` Juergen Gross
  2015-04-30 16:39 ` [PATCH 0/6] x86: reduce paravirtualized spinlock overhead Jeremy Fitzhardinge
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 17+ messages in thread
From: Juergen Gross @ 2015-04-30 10:54 UTC (permalink / raw)
  To: linux-kernel, x86, hpa, tglx, mingo, xen-devel, konrad.wilk,
	david.vrabel, boris.ostrovsky, jeremy, chrisw, akataria, rusty,
	virtualization, gleb, pbonzini, kvm
  Cc: Juergen Gross

With the paravirtualized spinlock unlock function being a pvops
function paravirt_ticketlocks_enabled is no longer needed. Remove it.

Signed-off-by: Juergen Gross <jgross@suse.com>
---
 arch/x86/include/asm/spinlock.h      |  3 ---
 arch/x86/kernel/kvm.c                | 14 --------------
 arch/x86/kernel/paravirt-spinlocks.c |  4 +---
 arch/x86/xen/spinlock.c              | 23 -----------------------
 4 files changed, 1 insertion(+), 43 deletions(-)

diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index 40a1091..2ac8118 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -39,9 +39,6 @@
 /* How long a lock should spin before we consider blocking */
 #define SPIN_THRESHOLD	(1 << 15)
 
-extern struct static_key paravirt_ticketlocks_enabled;
-static __always_inline bool static_key_false(struct static_key *key);
-
 static inline void ___ticket_unlock(arch_spinlock_t *lock)
 {
 	__add(&lock->tickets.head, TICKET_LOCK_INC, UNLOCK_LOCK_PREFIX);
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index c3b4b43..27d815a 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -834,18 +834,4 @@ void __init kvm_spinlock_init(void)
 	pv_lock_activate();
 }
 
-static __init int kvm_spinlock_init_jump(void)
-{
-	if (!kvm_para_available())
-		return 0;
-	if (!kvm_para_has_feature(KVM_FEATURE_PV_UNHALT))
-		return 0;
-
-	static_key_slow_inc(&paravirt_ticketlocks_enabled);
-	printk(KERN_INFO "KVM setup paravirtual spinlock\n");
-
-	return 0;
-}
-early_initcall(kvm_spinlock_init_jump);
-
 #endif	/* CONFIG_PARAVIRT_SPINLOCKS */
diff --git a/arch/x86/kernel/paravirt-spinlocks.c b/arch/x86/kernel/paravirt-spinlocks.c
index 91273fb..6b5f33c 100644
--- a/arch/x86/kernel/paravirt-spinlocks.c
+++ b/arch/x86/kernel/paravirt-spinlocks.c
@@ -40,6 +40,7 @@ void pv_lock_activate(void)
 {
 	pv_lock_ops.clear_slowpath = pv_ticket_clear_slowpath;
 	pv_lock_ops.unlock = pv_ticket_unlock;
+	pr_info("paravirtual spinlocks activated\n");
 }
 EXPORT_SYMBOL_GPL(pv_lock_activate);
 #endif
@@ -53,6 +54,3 @@ struct pv_lock_ops pv_lock_ops = {
 #endif
 };
 EXPORT_SYMBOL(pv_lock_ops);
-
-struct static_key paravirt_ticketlocks_enabled = STATIC_KEY_INIT_FALSE;
-EXPORT_SYMBOL(paravirt_ticketlocks_enabled);
diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c
index 988c895..a125d67 100644
--- a/arch/x86/xen/spinlock.c
+++ b/arch/x86/xen/spinlock.c
@@ -265,10 +265,6 @@ void xen_uninit_lock_cpu(int cpu)
 
 
 /*
- * Our init of PV spinlocks is split in two init functions due to us
- * using paravirt patching and jump labels patching and having to do
- * all of this before SMP code is invoked.
- *
  * The paravirt patching needs to be done _before_ the alternative asm code
  * is started, otherwise we would not patch the core kernel code.
  */
@@ -286,25 +282,6 @@ void __init xen_init_spinlocks(void)
 	pv_lock_activate();
 }
 
-/*
- * While the jump_label init code needs to happend _after_ the jump labels are
- * enabled and before SMP is started. Hence we use pre-SMP initcall level
- * init. We cannot do it in xen_init_spinlocks as that is done before
- * jump labels are activated.
- */
-static __init int xen_init_spinlocks_jump(void)
-{
-	if (!xen_pvspin)
-		return 0;
-
-	if (!xen_domain())
-		return 0;
-
-	static_key_slow_inc(&paravirt_ticketlocks_enabled);
-	return 0;
-}
-early_initcall(xen_init_spinlocks_jump);
-
 static __init int xen_parse_nopvspin(char *arg)
 {
 	xen_pvspin = false;
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/6] x86: reduce paravirtualized spinlock overhead
  2015-04-30 10:53 [PATCH 0/6] x86: reduce paravirtualized spinlock overhead Juergen Gross
                   ` (5 preceding siblings ...)
  2015-04-30 10:54 ` [PATCH 6/6] x86: remove no longer needed paravirt_ticketlocks_enabled Juergen Gross
@ 2015-04-30 16:39 ` Jeremy Fitzhardinge
  2015-05-04  5:55   ` Juergen Gross
  2015-05-15 12:16 ` Juergen Gross
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 17+ messages in thread
From: Jeremy Fitzhardinge @ 2015-04-30 16:39 UTC (permalink / raw)
  To: Juergen Gross, linux-kernel, x86, hpa, tglx, mingo, xen-devel,
	konrad.wilk, david.vrabel, boris.ostrovsky, chrisw, akataria,
	rusty, virtualization, gleb, pbonzini, kvm

On 04/30/2015 03:53 AM, Juergen Gross wrote:
> Paravirtualized spinlocks produce some overhead even if the kernel is
> running on bare metal. The main reason are the more complex locking
> and unlocking functions. Especially unlocking is no longer just one
> instruction but so complex that it is no longer inlined.
>
> This patch series addresses this issue by adding two more pvops
> functions to reduce the size of the inlined spinlock functions. When
> running on bare metal unlocking is again basically one instruction.

Out of curiosity, is there a measurable difference?

    J


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/6] x86: reduce paravirtualized spinlock overhead
  2015-04-30 16:39 ` [PATCH 0/6] x86: reduce paravirtualized spinlock overhead Jeremy Fitzhardinge
@ 2015-05-04  5:55   ` Juergen Gross
  2015-05-05 17:21     ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 17+ messages in thread
From: Juergen Gross @ 2015-05-04  5:55 UTC (permalink / raw)
  To: Jeremy Fitzhardinge, linux-kernel, x86, hpa, tglx, mingo,
	xen-devel, konrad.wilk, david.vrabel, boris.ostrovsky, chrisw,
	akataria, rusty, virtualization, gleb, pbonzini, kvm

On 04/30/2015 06:39 PM, Jeremy Fitzhardinge wrote:
> On 04/30/2015 03:53 AM, Juergen Gross wrote:
>> Paravirtualized spinlocks produce some overhead even if the kernel is
>> running on bare metal. The main reason are the more complex locking
>> and unlocking functions. Especially unlocking is no longer just one
>> instruction but so complex that it is no longer inlined.
>>
>> This patch series addresses this issue by adding two more pvops
>> functions to reduce the size of the inlined spinlock functions. When
>> running on bare metal unlocking is again basically one instruction.
>
> Out of curiosity, is there a measurable difference?

I did a small measurement of the pure locking functions on bare metal
without and with my patches.

spin_lock() for the first time (lock and code not in cache) dropped from
about 600 to 500 cycles.

spin_unlock() for first time dropped from 145 to 87 cycles.

spin_lock() in a loop dropped from 48 to 45 cycles.

spin_unlock() in the same loop dropped from 24 to 22 cycles.


Juergen

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/6] x86: reduce paravirtualized spinlock overhead
  2015-05-04  5:55   ` Juergen Gross
@ 2015-05-05 17:21     ` Jeremy Fitzhardinge
  2015-05-06 11:55       ` Juergen Gross
  0 siblings, 1 reply; 17+ messages in thread
From: Jeremy Fitzhardinge @ 2015-05-05 17:21 UTC (permalink / raw)
  To: Juergen Gross, linux-kernel, x86, hpa, tglx, mingo, xen-devel,
	konrad.wilk, david.vrabel, boris.ostrovsky, chrisw, akataria,
	rusty, virtualization, gleb, pbonzini, kvm

On 05/03/2015 10:55 PM, Juergen Gross wrote:
> I did a small measurement of the pure locking functions on bare metal
> without and with my patches.
>
> spin_lock() for the first time (lock and code not in cache) dropped from
> about 600 to 500 cycles.
>
> spin_unlock() for first time dropped from 145 to 87 cycles.
>
> spin_lock() in a loop dropped from 48 to 45 cycles.
>
> spin_unlock() in the same loop dropped from 24 to 22 cycles.

Did you isolate icache hot/cold from dcache hot/cold? It seems to me the
main difference will be whether the branch predictor is warmed up rather
than if the lock itself is in dcache, but its much more likely that the
lock code is icache if the code is lock intensive, making the cold case
moot. But that's pure speculation.

Could you see any differences in workloads beyond microbenchmarks?

Not that its my call at all, but I think we'd need to see some concrete
improvements in real workloads before adding the complexity of more pvops.

    J

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/6] x86: reduce paravirtualized spinlock overhead
  2015-05-05 17:21     ` Jeremy Fitzhardinge
@ 2015-05-06 11:55       ` Juergen Gross
  2015-05-17  5:30         ` Ingo Molnar
  0 siblings, 1 reply; 17+ messages in thread
From: Juergen Gross @ 2015-05-06 11:55 UTC (permalink / raw)
  To: Jeremy Fitzhardinge, linux-kernel, x86, hpa, tglx, mingo,
	xen-devel, konrad.wilk, david.vrabel, boris.ostrovsky, chrisw,
	akataria, rusty, virtualization, gleb, pbonzini, kvm

On 05/05/2015 07:21 PM, Jeremy Fitzhardinge wrote:
> On 05/03/2015 10:55 PM, Juergen Gross wrote:
>> I did a small measurement of the pure locking functions on bare metal
>> without and with my patches.
>>
>> spin_lock() for the first time (lock and code not in cache) dropped from
>> about 600 to 500 cycles.
>>
>> spin_unlock() for first time dropped from 145 to 87 cycles.
>>
>> spin_lock() in a loop dropped from 48 to 45 cycles.
>>
>> spin_unlock() in the same loop dropped from 24 to 22 cycles.
>
> Did you isolate icache hot/cold from dcache hot/cold? It seems to me the
> main difference will be whether the branch predictor is warmed up rather
> than if the lock itself is in dcache, but its much more likely that the
> lock code is icache if the code is lock intensive, making the cold case
> moot. But that's pure speculation.
>
> Could you see any differences in workloads beyond microbenchmarks?
>
> Not that its my call at all, but I think we'd need to see some concrete
> improvements in real workloads before adding the complexity of more pvops.

I did another test on a larger machine:

25 kernel builds (time make -j 32) on a 32 core machine. Before each
build "make clean" was called, the first result after boot was omitted
to avoid disk cache warmup effects.

System time without my patches: 861.5664 +/- 3.3665 s
                with my patches: 852.2269 +/- 3.6629 s


Juergen

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/6] x86: reduce paravirtualized spinlock overhead
  2015-04-30 10:53 [PATCH 0/6] x86: reduce paravirtualized spinlock overhead Juergen Gross
                   ` (6 preceding siblings ...)
  2015-04-30 16:39 ` [PATCH 0/6] x86: reduce paravirtualized spinlock overhead Jeremy Fitzhardinge
@ 2015-05-15 12:16 ` Juergen Gross
  2015-06-08  4:09 ` Juergen Gross
  2015-06-16 14:37 ` Juergen Gross
  9 siblings, 0 replies; 17+ messages in thread
From: Juergen Gross @ 2015-05-15 12:16 UTC (permalink / raw)
  To: linux-kernel, x86, hpa, tglx, mingo, xen-devel, konrad.wilk,
	david.vrabel, boris.ostrovsky, jeremy, chrisw, akataria, rusty,
	virtualization, gleb, pbonzini, kvm

Ping?

On 04/30/2015 12:53 PM, Juergen Gross wrote:
> Paravirtualized spinlocks produce some overhead even if the kernel is
> running on bare metal. The main reason are the more complex locking
> and unlocking functions. Especially unlocking is no longer just one
> instruction but so complex that it is no longer inlined.
>
> This patch series addresses this issue by adding two more pvops
> functions to reduce the size of the inlined spinlock functions. When
> running on bare metal unlocking is again basically one instruction.
>
> Compile tested with CONFIG_PARAVIRT_SPINLOCKS on and off, 32 and 64
> bits.
>
> Functional testing on bare metal and as Xen dom0.
>
> Correct patching verified by disassembly of active kernel.
>
> Juergen Gross (6):
>    x86: use macro instead of "0" for setting TICKET_SLOWPATH_FLAG
>    x86: move decision about clearing slowpath flag into arch_spin_lock()
>    x86: introduce new pvops function clear_slowpath
>    x86: introduce new pvops function spin_unlock
>    x86: switch config from UNINLINE_SPIN_UNLOCK to INLINE_SPIN_UNLOCK
>    x86: remove no longer needed paravirt_ticketlocks_enabled
>
>   arch/x86/Kconfig                      |  1 -
>   arch/x86/include/asm/paravirt.h       | 13 +++++++++
>   arch/x86/include/asm/paravirt_types.h | 12 ++++++++
>   arch/x86/include/asm/spinlock.h       | 53 ++++++++++++-----------------------
>   arch/x86/include/asm/spinlock_types.h |  3 +-
>   arch/x86/kernel/kvm.c                 | 14 +--------
>   arch/x86/kernel/paravirt-spinlocks.c  | 42 +++++++++++++++++++++++++--
>   arch/x86/kernel/paravirt.c            | 12 ++++++++
>   arch/x86/kernel/paravirt_patch_32.c   | 25 +++++++++++++++++
>   arch/x86/kernel/paravirt_patch_64.c   | 24 ++++++++++++++++
>   arch/x86/xen/spinlock.c               | 23 +--------------
>   include/linux/spinlock_api_smp.h      |  2 +-
>   kernel/Kconfig.locks                  |  7 +++--
>   kernel/Kconfig.preempt                |  3 +-
>   kernel/locking/spinlock.c             |  2 +-
>   lib/Kconfig.debug                     |  1 -
>   16 files changed, 154 insertions(+), 83 deletions(-)
>


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/6] x86: reduce paravirtualized spinlock overhead
  2015-05-06 11:55       ` Juergen Gross
@ 2015-05-17  5:30         ` Ingo Molnar
  2015-05-18  8:11           ` Juergen Gross
  0 siblings, 1 reply; 17+ messages in thread
From: Ingo Molnar @ 2015-05-17  5:30 UTC (permalink / raw)
  To: Juergen Gross
  Cc: Jeremy Fitzhardinge, linux-kernel, x86, hpa, tglx, mingo,
	xen-devel, konrad.wilk, david.vrabel, boris.ostrovsky, chrisw,
	akataria, rusty, virtualization, gleb, pbonzini, kvm


* Juergen Gross <jgross@suse.com> wrote:

> On 05/05/2015 07:21 PM, Jeremy Fitzhardinge wrote:
> >On 05/03/2015 10:55 PM, Juergen Gross wrote:
> >>I did a small measurement of the pure locking functions on bare metal
> >>without and with my patches.
> >>
> >>spin_lock() for the first time (lock and code not in cache) dropped from
> >>about 600 to 500 cycles.
> >>
> >>spin_unlock() for first time dropped from 145 to 87 cycles.
> >>
> >>spin_lock() in a loop dropped from 48 to 45 cycles.
> >>
> >>spin_unlock() in the same loop dropped from 24 to 22 cycles.
> >
> >Did you isolate icache hot/cold from dcache hot/cold? It seems to me the
> >main difference will be whether the branch predictor is warmed up rather
> >than if the lock itself is in dcache, but its much more likely that the
> >lock code is icache if the code is lock intensive, making the cold case
> >moot. But that's pure speculation.
> >
> >Could you see any differences in workloads beyond microbenchmarks?
> >
> >Not that its my call at all, but I think we'd need to see some concrete
> >improvements in real workloads before adding the complexity of more pvops.
> 
> I did another test on a larger machine:
> 
> 25 kernel builds (time make -j 32) on a 32 core machine. Before each
> build "make clean" was called, the first result after boot was omitted
> to avoid disk cache warmup effects.
> 
> System time without my patches: 861.5664 +/- 3.3665 s
>                with my patches: 852.2269 +/- 3.6629 s

So how does the profile look like in the guest, before/after the PV 
spinlock patches? I'm a bit surprised to see so much spinlock 
overhead.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/6] x86: reduce paravirtualized spinlock overhead
  2015-05-17  5:30         ` Ingo Molnar
@ 2015-05-18  8:11           ` Juergen Gross
  0 siblings, 0 replies; 17+ messages in thread
From: Juergen Gross @ 2015-05-18  8:11 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jeremy Fitzhardinge, linux-kernel, x86, hpa, tglx, mingo,
	xen-devel, konrad.wilk, david.vrabel, boris.ostrovsky, chrisw,
	akataria, rusty, virtualization, gleb, pbonzini, kvm

On 05/17/2015 07:30 AM, Ingo Molnar wrote:
>
> * Juergen Gross <jgross@suse.com> wrote:
>
>> On 05/05/2015 07:21 PM, Jeremy Fitzhardinge wrote:
>>> On 05/03/2015 10:55 PM, Juergen Gross wrote:
>>>> I did a small measurement of the pure locking functions on bare metal
>>>> without and with my patches.
>>>>
>>>> spin_lock() for the first time (lock and code not in cache) dropped from
>>>> about 600 to 500 cycles.
>>>>
>>>> spin_unlock() for first time dropped from 145 to 87 cycles.
>>>>
>>>> spin_lock() in a loop dropped from 48 to 45 cycles.
>>>>
>>>> spin_unlock() in the same loop dropped from 24 to 22 cycles.
>>>
>>> Did you isolate icache hot/cold from dcache hot/cold? It seems to me the
>>> main difference will be whether the branch predictor is warmed up rather
>>> than if the lock itself is in dcache, but its much more likely that the
>>> lock code is icache if the code is lock intensive, making the cold case
>>> moot. But that's pure speculation.
>>>
>>> Could you see any differences in workloads beyond microbenchmarks?
>>>
>>> Not that its my call at all, but I think we'd need to see some concrete
>>> improvements in real workloads before adding the complexity of more pvops.
>>
>> I did another test on a larger machine:
>>
>> 25 kernel builds (time make -j 32) on a 32 core machine. Before each
>> build "make clean" was called, the first result after boot was omitted
>> to avoid disk cache warmup effects.
>>
>> System time without my patches: 861.5664 +/- 3.3665 s
>>                 with my patches: 852.2269 +/- 3.6629 s
>
> So how does the profile look like in the guest, before/after the PV
> spinlock patches? I'm a bit surprised to see so much spinlock
> overhead.

I did another test in Xen dom0:

System time without my patches: 2903 +/- 2 s
                with my patches: 2904 +/- 2 s

BTW, this was what I expected: There should be no significant change in
system time, as the only real difference between both variants in a
guest is an additional 2-byte nop in the inlined unlock function call,
another one in the lock call and one jmp instruction less in the lock
call.

What I didn't expect was the huge performance difference between native
and guest. The used configuration (32 cores with hyperthreads enabled)
surely is one reason for the difference, but still this seems to be too
much. I double checked the results on bare metal, they are still more
or less the same (did only one kernel build resulting in 862 seconds
system time). There seems to be a lot of room for improvement, but
this is another story.

Regarding spinlock overhead: I think the reason I saw about 1% less
system time with my patches was mainly due to less cache misses.
Inlining of the unlock function avoided an additional instruction cache
miss for the unlock function. KT Raghavendra did some benchmarks with
only small user programs and high kernel load which showed nearly no
effect at all.

Additionally I've compared the two kernels using bloat-o-meter:

add/remove: 11/13 grow/shrink: 654/603 up/down: 6046/-31754 (-25708)

with some hot path functions going down in size quite nice, e.g.:

__raw_spin_unlock_irq                        336      90    -246


Juergen

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/6] x86: reduce paravirtualized spinlock overhead
  2015-04-30 10:53 [PATCH 0/6] x86: reduce paravirtualized spinlock overhead Juergen Gross
                   ` (7 preceding siblings ...)
  2015-05-15 12:16 ` Juergen Gross
@ 2015-06-08  4:09 ` Juergen Gross
  2015-06-16 14:37 ` Juergen Gross
  9 siblings, 0 replies; 17+ messages in thread
From: Juergen Gross @ 2015-06-08  4:09 UTC (permalink / raw)
  To: linux-kernel, x86, hpa, tglx, mingo, xen-devel, konrad.wilk,
	david.vrabel, boris.ostrovsky, jeremy, chrisw, akataria, rusty,
	virtualization, gleb, pbonzini, kvm

Ping?

Anything missing from my side?

On 04/30/2015 12:53 PM, Juergen Gross wrote:
> Paravirtualized spinlocks produce some overhead even if the kernel is
> running on bare metal. The main reason are the more complex locking
> and unlocking functions. Especially unlocking is no longer just one
> instruction but so complex that it is no longer inlined.
>
> This patch series addresses this issue by adding two more pvops
> functions to reduce the size of the inlined spinlock functions. When
> running on bare metal unlocking is again basically one instruction.
>
> Compile tested with CONFIG_PARAVIRT_SPINLOCKS on and off, 32 and 64
> bits.
>
> Functional testing on bare metal and as Xen dom0.
>
> Correct patching verified by disassembly of active kernel.
>
> Juergen Gross (6):
>    x86: use macro instead of "0" for setting TICKET_SLOWPATH_FLAG
>    x86: move decision about clearing slowpath flag into arch_spin_lock()
>    x86: introduce new pvops function clear_slowpath
>    x86: introduce new pvops function spin_unlock
>    x86: switch config from UNINLINE_SPIN_UNLOCK to INLINE_SPIN_UNLOCK
>    x86: remove no longer needed paravirt_ticketlocks_enabled
>
>   arch/x86/Kconfig                      |  1 -
>   arch/x86/include/asm/paravirt.h       | 13 +++++++++
>   arch/x86/include/asm/paravirt_types.h | 12 ++++++++
>   arch/x86/include/asm/spinlock.h       | 53 ++++++++++++-----------------------
>   arch/x86/include/asm/spinlock_types.h |  3 +-
>   arch/x86/kernel/kvm.c                 | 14 +--------
>   arch/x86/kernel/paravirt-spinlocks.c  | 42 +++++++++++++++++++++++++--
>   arch/x86/kernel/paravirt.c            | 12 ++++++++
>   arch/x86/kernel/paravirt_patch_32.c   | 25 +++++++++++++++++
>   arch/x86/kernel/paravirt_patch_64.c   | 24 ++++++++++++++++
>   arch/x86/xen/spinlock.c               | 23 +--------------
>   include/linux/spinlock_api_smp.h      |  2 +-
>   kernel/Kconfig.locks                  |  7 +++--
>   kernel/Kconfig.preempt                |  3 +-
>   kernel/locking/spinlock.c             |  2 +-
>   lib/Kconfig.debug                     |  1 -
>   16 files changed, 154 insertions(+), 83 deletions(-)


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/6] x86: reduce paravirtualized spinlock overhead
  2015-04-30 10:53 [PATCH 0/6] x86: reduce paravirtualized spinlock overhead Juergen Gross
                   ` (8 preceding siblings ...)
  2015-06-08  4:09 ` Juergen Gross
@ 2015-06-16 14:37 ` Juergen Gross
  2015-06-16 15:18   ` Thomas Gleixner
  9 siblings, 1 reply; 17+ messages in thread
From: Juergen Gross @ 2015-06-16 14:37 UTC (permalink / raw)
  To: linux-kernel, x86, hpa, tglx, mingo, xen-devel, konrad.wilk,
	david.vrabel, boris.ostrovsky, jeremy, chrisw, akataria, rusty,
	virtualization, gleb, pbonzini, kvm

AFAIK there are no outstanding questions for more than one month now.
I'd appreciate some feedback or accepting these patches.


Juergen

On 04/30/2015 12:53 PM, Juergen Gross wrote:
> Paravirtualized spinlocks produce some overhead even if the kernel is
> running on bare metal. The main reason are the more complex locking
> and unlocking functions. Especially unlocking is no longer just one
> instruction but so complex that it is no longer inlined.
>
> This patch series addresses this issue by adding two more pvops
> functions to reduce the size of the inlined spinlock functions. When
> running on bare metal unlocking is again basically one instruction.
>
> Compile tested with CONFIG_PARAVIRT_SPINLOCKS on and off, 32 and 64
> bits.
>
> Functional testing on bare metal and as Xen dom0.
>
> Correct patching verified by disassembly of active kernel.
>
> Juergen Gross (6):
>    x86: use macro instead of "0" for setting TICKET_SLOWPATH_FLAG
>    x86: move decision about clearing slowpath flag into arch_spin_lock()
>    x86: introduce new pvops function clear_slowpath
>    x86: introduce new pvops function spin_unlock
>    x86: switch config from UNINLINE_SPIN_UNLOCK to INLINE_SPIN_UNLOCK
>    x86: remove no longer needed paravirt_ticketlocks_enabled
>
>   arch/x86/Kconfig                      |  1 -
>   arch/x86/include/asm/paravirt.h       | 13 +++++++++
>   arch/x86/include/asm/paravirt_types.h | 12 ++++++++
>   arch/x86/include/asm/spinlock.h       | 53 ++++++++++++-----------------------
>   arch/x86/include/asm/spinlock_types.h |  3 +-
>   arch/x86/kernel/kvm.c                 | 14 +--------
>   arch/x86/kernel/paravirt-spinlocks.c  | 42 +++++++++++++++++++++++++--
>   arch/x86/kernel/paravirt.c            | 12 ++++++++
>   arch/x86/kernel/paravirt_patch_32.c   | 25 +++++++++++++++++
>   arch/x86/kernel/paravirt_patch_64.c   | 24 ++++++++++++++++
>   arch/x86/xen/spinlock.c               | 23 +--------------
>   include/linux/spinlock_api_smp.h      |  2 +-
>   kernel/Kconfig.locks                  |  7 +++--
>   kernel/Kconfig.preempt                |  3 +-
>   kernel/locking/spinlock.c             |  2 +-
>   lib/Kconfig.debug                     |  1 -
>   16 files changed, 154 insertions(+), 83 deletions(-)
>


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/6] x86: reduce paravirtualized spinlock overhead
  2015-06-16 14:37 ` Juergen Gross
@ 2015-06-16 15:18   ` Thomas Gleixner
  0 siblings, 0 replies; 17+ messages in thread
From: Thomas Gleixner @ 2015-06-16 15:18 UTC (permalink / raw)
  To: Juergen Gross
  Cc: LKML, x86, H. Peter Anvin, Ingo Molnar, xen-devel, konrad.wilk,
	david.vrabel, boris.ostrovsky, Jeremy Fitzhardinge, Chris Wright,
	akataria, Rusty Russell, virtualization, gleb, pbonzini, kvm,
	Peter Zijlstra

On Tue, 16 Jun 2015, Juergen Gross wrote:

> AFAIK there are no outstanding questions for more than one month now.
> I'd appreciate some feedback or accepting these patches.

They are against dead code, which will be gone soon. We switched over
to queued locks.

Thanks,

	tglx


^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2015-06-16 15:19 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-04-30 10:53 [PATCH 0/6] x86: reduce paravirtualized spinlock overhead Juergen Gross
2015-04-30 10:53 ` [PATCH 1/6] x86: use macro instead of "0" for setting TICKET_SLOWPATH_FLAG Juergen Gross
2015-04-30 10:53 ` [PATCH 2/6] x86: move decision about clearing slowpath flag into arch_spin_lock() Juergen Gross
2015-04-30 10:54 ` [PATCH 3/6] x86: introduce new pvops function clear_slowpath Juergen Gross
2015-04-30 10:54 ` [PATCH 4/6] x86: introduce new pvops function spin_unlock Juergen Gross
2015-04-30 10:54 ` [PATCH 5/6] x86: switch config from UNINLINE_SPIN_UNLOCK to INLINE_SPIN_UNLOCK Juergen Gross
2015-04-30 10:54 ` [PATCH 6/6] x86: remove no longer needed paravirt_ticketlocks_enabled Juergen Gross
2015-04-30 16:39 ` [PATCH 0/6] x86: reduce paravirtualized spinlock overhead Jeremy Fitzhardinge
2015-05-04  5:55   ` Juergen Gross
2015-05-05 17:21     ` Jeremy Fitzhardinge
2015-05-06 11:55       ` Juergen Gross
2015-05-17  5:30         ` Ingo Molnar
2015-05-18  8:11           ` Juergen Gross
2015-05-15 12:16 ` Juergen Gross
2015-06-08  4:09 ` Juergen Gross
2015-06-16 14:37 ` Juergen Gross
2015-06-16 15:18   ` Thomas Gleixner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).