linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC V6 0/11] Paravirtualized ticketlocks
@ 2012-03-21 10:20 Raghavendra K T
  2012-03-21 10:20 ` [PATCH RFC V6 1/11] x86/spinlock: replace pv spinlocks with pv ticketlocks Raghavendra K T
                   ` (12 more replies)
  0 siblings, 13 replies; 55+ messages in thread
From: Raghavendra K T @ 2012-03-21 10:20 UTC (permalink / raw)
  To: H. Peter Anvin, Ingo Molnar
  Cc: Linus Torvalds, Peter Zijlstra, the arch/x86 maintainers, LKML,
	Avi Kivity, Marcelo Tosatti, KVM, Andi Kleen, Xen Devel,
	Konrad Rzeszutek Wilk, Virtualization, Jeremy Fitzhardinge,
	Stephan Diestelhorst, Srivatsa Vaddagiri, Stefano Stabellini,
	Attilio Rao

From: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>

Changes since last posting: (Raghavendra K T)
[
 - Rebased to linux-3.3-rc6.
 - used function+enum in place of macro (better type checking) 
 - use cmpxchg while resetting zero status for possible race
	[suggested by Dave Hansen for KVM patches ]
]

This series replaces the existing paravirtualized spinlock mechanism
with a paravirtualized ticketlock mechanism.

Ticket locks have an inherent problem in a virtualized case, because
the vCPUs are scheduled rather than running concurrently (ignoring
gang scheduled vCPUs).  This can result in catastrophic performance
collapses when the vCPU scheduler doesn't schedule the correct "next"
vCPU, and ends up scheduling a vCPU which burns its entire timeslice
spinning.  (Note that this is not the same problem as lock-holder
preemption, which this series also addresses; that's also a problem,
but not catastrophic).

(See Thomas Friebel's talk "Prevent Guests from Spinning Around"
http://www.xen.org/files/xensummitboston08/LHP.pdf for more details.)

Currently we deal with this by having PV spinlocks, which adds a layer
of indirection in front of all the spinlock functions, and defining a
completely new implementation for Xen (and for other pvops users, but
there are none at present).

PV ticketlocks keeps the existing ticketlock implemenentation
(fastpath) as-is, but adds a couple of pvops for the slow paths:

- If a CPU has been waiting for a spinlock for SPIN_THRESHOLD
  iterations, then call out to the __ticket_lock_spinning() pvop,
  which allows a backend to block the vCPU rather than spinning.  This
  pvop can set the lock into "slowpath state".

- When releasing a lock, if it is in "slowpath state", the call
  __ticket_unlock_kick() to kick the next vCPU in line awake.  If the
  lock is no longer in contention, it also clears the slowpath flag.

The "slowpath state" is stored in the LSB of the within the lock tail
ticket.  This has the effect of reducing the max number of CPUs by
half (so, a "small ticket" can deal with 128 CPUs, and "large ticket"
32768).

This series provides a Xen implementation, but it should be
straightforward to add a KVM implementation as well.

Overall, it results in a large reduction in code, it makes the native
and virtualized cases closer, and it removes a layer of indirection
around all the spinlock functions.

The fast path (taking an uncontended lock which isn't in "slowpath"
state) is optimal, identical to the non-paravirtualized case.

The inner part of ticket lock code becomes:
	inc = xadd(&lock->tickets, inc);
	inc.tail &= ~TICKET_SLOWPATH_FLAG;

	if (likely(inc.head == inc.tail))
		goto out;
	for (;;) {
		unsigned count = SPIN_THRESHOLD;
		do {
			if (ACCESS_ONCE(lock->tickets.head) == inc.tail)
				goto out;
			cpu_relax();
		} while (--count);
		__ticket_lock_spinning(lock, inc.tail);
	}
out:	barrier();
which results in:
	push   %rbp
	mov    %rsp,%rbp

	mov    $0x200,%eax
	lock xadd %ax,(%rdi)
	movzbl %ah,%edx
	cmp    %al,%dl
	jne    1f	# Slowpath if lock in contention

	pop    %rbp
	retq   

	### SLOWPATH START
1:	and    $-2,%edx
	movzbl %dl,%esi

2:	mov    $0x800,%eax
	jmp    4f

3:	pause  
	sub    $0x1,%eax
	je     5f

4:	movzbl (%rdi),%ecx
	cmp    %cl,%dl
	jne    3b

	pop    %rbp
	retq   

5:	callq  *__ticket_lock_spinning
	jmp    2b
	### SLOWPATH END

with CONFIG_PARAVIRT_SPINLOCKS=n, the code has changed slightly, where
the fastpath case is straight through (taking the lock without
contention), and the spin loop is out of line:

	push   %rbp
	mov    %rsp,%rbp

	mov    $0x100,%eax
	lock xadd %ax,(%rdi)
	movzbl %ah,%edx
	cmp    %al,%dl
	jne    1f

	pop    %rbp
	retq   

	### SLOWPATH START
1:	pause  
	movzbl (%rdi),%eax
	cmp    %dl,%al
	jne    1b

	pop    %rbp
	retq   
	### SLOWPATH END

The unlock code is complicated by the need to both add to the lock's
"head" and fetch the slowpath flag from "tail".  This version of the
patch uses a locked add to do this, followed by a test to see if the
slowflag is set.  The lock prefix acts as a full memory barrier, so we
can be sure that other CPUs will have seen the unlock before we read
the flag (without the barrier the read could be fetched from the
store queue before it hits memory, which could result in a deadlock).

This is is all unnecessary complication if you're not using PV ticket
locks, it also uses the jump-label machinery to use the standard
"add"-based unlock in the non-PV case.

	if (TICKET_SLOWPATH_FLAG &&
	    unlikely(static_branch(&paravirt_ticketlocks_enabled))) {
		arch_spinlock_t prev;
		prev = *lock;
		add_smp(&lock->tickets.head, TICKET_LOCK_INC);

		/* add_smp() is a full mb() */
		if (unlikely(lock->tickets.tail & TICKET_SLOWPATH_FLAG))
			__ticket_unlock_slowpath(lock, prev);
	} else
		__add(&lock->tickets.head, TICKET_LOCK_INC, UNLOCK_LOCK_PREFIX);
which generates:
	push   %rbp
	mov    %rsp,%rbp

	nop5	# replaced by 5-byte jmp 2f when PV enabled

	# non-PV unlock
	addb   $0x2,(%rdi)

1:	pop    %rbp
	retq   

### PV unlock ###
2:	movzwl (%rdi),%esi	# Fetch prev

	lock addb $0x2,(%rdi)	# Do unlock

	testb  $0x1,0x1(%rdi)	# Test flag
	je     1b		# Finished if not set

### Slow path ###
	add    $2,%sil		# Add "head" in old lock state
	mov    %esi,%edx
	and    $0xfe,%dh	# clear slowflag for comparison
	movzbl %dh,%eax
	cmp    %dl,%al		# If head == tail (uncontended)
	je     4f		# clear slowpath flag

	# Kick next CPU waiting for lock
3:	movzbl %sil,%esi
	callq  *pv_lock_ops.kick

	pop    %rbp
	retq   

	# Lock no longer contended - clear slowflag
4:	mov    %esi,%eax
	lock cmpxchg %dx,(%rdi)	# cmpxchg to clear flag
	cmp    %si,%ax
	jne    3b		# If clear failed, then kick

	pop    %rbp
	retq   

So when not using PV ticketlocks, the unlock sequence just has a
5-byte nop added to it, and the PV case is reasonable straightforward
aside from requiring a "lock add".

Note that the patch series needs jumplabel split posted in
 https://lkml.org/lkml/2012/2/21/167 to avoid cyclic dependency
of headers (to use jump label machinary)

TODO: remove CONFIG_PARAVIRT_SPINLOCK when everybody convinced.

Results:
=======
machine
IBM xSeries with Intel(R) Xeon(R) x5570 2.93GHz CPU with 8 core , 64GB RAM

OS: enterprise linux 
Gcc  Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-bootstrap --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-languages=c,c++,objc,obj-c++,java,fortran,ada --enable-java-awt=gtk --disable-dssi --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-1.5.0.0/jre --enable-libgcj-multifile --enable-java-maintainer-mode --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --disable-libjava-multilib --with-ppl --with-cloog --with-tune=generic --with-arch_32=i686 --build=x86_64-redhat-linux
Thread model: posix
gcc version 4.4.5 20110214

base kernel = 3.3-rc6 (cloned sunday 4th march)
unit=tps (higher is better)
benchmak=pgbench based on pgsql 9.2-dev: 
	http://www.postgresql.org/ftp/snapshot/dev/ (link given by Attilo)

tool used to collect benachmark: git://git.postgresql.org/git/pgbench-tools.git
config is same as tools default except MAX_WORKER=8

Average taken over 10 iterations, analysed with ministat tool.

BASE  (CONFIG_PARAVIRT_SPINLOCK = n)
==========================================
	------ scale=1 (32MB shared buf) ----------
Client	    N           Min           Max        Median           Avg        Stddev
  1  	x  10     3718.4108     4182.7842     3855.1089      3914.535     196.91943
  2  	x  10     7462.1997     7921.4638     7855.1965     7808.1603     135.37891
  4  	x  10     21682.402     23445.941     22151.922     22224.329     507.32299
  8  	x  10     43309.638     48103.494      45332.24     45593.135     1496.3735
 16  	x  10     108624.95     109227.45     108997.96     108987.84     210.15136
 32  	x  10      112582.1     113170.42     112776.92     112830.09     202.70556
 64  	x  10     100576.34     104011.92     103299.89     103034.24     928.24581
	----------------
	------ scale=500 (16GB shared buf) ----------
Client	    N           Min           Max        Median           Avg        Stddev
  1  	x  10     3451.9407     3948.3127     3512.2215     3610.6086     201.58491
  2  	x  10     7311.1769     7383.2552     7341.0847     7342.2349     21.231902
  4  	x  10     19582.548      26909.72     24778.282     23893.162     2587.6103
  8  	x  10     52292.765     54561.472     53171.286     53216.256     733.16626
 16  	x  10     89643.138     90353.598     89970.878     90018.505     213.73589
 32  	x  10     81010.402      81556.02     81256.217     81247.223     174.31678
 64  	x  10     83855.565     85048.602     84087.693      84201.86     352.25182
	----------------

BASE + jumplabel_split + jeremy patch (CONFIG_PARAVIRT_SPINLOCK = n)
=====================================================
	------ scale=1 (32MB shared buf) ----------
Client	    N           Min           Max        Median           Avg        Stddev
  1  	x  10     3669.2156     4102.5109     3732.9526     3784.4072     129.14134
  2  	x  10      7423.984     7797.5046     7446.8946     7500.2076     119.85178
  4  	x  10     21332.859     26327.619     24175.239     24084.731     1841.8335
  8  	x  10     43149.937     49515.406     45779.204     45838.782     2191.6348
 16  	x  10     109512.27     110407.82     109977.15     110019.72     283.41371
 32  	x  10      112653.3     113156.22     113023.24     112973.56     151.54906
 64  	x  10     102816.08     104514.48     103843.95     103658.17     515.10115
	----------------
	------ scale=500 (16GB shared buf) ----------
Client	    N           Min           Max        Median           Avg        Stddev
  1  	x  10     3501.3548     3985.3114     3609.0236     3705.6665      224.3719
  2  	x  10      7275.246     9026.7466     7447.4013     7581.6494     512.75417
  4  	x  10     19506.151     22661.801     20843.142     21154.886     1329.5591
  8  	x  10     53150.178     55594.073     54132.383     54227.117     728.42913
 16  	x  10      84281.93     91234.692     90917.411     90249.053      2108.903
 32  	x  10     80860.018     81500.369     81212.514     81201.361     205.66759
 64  	x  10     84090.033      85423.79     84505.041     84588.913     436.69012
	----------------

BASE + jumplabel_split+ jeremy patch (CONFIG_PARAVIRT_SPINLOCK = y)
=====================================================
	------ scale=1 (32MB shared buf) ----------
Client	    N           Min           Max        Median           Avg        Stddev
  1  	x  10     3749.8427     4149.0224     4120.6696     3982.6575     197.32902
  2  	x  10     7786.4802     8149.0902     7956.6706     7970.5441      94.42967
  4  	x  10     22053.383     27424.414     23514.166     23698.775      1492.792
  8  	x  10     44585.203     48082.115     46123.156     46135.687     1232.9399
 16  	x  10     108290.15     109655.13        108924     108968.59     476.48336
 32  	x  10     112359.02     112966.97     112570.06     112611.48     180.51304
 64  	x  10     103020.85     104042.71     103457.83     103496.84     291.19165
	----------------
	------ scale=500 (16GB shared buf) ----------
Client	    N           Min           Max        Median           Avg        Stddev
  1  	x  10     3462.6179     3898.5392     3871.6231     3738.0069     196.86077
  2  	x  10     7358.8148     7396.1029     7387.8169      7382.229     13.117357
  4  	x  10     19734.357     27799.895      21840.41     22964.202     3070.8067
  8  	x  10      52412.64     55214.305     53481.185     53552.261     878.21383
 16  	x  10     89862.081     90375.328     90161.886     90139.154     202.49282
 32  	x  10     80140.853     80898.452     80683.819     80671.361     227.13277
 64  	x  10     83402.864     84868.355     84311.472     84281.567      428.6501
	----------------
		
Summary of Avg
==============

Client  BASE         Base+patch               base+patch
	PARAVIRT=n   PARAVIRT=n (%improve)    PARAVIRT=y (%improve)
------ scale=1 (32MB shared buf) ----------
1	3914.535     3784.4072 (-3.32422)      3982.6575 (+1.74025)
2	7808.1603    7500.2076 (-3.94399)      7970.5441 (+2.07967)
4	22224.329    24084.731 (+8.37102)     23698.775  (+6.63438)
8	45593.135    45838.782 (+0.538781)    46135.687  (+1.18999)
16	108987.84    110019.72 (+0.946785)    108968.59  (-0.0176625)
32	112830.09    112973.56 (+0.127156)    112611.48  (-0.193752)
64	103034.24    103658.17 (+0.605556)    103496.84  (+0.448977)

------ scale=500 (~16GB shared buf) ----------
1	3610.6086    3705.6665 (+2.63274)     3738.0069  (+3.52844)
2	7342.2349    7581.6494 (+3.26079)     7382.229   (+0.544713)
4	23893.162    21154.886 (-11.4605)     22964.202  (-3.88797)
8	53216.256    54227.117 (+1.89953)     53552.261  (+0.631395)
16	90018.505    90249.053 (+0.256112)    90139.154  (+0.134027)
32	81247.223    81201.361 (-0.0564475)    80671.361 (-0.708777)
64	84201.86     84588.913 (+0.459673)    84281.567  (+0.0946618)

Thoughts? Comments? Suggestions?

Jeremy Fitzhardinge (10):
  x86/spinlock: replace pv spinlocks with pv ticketlocks
  x86/ticketlock: don't inline _spin_unlock when using paravirt
    spinlocks
  x86/ticketlock: collapse a layer of functions
  xen: defer spinlock setup until boot CPU setup
  xen/pvticketlock: Xen implementation for PV ticket locks
  xen/pvticketlocks: add xen_nopvspin parameter to disable xen pv
    ticketlocks
  x86/pvticketlock: use callee-save for lock_spinning
  x86/pvticketlock: when paravirtualizing ticket locks, increment by 2
  x86/ticketlock: add slowpath logic
  xen/pvticketlock: allow interrupts to be enabled while blocking

Stefano Stabellini (1):
 xen: enable PV ticketlocks on HVM Xen
---
 arch/x86/Kconfig                      |    3 +
 arch/x86/include/asm/paravirt.h       |   32 +---
 arch/x86/include/asm/paravirt_types.h |   10 +-
 arch/x86/include/asm/spinlock.h       |  128 ++++++++----
 arch/x86/include/asm/spinlock_types.h |   16 +-
 arch/x86/kernel/paravirt-spinlocks.c  |   18 +--
 arch/x86/xen/smp.c                    |    3 +-
 arch/x86/xen/spinlock.c               |  383 +++++++++++----------------------
 kernel/Kconfig.locks                  |    2 +-
 9 files changed, 245 insertions(+), 350 deletions(-)


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH RFC V6 1/11]  x86/spinlock: replace pv spinlocks with pv ticketlocks
  2012-03-21 10:20 [PATCH RFC V6 0/11] Paravirtualized ticketlocks Raghavendra K T
@ 2012-03-21 10:20 ` Raghavendra K T
  2012-03-21 13:04   ` Attilio Rao
  2012-03-21 10:21 ` [PATCH RFC V6 2/11] x86/ticketlock: don't inline _spin_unlock when using paravirt spinlocks Raghavendra K T
                   ` (11 subsequent siblings)
  12 siblings, 1 reply; 55+ messages in thread
From: Raghavendra K T @ 2012-03-21 10:20 UTC (permalink / raw)
  To: H. Peter Anvin, Ingo Molnar
  Cc: Linus Torvalds, Peter Zijlstra, the arch/x86 maintainers, LKML,
	Avi Kivity, Marcelo Tosatti, KVM, Andi Kleen, Xen Devel,
	Konrad Rzeszutek Wilk, Virtualization, Jeremy Fitzhardinge,
	Stephan Diestelhorst, Srivatsa Vaddagiri, Stefano Stabellini,
	Attilio Rao

From: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>

Rather than outright replacing the entire spinlock implementation in
order to paravirtualize it, keep the ticket lock implementation but add
a couple of pvops hooks on the slow patch (long spin on lock, unlocking
a contended lock).

Ticket locks have a number of nice properties, but they also have some
surprising behaviours in virtual environments.  They enforce a strict
FIFO ordering on cpus trying to take a lock; however, if the hypervisor
scheduler does not schedule the cpus in the correct order, the system can
waste a huge amount of time spinning until the next cpu can take the lock.

(See Thomas Friebel's talk "Prevent Guests from Spinning Around"
http://www.xen.org/files/xensummitboston08/LHP.pdf for more details.)

To address this, we add two hooks:
 - __ticket_spin_lock which is called after the cpu has been
   spinning on the lock for a significant number of iterations but has
   failed to take the lock (presumably because the cpu holding the lock
   has been descheduled).  The lock_spinning pvop is expected to block
   the cpu until it has been kicked by the current lock holder.
 - __ticket_spin_unlock, which on releasing a contended lock
   (there are more cpus with tail tickets), it looks to see if the next
   cpu is blocked and wakes it if so.

When compiled with CONFIG_PARAVIRT_SPINLOCKS disabled, a set of stub
functions causes all the extra code to go away.

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
---
 arch/x86/include/asm/paravirt.h       |   32 ++++----------------
 arch/x86/include/asm/paravirt_types.h |   10 ++----
 arch/x86/include/asm/spinlock.h       |   53 ++++++++++++++++++++++++++------
 arch/x86/include/asm/spinlock_types.h |    4 --
 arch/x86/kernel/paravirt-spinlocks.c  |   15 +--------
 arch/x86/xen/spinlock.c               |    8 ++++-
 6 files changed, 61 insertions(+), 61 deletions(-)
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index a7d2db9..0fa4553 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -750,36 +750,16 @@ static inline void __set_fixmap(unsigned /* enum fixed_addresses */ idx,
 
 #if defined(CONFIG_SMP) && defined(CONFIG_PARAVIRT_SPINLOCKS)
 
-static inline int arch_spin_is_locked(struct arch_spinlock *lock)
+static __always_inline void __ticket_lock_spinning(struct arch_spinlock *lock,
+							__ticket_t ticket)
 {
-	return PVOP_CALL1(int, pv_lock_ops.spin_is_locked, lock);
+	PVOP_VCALL2(pv_lock_ops.lock_spinning, lock, ticket);
 }
 
-static inline int arch_spin_is_contended(struct arch_spinlock *lock)
+static __always_inline void ____ticket_unlock_kick(struct arch_spinlock *lock,
+							__ticket_t ticket)
 {
-	return PVOP_CALL1(int, pv_lock_ops.spin_is_contended, lock);
-}
-#define arch_spin_is_contended	arch_spin_is_contended
-
-static __always_inline void arch_spin_lock(struct arch_spinlock *lock)
-{
-	PVOP_VCALL1(pv_lock_ops.spin_lock, lock);
-}
-
-static __always_inline void arch_spin_lock_flags(struct arch_spinlock *lock,
-						  unsigned long flags)
-{
-	PVOP_VCALL2(pv_lock_ops.spin_lock_flags, lock, flags);
-}
-
-static __always_inline int arch_spin_trylock(struct arch_spinlock *lock)
-{
-	return PVOP_CALL1(int, pv_lock_ops.spin_trylock, lock);
-}
-
-static __always_inline void arch_spin_unlock(struct arch_spinlock *lock)
-{
-	PVOP_VCALL1(pv_lock_ops.spin_unlock, lock);
+	PVOP_VCALL2(pv_lock_ops.unlock_kick, lock, ticket);
 }
 
 #endif
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index 8e8b9a4..005e24d 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -327,13 +327,11 @@ struct pv_mmu_ops {
 };
 
 struct arch_spinlock;
+#include <asm/spinlock_types.h>
+
 struct pv_lock_ops {
-	int (*spin_is_locked)(struct arch_spinlock *lock);
-	int (*spin_is_contended)(struct arch_spinlock *lock);
-	void (*spin_lock)(struct arch_spinlock *lock);
-	void (*spin_lock_flags)(struct arch_spinlock *lock, unsigned long flags);
-	int (*spin_trylock)(struct arch_spinlock *lock);
-	void (*spin_unlock)(struct arch_spinlock *lock);
+	void (*lock_spinning)(struct arch_spinlock *lock, __ticket_t ticket);
+	void (*unlock_kick)(struct arch_spinlock *lock, __ticket_t ticket);
 };
 
 /* This contains all the paravirt structures: we get a convenient
diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index a82c2bf..7e66b85 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -37,6 +37,35 @@
 # define UNLOCK_LOCK_PREFIX
 #endif
 
+/* How long a lock should spin before we consider blocking */
+#define SPIN_THRESHOLD	(1 << 11)
+
+#ifndef CONFIG_PARAVIRT_SPINLOCKS
+
+static __always_inline void __ticket_lock_spinning(struct arch_spinlock *lock,
+							__ticket_t ticket)
+{
+}
+
+static __always_inline void ____ticket_unlock_kick(struct arch_spinlock *lock,
+							 __ticket_t ticket)
+{
+}
+
+#endif	/* CONFIG_PARAVIRT_SPINLOCKS */
+
+
+/*
+ * If a spinlock has someone waiting on it, then kick the appropriate
+ * waiting cpu.
+ */
+static __always_inline void __ticket_unlock_kick(struct arch_spinlock *lock,
+							__ticket_t next)
+{
+	if (unlikely(lock->tickets.tail != next))
+		____ticket_unlock_kick(lock, next);
+}
+
 /*
  * Ticket locks are conceptually two parts, one indicating the current head of
  * the queue, and the other indicating the current tail. The lock is acquired
@@ -50,19 +79,24 @@
  * in the high part, because a wide xadd increment of the low part would carry
  * up and contaminate the high part.
  */
-static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock)
+static __always_inline void __ticket_spin_lock(struct arch_spinlock *lock)
 {
 	register struct __raw_tickets inc = { .tail = 1 };
 
 	inc = xadd(&lock->tickets, inc);
 
 	for (;;) {
-		if (inc.head == inc.tail)
-			break;
-		cpu_relax();
-		inc.head = ACCESS_ONCE(lock->tickets.head);
+		unsigned count = SPIN_THRESHOLD;
+
+		do {
+			if (inc.head == inc.tail)
+				goto out;
+			cpu_relax();
+			inc.head = ACCESS_ONCE(lock->tickets.head);
+		} while (--count);
+		__ticket_lock_spinning(lock, inc.tail);
 	}
-	barrier();		/* make sure nothing creeps before the lock is taken */
+out:	barrier();	/* make sure nothing creeps before the lock is taken */
 }
 
 static __always_inline int __ticket_spin_trylock(arch_spinlock_t *lock)
@@ -81,7 +115,10 @@ static __always_inline int __ticket_spin_trylock(arch_spinlock_t *lock)
 
 static __always_inline void __ticket_spin_unlock(arch_spinlock_t *lock)
 {
+	__ticket_t next = lock->tickets.head + 1;
+
 	__add(&lock->tickets.head, 1, UNLOCK_LOCK_PREFIX);
+	__ticket_unlock_kick(lock, next);
 }
 
 static inline int __ticket_spin_is_locked(arch_spinlock_t *lock)
@@ -98,8 +135,6 @@ static inline int __ticket_spin_is_contended(arch_spinlock_t *lock)
 	return ((tmp.tail - tmp.head) & TICKET_MASK) > 1;
 }
 
-#ifndef CONFIG_PARAVIRT_SPINLOCKS
-
 static inline int arch_spin_is_locked(arch_spinlock_t *lock)
 {
 	return __ticket_spin_is_locked(lock);
@@ -132,8 +167,6 @@ static __always_inline void arch_spin_lock_flags(arch_spinlock_t *lock,
 	arch_spin_lock(lock);
 }
 
-#endif	/* CONFIG_PARAVIRT_SPINLOCKS */
-
 static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
 {
 	while (arch_spin_is_locked(lock))
diff --git a/arch/x86/include/asm/spinlock_types.h b/arch/x86/include/asm/spinlock_types.h
index 8ebd5df..dbe223d 100644
--- a/arch/x86/include/asm/spinlock_types.h
+++ b/arch/x86/include/asm/spinlock_types.h
@@ -1,10 +1,6 @@
 #ifndef _ASM_X86_SPINLOCK_TYPES_H
 #define _ASM_X86_SPINLOCK_TYPES_H
 
-#ifndef __LINUX_SPINLOCK_TYPES_H
-# error "please don't include this file directly"
-#endif
-
 #include <linux/types.h>
 
 #if (CONFIG_NR_CPUS < 256)
diff --git a/arch/x86/kernel/paravirt-spinlocks.c b/arch/x86/kernel/paravirt-spinlocks.c
index 676b8c7..c2e010e 100644
--- a/arch/x86/kernel/paravirt-spinlocks.c
+++ b/arch/x86/kernel/paravirt-spinlocks.c
@@ -7,21 +7,10 @@
 
 #include <asm/paravirt.h>
 
-static inline void
-default_spin_lock_flags(arch_spinlock_t *lock, unsigned long flags)
-{
-	arch_spin_lock(lock);
-}
-
 struct pv_lock_ops pv_lock_ops = {
 #ifdef CONFIG_SMP
-	.spin_is_locked = __ticket_spin_is_locked,
-	.spin_is_contended = __ticket_spin_is_contended,
-
-	.spin_lock = __ticket_spin_lock,
-	.spin_lock_flags = default_spin_lock_flags,
-	.spin_trylock = __ticket_spin_trylock,
-	.spin_unlock = __ticket_spin_unlock,
+	.lock_spinning = paravirt_nop,
+	.unlock_kick = paravirt_nop,
 #endif
 };
 EXPORT_SYMBOL(pv_lock_ops);
diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c
index d69cc6c..1258514 100644
--- a/arch/x86/xen/spinlock.c
+++ b/arch/x86/xen/spinlock.c
@@ -138,6 +138,9 @@ struct xen_spinlock {
 	xen_spinners_t spinners;	/* count of waiting cpus */
 };
 
+static DEFINE_PER_CPU(int, lock_kicker_irq) = -1;
+
+#if 0
 static int xen_spin_is_locked(struct arch_spinlock *lock)
 {
 	struct xen_spinlock *xl = (struct xen_spinlock *)lock;
@@ -165,7 +168,6 @@ static int xen_spin_trylock(struct arch_spinlock *lock)
 	return old == 0;
 }
 
-static DEFINE_PER_CPU(int, lock_kicker_irq) = -1;
 static DEFINE_PER_CPU(struct xen_spinlock *, lock_spinners);
 
 /*
@@ -353,6 +355,7 @@ static void xen_spin_unlock(struct arch_spinlock *lock)
 	if (unlikely(xl->spinners))
 		xen_spin_unlock_slow(xl);
 }
+#endif
 
 static irqreturn_t dummy_handler(int irq, void *dev_id)
 {
@@ -389,13 +392,14 @@ void xen_uninit_lock_cpu(int cpu)
 void __init xen_init_spinlocks(void)
 {
 	BUILD_BUG_ON(sizeof(struct xen_spinlock) > sizeof(arch_spinlock_t));
-
+#if 0
 	pv_lock_ops.spin_is_locked = xen_spin_is_locked;
 	pv_lock_ops.spin_is_contended = xen_spin_is_contended;
 	pv_lock_ops.spin_lock = xen_spin_lock;
 	pv_lock_ops.spin_lock_flags = xen_spin_lock_flags;
 	pv_lock_ops.spin_trylock = xen_spin_trylock;
 	pv_lock_ops.spin_unlock = xen_spin_unlock;
+#endif
 }
 
 #ifdef CONFIG_XEN_DEBUG_FS


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH RFC V6 2/11]  x86/ticketlock: don't inline _spin_unlock when using paravirt spinlocks
  2012-03-21 10:20 [PATCH RFC V6 0/11] Paravirtualized ticketlocks Raghavendra K T
  2012-03-21 10:20 ` [PATCH RFC V6 1/11] x86/spinlock: replace pv spinlocks with pv ticketlocks Raghavendra K T
@ 2012-03-21 10:21 ` Raghavendra K T
  2012-03-21 17:13   ` Linus Torvalds
  2012-03-21 10:21 ` [PATCH RFC V6 3/11] x86/ticketlock: collapse a layer of functions Raghavendra K T
                   ` (10 subsequent siblings)
  12 siblings, 1 reply; 55+ messages in thread
From: Raghavendra K T @ 2012-03-21 10:21 UTC (permalink / raw)
  To: H. Peter Anvin, Ingo Molnar
  Cc: Linus Torvalds, Peter Zijlstra, the arch/x86 maintainers, LKML,
	Avi Kivity, Marcelo Tosatti, KVM, Andi Kleen, Xen Devel,
	Konrad Rzeszutek Wilk, Virtualization, Jeremy Fitzhardinge,
	Stephan Diestelhorst, Srivatsa Vaddagiri, Stefano Stabellini,
	Attilio Rao

From: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>

The code size expands somewhat, and its probably better to just call
a function rather than inline it.

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
---
 arch/x86/Kconfig     |    3 +++
 kernel/Kconfig.locks |    2 +-
 2 files changed, 4 insertions(+), 1 deletions(-)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 5bed94e..10c28ec 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -623,6 +623,9 @@ config PARAVIRT_SPINLOCKS
 
 	  If you are unsure how to answer this question, answer N.
 
+config ARCH_NOINLINE_SPIN_UNLOCK
+       def_bool PARAVIRT_SPINLOCKS
+
 config PARAVIRT_CLOCK
 	bool
 
diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks
index 5068e2a..584637b 100644
--- a/kernel/Kconfig.locks
+++ b/kernel/Kconfig.locks
@@ -125,7 +125,7 @@ config INLINE_SPIN_LOCK_IRQSAVE
 		 ARCH_INLINE_SPIN_LOCK_IRQSAVE
 
 config INLINE_SPIN_UNLOCK
-	def_bool !DEBUG_SPINLOCK && (!PREEMPT || ARCH_INLINE_SPIN_UNLOCK)
+	def_bool !DEBUG_SPINLOCK && (!PREEMPT || ARCH_INLINE_SPIN_UNLOCK) && !ARCH_NOINLINE_SPIN_UNLOCK
 
 config INLINE_SPIN_UNLOCK_BH
 	def_bool !DEBUG_SPINLOCK && ARCH_INLINE_SPIN_UNLOCK_BH


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH RFC V6 3/11]  x86/ticketlock: collapse a layer of functions
  2012-03-21 10:20 [PATCH RFC V6 0/11] Paravirtualized ticketlocks Raghavendra K T
  2012-03-21 10:20 ` [PATCH RFC V6 1/11] x86/spinlock: replace pv spinlocks with pv ticketlocks Raghavendra K T
  2012-03-21 10:21 ` [PATCH RFC V6 2/11] x86/ticketlock: don't inline _spin_unlock when using paravirt spinlocks Raghavendra K T
@ 2012-03-21 10:21 ` Raghavendra K T
  2012-03-21 10:21 ` [PATCH RFC V6 4/11] xen: defer spinlock setup until boot CPU setup Raghavendra K T
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 55+ messages in thread
From: Raghavendra K T @ 2012-03-21 10:21 UTC (permalink / raw)
  To: H. Peter Anvin, Ingo Molnar
  Cc: Linus Torvalds, Peter Zijlstra, the arch/x86 maintainers, LKML,
	Avi Kivity, Marcelo Tosatti, KVM, Andi Kleen, Xen Devel,
	Konrad Rzeszutek Wilk, Virtualization, Jeremy Fitzhardinge,
	Stephan Diestelhorst, Srivatsa Vaddagiri, Stefano Stabellini,
	Attilio Rao

From: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>

Now that the paravirtualization layer doesn't exist at the spinlock
level any more, we can collapse the __ticket_ functions into the arch_
functions.

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
---
 arch/x86/include/asm/spinlock.h |   35 +++++------------------------------
 1 files changed, 5 insertions(+), 30 deletions(-)
diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index 7e66b85..f6442f4 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -79,7 +79,7 @@ static __always_inline void __ticket_unlock_kick(struct arch_spinlock *lock,
  * in the high part, because a wide xadd increment of the low part would carry
  * up and contaminate the high part.
  */
-static __always_inline void __ticket_spin_lock(struct arch_spinlock *lock)
+static __always_inline void arch_spin_lock(struct arch_spinlock *lock)
 {
 	register struct __raw_tickets inc = { .tail = 1 };
 
@@ -99,7 +99,7 @@ static __always_inline void __ticket_spin_lock(struct arch_spinlock *lock)
 out:	barrier();	/* make sure nothing creeps before the lock is taken */
 }
 
-static __always_inline int __ticket_spin_trylock(arch_spinlock_t *lock)
+static __always_inline int arch_spin_trylock(arch_spinlock_t *lock)
 {
 	arch_spinlock_t old, new;
 
@@ -113,7 +113,7 @@ static __always_inline int __ticket_spin_trylock(arch_spinlock_t *lock)
 	return cmpxchg(&lock->head_tail, old.head_tail, new.head_tail) == old.head_tail;
 }
 
-static __always_inline void __ticket_spin_unlock(arch_spinlock_t *lock)
+static __always_inline void arch_spin_unlock(arch_spinlock_t *lock)
 {
 	__ticket_t next = lock->tickets.head + 1;
 
@@ -121,46 +121,21 @@ static __always_inline void __ticket_spin_unlock(arch_spinlock_t *lock)
 	__ticket_unlock_kick(lock, next);
 }
 
-static inline int __ticket_spin_is_locked(arch_spinlock_t *lock)
+static inline int arch_spin_is_locked(arch_spinlock_t *lock)
 {
 	struct __raw_tickets tmp = ACCESS_ONCE(lock->tickets);
 
 	return !!(tmp.tail ^ tmp.head);
 }
 
-static inline int __ticket_spin_is_contended(arch_spinlock_t *lock)
+static inline int arch_spin_is_contended(arch_spinlock_t *lock)
 {
 	struct __raw_tickets tmp = ACCESS_ONCE(lock->tickets);
 
 	return ((tmp.tail - tmp.head) & TICKET_MASK) > 1;
 }
-
-static inline int arch_spin_is_locked(arch_spinlock_t *lock)
-{
-	return __ticket_spin_is_locked(lock);
-}
-
-static inline int arch_spin_is_contended(arch_spinlock_t *lock)
-{
-	return __ticket_spin_is_contended(lock);
-}
 #define arch_spin_is_contended	arch_spin_is_contended
 
-static __always_inline void arch_spin_lock(arch_spinlock_t *lock)
-{
-	__ticket_spin_lock(lock);
-}
-
-static __always_inline int arch_spin_trylock(arch_spinlock_t *lock)
-{
-	return __ticket_spin_trylock(lock);
-}
-
-static __always_inline void arch_spin_unlock(arch_spinlock_t *lock)
-{
-	__ticket_spin_unlock(lock);
-}
-
 static __always_inline void arch_spin_lock_flags(arch_spinlock_t *lock,
 						  unsigned long flags)
 {


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH RFC V6 4/11]  xen: defer spinlock setup until boot CPU setup
  2012-03-21 10:20 [PATCH RFC V6 0/11] Paravirtualized ticketlocks Raghavendra K T
                   ` (2 preceding siblings ...)
  2012-03-21 10:21 ` [PATCH RFC V6 3/11] x86/ticketlock: collapse a layer of functions Raghavendra K T
@ 2012-03-21 10:21 ` Raghavendra K T
  2012-03-21 10:21 ` [PATCH RFC V6 5/11] xen/pvticketlock: Xen implementation for PV ticket locks Raghavendra K T
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 55+ messages in thread
From: Raghavendra K T @ 2012-03-21 10:21 UTC (permalink / raw)
  To: H. Peter Anvin, Ingo Molnar
  Cc: Linus Torvalds, Peter Zijlstra, the arch/x86 maintainers, LKML,
	Avi Kivity, Marcelo Tosatti, KVM, Andi Kleen, Xen Devel,
	Konrad Rzeszutek Wilk, Virtualization, Jeremy Fitzhardinge,
	Stephan Diestelhorst, Srivatsa Vaddagiri, Stefano Stabellini,
	Attilio Rao

From: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>

There's no need to do it at very early init, and doing it there
makes it impossible to use the jump_label machinery.

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
---
 arch/x86/xen/smp.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/arch/x86/xen/smp.c b/arch/x86/xen/smp.c
index 501d4e0..e85d3ee 100644
--- a/arch/x86/xen/smp.c
+++ b/arch/x86/xen/smp.c
@@ -201,6 +201,7 @@ static void __init xen_smp_prepare_boot_cpu(void)
 
 	xen_filter_cpu_maps();
 	xen_setup_vcpu_info_placement();
+	xen_init_spinlocks();
 }
 
 static void __init xen_smp_prepare_cpus(unsigned int max_cpus)
@@ -530,7 +531,6 @@ void __init xen_smp_init(void)
 {
 	smp_ops = xen_smp_ops;
 	xen_fill_possible_map();
-	xen_init_spinlocks();
 }
 
 static void __init xen_hvm_smp_prepare_cpus(unsigned int max_cpus)


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH RFC V6 5/11]  xen/pvticketlock: Xen implementation for PV ticket locks
  2012-03-21 10:20 [PATCH RFC V6 0/11] Paravirtualized ticketlocks Raghavendra K T
                   ` (3 preceding siblings ...)
  2012-03-21 10:21 ` [PATCH RFC V6 4/11] xen: defer spinlock setup until boot CPU setup Raghavendra K T
@ 2012-03-21 10:21 ` Raghavendra K T
  2012-03-21 10:21 ` [PATCH RFC V6 6/11] xen/pvticketlocks: add xen_nopvspin parameter to disable xen pv ticketlocks Raghavendra K T
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 55+ messages in thread
From: Raghavendra K T @ 2012-03-21 10:21 UTC (permalink / raw)
  To: H. Peter Anvin, Ingo Molnar
  Cc: Linus Torvalds, Peter Zijlstra, the arch/x86 maintainers, LKML,
	Avi Kivity, Marcelo Tosatti, KVM, Andi Kleen, Xen Devel,
	Konrad Rzeszutek Wilk, Virtualization, Jeremy Fitzhardinge,
	Stephan Diestelhorst, Srivatsa Vaddagiri, Stefano Stabellini,
	Attilio Rao

From: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>

Replace the old Xen implementation of PV spinlocks with and implementation
of xen_lock_spinning and xen_unlock_kick.

xen_lock_spinning simply registers the cpu in its entry in lock_waiting,
adds itself to the waiting_cpus set, and blocks on an event channel
until the channel becomes pending.

xen_unlock_kick searches the cpus in waiting_cpus looking for the one
which next wants this lock with the next ticket, if any.  If found,
it kicks it by making its event channel pending, which wakes it up.

We need to make sure interrupts are disabled while we're relying on the
contents of the per-cpu lock_waiting values, otherwise an interrupt
handler could come in, try to take some other lock, block, and overwrite
our values.

Raghu: use function + enum instead of macro, cmpxchg for zero status reset

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
 arch/x86/xen/spinlock.c |  344 +++++++++++------------------------------------
 1 files changed, 77 insertions(+), 267 deletions(-)
diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c
index 1258514..0e4d057 100644
--- a/arch/x86/xen/spinlock.c
+++ b/arch/x86/xen/spinlock.c
@@ -16,45 +16,46 @@
 #include "xen-ops.h"
 #include "debugfs.h"
 
+enum xen_contention_stat {
+	TAKEN_SLOW,
+	TAKEN_SLOW_PICKUP,
+	TAKEN_SLOW_SPURIOUS,
+	RELEASED_SLOW,
+	RELEASED_SLOW_KICKED,
+	NR_CONTENTION_STATS
+};
+
+
 #ifdef CONFIG_XEN_DEBUG_FS
 static struct xen_spinlock_stats
 {
-	u64 taken;
-	u32 taken_slow;
-	u32 taken_slow_nested;
-	u32 taken_slow_pickup;
-	u32 taken_slow_spurious;
-	u32 taken_slow_irqenable;
-
-	u64 released;
-	u32 released_slow;
-	u32 released_slow_kicked;
+	u32 contention_stats[NR_CONTENTION_STATS];
 
 #define HISTO_BUCKETS	30
-	u32 histo_spin_total[HISTO_BUCKETS+1];
-	u32 histo_spin_spinning[HISTO_BUCKETS+1];
 	u32 histo_spin_blocked[HISTO_BUCKETS+1];
 
-	u64 time_total;
-	u64 time_spinning;
 	u64 time_blocked;
 } spinlock_stats;
 
 static u8 zero_stats;
 
-static unsigned lock_timeout = 1 << 10;
-#define TIMEOUT lock_timeout
-
 static inline void check_zero(void)
 {
-	if (unlikely(zero_stats)) {
-		memset(&spinlock_stats, 0, sizeof(spinlock_stats));
-		zero_stats = 0;
+	u8 ret;
+	u8 old = ACCESS_ONCE(zero_stats);
+	if (unlikely(old)) {
+		ret = cmpxchg(&zero_stats, old, 0);
+		/* This ensures only one fellow resets the stat */
+		if (ret == old)
+			memset(&spinlock_stats, 0, sizeof(spinlock_stats));
 	}
 }
 
-#define ADD_STATS(elem, val)			\
-	do { check_zero(); spinlock_stats.elem += (val); } while(0)
+static inline void add_stats(enum xen_contention_stat var, u32 val)
+{
+	check_zero();
+	spinlock_stats.contention_stats[var] += val;
+}
 
 static inline u64 spin_time_start(void)
 {
@@ -73,22 +74,6 @@ static void __spin_time_accum(u64 delta, u32 *array)
 		array[HISTO_BUCKETS]++;
 }
 
-static inline void spin_time_accum_spinning(u64 start)
-{
-	u32 delta = xen_clocksource_read() - start;
-
-	__spin_time_accum(delta, spinlock_stats.histo_spin_spinning);
-	spinlock_stats.time_spinning += delta;
-}
-
-static inline void spin_time_accum_total(u64 start)
-{
-	u32 delta = xen_clocksource_read() - start;
-
-	__spin_time_accum(delta, spinlock_stats.histo_spin_total);
-	spinlock_stats.time_total += delta;
-}
-
 static inline void spin_time_accum_blocked(u64 start)
 {
 	u32 delta = xen_clocksource_read() - start;
@@ -98,19 +83,15 @@ static inline void spin_time_accum_blocked(u64 start)
 }
 #else  /* !CONFIG_XEN_DEBUG_FS */
 #define TIMEOUT			(1 << 10)
-#define ADD_STATS(elem, val)	do { (void)(val); } while(0)
+static inline void add_stats(enum xen_contention_stat var, u32 val)
+{
+}
 
 static inline u64 spin_time_start(void)
 {
 	return 0;
 }
 
-static inline void spin_time_accum_total(u64 start)
-{
-}
-static inline void spin_time_accum_spinning(u64 start)
-{
-}
 static inline void spin_time_accum_blocked(u64 start)
 {
 }
@@ -133,230 +114,83 @@ typedef u16 xen_spinners_t;
 	asm(LOCK_PREFIX " decw %0" : "+m" ((xl)->spinners) : : "memory");
 #endif
 
-struct xen_spinlock {
-	unsigned char lock;		/* 0 -> free; 1 -> locked */
-	xen_spinners_t spinners;	/* count of waiting cpus */
+struct xen_lock_waiting {
+	struct arch_spinlock *lock;
+	__ticket_t want;
 };
 
 static DEFINE_PER_CPU(int, lock_kicker_irq) = -1;
+static DEFINE_PER_CPU(struct xen_lock_waiting, lock_waiting);
+static cpumask_t waiting_cpus;
 
-#if 0
-static int xen_spin_is_locked(struct arch_spinlock *lock)
+static void xen_lock_spinning(struct arch_spinlock *lock, __ticket_t want)
 {
-	struct xen_spinlock *xl = (struct xen_spinlock *)lock;
-
-	return xl->lock != 0;
-}
-
-static int xen_spin_is_contended(struct arch_spinlock *lock)
-{
-	struct xen_spinlock *xl = (struct xen_spinlock *)lock;
-
-	/* Not strictly true; this is only the count of contended
-	   lock-takers entering the slow path. */
-	return xl->spinners != 0;
-}
-
-static int xen_spin_trylock(struct arch_spinlock *lock)
-{
-	struct xen_spinlock *xl = (struct xen_spinlock *)lock;
-	u8 old = 1;
-
-	asm("xchgb %b0,%1"
-	    : "+q" (old), "+m" (xl->lock) : : "memory");
-
-	return old == 0;
-}
-
-static DEFINE_PER_CPU(struct xen_spinlock *, lock_spinners);
-
-/*
- * Mark a cpu as interested in a lock.  Returns the CPU's previous
- * lock of interest, in case we got preempted by an interrupt.
- */
-static inline struct xen_spinlock *spinning_lock(struct xen_spinlock *xl)
-{
-	struct xen_spinlock *prev;
-
-	prev = __this_cpu_read(lock_spinners);
-	__this_cpu_write(lock_spinners, xl);
-
-	wmb();			/* set lock of interest before count */
-
-	inc_spinners(xl);
-
-	return prev;
-}
-
-/*
- * Mark a cpu as no longer interested in a lock.  Restores previous
- * lock of interest (NULL for none).
- */
-static inline void unspinning_lock(struct xen_spinlock *xl, struct xen_spinlock *prev)
-{
-	dec_spinners(xl);
-	wmb();			/* decrement count before restoring lock */
-	__this_cpu_write(lock_spinners, prev);
-}
-
-static noinline int xen_spin_lock_slow(struct arch_spinlock *lock, bool irq_enable)
-{
-	struct xen_spinlock *xl = (struct xen_spinlock *)lock;
-	struct xen_spinlock *prev;
 	int irq = __this_cpu_read(lock_kicker_irq);
-	int ret;
+	struct xen_lock_waiting *w = &__get_cpu_var(lock_waiting);
+	int cpu = smp_processor_id();
 	u64 start;
+	unsigned long flags;
 
 	/* If kicker interrupts not initialized yet, just spin */
 	if (irq == -1)
-		return 0;
+		return;
 
 	start = spin_time_start();
 
-	/* announce we're spinning */
-	prev = spinning_lock(xl);
-
-	ADD_STATS(taken_slow, 1);
-	ADD_STATS(taken_slow_nested, prev != NULL);
-
-	do {
-		unsigned long flags;
-
-		/* clear pending */
-		xen_clear_irq_pending(irq);
-
-		/* check again make sure it didn't become free while
-		   we weren't looking  */
-		ret = xen_spin_trylock(lock);
-		if (ret) {
-			ADD_STATS(taken_slow_pickup, 1);
-
-			/*
-			 * If we interrupted another spinlock while it
-			 * was blocking, make sure it doesn't block
-			 * without rechecking the lock.
-			 */
-			if (prev != NULL)
-				xen_set_irq_pending(irq);
-			goto out;
-		}
+	/*
+	 * Make sure an interrupt handler can't upset things in a
+	 * partially setup state.
+	 */
+	local_irq_save(flags);
 
-		flags = arch_local_save_flags();
-		if (irq_enable) {
-			ADD_STATS(taken_slow_irqenable, 1);
-			raw_local_irq_enable();
-		}
+	w->want = want;
+	smp_wmb();
+	w->lock = lock;
 
-		/*
-		 * Block until irq becomes pending.  If we're
-		 * interrupted at this point (after the trylock but
-		 * before entering the block), then the nested lock
-		 * handler guarantees that the irq will be left
-		 * pending if there's any chance the lock became free;
-		 * xen_poll_irq() returns immediately if the irq is
-		 * pending.
-		 */
-		xen_poll_irq(irq);
+	/* This uses set_bit, which atomic and therefore a barrier */
+	cpumask_set_cpu(cpu, &waiting_cpus);
+	add_stats(TAKEN_SLOW, 1);
 
-		raw_local_irq_restore(flags);
+	/* clear pending */
+	xen_clear_irq_pending(irq);
 
-		ADD_STATS(taken_slow_spurious, !xen_test_irq_pending(irq));
-	} while (!xen_test_irq_pending(irq)); /* check for spurious wakeups */
+	/* Only check lock once pending cleared */
+	barrier();
 
+	/* check again make sure it didn't become free while
+	   we weren't looking  */
+	if (ACCESS_ONCE(lock->tickets.head) == want) {
+		add_stats(TAKEN_SLOW_PICKUP, 1);
+		goto out;
+	}
+	/* Block until irq becomes pending (or perhaps a spurious wakeup) */
+	xen_poll_irq(irq);
+	add_stats(TAKEN_SLOW_SPURIOUS, !xen_test_irq_pending(irq));
 	kstat_incr_irqs_this_cpu(irq, irq_to_desc(irq));
-
 out:
-	unspinning_lock(xl, prev);
+	cpumask_clear_cpu(cpu, &waiting_cpus);
+	w->lock = NULL;
+	local_irq_restore(flags);
 	spin_time_accum_blocked(start);
-
-	return ret;
 }
 
-static inline void __xen_spin_lock(struct arch_spinlock *lock, bool irq_enable)
-{
-	struct xen_spinlock *xl = (struct xen_spinlock *)lock;
-	unsigned timeout;
-	u8 oldval;
-	u64 start_spin;
-
-	ADD_STATS(taken, 1);
-
-	start_spin = spin_time_start();
-
-	do {
-		u64 start_spin_fast = spin_time_start();
-
-		timeout = TIMEOUT;
-
-		asm("1: xchgb %1,%0\n"
-		    "   testb %1,%1\n"
-		    "   jz 3f\n"
-		    "2: rep;nop\n"
-		    "   cmpb $0,%0\n"
-		    "   je 1b\n"
-		    "   dec %2\n"
-		    "   jnz 2b\n"
-		    "3:\n"
-		    : "+m" (xl->lock), "=q" (oldval), "+r" (timeout)
-		    : "1" (1)
-		    : "memory");
-
-		spin_time_accum_spinning(start_spin_fast);
-
-	} while (unlikely(oldval != 0 &&
-			  (TIMEOUT == ~0 || !xen_spin_lock_slow(lock, irq_enable))));
-
-	spin_time_accum_total(start_spin);
-}
-
-static void xen_spin_lock(struct arch_spinlock *lock)
-{
-	__xen_spin_lock(lock, false);
-}
-
-static void xen_spin_lock_flags(struct arch_spinlock *lock, unsigned long flags)
-{
-	__xen_spin_lock(lock, !raw_irqs_disabled_flags(flags));
-}
-
-static noinline void xen_spin_unlock_slow(struct xen_spinlock *xl)
+static void xen_unlock_kick(struct arch_spinlock *lock, __ticket_t next)
 {
 	int cpu;
 
-	ADD_STATS(released_slow, 1);
+	add_stats(RELEASED_SLOW, 1);
 
-	for_each_online_cpu(cpu) {
-		/* XXX should mix up next cpu selection */
-		if (per_cpu(lock_spinners, cpu) == xl) {
-			ADD_STATS(released_slow_kicked, 1);
+	for_each_cpu(cpu, &waiting_cpus) {
+		const struct xen_lock_waiting *w = &per_cpu(lock_waiting, cpu);
+
+		if (w->lock == lock && w->want == next) {
+			add_stats(RELEASED_SLOW_KICKED, 1);
 			xen_send_IPI_one(cpu, XEN_SPIN_UNLOCK_VECTOR);
 			break;
 		}
 	}
 }
 
-static void xen_spin_unlock(struct arch_spinlock *lock)
-{
-	struct xen_spinlock *xl = (struct xen_spinlock *)lock;
-
-	ADD_STATS(released, 1);
-
-	smp_wmb();		/* make sure no writes get moved after unlock */
-	xl->lock = 0;		/* release lock */
-
-	/*
-	 * Make sure unlock happens before checking for waiting
-	 * spinners.  We need a strong barrier to enforce the
-	 * write-read ordering to different memory locations, as the
-	 * CPU makes no implied guarantees about their ordering.
-	 */
-	mb();
-
-	if (unlikely(xl->spinners))
-		xen_spin_unlock_slow(xl);
-}
-#endif
-
 static irqreturn_t dummy_handler(int irq, void *dev_id)
 {
 	BUG();
@@ -391,15 +225,8 @@ void xen_uninit_lock_cpu(int cpu)
 
 void __init xen_init_spinlocks(void)
 {
-	BUILD_BUG_ON(sizeof(struct xen_spinlock) > sizeof(arch_spinlock_t));
-#if 0
-	pv_lock_ops.spin_is_locked = xen_spin_is_locked;
-	pv_lock_ops.spin_is_contended = xen_spin_is_contended;
-	pv_lock_ops.spin_lock = xen_spin_lock;
-	pv_lock_ops.spin_lock_flags = xen_spin_lock_flags;
-	pv_lock_ops.spin_trylock = xen_spin_trylock;
-	pv_lock_ops.spin_unlock = xen_spin_unlock;
-#endif
+	pv_lock_ops.lock_spinning = xen_lock_spinning;
+	pv_lock_ops.unlock_kick = xen_unlock_kick;
 }
 
 #ifdef CONFIG_XEN_DEBUG_FS
@@ -417,42 +244,25 @@ static int __init xen_spinlock_debugfs(void)
 
 	debugfs_create_u8("zero_stats", 0644, d_spin_debug, &zero_stats);
 
-	debugfs_create_u32("timeout", 0644, d_spin_debug, &lock_timeout);
-
-	debugfs_create_u64("taken", 0444, d_spin_debug, &spinlock_stats.taken);
 	debugfs_create_u32("taken_slow", 0444, d_spin_debug,
-			   &spinlock_stats.taken_slow);
-	debugfs_create_u32("taken_slow_nested", 0444, d_spin_debug,
-			   &spinlock_stats.taken_slow_nested);
+			   &spinlock_stats.contention_stats[TAKEN_SLOW]);
 	debugfs_create_u32("taken_slow_pickup", 0444, d_spin_debug,
-			   &spinlock_stats.taken_slow_pickup);
+			   &spinlock_stats.contention_stats[TAKEN_SLOW_PICKUP]);
 	debugfs_create_u32("taken_slow_spurious", 0444, d_spin_debug,
-			   &spinlock_stats.taken_slow_spurious);
-	debugfs_create_u32("taken_slow_irqenable", 0444, d_spin_debug,
-			   &spinlock_stats.taken_slow_irqenable);
+			   &spinlock_stats.contention_stats[TAKEN_SLOW_SPURIOUS]);
 
-	debugfs_create_u64("released", 0444, d_spin_debug, &spinlock_stats.released);
 	debugfs_create_u32("released_slow", 0444, d_spin_debug,
-			   &spinlock_stats.released_slow);
+			   &spinlock_stats.contention_stats[RELEASED_SLOW]);
 	debugfs_create_u32("released_slow_kicked", 0444, d_spin_debug,
-			   &spinlock_stats.released_slow_kicked);
+			   &spinlock_stats.contention_stats[RELEASED_SLOW_KICKED]);
 
-	debugfs_create_u64("time_spinning", 0444, d_spin_debug,
-			   &spinlock_stats.time_spinning);
 	debugfs_create_u64("time_blocked", 0444, d_spin_debug,
 			   &spinlock_stats.time_blocked);
-	debugfs_create_u64("time_total", 0444, d_spin_debug,
-			   &spinlock_stats.time_total);
 
-	xen_debugfs_create_u32_array("histo_total", 0444, d_spin_debug,
-				     spinlock_stats.histo_spin_total, HISTO_BUCKETS + 1);
-	xen_debugfs_create_u32_array("histo_spinning", 0444, d_spin_debug,
-				     spinlock_stats.histo_spin_spinning, HISTO_BUCKETS + 1);
 	xen_debugfs_create_u32_array("histo_blocked", 0444, d_spin_debug,
 				     spinlock_stats.histo_spin_blocked, HISTO_BUCKETS + 1);
 
 	return 0;
 }
 fs_initcall(xen_spinlock_debugfs);
-
 #endif	/* CONFIG_XEN_DEBUG_FS */


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH RFC V6 6/11]  xen/pvticketlocks: add xen_nopvspin parameter to disable xen pv ticketlocks
  2012-03-21 10:20 [PATCH RFC V6 0/11] Paravirtualized ticketlocks Raghavendra K T
                   ` (4 preceding siblings ...)
  2012-03-21 10:21 ` [PATCH RFC V6 5/11] xen/pvticketlock: Xen implementation for PV ticket locks Raghavendra K T
@ 2012-03-21 10:21 ` Raghavendra K T
  2012-03-21 10:21 ` [PATCH RFC V6 7/11] x86/pvticketlock: use callee-save for lock_spinning Raghavendra K T
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 55+ messages in thread
From: Raghavendra K T @ 2012-03-21 10:21 UTC (permalink / raw)
  To: H. Peter Anvin, Ingo Molnar
  Cc: Linus Torvalds, Peter Zijlstra, the arch/x86 maintainers, LKML,
	Avi Kivity, Marcelo Tosatti, KVM, Andi Kleen, Xen Devel,
	Konrad Rzeszutek Wilk, Virtualization, Jeremy Fitzhardinge,
	Stephan Diestelhorst, Srivatsa Vaddagiri, Stefano Stabellini,
	Attilio Rao

From: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
---
 arch/x86/xen/spinlock.c |   14 ++++++++++++++
 1 files changed, 14 insertions(+), 0 deletions(-)
diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c
index 0e4d057..5dce49d 100644
--- a/arch/x86/xen/spinlock.c
+++ b/arch/x86/xen/spinlock.c
@@ -223,12 +223,26 @@ void xen_uninit_lock_cpu(int cpu)
 	unbind_from_irqhandler(per_cpu(lock_kicker_irq, cpu), NULL);
 }
 
+static bool xen_pvspin __initdata = true;
+
 void __init xen_init_spinlocks(void)
 {
+	if (!xen_pvspin) {
+		printk(KERN_DEBUG "xen: PV spinlocks disabled\n");
+		return;
+	}
+
 	pv_lock_ops.lock_spinning = xen_lock_spinning;
 	pv_lock_ops.unlock_kick = xen_unlock_kick;
 }
 
+static __init int xen_parse_nopvspin(char *arg)
+{
+	xen_pvspin = false;
+	return 0;
+}
+early_param("xen_nopvspin", xen_parse_nopvspin);
+
 #ifdef CONFIG_XEN_DEBUG_FS
 
 static struct dentry *d_spin_debug;


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH RFC V6 7/11]  x86/pvticketlock: use callee-save for lock_spinning
  2012-03-21 10:20 [PATCH RFC V6 0/11] Paravirtualized ticketlocks Raghavendra K T
                   ` (5 preceding siblings ...)
  2012-03-21 10:21 ` [PATCH RFC V6 6/11] xen/pvticketlocks: add xen_nopvspin parameter to disable xen pv ticketlocks Raghavendra K T
@ 2012-03-21 10:21 ` Raghavendra K T
  2012-03-21 10:22 ` [PATCH RFC V6 8/11] x86/pvticketlock: when paravirtualizing ticket locks, increment by 2 Raghavendra K T
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 55+ messages in thread
From: Raghavendra K T @ 2012-03-21 10:21 UTC (permalink / raw)
  To: H. Peter Anvin, Ingo Molnar
  Cc: Linus Torvalds, Peter Zijlstra, the arch/x86 maintainers, LKML,
	Avi Kivity, Marcelo Tosatti, KVM, Andi Kleen, Xen Devel,
	Konrad Rzeszutek Wilk, Virtualization, Jeremy Fitzhardinge,
	Stephan Diestelhorst, Srivatsa Vaddagiri, Stefano Stabellini,
	Attilio Rao

From: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>

Although the lock_spinning calls in the spinlock code are on the
uncommon path, their presence can cause the compiler to generate many
more register save/restores in the function pre/postamble, which is in
the fast path.  To avoid this, convert it to using the pvops callee-save
calling convention, which defers all the save/restores until the actual
function is called, keeping the fastpath clean.

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
---
 arch/x86/include/asm/paravirt.h       |    2 +-
 arch/x86/include/asm/paravirt_types.h |    2 +-
 arch/x86/kernel/paravirt-spinlocks.c  |    2 +-
 arch/x86/xen/spinlock.c               |    3 ++-
 4 files changed, 5 insertions(+), 4 deletions(-)
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 0fa4553..4343419 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -753,7 +753,7 @@ static inline void __set_fixmap(unsigned /* enum fixed_addresses */ idx,
 static __always_inline void __ticket_lock_spinning(struct arch_spinlock *lock,
 							__ticket_t ticket)
 {
-	PVOP_VCALL2(pv_lock_ops.lock_spinning, lock, ticket);
+	PVOP_VCALLEE2(pv_lock_ops.lock_spinning, lock, ticket);
 }
 
 static __always_inline void ____ticket_unlock_kick(struct arch_spinlock *lock,
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index 005e24d..5e0c138 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -330,7 +330,7 @@ struct arch_spinlock;
 #include <asm/spinlock_types.h>
 
 struct pv_lock_ops {
-	void (*lock_spinning)(struct arch_spinlock *lock, __ticket_t ticket);
+	struct paravirt_callee_save lock_spinning;
 	void (*unlock_kick)(struct arch_spinlock *lock, __ticket_t ticket);
 };
 
diff --git a/arch/x86/kernel/paravirt-spinlocks.c b/arch/x86/kernel/paravirt-spinlocks.c
index c2e010e..4251c1d 100644
--- a/arch/x86/kernel/paravirt-spinlocks.c
+++ b/arch/x86/kernel/paravirt-spinlocks.c
@@ -9,7 +9,7 @@
 
 struct pv_lock_ops pv_lock_ops = {
 #ifdef CONFIG_SMP
-	.lock_spinning = paravirt_nop,
+	.lock_spinning = __PV_IS_CALLEE_SAVE(paravirt_nop),
 	.unlock_kick = paravirt_nop,
 #endif
 };
diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c
index 5dce49d..176a554 100644
--- a/arch/x86/xen/spinlock.c
+++ b/arch/x86/xen/spinlock.c
@@ -173,6 +173,7 @@ out:
 	local_irq_restore(flags);
 	spin_time_accum_blocked(start);
 }
+PV_CALLEE_SAVE_REGS_THUNK(xen_lock_spinning);
 
 static void xen_unlock_kick(struct arch_spinlock *lock, __ticket_t next)
 {
@@ -232,7 +233,7 @@ void __init xen_init_spinlocks(void)
 		return;
 	}
 
-	pv_lock_ops.lock_spinning = xen_lock_spinning;
+	pv_lock_ops.lock_spinning = PV_CALLEE_SAVE(xen_lock_spinning);
 	pv_lock_ops.unlock_kick = xen_unlock_kick;
 }
 


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH RFC V6 8/11]  x86/pvticketlock: when paravirtualizing ticket locks, increment by 2
  2012-03-21 10:20 [PATCH RFC V6 0/11] Paravirtualized ticketlocks Raghavendra K T
                   ` (6 preceding siblings ...)
  2012-03-21 10:21 ` [PATCH RFC V6 7/11] x86/pvticketlock: use callee-save for lock_spinning Raghavendra K T
@ 2012-03-21 10:22 ` Raghavendra K T
  2012-03-21 10:22 ` [PATCH RFC V6 9/11] x86/ticketlock: add slowpath logic Raghavendra K T
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 55+ messages in thread
From: Raghavendra K T @ 2012-03-21 10:22 UTC (permalink / raw)
  To: H. Peter Anvin, Ingo Molnar
  Cc: Linus Torvalds, Peter Zijlstra, the arch/x86 maintainers, LKML,
	Avi Kivity, Marcelo Tosatti, KVM, Andi Kleen, Xen Devel,
	Konrad Rzeszutek Wilk, Virtualization, Jeremy Fitzhardinge,
	Stephan Diestelhorst, Srivatsa Vaddagiri, Stefano Stabellini,
	Attilio Rao

From: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>

Increment ticket head/tails by 2 rather than 1 to leave the LSB free
to store a "is in slowpath state" bit.  This halves the number
of possible CPUs for a given ticket size, but this shouldn't matter
in practice - kernels built for 32k+ CPU systems are probably
specially built for the hardware rather than a generic distro
kernel.

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
---
 arch/x86/include/asm/spinlock.h       |   10 +++++-----
 arch/x86/include/asm/spinlock_types.h |   10 +++++++++-
 2 files changed, 14 insertions(+), 6 deletions(-)
diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index f6442f4..c14263a 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -81,7 +81,7 @@ static __always_inline void __ticket_unlock_kick(struct arch_spinlock *lock,
  */
 static __always_inline void arch_spin_lock(struct arch_spinlock *lock)
 {
-	register struct __raw_tickets inc = { .tail = 1 };
+	register struct __raw_tickets inc = { .tail = TICKET_LOCK_INC };
 
 	inc = xadd(&lock->tickets, inc);
 
@@ -107,7 +107,7 @@ static __always_inline int arch_spin_trylock(arch_spinlock_t *lock)
 	if (old.tickets.head != old.tickets.tail)
 		return 0;
 
-	new.head_tail = old.head_tail + (1 << TICKET_SHIFT);
+	new.head_tail = old.head_tail + (TICKET_LOCK_INC << TICKET_SHIFT);
 
 	/* cmpxchg is a full barrier, so nothing can move before it */
 	return cmpxchg(&lock->head_tail, old.head_tail, new.head_tail) == old.head_tail;
@@ -115,9 +115,9 @@ static __always_inline int arch_spin_trylock(arch_spinlock_t *lock)
 
 static __always_inline void arch_spin_unlock(arch_spinlock_t *lock)
 {
-	__ticket_t next = lock->tickets.head + 1;
+	__ticket_t next = lock->tickets.head + TICKET_LOCK_INC;
 
-	__add(&lock->tickets.head, 1, UNLOCK_LOCK_PREFIX);
+	__add(&lock->tickets.head, TICKET_LOCK_INC, UNLOCK_LOCK_PREFIX);
 	__ticket_unlock_kick(lock, next);
 }
 
@@ -132,7 +132,7 @@ static inline int arch_spin_is_contended(arch_spinlock_t *lock)
 {
 	struct __raw_tickets tmp = ACCESS_ONCE(lock->tickets);
 
-	return ((tmp.tail - tmp.head) & TICKET_MASK) > 1;
+	return ((tmp.tail - tmp.head) & TICKET_MASK) > TICKET_LOCK_INC;
 }
 #define arch_spin_is_contended	arch_spin_is_contended
 
diff --git a/arch/x86/include/asm/spinlock_types.h b/arch/x86/include/asm/spinlock_types.h
index dbe223d..aa9a205 100644
--- a/arch/x86/include/asm/spinlock_types.h
+++ b/arch/x86/include/asm/spinlock_types.h
@@ -3,7 +3,13 @@
 
 #include <linux/types.h>
 
-#if (CONFIG_NR_CPUS < 256)
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+#define __TICKET_LOCK_INC	2
+#else
+#define __TICKET_LOCK_INC	1
+#endif
+
+#if (CONFIG_NR_CPUS < (256 / __TICKET_LOCK_INC))
 typedef u8  __ticket_t;
 typedef u16 __ticketpair_t;
 #else
@@ -11,6 +17,8 @@ typedef u16 __ticket_t;
 typedef u32 __ticketpair_t;
 #endif
 
+#define TICKET_LOCK_INC	((__ticket_t)__TICKET_LOCK_INC)
+
 #define TICKET_SHIFT	(sizeof(__ticket_t) * 8)
 #define TICKET_MASK	((__ticket_t)((1 << TICKET_SHIFT) - 1))
 


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH RFC V6 9/11]  x86/ticketlock: add slowpath logic
  2012-03-21 10:20 [PATCH RFC V6 0/11] Paravirtualized ticketlocks Raghavendra K T
                   ` (7 preceding siblings ...)
  2012-03-21 10:22 ` [PATCH RFC V6 8/11] x86/pvticketlock: when paravirtualizing ticket locks, increment by 2 Raghavendra K T
@ 2012-03-21 10:22 ` Raghavendra K T
  2012-03-21 10:22 ` [PATCH RFC V6 10/11] xen/pvticketlock: allow interrupts to be enabled while blocking Raghavendra K T
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 55+ messages in thread
From: Raghavendra K T @ 2012-03-21 10:22 UTC (permalink / raw)
  To: H. Peter Anvin, Ingo Molnar
  Cc: Linus Torvalds, Peter Zijlstra, the arch/x86 maintainers, LKML,
	Avi Kivity, Marcelo Tosatti, KVM, Andi Kleen, Xen Devel,
	Konrad Rzeszutek Wilk, Virtualization, Jeremy Fitzhardinge,
	Stephan Diestelhorst, Srivatsa Vaddagiri, Stefano Stabellini,
	Attilio Rao

From: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>

Maintain a flag in the LSB of the ticket lock tail which indicates
whether anyone is in the lock slowpath and may need kicking when
the current holder unlocks.  The flags are set when the first locker
enters the slowpath, and cleared when unlocking to an empty queue (ie,
no contention).

In the specific implementation of lock_spinning(), make sure to set
the slowpath flags on the lock just before blocking.  We must do
this before the last-chance pickup test to prevent a deadlock
with the unlocker:

Unlocker			Locker
				test for lock pickup
					-> fail
unlock
test slowpath
	-> false
				set slowpath flags
				block

Whereas this works in any ordering:

Unlocker			Locker
				set slowpath flags
				test for lock pickup
					-> fail
				block
unlock
test slowpath
	-> true, kick

If the unlocker finds that the lock has the slowpath flag set but it is
actually uncontended (ie, head == tail, so nobody is waiting), then it
clears the slowpath flag.

The unlock code uses a locked add to update the head counter.  This also
acts as a full memory barrier so that its safe to subsequently
read back the slowflag state, knowing that the updated lock is visible
to the other CPUs.  If it were an unlocked add, then the flag read may
just be forwarded from the store buffer before it was visible to the other
CPUs, which could result in a deadlock.

Unfortunately this means we need to do a locked instruction when
unlocking with PV ticketlocks.  However, if PV ticketlocks are not
enabled, then the old non-locked "add" is the only unlocking code.

Note: this code relies on gcc making sure that unlikely() code is out of
line of the fastpath, which only happens when OPTIMIZE_SIZE=n.  If it
doesn't the generated code isn't too bad, but its definitely suboptimal.

Thanks to Srivatsa Vaddagiri for providing a bugfix to the original
version of this change, which has been folded in.
Thanks to Stephan Diestelhorst for commenting on some code which relied
on an inaccurate reading of the x86 memory ordering rules.

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Stephan Diestelhorst <stephan.diestelhorst@amd.com>
Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
---
 arch/x86/include/asm/paravirt.h       |    2 +-
 arch/x86/include/asm/spinlock.h       |   86 +++++++++++++++++++++++---------
 arch/x86/include/asm/spinlock_types.h |    2 +
 arch/x86/kernel/paravirt-spinlocks.c  |    3 +
 arch/x86/xen/spinlock.c               |    6 ++
 5 files changed, 74 insertions(+), 25 deletions(-)
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 4343419..f2f1700 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -756,7 +756,7 @@ static __always_inline void __ticket_lock_spinning(struct arch_spinlock *lock,
 	PVOP_VCALLEE2(pv_lock_ops.lock_spinning, lock, ticket);
 }
 
-static __always_inline void ____ticket_unlock_kick(struct arch_spinlock *lock,
+static __always_inline void __ticket_unlock_kick(struct arch_spinlock *lock,
 							__ticket_t ticket)
 {
 	PVOP_VCALL2(pv_lock_ops.unlock_kick, lock, ticket);
diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index c14263a..e98cfa8 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -1,11 +1,14 @@
 #ifndef _ASM_X86_SPINLOCK_H
 #define _ASM_X86_SPINLOCK_H
 
+#include <linux/jump_label.h>
 #include <linux/atomic.h>
 #include <asm/page.h>
 #include <asm/processor.h>
 #include <linux/compiler.h>
 #include <asm/paravirt.h>
+#include <asm/bitops.h>
+
 /*
  * Your basic SMP spinlocks, allowing only a single CPU anywhere
  *
@@ -40,32 +43,28 @@
 /* How long a lock should spin before we consider blocking */
 #define SPIN_THRESHOLD	(1 << 11)
 
-#ifndef CONFIG_PARAVIRT_SPINLOCKS
+extern struct jump_label_key paravirt_ticketlocks_enabled;
+static __always_inline bool static_branch(struct jump_label_key *key);
 
-static __always_inline void __ticket_lock_spinning(struct arch_spinlock *lock,
-							__ticket_t ticket)
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+
+static inline void __ticket_enter_slowpath(arch_spinlock_t *lock)
 {
+	set_bit(0, (volatile unsigned long *)&lock->tickets.tail);
 }
 
-static __always_inline void ____ticket_unlock_kick(struct arch_spinlock *lock,
-							 __ticket_t ticket)
+#else  /* !CONFIG_PARAVIRT_SPINLOCKS */
+static __always_inline void __ticket_lock_spinning(arch_spinlock_t *lock,
+							__ticket_t ticket)
 {
 }
-
-#endif	/* CONFIG_PARAVIRT_SPINLOCKS */
-
-
-/*
- * If a spinlock has someone waiting on it, then kick the appropriate
- * waiting cpu.
- */
-static __always_inline void __ticket_unlock_kick(struct arch_spinlock *lock,
-							__ticket_t next)
+static inline void __ticket_unlock_kick(arch_spinlock_t *lock,
+							__ticket_t ticket)
 {
-	if (unlikely(lock->tickets.tail != next))
-		____ticket_unlock_kick(lock, next);
 }
 
+#endif /* CONFIG_PARAVIRT_SPINLOCKS */
+
 /*
  * Ticket locks are conceptually two parts, one indicating the current head of
  * the queue, and the other indicating the current tail. The lock is acquired
@@ -79,20 +78,22 @@ static __always_inline void __ticket_unlock_kick(struct arch_spinlock *lock,
  * in the high part, because a wide xadd increment of the low part would carry
  * up and contaminate the high part.
  */
-static __always_inline void arch_spin_lock(struct arch_spinlock *lock)
+static __always_inline void arch_spin_lock(arch_spinlock_t *lock)
 {
 	register struct __raw_tickets inc = { .tail = TICKET_LOCK_INC };
 
 	inc = xadd(&lock->tickets, inc);
+	if (likely(inc.head == inc.tail))
+		goto out;
 
+	inc.tail &= ~TICKET_SLOWPATH_FLAG;
 	for (;;) {
 		unsigned count = SPIN_THRESHOLD;
 
 		do {
-			if (inc.head == inc.tail)
+			if (ACCESS_ONCE(lock->tickets.head) == inc.tail)
 				goto out;
 			cpu_relax();
-			inc.head = ACCESS_ONCE(lock->tickets.head);
 		} while (--count);
 		__ticket_lock_spinning(lock, inc.tail);
 	}
@@ -104,7 +105,7 @@ static __always_inline int arch_spin_trylock(arch_spinlock_t *lock)
 	arch_spinlock_t old, new;
 
 	old.tickets = ACCESS_ONCE(lock->tickets);
-	if (old.tickets.head != old.tickets.tail)
+	if (old.tickets.head != (old.tickets.tail & ~TICKET_SLOWPATH_FLAG))
 		return 0;
 
 	new.head_tail = old.head_tail + (TICKET_LOCK_INC << TICKET_SHIFT);
@@ -113,12 +114,49 @@ static __always_inline int arch_spin_trylock(arch_spinlock_t *lock)
 	return cmpxchg(&lock->head_tail, old.head_tail, new.head_tail) == old.head_tail;
 }
 
+static inline void __ticket_unlock_slowpath(arch_spinlock_t *lock,
+					    arch_spinlock_t old)
+{
+	arch_spinlock_t new;
+
+	BUILD_BUG_ON(((__ticket_t)NR_CPUS) != NR_CPUS);
+
+	/* Perform the unlock on the "before" copy */
+	old.tickets.head += TICKET_LOCK_INC;
+
+	/* Clear the slowpath flag */
+	new.head_tail = old.head_tail & ~(TICKET_SLOWPATH_FLAG << TICKET_SHIFT);
+
+	/*
+	 * If the lock is uncontended, clear the flag - use cmpxchg in
+	 * case it changes behind our back though.
+	 */
+	if (new.tickets.head != new.tickets.tail ||
+	    cmpxchg(&lock->head_tail, old.head_tail,
+					new.head_tail) != old.head_tail) {
+		/*
+		 * Lock still has someone queued for it, so wake up an
+		 * appropriate waiter.
+		 */
+		__ticket_unlock_kick(lock, old.tickets.head);
+	}
+}
+
 static __always_inline void arch_spin_unlock(arch_spinlock_t *lock)
 {
-	__ticket_t next = lock->tickets.head + TICKET_LOCK_INC;
+	if (TICKET_SLOWPATH_FLAG &&
+	    unlikely(static_branch(&paravirt_ticketlocks_enabled))) {
+		arch_spinlock_t prev;
+
+		prev = *lock;
+		add_smp(&lock->tickets.head, TICKET_LOCK_INC);
+
+		/* add_smp() is a full mb() */
 
-	__add(&lock->tickets.head, TICKET_LOCK_INC, UNLOCK_LOCK_PREFIX);
-	__ticket_unlock_kick(lock, next);
+		if (unlikely(lock->tickets.tail & TICKET_SLOWPATH_FLAG))
+			__ticket_unlock_slowpath(lock, prev);
+	} else
+		__add(&lock->tickets.head, TICKET_LOCK_INC, UNLOCK_LOCK_PREFIX);
 }
 
 static inline int arch_spin_is_locked(arch_spinlock_t *lock)
diff --git a/arch/x86/include/asm/spinlock_types.h b/arch/x86/include/asm/spinlock_types.h
index aa9a205..407f7f7 100644
--- a/arch/x86/include/asm/spinlock_types.h
+++ b/arch/x86/include/asm/spinlock_types.h
@@ -5,8 +5,10 @@
 
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
 #define __TICKET_LOCK_INC	2
+#define TICKET_SLOWPATH_FLAG	((__ticket_t)1)
 #else
 #define __TICKET_LOCK_INC	1
+#define TICKET_SLOWPATH_FLAG	((__ticket_t)0)
 #endif
 
 #if (CONFIG_NR_CPUS < (256 / __TICKET_LOCK_INC))
diff --git a/arch/x86/kernel/paravirt-spinlocks.c b/arch/x86/kernel/paravirt-spinlocks.c
index 4251c1d..6ca1d33 100644
--- a/arch/x86/kernel/paravirt-spinlocks.c
+++ b/arch/x86/kernel/paravirt-spinlocks.c
@@ -4,6 +4,7 @@
  */
 #include <linux/spinlock.h>
 #include <linux/module.h>
+#include <linux/jump_label.h>
 
 #include <asm/paravirt.h>
 
@@ -15,3 +16,5 @@ struct pv_lock_ops pv_lock_ops = {
 };
 EXPORT_SYMBOL(pv_lock_ops);
 
+struct jump_label_key paravirt_ticketlocks_enabled = JUMP_LABEL_INIT;
+EXPORT_SYMBOL(paravirt_ticketlocks_enabled);
diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c
index 176a554..d4a9ec2 100644
--- a/arch/x86/xen/spinlock.c
+++ b/arch/x86/xen/spinlock.c
@@ -157,6 +157,10 @@ static void xen_lock_spinning(struct arch_spinlock *lock, __ticket_t want)
 	/* Only check lock once pending cleared */
 	barrier();
 
+	/* Mark entry to slowpath before doing the pickup test to make
+	   sure we don't deadlock with an unlocker. */
+	__ticket_enter_slowpath(lock);
+
 	/* check again make sure it didn't become free while
 	   we weren't looking  */
 	if (ACCESS_ONCE(lock->tickets.head) == want) {
@@ -233,6 +237,8 @@ void __init xen_init_spinlocks(void)
 		return;
 	}
 
+	jump_label_inc(&paravirt_ticketlocks_enabled);
+
 	pv_lock_ops.lock_spinning = PV_CALLEE_SAVE(xen_lock_spinning);
 	pv_lock_ops.unlock_kick = xen_unlock_kick;
 }


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH RFC V6 10/11]  xen/pvticketlock: allow interrupts to be enabled while blocking
  2012-03-21 10:20 [PATCH RFC V6 0/11] Paravirtualized ticketlocks Raghavendra K T
                   ` (8 preceding siblings ...)
  2012-03-21 10:22 ` [PATCH RFC V6 9/11] x86/ticketlock: add slowpath logic Raghavendra K T
@ 2012-03-21 10:22 ` Raghavendra K T
  2012-03-21 10:22 ` [PATCH RFC V6 11/11] xen: enable PV ticketlocks on HVM Xen Raghavendra K T
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 55+ messages in thread
From: Raghavendra K T @ 2012-03-21 10:22 UTC (permalink / raw)
  To: H. Peter Anvin, Ingo Molnar
  Cc: Linus Torvalds, Peter Zijlstra, the arch/x86 maintainers, LKML,
	Avi Kivity, Marcelo Tosatti, KVM, Andi Kleen, Xen Devel,
	Konrad Rzeszutek Wilk, Virtualization, Jeremy Fitzhardinge,
	Stephan Diestelhorst, Srivatsa Vaddagiri, Stefano Stabellini,
	Attilio Rao

From: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>

If interrupts were enabled when taking the spinlock, we can leave them
enabled while blocking to get the lock.

If we can enable interrupts while waiting for the lock to become
available, and we take an interrupt before entering the poll,
and the handler takes a spinlock which ends up going into
the slow state (invalidating the per-cpu "lock" and "want" values),
then when the interrupt handler returns the event channel will
remain pending so the poll will return immediately, causing it to
return out to the main spinlock loop.

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
---
 arch/x86/xen/spinlock.c |   46 ++++++++++++++++++++++++++++++++++++++++------
 1 files changed, 40 insertions(+), 6 deletions(-)
diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c
index d4a9ec2..4926974 100644
--- a/arch/x86/xen/spinlock.c
+++ b/arch/x86/xen/spinlock.c
@@ -142,7 +142,20 @@ static void xen_lock_spinning(struct arch_spinlock *lock, __ticket_t want)
 	 * partially setup state.
 	 */
 	local_irq_save(flags);
-
+	/*
+	 * We don't really care if we're overwriting some other
+	 * (lock,want) pair, as that would mean that we're currently
+	 * in an interrupt context, and the outer context had
+	 * interrupts enabled.  That has already kicked the VCPU out
+	 * of xen_poll_irq(), so it will just return spuriously and
+	 * retry with newly setup (lock,want).
+	 *
+	 * The ordering protocol on this is that the "lock" pointer
+	 * may only be set non-NULL if the "want" ticket is correct.
+	 * If we're updating "want", we must first clear "lock".
+	 */
+	w->lock = NULL;
+	smp_wmb();
 	w->want = want;
 	smp_wmb();
 	w->lock = lock;
@@ -157,24 +170,43 @@ static void xen_lock_spinning(struct arch_spinlock *lock, __ticket_t want)
 	/* Only check lock once pending cleared */
 	barrier();
 
-	/* Mark entry to slowpath before doing the pickup test to make
-	   sure we don't deadlock with an unlocker. */
+	/*
+	 * Mark entry to slowpath before doing the pickup test to make
+	 * sure we don't deadlock with an unlocker.
+	 */
 	__ticket_enter_slowpath(lock);
 
-	/* check again make sure it didn't become free while
-	   we weren't looking  */
+	/*
+	 * check again make sure it didn't become free while
+	 * we weren't looking
+	 */
 	if (ACCESS_ONCE(lock->tickets.head) == want) {
 		add_stats(TAKEN_SLOW_PICKUP, 1);
 		goto out;
 	}
+
+	/* Allow interrupts while blocked */
+	local_irq_restore(flags);
+
+	/*
+	 * If an interrupt happens here, it will leave the wakeup irq
+	 * pending, which will cause xen_poll_irq() to return
+	 * immediately.
+	 */
+
 	/* Block until irq becomes pending (or perhaps a spurious wakeup) */
 	xen_poll_irq(irq);
 	add_stats(TAKEN_SLOW_SPURIOUS, !xen_test_irq_pending(irq));
+
+	local_irq_save(flags);
+
 	kstat_incr_irqs_this_cpu(irq, irq_to_desc(irq));
 out:
 	cpumask_clear_cpu(cpu, &waiting_cpus);
 	w->lock = NULL;
+
 	local_irq_restore(flags);
+
 	spin_time_accum_blocked(start);
 }
 PV_CALLEE_SAVE_REGS_THUNK(xen_lock_spinning);
@@ -188,7 +220,9 @@ static void xen_unlock_kick(struct arch_spinlock *lock, __ticket_t next)
 	for_each_cpu(cpu, &waiting_cpus) {
 		const struct xen_lock_waiting *w = &per_cpu(lock_waiting, cpu);
 
-		if (w->lock == lock && w->want == next) {
+		/* Make sure we read lock before want */
+		if (ACCESS_ONCE(w->lock) == lock &&
+		    ACCESS_ONCE(w->want) == next) {
 			add_stats(RELEASED_SLOW_KICKED, 1);
 			xen_send_IPI_one(cpu, XEN_SPIN_UNLOCK_VECTOR);
 			break;


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH RFC V6 11/11]  xen: enable PV ticketlocks on HVM Xen
  2012-03-21 10:20 [PATCH RFC V6 0/11] Paravirtualized ticketlocks Raghavendra K T
                   ` (9 preceding siblings ...)
  2012-03-21 10:22 ` [PATCH RFC V6 10/11] xen/pvticketlock: allow interrupts to be enabled while blocking Raghavendra K T
@ 2012-03-21 10:22 ` Raghavendra K T
  2012-03-26 14:25 ` [PATCH RFC V6 0/11] Paravirtualized ticketlocks Avi Kivity
  2012-03-30 20:26 ` H. Peter Anvin
  12 siblings, 0 replies; 55+ messages in thread
From: Raghavendra K T @ 2012-03-21 10:22 UTC (permalink / raw)
  To: H. Peter Anvin, Ingo Molnar
  Cc: Linus Torvalds, Peter Zijlstra, the arch/x86 maintainers, LKML,
	Avi Kivity, Marcelo Tosatti, KVM, Andi Kleen, Xen Devel,
	Konrad Rzeszutek Wilk, Virtualization, Jeremy Fitzhardinge,
	Stephan Diestelhorst, Srivatsa Vaddagiri, Stefano Stabellini,
	Attilio Rao

From: Stefano Stabellini <stefano.stabellini@eu.citrix.com>

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
---
 arch/x86/xen/smp.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)
diff --git a/arch/x86/xen/smp.c b/arch/x86/xen/smp.c
index e85d3ee..8dc2574 100644
--- a/arch/x86/xen/smp.c
+++ b/arch/x86/xen/smp.c
@@ -568,4 +568,5 @@ void __init xen_hvm_smp_init(void)
 	smp_ops.cpu_die = xen_hvm_cpu_die;
 	smp_ops.send_call_func_ipi = xen_smp_send_call_function_ipi;
 	smp_ops.send_call_func_single_ipi = xen_smp_send_call_function_single_ipi;
+	xen_init_spinlocks();
 }


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC V6 1/11]  x86/spinlock: replace pv spinlocks with pv ticketlocks
  2012-03-21 10:20 ` [PATCH RFC V6 1/11] x86/spinlock: replace pv spinlocks with pv ticketlocks Raghavendra K T
@ 2012-03-21 13:04   ` Attilio Rao
  2012-03-21 13:22     ` Stephan Diestelhorst
  0 siblings, 1 reply; 55+ messages in thread
From: Attilio Rao @ 2012-03-21 13:04 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: H. Peter Anvin, Ingo Molnar, Linus Torvalds, Peter Zijlstra,
	the arch/x86 maintainers, LKML, Avi Kivity, Marcelo Tosatti, KVM,
	Andi Kleen, Xen Devel, Konrad Rzeszutek Wilk, Virtualization,
	Jeremy Fitzhardinge, Stephan Diestelhorst, Srivatsa Vaddagiri,
	Stefano Stabellini

On 21/03/12 10:20, Raghavendra K T wrote:
> From: Jeremy Fitzhardinge<jeremy.fitzhardinge@citrix.com>
>
> Rather than outright replacing the entire spinlock implementation in
> order to paravirtualize it, keep the ticket lock implementation but add
> a couple of pvops hooks on the slow patch (long spin on lock, unlocking
> a contended lock).
>
> Ticket locks have a number of nice properties, but they also have some
> surprising behaviours in virtual environments.  They enforce a strict
> FIFO ordering on cpus trying to take a lock; however, if the hypervisor
> scheduler does not schedule the cpus in the correct order, the system can
> waste a huge amount of time spinning until the next cpu can take the lock.
>
> (See Thomas Friebel's talk "Prevent Guests from Spinning Around"
> http://www.xen.org/files/xensummitboston08/LHP.pdf  for more details.)
>
> To address this, we add two hooks:
>   - __ticket_spin_lock which is called after the cpu has been
>     spinning on the lock for a significant number of iterations but has
>     failed to take the lock (presumably because the cpu holding the lock
>     has been descheduled).  The lock_spinning pvop is expected to block
>     the cpu until it has been kicked by the current lock holder.
>   - __ticket_spin_unlock, which on releasing a contended lock
>     (there are more cpus with tail tickets), it looks to see if the next
>     cpu is blocked and wakes it if so.
>
> When compiled with CONFIG_PARAVIRT_SPINLOCKS disabled, a set of stub
> functions causes all the extra code to go away.
>    

I've made some real world benchmarks based on this serie of patches 
applied on top of a vanilla Linux-3.3-rc6 (commit 
4704fe65e55fb088fbcb1dc0b15ff7cc8bff3685), with both 
CONFIG_PARAVIRT_SPINLOCK=y and n, which means essentially 4 versions 
compared:
* vanilla - CONFIG_PARAVIRT_SPINLOCK - patch
* vanilla + CONFIG_PARAVIRT_SPINLOCK - patch
* vanilla - CONFIG_PARAVIRT_SPINLOCK + patch
* vanilla + CONFIG_PARAVIRT_SPINLOCK + patch

(you can check out the monolithic kernel configuration I used, and 
verify the sole difference, here):
http://xenbits.xen.org/people/attilio/jeremy-spinlock/kernel-configs/

Tests, information and results are summarized below.

== System used information:
* Machine is a XEON x3450, 2.6GHz, 8-ways system:
http://xenbits.xen.org/people/attilio/jeremy-spinlock/dmesg
* System version, a Debian Squeeze 6.0.4:
http://xenbits.xen.org/people/attilio/jeremy-spinlock/debian-version
* gcc version, 4.4.5:
http://xenbits.xen.org/people/attilio/jeremy-spinlock/gcc-version

== Tests performed
* pgbench based on PostgreSQL 9.2 (development version) as it has a lot 
of scalability improvements in it:
http://www.postgresql.org/docs/devel/static/install-getsource.html

I used a stock installation, with only this simple configuration change:
http://xenbits.xen.org/people/attilio/jeremy-spinlock/postsgresql.conf.patch

For collecting data I used this simple scripts, which runs the test 10 
times every time with a different set of threads (from 1 to 64). Please 
note that the first 8 runs cache all the data in memory in order to 
avoid subsequent I/O, thus they are discarded in sampling and calculation:
http://xenbits.xen.org/people/attilio/jeremy-spinlock/pgbench_script

Here is the crude data (please remind this is tps, thus the higher the 
better):
http://xenbits.xen.org/people/attilio/jeremy-spinlock/pgbench-crude-datas/

And here are data chartered with ministat tool, comparing all the 4 
kernel configuration for every thread configuration:
http://xenbits.xen.org/people/attilio/jeremy-spinlock/pgbench-9.2-total.bench

As you can see, the patch doesn't really show a statistically meaningful 
difference for this workload, excluding the single-thread run for the 
patched + CONFIG_PARAVIRT_SPINLOCK=y case, which seems 5% faster.


* pbzip2, which is a parallel version of bzip2, supposed to reproduce a 
CPU-intensive, multithreaded, application.
The file choosen for compression is 1GB sized, got from /dev/urandom 
(this is not published but I may have it, so if you need it for more 
tests please just ask), and all the I/O is done on a tmpfs volume in 
order to avoid I/O floaty effects.

For collecting data I used this simple scripts, which runs the test 10 
times every time with a different set of threads (from 1 to 64):
http://xenbits.xen.org/people/attilio/jeremy-spinlock/pbzip2bench_script

Here is the crude data (please remind this is time(1) output, thus the 
lower the better):
http://xenbits.xen.org/people/attilio/jeremy-spinlock/pbzip2-crude-datas/

And here are data chartered with ministat tool, comparing all the 4 
kernel configuration for every thread configuration:
http://xenbits.xen.org/people/attilio/jeremy-spinlock/pbzip2-1.1.1-total.bench

As you can see, the patch doesn't really show a statistically meaningful 
difference for this workload.


* kernbench-0.50 run, doing I/O on a 10GB tmpfs volume (thus no actual  
I/O involved), with the following invokation:
./kernbench -n10 -s -c16 -M -f

(I had to do that because kernbench wasn't getting a good maximum value 
at all, thus I disabled default maximum and forced for 16 threads).

Here is the crude data (please remind this is time(1) output, thus the 
lower the better):
http://xenbits.xen.org/people/attilio/jeremy-spinlock/kernbench-crude-datas/

Please note that kernbench already calculates std deviation for them. 
However I also wanted a ministat summary in order to quickly display any 
possible difference, thus I just replicated 3 times any value (the 
minimum requested by ministat) and charted them:
http://xenbits.xen.org/people/attilio/jeremy-spinlock/kernbench-0.50-total.bench

Again, it doesn't seem to be any meaningful statistical difference.

== Results
This test points in the direction that Jeremy's rebased patches don't 
introduce a peformance penalty at all, but also that we could likely 
consider CONFIG_PARAVIRT_SPINLOCK option removal, or turn it on by 
default and suggest disabling just on very old CPUs (assuming a 
performance regression can be proven there).

If you have questions please let me know.

Thanks,
Attilio

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC V6 1/11]  x86/spinlock: replace pv spinlocks with pv ticketlocks
  2012-03-21 13:04   ` Attilio Rao
@ 2012-03-21 13:22     ` Stephan Diestelhorst
  2012-03-21 13:49       ` Attilio Rao
  0 siblings, 1 reply; 55+ messages in thread
From: Stephan Diestelhorst @ 2012-03-21 13:22 UTC (permalink / raw)
  To: Attilio Rao
  Cc: Raghavendra K T, H. Peter Anvin, Ingo Molnar, Linus Torvalds,
	Peter Zijlstra, the arch/x86 maintainers, LKML, Avi Kivity,
	Marcelo Tosatti, KVM, Andi Kleen, Xen Devel,
	Konrad Rzeszutek Wilk, Virtualization, Jeremy Fitzhardinge,
	Srivatsa Vaddagiri, Stefano Stabellini

On Wednesday 21 March 2012, 13:04:25 Attilio Rao wrote:
> On 21/03/12 10:20, Raghavendra K T wrote:
> > From: Jeremy Fitzhardinge<jeremy.fitzhardinge@citrix.com>
> >
> > Rather than outright replacing the entire spinlock implementation in
> > order to paravirtualize it, keep the ticket lock implementation but add
> > a couple of pvops hooks on the slow patch (long spin on lock, unlocking
> > a contended lock).
> >
> > Ticket locks have a number of nice properties, but they also have some
> > surprising behaviours in virtual environments.  They enforce a strict
> > FIFO ordering on cpus trying to take a lock; however, if the hypervisor
> > scheduler does not schedule the cpus in the correct order, the system can
> > waste a huge amount of time spinning until the next cpu can take the lock.
> >
> > (See Thomas Friebel's talk "Prevent Guests from Spinning Around"
> > http://www.xen.org/files/xensummitboston08/LHP.pdf  for more details.)
> >
> > To address this, we add two hooks:
> >   - __ticket_spin_lock which is called after the cpu has been
> >     spinning on the lock for a significant number of iterations but has
> >     failed to take the lock (presumably because the cpu holding the lock
> >     has been descheduled).  The lock_spinning pvop is expected to block
> >     the cpu until it has been kicked by the current lock holder.
> >   - __ticket_spin_unlock, which on releasing a contended lock
> >     (there are more cpus with tail tickets), it looks to see if the next
> >     cpu is blocked and wakes it if so.
> >
> > When compiled with CONFIG_PARAVIRT_SPINLOCKS disabled, a set of stub
> > functions causes all the extra code to go away.
> >    
> 
> I've made some real world benchmarks based on this serie of patches 
> applied on top of a vanilla Linux-3.3-rc6 (commit 
> 4704fe65e55fb088fbcb1dc0b15ff7cc8bff3685), with both 
> CONFIG_PARAVIRT_SPINLOCK=y and n, which means essentially 4 versions 
> compared:
> * vanilla - CONFIG_PARAVIRT_SPINLOCK - patch
> * vanilla + CONFIG_PARAVIRT_SPINLOCK - patch
> * vanilla - CONFIG_PARAVIRT_SPINLOCK + patch
> * vanilla + CONFIG_PARAVIRT_SPINLOCK + patch
> 
[...]
> == Results
> This test points in the direction that Jeremy's rebased patches don't 
> introduce a peformance penalty at all, but also that we could likely 
> consider CONFIG_PARAVIRT_SPINLOCK option removal, or turn it on by 
> default and suggest disabling just on very old CPUs (assuming a 
> performance regression can be proven there).

Very interesting results, in particular knowing that in the one guest
case things do not get (significantly) slower due to the added logic
and LOCKed RMW in the unlock path.

AFAICR, the problem really became apparent when running multiple guests
time sharing the physical CPUs, i.e., two guests with eight vCPUs each
on an eight core machine.  Did you look at this setup with your tests?

Thanks,
  Stephan
-- 
Stephan Diestelhorst, AMD Operating System Research Center
stephan.diestelhorst@amd.com, Tel. +49 (0)351 448 356 719  (AMD: 29-719)



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC V6 1/11]  x86/spinlock: replace pv spinlocks with pv ticketlocks
  2012-03-21 13:22     ` Stephan Diestelhorst
@ 2012-03-21 13:49       ` Attilio Rao
  2012-03-21 14:25         ` Stephan Diestelhorst
  0 siblings, 1 reply; 55+ messages in thread
From: Attilio Rao @ 2012-03-21 13:49 UTC (permalink / raw)
  To: Stephan Diestelhorst
  Cc: Raghavendra K T, H. Peter Anvin, Ingo Molnar, Linus Torvalds,
	Peter Zijlstra, the arch/x86 maintainers, LKML, Avi Kivity,
	Marcelo Tosatti, KVM, Andi Kleen, Xen Devel,
	Konrad Rzeszutek Wilk, Virtualization, Jeremy Fitzhardinge,
	Srivatsa Vaddagiri, Stefano Stabellini

On 21/03/12 13:22, Stephan Diestelhorst wrote:
> On Wednesday 21 March 2012, 13:04:25 Attilio Rao wrote:
>    
>> On 21/03/12 10:20, Raghavendra K T wrote:
>>      
>>> From: Jeremy Fitzhardinge<jeremy.fitzhardinge@citrix.com>
>>>
>>> Rather than outright replacing the entire spinlock implementation in
>>> order to paravirtualize it, keep the ticket lock implementation but add
>>> a couple of pvops hooks on the slow patch (long spin on lock, unlocking
>>> a contended lock).
>>>
>>> Ticket locks have a number of nice properties, but they also have some
>>> surprising behaviours in virtual environments.  They enforce a strict
>>> FIFO ordering on cpus trying to take a lock; however, if the hypervisor
>>> scheduler does not schedule the cpus in the correct order, the system can
>>> waste a huge amount of time spinning until the next cpu can take the lock.
>>>
>>> (See Thomas Friebel's talk "Prevent Guests from Spinning Around"
>>> http://www.xen.org/files/xensummitboston08/LHP.pdf  for more details.)
>>>
>>> To address this, we add two hooks:
>>>    - __ticket_spin_lock which is called after the cpu has been
>>>      spinning on the lock for a significant number of iterations but has
>>>      failed to take the lock (presumably because the cpu holding the lock
>>>      has been descheduled).  The lock_spinning pvop is expected to block
>>>      the cpu until it has been kicked by the current lock holder.
>>>    - __ticket_spin_unlock, which on releasing a contended lock
>>>      (there are more cpus with tail tickets), it looks to see if the next
>>>      cpu is blocked and wakes it if so.
>>>
>>> When compiled with CONFIG_PARAVIRT_SPINLOCKS disabled, a set of stub
>>> functions causes all the extra code to go away.
>>>
>>>        
>> I've made some real world benchmarks based on this serie of patches
>> applied on top of a vanilla Linux-3.3-rc6 (commit
>> 4704fe65e55fb088fbcb1dc0b15ff7cc8bff3685), with both
>> CONFIG_PARAVIRT_SPINLOCK=y and n, which means essentially 4 versions
>> compared:
>> * vanilla - CONFIG_PARAVIRT_SPINLOCK - patch
>> * vanilla + CONFIG_PARAVIRT_SPINLOCK - patch
>> * vanilla - CONFIG_PARAVIRT_SPINLOCK + patch
>> * vanilla + CONFIG_PARAVIRT_SPINLOCK + patch
>>
>>      
> [...]
>    
>> == Results
>> This test points in the direction that Jeremy's rebased patches don't
>> introduce a peformance penalty at all, but also that we could likely
>> consider CONFIG_PARAVIRT_SPINLOCK option removal, or turn it on by
>> default and suggest disabling just on very old CPUs (assuming a
>> performance regression can be proven there).
>>      
> Very interesting results, in particular knowing that in the one guest
> case things do not get (significantly) slower due to the added logic
> and LOCKed RMW in the unlock path.
>
> AFAICR, the problem really became apparent when running multiple guests
> time sharing the physical CPUs, i.e., two guests with eight vCPUs each
> on an eight core machine.  Did you look at this setup with your tests?
>
>    

Please note that my tests are made on native Linux, without XEN involvement.

You maybe meant that the spinlock paravirtualization became generally 
useful in the case you mentioned? (2 guests, 8vpcu + 8vcpu)?

Attilio

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC V6 1/11]  x86/spinlock: replace pv spinlocks with pv ticketlocks
  2012-03-21 13:49       ` Attilio Rao
@ 2012-03-21 14:25         ` Stephan Diestelhorst
  2012-03-21 14:33           ` Attilio Rao
  0 siblings, 1 reply; 55+ messages in thread
From: Stephan Diestelhorst @ 2012-03-21 14:25 UTC (permalink / raw)
  To: Attilio Rao
  Cc: Raghavendra K T, H. Peter Anvin, Ingo Molnar, Linus Torvalds,
	Peter Zijlstra, the arch/x86 maintainers, LKML, Avi Kivity,
	Marcelo Tosatti, KVM, Andi Kleen, Xen Devel,
	Konrad Rzeszutek Wilk, Virtualization, Jeremy Fitzhardinge,
	Srivatsa Vaddagiri, Stefano Stabellini

On Wednesday 21 March 2012, 13:49:28 Attilio Rao wrote:
> On 21/03/12 13:22, Stephan Diestelhorst wrote:
> > On Wednesday 21 March 2012, 13:04:25 Attilio Rao wrote:
> >    
> >> On 21/03/12 10:20, Raghavendra K T wrote:
> >>      
> >>> From: Jeremy Fitzhardinge<jeremy.fitzhardinge@citrix.com>
> >>>
> >>> Rather than outright replacing the entire spinlock implementation in
> >>> order to paravirtualize it, keep the ticket lock implementation but add
> >>> a couple of pvops hooks on the slow patch (long spin on lock, unlocking
> >>> a contended lock).
> >>>
> >>> Ticket locks have a number of nice properties, but they also have some
> >>> surprising behaviours in virtual environments.  They enforce a strict
> >>> FIFO ordering on cpus trying to take a lock; however, if the hypervisor
> >>> scheduler does not schedule the cpus in the correct order, the system can
> >>> waste a huge amount of time spinning until the next cpu can take the lock.
> >>>
> >>> (See Thomas Friebel's talk "Prevent Guests from Spinning Around"
> >>> http://www.xen.org/files/xensummitboston08/LHP.pdf  for more details.)
> >>>
> >>> To address this, we add two hooks:
> >>>    - __ticket_spin_lock which is called after the cpu has been
> >>>      spinning on the lock for a significant number of iterations but has
> >>>      failed to take the lock (presumably because the cpu holding the lock
> >>>      has been descheduled).  The lock_spinning pvop is expected to block
> >>>      the cpu until it has been kicked by the current lock holder.
> >>>    - __ticket_spin_unlock, which on releasing a contended lock
> >>>      (there are more cpus with tail tickets), it looks to see if the next
> >>>      cpu is blocked and wakes it if so.
> >>>
> >>> When compiled with CONFIG_PARAVIRT_SPINLOCKS disabled, a set of stub
> >>> functions causes all the extra code to go away.
> >>>
> >>>        
> >> I've made some real world benchmarks based on this serie of patches
> >> applied on top of a vanilla Linux-3.3-rc6 (commit
> >> 4704fe65e55fb088fbcb1dc0b15ff7cc8bff3685), with both
> >> CONFIG_PARAVIRT_SPINLOCK=y and n, which means essentially 4 versions
> >> compared:
> >> * vanilla - CONFIG_PARAVIRT_SPINLOCK - patch
> >> * vanilla + CONFIG_PARAVIRT_SPINLOCK - patch
> >> * vanilla - CONFIG_PARAVIRT_SPINLOCK + patch
> >> * vanilla + CONFIG_PARAVIRT_SPINLOCK + patch
> >>
> >>      
> > [...]
> >    
> >> == Results
> >> This test points in the direction that Jeremy's rebased patches don't
> >> introduce a peformance penalty at all, but also that we could likely
> >> consider CONFIG_PARAVIRT_SPINLOCK option removal, or turn it on by
> >> default and suggest disabling just on very old CPUs (assuming a
> >> performance regression can be proven there).
> >>      
> > Very interesting results, in particular knowing that in the one guest
> > case things do not get (significantly) slower due to the added logic
> > and LOCKed RMW in the unlock path.
> >
> > AFAICR, the problem really became apparent when running multiple guests
> > time sharing the physical CPUs, i.e., two guests with eight vCPUs each
> > on an eight core machine.  Did you look at this setup with your tests?
> >
> >    
> 
> Please note that my tests are made on native Linux, without XEN involvement.
> 
> You maybe meant that the spinlock paravirtualization became generally 
> useful in the case you mentioned? (2 guests, 8vpcu + 8vcpu)?

Yes, that is what I meant.  Just to clarify why you do not see any
speed-ups, and were wondering why.  If the whole point of the exercise
was to see that there are no perforamnce regressions, fine.  In that
case I misunderstood.

Stephan
-- 
Stephan Diestelhorst, AMD Operating System Research Center
stephan.diestelhorst@amd.com
Tel. +49 (0)351 448 356 719

Advanced Micro Devices GmbH
Einsteinring 24
85609 Aschheim
Germany

Geschaeftsfuehrer: Alberto Bozzo
Sitz: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632, WEEE-Reg-Nr: DE 12919551



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC V6 1/11]  x86/spinlock: replace pv spinlocks with pv ticketlocks
  2012-03-21 14:25         ` Stephan Diestelhorst
@ 2012-03-21 14:33           ` Attilio Rao
  2012-03-21 14:49             ` Raghavendra K T
  0 siblings, 1 reply; 55+ messages in thread
From: Attilio Rao @ 2012-03-21 14:33 UTC (permalink / raw)
  To: Stephan Diestelhorst
  Cc: Raghavendra K T, H. Peter Anvin, Ingo Molnar, Linus Torvalds,
	Peter Zijlstra, the arch/x86 maintainers, LKML, Avi Kivity,
	Marcelo Tosatti, KVM, Andi Kleen, Xen Devel,
	Konrad Rzeszutek Wilk, Virtualization, Jeremy Fitzhardinge,
	Srivatsa Vaddagiri, Stefano Stabellini

On 21/03/12 14:25, Stephan Diestelhorst wrote:
> On Wednesday 21 March 2012, 13:49:28 Attilio Rao wrote:
>    
>> On 21/03/12 13:22, Stephan Diestelhorst wrote:
>>      
>>> On Wednesday 21 March 2012, 13:04:25 Attilio Rao wrote:
>>>
>>>        
>>>> On 21/03/12 10:20, Raghavendra K T wrote:
>>>>
>>>>          
>>>>> From: Jeremy Fitzhardinge<jeremy.fitzhardinge@citrix.com>
>>>>>
>>>>> Rather than outright replacing the entire spinlock implementation in
>>>>> order to paravirtualize it, keep the ticket lock implementation but add
>>>>> a couple of pvops hooks on the slow patch (long spin on lock, unlocking
>>>>> a contended lock).
>>>>>
>>>>> Ticket locks have a number of nice properties, but they also have some
>>>>> surprising behaviours in virtual environments.  They enforce a strict
>>>>> FIFO ordering on cpus trying to take a lock; however, if the hypervisor
>>>>> scheduler does not schedule the cpus in the correct order, the system can
>>>>> waste a huge amount of time spinning until the next cpu can take the lock.
>>>>>
>>>>> (See Thomas Friebel's talk "Prevent Guests from Spinning Around"
>>>>> http://www.xen.org/files/xensummitboston08/LHP.pdf  for more details.)
>>>>>
>>>>> To address this, we add two hooks:
>>>>>     - __ticket_spin_lock which is called after the cpu has been
>>>>>       spinning on the lock for a significant number of iterations but has
>>>>>       failed to take the lock (presumably because the cpu holding the lock
>>>>>       has been descheduled).  The lock_spinning pvop is expected to block
>>>>>       the cpu until it has been kicked by the current lock holder.
>>>>>     - __ticket_spin_unlock, which on releasing a contended lock
>>>>>       (there are more cpus with tail tickets), it looks to see if the next
>>>>>       cpu is blocked and wakes it if so.
>>>>>
>>>>> When compiled with CONFIG_PARAVIRT_SPINLOCKS disabled, a set of stub
>>>>> functions causes all the extra code to go away.
>>>>>
>>>>>
>>>>>            
>>>> I've made some real world benchmarks based on this serie of patches
>>>> applied on top of a vanilla Linux-3.3-rc6 (commit
>>>> 4704fe65e55fb088fbcb1dc0b15ff7cc8bff3685), with both
>>>> CONFIG_PARAVIRT_SPINLOCK=y and n, which means essentially 4 versions
>>>> compared:
>>>> * vanilla - CONFIG_PARAVIRT_SPINLOCK - patch
>>>> * vanilla + CONFIG_PARAVIRT_SPINLOCK - patch
>>>> * vanilla - CONFIG_PARAVIRT_SPINLOCK + patch
>>>> * vanilla + CONFIG_PARAVIRT_SPINLOCK + patch
>>>>
>>>>
>>>>          
>>> [...]
>>>
>>>        
>>>> == Results
>>>> This test points in the direction that Jeremy's rebased patches don't
>>>> introduce a peformance penalty at all, but also that we could likely
>>>> consider CONFIG_PARAVIRT_SPINLOCK option removal, or turn it on by
>>>> default and suggest disabling just on very old CPUs (assuming a
>>>> performance regression can be proven there).
>>>>
>>>>          
>>> Very interesting results, in particular knowing that in the one guest
>>> case things do not get (significantly) slower due to the added logic
>>> and LOCKed RMW in the unlock path.
>>>
>>> AFAICR, the problem really became apparent when running multiple guests
>>> time sharing the physical CPUs, i.e., two guests with eight vCPUs each
>>> on an eight core machine.  Did you look at this setup with your tests?
>>>
>>>
>>>        
>> Please note that my tests are made on native Linux, without XEN involvement.
>>
>> You maybe meant that the spinlock paravirtualization became generally
>> useful in the case you mentioned? (2 guests, 8vpcu + 8vcpu)?
>>      
> Yes, that is what I meant.  Just to clarify why you do not see any
> speed-ups, and were wondering why.  If the whole point of the exercise
> was to see that there are no perforamnce regressions, fine.  In that
> case I misunderstood.
>    

Yes, that's right, I just wanted to measure (possible) overhead in 
native Linux and the cost of leaving CONFIG_PARAVIRT_SPINLOCK on.

Thanks,
Attilio

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC V6 1/11]  x86/spinlock: replace pv spinlocks with pv ticketlocks
  2012-03-21 14:33           ` Attilio Rao
@ 2012-03-21 14:49             ` Raghavendra K T
  0 siblings, 0 replies; 55+ messages in thread
From: Raghavendra K T @ 2012-03-21 14:49 UTC (permalink / raw)
  To: Stephan Diestelhorst
  Cc: Attilio Rao, H. Peter Anvin, Ingo Molnar, Linus Torvalds,
	Peter Zijlstra, the arch/x86 maintainers, LKML, Avi Kivity,
	Marcelo Tosatti, KVM, Andi Kleen, Xen Devel,
	Konrad Rzeszutek Wilk, Virtualization, Jeremy Fitzhardinge,
	Srivatsa Vaddagiri, Stefano Stabellini

On 03/21/2012 08:03 PM, Attilio Rao wrote:
> On 21/03/12 14:25, Stephan Diestelhorst wrote:
>> On Wednesday 21 March 2012, 13:49:28 Attilio Rao wrote:
>>> On 21/03/12 13:22, Stephan Diestelhorst wrote:
>>>> On Wednesday 21 March 2012, 13:04:25 Attilio Rao wrote:
>>>>
>>>>> On 21/03/12 10:20, Raghavendra K T wrote:
>>>>>
>>>>>> From: Jeremy Fitzhardinge<jeremy.fitzhardinge@citrix.com>
>>>>>>
>>>>>> Rather than outright replacing the entire spinlock implementation in
>>>>>> order to paravirtualize it, keep the ticket lock implementation
>>>>>> but add
>>>>>> a couple of pvops hooks on the slow patch (long spin on lock,
>>>>>> unlocking
>>>>>> a contended lock).
>>>>>>
>>>>>> Ticket locks have a number of nice properties, but they also have
>>>>>> some
>>>>>> surprising behaviours in virtual environments. They enforce a strict
>>>>>> FIFO ordering on cpus trying to take a lock; however, if the
>>>>>> hypervisor
>>>>>> scheduler does not schedule the cpus in the correct order, the
>>>>>> system can
>>>>>> waste a huge amount of time spinning until the next cpu can take
>>>>>> the lock.
>>>>>>
>>>>>> (See Thomas Friebel's talk "Prevent Guests from Spinning Around"
>>>>>> http://www.xen.org/files/xensummitboston08/LHP.pdf for more details.)
>>>>>>
>>>>>> To address this, we add two hooks:
>>>>>> - __ticket_spin_lock which is called after the cpu has been
>>>>>> spinning on the lock for a significant number of iterations but has
>>>>>> failed to take the lock (presumably because the cpu holding the lock
>>>>>> has been descheduled). The lock_spinning pvop is expected to block
>>>>>> the cpu until it has been kicked by the current lock holder.
>>>>>> - __ticket_spin_unlock, which on releasing a contended lock
>>>>>> (there are more cpus with tail tickets), it looks to see if the next
>>>>>> cpu is blocked and wakes it if so.
>>>>>>
>>>>>> When compiled with CONFIG_PARAVIRT_SPINLOCKS disabled, a set of stub
>>>>>> functions causes all the extra code to go away.
>>>>>>
>>>>>>
>>>>> I've made some real world benchmarks based on this serie of patches
>>>>> applied on top of a vanilla Linux-3.3-rc6 (commit
>>>>> 4704fe65e55fb088fbcb1dc0b15ff7cc8bff3685), with both
>>>>> CONFIG_PARAVIRT_SPINLOCK=y and n, which means essentially 4 versions
>>>>> compared:
>>>>> * vanilla - CONFIG_PARAVIRT_SPINLOCK - patch
>>>>> * vanilla + CONFIG_PARAVIRT_SPINLOCK - patch
>>>>> * vanilla - CONFIG_PARAVIRT_SPINLOCK + patch
>>>>> * vanilla + CONFIG_PARAVIRT_SPINLOCK + patch
>>>>>
>>>>>
>>>> [...]
>>>>
>>>>> == Results
>>>>> This test points in the direction that Jeremy's rebased patches don't
>>>>> introduce a peformance penalty at all, but also that we could likely
>>>>> consider CONFIG_PARAVIRT_SPINLOCK option removal, or turn it on by
>>>>> default and suggest disabling just on very old CPUs (assuming a
>>>>> performance regression can be proven there).
>>>>>
>>>> Very interesting results, in particular knowing that in the one guest
>>>> case things do not get (significantly) slower due to the added logic
>>>> and LOCKed RMW in the unlock path.
>>>>
>>>> AFAICR, the problem really became apparent when running multiple guests
>>>> time sharing the physical CPUs, i.e., two guests with eight vCPUs each
>>>> on an eight core machine. Did you look at this setup with your tests?
>>>>
>>>>
>>> Please note that my tests are made on native Linux, without XEN
>>> involvement.
>>>
>>> You maybe meant that the spinlock paravirtualization became generally
>>> useful in the case you mentioned? (2 guests, 8vpcu + 8vcpu)?
>> Yes, that is what I meant. Just to clarify why you do not see any
>> speed-ups, and were wondering why. If the whole point of the exercise
>> was to see that there are no perforamnce regressions, fine. In that
>> case I misunderstood.
>
> Yes, that's right, I just wanted to measure (possible) overhead in
> native Linux and the cost of leaving CONFIG_PARAVIRT_SPINLOCK on.

True.  Even my result was only revolved around native overhead.

Till now main concern in the community was native overhead. So this time 
we have the results that proves CONFG_PARAVRT_SPINLOCK is now in par 
with corresponding vanilla because of ticketlock improvements.

Coming to Guest scenario, I intend to post KVM counterpart of the 
patches with results where we see huge improvement (around 90%) in 
contention scenario and almost zero overhead in normal case.



>


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC V6 2/11] x86/ticketlock: don't inline _spin_unlock when using paravirt spinlocks
  2012-03-21 10:21 ` [PATCH RFC V6 2/11] x86/ticketlock: don't inline _spin_unlock when using paravirt spinlocks Raghavendra K T
@ 2012-03-21 17:13   ` Linus Torvalds
  2012-03-22 10:06     ` Raghavendra K T
  0 siblings, 1 reply; 55+ messages in thread
From: Linus Torvalds @ 2012-03-21 17:13 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: H. Peter Anvin, Ingo Molnar, Peter Zijlstra,
	the arch/x86 maintainers, LKML, Avi Kivity, Marcelo Tosatti, KVM,
	Andi Kleen, Xen Devel, Konrad Rzeszutek Wilk, Virtualization,
	Jeremy Fitzhardinge, Stephan Diestelhorst, Srivatsa Vaddagiri,
	Stefano Stabellini, Attilio Rao

On Wed, Mar 21, 2012 at 3:21 AM, Raghavendra K T
<raghavendra.kt@linux.vnet.ibm.com> wrote:
> From: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
>
> The code size expands somewhat, and its probably better to just call
> a function rather than inline it.
>
> Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
> Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
> ---
>  arch/x86/Kconfig     |    3 +++
>  kernel/Kconfig.locks |    2 +-
>  2 files changed, 4 insertions(+), 1 deletions(-)
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 5bed94e..10c28ec 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -623,6 +623,9 @@ config PARAVIRT_SPINLOCKS
>
>          If you are unsure how to answer this question, answer N.
>
> +config ARCH_NOINLINE_SPIN_UNLOCK
> +       def_bool PARAVIRT_SPINLOCKS
> +
>  config PARAVIRT_CLOCK
>        bool
>
> diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks
> index 5068e2a..584637b 100644
> --- a/kernel/Kconfig.locks
> +++ b/kernel/Kconfig.locks
> @@ -125,7 +125,7 @@ config INLINE_SPIN_LOCK_IRQSAVE
>                 ARCH_INLINE_SPIN_LOCK_IRQSAVE
>
>  config INLINE_SPIN_UNLOCK
> -       def_bool !DEBUG_SPINLOCK && (!PREEMPT || ARCH_INLINE_SPIN_UNLOCK)
> +       def_bool !DEBUG_SPINLOCK && (!PREEMPT || ARCH_INLINE_SPIN_UNLOCK) && !ARCH_NOINLINE_SPIN_UNLOCK
>
>  config INLINE_SPIN_UNLOCK_BH
>        def_bool !DEBUG_SPINLOCK && ARCH_INLINE_SPIN_UNLOCK_BH

Ugh. This is getting really ugly.

Can we just fix it by
 - getting rid of INLINE_SPIN_UNLOCK entirely

 - replacing it with UNINLINE_SPIN_UNLOCK instead with the reverse
meaning, and no "def_bool" at all, just a simple

        config UNINLINE_SPIN_UNLOCK
                bool

 - make the various people who want to uninline the spinlocks (like
spinlock debugging, paravirt etc) all just do

        select UNINLINE_SPIN_UNLOCK

because quite frankly, the whole spinunlock inlining logic is
*already* unreadable, and you just made it worse.

                      Linus

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC V6 2/11] x86/ticketlock: don't inline _spin_unlock when using paravirt spinlocks
  2012-03-21 17:13   ` Linus Torvalds
@ 2012-03-22 10:06     ` Raghavendra K T
  0 siblings, 0 replies; 55+ messages in thread
From: Raghavendra K T @ 2012-03-22 10:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: H. Peter Anvin, Ingo Molnar, Peter Zijlstra,
	the arch/x86 maintainers, LKML, Avi Kivity, Marcelo Tosatti, KVM,
	Andi Kleen, Xen Devel, Konrad Rzeszutek Wilk, Virtualization,
	Jeremy Fitzhardinge, Stephan Diestelhorst, Srivatsa Vaddagiri,
	Stefano Stabellini, Attilio Rao

On 03/21/2012 10:43 PM, Linus Torvalds wrote:
> On Wed, Mar 21, 2012 at 3:21 AM, Raghavendra K T
> <raghavendra.kt@linux.vnet.ibm.com>  wrote:
>> From: Jeremy Fitzhardinge<jeremy.fitzhardinge@citrix.com>
>>
>> The code size expands somewhat, and its probably better to just call
>> a function rather than inline it.
>>
>> Signed-off-by: Jeremy Fitzhardinge<jeremy.fitzhardinge@citrix.com>
>> Signed-off-by: Raghavendra K T<raghavendra.kt@linux.vnet.ibm.com>
>> ---
>>   arch/x86/Kconfig     |    3 +++
>>   kernel/Kconfig.locks |    2 +-
>>   2 files changed, 4 insertions(+), 1 deletions(-)
>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>> index 5bed94e..10c28ec 100644
>> --- a/arch/x86/Kconfig
>> +++ b/arch/x86/Kconfig
>> @@ -623,6 +623,9 @@ config PARAVIRT_SPINLOCKS
>>
>>           If you are unsure how to answer this question, answer N.
>>
>> +config ARCH_NOINLINE_SPIN_UNLOCK
>> +       def_bool PARAVIRT_SPINLOCKS
>> +
>>   config PARAVIRT_CLOCK
>>         bool
>>
>> diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks
>> index 5068e2a..584637b 100644
>> --- a/kernel/Kconfig.locks
>> +++ b/kernel/Kconfig.locks
>> @@ -125,7 +125,7 @@ config INLINE_SPIN_LOCK_IRQSAVE
>>                  ARCH_INLINE_SPIN_LOCK_IRQSAVE
>>
>>   config INLINE_SPIN_UNLOCK
>> -       def_bool !DEBUG_SPINLOCK&&  (!PREEMPT || ARCH_INLINE_SPIN_UNLOCK)
>> +       def_bool !DEBUG_SPINLOCK&&  (!PREEMPT || ARCH_INLINE_SPIN_UNLOCK)&&  !ARCH_NOINLINE_SPIN_UNLOCK
>>
>>   config INLINE_SPIN_UNLOCK_BH
>>         def_bool !DEBUG_SPINLOCK&&  ARCH_INLINE_SPIN_UNLOCK_BH
>
> Ugh. This is getting really ugly.
>

Agree that it had become longer.

> Can we just fix it by
>   - getting rid of INLINE_SPIN_UNLOCK entirely
>
>   - replacing it with UNINLINE_SPIN_UNLOCK instead with the reverse
> meaning, and no "def_bool" at all, just a simple
>
>          config UNINLINE_SPIN_UNLOCK
>                  bool
>
>   - make the various people who want to uninline the spinlocks (like
> spinlock debugging, paravirt etc) all just do
>
>          select UNINLINE_SPIN_UNLOCK

I just posted  https://lkml.org/lkml/2012/3/22/94. Please let me know
if that looks better.
And this patch should now become something like
---
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 5bed94e..2666b7d 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -613,6 +613,7 @@ config PARAVIRT
  config PARAVIRT_SPINLOCKS
	 bool "Paravirtualization layer for spinlocks"
	 depends on PARAVIRT && SMP && EXPERIMENTAL
+	 select UNINLINE_SPIN_UNLOCK
	 ---help---
	   Paravirtualized spinlocks allow a pvops backend to replace the
	   spinlock implementation with something virtualization-friendly


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks
  2012-03-21 10:20 [PATCH RFC V6 0/11] Paravirtualized ticketlocks Raghavendra K T
                   ` (10 preceding siblings ...)
  2012-03-21 10:22 ` [PATCH RFC V6 11/11] xen: enable PV ticketlocks on HVM Xen Raghavendra K T
@ 2012-03-26 14:25 ` Avi Kivity
  2012-03-27  7:37   ` Raghavendra K T
  2012-03-30 20:26 ` H. Peter Anvin
  12 siblings, 1 reply; 55+ messages in thread
From: Avi Kivity @ 2012-03-26 14:25 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: H. Peter Anvin, Ingo Molnar, Linus Torvalds, Peter Zijlstra,
	the arch/x86 maintainers, LKML, Marcelo Tosatti, KVM, Andi Kleen,
	Xen Devel, Konrad Rzeszutek Wilk, Virtualization,
	Jeremy Fitzhardinge, Stephan Diestelhorst, Srivatsa Vaddagiri,
	Stefano Stabellini, Attilio Rao

On 03/21/2012 12:20 PM, Raghavendra K T wrote:
> From: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
>
> Changes since last posting: (Raghavendra K T)
> [
>  - Rebased to linux-3.3-rc6.
>  - used function+enum in place of macro (better type checking) 
>  - use cmpxchg while resetting zero status for possible race
> 	[suggested by Dave Hansen for KVM patches ]
> ]
>
> This series replaces the existing paravirtualized spinlock mechanism
> with a paravirtualized ticketlock mechanism.
>
> Ticket locks have an inherent problem in a virtualized case, because
> the vCPUs are scheduled rather than running concurrently (ignoring
> gang scheduled vCPUs).  This can result in catastrophic performance
> collapses when the vCPU scheduler doesn't schedule the correct "next"
> vCPU, and ends up scheduling a vCPU which burns its entire timeslice
> spinning.  (Note that this is not the same problem as lock-holder
> preemption, which this series also addresses; that's also a problem,
> but not catastrophic).
>
> (See Thomas Friebel's talk "Prevent Guests from Spinning Around"
> http://www.xen.org/files/xensummitboston08/LHP.pdf for more details.)
>
> Currently we deal with this by having PV spinlocks, which adds a layer
> of indirection in front of all the spinlock functions, and defining a
> completely new implementation for Xen (and for other pvops users, but
> there are none at present).
>
> PV ticketlocks keeps the existing ticketlock implemenentation
> (fastpath) as-is, but adds a couple of pvops for the slow paths:
>
> - If a CPU has been waiting for a spinlock for SPIN_THRESHOLD
>   iterations, then call out to the __ticket_lock_spinning() pvop,
>   which allows a backend to block the vCPU rather than spinning.  This
>   pvop can set the lock into "slowpath state".
>
> - When releasing a lock, if it is in "slowpath state", the call
>   __ticket_unlock_kick() to kick the next vCPU in line awake.  If the
>   lock is no longer in contention, it also clears the slowpath flag.
>
> The "slowpath state" is stored in the LSB of the within the lock tail
> ticket.  This has the effect of reducing the max number of CPUs by
> half (so, a "small ticket" can deal with 128 CPUs, and "large ticket"
> 32768).
>
> This series provides a Xen implementation, but it should be
> straightforward to add a KVM implementation as well.
>

Looks like a good baseline on which to build the KVM implementation.  We
might need some handshake to prevent interference on the host side with
the PLE code.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks
  2012-03-26 14:25 ` [PATCH RFC V6 0/11] Paravirtualized ticketlocks Avi Kivity
@ 2012-03-27  7:37   ` Raghavendra K T
  2012-03-28 16:37     ` Alan Meadows
       [not found]     ` <CAMy5W3foop40+R1RLv_JPhnO5ZmV90uMmNERYY-e3QCeaJfqLw@mail.gmail.com>
  0 siblings, 2 replies; 55+ messages in thread
From: Raghavendra K T @ 2012-03-27  7:37 UTC (permalink / raw)
  To: Avi Kivity, H. Peter Anvin, Ingo Molnar
  Cc: Linus Torvalds, Peter Zijlstra, the arch/x86 maintainers, LKML,
	Marcelo Tosatti, KVM, Andi Kleen, Xen Devel,
	Konrad Rzeszutek Wilk, Virtualization, Jeremy Fitzhardinge,
	Stephan Diestelhorst, Srivatsa Vaddagiri, Stefano Stabellini,
	Attilio Rao

On 03/26/2012 07:55 PM, Avi Kivity wrote:
> On 03/21/2012 12:20 PM, Raghavendra K T wrote:
>> From: Jeremy Fitzhardinge<jeremy.fitzhardinge@citrix.com>
[...]
>>
>> This series provides a Xen implementation, but it should be
>> straightforward to add a KVM implementation as well.
>>
>
> Looks like a good baseline on which to build the KVM implementation.  We
> might need some handshake to prevent interference on the host side with
> the PLE code.
>

Avi, Thanks for reviewing. True, it is sort of equivalent to PLE on non 
PLE machine.

Ingo, Peter,
Can you please let us know if this series can be considered for next 
merge window?
OR do you still have some concerns that needs addressing.

I shall rebase patches to 3.3 and resend. (main difference would be 
UNINLINE_SPIN_UNLOCK and jump label changes to use 
static_key_true/false() usage instead of static_branch.)

Thanks,
Raghu


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks
  2012-03-27  7:37   ` Raghavendra K T
@ 2012-03-28 16:37     ` Alan Meadows
       [not found]     ` <CAMy5W3foop40+R1RLv_JPhnO5ZmV90uMmNERYY-e3QCeaJfqLw@mail.gmail.com>
  1 sibling, 0 replies; 55+ messages in thread
From: Alan Meadows @ 2012-03-28 16:37 UTC (permalink / raw)
  To: KVM, LKML

I am happy to see this issue receiving some attention and second the
wish to see these patches be considered for further review and
inclusion in an upcoming release.

Overcommit is not as common in enterprise and single-tenant
virtualized environments as it is in multi-tenant environments, and
frankly we have been suffering.

We have been running an early copy of these patches in our lab and in
a small production node sample set both on 3.2.0-rc4 and 3.3.0-rc6 for
over two weeks now with great success. With the heavy level of
vCPU:pCPU overcommit required for our situation, the patches are
increasing performance by an _order of magnitude_ on our E5645 and
E5620 systems.

Alan Meadows

On Tue, Mar 27, 2012 at 12:37 AM, Raghavendra K T
<raghavendra.kt@linux.vnet.ibm.com> wrote:
>
> On 03/26/2012 07:55 PM, Avi Kivity wrote:
>>
>> On 03/21/2012 12:20 PM, Raghavendra K T wrote:
>>>
>>> From: Jeremy Fitzhardinge<jeremy.fitzhardinge@citrix.com>
>
> [...]
>
>>>
>>> This series provides a Xen implementation, but it should be
>>> straightforward to add a KVM implementation as well.
>>>
>>
>> Looks like a good baseline on which to build the KVM implementation.  We
>> might need some handshake to prevent interference on the host side with
>> the PLE code.
>>
>
> Avi, Thanks for reviewing. True, it is sort of equivalent to PLE on non
> PLE machine.
>
> Ingo, Peter,
> Can you please let us know if this series can be considered for next merge
> window?
> OR do you still have some concerns that needs addressing.
>
> I shall rebase patches to 3.3 and resend. (main difference would be
> UNINLINE_SPIN_UNLOCK and jump label changes to use static_key_true/false()
> usage instead of static_branch.)
>
> Thanks,
> Raghu
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks
       [not found]     ` <CAMy5W3foop40+R1RLv_JPhnO5ZmV90uMmNERYY-e3QCeaJfqLw@mail.gmail.com>
@ 2012-03-28 18:21       ` Raghavendra K T
  2012-03-29  9:58         ` Avi Kivity
  0 siblings, 1 reply; 55+ messages in thread
From: Raghavendra K T @ 2012-03-28 18:21 UTC (permalink / raw)
  To: Alan Meadows, Avi Kivity
  Cc: H. Peter Anvin, Ingo Molnar, Linus Torvalds, Peter Zijlstra,
	the arch/x86 maintainers, LKML, Marcelo Tosatti, KVM, Andi Kleen,
	Xen Devel, Konrad Rzeszutek Wilk, Virtualization,
	Jeremy Fitzhardinge, Stephan Diestelhorst, Srivatsa Vaddagiri,
	Stefano Stabellini, Attilio Rao

On 03/28/2012 09:39 PM, Alan Meadows wrote:
> I am happy to see this issue receiving some attention and second the
> wish to see these patches be considered for further review and inclusion
> in an upcoming release.
>
> Overcommit is not as common in enterprise and single-tenant virtualized
> environments as it is in multi-tenant environments, and frankly we have
> been suffering.
>
> We have been running an early copy of these patches in our lab and in a
> small production node sample set both on3.2.0-rc4 and 3.3.0-rc6 for over
> two weeks now with great success. With the heavy level of vCPU:pCPU
> overcommit required for our situation, the patches are increasing
> performance by an _order of magnitude_ on our E5645 and E5620 systems.
>

Thanks Alan for the support. I feel timing of this patch was little bad
though. (merge window)

>
>
>         Looks like a good baseline on which to build the KVM
>         implementation.  We
>         might need some handshake to prevent interference on the host
>         side with
>         the PLE code.
>

I think I still missed some point in Avi's comment. I agree that PLE
may be interfering with above patches (resulting in less performance
advantages). but we have not seen performance degradation with the
patches in earlier benchmarks. [ theoretically since patch has very
slight advantage over PLE that atleast it knows who should run next ].

So TODO in my list on this is:
1. More analysis of performance on PLE mc.
2. Seeing how to implement handshake to increase performance (if PLE +
patch combination have slight negative effect).

Sorry that, I could not do more analysis on PLE (as promised last time)
because of machine availability.

I 'll do some work on this and comeback. But in the meantime, I do not
see it as blocking for next merge window.

>
>     Avi, Thanks for reviewing. True, it is sort of equivalent to PLE on
>     non PLE machine.
>
>     Ingo, Peter,
>     Can you please let us know if this series can be considered for next
>     merge window?
>     OR do you still have some concerns that needs addressing.
>
>     I shall rebase patches to 3.3 and resend. (main difference would be
>     UNINLINE_SPIN_UNLOCK and jump label changes to use
>     static_key_true/false() usage instead of static_branch.)


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks
  2012-03-28 18:21       ` Raghavendra K T
@ 2012-03-29  9:58         ` Avi Kivity
  2012-03-29 18:03           ` Raghavendra K T
  0 siblings, 1 reply; 55+ messages in thread
From: Avi Kivity @ 2012-03-29  9:58 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Alan Meadows, H. Peter Anvin, Ingo Molnar, Linus Torvalds,
	Peter Zijlstra, the arch/x86 maintainers, LKML, Marcelo Tosatti,
	KVM, Andi Kleen, Xen Devel, Konrad Rzeszutek Wilk,
	Virtualization, Jeremy Fitzhardinge, Stephan Diestelhorst,
	Srivatsa Vaddagiri, Stefano Stabellini, Attilio Rao

On 03/28/2012 08:21 PM, Raghavendra K T wrote:
>
>>
>>
>>         Looks like a good baseline on which to build the KVM
>>         implementation.  We
>>         might need some handshake to prevent interference on the host
>>         side with
>>         the PLE code.
>>
>
> I think I still missed some point in Avi's comment. I agree that PLE
> may be interfering with above patches (resulting in less performance
> advantages). but we have not seen performance degradation with the
> patches in earlier benchmarks. [ theoretically since patch has very
> slight advantage over PLE that atleast it knows who should run next ].

The advantage grows with the vcpu counts and overcommit ratio.  If you
have N vcpus and M:1 overcommit, PLE has to guess from N/M queued vcpus
while your patch knows who to wake up.

>
> So TODO in my list on this is:
> 1. More analysis of performance on PLE mc.
> 2. Seeing how to implement handshake to increase performance (if PLE +
> patch combination have slight negative effect).

I can think of two options:
- from the PLE handler, don't wake up a vcpu that is sleeping because it
is waiting for a kick
- look at other sources of pause loops (I guess smp_call_function() is
the significant source) and adjust them to use the same mechanism, and
ask the host to disable PLE exiting.

This can be done incrementally later.

>
> Sorry that, I could not do more analysis on PLE (as promised last time)
> because of machine availability.
>
> I 'll do some work on this and comeback. But in the meantime, I do not
> see it as blocking for next merge window.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks
  2012-03-29  9:58         ` Avi Kivity
@ 2012-03-29 18:03           ` Raghavendra K T
  2012-03-30 10:07             ` Raghavendra K T
  0 siblings, 1 reply; 55+ messages in thread
From: Raghavendra K T @ 2012-03-29 18:03 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Alan Meadows, H. Peter Anvin, Ingo Molnar, Linus Torvalds,
	Peter Zijlstra, the arch/x86 maintainers, LKML, Marcelo Tosatti,
	KVM, Andi Kleen, Xen Devel, Konrad Rzeszutek Wilk,
	Virtualization, Jeremy Fitzhardinge, Stephan Diestelhorst,
	Srivatsa Vaddagiri, Stefano Stabellini, Attilio Rao

On 03/29/2012 03:28 PM, Avi Kivity wrote:
> On 03/28/2012 08:21 PM, Raghavendra K T wrote:
>>
>>>
>>>
>>>          Looks like a good baseline on which to build the KVM
>>>          implementation.  We
>>>          might need some handshake to prevent interference on the host
>>>          side with
>>>          the PLE code.
>>>
>>
>> I think I still missed some point in Avi's comment. I agree that PLE
>> may be interfering with above patches (resulting in less performance
>> advantages). but we have not seen performance degradation with the
>> patches in earlier benchmarks. [ theoretically since patch has very
>> slight advantage over PLE that atleast it knows who should run next ].
>
> The advantage grows with the vcpu counts and overcommit ratio.  If you
> have N vcpus and M:1 overcommit, PLE has to guess from N/M queued vcpus
> while your patch knows who to wake up.
>

Yes. I agree.

>>
>> So TODO in my list on this is:
>> 1. More analysis of performance on PLE mc.
>> 2. Seeing how to implement handshake to increase performance (if PLE +
>> patch combination have slight negative effect).
>
> I can think of two options:

I really like below ideas. Thanks for that!.

> - from the PLE handler, don't wake up a vcpu that is sleeping because it
> is waiting for a kick

How about, adding another pass in the beginning of  kvm_vcpu_on_spin()
to check if any vcpu is already kicked. This would almost result in 
yield_to(kicked_vcpu). IMO this is also worth trying.

will try above ideas soon.

> - look at other sources of pause loops (I guess smp_call_function() is
> the significant source) and adjust them to use the same mechanism, and
> ask the host to disable PLE exiting.
>
> This can be done incrementally later.
>

Yes.. this can wait a bit.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks
  2012-03-29 18:03           ` Raghavendra K T
@ 2012-03-30 10:07             ` Raghavendra K T
  2012-04-01 13:18               ` Avi Kivity
  0 siblings, 1 reply; 55+ messages in thread
From: Raghavendra K T @ 2012-03-30 10:07 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Alan Meadows, H. Peter Anvin, Ingo Molnar, Linus Torvalds,
	Peter Zijlstra, the arch/x86 maintainers, LKML, Marcelo Tosatti,
	KVM, Andi Kleen, Xen Devel, Konrad Rzeszutek Wilk,
	Virtualization, Jeremy Fitzhardinge, Stephan Diestelhorst,
	Srivatsa Vaddagiri, Stefano Stabellini, Attilio Rao

On 03/29/2012 11:33 PM, Raghavendra K T wrote:
> On 03/29/2012 03:28 PM, Avi Kivity wrote:
>> On 03/28/2012 08:21 PM, Raghavendra K T wrote:

> I really like below ideas. Thanks for that!.
>
>> - from the PLE handler, don't wake up a vcpu that is sleeping because it
>> is waiting for a kick
>
> How about, adding another pass in the beginning of kvm_vcpu_on_spin()
> to check if any vcpu is already kicked. This would almost result in
> yield_to(kicked_vcpu). IMO this is also worth trying.
>
> will try above ideas soon.
>

I have patch something like below in mind to try:

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d3b98b1..5127668 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1608,15 +1608,18 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
  	 * else and called schedule in __vcpu_run.  Hopefully that
  	 * VCPU is holding the lock that we need and will release it.
  	 * We approximate round-robin by starting at the last boosted VCPU.
+	 * Priority is given to vcpu that are unhalted.
  	 */
-	for (pass = 0; pass < 2 && !yielded; pass++) {
+	for (pass = 0; pass < 3 && !yielded; pass++) {
  		kvm_for_each_vcpu(i, vcpu, kvm) {
  			struct task_struct *task = NULL;
  			struct pid *pid;
-			if (!pass && i < last_boosted_vcpu) {
+			if (!pass && !vcpu->pv_unhalted)
+				continue;
+			else if (pass == 1 && i < last_boosted_vcpu) {
  				i = last_boosted_vcpu;
  				continue;
-			} else if (pass && i > last_boosted_vcpu)
+			} else if (pass == 2 && i > last_boosted_vcpu)
  				break;
  			if (vcpu == me)
  				continue;


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks
  2012-03-21 10:20 [PATCH RFC V6 0/11] Paravirtualized ticketlocks Raghavendra K T
                   ` (11 preceding siblings ...)
  2012-03-26 14:25 ` [PATCH RFC V6 0/11] Paravirtualized ticketlocks Avi Kivity
@ 2012-03-30 20:26 ` H. Peter Anvin
  2012-03-30 22:07   ` Thomas Gleixner
  2012-03-31  0:51   ` Raghavendra K T
  12 siblings, 2 replies; 55+ messages in thread
From: H. Peter Anvin @ 2012-03-30 20:26 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Ingo Molnar, Linus Torvalds, Peter Zijlstra,
	the arch/x86 maintainers, LKML, Avi Kivity, Marcelo Tosatti, KVM,
	Andi Kleen, Xen Devel, Konrad Rzeszutek Wilk, Virtualization,
	Jeremy Fitzhardinge, Stephan Diestelhorst, Srivatsa Vaddagiri,
	Stefano Stabellini, Attilio Rao

What is the current status of this patchset?  I haven't looked at it too
closely because I have been focused on 3.4 up until now...

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks
  2012-03-30 20:26 ` H. Peter Anvin
@ 2012-03-30 22:07   ` Thomas Gleixner
  2012-03-30 22:18     ` Andi Kleen
                       ` (5 more replies)
  2012-03-31  0:51   ` Raghavendra K T
  1 sibling, 6 replies; 55+ messages in thread
From: Thomas Gleixner @ 2012-03-30 22:07 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Raghavendra K T, Ingo Molnar, Linus Torvalds, Peter Zijlstra,
	the arch/x86 maintainers, LKML, Avi Kivity, Marcelo Tosatti, KVM,
	Andi Kleen, Xen Devel, Konrad Rzeszutek Wilk, Virtualization,
	Jeremy Fitzhardinge, Stephan Diestelhorst, Srivatsa Vaddagiri,
	Stefano Stabellini, Attilio Rao

On Fri, 30 Mar 2012, H. Peter Anvin wrote:

> What is the current status of this patchset?  I haven't looked at it too
> closely because I have been focused on 3.4 up until now...

The real question is whether these heuristics are the correct approach
or not.

If I look at it from the non virtualized kernel side then this is ass
backwards. We know already that we are holding a spinlock which might
cause other (v)cpus going into eternal spin. The non virtualized
kernel solves this by disabling preemption and therefor getting out of
the critical section as fast as possible,

The virtualization problem reminds me a lot of the problem which RT
kernels are observing where non raw spinlocks are turned into
"sleeping spinlocks" and therefor can cause throughput issues for non
RT workloads.

Though the virtualized situation is even worse. Any preempted guest
section which holds a spinlock is prone to cause unbound delays.

The paravirt ticketlock solution can only mitigate the problem, but
not solve it. With massive overcommit there is always a way to trigger
worst case scenarious unless you are educating the scheduler to cope
with that.

So if we need to fiddle with the scheduler and frankly that's the only
way to get a real gain (the numbers, which are achieved by this
patches, are not that impressive) then the question arises whether we
should turn the whole thing around.

I know that Peter is going to go berserk on me, but if we are running
a paravirt guest then it's simple to provide a mechanism which allows
the host (aka hypervisor) to check that in the guest just by looking
at some global state.

So if a guest exits due to an external event it's easy to inspect the
state of that guest and avoid to schedule away when it was interrupted
in a spinlock held section. That guest/host shared state needs to be
modified to indicate the guest to invoke an exit when the last nested
lock has been released.

Of course this needs to be time bound, so a rogue guest cannot
monopolize the cpu forever, but that's the least to worry about
problem simply because a guest which does not get out of a spinlocked
region within a certain amount of time is borked and elegible to
killing anyway.

Thoughts ?

Thanks,

	tglx



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks
  2012-03-30 22:07   ` Thomas Gleixner
@ 2012-03-30 22:18     ` Andi Kleen
  2012-03-30 23:04       ` Thomas Gleixner
  2012-03-31  4:07     ` Srivatsa Vaddagiri
                       ` (4 subsequent siblings)
  5 siblings, 1 reply; 55+ messages in thread
From: Andi Kleen @ 2012-03-30 22:18 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: H. Peter Anvin, Raghavendra K T, Ingo Molnar, Linus Torvalds,
	Peter Zijlstra, the arch/x86 maintainers, LKML, Avi Kivity,
	Marcelo Tosatti, KVM, Andi Kleen, Xen Devel,
	Konrad Rzeszutek Wilk, Virtualization, Jeremy Fitzhardinge,
	Stephan Diestelhorst, Srivatsa Vaddagiri, Stefano Stabellini,
	Attilio Rao

> So if a guest exits due to an external event it's easy to inspect the
> state of that guest and avoid to schedule away when it was interrupted
> in a spinlock held section. That guest/host shared state needs to be

On a large system under high contention sleeping can perform surprisingly
well. pthread mutexes have a tendency to beat kernel spinlocks there.
So avoiding sleeping locks completely (that is what pv locks are
essentially) is not necessarily that good.

Your proposal is probably only a good idea on low contention
and relatively small systems.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks
  2012-03-30 22:18     ` Andi Kleen
@ 2012-03-30 23:04       ` Thomas Gleixner
  2012-03-31  0:08         ` Andi Kleen
  0 siblings, 1 reply; 55+ messages in thread
From: Thomas Gleixner @ 2012-03-30 23:04 UTC (permalink / raw)
  To: Andi Kleen
  Cc: H. Peter Anvin, Raghavendra K T, Ingo Molnar, Linus Torvalds,
	Peter Zijlstra, the arch/x86 maintainers, LKML, Avi Kivity,
	Marcelo Tosatti, KVM, Xen Devel, Konrad Rzeszutek Wilk,
	Virtualization, Jeremy Fitzhardinge, Stephan Diestelhorst,
	Srivatsa Vaddagiri, Stefano Stabellini, Attilio Rao

On Sat, 31 Mar 2012, Andi Kleen wrote:

> > So if a guest exits due to an external event it's easy to inspect the
> > state of that guest and avoid to schedule away when it was interrupted
> > in a spinlock held section. That guest/host shared state needs to be
> 
> On a large system under high contention sleeping can perform surprisingly
> well. pthread mutexes have a tendency to beat kernel spinlocks there.
> So avoiding sleeping locks completely (that is what pv locks are
> essentially) is not necessarily that good.

Care to back that up with numbers and proper trace evidence instead of
handwaving?

I've stared at RT traces and throughput problems on _LARGE_ machines
long enough to know what I'm talking about and I can provide evidence
in a split second.

> Your proposal is probably only a good idea on low contention
> and relatively small systems.

Sigh, you have really no fcking clue what you are talking about.

On RT we observed scalabilty problems way before hardware was
available to expose them. So what's your point?

Put up or shut up, really!

    tglx

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks
  2012-03-30 23:04       ` Thomas Gleixner
@ 2012-03-31  0:08         ` Andi Kleen
  2012-03-31  8:11           ` Ingo Molnar
  0 siblings, 1 reply; 55+ messages in thread
From: Andi Kleen @ 2012-03-31  0:08 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andi Kleen, H. Peter Anvin, Raghavendra K T, Ingo Molnar,
	Linus Torvalds, Peter Zijlstra, the arch/x86 maintainers, LKML,
	Avi Kivity, Marcelo Tosatti, KVM, Xen Devel,
	Konrad Rzeszutek Wilk, Virtualization, Jeremy Fitzhardinge,
	Stephan Diestelhorst, Srivatsa Vaddagiri, Stefano Stabellini,
	Attilio Rao

On Sat, Mar 31, 2012 at 01:04:41AM +0200, Thomas "Kubys" Gleixner wrote:
> On Sat, 31 Mar 2012, Andi Kleen wrote:
> 
> > > So if a guest exits due to an external event it's easy to inspect the
> > > state of that guest and avoid to schedule away when it was interrupted
> > > in a spinlock held section. That guest/host shared state needs to be
> > 
> > On a large system under high contention sleeping can perform surprisingly
> > well. pthread mutexes have a tendency to beat kernel spinlocks there.
> > So avoiding sleeping locks completely (that is what pv locks are
> > essentially) is not necessarily that good.
> 
> Care to back that up with numbers and proper trace evidence instead of
> handwaving?

E.g. my plumbers presentations on lock and mm scalability from last year has some 
graphs that show this very clearly, plus some additional data on the mutexes. 
This compares to the glibc futex locks, which perform much better than the kernel 
mutex locks on larger systems under higher contention

Given your tone I will not supply an URL. I'm sure you can find it if you
need it.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks
  2012-03-30 20:26 ` H. Peter Anvin
  2012-03-30 22:07   ` Thomas Gleixner
@ 2012-03-31  0:51   ` Raghavendra K T
  1 sibling, 0 replies; 55+ messages in thread
From: Raghavendra K T @ 2012-03-31  0:51 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Ingo Molnar, Linus Torvalds, Peter Zijlstra,
	the arch/x86 maintainers, LKML, Avi Kivity, Marcelo Tosatti, KVM,
	Andi Kleen, Xen Devel, Konrad Rzeszutek Wilk, Virtualization,
	Jeremy Fitzhardinge, Stephan Diestelhorst, Srivatsa Vaddagiri,
	Stefano Stabellini, Attilio Rao

On 03/31/2012 01:56 AM, H. Peter Anvin wrote:
> What is the current status of this patchset?  I haven't looked at it too
> closely because I have been focused on 3.4 up until now...
>

Thanks Peter,

Currently targeting the patchset for next merge window. IMO These
patches are in good shape now. I' ll rebase these patches and send it
ASAP. It needs "jumplabel split patch" (from Andrew Jones) as
dependency, I would fold that patch (got ok from him for that) also
into series, and send them ASAP.

  - Raghu






^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks
  2012-03-30 22:07   ` Thomas Gleixner
  2012-03-30 22:18     ` Andi Kleen
@ 2012-03-31  4:07     ` Srivatsa Vaddagiri
  2012-03-31  4:09       ` Srivatsa Vaddagiri
  2012-04-16 15:44       ` Konrad Rzeszutek Wilk
  2012-04-01 13:31     ` Avi Kivity
                       ` (3 subsequent siblings)
  5 siblings, 2 replies; 55+ messages in thread
From: Srivatsa Vaddagiri @ 2012-03-31  4:07 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: H. Peter Anvin, Raghavendra K T, Ingo Molnar, Linus Torvalds,
	Peter Zijlstra, the arch/x86 maintainers, LKML, Avi Kivity,
	Marcelo Tosatti, KVM, Andi Kleen, Xen Devel,
	Konrad Rzeszutek Wilk, Virtualization, Jeremy Fitzhardinge,
	Stephan Diestelhorst, Stefano Stabellini, Attilio Rao

* Thomas Gleixner <tglx@linutronix.de> [2012-03-31 00:07:58]:

> I know that Peter is going to go berserk on me, but if we are running
> a paravirt guest then it's simple to provide a mechanism which allows
> the host (aka hypervisor) to check that in the guest just by looking
> at some global state.
> 
> So if a guest exits due to an external event it's easy to inspect the
> state of that guest and avoid to schedule away when it was interrupted
> in a spinlock held section. That guest/host shared state needs to be
> modified to indicate the guest to invoke an exit when the last nested
> lock has been released.

I had attempted something like that long back:

http://lkml.org/lkml/2010/6/3/4

The issue is with ticketlocks though. VCPUs could go into a spin w/o
a lock being held by anybody. Say VCPUs 1-99 try to grab a lock in
that order (on a host with one cpu). VCPU1 wins (after VCPU0 releases it)
and releases the lock. VCPU1 is next eligible to take the lock. If 
that is not scheduled early enough by host, then remaining vcpus would keep 
spinning (even though lock is technically not held by anybody) w/o making 
forward progress.

In that situation, what we really need is for the guest to hint to host
scheduler to schedule VCPU1 early (via yield_to or something similar). 

The current pv-spinlock patches however does not track which vcpu is
spinning at what head of the ticketlock. I suppose we can consider 
that optimization in future and see how much benefit it provides (over
plain yield/sleep the way its done now).

Do you see any issues if we take in what we have today and address the
finer-grained optimization as next step?

- vatsa 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks
  2012-03-31  4:07     ` Srivatsa Vaddagiri
@ 2012-03-31  4:09       ` Srivatsa Vaddagiri
  2012-04-16 15:44       ` Konrad Rzeszutek Wilk
  1 sibling, 0 replies; 55+ messages in thread
From: Srivatsa Vaddagiri @ 2012-03-31  4:09 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: H. Peter Anvin, Raghavendra K T, Ingo Molnar, Linus Torvalds,
	Peter Zijlstra, the arch/x86 maintainers, LKML, Avi Kivity,
	Marcelo Tosatti, KVM, Andi Kleen, Xen Devel,
	Konrad Rzeszutek Wilk, Virtualization, Jeremy Fitzhardinge,
	Stephan Diestelhorst, Stefano Stabellini, Attilio Rao

* Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> [2012-03-31 09:37:45]:

> The issue is with ticketlocks though. VCPUs could go into a spin w/o
> a lock being held by anybody. Say VCPUs 1-99 try to grab a lock in
> that order (on a host with one cpu). VCPU1 wins (after VCPU0 releases it)
> and releases the lock. VCPU1 is next eligible to take the lock. If 

Sorry I meant to say "VCPU2 is next eligible ..."

> that is not scheduled early enough by host, then remaining vcpus would keep 
> spinning (even though lock is technically not held by anybody) w/o making 
> forward progress.
> 
> In that situation, what we really need is for the guest to hint to host
> scheduler to schedule VCPU1 early (via yield_to or something similar). 

s/VCPU1/VCPU2 ..

- vatsa


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks
  2012-03-31  0:08         ` Andi Kleen
@ 2012-03-31  8:11           ` Ingo Molnar
  0 siblings, 0 replies; 55+ messages in thread
From: Ingo Molnar @ 2012-03-31  8:11 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Thomas Gleixner, H. Peter Anvin, Raghavendra K T, Ingo Molnar,
	Linus Torvalds, Peter Zijlstra, the arch/x86 maintainers, LKML,
	Avi Kivity, Marcelo Tosatti, KVM, Xen Devel,
	Konrad Rzeszutek Wilk, Virtualization, Jeremy Fitzhardinge,
	Stephan Diestelhorst, Srivatsa Vaddagiri, Stefano Stabellini,
	Attilio Rao


* Andi Kleen <andi@firstfloor.org> wrote:

> > Care to back that up with numbers and proper trace evidence 
> > instead of handwaving?
> 
> E.g. my plumbers presentations on lock and mm scalability from 
> last year has some graphs that show this very clearly, plus 
> some additional data on the mutexes. This compares to the 
> glibc futex locks, which perform much better than the kernel 
> mutex locks on larger systems under higher contention

If you mean these draft slides:

  http://www.halobates.de/plumbers-fork-locks_v2.pdf

it has very little verifiable information in it. It just 
cryptically says lock hold time "microbenchmark", which might or 
might not be a valid measurement.

You could have been honest and straightforward in your first 
mail:

 "I ran workload X on machine Y, and got results Z."

Instead you are *hindering* the discussion:

> Given your tone I will not supply an URL. [...]

If you meant the above URL then it's not the proper numbers 
Thomas asked for, just some vague slides. If you meant something 
else then put up or shut up.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks
  2012-03-30 10:07             ` Raghavendra K T
@ 2012-04-01 13:18               ` Avi Kivity
  2012-04-01 13:48                 ` Raghavendra K T
  0 siblings, 1 reply; 55+ messages in thread
From: Avi Kivity @ 2012-04-01 13:18 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Alan Meadows, H. Peter Anvin, Ingo Molnar, Linus Torvalds,
	Peter Zijlstra, the arch/x86 maintainers, LKML, Marcelo Tosatti,
	KVM, Andi Kleen, Xen Devel, Konrad Rzeszutek Wilk,
	Virtualization, Jeremy Fitzhardinge, Stephan Diestelhorst,
	Srivatsa Vaddagiri, Stefano Stabellini, Attilio Rao

On 03/30/2012 01:07 PM, Raghavendra K T wrote:
> On 03/29/2012 11:33 PM, Raghavendra K T wrote:
>> On 03/29/2012 03:28 PM, Avi Kivity wrote:
>>> On 03/28/2012 08:21 PM, Raghavendra K T wrote:
>
>> I really like below ideas. Thanks for that!.
>>
>>> - from the PLE handler, don't wake up a vcpu that is sleeping
>>> because it
>>> is waiting for a kick
>>
>> How about, adding another pass in the beginning of kvm_vcpu_on_spin()
>> to check if any vcpu is already kicked. This would almost result in
>> yield_to(kicked_vcpu). IMO this is also worth trying.
>>
>> will try above ideas soon.
>>
>
> I have patch something like below in mind to try:
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index d3b98b1..5127668 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1608,15 +1608,18 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
>       * else and called schedule in __vcpu_run.  Hopefully that
>       * VCPU is holding the lock that we need and will release it.
>       * We approximate round-robin by starting at the last boosted VCPU.
> +     * Priority is given to vcpu that are unhalted.
>       */
> -    for (pass = 0; pass < 2 && !yielded; pass++) {
> +    for (pass = 0; pass < 3 && !yielded; pass++) {
>          kvm_for_each_vcpu(i, vcpu, kvm) {
>              struct task_struct *task = NULL;
>              struct pid *pid;
> -            if (!pass && i < last_boosted_vcpu) {
> +            if (!pass && !vcpu->pv_unhalted)
> +                continue;
> +            else if (pass == 1 && i < last_boosted_vcpu) {
>                  i = last_boosted_vcpu;
>                  continue;
> -            } else if (pass && i > last_boosted_vcpu)
> +            } else if (pass == 2 && i > last_boosted_vcpu)
>                  break;
>              if (vcpu == me)
>                  continue;
>

Actually I think this is unneeded.  The loops tries to find vcpus that
are runnable but not running (vcpu_active(vcpu->wq)), and halted vcpus
don't match this condition.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks
  2012-03-30 22:07   ` Thomas Gleixner
  2012-03-30 22:18     ` Andi Kleen
  2012-03-31  4:07     ` Srivatsa Vaddagiri
@ 2012-04-01 13:31     ` Avi Kivity
  2012-04-02  9:26       ` Thomas Gleixner
  2012-04-02  4:36     ` [Xen-devel] " Juergen Gross
                       ` (2 subsequent siblings)
  5 siblings, 1 reply; 55+ messages in thread
From: Avi Kivity @ 2012-04-01 13:31 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: H. Peter Anvin, Raghavendra K T, Ingo Molnar, Linus Torvalds,
	Peter Zijlstra, the arch/x86 maintainers, LKML, Marcelo Tosatti,
	KVM, Andi Kleen, Xen Devel, Konrad Rzeszutek Wilk,
	Virtualization, Jeremy Fitzhardinge, Stephan Diestelhorst,
	Srivatsa Vaddagiri, Stefano Stabellini, Attilio Rao

On 03/31/2012 01:07 AM, Thomas Gleixner wrote:
> On Fri, 30 Mar 2012, H. Peter Anvin wrote:
>
> > What is the current status of this patchset?  I haven't looked at it too
> > closely because I have been focused on 3.4 up until now...
>
> The real question is whether these heuristics are the correct approach
> or not.
>
> If I look at it from the non virtualized kernel side then this is ass
> backwards. We know already that we are holding a spinlock which might
> cause other (v)cpus going into eternal spin. The non virtualized
> kernel solves this by disabling preemption and therefor getting out of
> the critical section as fast as possible,
>
> The virtualization problem reminds me a lot of the problem which RT
> kernels are observing where non raw spinlocks are turned into
> "sleeping spinlocks" and therefor can cause throughput issues for non
> RT workloads.
>
> Though the virtualized situation is even worse. Any preempted guest
> section which holds a spinlock is prone to cause unbound delays.
>
> The paravirt ticketlock solution can only mitigate the problem, but
> not solve it. With massive overcommit there is always a way to trigger
> worst case scenarious unless you are educating the scheduler to cope
> with that.
>
> So if we need to fiddle with the scheduler and frankly that's the only
> way to get a real gain (the numbers, which are achieved by this
> patches, are not that impressive) then the question arises whether we
> should turn the whole thing around.
>
> I know that Peter is going to go berserk on me, but if we are running
> a paravirt guest then it's simple to provide a mechanism which allows
> the host (aka hypervisor) to check that in the guest just by looking
> at some global state.
>
> So if a guest exits due to an external event it's easy to inspect the
> state of that guest and avoid to schedule away when it was interrupted
> in a spinlock held section. That guest/host shared state needs to be
> modified to indicate the guest to invoke an exit when the last nested
> lock has been released.

Interesting idea (I think it has been raised before btw, don't recall by
who).

One thing about it is that it can give many false positives.  Consider a
fine-grained spinlock that is being accessed by many threads.  That is,
the lock is taken and released with high frequency, but there is no
contention, because each vcpu is accessing a different instance.  So the
host scheduler will needlessly delay preemption of vcpus that happen to
be holding a lock, even though this gains nothing.

A second issue may happen with a lock that is taken and released with
high frequency, with a high hold percentage.  The host scheduler may
always sample the guest in a held state, leading it to conclude that
it's exceeding its timeout when in fact the lock is held for a short
time only.

> Of course this needs to be time bound, so a rogue guest cannot
> monopolize the cpu forever, but that's the least to worry about
> problem simply because a guest which does not get out of a spinlocked
> region within a certain amount of time is borked and elegible to
> killing anyway.

Hopefully not killing!  Just because a guest doesn't scale well, or even
if it's deadlocked, doesn't mean it should be killed.  Just preempt it.

> Thoughts ?

It's certainly interesting.  Maybe a combination is worthwhile - prevent
lockholder preemption for a short period of time AND put waiters to
sleep in case that period is insufficient to release the lock.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks
  2012-04-01 13:18               ` Avi Kivity
@ 2012-04-01 13:48                 ` Raghavendra K T
  2012-04-01 13:53                   ` Avi Kivity
  0 siblings, 1 reply; 55+ messages in thread
From: Raghavendra K T @ 2012-04-01 13:48 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Alan Meadows, H. Peter Anvin, Ingo Molnar, Linus Torvalds,
	Peter Zijlstra, the arch/x86 maintainers, LKML, Marcelo Tosatti,
	KVM, Andi Kleen, Xen Devel, Konrad Rzeszutek Wilk,
	Virtualization, Jeremy Fitzhardinge, Stephan Diestelhorst,
	Srivatsa Vaddagiri, Stefano Stabellini, Attilio Rao

On 04/01/2012 06:48 PM, Avi Kivity wrote:
> On 03/30/2012 01:07 PM, Raghavendra K T wrote:
>> On 03/29/2012 11:33 PM, Raghavendra K T wrote:
>>> On 03/29/2012 03:28 PM, Avi Kivity wrote:
>>>> On 03/28/2012 08:21 PM, Raghavendra K T wrote:
>>
>>> I really like below ideas. Thanks for that!.
>>>
>>>> - from the PLE handler, don't wake up a vcpu that is sleeping
>>>> because it
>>>> is waiting for a kick
>>>
>>> How about, adding another pass in the beginning of kvm_vcpu_on_spin()
>>> to check if any vcpu is already kicked. This would almost result in
>>> yield_to(kicked_vcpu). IMO this is also worth trying.
>>>
>>> will try above ideas soon.
>>>
>>
>> I have patch something like below in mind to try:
>>
>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>> index d3b98b1..5127668 100644
>> --- a/virt/kvm/kvm_main.c
>> +++ b/virt/kvm/kvm_main.c
>> @@ -1608,15 +1608,18 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
>>        * else and called schedule in __vcpu_run.  Hopefully that
>>        * VCPU is holding the lock that we need and will release it.
>>        * We approximate round-robin by starting at the last boosted VCPU.
>> +     * Priority is given to vcpu that are unhalted.
>>        */
>> -    for (pass = 0; pass<  2&&  !yielded; pass++) {
>> +    for (pass = 0; pass<  3&&  !yielded; pass++) {
>>           kvm_for_each_vcpu(i, vcpu, kvm) {
>>               struct task_struct *task = NULL;
>>               struct pid *pid;
>> -            if (!pass&&  i<  last_boosted_vcpu) {
>> +            if (!pass&&  !vcpu->pv_unhalted)
>> +                continue;
>> +            else if (pass == 1&&  i<  last_boosted_vcpu) {
>>                   i = last_boosted_vcpu;
>>                   continue;
>> -            } else if (pass&&  i>  last_boosted_vcpu)
>> +            } else if (pass == 2&&  i>  last_boosted_vcpu)
>>                   break;
>>               if (vcpu == me)
>>                   continue;
>>
>
> Actually I think this is unneeded.  The loops tries to find vcpus that
> are runnable but not running (vcpu_active(vcpu->wq)), and halted vcpus
> don't match this condition.
>

I almost agree. But at corner of my thought,

Suppose there are 8 vcpus runnable out of which 4 of them are kicked
but not running, making yield_to those 4 vcpus would result in better
lock progress. no?

  I still have little problem getting PLE setup, here (instead rebasing 
patches).
Once I get PLE to get that running, and numbers prove no improvement, I 
will drop this idea.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks
  2012-04-01 13:48                 ` Raghavendra K T
@ 2012-04-01 13:53                   ` Avi Kivity
  2012-04-01 13:56                     ` Raghavendra K T
                                       ` (2 more replies)
  0 siblings, 3 replies; 55+ messages in thread
From: Avi Kivity @ 2012-04-01 13:53 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Alan Meadows, H. Peter Anvin, Ingo Molnar, Linus Torvalds,
	Peter Zijlstra, the arch/x86 maintainers, LKML, Marcelo Tosatti,
	KVM, Andi Kleen, Xen Devel, Konrad Rzeszutek Wilk,
	Virtualization, Jeremy Fitzhardinge, Stephan Diestelhorst,
	Srivatsa Vaddagiri, Stefano Stabellini, Attilio Rao

On 04/01/2012 04:48 PM, Raghavendra K T wrote:
>>> I have patch something like below in mind to try:
>>>
>>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>>> index d3b98b1..5127668 100644
>>> --- a/virt/kvm/kvm_main.c
>>> +++ b/virt/kvm/kvm_main.c
>>> @@ -1608,15 +1608,18 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
>>>        * else and called schedule in __vcpu_run.  Hopefully that
>>>        * VCPU is holding the lock that we need and will release it.
>>>        * We approximate round-robin by starting at the last boosted
>>> VCPU.
>>> +     * Priority is given to vcpu that are unhalted.
>>>        */
>>> -    for (pass = 0; pass<  2&&  !yielded; pass++) {
>>> +    for (pass = 0; pass<  3&&  !yielded; pass++) {
>>>           kvm_for_each_vcpu(i, vcpu, kvm) {
>>>               struct task_struct *task = NULL;
>>>               struct pid *pid;
>>> -            if (!pass&&  i<  last_boosted_vcpu) {
>>> +            if (!pass&&  !vcpu->pv_unhalted)
>>> +                continue;
>>> +            else if (pass == 1&&  i<  last_boosted_vcpu) {
>>>                   i = last_boosted_vcpu;
>>>                   continue;
>>> -            } else if (pass&&  i>  last_boosted_vcpu)
>>> +            } else if (pass == 2&&  i>  last_boosted_vcpu)
>>>                   break;
>>>               if (vcpu == me)
>>>                   continue;
>>>
>>
>> Actually I think this is unneeded.  The loops tries to find vcpus that
>> are runnable but not running (vcpu_active(vcpu->wq)), and halted vcpus
>> don't match this condition.
>>
>
>
> I almost agree. But at corner of my thought,
>
> Suppose there are 8 vcpus runnable out of which 4 of them are kicked
> but not running, making yield_to those 4 vcpus would result in better
> lock progress. no?

That's what the code does.

>   I still have little problem getting PLE setup, here (instead
> rebasing patches).
> Once I get PLE to get that running, and numbers prove no improvement,
> I will drop this idea.
>

I'm interested in how PLE does vs. your patches, both with PLE enabled
and disabled.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks
  2012-04-01 13:53                   ` Avi Kivity
@ 2012-04-01 13:56                     ` Raghavendra K T
  2012-04-02  9:51                     ` Raghavendra K T
  2012-04-05  8:43                     ` Raghavendra K T
  2 siblings, 0 replies; 55+ messages in thread
From: Raghavendra K T @ 2012-04-01 13:56 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Alan Meadows, H. Peter Anvin, Ingo Molnar, Linus Torvalds,
	Peter Zijlstra, the arch/x86 maintainers, LKML, Marcelo Tosatti,
	KVM, Andi Kleen, Xen Devel, Konrad Rzeszutek Wilk,
	Virtualization, Jeremy Fitzhardinge, Stephan Diestelhorst,
	Srivatsa Vaddagiri, Stefano Stabellini, Attilio Rao

On 04/01/2012 07:23 PM, Avi Kivity wrote:
> On 04/01/2012 04:48 PM, Raghavendra K T wrote:
>>>> I have patch something like below in mind to try:
>
> I'm interested in how PLE does vs. your patches, both with PLE enabled
> and disabled.
>

Sure. will update with the experimental results.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Xen-devel] [PATCH RFC V6 0/11] Paravirtualized ticketlocks
  2012-03-30 22:07   ` Thomas Gleixner
                       ` (2 preceding siblings ...)
  2012-04-01 13:31     ` Avi Kivity
@ 2012-04-02  4:36     ` Juergen Gross
  2012-04-02  9:42     ` Ian Campbell
  2012-04-11  1:29     ` Marcelo Tosatti
  5 siblings, 0 replies; 55+ messages in thread
From: Juergen Gross @ 2012-04-02  4:36 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: H. Peter Anvin, the arch/x86 maintainers, KVM,
	Konrad Rzeszutek Wilk, Peter Zijlstra, Stefano Stabellini,
	Raghavendra K T, LKML, Marcelo Tosatti, Andi Kleen, Avi Kivity,
	Jeremy Fitzhardinge, Srivatsa Vaddagiri, Attilio Rao,
	Ingo Molnar, Virtualization, Linus Torvalds, Xen Devel,
	Stephan Diestelhorst

On 03/31/2012 12:07 AM, Thomas Gleixner wrote:
> On Fri, 30 Mar 2012, H. Peter Anvin wrote:
>
>> What is the current status of this patchset?  I haven't looked at it too
>> closely because I have been focused on 3.4 up until now...
> The real question is whether these heuristics are the correct approach
> or not.
>
> If I look at it from the non virtualized kernel side then this is ass
> backwards. We know already that we are holding a spinlock which might
> cause other (v)cpus going into eternal spin. The non virtualized
> kernel solves this by disabling preemption and therefor getting out of
> the critical section as fast as possible,
>
> The virtualization problem reminds me a lot of the problem which RT
> kernels are observing where non raw spinlocks are turned into
> "sleeping spinlocks" and therefor can cause throughput issues for non
> RT workloads.
>
> Though the virtualized situation is even worse. Any preempted guest
> section which holds a spinlock is prone to cause unbound delays.
>
> The paravirt ticketlock solution can only mitigate the problem, but
> not solve it. With massive overcommit there is always a way to trigger
> worst case scenarious unless you are educating the scheduler to cope
> with that.
>
> So if we need to fiddle with the scheduler and frankly that's the only
> way to get a real gain (the numbers, which are achieved by this
> patches, are not that impressive) then the question arises whether we
> should turn the whole thing around.
>
> I know that Peter is going to go berserk on me, but if we are running
> a paravirt guest then it's simple to provide a mechanism which allows
> the host (aka hypervisor) to check that in the guest just by looking
> at some global state.
>
> So if a guest exits due to an external event it's easy to inspect the
> state of that guest and avoid to schedule away when it was interrupted
> in a spinlock held section. That guest/host shared state needs to be
> modified to indicate the guest to invoke an exit when the last nested
> lock has been released.
>
> Of course this needs to be time bound, so a rogue guest cannot
> monopolize the cpu forever, but that's the least to worry about
> problem simply because a guest which does not get out of a spinlocked
> region within a certain amount of time is borked and elegible to
> killing anyway.
>
> Thoughts ?

I used this approach in 2008:

http://lists.xen.org/archives/html/xen-devel/2008-12/msg00740.html

It worked very well, but it was rejected at that time. I wouldn't mind trying
it again if there is some support from your side. :-)


Juergen

-- 

Juergen Gross                 Principal Developer Operating Systems
PDG ES&S SWE OS6                       Telephone: +49 (0) 89 3222 2967
Fujitsu Technology Solutions              e-mail: juergen.gross@ts.fujitsu.com
Domagkstr. 28                           Internet: ts.fujitsu.com
D-80807 Muenchen                 Company details: ts.fujitsu.com/imprint.html


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks
  2012-04-01 13:31     ` Avi Kivity
@ 2012-04-02  9:26       ` Thomas Gleixner
  2012-04-05  9:15         ` Avi Kivity
  0 siblings, 1 reply; 55+ messages in thread
From: Thomas Gleixner @ 2012-04-02  9:26 UTC (permalink / raw)
  To: Avi Kivity
  Cc: H. Peter Anvin, Raghavendra K T, Ingo Molnar, Linus Torvalds,
	Peter Zijlstra, the arch/x86 maintainers, LKML, Marcelo Tosatti,
	KVM, Andi Kleen, Xen Devel, Konrad Rzeszutek Wilk,
	Virtualization, Jeremy Fitzhardinge, Stephan Diestelhorst,
	Srivatsa Vaddagiri, Stefano Stabellini, Attilio Rao

On Sun, 1 Apr 2012, Avi Kivity wrote:
> On 03/31/2012 01:07 AM, Thomas Gleixner wrote:
> > On Fri, 30 Mar 2012, H. Peter Anvin wrote:
> >
> > > What is the current status of this patchset?  I haven't looked at it too
> > > closely because I have been focused on 3.4 up until now...
> >
> > The real question is whether these heuristics are the correct approach
> > or not.
> >
> > If I look at it from the non virtualized kernel side then this is ass
> > backwards. We know already that we are holding a spinlock which might
> > cause other (v)cpus going into eternal spin. The non virtualized
> > kernel solves this by disabling preemption and therefor getting out of
> > the critical section as fast as possible,
> >
> > The virtualization problem reminds me a lot of the problem which RT
> > kernels are observing where non raw spinlocks are turned into
> > "sleeping spinlocks" and therefor can cause throughput issues for non
> > RT workloads.
> >
> > Though the virtualized situation is even worse. Any preempted guest
> > section which holds a spinlock is prone to cause unbound delays.
> >
> > The paravirt ticketlock solution can only mitigate the problem, but
> > not solve it. With massive overcommit there is always a way to trigger
> > worst case scenarious unless you are educating the scheduler to cope
> > with that.
> >
> > So if we need to fiddle with the scheduler and frankly that's the only
> > way to get a real gain (the numbers, which are achieved by this
> > patches, are not that impressive) then the question arises whether we
> > should turn the whole thing around.
> >
> > I know that Peter is going to go berserk on me, but if we are running
> > a paravirt guest then it's simple to provide a mechanism which allows
> > the host (aka hypervisor) to check that in the guest just by looking
> > at some global state.
> >
> > So if a guest exits due to an external event it's easy to inspect the
> > state of that guest and avoid to schedule away when it was interrupted
> > in a spinlock held section. That guest/host shared state needs to be
> > modified to indicate the guest to invoke an exit when the last nested
> > lock has been released.
> 
> Interesting idea (I think it has been raised before btw, don't recall by
> who).

Someoen posted a pointer to that old thread.

> One thing about it is that it can give many false positives.  Consider a
> fine-grained spinlock that is being accessed by many threads.  That is,
> the lock is taken and released with high frequency, but there is no
> contention, because each vcpu is accessing a different instance.  So the
> host scheduler will needlessly delay preemption of vcpus that happen to
> be holding a lock, even though this gains nothing.

You're talking about per cpu locks, right? I can see the point there,
but per cpu stuff might be worth annotating to avoid this.
 
> A second issue may happen with a lock that is taken and released with
> high frequency, with a high hold percentage.  The host scheduler may
> always sample the guest in a held state, leading it to conclude that
> it's exceeding its timeout when in fact the lock is held for a short
> time only.

Well, no. You can avoid that.

host		VCPU
		spin_lock()
		 modify_global_state()
   	exit
check_global_state()
mark_global_state()
reschedule vcpu

		spin_unlock()
		 check_global_state()
		  trigger_exit()

So when an exit occures in the locked section, then the host can mark
the global state to tell the guest to issue a trap on unlock.
 
> > Of course this needs to be time bound, so a rogue guest cannot
> > monopolize the cpu forever, but that's the least to worry about
> > problem simply because a guest which does not get out of a spinlocked
> > region within a certain amount of time is borked and elegible to
> > killing anyway.
> 
> Hopefully not killing!  Just because a guest doesn't scale well, or even
> if it's deadlocked, doesn't mean it should be killed.  Just preempt it.

:)
 
> > Thoughts ?
> 
> It's certainly interesting.  Maybe a combination is worthwhile - prevent
> lockholder preemption for a short period of time AND put waiters to
> sleep in case that period is insufficient to release the lock.

Right, but as Srivatsa pointed out this still needs the ticket lock
ordering support to avoid that guests are completely starved.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Xen-devel] [PATCH RFC V6 0/11] Paravirtualized ticketlocks
  2012-03-30 22:07   ` Thomas Gleixner
                       ` (3 preceding siblings ...)
  2012-04-02  4:36     ` [Xen-devel] " Juergen Gross
@ 2012-04-02  9:42     ` Ian Campbell
  2012-04-11  1:29     ` Marcelo Tosatti
  5 siblings, 0 replies; 55+ messages in thread
From: Ian Campbell @ 2012-04-02  9:42 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: H. Peter Anvin, the arch/x86 maintainers, KVM,
	Konrad Rzeszutek Wilk, Peter Zijlstra, Stefano Stabellini,
	Raghavendra K T, LKML, Marcelo Tosatti, Andi Kleen, Avi Kivity,
	Jeremy Fitzhardinge, Srivatsa Vaddagiri, Attilio Rao,
	Ingo Molnar, Virtualization, Linus Torvalds, Xen Devel,
	Stephan Diestelhorst

On Fri, 2012-03-30 at 23:07 +0100, Thomas Gleixner wrote:
> So if we need to fiddle with the scheduler and frankly that's the only
> way to get a real gain (the numbers, which are achieved by this
> patches, are not that impressive) then the question arises whether we
> should turn the whole thing around.

It probably doesn't materially effect your core point (which seems valid
to me) but it's worth pointing out that the numbers presented in this
thread are AFAICT mostly focused on ensuring that that the impact of
this infrastructure is acceptable on native rather than showing the
benefits for virtualized workloads.

Ian.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks
  2012-04-01 13:53                   ` Avi Kivity
  2012-04-01 13:56                     ` Raghavendra K T
@ 2012-04-02  9:51                     ` Raghavendra K T
  2012-04-02 12:15                       ` Raghavendra K T
  2012-04-05  9:01                       ` Avi Kivity
  2012-04-05  8:43                     ` Raghavendra K T
  2 siblings, 2 replies; 55+ messages in thread
From: Raghavendra K T @ 2012-04-02  9:51 UTC (permalink / raw)
  To: Avi Kivity, H. Peter Anvin
  Cc: Alan Meadows, Ingo Molnar, Linus Torvalds, Peter Zijlstra,
	the arch/x86 maintainers, LKML, Marcelo Tosatti, KVM, Andi Kleen,
	Xen Devel, Konrad Rzeszutek Wilk, Virtualization,
	Jeremy Fitzhardinge, Stephan Diestelhorst, Srivatsa Vaddagiri,
	Stefano Stabellini, Attilio Rao

On 04/01/2012 07:23 PM, Avi Kivity wrote:
 > On 04/01/2012 04:48 PM, Raghavendra K T wrote:
 >>>> I have patch something like below in mind to try:
 >>>>
 >>>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
 >>>> index d3b98b1..5127668 100644
 >>>> --- a/virt/kvm/kvm_main.c
 >>>> +++ b/virt/kvm/kvm_main.c
 >>>> @@ -1608,15 +1608,18 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
 >>>>         * else and called schedule in __vcpu_run.  Hopefully that
 >>>>         * VCPU is holding the lock that we need and will release it.
 >>>>         * We approximate round-robin by starting at the last boosted
 >>>> VCPU.
 >>>> +     * Priority is given to vcpu that are unhalted.
 >>>>         */
 >>>> -    for (pass = 0; pass<   2&&   !yielded; pass++) {
 >>>> +    for (pass = 0; pass<   3&&   !yielded; pass++) {
 >>>>            kvm_for_each_vcpu(i, vcpu, kvm) {
 >>>>                struct task_struct *task = NULL;
 >>>>                struct pid *pid;
 >>>> -            if (!pass&&   i<   last_boosted_vcpu) {
 >>>> +            if (!pass&&   !vcpu->pv_unhalted)
 >>>> +                continue;
 >>>> +            else if (pass == 1&&   i<   last_boosted_vcpu) {
 >>>>                    i = last_boosted_vcpu;
 >>>>                    continue;
 >>>> -            } else if (pass&&   i>   last_boosted_vcpu)
 >>>> +            } else if (pass == 2&&   i>   last_boosted_vcpu)
 >>>>                    break;
 >>>>                if (vcpu == me)
 >>>>                    continue;
 >>>>
 >>>
 >>> Actually I think this is unneeded.  The loops tries to find vcpus that
 >>> are runnable but not running (vcpu_active(vcpu->wq)), and halted vcpus
 >>> don't match this condition.
 >>>

Oh! I think I misinterpreted your statement. hmm I got it. you told to
remove if (vcpu == me) condition.

I got some more interesting idea ( not sure there is some flaw in idea too).
Basically tried  similar idea (to PLE exit handler) in vcpu_block.

Instead of blind scheduling we try to do yield to vcpu that is kicked.
IMO it may solve some scalability problem and make LHP problem further
shrink.

I think Thomas would be happy to see the result.

results:
test setup.
===========
Host: i5-2540M CPU @ 2.60GHz laptop with 4cpu w/ hyperthreading. 8GB RAM
guest: 16 vcpu 2GB RAM  single guest.

Did kernbench run under guest:
x rc6-with ticketlock (current patchset)+ kvmpatches 
(CONFIG_PARAVIRT_SPINLOCK=y)
+ rc6-with ticketlock + kvmpatches + try_yield_patch (below one) 
(YIELD_THRESHOLD=256) (CONFIG_PARAVIRT_SPINLOCK=y)
* rc6-withticketlock + kvmpatches + try_yield_patch 
(YIELD_THRESHOLD=2048) (CONFIG_PARAVIRT_SPINLOCK=y)

N           Min           Max        Median           Avg        Stddev
x   3        162.45        165.94       165.433     164.60767     1.8857111
+   3        114.02       117.243       115.953     115.73867     1.6221548
Difference at 95.0% confidence
         -29.6882% +/- 2.42192%
*   3       115.823       120.423       117.103       117.783     2.3741946
Difference at 95.0% confidence
         -28.4462% +/- 2.9521%


improvement ~29% w.r.t to current patches.

Note: vanilla rc6 (host and guest) with (CONFIG_PARAVIRT_SPINLOCK=n)
did not finish kernbench run even after *1hr 45* minutes (above
kernbench runs took 9 minute and  6.5 min respectively). I did not try
to test it again.


Yes, I understand that  have to do some more test. and immediate TODO's
for patch are.

1) code belongs to arch/x86 directory and fill in static inline for
other archs
2) tweek YIELD_THRESHOLD value.

Ideas/suggestions welcome

Here is the try_yield_to patch.
---
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 5127668..3fa912a 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1557,12 +1557,17 @@ void mark_page_dirty(struct kvm *kvm, gfn_t gfn)
  	mark_page_dirty_in_slot(kvm, memslot, gfn);
  }

+#define YIELD_THRESHOLD 2048
+static void kvm_vcpu_try_yield_to(struct kvm_vcpu *me);
  /*
   * The vCPU has executed a HLT instruction with in-kernel mode enabled.
   */
  void kvm_vcpu_block(struct kvm_vcpu *vcpu)
  {
  	DEFINE_WAIT(wait);
+	unsigned int loop_count;
+
+	loop_count = 0;

  	for (;;) {
  		prepare_to_wait(&vcpu->wq, &wait, TASK_INTERRUPTIBLE);
@@ -1579,7 +1584,10 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
  		if (signal_pending(current))
  			break;

-		schedule();
+		if (loop_count++ % YIELD_THRESHOLD)
+			schedule();
+		else
+			kvm_vcpu_try_yield_to(vcpu);
  	}

  	finish_wait(&vcpu->wq, &wait);
@@ -1593,6 +1601,39 @@ void kvm_resched(struct kvm_vcpu *vcpu)
  }
  EXPORT_SYMBOL_GPL(kvm_resched);

+static void kvm_vcpu_try_yield(struct kvm_vcpu *me)
+{
+
+	struct kvm *kvm = me->kvm;
+	struct kvm_vcpu *vcpu;
+	int i;
+
+	kvm_for_each_vcpu(i, vcpu, kvm) {
+		struct task_struct *task = NULL;
+		struct pid *pid;
+		if (!vcpu->pv_unhalted)
+			continue;
+		if (waitqueue_active(&vcpu->wq))
+			continue;
+		rcu_read_lock();
+		pid = rcu_dereference(vcpu->pid);
+		if (pid)
+			task = get_pid_task(vcpu->pid, PIDTYPE_PID);
+		rcu_read_unlock();
+		if (!task)
+			continue;
+		if (task->flags & PF_VCPU) {
+			put_task_struct(task);
+			continue;
+		}
+		if (yield_to(task, 1)) {
+			put_task_struct(task);
+			break;
+		}
+		put_task_struct(task);
+	}
+}
+
  void kvm_vcpu_on_spin(struct kvm_vcpu *me)
  {
  	struct kvm *kvm = me->kvm;
---


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks
  2012-04-02  9:51                     ` Raghavendra K T
@ 2012-04-02 12:15                       ` Raghavendra K T
  2012-04-05  9:01                       ` Avi Kivity
  1 sibling, 0 replies; 55+ messages in thread
From: Raghavendra K T @ 2012-04-02 12:15 UTC (permalink / raw)
  To: Avi Kivity, H. Peter Anvin
  Cc: Alan Meadows, Ingo Molnar, Linus Torvalds, Peter Zijlstra,
	the arch/x86 maintainers, LKML, Marcelo Tosatti, KVM, Andi Kleen,
	Xen Devel, Konrad Rzeszutek Wilk, Virtualization,
	Jeremy Fitzhardinge, Stephan Diestelhorst, Srivatsa Vaddagiri,
	Stefano Stabellini, Attilio Rao

On 04/02/2012 03:21 PM, Raghavendra K T wrote:
> On 04/01/2012 07:23 PM, Avi Kivity wrote:
>  > On 04/01/2012 04:48 PM, Raghavendra K T wrote:
>  >>>> I have patch something like below in mind to try:
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 5127668..3fa912a 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1557,12 +1557,17 @@ void mark_page_dirty(struct kvm *kvm, gfn_t gfn)
> mark_page_dirty_in_slot(kvm, memslot, gfn);
> }
>
> +#define YIELD_THRESHOLD 2048
> +static void kvm_vcpu_try_yield_to(struct kvm_vcpu *me);
> /*
> * The vCPU has executed a HLT instruction with in-kernel mode enabled.
> */
> void kvm_vcpu_block(struct kvm_vcpu *vcpu)
> {
[...]
> + if (loop_count++ % YIELD_THRESHOLD)
> + schedule();
> + else
> + kvm_vcpu_try_yield_to(vcpu);
> }
>
> +static void kvm_vcpu_try_yield(struct kvm_vcpu *me)

yes, it is kvm_vcpu_try_yield_to. had changed the name just before
sending. sorry.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks
  2012-04-01 13:53                   ` Avi Kivity
  2012-04-01 13:56                     ` Raghavendra K T
  2012-04-02  9:51                     ` Raghavendra K T
@ 2012-04-05  8:43                     ` Raghavendra K T
  2 siblings, 0 replies; 55+ messages in thread
From: Raghavendra K T @ 2012-04-05  8:43 UTC (permalink / raw)
  To: Avi Kivity, H. Peter Anvin, Ingo Molnar
  Cc: Alan Meadows, Linus Torvalds, Peter Zijlstra,
	the arch/x86 maintainers, LKML, Marcelo Tosatti, KVM, Andi Kleen,
	Xen Devel, Konrad Rzeszutek Wilk, Virtualization,
	Jeremy Fitzhardinge, Stephan Diestelhorst, Srivatsa Vaddagiri,
	Stefano Stabellini, Attilio Rao

On 04/01/2012 07:23 PM, Avi Kivity wrote:
> On 04/01/2012 04:48 PM, Raghavendra K T wrote:
>>>> I have patch something like below in mind to try:
>>>>
>>>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>>>> index d3b98b1..5127668 100644
>>>> --- a/virt/kvm/kvm_main.c
>>>> +++ b/virt/kvm/kvm_main.c
>>>> @@ -1608,15 +1608,18 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
>>>>         * else and called schedule in __vcpu_run.  Hopefully that
>>>>         * VCPU is holding the lock that we need and will release it.
>>>>         * We approximate round-robin by starting at the last boosted
>>>> VCPU.
>>>> +     * Priority is given to vcpu that are unhalted.
>>>>         */
>>>> -    for (pass = 0; pass<   2&&   !yielded; pass++) {
>>>> +    for (pass = 0; pass<   3&&   !yielded; pass++) {
>>>>            kvm_for_each_vcpu(i, vcpu, kvm) {
>>>>                struct task_struct *task = NULL;
>>>>                struct pid *pid;
>>>> -            if (!pass&&   i<   last_boosted_vcpu) {
>>>> +            if (!pass&&   !vcpu->pv_unhalted)
>>>> +                continue;
>>>> +            else if (pass == 1&&   i<   last_boosted_vcpu) {
>>>>                    i = last_boosted_vcpu;
>>>>                    continue;
>>>> -            } else if (pass&&   i>   last_boosted_vcpu)
>>>> +            } else if (pass == 2&&   i>   last_boosted_vcpu)
>>>>                    break;
>>>>                if (vcpu == me)
>>>>                    continue;
>>>>
>>>

[...]

> I'm interested in how PLE does vs. your patches, both with PLE enabled
> and disabled.
>

  Here is the result taken on PLE machine. Results seem to support all 
our assumptions.
  Following are the observations from results:

  1) There is a huge benefit for Non PLE based configuration. 
(base_nople vs pv_ple) (around 90%)

  2) ticketlock + kvm patches does go well along with PLE (more benefit 
is seen not degradation)
	(base_ple vs pv_ple)

  3) The ticketlock + kvm patches make behaves almost like PLE enabled 
machine (base_ple vs pv_nople)

  4) ple handler modification patches seem to give advantage (pv_ple vs 
pv_ple_optimized). More study needed
     probably with higher M/N ratio Avi pointed.

  configurations:

  base_nople       = 3.3-rc6 with CONFIG_PARAVIRT_SPINLOCK=n - PLE
  base_ple         = 3.3-rc6 with CONFIG_PARAVIRT_SPINLOCK=n  + PLE
  pv_ple           = 3.3-rc6 with CONFIG_PARAVIRT_SPINLOCK=y + PLE + 
ticketlock + kvm patches
  pv_nople         = 3.3-rc6 with CONFIG_PARAVIRT_SPINLOCK=y - PLE + 
ticketlock + kvm patches
  pv_ple_optimized = 3.3-rc6 with CONFIG_PARAVIRT_SPINLOCK=y + PLE + 
optimizaton patch + ticketlock
			+ kvm patches + posted with ple_handler modification (yield to kicked 
vcpu).

  Machine : IBM xSeries with Intel(R) Xeon(R)  X7560 2.27GHz CPU with 32 
core, with 8
          online cores and 4*64GB RAM

  3 guests running with 2GB RAM, 8vCPU.

  Results:
  -------
  case A)
  1x: 1 kernbench 2 idle
  2x: 1 kernbench 1 while1 hog 1 idle
  3x: 1 kernbench 2 while1 hog

  Average time taken in sec for kernbench run (std). [ lower the value 
better ]

       base_nople                 base_ple              pv_ple 
       pv_nople           pv_ple_optimized
	
  1x   72.8284 (89.8757) 	 70.475 (85.6979) 	63.5033 (72.7041) 
65.7634 (77.0504)  64.3284 (73.2688)
  2x   823.053 (1113.05) 	 110.971 (132.829) 	105.099 (128.738) 
139.058 (165.156)  106.268 (129.611)
  3x   3244.37 (4707.61) 	 150.265 (184.766) 	138.341 (172.69) 
139.106 (163.549)  133.238 (168.388)


    Percentage improvement calculation w.r.t base_nople
    -------------------------------------------------

       base_ple  pv_ple    pv_nople pv_ple_optimized

  1x    3.23143  12.8042   9.70089   11.6713
  2x    86.5172  87.2306   83.1046   87.0886
  3x    95.3684  95.736    95.7124   95.8933

-------------------
    Percentage improvement calculation w.r.t base_ple
    -------------------------------------------------

       base_nople  pv_ple    pv_nople  pv_ple_optimized

   1x   -3.3393    9.89244   6.68549   8.72167
   2x   -641.683   5.29147   -25.3102  4.23804
   3x   -2059.1    7.93531   7.42621   11.3313


  case B)
  all 3 guests running kernbench
  Average time taken in sec for kernbench run (std). [ lower the value 
better ].
  Note that std is calculated over 6*3 run average from all 3 guests 
given by kernbench

  base_nople            base_ple                pv_ple 
pv_nople              pv_ple_opt
  2886.92 (18.289131)   204.80333 (7.1784039)   200.22517 (10.134804) 
202.091 (12.249673)   201.60683 (7.881737)


    Percentage improvement calculation w.r.t base_nople
    -------------------------------------------------

       base_ple   pv_ple    pv_nople   pv_ple_optimized
       92.9058    93.0644   93	     93.0166
	


    Percentage improvement calculation w.r.t base_ple
    -------------------------------------------------

       base_nople   pv_ple    pv_nople   pv_ple_optimized
       -1309.606	   2.2354    1.324      1.5607	

  I hope the experimental results should convey same message if somebody 
does benchmarking.

  Also as Ian pointed in the thread, the earlier results from Attilio
and me was to convince that framework is acceptable on native.

  Does this convince to consider for it to go to next merge window?

  comments /suggestions please...


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks
  2012-04-02  9:51                     ` Raghavendra K T
  2012-04-02 12:15                       ` Raghavendra K T
@ 2012-04-05  9:01                       ` Avi Kivity
  2012-04-05 10:40                         ` Raghavendra K T
  1 sibling, 1 reply; 55+ messages in thread
From: Avi Kivity @ 2012-04-05  9:01 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: H. Peter Anvin, Alan Meadows, Ingo Molnar, Linus Torvalds,
	Peter Zijlstra, the arch/x86 maintainers, LKML, Marcelo Tosatti,
	KVM, Andi Kleen, Xen Devel, Konrad Rzeszutek Wilk,
	Virtualization, Jeremy Fitzhardinge, Stephan Diestelhorst,
	Srivatsa Vaddagiri, Stefano Stabellini, Attilio Rao

On 04/02/2012 12:51 PM, Raghavendra K T wrote:
> On 04/01/2012 07:23 PM, Avi Kivity wrote:
> > On 04/01/2012 04:48 PM, Raghavendra K T wrote:
> >>>> I have patch something like below in mind to try:
> >>>>
> >>>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> >>>> index d3b98b1..5127668 100644
> >>>> --- a/virt/kvm/kvm_main.c
> >>>> +++ b/virt/kvm/kvm_main.c
> >>>> @@ -1608,15 +1608,18 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
> >>>>         * else and called schedule in __vcpu_run.  Hopefully that
> >>>>         * VCPU is holding the lock that we need and will release it.
> >>>>         * We approximate round-robin by starting at the last boosted
> >>>> VCPU.
> >>>> +     * Priority is given to vcpu that are unhalted.
> >>>>         */
> >>>> -    for (pass = 0; pass<   2&&   !yielded; pass++) {
> >>>> +    for (pass = 0; pass<   3&&   !yielded; pass++) {
> >>>>            kvm_for_each_vcpu(i, vcpu, kvm) {
> >>>>                struct task_struct *task = NULL;
> >>>>                struct pid *pid;
> >>>> -            if (!pass&&   i<   last_boosted_vcpu) {
> >>>> +            if (!pass&&   !vcpu->pv_unhalted)
> >>>> +                continue;
> >>>> +            else if (pass == 1&&   i<   last_boosted_vcpu) {
> >>>>                    i = last_boosted_vcpu;
> >>>>                    continue;
> >>>> -            } else if (pass&&   i>   last_boosted_vcpu)
> >>>> +            } else if (pass == 2&&   i>   last_boosted_vcpu)
> >>>>                    break;
> >>>>                if (vcpu == me)
> >>>>                    continue;
> >>>>
> >>>
> >>> Actually I think this is unneeded.  The loops tries to find vcpus
> that
> >>> are runnable but not running (vcpu_active(vcpu->wq)), and halted
> vcpus
> >>> don't match this condition.
> >>>
>
> Oh! I think I misinterpreted your statement. hmm I got it. you told to
> remove if (vcpu == me) condition.

No, the entire patch is unneeded.  My original comment was:

> from the PLE handler, don't wake up a vcpu that is sleeping because it
is waiting for a kick

But the PLE handler never wakes up sleeping vcpus anyway.

There's still a conflict with PLE in that it may trigger during the spin
phase and send a random yield_to() somewhere.  Maybe it's sufficient to
tune the PLE timeout to be longer than the spinlock timeout.


-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks
  2012-04-02  9:26       ` Thomas Gleixner
@ 2012-04-05  9:15         ` Avi Kivity
  0 siblings, 0 replies; 55+ messages in thread
From: Avi Kivity @ 2012-04-05  9:15 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: H. Peter Anvin, Raghavendra K T, Ingo Molnar, Linus Torvalds,
	Peter Zijlstra, the arch/x86 maintainers, LKML, Marcelo Tosatti,
	KVM, Andi Kleen, Xen Devel, Konrad Rzeszutek Wilk,
	Virtualization, Jeremy Fitzhardinge, Stephan Diestelhorst,
	Srivatsa Vaddagiri, Stefano Stabellini, Attilio Rao

On 04/02/2012 12:26 PM, Thomas Gleixner wrote:
> > One thing about it is that it can give many false positives.  Consider a
> > fine-grained spinlock that is being accessed by many threads.  That is,
> > the lock is taken and released with high frequency, but there is no
> > contention, because each vcpu is accessing a different instance.  So the
> > host scheduler will needlessly delay preemption of vcpus that happen to
> > be holding a lock, even though this gains nothing.
>
> You're talking about per cpu locks, right? I can see the point there,
> but per cpu stuff might be worth annotating to avoid this.

Or any lock which is simply uncontended.  Say a single process is
running, the rest of the system is idle.  It will take and release many
locks; but it can be preempted at any point by the hypervisor with no
performance loss.

The overhead is arming a timer twice and an extra exit per deferred
context switch.  Perhaps not much given that you don't see tons of
context switches on virt workloads, at least without threaded interrupts
(or maybe interrupt threads should override this mechanism, by being
realtime threads).

> > A second issue may happen with a lock that is taken and released with
> > high frequency, with a high hold percentage.  The host scheduler may
> > always sample the guest in a held state, leading it to conclude that
> > it's exceeding its timeout when in fact the lock is held for a short
> > time only.
>
> Well, no. You can avoid that.
>
> host		VCPU
> 		spin_lock()
> 		 modify_global_state()
>    	exit
> check_global_state()
> mark_global_state()
> reschedule vcpu
>
> 		spin_unlock()
> 		 check_global_state()
> 		  trigger_exit()
>
> So when an exit occures in the locked section, then the host can mark
> the global state to tell the guest to issue a trap on unlock.

Right.

How does this nest?  Do we trigger the exit on the outermost unlock?

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks
  2012-04-05  9:01                       ` Avi Kivity
@ 2012-04-05 10:40                         ` Raghavendra K T
  0 siblings, 0 replies; 55+ messages in thread
From: Raghavendra K T @ 2012-04-05 10:40 UTC (permalink / raw)
  To: Avi Kivity
  Cc: H. Peter Anvin, Alan Meadows, Ingo Molnar, Linus Torvalds,
	Peter Zijlstra, the arch/x86 maintainers, LKML, Marcelo Tosatti,
	KVM, Andi Kleen, Xen Devel, Konrad Rzeszutek Wilk,
	Virtualization, Jeremy Fitzhardinge, Stephan Diestelhorst,
	Srivatsa Vaddagiri, Stefano Stabellini, Attilio Rao

On 04/05/2012 02:31 PM, Avi Kivity wrote:
> On 04/02/2012 12:51 PM, Raghavendra K T wrote:
>> On 04/01/2012 07:23 PM, Avi Kivity wrote:
>>> On 04/01/2012 04:48 PM, Raghavendra K T wrote:
>>>>>> I have patch something like below in mind to try:
>>>>>>
>>>>>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>>>>>> index d3b98b1..5127668 100644
>>>>>> --- a/virt/kvm/kvm_main.c
>>>>>> +++ b/virt/kvm/kvm_main.c
>>>>>> @@ -1608,15 +1608,18 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
>>>>>>          * else and called schedule in __vcpu_run.  Hopefully that
>>>>>>          * VCPU is holding the lock that we need and will release it.
>>>>>>          * We approximate round-robin by starting at the last boosted
>>>>>> VCPU.
>>>>>> +     * Priority is given to vcpu that are unhalted.
>>>>>>          */
>>>>>> -    for (pass = 0; pass<    2&&    !yielded; pass++) {
>>>>>> +    for (pass = 0; pass<    3&&    !yielded; pass++) {
>>>>>>             kvm_for_each_vcpu(i, vcpu, kvm) {
>>>>>>                 struct task_struct *task = NULL;
>>>>>>                 struct pid *pid;
>>>>>> -            if (!pass&&    i<    last_boosted_vcpu) {
>>>>>> +            if (!pass&&    !vcpu->pv_unhalted)
>>>>>> +                continue;
>>>>>> +            else if (pass == 1&&    i<    last_boosted_vcpu) {
>>>>>>                     i = last_boosted_vcpu;
>>>>>>                     continue;
>>>>>> -            } else if (pass&&    i>    last_boosted_vcpu)
>>>>>> +            } else if (pass == 2&&    i>    last_boosted_vcpu)
>>>>>>                     break;
>>>>>>                 if (vcpu == me)
>>>>>>                     continue;
>>>>>>
>>>>>
>>>>> Actually I think this is unneeded.  The loops tries to find vcpus
>> that
>>>>> are runnable but not running (vcpu_active(vcpu->wq)), and halted
>> vcpus
>>>>> don't match this condition.
>>>>>
>>
>> Oh! I think I misinterpreted your statement. hmm I got it. you told to
>> remove if (vcpu == me) condition.
>
> No, the entire patch is unneeded.  My original comment was:
>
>> from the PLE handler, don't wake up a vcpu that is sleeping because it
> is waiting for a kick
>
> But the PLE handler never wakes up sleeping vcpus anyway.

I agree with you. It is already doing that. But my approach here is
little different.

In 2 classes of vcpus we have (one is subset of another when we try to
do yield_to) viz,

1) runnable and kicked < (subset of) 2) just runnable

what we are trying to do here is targeting 1) first so that we get good
lock progress.

Here was the sequence I was talking.

vcpu1 releases lock->finds that vcpu2  is next candidate ->
kick hypercall -> kvm_pv_kick_cpu_op -> set kicked flag ->
vcpu->kick(vcpu2)

at this point we have vcpu2 waiting for getting scheduled. But
above yield call can wake *anybody*.

I agree this is not serious (rather it is overhead) when there are are 
less number of vcpus, But when we have more number of vcpu/vm.. it may
not scale well. My attempt was to fix that.

Let me know if I am completely missing something..

>
> There's still a conflict with PLE in that it may trigger during the spin
> phase and send a random yield_to() somewhere.  Maybe it's sufficient to
> tune the PLE timeout to be longer than the spinlock timeout.
>

Ok ... But we also should be cautious that, we may do more halt, though 
we are about to get spinlock.
Need more study on this.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks
  2012-03-30 22:07   ` Thomas Gleixner
                       ` (4 preceding siblings ...)
  2012-04-02  9:42     ` Ian Campbell
@ 2012-04-11  1:29     ` Marcelo Tosatti
  5 siblings, 0 replies; 55+ messages in thread
From: Marcelo Tosatti @ 2012-04-11  1:29 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: H. Peter Anvin, Raghavendra K T, Ingo Molnar, Linus Torvalds,
	Peter Zijlstra, the arch/x86 maintainers, LKML, Avi Kivity, KVM,
	Andi Kleen, Xen Devel, Konrad Rzeszutek Wilk, Virtualization,
	Jeremy Fitzhardinge, Stephan Diestelhorst, Srivatsa Vaddagiri,
	Stefano Stabellini, Attilio Rao

On Sat, Mar 31, 2012 at 12:07:58AM +0200, Thomas Gleixner wrote:
> On Fri, 30 Mar 2012, H. Peter Anvin wrote:
> 
> > What is the current status of this patchset?  I haven't looked at it too
> > closely because I have been focused on 3.4 up until now...
> 
> The real question is whether these heuristics are the correct approach
> or not.
> 
> If I look at it from the non virtualized kernel side then this is ass
> backwards. We know already that we are holding a spinlock which might
> cause other (v)cpus going into eternal spin. The non virtualized
> kernel solves this by disabling preemption and therefor getting out of
> the critical section as fast as possible,
> 
> The virtualization problem reminds me a lot of the problem which RT
> kernels are observing where non raw spinlocks are turned into
> "sleeping spinlocks" and therefor can cause throughput issues for non
> RT workloads.
> 
> Though the virtualized situation is even worse. Any preempted guest
> section which holds a spinlock is prone to cause unbound delays.
> 
> The paravirt ticketlock solution can only mitigate the problem, but
> not solve it. With massive overcommit there is always a way to trigger
> worst case scenarious unless you are educating the scheduler to cope
> with that.
> 
> So if we need to fiddle with the scheduler and frankly that's the only
> way to get a real gain (the numbers, which are achieved by this
> patches, are not that impressive) then the question arises whether we
> should turn the whole thing around.
> 
> I know that Peter is going to go berserk on me, but if we are running
> a paravirt guest then it's simple to provide a mechanism which allows
> the host (aka hypervisor) to check that in the guest just by looking
> at some global state.
> 
> So if a guest exits due to an external event it's easy to inspect the
> state of that guest and avoid to schedule away when it was interrupted
> in a spinlock held section. That guest/host shared state needs to be
> modified to indicate the guest to invoke an exit when the last nested
> lock has been released.

Remember that the host is scheduling other processes than vcpus of guests. 

The case where a higher priority task (whatever that task is) interrupts
a vcpu which holds a spinlock should be frequent, in a overcommit
scenario. Whenever that is the case, other vcpus _must_ be able to stop
spinning. 

Now extrapolate that to guests with large number of vcpus. There is no
replacement for sleep-in-hypervisor-instead-of-spin.

> Of course this needs to be time bound, so a rogue guest cannot
> monopolize the cpu forever, but that's the least to worry about
> problem simply because a guest which does not get out of a spinlocked
> region within a certain amount of time is borked and elegible to
> killing anyway.
> 
> Thoughts ?
> 
> Thanks,
> 
> 	tglx

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks
  2012-03-31  4:07     ` Srivatsa Vaddagiri
  2012-03-31  4:09       ` Srivatsa Vaddagiri
@ 2012-04-16 15:44       ` Konrad Rzeszutek Wilk
  2012-04-16 16:36         ` [Xen-devel] " Ian Campbell
  1 sibling, 1 reply; 55+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-04-16 15:44 UTC (permalink / raw)
  To: Srivatsa Vaddagiri, Jeremy Fitzhardinge
  Cc: Thomas Gleixner, H. Peter Anvin, Raghavendra K T, Ingo Molnar,
	Linus Torvalds, Peter Zijlstra, the arch/x86 maintainers, LKML,
	Avi Kivity, Marcelo Tosatti, KVM, Andi Kleen, Xen Devel,
	Virtualization, Jeremy Fitzhardinge, Stephan Diestelhorst,
	Stefano Stabellini, Attilio Rao

On Sat, Mar 31, 2012 at 09:37:45AM +0530, Srivatsa Vaddagiri wrote:
> * Thomas Gleixner <tglx@linutronix.de> [2012-03-31 00:07:58]:
> 
> > I know that Peter is going to go berserk on me, but if we are running
> > a paravirt guest then it's simple to provide a mechanism which allows
> > the host (aka hypervisor) to check that in the guest just by looking
> > at some global state.
> > 
> > So if a guest exits due to an external event it's easy to inspect the
> > state of that guest and avoid to schedule away when it was interrupted
> > in a spinlock held section. That guest/host shared state needs to be
> > modified to indicate the guest to invoke an exit when the last nested
> > lock has been released.
> 
> I had attempted something like that long back:
> 
> http://lkml.org/lkml/2010/6/3/4
> 
> The issue is with ticketlocks though. VCPUs could go into a spin w/o
> a lock being held by anybody. Say VCPUs 1-99 try to grab a lock in
> that order (on a host with one cpu). VCPU1 wins (after VCPU0 releases it)
> and releases the lock. VCPU1 is next eligible to take the lock. If 
> that is not scheduled early enough by host, then remaining vcpus would keep 
> spinning (even though lock is technically not held by anybody) w/o making 
> forward progress.
> 
> In that situation, what we really need is for the guest to hint to host
> scheduler to schedule VCPU1 early (via yield_to or something similar). 
> 
> The current pv-spinlock patches however does not track which vcpu is
> spinning at what head of the ticketlock. I suppose we can consider 
> that optimization in future and see how much benefit it provides (over
> plain yield/sleep the way its done now).

Right. I think Jeremy played around with this some time?
> 
> Do you see any issues if we take in what we have today and address the
> finer-grained optimization as next step?

I think that is the proper course - these patches show
that on baremetal we don't incur performance regressions and in
virtualization case we benefit greatly. Since these are the basic
building blocks of a kernel - taking it slow and just adding
this set of patches for v3.5 is a good idea - and then building on top
of that for further refinement.

> 
> - vatsa 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Xen-devel] [PATCH RFC V6 0/11] Paravirtualized ticketlocks
  2012-04-16 15:44       ` Konrad Rzeszutek Wilk
@ 2012-04-16 16:36         ` Ian Campbell
  2012-04-16 16:42           ` Jeremy Fitzhardinge
  2012-04-17  2:54           ` Srivatsa Vaddagiri
  0 siblings, 2 replies; 55+ messages in thread
From: Ian Campbell @ 2012-04-16 16:36 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Srivatsa Vaddagiri, Jeremy Fitzhardinge, Xen Devel,
	the arch/x86 maintainers, KVM, Stefano Stabellini,
	Peter Zijlstra, Raghavendra K T, LKML, Marcelo Tosatti,
	Andi Kleen, Avi Kivity, Jeremy Fitzhardinge, H. Peter Anvin,
	Attilio Rao, Thomas Gleixner, Virtualization, Linus Torvalds,
	Ingo Molnar, Stephan Diestelhorst

On Mon, 2012-04-16 at 16:44 +0100, Konrad Rzeszutek Wilk wrote:
> On Sat, Mar 31, 2012 at 09:37:45AM +0530, Srivatsa Vaddagiri wrote:
> > * Thomas Gleixner <tglx@linutronix.de> [2012-03-31 00:07:58]:
> > 
> > > I know that Peter is going to go berserk on me, but if we are running
> > > a paravirt guest then it's simple to provide a mechanism which allows
> > > the host (aka hypervisor) to check that in the guest just by looking
> > > at some global state.
> > > 
> > > So if a guest exits due to an external event it's easy to inspect the
> > > state of that guest and avoid to schedule away when it was interrupted
> > > in a spinlock held section. That guest/host shared state needs to be
> > > modified to indicate the guest to invoke an exit when the last nested
> > > lock has been released.
> > 
> > I had attempted something like that long back:
> > 
> > http://lkml.org/lkml/2010/6/3/4
> > 
> > The issue is with ticketlocks though. VCPUs could go into a spin w/o
> > a lock being held by anybody. Say VCPUs 1-99 try to grab a lock in
> > that order (on a host with one cpu). VCPU1 wins (after VCPU0 releases it)
> > and releases the lock. VCPU1 is next eligible to take the lock. If 
> > that is not scheduled early enough by host, then remaining vcpus would keep 
> > spinning (even though lock is technically not held by anybody) w/o making 
> > forward progress.
> > 
> > In that situation, what we really need is for the guest to hint to host
> > scheduler to schedule VCPU1 early (via yield_to or something similar). 
> > 
> > The current pv-spinlock patches however does not track which vcpu is
> > spinning at what head of the ticketlock. I suppose we can consider 
> > that optimization in future and see how much benefit it provides (over
> > plain yield/sleep the way its done now).
> 
> Right. I think Jeremy played around with this some time?

5/11 "xen/pvticketlock: Xen implementation for PV ticket locks" tracks
which vcpus are waiting for a lock in "cpumask_t waiting_cpus" and
tracks which lock each is waiting for in per-cpu "lock_waiting". This is
used in xen_unlock_kick to kick the right CPU. There's a loop over only
the waiting cpus to figure out who to kick.

> > 
> > Do you see any issues if we take in what we have today and address the
> > finer-grained optimization as next step?
> 
> I think that is the proper course - these patches show
> that on baremetal we don't incur performance regressions and in
> virtualization case we benefit greatly. Since these are the basic
> building blocks of a kernel - taking it slow and just adding
> this set of patches for v3.5 is a good idea - and then building on top
> of that for further refinement.
> 
> > 
> > - vatsa 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Xen-devel] [PATCH RFC V6 0/11] Paravirtualized ticketlocks
  2012-04-16 16:36         ` [Xen-devel] " Ian Campbell
@ 2012-04-16 16:42           ` Jeremy Fitzhardinge
  2012-04-17  2:54           ` Srivatsa Vaddagiri
  1 sibling, 0 replies; 55+ messages in thread
From: Jeremy Fitzhardinge @ 2012-04-16 16:42 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Konrad Rzeszutek Wilk, Srivatsa Vaddagiri, Xen Devel,
	the arch/x86 maintainers, KVM, Stefano Stabellini,
	Peter Zijlstra, Raghavendra K T, LKML, Marcelo Tosatti,
	Andi Kleen, Avi Kivity, Jeremy Fitzhardinge, H. Peter Anvin,
	Attilio Rao, Thomas Gleixner, Virtualization, Linus Torvalds,
	Ingo Molnar, Stephan Diestelhorst

On 04/16/2012 09:36 AM, Ian Campbell wrote:
> On Mon, 2012-04-16 at 16:44 +0100, Konrad Rzeszutek Wilk wrote:
>> On Sat, Mar 31, 2012 at 09:37:45AM +0530, Srivatsa Vaddagiri wrote:
>>> * Thomas Gleixner <tglx@linutronix.de> [2012-03-31 00:07:58]:
>>>
>>>> I know that Peter is going to go berserk on me, but if we are running
>>>> a paravirt guest then it's simple to provide a mechanism which allows
>>>> the host (aka hypervisor) to check that in the guest just by looking
>>>> at some global state.
>>>>
>>>> So if a guest exits due to an external event it's easy to inspect the
>>>> state of that guest and avoid to schedule away when it was interrupted
>>>> in a spinlock held section. That guest/host shared state needs to be
>>>> modified to indicate the guest to invoke an exit when the last nested
>>>> lock has been released.
>>> I had attempted something like that long back:
>>>
>>> http://lkml.org/lkml/2010/6/3/4
>>>
>>> The issue is with ticketlocks though. VCPUs could go into a spin w/o
>>> a lock being held by anybody. Say VCPUs 1-99 try to grab a lock in
>>> that order (on a host with one cpu). VCPU1 wins (after VCPU0 releases it)
>>> and releases the lock. VCPU1 is next eligible to take the lock. If 
>>> that is not scheduled early enough by host, then remaining vcpus would keep 
>>> spinning (even though lock is technically not held by anybody) w/o making 
>>> forward progress.
>>>
>>> In that situation, what we really need is for the guest to hint to host
>>> scheduler to schedule VCPU1 early (via yield_to or something similar). 
>>>
>>> The current pv-spinlock patches however does not track which vcpu is
>>> spinning at what head of the ticketlock. I suppose we can consider 
>>> that optimization in future and see how much benefit it provides (over
>>> plain yield/sleep the way its done now).
>> Right. I think Jeremy played around with this some time?
> 5/11 "xen/pvticketlock: Xen implementation for PV ticket locks" tracks
> which vcpus are waiting for a lock in "cpumask_t waiting_cpus" and
> tracks which lock each is waiting for in per-cpu "lock_waiting". This is
> used in xen_unlock_kick to kick the right CPU. There's a loop over only
> the waiting cpus to figure out who to kick.

Yes, and AFAIK the KVM pv-ticketlock patches do the same thing.  If a
(V)CPU is asleep, then sending it a kick is pretty much equivalent to a
yield to (not precisely, but it should get scheduled soon enough, and it
won't be competing with a pile of VCPUs with no useful work to do).

    J

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [Xen-devel] [PATCH RFC V6 0/11] Paravirtualized ticketlocks
  2012-04-16 16:36         ` [Xen-devel] " Ian Campbell
  2012-04-16 16:42           ` Jeremy Fitzhardinge
@ 2012-04-17  2:54           ` Srivatsa Vaddagiri
  1 sibling, 0 replies; 55+ messages in thread
From: Srivatsa Vaddagiri @ 2012-04-17  2:54 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Konrad Rzeszutek Wilk, Jeremy Fitzhardinge, Xen Devel,
	the arch/x86 maintainers, KVM, Stefano Stabellini,
	Peter Zijlstra, Raghavendra K T, LKML, Marcelo Tosatti,
	Andi Kleen, Avi Kivity, Jeremy Fitzhardinge, H. Peter Anvin,
	Attilio Rao, Thomas Gleixner, Virtualization, Linus Torvalds,
	Ingo Molnar, Stephan Diestelhorst

* Ian Campbell <Ian.Campbell@citrix.com> [2012-04-16 17:36:35]:

> > > The current pv-spinlock patches however does not track which vcpu is
> > > spinning at what head of the ticketlock. I suppose we can consider 
> > > that optimization in future and see how much benefit it provides (over
> > > plain yield/sleep the way its done now).
> > 
> > Right. I think Jeremy played around with this some time?
> 
> 5/11 "xen/pvticketlock: Xen implementation for PV ticket locks" tracks
> which vcpus are waiting for a lock in "cpumask_t waiting_cpus" and
> tracks which lock each is waiting for in per-cpu "lock_waiting". This is
> used in xen_unlock_kick to kick the right CPU. There's a loop over only
> the waiting cpus to figure out who to kick.

Yes sorry that's right. We do track who is waiting on what lock at what
position. This can be used to pass directed yield hints to host
scheduler (in a future optimization patch). What we don't track is the
vcpu owning a lock, which would have allowed other spinning vcpus to do
a directed yield to vcpu preempted holding a lock. OTOH that may be
unnecessary if we put in support for deferring preemption of vcpu that
are holding locks.

- vatsa


^ permalink raw reply	[flat|nested] 55+ messages in thread

end of thread, other threads:[~2012-04-17  2:55 UTC | newest]

Thread overview: 55+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-03-21 10:20 [PATCH RFC V6 0/11] Paravirtualized ticketlocks Raghavendra K T
2012-03-21 10:20 ` [PATCH RFC V6 1/11] x86/spinlock: replace pv spinlocks with pv ticketlocks Raghavendra K T
2012-03-21 13:04   ` Attilio Rao
2012-03-21 13:22     ` Stephan Diestelhorst
2012-03-21 13:49       ` Attilio Rao
2012-03-21 14:25         ` Stephan Diestelhorst
2012-03-21 14:33           ` Attilio Rao
2012-03-21 14:49             ` Raghavendra K T
2012-03-21 10:21 ` [PATCH RFC V6 2/11] x86/ticketlock: don't inline _spin_unlock when using paravirt spinlocks Raghavendra K T
2012-03-21 17:13   ` Linus Torvalds
2012-03-22 10:06     ` Raghavendra K T
2012-03-21 10:21 ` [PATCH RFC V6 3/11] x86/ticketlock: collapse a layer of functions Raghavendra K T
2012-03-21 10:21 ` [PATCH RFC V6 4/11] xen: defer spinlock setup until boot CPU setup Raghavendra K T
2012-03-21 10:21 ` [PATCH RFC V6 5/11] xen/pvticketlock: Xen implementation for PV ticket locks Raghavendra K T
2012-03-21 10:21 ` [PATCH RFC V6 6/11] xen/pvticketlocks: add xen_nopvspin parameter to disable xen pv ticketlocks Raghavendra K T
2012-03-21 10:21 ` [PATCH RFC V6 7/11] x86/pvticketlock: use callee-save for lock_spinning Raghavendra K T
2012-03-21 10:22 ` [PATCH RFC V6 8/11] x86/pvticketlock: when paravirtualizing ticket locks, increment by 2 Raghavendra K T
2012-03-21 10:22 ` [PATCH RFC V6 9/11] x86/ticketlock: add slowpath logic Raghavendra K T
2012-03-21 10:22 ` [PATCH RFC V6 10/11] xen/pvticketlock: allow interrupts to be enabled while blocking Raghavendra K T
2012-03-21 10:22 ` [PATCH RFC V6 11/11] xen: enable PV ticketlocks on HVM Xen Raghavendra K T
2012-03-26 14:25 ` [PATCH RFC V6 0/11] Paravirtualized ticketlocks Avi Kivity
2012-03-27  7:37   ` Raghavendra K T
2012-03-28 16:37     ` Alan Meadows
     [not found]     ` <CAMy5W3foop40+R1RLv_JPhnO5ZmV90uMmNERYY-e3QCeaJfqLw@mail.gmail.com>
2012-03-28 18:21       ` Raghavendra K T
2012-03-29  9:58         ` Avi Kivity
2012-03-29 18:03           ` Raghavendra K T
2012-03-30 10:07             ` Raghavendra K T
2012-04-01 13:18               ` Avi Kivity
2012-04-01 13:48                 ` Raghavendra K T
2012-04-01 13:53                   ` Avi Kivity
2012-04-01 13:56                     ` Raghavendra K T
2012-04-02  9:51                     ` Raghavendra K T
2012-04-02 12:15                       ` Raghavendra K T
2012-04-05  9:01                       ` Avi Kivity
2012-04-05 10:40                         ` Raghavendra K T
2012-04-05  8:43                     ` Raghavendra K T
2012-03-30 20:26 ` H. Peter Anvin
2012-03-30 22:07   ` Thomas Gleixner
2012-03-30 22:18     ` Andi Kleen
2012-03-30 23:04       ` Thomas Gleixner
2012-03-31  0:08         ` Andi Kleen
2012-03-31  8:11           ` Ingo Molnar
2012-03-31  4:07     ` Srivatsa Vaddagiri
2012-03-31  4:09       ` Srivatsa Vaddagiri
2012-04-16 15:44       ` Konrad Rzeszutek Wilk
2012-04-16 16:36         ` [Xen-devel] " Ian Campbell
2012-04-16 16:42           ` Jeremy Fitzhardinge
2012-04-17  2:54           ` Srivatsa Vaddagiri
2012-04-01 13:31     ` Avi Kivity
2012-04-02  9:26       ` Thomas Gleixner
2012-04-05  9:15         ` Avi Kivity
2012-04-02  4:36     ` [Xen-devel] " Juergen Gross
2012-04-02  9:42     ` Ian Campbell
2012-04-11  1:29     ` Marcelo Tosatti
2012-03-31  0:51   ` Raghavendra K T

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).