All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v6 00/11] qspinlock: a 4-byte queue spinlock with PV support
@ 2014-03-12 18:54 Waiman Long
  2014-03-12 18:54 ` [PATCH v6 01/11] qspinlock: A generic 4-byte queue spinlock implementation Waiman Long
                   ` (21 more replies)
  0 siblings, 22 replies; 135+ messages in thread
From: Waiman Long @ 2014-03-12 18:54 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	Peter Zijlstra
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, virtualization,
	Andi Kleen, Michel Lespinasse, Alok Kataria, linux-arch,
	Gleb Natapov, x86, xen-devel, Paul E. McKenney, Rik van Riel,
	Konrad Rzeszutek Wilk, Scott J Norton, Steven Rostedt,
	Chris Wright, Oleg Nesterov, Boris Ostrovsky,
	Aswin Chandramouleeswaran, Chegu Vinod, Waiman Long,
	linux-kernel, David Vrabel, Andrew

v5->v6:
 - Change the optimized 2-task contending code to make it fairer at the
   expense of a bit of performance.
 - Add a patch to support unfair queue spinlock for Xen.
 - Modify the PV qspinlock code to follow what was done in the PV
   ticketlock.
 - Add performance data for the unfair lock as well as the PV
   support code.

v4->v5:
 - Move the optimized 2-task contending code to the generic file to
   enable more architectures to use it without code duplication.
 - Address some of the style-related comments by PeterZ.
 - Allow the use of unfair queue spinlock in a real para-virtualized
   execution environment.
 - Add para-virtualization support to the qspinlock code by ensuring
   that the lock holder and queue head stay alive as much as possible.

v3->v4:
 - Remove debugging code and fix a configuration error
 - Simplify the qspinlock structure and streamline the code to make it
   perform a bit better
 - Add an x86 version of asm/qspinlock.h for holding x86 specific
   optimization.
 - Add an optimized x86 code path for 2 contending tasks to improve
   low contention performance.

v2->v3:
 - Simplify the code by using numerous mode only without an unfair option.
 - Use the latest smp_load_acquire()/smp_store_release() barriers.
 - Move the queue spinlock code to kernel/locking.
 - Make the use of queue spinlock the default for x86-64 without user
   configuration.
 - Additional performance tuning.

v1->v2:
 - Add some more comments to document what the code does.
 - Add a numerous CPU mode to support >= 16K CPUs
 - Add a configuration option to allow lock stealing which can further
   improve performance in many cases.
 - Enable wakeup of queue head CPU at unlock time for non-numerous
   CPU mode.

This patch set has 3 different sections:
 1) Patches 1-4: Introduces a queue-based spinlock implementation that
    can replace the default ticket spinlock without increasing the
    size of the spinlock data structure. As a result, critical kernel
    data structures that embed spinlock won't increase in size and
    break data alignments.
 2) Patches 5-7: Enables the use of unfair queue spinlock in a
    para-virtualized execution environment. This can resolve some
    of the locking related performance issues due to the fact that
    the next CPU to get the lock may have been scheduled out for a
    period of time.
 3) Patches 8-11: Enable qspinlock para-virtualization support
    by halting the waiting CPUs after spinning for a certain amount of
    time. The unlock code will detect the a sleeping waiter and wake it
    up. This is essentially the same logic as the PV ticketlock code.

Patches 1-8 are fully tested and ready for production. Patches 9-11
may still need further testing and tuning.

The queue spinlock has slightly better performance than the ticket
spinlock in uncontended case. Its performance can be much better
with moderate to heavy contention.  This patch has the potential of
improving the performance of all the workloads that have moderate to
heavy spinlock contention.

The queue spinlock is especially suitable for NUMA machines with at
least 2 sockets, though noticeable performance benefit probably won't
show up in machines with less than 4 sockets.

The purpose of this patch set is not to solve any particular spinlock
contention problems. Those need to be solved by refactoring the code
to make more efficient use of the lock or finer granularity ones. The
main purpose is to make the lock contention problems more tolerable
until someone can spend the time and effort to fix them.

Waiman Long (11):
  qspinlock: A generic 4-byte queue spinlock implementation
  qspinlock, x86: Enable x86-64 to use queue spinlock
  qspinlock: More optimized code for smaller NR_CPUS
  qspinlock: Optimized code path for 2 contending tasks
  pvqspinlock, x86: Allow unfair spinlock in a PV guest
  pvqspinlock, x86: Allow unfair queue spinlock in a KVM guest
  pvqspinlock, x86: Allow unfair queue spinlock in a XEN guest
  pvqspinlock, x86: Rename paravirt_ticketlocks_enabled
  pvqspinlock, x86: Add qspinlock para-virtualization support
  pvqspinlock, x86: Enable qspinlock PV support for KVM
  pvqspinlock, x86: Enable qspinlock PV support for XEN

 arch/x86/Kconfig                      |   12 +
 arch/x86/include/asm/paravirt.h       |   12 +-
 arch/x86/include/asm/paravirt_types.h |   12 +
 arch/x86/include/asm/pvqspinlock.h    |  232 +++++++++++
 arch/x86/include/asm/qspinlock.h      |  161 ++++++++
 arch/x86/include/asm/spinlock.h       |    9 +-
 arch/x86/include/asm/spinlock_types.h |    4 +
 arch/x86/kernel/Makefile              |    1 +
 arch/x86/kernel/kvm.c                 |  106 +++++-
 arch/x86/kernel/paravirt-spinlocks.c  |   16 +-
 arch/x86/xen/setup.c                  |   19 +
 arch/x86/xen/spinlock.c               |   97 +++++-
 include/asm-generic/qspinlock.h       |  122 ++++++
 include/asm-generic/qspinlock_types.h |   61 +++
 kernel/Kconfig.locks                  |    7 +
 kernel/locking/Makefile               |    1 +
 kernel/locking/qspinlock.c            |  698 +++++++++++++++++++++++++++++++++
 17 files changed, 1558 insertions(+), 12 deletions(-)
 create mode 100644 arch/x86/include/asm/pvqspinlock.h
 create mode 100644 arch/x86/include/asm/qspinlock.h
 create mode 100644 include/asm-generic/qspinlock.h
 create mode 100644 include/asm-generic/qspinlock_types.h
 create mode 100644 kernel/locking/qspinlock.c

^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH v6 01/11] qspinlock: A generic 4-byte queue spinlock implementation
  2014-03-12 18:54 [PATCH v6 00/11] qspinlock: a 4-byte queue spinlock with PV support Waiman Long
@ 2014-03-12 18:54 ` Waiman Long
  2014-03-12 18:54 ` Waiman Long
                   ` (20 subsequent siblings)
  21 siblings, 0 replies; 135+ messages in thread
From: Waiman Long @ 2014-03-12 18:54 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	Peter Zijlstra
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, virtualization,
	Andi Kleen, Michel Lespinasse, Alok Kataria, linux-arch,
	Gleb Natapov, x86, xen-devel, Paul E. McKenney, Rik van Riel,
	Konrad Rzeszutek Wilk, Scott J Norton, Steven Rostedt,
	Chris Wright, Oleg Nesterov, Boris Ostrovsky,
	Aswin Chandramouleeswaran, Chegu Vinod, Waiman Long,
	linux-kernel, David Vrabel, Andrew

This patch introduces a new generic queue spinlock implementation that
can serve as an alternative to the default ticket spinlock. Compared
with the ticket spinlock, this queue spinlock should be almost as fair
as the ticket spinlock. It has about the same speed in single-thread
and it can be much faster in high contention situations especially when
the spinlock is embedded within the data structure to be protected.

Only in light to moderate contention where the average queue depth
is around 1-3 will this queue spinlock be potentially a bit slower
due to the higher slowpath overhead.

This queue spinlock is especially suit to NUMA machines with a large
number of cores as the chance of spinlock contention is much higher
in those machines. The cost of contention is also higher because of
slower inter-node memory traffic.

The idea behind this spinlock implementation is the fact that spinlocks
are acquired with preemption disabled. In other words, the process
will not be migrated to another CPU while it is trying to get a
spinlock. Ignoring interrupt handling, a CPU can only be contending
in one spinlock at any one time. Of course, interrupt handler can try
to acquire one spinlock while the interrupted user process is in the
process of getting another spinlock. By allocating a set of per-cpu
queue nodes and used them to form a waiting queue, we can encode the
queue node address into a much smaller 16-bit size. Together with
the 1-byte lock bit, this queue spinlock implementation will only
need 4 bytes to hold all the information that it needs.

The current queue node address encoding of the 4-byte word is as
follows:
Bits 0-7  : the locked byte
Bits 8-9  : queue node index in the per-cpu array (4 entries)
Bits 10-31: cpu number + 1 (max cpus = 4M -1)

For single-thread performance (no contention), a 256K lock/unlock
loop was run on a 2.4Ghz Westmere x86-64 CPU.  The following table
shows the average time (in ns) for a single lock/unlock sequence
(including the looping and timing overhead):

  Lock Type			Time (ns)
  ---------			---------
  Ticket spinlock		  14.1
  Queue spinlock		   8.8

So the queue spinlock is much faster than the ticket spinlock, even
though the overhead of locking and unlocking should be pretty small
when there is no contention. The performance advantage is mainly
due to the fact that ticket spinlock does a read-modify-write (add)
instruction in unlock whereas queue spinlock only does a simple write
in unlock which can be much faster in a pipelined CPU.

The AIM7 benchmark was run on a 8-socket 80-core DL980 with Westmere
x86-64 CPUs with XFS filesystem on a ramdisk and HT off to evaluate
the performance impact of this patch on a 3.13 kernel.

  +------------+----------+-----------------+---------+
  | Kernel     | 3.13 JPM |    3.13 with    | %Change |
  |            |          | qspinlock patch |	      |
  +------------+----------+-----------------+---------+
  |		      10-100 users		      |
  +------------+----------+-----------------+---------+
  |custom      |   357459 |      363109     |  +1.58% |
  |dbase       |   496847 |      498801	    |  +0.39% |
  |disk        |  2925312 |     2771387     |  -5.26% |
  |five_sec    |   166612 |      169215     |  +1.56% |
  |fserver     |   382129 |      383279     |  +0.30% |
  |high_systime|    16356 |       16380     |  +0.15% |
  |short       |  4521978 |     4257363     |  -5.85% |
  +------------+----------+-----------------+---------+
  |		     200-1000 users		      |
  +------------+----------+-----------------+---------+
  |custom      |   449070 |      447711     |  -0.30% |
  |dbase       |   845029 |      853362	    |  +0.99% |
  |disk        |  2725249 |     4892907     | +79.54% |
  |five_sec    |   169410 |      170638     |  +0.72% |
  |fserver     |   489662 |      491828     |  +0.44% |
  |high_systime|   142823 |      143790     |  +0.68% |
  |short       |  7435288 |     9016171     | +21.26% |
  +------------+----------+-----------------+---------+
  |		     1100-2000 users		      |
  +------------+----------+-----------------+---------+
  |custom      |   432470 |      432570     |  +0.02% |
  |dbase       |   889289 |      890026	    |  +0.08% |
  |disk        |  2565138 |     5008732     | +95.26% |
  |five_sec    |   169141 |      170034     |  +0.53% |
  |fserver     |   498569 |      500701     |  +0.43% |
  |high_systime|   229913 |      245866     |  +6.94% |
  |short       |  8496794 |     8281918     |  -2.53% |
  +------------+----------+-----------------+---------+

The workload with the most gain was the disk workload. Without the
patch, the perf profile at 1500 users looked like:

 26.19%    reaim  [kernel.kallsyms]  [k] _raw_spin_lock
              |--47.28%-- evict
              |--46.87%-- inode_sb_list_add
              |--1.24%-- xlog_cil_insert_items
              |--0.68%-- __remove_inode_hash
              |--0.67%-- inode_wait_for_writeback
               --3.26%-- [...]
 22.96%  swapper  [kernel.kallsyms]  [k] cpu_idle_loop
  5.56%    reaim  [kernel.kallsyms]  [k] mutex_spin_on_owner
  4.87%    reaim  [kernel.kallsyms]  [k] update_cfs_rq_blocked_load
  2.04%    reaim  [kernel.kallsyms]  [k] mspin_lock
  1.30%    reaim  [kernel.kallsyms]  [k] memcpy
  1.08%    reaim  [unknown]          [.] 0x0000003c52009447

There was pretty high spinlock contention on the inode_sb_list_lock
and maybe the inode's i_lock.

With the patch, the perf profile at 1500 users became:

 26.82%  swapper  [kernel.kallsyms]  [k] cpu_idle_loop
  4.66%    reaim  [kernel.kallsyms]  [k] mutex_spin_on_owner
  3.97%    reaim  [kernel.kallsyms]  [k] update_cfs_rq_blocked_load
  2.40%    reaim  [kernel.kallsyms]  [k] queue_spin_lock_slowpath
              |--88.31%-- _raw_spin_lock
              |          |--36.02%-- inode_sb_list_add
              |          |--35.09%-- evict
              |          |--16.89%-- xlog_cil_insert_items
              |          |--6.30%-- try_to_wake_up
              |          |--2.20%-- _xfs_buf_find
              |          |--0.75%-- __remove_inode_hash
              |          |--0.72%-- __mutex_lock_slowpath
              |          |--0.53%-- load_balance
              |--6.02%-- _raw_spin_lock_irqsave
              |          |--74.75%-- down_trylock
              |          |--9.69%-- rcu_check_quiescent_state
              |          |--7.47%-- down
              |          |--3.57%-- up
              |          |--1.67%-- rwsem_wake
              |          |--1.00%-- remove_wait_queue
              |          |--0.56%-- pagevec_lru_move_fn
              |--5.39%-- _raw_spin_lock_irq
              |          |--82.05%-- rwsem_down_read_failed
              |          |--10.48%-- rwsem_down_write_failed
              |          |--4.24%-- __down
              |          |--2.74%-- __schedule
               --0.28%-- [...]
  2.20%    reaim  [kernel.kallsyms]  [k] memcpy
  1.84%    reaim  [unknown]          [.] 0x000000000041517b
  1.77%    reaim  [kernel.kallsyms]  [k] _raw_spin_lock
              |--21.08%-- xlog_cil_insert_items
              |--10.14%-- xfs_icsb_modify_counters
              |--7.20%-- xfs_iget_cache_hit
              |--6.56%-- inode_sb_list_add
              |--5.49%-- _xfs_buf_find
              |--5.25%-- evict
              |--5.03%-- __remove_inode_hash
              |--4.64%-- __mutex_lock_slowpath
              |--3.78%-- selinux_inode_free_security
              |--2.95%-- xfs_inode_is_filestream
              |--2.35%-- try_to_wake_up
              |--2.07%-- xfs_inode_set_reclaim_tag
              |--1.52%-- list_lru_add
              |--1.16%-- xfs_inode_clear_eofblocks_tag
		  :
  1.30%    reaim  [kernel.kallsyms]  [k] effective_load
  1.27%    reaim  [kernel.kallsyms]  [k] mspin_lock
  1.10%    reaim  [kernel.kallsyms]  [k] security_compute_sid

On the ext4 filesystem, the disk workload improved from 416281 JPM
to 899101 JPM (+116%) with the patch. In this case, the contended
spinlock is the mb_cache_spinlock.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
Acked-by: Rik van Riel <riel@redhat.com>
---
 include/asm-generic/qspinlock.h       |  122 +++++++++++
 include/asm-generic/qspinlock_types.h |   55 +++++
 kernel/Kconfig.locks                  |    7 +
 kernel/locking/Makefile               |    1 +
 kernel/locking/qspinlock.c            |  373 +++++++++++++++++++++++++++++++++
 5 files changed, 558 insertions(+), 0 deletions(-)
 create mode 100644 include/asm-generic/qspinlock.h
 create mode 100644 include/asm-generic/qspinlock_types.h
 create mode 100644 kernel/locking/qspinlock.c

diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h
new file mode 100644
index 0000000..08da60f
--- /dev/null
+++ b/include/asm-generic/qspinlock.h
@@ -0,0 +1,122 @@
+/*
+ * Queue spinlock
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2013-2014 Hewlett-Packard Development Company, L.P.
+ *
+ * Authors: Waiman Long <waiman.long@hp.com>
+ */
+#ifndef __ASM_GENERIC_QSPINLOCK_H
+#define __ASM_GENERIC_QSPINLOCK_H
+
+#include <asm-generic/qspinlock_types.h>
+
+/*
+ * External function declarations
+ */
+extern void queue_spin_lock_slowpath(struct qspinlock *lock, int qsval);
+
+/**
+ * queue_spin_is_locked - is the spinlock locked?
+ * @lock: Pointer to queue spinlock structure
+ * Return: 1 if it is locked, 0 otherwise
+ */
+static __always_inline int queue_spin_is_locked(struct qspinlock *lock)
+{
+	return atomic_read(&lock->qlcode) & _QSPINLOCK_LOCKED;
+}
+
+/**
+ * queue_spin_value_unlocked - is the spinlock structure unlocked?
+ * @lock: queue spinlock structure
+ * Return: 1 if it is unlocked, 0 otherwise
+ */
+static __always_inline int queue_spin_value_unlocked(struct qspinlock lock)
+{
+	return !(atomic_read(&lock.qlcode) & _QSPINLOCK_LOCKED);
+}
+
+/**
+ * queue_spin_is_contended - check if the lock is contended
+ * @lock : Pointer to queue spinlock structure
+ * Return: 1 if lock contended, 0 otherwise
+ */
+static __always_inline int queue_spin_is_contended(struct qspinlock *lock)
+{
+	return atomic_read(&lock->qlcode) & ~_QSPINLOCK_LOCK_MASK;
+}
+/**
+ * queue_spin_trylock - try to acquire the queue spinlock
+ * @lock : Pointer to queue spinlock structure
+ * Return: 1 if lock acquired, 0 if failed
+ */
+static __always_inline int queue_spin_trylock(struct qspinlock *lock)
+{
+	if (!atomic_read(&lock->qlcode) &&
+	   (atomic_cmpxchg(&lock->qlcode, 0, _QSPINLOCK_LOCKED) == 0))
+		return 1;
+	return 0;
+}
+
+/**
+ * queue_spin_lock - acquire a queue spinlock
+ * @lock: Pointer to queue spinlock structure
+ */
+static __always_inline void queue_spin_lock(struct qspinlock *lock)
+{
+	int qsval;
+
+	/*
+	 * To reduce memory access to only once for the cold cache case,
+	 * a direct cmpxchg() is performed in the fastpath to optimize the
+	 * uncontended case. The contended performance, however, may suffer
+	 * a bit because of that.
+	 */
+	qsval = atomic_cmpxchg(&lock->qlcode, 0, _QSPINLOCK_LOCKED);
+	if (likely(qsval == 0))
+		return;
+	queue_spin_lock_slowpath(lock, qsval);
+}
+
+#ifndef queue_spin_unlock
+/**
+ * queue_spin_unlock - release a queue spinlock
+ * @lock : Pointer to queue spinlock structure
+ */
+static __always_inline void queue_spin_unlock(struct qspinlock *lock)
+{
+	/*
+	 * Use an atomic subtraction to clear the lock bit.
+	 */
+	smp_mb__before_atomic_dec();
+	atomic_sub(_QSPINLOCK_LOCKED, &lock->qlcode);
+}
+#endif
+
+/*
+ * Initializier
+ */
+#define	__ARCH_SPIN_LOCK_UNLOCKED	{ ATOMIC_INIT(0) }
+
+/*
+ * Remapping spinlock architecture specific functions to the corresponding
+ * queue spinlock functions.
+ */
+#define arch_spin_is_locked(l)		queue_spin_is_locked(l)
+#define arch_spin_is_contended(l)	queue_spin_is_contended(l)
+#define arch_spin_value_unlocked(l)	queue_spin_value_unlocked(l)
+#define arch_spin_lock(l)		queue_spin_lock(l)
+#define arch_spin_trylock(l)		queue_spin_trylock(l)
+#define arch_spin_unlock(l)		queue_spin_unlock(l)
+#define arch_spin_lock_flags(l, f)	queue_spin_lock(l)
+
+#endif /* __ASM_GENERIC_QSPINLOCK_H */
diff --git a/include/asm-generic/qspinlock_types.h b/include/asm-generic/qspinlock_types.h
new file mode 100644
index 0000000..df981d0
--- /dev/null
+++ b/include/asm-generic/qspinlock_types.h
@@ -0,0 +1,55 @@
+/*
+ * Queue spinlock
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2013-2014 Hewlett-Packard Development Company, L.P.
+ *
+ * Authors: Waiman Long <waiman.long@hp.com>
+ */
+#ifndef __ASM_GENERIC_QSPINLOCK_TYPES_H
+#define __ASM_GENERIC_QSPINLOCK_TYPES_H
+
+/*
+ * Including atomic.h with PARAVIRT on will cause compilation errors because
+ * of recursive header file incluson via paravirt_types.h. A workaround is
+ * to include paravirt_types.h here in this case.
+ */
+#ifdef CONFIG_PARAVIRT
+# include <asm/paravirt_types.h>
+#else
+# include <linux/types.h>
+# include <linux/atomic.h>
+#endif
+
+/*
+ * The queue spinlock data structure - a 32-bit word
+ *
+ * For NR_CPUS >= 16K, the bits assignment are:
+ *   Bit  0   : Set if locked
+ *   Bits 1-7 : Not used
+ *   Bits 8-31: Queue code
+ *
+ * For NR_CPUS < 16K, the bits assignment are:
+ *   Bit   0   : Set if locked
+ *   Bits  1-7 : Not used
+ *   Bits  8-15: Reserved for architecture specific optimization
+ *   Bits 16-31: Queue code
+ */
+typedef struct qspinlock {
+	atomic_t	qlcode;	/* Lock + queue code */
+} arch_spinlock_t;
+
+#define _QCODE_OFFSET		8
+#define _QSPINLOCK_LOCKED	1U
+#define	_QSPINLOCK_LOCK_MASK	0xff
+
+#endif /* __ASM_GENERIC_QSPINLOCK_TYPES_H */
diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks
index d2b32ac..f185584 100644
--- a/kernel/Kconfig.locks
+++ b/kernel/Kconfig.locks
@@ -223,3 +223,10 @@ endif
 config MUTEX_SPIN_ON_OWNER
 	def_bool y
 	depends on SMP && !DEBUG_MUTEXES
+
+config ARCH_USE_QUEUE_SPINLOCK
+	bool
+
+config QUEUE_SPINLOCK
+	def_bool y if ARCH_USE_QUEUE_SPINLOCK
+	depends on SMP && !PARAVIRT_SPINLOCKS
diff --git a/kernel/locking/Makefile b/kernel/locking/Makefile
index baab8e5..e3b3293 100644
--- a/kernel/locking/Makefile
+++ b/kernel/locking/Makefile
@@ -15,6 +15,7 @@ obj-$(CONFIG_LOCKDEP) += lockdep_proc.o
 endif
 obj-$(CONFIG_SMP) += spinlock.o
 obj-$(CONFIG_PROVE_LOCKING) += spinlock.o
+obj-$(CONFIG_QUEUE_SPINLOCK) += qspinlock.o
 obj-$(CONFIG_RT_MUTEXES) += rtmutex.o
 obj-$(CONFIG_DEBUG_RT_MUTEXES) += rtmutex-debug.o
 obj-$(CONFIG_RT_MUTEX_TESTER) += rtmutex-tester.o
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
new file mode 100644
index 0000000..f1a8102
--- /dev/null
+++ b/kernel/locking/qspinlock.c
@@ -0,0 +1,373 @@
+/*
+ * Queue spinlock
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2013-2014 Hewlett-Packard Development Company, L.P.
+ *
+ * Authors: Waiman Long <waiman.long@hp.com>
+ */
+#include <linux/smp.h>
+#include <linux/bug.h>
+#include <linux/cpumask.h>
+#include <linux/percpu.h>
+#include <linux/hardirq.h>
+#include <linux/mutex.h>
+#include <linux/spinlock.h>
+
+/*
+ * The basic principle of a queue-based spinlock can best be understood
+ * by studying a classic queue-based spinlock implementation called the
+ * MCS lock. The paper below provides a good description for this kind
+ * of lock.
+ *
+ * http://www.cise.ufl.edu/tr/DOC/REP-1992-71.pdf
+ *
+ * This queue spinlock implementation is based on the MCS lock with twists
+ * to make it fit the following constraints:
+ * 1. A max spinlock size of 4 bytes
+ * 2. Good fastpath performance
+ * 3. No change in the locking APIs
+ *
+ * The queue spinlock fastpath is as simple as it can get, all the heavy
+ * lifting is done in the lock slowpath. The main idea behind this queue
+ * spinlock implementation is to keep the spinlock size at 4 bytes while
+ * at the same time implement a queue structure to queue up the waiting
+ * lock spinners.
+ *
+ * Since preemption is disabled before getting the lock, a given CPU will
+ * only need to use one queue node structure in a non-interrupt context.
+ * A percpu queue node structure will be allocated for this purpose and the
+ * cpu number will be put into the queue spinlock structure to indicate the
+ * tail of the queue.
+ *
+ * To handle spinlock acquisition at interrupt context (softirq or hardirq),
+ * the queue node structure is actually an array for supporting nested spin
+ * locking operations in interrupt handlers. If all the entries in the
+ * array are used up, a warning message will be printed (as that shouldn't
+ * happen in normal circumstances) and the lock spinner will fall back to
+ * busy spinning instead of waiting in a queue.
+ */
+
+/*
+ * The 24-bit queue node code is divided into the following 2 fields:
+ * Bits 0-1 : queue node index (4 nodes)
+ * Bits 2-23: CPU number + 1   (4M - 1 CPUs)
+ *
+ * A queue node code of 0 indicates that no one is waiting for the lock.
+ * As the value 0 cannot be used as a valid CPU number. We need to add
+ * 1 to it before putting it into the queue code.
+ */
+#define MAX_QNODES		4
+#ifndef _QCODE_VAL_OFFSET
+#define _QCODE_VAL_OFFSET	_QCODE_OFFSET
+#endif
+
+/*
+ * The queue node structure
+ *
+ * This structure is essentially the same as the mcs_spinlock structure
+ * in mcs_spinlock.h file. It is retained for future extension where new
+ * fields may be added.
+ */
+struct qnode {
+	u32		 wait;		/* Waiting flag		*/
+	struct qnode	*next;		/* Next queue node addr */
+};
+
+struct qnode_set {
+	struct qnode	nodes[MAX_QNODES];
+	int		node_idx;	/* Current node to use */
+};
+
+/*
+ * Per-CPU queue node structures
+ */
+static DEFINE_PER_CPU_ALIGNED(struct qnode_set, qnset) = { { { 0 } }, 0 };
+
+/**
+ *_queue_spin_setlock - try to acquire the lock by setting the lock bit
+ * @lock: Pointer to queue spinlock structure
+ * Return: 1 if lock bit set successfully, 0 if failed
+ */
+static __always_inline int queue_spin_setlock(struct qspinlock *lock)
+{
+	int qlcode = atomic_read(&lock->qlcode);
+
+	if (!(qlcode & _QSPINLOCK_LOCKED) && (atomic_cmpxchg(&lock->qlcode,
+		qlcode, qlcode|_QSPINLOCK_LOCKED) == qlcode))
+			return 1;
+	return 0;
+}
+
+/*
+ ************************************************************************
+ * Inline functions used by the queue_spin_lock_slowpath() function	*
+ * that may get superseded by a more optimized version.			*
+ ************************************************************************
+ */
+
+#ifndef queue_get_lock_qcode
+/**
+ * queue_get_lock_qcode - get the lock & qcode values
+ * @lock  : Pointer to queue spinlock structure
+ * @qcode : Pointer to the returned qcode value
+ * @mycode: My qcode value (not used)
+ * Return : != 0 if lock is not available, = 0 if lock is free
+ */
+static inline int
+queue_get_lock_qcode(struct qspinlock *lock, u32 *qcode, u32 mycode)
+{
+	int qlcode = atomic_read(&lock->qlcode);
+
+	*qcode = qlcode;
+	return qlcode & _QSPINLOCK_LOCKED;
+}
+#endif /* queue_get_lock_qcode */
+
+#ifndef queue_spin_trylock_and_clr_qcode
+/**
+ * queue_spin_trylock_and_clr_qcode - Try to lock & clear qcode simultaneously
+ * @lock : Pointer to queue spinlock structure
+ * @qcode: The supposedly current qcode value
+ * Return: true if successful, false otherwise
+ */
+static inline int
+queue_spin_trylock_and_clr_qcode(struct qspinlock *lock, u32 qcode)
+{
+	return atomic_cmpxchg(&lock->qlcode, qcode, _QSPINLOCK_LOCKED) == qcode;
+}
+#endif /* queue_spin_trylock_and_clr_qcode */
+
+#ifndef queue_encode_qcode
+/**
+ * queue_encode_qcode - Encode the CPU number & node index into a qnode code
+ * @cpu_nr: CPU number
+ * @qn_idx: Queue node index
+ * Return : A qnode code that can be saved into the qspinlock structure
+ */
+static inline u32 queue_encode_qcode(u32 cpu_nr, u8 qn_idx)
+{
+	return ((cpu_nr + 1) << (_QCODE_VAL_OFFSET + 2)) |
+		(qn_idx << _QCODE_VAL_OFFSET);
+}
+#endif /* queue_encode_qcode */
+
+#ifndef queue_code_xchg
+/**
+ * queue_code_xchg - exchange a queue code value
+ * @lock : Pointer to queue spinlock structure
+ * @ocode: Old queue code in the lock [OUT]
+ * @ncode: New queue code to be exchanged
+ * Return: 1 if lock is taken and so can release the queue node, 0 otherwise.
+ */
+static inline int queue_code_xchg(struct qspinlock *lock, u32 *ocode, u32 ncode)
+{
+	ncode |= _QSPINLOCK_LOCKED;	/* Set lock bit */
+
+	/*
+	 * Exchange current copy of the queue node code
+	 */
+	*ocode = atomic_xchg(&lock->qlcode, ncode);
+
+	if (likely(*ocode & _QSPINLOCK_LOCKED)) {
+		*ocode &= ~_QSPINLOCK_LOCKED;	/* Clear the lock bit */
+		return 0;
+	}
+	/*
+	 * It is possible that we may accidentally steal the lock during
+	 * the unlock-lock transition. If this is the case, we need to either
+	 * release it if not the head of the queue or get the lock and be
+	 * done with it.
+	 */
+	if (*ocode == 0) {
+		u32 qcode;
+
+		/*
+		 * Got the lock since it is at the head of the queue
+		 * Now try to atomically clear the queue code.
+		 */
+		qcode = atomic_cmpxchg(&lock->qlcode, ncode, _QSPINLOCK_LOCKED);
+		/*
+		 * The cmpxchg fails only if one or more tasks are added to
+		 * the queue. In this case, we set the *ocode to -1 to
+		 * indicate that more tasks are on queue.
+		 */
+		if (qcode != ncode)
+			*ocode = -1;
+		return 1;
+	}
+	/*
+	 * Accidentally steal the lock, release the lock and
+	 * let the queue head get it.
+	 */
+	queue_spin_unlock(lock);
+	return 0;
+}
+#endif /* queue_code_xchg */
+
+/*
+ ************************************************************************
+ * Other inline functions needed by the queue_spin_lock_slowpath()	*
+ * function.								*
+ ************************************************************************
+ */
+
+/**
+ * xlate_qcode - translate the queue code into the queue node address
+ * @qcode: Queue code to be translated
+ * Return: The corresponding queue node address
+ */
+static inline struct qnode *xlate_qcode(u32 qcode)
+{
+	u32 cpu_nr = (qcode >> (_QCODE_VAL_OFFSET + 2)) - 1;
+	u8  qn_idx = (qcode >> _QCODE_VAL_OFFSET) & 3;
+
+	return per_cpu_ptr(&qnset.nodes[qn_idx], cpu_nr);
+}
+
+/**
+ * get_qnode - Get a queue node address
+ * @qn_idx: Pointer to queue node index [out]
+ * Return : queue node address & queue node index in qn_idx, or NULL if
+ *	    no free queue node available.
+ */
+static inline struct qnode *get_qnode(unsigned int *qn_idx)
+{
+	struct qnode_set *qset = this_cpu_ptr(&qnset);
+	int i;
+
+	if (unlikely(qset->node_idx >= MAX_QNODES))
+		return NULL;
+	i = qset->node_idx++;
+	*qn_idx = i;
+	return &qset->nodes[i];
+}
+
+/**
+ * put_qnode - Return a queue node to the pool
+ */
+static inline void put_qnode(void)
+{
+	this_cpu_dec(qnset.node_idx);
+}
+
+/**
+ * queue_spin_lock_slowpath - acquire the queue spinlock
+ * @lock : Pointer to queue spinlock structure
+ * @qsval: Current value of the queue spinlock 32-bit word
+ */
+void queue_spin_lock_slowpath(struct qspinlock *lock, int qsval)
+{
+	unsigned int cpu_nr, qn_idx;
+	struct qnode *node, *next;
+	u32 prev_qcode, my_qcode;
+
+	/*
+	 * Get the queue node
+	 */
+	cpu_nr = smp_processor_id();
+	node   = get_qnode(&qn_idx);
+
+	/*
+	 * It should never happen that all the queue nodes are being used.
+	 */
+	BUG_ON(!node);
+
+	/*
+	 * Set up the new cpu code to be exchanged
+	 */
+	my_qcode = queue_encode_qcode(cpu_nr, qn_idx);
+
+	/*
+	 * Initialize the queue node
+	 */
+	node->wait = true;
+	node->next = NULL;
+
+	/*
+	 * The lock may be available at this point, try again if no task was
+	 * waiting in the queue.
+	 */
+	if (!(qsval >> _QCODE_OFFSET) && queue_spin_trylock(lock))
+		goto release_node;
+
+	/*
+	 * Exchange current copy of the queue node code
+	 */
+	if (queue_code_xchg(lock, &prev_qcode, my_qcode)) {
+		/*
+		 * Lock acquired
+		 * A non-zero prev_qcode indicates that there are
+		 * additional CPUs queuing up in the queue.
+		 */
+		if (prev_qcode)
+			goto notify_next;
+		else
+			goto release_node;
+	}
+
+	if (prev_qcode) {
+		/*
+		 * Not at the queue head, get the address of the previous node
+		 * and set up the "next" fields of the that node.
+		 */
+		struct qnode *prev = xlate_qcode(prev_qcode);
+
+		ACCESS_ONCE(prev->next) = node;
+		/*
+		 * Wait until the waiting flag is off
+		 */
+		while (smp_load_acquire(&node->wait))
+			arch_mutex_cpu_relax();
+	}
+
+	/*
+	 * At the head of the wait queue now
+	 */
+	while (true) {
+		u32 qcode;
+
+		if (queue_get_lock_qcode(lock, &qcode, my_qcode))
+			;	/* Lock not available yet */
+		else if (qcode != my_qcode) {
+			/*
+			 * Just get the lock with other spinners waiting
+			 * in the queue.
+			 */
+			if (queue_spin_setlock(lock))
+				goto notify_next;
+		} else {
+			/*
+			 * Get the lock & clear the queue code simultaneously
+			 */
+			if (queue_spin_trylock_and_clr_qcode(lock, qcode))
+				/* No need to notify the next one */
+				goto release_node;
+		}
+		arch_mutex_cpu_relax();
+	}
+
+notify_next:
+	/*
+	 * Wait, if needed, until the next one in queue set up the next field
+	 */
+	while (!(next = ACCESS_ONCE(node->next)))
+		arch_mutex_cpu_relax();
+	/*
+	 * The next one in queue is now at the head
+	 */
+	smp_store_release(&next->wait, false);
+
+release_node:
+	put_qnode();
+}
+EXPORT_SYMBOL(queue_spin_lock_slowpath);
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 135+ messages in thread

* [PATCH v6 01/11] qspinlock: A generic 4-byte queue spinlock implementation
  2014-03-12 18:54 [PATCH v6 00/11] qspinlock: a 4-byte queue spinlock with PV support Waiman Long
  2014-03-12 18:54 ` [PATCH v6 01/11] qspinlock: A generic 4-byte queue spinlock implementation Waiman Long
@ 2014-03-12 18:54 ` Waiman Long
  2014-03-12 18:54 ` [PATCH v6 02/11] qspinlock, x86: Enable x86-64 to use queue spinlock Waiman Long
                   ` (19 subsequent siblings)
  21 siblings, 0 replies; 135+ messages in thread
From: Waiman Long @ 2014-03-12 18:54 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	Peter Zijlstra
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, virtualization,
	Andi Kleen, Michel Lespinasse, Alok Kataria, linux-arch,
	Gleb Natapov, x86, xen-devel, Paul E. McKenney, Scott J Norton,
	Rusty Russell, Steven Rostedt, Chris Wright, Oleg Nesterov,
	Boris Ostrovsky, Aswin Chandramouleeswaran, Chegu Vinod,
	Waiman Long, linux-kernel, David Vrabel, Andrew Morton, Linu

This patch introduces a new generic queue spinlock implementation that
can serve as an alternative to the default ticket spinlock. Compared
with the ticket spinlock, this queue spinlock should be almost as fair
as the ticket spinlock. It has about the same speed in single-thread
and it can be much faster in high contention situations especially when
the spinlock is embedded within the data structure to be protected.

Only in light to moderate contention where the average queue depth
is around 1-3 will this queue spinlock be potentially a bit slower
due to the higher slowpath overhead.

This queue spinlock is especially suit to NUMA machines with a large
number of cores as the chance of spinlock contention is much higher
in those machines. The cost of contention is also higher because of
slower inter-node memory traffic.

The idea behind this spinlock implementation is the fact that spinlocks
are acquired with preemption disabled. In other words, the process
will not be migrated to another CPU while it is trying to get a
spinlock. Ignoring interrupt handling, a CPU can only be contending
in one spinlock at any one time. Of course, interrupt handler can try
to acquire one spinlock while the interrupted user process is in the
process of getting another spinlock. By allocating a set of per-cpu
queue nodes and used them to form a waiting queue, we can encode the
queue node address into a much smaller 16-bit size. Together with
the 1-byte lock bit, this queue spinlock implementation will only
need 4 bytes to hold all the information that it needs.

The current queue node address encoding of the 4-byte word is as
follows:
Bits 0-7  : the locked byte
Bits 8-9  : queue node index in the per-cpu array (4 entries)
Bits 10-31: cpu number + 1 (max cpus = 4M -1)

For single-thread performance (no contention), a 256K lock/unlock
loop was run on a 2.4Ghz Westmere x86-64 CPU.  The following table
shows the average time (in ns) for a single lock/unlock sequence
(including the looping and timing overhead):

  Lock Type			Time (ns)
  ---------			---------
  Ticket spinlock		  14.1
  Queue spinlock		   8.8

So the queue spinlock is much faster than the ticket spinlock, even
though the overhead of locking and unlocking should be pretty small
when there is no contention. The performance advantage is mainly
due to the fact that ticket spinlock does a read-modify-write (add)
instruction in unlock whereas queue spinlock only does a simple write
in unlock which can be much faster in a pipelined CPU.

The AIM7 benchmark was run on a 8-socket 80-core DL980 with Westmere
x86-64 CPUs with XFS filesystem on a ramdisk and HT off to evaluate
the performance impact of this patch on a 3.13 kernel.

  +------------+----------+-----------------+---------+
  | Kernel     | 3.13 JPM |    3.13 with    | %Change |
  |            |          | qspinlock patch |	      |
  +------------+----------+-----------------+---------+
  |		      10-100 users		      |
  +------------+----------+-----------------+---------+
  |custom      |   357459 |      363109     |  +1.58% |
  |dbase       |   496847 |      498801	    |  +0.39% |
  |disk        |  2925312 |     2771387     |  -5.26% |
  |five_sec    |   166612 |      169215     |  +1.56% |
  |fserver     |   382129 |      383279     |  +0.30% |
  |high_systime|    16356 |       16380     |  +0.15% |
  |short       |  4521978 |     4257363     |  -5.85% |
  +------------+----------+-----------------+---------+
  |		     200-1000 users		      |
  +------------+----------+-----------------+---------+
  |custom      |   449070 |      447711     |  -0.30% |
  |dbase       |   845029 |      853362	    |  +0.99% |
  |disk        |  2725249 |     4892907     | +79.54% |
  |five_sec    |   169410 |      170638     |  +0.72% |
  |fserver     |   489662 |      491828     |  +0.44% |
  |high_systime|   142823 |      143790     |  +0.68% |
  |short       |  7435288 |     9016171     | +21.26% |
  +------------+----------+-----------------+---------+
  |		     1100-2000 users		      |
  +------------+----------+-----------------+---------+
  |custom      |   432470 |      432570     |  +0.02% |
  |dbase       |   889289 |      890026	    |  +0.08% |
  |disk        |  2565138 |     5008732     | +95.26% |
  |five_sec    |   169141 |      170034     |  +0.53% |
  |fserver     |   498569 |      500701     |  +0.43% |
  |high_systime|   229913 |      245866     |  +6.94% |
  |short       |  8496794 |     8281918     |  -2.53% |
  +------------+----------+-----------------+---------+

The workload with the most gain was the disk workload. Without the
patch, the perf profile at 1500 users looked like:

 26.19%    reaim  [kernel.kallsyms]  [k] _raw_spin_lock
              |--47.28%-- evict
              |--46.87%-- inode_sb_list_add
              |--1.24%-- xlog_cil_insert_items
              |--0.68%-- __remove_inode_hash
              |--0.67%-- inode_wait_for_writeback
               --3.26%-- [...]
 22.96%  swapper  [kernel.kallsyms]  [k] cpu_idle_loop
  5.56%    reaim  [kernel.kallsyms]  [k] mutex_spin_on_owner
  4.87%    reaim  [kernel.kallsyms]  [k] update_cfs_rq_blocked_load
  2.04%    reaim  [kernel.kallsyms]  [k] mspin_lock
  1.30%    reaim  [kernel.kallsyms]  [k] memcpy
  1.08%    reaim  [unknown]          [.] 0x0000003c52009447

There was pretty high spinlock contention on the inode_sb_list_lock
and maybe the inode's i_lock.

With the patch, the perf profile at 1500 users became:

 26.82%  swapper  [kernel.kallsyms]  [k] cpu_idle_loop
  4.66%    reaim  [kernel.kallsyms]  [k] mutex_spin_on_owner
  3.97%    reaim  [kernel.kallsyms]  [k] update_cfs_rq_blocked_load
  2.40%    reaim  [kernel.kallsyms]  [k] queue_spin_lock_slowpath
              |--88.31%-- _raw_spin_lock
              |          |--36.02%-- inode_sb_list_add
              |          |--35.09%-- evict
              |          |--16.89%-- xlog_cil_insert_items
              |          |--6.30%-- try_to_wake_up
              |          |--2.20%-- _xfs_buf_find
              |          |--0.75%-- __remove_inode_hash
              |          |--0.72%-- __mutex_lock_slowpath
              |          |--0.53%-- load_balance
              |--6.02%-- _raw_spin_lock_irqsave
              |          |--74.75%-- down_trylock
              |          |--9.69%-- rcu_check_quiescent_state
              |          |--7.47%-- down
              |          |--3.57%-- up
              |          |--1.67%-- rwsem_wake
              |          |--1.00%-- remove_wait_queue
              |          |--0.56%-- pagevec_lru_move_fn
              |--5.39%-- _raw_spin_lock_irq
              |          |--82.05%-- rwsem_down_read_failed
              |          |--10.48%-- rwsem_down_write_failed
              |          |--4.24%-- __down
              |          |--2.74%-- __schedule
               --0.28%-- [...]
  2.20%    reaim  [kernel.kallsyms]  [k] memcpy
  1.84%    reaim  [unknown]          [.] 0x000000000041517b
  1.77%    reaim  [kernel.kallsyms]  [k] _raw_spin_lock
              |--21.08%-- xlog_cil_insert_items
              |--10.14%-- xfs_icsb_modify_counters
              |--7.20%-- xfs_iget_cache_hit
              |--6.56%-- inode_sb_list_add
              |--5.49%-- _xfs_buf_find
              |--5.25%-- evict
              |--5.03%-- __remove_inode_hash
              |--4.64%-- __mutex_lock_slowpath
              |--3.78%-- selinux_inode_free_security
              |--2.95%-- xfs_inode_is_filestream
              |--2.35%-- try_to_wake_up
              |--2.07%-- xfs_inode_set_reclaim_tag
              |--1.52%-- list_lru_add
              |--1.16%-- xfs_inode_clear_eofblocks_tag
		  :
  1.30%    reaim  [kernel.kallsyms]  [k] effective_load
  1.27%    reaim  [kernel.kallsyms]  [k] mspin_lock
  1.10%    reaim  [kernel.kallsyms]  [k] security_compute_sid

On the ext4 filesystem, the disk workload improved from 416281 JPM
to 899101 JPM (+116%) with the patch. In this case, the contended
spinlock is the mb_cache_spinlock.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
Acked-by: Rik van Riel <riel@redhat.com>
---
 include/asm-generic/qspinlock.h       |  122 +++++++++++
 include/asm-generic/qspinlock_types.h |   55 +++++
 kernel/Kconfig.locks                  |    7 +
 kernel/locking/Makefile               |    1 +
 kernel/locking/qspinlock.c            |  373 +++++++++++++++++++++++++++++++++
 5 files changed, 558 insertions(+), 0 deletions(-)
 create mode 100644 include/asm-generic/qspinlock.h
 create mode 100644 include/asm-generic/qspinlock_types.h
 create mode 100644 kernel/locking/qspinlock.c

diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h
new file mode 100644
index 0000000..08da60f
--- /dev/null
+++ b/include/asm-generic/qspinlock.h
@@ -0,0 +1,122 @@
+/*
+ * Queue spinlock
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2013-2014 Hewlett-Packard Development Company, L.P.
+ *
+ * Authors: Waiman Long <waiman.long@hp.com>
+ */
+#ifndef __ASM_GENERIC_QSPINLOCK_H
+#define __ASM_GENERIC_QSPINLOCK_H
+
+#include <asm-generic/qspinlock_types.h>
+
+/*
+ * External function declarations
+ */
+extern void queue_spin_lock_slowpath(struct qspinlock *lock, int qsval);
+
+/**
+ * queue_spin_is_locked - is the spinlock locked?
+ * @lock: Pointer to queue spinlock structure
+ * Return: 1 if it is locked, 0 otherwise
+ */
+static __always_inline int queue_spin_is_locked(struct qspinlock *lock)
+{
+	return atomic_read(&lock->qlcode) & _QSPINLOCK_LOCKED;
+}
+
+/**
+ * queue_spin_value_unlocked - is the spinlock structure unlocked?
+ * @lock: queue spinlock structure
+ * Return: 1 if it is unlocked, 0 otherwise
+ */
+static __always_inline int queue_spin_value_unlocked(struct qspinlock lock)
+{
+	return !(atomic_read(&lock.qlcode) & _QSPINLOCK_LOCKED);
+}
+
+/**
+ * queue_spin_is_contended - check if the lock is contended
+ * @lock : Pointer to queue spinlock structure
+ * Return: 1 if lock contended, 0 otherwise
+ */
+static __always_inline int queue_spin_is_contended(struct qspinlock *lock)
+{
+	return atomic_read(&lock->qlcode) & ~_QSPINLOCK_LOCK_MASK;
+}
+/**
+ * queue_spin_trylock - try to acquire the queue spinlock
+ * @lock : Pointer to queue spinlock structure
+ * Return: 1 if lock acquired, 0 if failed
+ */
+static __always_inline int queue_spin_trylock(struct qspinlock *lock)
+{
+	if (!atomic_read(&lock->qlcode) &&
+	   (atomic_cmpxchg(&lock->qlcode, 0, _QSPINLOCK_LOCKED) == 0))
+		return 1;
+	return 0;
+}
+
+/**
+ * queue_spin_lock - acquire a queue spinlock
+ * @lock: Pointer to queue spinlock structure
+ */
+static __always_inline void queue_spin_lock(struct qspinlock *lock)
+{
+	int qsval;
+
+	/*
+	 * To reduce memory access to only once for the cold cache case,
+	 * a direct cmpxchg() is performed in the fastpath to optimize the
+	 * uncontended case. The contended performance, however, may suffer
+	 * a bit because of that.
+	 */
+	qsval = atomic_cmpxchg(&lock->qlcode, 0, _QSPINLOCK_LOCKED);
+	if (likely(qsval == 0))
+		return;
+	queue_spin_lock_slowpath(lock, qsval);
+}
+
+#ifndef queue_spin_unlock
+/**
+ * queue_spin_unlock - release a queue spinlock
+ * @lock : Pointer to queue spinlock structure
+ */
+static __always_inline void queue_spin_unlock(struct qspinlock *lock)
+{
+	/*
+	 * Use an atomic subtraction to clear the lock bit.
+	 */
+	smp_mb__before_atomic_dec();
+	atomic_sub(_QSPINLOCK_LOCKED, &lock->qlcode);
+}
+#endif
+
+/*
+ * Initializier
+ */
+#define	__ARCH_SPIN_LOCK_UNLOCKED	{ ATOMIC_INIT(0) }
+
+/*
+ * Remapping spinlock architecture specific functions to the corresponding
+ * queue spinlock functions.
+ */
+#define arch_spin_is_locked(l)		queue_spin_is_locked(l)
+#define arch_spin_is_contended(l)	queue_spin_is_contended(l)
+#define arch_spin_value_unlocked(l)	queue_spin_value_unlocked(l)
+#define arch_spin_lock(l)		queue_spin_lock(l)
+#define arch_spin_trylock(l)		queue_spin_trylock(l)
+#define arch_spin_unlock(l)		queue_spin_unlock(l)
+#define arch_spin_lock_flags(l, f)	queue_spin_lock(l)
+
+#endif /* __ASM_GENERIC_QSPINLOCK_H */
diff --git a/include/asm-generic/qspinlock_types.h b/include/asm-generic/qspinlock_types.h
new file mode 100644
index 0000000..df981d0
--- /dev/null
+++ b/include/asm-generic/qspinlock_types.h
@@ -0,0 +1,55 @@
+/*
+ * Queue spinlock
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2013-2014 Hewlett-Packard Development Company, L.P.
+ *
+ * Authors: Waiman Long <waiman.long@hp.com>
+ */
+#ifndef __ASM_GENERIC_QSPINLOCK_TYPES_H
+#define __ASM_GENERIC_QSPINLOCK_TYPES_H
+
+/*
+ * Including atomic.h with PARAVIRT on will cause compilation errors because
+ * of recursive header file incluson via paravirt_types.h. A workaround is
+ * to include paravirt_types.h here in this case.
+ */
+#ifdef CONFIG_PARAVIRT
+# include <asm/paravirt_types.h>
+#else
+# include <linux/types.h>
+# include <linux/atomic.h>
+#endif
+
+/*
+ * The queue spinlock data structure - a 32-bit word
+ *
+ * For NR_CPUS >= 16K, the bits assignment are:
+ *   Bit  0   : Set if locked
+ *   Bits 1-7 : Not used
+ *   Bits 8-31: Queue code
+ *
+ * For NR_CPUS < 16K, the bits assignment are:
+ *   Bit   0   : Set if locked
+ *   Bits  1-7 : Not used
+ *   Bits  8-15: Reserved for architecture specific optimization
+ *   Bits 16-31: Queue code
+ */
+typedef struct qspinlock {
+	atomic_t	qlcode;	/* Lock + queue code */
+} arch_spinlock_t;
+
+#define _QCODE_OFFSET		8
+#define _QSPINLOCK_LOCKED	1U
+#define	_QSPINLOCK_LOCK_MASK	0xff
+
+#endif /* __ASM_GENERIC_QSPINLOCK_TYPES_H */
diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks
index d2b32ac..f185584 100644
--- a/kernel/Kconfig.locks
+++ b/kernel/Kconfig.locks
@@ -223,3 +223,10 @@ endif
 config MUTEX_SPIN_ON_OWNER
 	def_bool y
 	depends on SMP && !DEBUG_MUTEXES
+
+config ARCH_USE_QUEUE_SPINLOCK
+	bool
+
+config QUEUE_SPINLOCK
+	def_bool y if ARCH_USE_QUEUE_SPINLOCK
+	depends on SMP && !PARAVIRT_SPINLOCKS
diff --git a/kernel/locking/Makefile b/kernel/locking/Makefile
index baab8e5..e3b3293 100644
--- a/kernel/locking/Makefile
+++ b/kernel/locking/Makefile
@@ -15,6 +15,7 @@ obj-$(CONFIG_LOCKDEP) += lockdep_proc.o
 endif
 obj-$(CONFIG_SMP) += spinlock.o
 obj-$(CONFIG_PROVE_LOCKING) += spinlock.o
+obj-$(CONFIG_QUEUE_SPINLOCK) += qspinlock.o
 obj-$(CONFIG_RT_MUTEXES) += rtmutex.o
 obj-$(CONFIG_DEBUG_RT_MUTEXES) += rtmutex-debug.o
 obj-$(CONFIG_RT_MUTEX_TESTER) += rtmutex-tester.o
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
new file mode 100644
index 0000000..f1a8102
--- /dev/null
+++ b/kernel/locking/qspinlock.c
@@ -0,0 +1,373 @@
+/*
+ * Queue spinlock
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2013-2014 Hewlett-Packard Development Company, L.P.
+ *
+ * Authors: Waiman Long <waiman.long@hp.com>
+ */
+#include <linux/smp.h>
+#include <linux/bug.h>
+#include <linux/cpumask.h>
+#include <linux/percpu.h>
+#include <linux/hardirq.h>
+#include <linux/mutex.h>
+#include <linux/spinlock.h>
+
+/*
+ * The basic principle of a queue-based spinlock can best be understood
+ * by studying a classic queue-based spinlock implementation called the
+ * MCS lock. The paper below provides a good description for this kind
+ * of lock.
+ *
+ * http://www.cise.ufl.edu/tr/DOC/REP-1992-71.pdf
+ *
+ * This queue spinlock implementation is based on the MCS lock with twists
+ * to make it fit the following constraints:
+ * 1. A max spinlock size of 4 bytes
+ * 2. Good fastpath performance
+ * 3. No change in the locking APIs
+ *
+ * The queue spinlock fastpath is as simple as it can get, all the heavy
+ * lifting is done in the lock slowpath. The main idea behind this queue
+ * spinlock implementation is to keep the spinlock size at 4 bytes while
+ * at the same time implement a queue structure to queue up the waiting
+ * lock spinners.
+ *
+ * Since preemption is disabled before getting the lock, a given CPU will
+ * only need to use one queue node structure in a non-interrupt context.
+ * A percpu queue node structure will be allocated for this purpose and the
+ * cpu number will be put into the queue spinlock structure to indicate the
+ * tail of the queue.
+ *
+ * To handle spinlock acquisition at interrupt context (softirq or hardirq),
+ * the queue node structure is actually an array for supporting nested spin
+ * locking operations in interrupt handlers. If all the entries in the
+ * array are used up, a warning message will be printed (as that shouldn't
+ * happen in normal circumstances) and the lock spinner will fall back to
+ * busy spinning instead of waiting in a queue.
+ */
+
+/*
+ * The 24-bit queue node code is divided into the following 2 fields:
+ * Bits 0-1 : queue node index (4 nodes)
+ * Bits 2-23: CPU number + 1   (4M - 1 CPUs)
+ *
+ * A queue node code of 0 indicates that no one is waiting for the lock.
+ * As the value 0 cannot be used as a valid CPU number. We need to add
+ * 1 to it before putting it into the queue code.
+ */
+#define MAX_QNODES		4
+#ifndef _QCODE_VAL_OFFSET
+#define _QCODE_VAL_OFFSET	_QCODE_OFFSET
+#endif
+
+/*
+ * The queue node structure
+ *
+ * This structure is essentially the same as the mcs_spinlock structure
+ * in mcs_spinlock.h file. It is retained for future extension where new
+ * fields may be added.
+ */
+struct qnode {
+	u32		 wait;		/* Waiting flag		*/
+	struct qnode	*next;		/* Next queue node addr */
+};
+
+struct qnode_set {
+	struct qnode	nodes[MAX_QNODES];
+	int		node_idx;	/* Current node to use */
+};
+
+/*
+ * Per-CPU queue node structures
+ */
+static DEFINE_PER_CPU_ALIGNED(struct qnode_set, qnset) = { { { 0 } }, 0 };
+
+/**
+ *_queue_spin_setlock - try to acquire the lock by setting the lock bit
+ * @lock: Pointer to queue spinlock structure
+ * Return: 1 if lock bit set successfully, 0 if failed
+ */
+static __always_inline int queue_spin_setlock(struct qspinlock *lock)
+{
+	int qlcode = atomic_read(&lock->qlcode);
+
+	if (!(qlcode & _QSPINLOCK_LOCKED) && (atomic_cmpxchg(&lock->qlcode,
+		qlcode, qlcode|_QSPINLOCK_LOCKED) == qlcode))
+			return 1;
+	return 0;
+}
+
+/*
+ ************************************************************************
+ * Inline functions used by the queue_spin_lock_slowpath() function	*
+ * that may get superseded by a more optimized version.			*
+ ************************************************************************
+ */
+
+#ifndef queue_get_lock_qcode
+/**
+ * queue_get_lock_qcode - get the lock & qcode values
+ * @lock  : Pointer to queue spinlock structure
+ * @qcode : Pointer to the returned qcode value
+ * @mycode: My qcode value (not used)
+ * Return : != 0 if lock is not available, = 0 if lock is free
+ */
+static inline int
+queue_get_lock_qcode(struct qspinlock *lock, u32 *qcode, u32 mycode)
+{
+	int qlcode = atomic_read(&lock->qlcode);
+
+	*qcode = qlcode;
+	return qlcode & _QSPINLOCK_LOCKED;
+}
+#endif /* queue_get_lock_qcode */
+
+#ifndef queue_spin_trylock_and_clr_qcode
+/**
+ * queue_spin_trylock_and_clr_qcode - Try to lock & clear qcode simultaneously
+ * @lock : Pointer to queue spinlock structure
+ * @qcode: The supposedly current qcode value
+ * Return: true if successful, false otherwise
+ */
+static inline int
+queue_spin_trylock_and_clr_qcode(struct qspinlock *lock, u32 qcode)
+{
+	return atomic_cmpxchg(&lock->qlcode, qcode, _QSPINLOCK_LOCKED) == qcode;
+}
+#endif /* queue_spin_trylock_and_clr_qcode */
+
+#ifndef queue_encode_qcode
+/**
+ * queue_encode_qcode - Encode the CPU number & node index into a qnode code
+ * @cpu_nr: CPU number
+ * @qn_idx: Queue node index
+ * Return : A qnode code that can be saved into the qspinlock structure
+ */
+static inline u32 queue_encode_qcode(u32 cpu_nr, u8 qn_idx)
+{
+	return ((cpu_nr + 1) << (_QCODE_VAL_OFFSET + 2)) |
+		(qn_idx << _QCODE_VAL_OFFSET);
+}
+#endif /* queue_encode_qcode */
+
+#ifndef queue_code_xchg
+/**
+ * queue_code_xchg - exchange a queue code value
+ * @lock : Pointer to queue spinlock structure
+ * @ocode: Old queue code in the lock [OUT]
+ * @ncode: New queue code to be exchanged
+ * Return: 1 if lock is taken and so can release the queue node, 0 otherwise.
+ */
+static inline int queue_code_xchg(struct qspinlock *lock, u32 *ocode, u32 ncode)
+{
+	ncode |= _QSPINLOCK_LOCKED;	/* Set lock bit */
+
+	/*
+	 * Exchange current copy of the queue node code
+	 */
+	*ocode = atomic_xchg(&lock->qlcode, ncode);
+
+	if (likely(*ocode & _QSPINLOCK_LOCKED)) {
+		*ocode &= ~_QSPINLOCK_LOCKED;	/* Clear the lock bit */
+		return 0;
+	}
+	/*
+	 * It is possible that we may accidentally steal the lock during
+	 * the unlock-lock transition. If this is the case, we need to either
+	 * release it if not the head of the queue or get the lock and be
+	 * done with it.
+	 */
+	if (*ocode == 0) {
+		u32 qcode;
+
+		/*
+		 * Got the lock since it is at the head of the queue
+		 * Now try to atomically clear the queue code.
+		 */
+		qcode = atomic_cmpxchg(&lock->qlcode, ncode, _QSPINLOCK_LOCKED);
+		/*
+		 * The cmpxchg fails only if one or more tasks are added to
+		 * the queue. In this case, we set the *ocode to -1 to
+		 * indicate that more tasks are on queue.
+		 */
+		if (qcode != ncode)
+			*ocode = -1;
+		return 1;
+	}
+	/*
+	 * Accidentally steal the lock, release the lock and
+	 * let the queue head get it.
+	 */
+	queue_spin_unlock(lock);
+	return 0;
+}
+#endif /* queue_code_xchg */
+
+/*
+ ************************************************************************
+ * Other inline functions needed by the queue_spin_lock_slowpath()	*
+ * function.								*
+ ************************************************************************
+ */
+
+/**
+ * xlate_qcode - translate the queue code into the queue node address
+ * @qcode: Queue code to be translated
+ * Return: The corresponding queue node address
+ */
+static inline struct qnode *xlate_qcode(u32 qcode)
+{
+	u32 cpu_nr = (qcode >> (_QCODE_VAL_OFFSET + 2)) - 1;
+	u8  qn_idx = (qcode >> _QCODE_VAL_OFFSET) & 3;
+
+	return per_cpu_ptr(&qnset.nodes[qn_idx], cpu_nr);
+}
+
+/**
+ * get_qnode - Get a queue node address
+ * @qn_idx: Pointer to queue node index [out]
+ * Return : queue node address & queue node index in qn_idx, or NULL if
+ *	    no free queue node available.
+ */
+static inline struct qnode *get_qnode(unsigned int *qn_idx)
+{
+	struct qnode_set *qset = this_cpu_ptr(&qnset);
+	int i;
+
+	if (unlikely(qset->node_idx >= MAX_QNODES))
+		return NULL;
+	i = qset->node_idx++;
+	*qn_idx = i;
+	return &qset->nodes[i];
+}
+
+/**
+ * put_qnode - Return a queue node to the pool
+ */
+static inline void put_qnode(void)
+{
+	this_cpu_dec(qnset.node_idx);
+}
+
+/**
+ * queue_spin_lock_slowpath - acquire the queue spinlock
+ * @lock : Pointer to queue spinlock structure
+ * @qsval: Current value of the queue spinlock 32-bit word
+ */
+void queue_spin_lock_slowpath(struct qspinlock *lock, int qsval)
+{
+	unsigned int cpu_nr, qn_idx;
+	struct qnode *node, *next;
+	u32 prev_qcode, my_qcode;
+
+	/*
+	 * Get the queue node
+	 */
+	cpu_nr = smp_processor_id();
+	node   = get_qnode(&qn_idx);
+
+	/*
+	 * It should never happen that all the queue nodes are being used.
+	 */
+	BUG_ON(!node);
+
+	/*
+	 * Set up the new cpu code to be exchanged
+	 */
+	my_qcode = queue_encode_qcode(cpu_nr, qn_idx);
+
+	/*
+	 * Initialize the queue node
+	 */
+	node->wait = true;
+	node->next = NULL;
+
+	/*
+	 * The lock may be available at this point, try again if no task was
+	 * waiting in the queue.
+	 */
+	if (!(qsval >> _QCODE_OFFSET) && queue_spin_trylock(lock))
+		goto release_node;
+
+	/*
+	 * Exchange current copy of the queue node code
+	 */
+	if (queue_code_xchg(lock, &prev_qcode, my_qcode)) {
+		/*
+		 * Lock acquired
+		 * A non-zero prev_qcode indicates that there are
+		 * additional CPUs queuing up in the queue.
+		 */
+		if (prev_qcode)
+			goto notify_next;
+		else
+			goto release_node;
+	}
+
+	if (prev_qcode) {
+		/*
+		 * Not at the queue head, get the address of the previous node
+		 * and set up the "next" fields of the that node.
+		 */
+		struct qnode *prev = xlate_qcode(prev_qcode);
+
+		ACCESS_ONCE(prev->next) = node;
+		/*
+		 * Wait until the waiting flag is off
+		 */
+		while (smp_load_acquire(&node->wait))
+			arch_mutex_cpu_relax();
+	}
+
+	/*
+	 * At the head of the wait queue now
+	 */
+	while (true) {
+		u32 qcode;
+
+		if (queue_get_lock_qcode(lock, &qcode, my_qcode))
+			;	/* Lock not available yet */
+		else if (qcode != my_qcode) {
+			/*
+			 * Just get the lock with other spinners waiting
+			 * in the queue.
+			 */
+			if (queue_spin_setlock(lock))
+				goto notify_next;
+		} else {
+			/*
+			 * Get the lock & clear the queue code simultaneously
+			 */
+			if (queue_spin_trylock_and_clr_qcode(lock, qcode))
+				/* No need to notify the next one */
+				goto release_node;
+		}
+		arch_mutex_cpu_relax();
+	}
+
+notify_next:
+	/*
+	 * Wait, if needed, until the next one in queue set up the next field
+	 */
+	while (!(next = ACCESS_ONCE(node->next)))
+		arch_mutex_cpu_relax();
+	/*
+	 * The next one in queue is now at the head
+	 */
+	smp_store_release(&next->wait, false);
+
+release_node:
+	put_qnode();
+}
+EXPORT_SYMBOL(queue_spin_lock_slowpath);
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 135+ messages in thread

* [PATCH v6 02/11] qspinlock, x86: Enable x86-64 to use queue spinlock
  2014-03-12 18:54 [PATCH v6 00/11] qspinlock: a 4-byte queue spinlock with PV support Waiman Long
                   ` (2 preceding siblings ...)
  2014-03-12 18:54 ` [PATCH v6 02/11] qspinlock, x86: Enable x86-64 to use queue spinlock Waiman Long
@ 2014-03-12 18:54 ` Waiman Long
  2014-03-12 18:54 ` [PATCH v6 03/11] qspinlock: More optimized code for smaller NR_CPUS Waiman Long
                   ` (17 subsequent siblings)
  21 siblings, 0 replies; 135+ messages in thread
From: Waiman Long @ 2014-03-12 18:54 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	Peter Zijlstra
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, virtualization,
	Andi Kleen, Michel Lespinasse, Alok Kataria, linux-arch,
	Gleb Natapov, x86, xen-devel, Paul E. McKenney, Rik van Riel,
	Konrad Rzeszutek Wilk, Scott J Norton, Steven Rostedt,
	Chris Wright, Oleg Nesterov, Boris Ostrovsky,
	Aswin Chandramouleeswaran, Chegu Vinod, Waiman Long,
	linux-kernel, David Vrabel, Andrew

This patch makes the necessary changes at the x86 architecture
specific layer to enable the use of queue spinlock for x86-64. As
x86-32 machines are typically not multi-socket. The benefit of queue
spinlock may not be apparent. So queue spinlock is not enabled.

Currently, there is some incompatibilities between the para-virtualized
spinlock code (which hard-codes the use of ticket spinlock) and the
queue spinlock. Therefore, the use of queue spinlock is disabled when
the para-virtualized spinlock is enabled.

The arch/x86/include/asm/qspinlock.h header file includes some x86
specific optimization which will make the queue spinlock code perform
better than the generic implementation.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
Acked-by: Rik van Riel <riel@redhat.com>
---
 arch/x86/Kconfig                      |    1 +
 arch/x86/include/asm/qspinlock.h      |   41 +++++++++++++++++++++++++++++++++
 arch/x86/include/asm/spinlock.h       |    5 ++++
 arch/x86/include/asm/spinlock_types.h |    4 +++
 4 files changed, 51 insertions(+), 0 deletions(-)
 create mode 100644 arch/x86/include/asm/qspinlock.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 0af5250..de573f9 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -17,6 +17,7 @@ config X86_64
 	depends on 64BIT
 	select X86_DEV_DMA_OPS
 	select ARCH_USE_CMPXCHG_LOCKREF
+	select ARCH_USE_QUEUE_SPINLOCK
 
 ### Arch settings
 config X86
diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
new file mode 100644
index 0000000..44cefee
--- /dev/null
+++ b/arch/x86/include/asm/qspinlock.h
@@ -0,0 +1,41 @@
+#ifndef _ASM_X86_QSPINLOCK_H
+#define _ASM_X86_QSPINLOCK_H
+
+#include <asm-generic/qspinlock_types.h>
+
+#if !defined(CONFIG_X86_OOSTORE) && !defined(CONFIG_X86_PPRO_FENCE)
+
+#define _ARCH_SUPPORTS_ATOMIC_8_16_BITS_OPS
+
+/*
+ * x86-64 specific queue spinlock union structure
+ */
+union arch_qspinlock {
+	struct qspinlock slock;
+	u8		 lock;	/* Lock bit	*/
+};
+
+#define	queue_spin_unlock queue_spin_unlock
+/**
+ * queue_spin_unlock - release a queue spinlock
+ * @lock : Pointer to queue spinlock structure
+ *
+ * No special memory barrier other than a compiler one is needed for the
+ * x86 architecture. A compiler barrier is added at the end to make sure
+ * that the clearing the lock bit is done ASAP without artificial delay
+ * due to compiler optimization.
+ */
+static inline void queue_spin_unlock(struct qspinlock *lock)
+{
+	union arch_qspinlock *qlock = (union arch_qspinlock *)lock;
+
+	barrier();
+	ACCESS_ONCE(qlock->lock) = 0;
+	barrier();
+}
+
+#endif /* !CONFIG_X86_OOSTORE && !CONFIG_X86_PPRO_FENCE */
+
+#include <asm-generic/qspinlock.h>
+
+#endif /* _ASM_X86_QSPINLOCK_H */
diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index bf156de..6e6de1f 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -43,6 +43,10 @@
 extern struct static_key paravirt_ticketlocks_enabled;
 static __always_inline bool static_key_false(struct static_key *key);
 
+#ifdef CONFIG_QUEUE_SPINLOCK
+#include <asm/qspinlock.h>
+#else
+
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
 
 static inline void __ticket_enter_slowpath(arch_spinlock_t *lock)
@@ -181,6 +185,7 @@ static __always_inline void arch_spin_lock_flags(arch_spinlock_t *lock,
 {
 	arch_spin_lock(lock);
 }
+#endif /* CONFIG_QUEUE_SPINLOCK */
 
 static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
 {
diff --git a/arch/x86/include/asm/spinlock_types.h b/arch/x86/include/asm/spinlock_types.h
index 4f1bea1..7960268 100644
--- a/arch/x86/include/asm/spinlock_types.h
+++ b/arch/x86/include/asm/spinlock_types.h
@@ -23,6 +23,9 @@ typedef u32 __ticketpair_t;
 
 #define TICKET_SHIFT	(sizeof(__ticket_t) * 8)
 
+#ifdef CONFIG_QUEUE_SPINLOCK
+#include <asm-generic/qspinlock_types.h>
+#else
 typedef struct arch_spinlock {
 	union {
 		__ticketpair_t head_tail;
@@ -33,6 +36,7 @@ typedef struct arch_spinlock {
 } arch_spinlock_t;
 
 #define __ARCH_SPIN_LOCK_UNLOCKED	{ { 0 } }
+#endif /* CONFIG_QUEUE_SPINLOCK */
 
 #include <asm/rwlock.h>
 
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 135+ messages in thread

* [PATCH v6 02/11] qspinlock, x86: Enable x86-64 to use queue spinlock
  2014-03-12 18:54 [PATCH v6 00/11] qspinlock: a 4-byte queue spinlock with PV support Waiman Long
  2014-03-12 18:54 ` [PATCH v6 01/11] qspinlock: A generic 4-byte queue spinlock implementation Waiman Long
  2014-03-12 18:54 ` Waiman Long
@ 2014-03-12 18:54 ` Waiman Long
  2014-03-12 18:54 ` Waiman Long
                   ` (18 subsequent siblings)
  21 siblings, 0 replies; 135+ messages in thread
From: Waiman Long @ 2014-03-12 18:54 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	Peter Zijlstra
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, virtualization,
	Andi Kleen, Michel Lespinasse, Alok Kataria, linux-arch,
	Gleb Natapov, x86, xen-devel, Paul E. McKenney, Scott J Norton,
	Rusty Russell, Steven Rostedt, Chris Wright, Oleg Nesterov,
	Boris Ostrovsky, Aswin Chandramouleeswaran, Chegu Vinod,
	Waiman Long, linux-kernel, David Vrabel, Andrew Morton, Linu

This patch makes the necessary changes at the x86 architecture
specific layer to enable the use of queue spinlock for x86-64. As
x86-32 machines are typically not multi-socket. The benefit of queue
spinlock may not be apparent. So queue spinlock is not enabled.

Currently, there is some incompatibilities between the para-virtualized
spinlock code (which hard-codes the use of ticket spinlock) and the
queue spinlock. Therefore, the use of queue spinlock is disabled when
the para-virtualized spinlock is enabled.

The arch/x86/include/asm/qspinlock.h header file includes some x86
specific optimization which will make the queue spinlock code perform
better than the generic implementation.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
Acked-by: Rik van Riel <riel@redhat.com>
---
 arch/x86/Kconfig                      |    1 +
 arch/x86/include/asm/qspinlock.h      |   41 +++++++++++++++++++++++++++++++++
 arch/x86/include/asm/spinlock.h       |    5 ++++
 arch/x86/include/asm/spinlock_types.h |    4 +++
 4 files changed, 51 insertions(+), 0 deletions(-)
 create mode 100644 arch/x86/include/asm/qspinlock.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 0af5250..de573f9 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -17,6 +17,7 @@ config X86_64
 	depends on 64BIT
 	select X86_DEV_DMA_OPS
 	select ARCH_USE_CMPXCHG_LOCKREF
+	select ARCH_USE_QUEUE_SPINLOCK
 
 ### Arch settings
 config X86
diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
new file mode 100644
index 0000000..44cefee
--- /dev/null
+++ b/arch/x86/include/asm/qspinlock.h
@@ -0,0 +1,41 @@
+#ifndef _ASM_X86_QSPINLOCK_H
+#define _ASM_X86_QSPINLOCK_H
+
+#include <asm-generic/qspinlock_types.h>
+
+#if !defined(CONFIG_X86_OOSTORE) && !defined(CONFIG_X86_PPRO_FENCE)
+
+#define _ARCH_SUPPORTS_ATOMIC_8_16_BITS_OPS
+
+/*
+ * x86-64 specific queue spinlock union structure
+ */
+union arch_qspinlock {
+	struct qspinlock slock;
+	u8		 lock;	/* Lock bit	*/
+};
+
+#define	queue_spin_unlock queue_spin_unlock
+/**
+ * queue_spin_unlock - release a queue spinlock
+ * @lock : Pointer to queue spinlock structure
+ *
+ * No special memory barrier other than a compiler one is needed for the
+ * x86 architecture. A compiler barrier is added at the end to make sure
+ * that the clearing the lock bit is done ASAP without artificial delay
+ * due to compiler optimization.
+ */
+static inline void queue_spin_unlock(struct qspinlock *lock)
+{
+	union arch_qspinlock *qlock = (union arch_qspinlock *)lock;
+
+	barrier();
+	ACCESS_ONCE(qlock->lock) = 0;
+	barrier();
+}
+
+#endif /* !CONFIG_X86_OOSTORE && !CONFIG_X86_PPRO_FENCE */
+
+#include <asm-generic/qspinlock.h>
+
+#endif /* _ASM_X86_QSPINLOCK_H */
diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index bf156de..6e6de1f 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -43,6 +43,10 @@
 extern struct static_key paravirt_ticketlocks_enabled;
 static __always_inline bool static_key_false(struct static_key *key);
 
+#ifdef CONFIG_QUEUE_SPINLOCK
+#include <asm/qspinlock.h>
+#else
+
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
 
 static inline void __ticket_enter_slowpath(arch_spinlock_t *lock)
@@ -181,6 +185,7 @@ static __always_inline void arch_spin_lock_flags(arch_spinlock_t *lock,
 {
 	arch_spin_lock(lock);
 }
+#endif /* CONFIG_QUEUE_SPINLOCK */
 
 static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
 {
diff --git a/arch/x86/include/asm/spinlock_types.h b/arch/x86/include/asm/spinlock_types.h
index 4f1bea1..7960268 100644
--- a/arch/x86/include/asm/spinlock_types.h
+++ b/arch/x86/include/asm/spinlock_types.h
@@ -23,6 +23,9 @@ typedef u32 __ticketpair_t;
 
 #define TICKET_SHIFT	(sizeof(__ticket_t) * 8)
 
+#ifdef CONFIG_QUEUE_SPINLOCK
+#include <asm-generic/qspinlock_types.h>
+#else
 typedef struct arch_spinlock {
 	union {
 		__ticketpair_t head_tail;
@@ -33,6 +36,7 @@ typedef struct arch_spinlock {
 } arch_spinlock_t;
 
 #define __ARCH_SPIN_LOCK_UNLOCKED	{ { 0 } }
+#endif /* CONFIG_QUEUE_SPINLOCK */
 
 #include <asm/rwlock.h>
 
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 135+ messages in thread

* [PATCH v6 03/11] qspinlock: More optimized code for smaller NR_CPUS
  2014-03-12 18:54 [PATCH v6 00/11] qspinlock: a 4-byte queue spinlock with PV support Waiman Long
                   ` (4 preceding siblings ...)
  2014-03-12 18:54 ` [PATCH v6 03/11] qspinlock: More optimized code for smaller NR_CPUS Waiman Long
@ 2014-03-12 18:54 ` Waiman Long
  2014-03-12 18:54 ` [PATCH v6 04/11] qspinlock: Optimized code path for 2 contending tasks Waiman Long
                   ` (15 subsequent siblings)
  21 siblings, 0 replies; 135+ messages in thread
From: Waiman Long @ 2014-03-12 18:54 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	Peter Zijlstra
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, virtualization,
	Andi Kleen, Michel Lespinasse, Alok Kataria, linux-arch,
	Gleb Natapov, x86, xen-devel, Paul E. McKenney, Scott J Norton,
	Rusty Russell, Steven Rostedt, Chris Wright, Oleg Nesterov,
	Boris Ostrovsky, Aswin Chandramouleeswaran, Chegu Vinod,
	Waiman Long, linux-kernel, David Vrabel, Andrew Morton, Linu

For architectures that support atomic operations on smaller 8 or
16 bits data types. It is possible to simplify the code and produce
slightly better optimized code at the expense of smaller number of
supported CPUs.

The qspinlock code can support up to a maximum of 4M-1 CPUs. With
less than 16K CPUs, it is possible to squeeze the queue code into a
2-byte short word which can be accessed directly as a 16-bit short
data type. This enables the simplification of the queue code exchange
portion of the slowpath code.

This patch introduces a new macro _ARCH_SUPPORTS_ATOMIC_8_16_BITS_OPS
which can now be defined in an architecture specific qspinlock.h header
file to indicate its support for smaller atomic operation data types.
This macro triggers the replacement of some of the generic functions
by more optimized versions.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
---
 arch/x86/include/asm/qspinlock.h      |   14 ++++-
 include/asm-generic/qspinlock_types.h |    8 ++-
 kernel/locking/qspinlock.c            |  100 +++++++++++++++++++++++++++++++++
 3 files changed, 120 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
index 44cefee..acbe155 100644
--- a/arch/x86/include/asm/qspinlock.h
+++ b/arch/x86/include/asm/qspinlock.h
@@ -8,11 +8,23 @@
 #define _ARCH_SUPPORTS_ATOMIC_8_16_BITS_OPS
 
 /*
+ * As the qcode will be accessed as a 16-bit word, no offset is needed
+ */
+#define _QCODE_VAL_OFFSET	0
+
+/*
  * x86-64 specific queue spinlock union structure
+ * Besides the slock and lock fields, the other fields are only
+ * valid with less than 16K CPUs.
  */
 union arch_qspinlock {
 	struct qspinlock slock;
-	u8		 lock;	/* Lock bit	*/
+	struct {
+		u8  lock;	/* Lock bit	*/
+		u8  reserved;
+		u16 qcode;	/* Queue code	*/
+	};
+	u32 qlcode;		/* Complete lock word */
 };
 
 #define	queue_spin_unlock queue_spin_unlock
diff --git a/include/asm-generic/qspinlock_types.h b/include/asm-generic/qspinlock_types.h
index df981d0..3a02a9e 100644
--- a/include/asm-generic/qspinlock_types.h
+++ b/include/asm-generic/qspinlock_types.h
@@ -48,7 +48,13 @@ typedef struct qspinlock {
 	atomic_t	qlcode;	/* Lock + queue code */
 } arch_spinlock_t;
 
-#define _QCODE_OFFSET		8
+#if CONFIG_NR_CPUS >= (1 << 14)
+# define _Q_MANY_CPUS
+# define _QCODE_OFFSET	8
+#else
+# define _QCODE_OFFSET	16
+#endif
+
 #define _QSPINLOCK_LOCKED	1U
 #define	_QSPINLOCK_LOCK_MASK	0xff
 
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index f1a8102..52d3580 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -62,6 +62,10 @@
  * Bits 0-1 : queue node index (4 nodes)
  * Bits 2-23: CPU number + 1   (4M - 1 CPUs)
  *
+ * The 16-bit queue node code is divided into the following 2 fields:
+ * Bits 0-1 : queue node index (4 nodes)
+ * Bits 2-15: CPU number + 1   (16K - 1 CPUs)
+ *
  * A queue node code of 0 indicates that no one is waiting for the lock.
  * As the value 0 cannot be used as a valid CPU number. We need to add
  * 1 to it before putting it into the queue code.
@@ -93,6 +97,101 @@ struct qnode_set {
  */
 static DEFINE_PER_CPU_ALIGNED(struct qnode_set, qnset) = { { { 0 } }, 0 };
 
+/*
+ ************************************************************************
+ * The following optimized codes are for architectures that support:	*
+ *  1) Atomic byte and short data write					*
+ *  2) Byte and short data exchange and compare-exchange instructions	*
+ *									*
+ * For those architectures, their asm/qspinlock.h header file should	*
+ * define the followings in order to use the optimized codes.		*
+ *  1) The _ARCH_SUPPORTS_ATOMIC_8_16_BITS_OPS macro			*
+ *  2) A "union arch_qspinlock" structure that include the individual	*
+ *     fields of the qspinlock structure, including:			*
+ *      o slock     - the qspinlock structure				*
+ *      o lock      - the lock byte					*
+ *      o qcode     - the queue node code				*
+ *      o qlcode    - the 32-bit qspinlock word				*
+ *									*
+ ************************************************************************
+ */
+#ifdef _ARCH_SUPPORTS_ATOMIC_8_16_BITS_OPS
+#ifndef _Q_MANY_CPUS
+/*
+ * With less than 16K CPUs, the following optimizations are possible with
+ * architectures that allows atomic 8/16 bit operations:
+ *  1) The 16-bit queue code can be accessed or modified directly as a
+ *     16-bit short value without disturbing the first 2 bytes.
+ */
+#define queue_encode_qcode(cpu, idx)	(((cpu) + 1) << 2 | (idx))
+
+#define queue_code_xchg queue_code_xchg
+/**
+ * queue_code_xchg - exchange a queue code value
+ * @lock : Pointer to queue spinlock structure
+ * @ocode: Old queue code in the lock [OUT]
+ * @ncode: New queue code to be exchanged
+ * Return: 0 is always returned
+ */
+static inline int queue_code_xchg(struct qspinlock *lock, u32 *ocode, u32 ncode)
+{
+	union arch_qspinlock *qlock = (union arch_qspinlock *)lock;
+
+	*ocode = xchg(&qlock->qcode, (u16)ncode);
+	return 0;
+}
+
+#define queue_spin_trylock_and_clr_qcode queue_spin_trylock_and_clr_qcode
+/**
+ * queue_spin_trylock_and_clr_qcode - Try to lock & clear qcode simultaneously
+ * @lock : Pointer to queue spinlock structure
+ * @qcode: The supposedly current qcode value
+ * Return: true if successful, false otherwise
+ */
+static inline int
+queue_spin_trylock_and_clr_qcode(struct qspinlock *lock, u32 qcode)
+{
+	qcode <<= _QCODE_OFFSET;
+	return atomic_cmpxchg(&lock->qlcode, qcode, _QSPINLOCK_LOCKED) == qcode;
+}
+
+#define queue_get_lock_qcode queue_get_lock_qcode
+/**
+ * queue_get_lock_qcode - get the lock & qcode values
+ * @lock  : Pointer to queue spinlock structure
+ * @qcode : Pointer to the returned qcode value
+ * @mycode: My qcode value
+ * Return : != 0 if lock is not available
+ *	     = 0 if lock is free
+ *
+ * It is considered locked when either the lock bit or the wait bit is set.
+ */
+static inline int
+queue_get_lock_qcode(struct qspinlock *lock, u32 *qcode, u32 mycode)
+{
+	u32 qlcode = (u32)atomic_read(&lock->qlcode);
+
+	*qcode = qlcode >> _QCODE_OFFSET;
+	return qlcode & _QSPINLOCK_LOCKED;
+}
+#endif /* _Q_MANY_CPUS */
+
+/**
+ * queue_spin_setlock - try to acquire the lock by setting the lock bit
+ * @lock: Pointer to queue spinlock structure
+ * Return: 1 if lock bit set successfully, 0 if failed
+ */
+static __always_inline int queue_spin_setlock(struct qspinlock *lock)
+{
+	union arch_qspinlock *qlock = (union arch_qspinlock *)lock;
+
+	return cmpxchg(&qlock->lock, 0, _QSPINLOCK_LOCKED) == 0;
+}
+#else /*  _ARCH_SUPPORTS_ATOMIC_8_16_BITS_OPS  */
+/*
+ * Generic functions for architectures that do not support atomic
+ * byte or short data types.
+ */
 /**
  *_queue_spin_setlock - try to acquire the lock by setting the lock bit
  * @lock: Pointer to queue spinlock structure
@@ -107,6 +206,7 @@ static __always_inline int queue_spin_setlock(struct qspinlock *lock)
 			return 1;
 	return 0;
 }
+#endif /* _ARCH_SUPPORTS_ATOMIC_8_16_BITS_OPS */
 
 /*
  ************************************************************************
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 135+ messages in thread

* [PATCH v6 03/11] qspinlock: More optimized code for smaller NR_CPUS
  2014-03-12 18:54 [PATCH v6 00/11] qspinlock: a 4-byte queue spinlock with PV support Waiman Long
                   ` (3 preceding siblings ...)
  2014-03-12 18:54 ` Waiman Long
@ 2014-03-12 18:54 ` Waiman Long
  2014-03-12 18:54 ` Waiman Long
                   ` (16 subsequent siblings)
  21 siblings, 0 replies; 135+ messages in thread
From: Waiman Long @ 2014-03-12 18:54 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	Peter Zijlstra
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, virtualization,
	Andi Kleen, Michel Lespinasse, Alok Kataria, linux-arch,
	Gleb Natapov, x86, xen-devel, Paul E. McKenney, Rik van Riel,
	Konrad Rzeszutek Wilk, Scott J Norton, Steven Rostedt,
	Chris Wright, Oleg Nesterov, Boris Ostrovsky,
	Aswin Chandramouleeswaran, Chegu Vinod, Waiman Long,
	linux-kernel, David Vrabel, Andrew

For architectures that support atomic operations on smaller 8 or
16 bits data types. It is possible to simplify the code and produce
slightly better optimized code at the expense of smaller number of
supported CPUs.

The qspinlock code can support up to a maximum of 4M-1 CPUs. With
less than 16K CPUs, it is possible to squeeze the queue code into a
2-byte short word which can be accessed directly as a 16-bit short
data type. This enables the simplification of the queue code exchange
portion of the slowpath code.

This patch introduces a new macro _ARCH_SUPPORTS_ATOMIC_8_16_BITS_OPS
which can now be defined in an architecture specific qspinlock.h header
file to indicate its support for smaller atomic operation data types.
This macro triggers the replacement of some of the generic functions
by more optimized versions.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
---
 arch/x86/include/asm/qspinlock.h      |   14 ++++-
 include/asm-generic/qspinlock_types.h |    8 ++-
 kernel/locking/qspinlock.c            |  100 +++++++++++++++++++++++++++++++++
 3 files changed, 120 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
index 44cefee..acbe155 100644
--- a/arch/x86/include/asm/qspinlock.h
+++ b/arch/x86/include/asm/qspinlock.h
@@ -8,11 +8,23 @@
 #define _ARCH_SUPPORTS_ATOMIC_8_16_BITS_OPS
 
 /*
+ * As the qcode will be accessed as a 16-bit word, no offset is needed
+ */
+#define _QCODE_VAL_OFFSET	0
+
+/*
  * x86-64 specific queue spinlock union structure
+ * Besides the slock and lock fields, the other fields are only
+ * valid with less than 16K CPUs.
  */
 union arch_qspinlock {
 	struct qspinlock slock;
-	u8		 lock;	/* Lock bit	*/
+	struct {
+		u8  lock;	/* Lock bit	*/
+		u8  reserved;
+		u16 qcode;	/* Queue code	*/
+	};
+	u32 qlcode;		/* Complete lock word */
 };
 
 #define	queue_spin_unlock queue_spin_unlock
diff --git a/include/asm-generic/qspinlock_types.h b/include/asm-generic/qspinlock_types.h
index df981d0..3a02a9e 100644
--- a/include/asm-generic/qspinlock_types.h
+++ b/include/asm-generic/qspinlock_types.h
@@ -48,7 +48,13 @@ typedef struct qspinlock {
 	atomic_t	qlcode;	/* Lock + queue code */
 } arch_spinlock_t;
 
-#define _QCODE_OFFSET		8
+#if CONFIG_NR_CPUS >= (1 << 14)
+# define _Q_MANY_CPUS
+# define _QCODE_OFFSET	8
+#else
+# define _QCODE_OFFSET	16
+#endif
+
 #define _QSPINLOCK_LOCKED	1U
 #define	_QSPINLOCK_LOCK_MASK	0xff
 
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index f1a8102..52d3580 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -62,6 +62,10 @@
  * Bits 0-1 : queue node index (4 nodes)
  * Bits 2-23: CPU number + 1   (4M - 1 CPUs)
  *
+ * The 16-bit queue node code is divided into the following 2 fields:
+ * Bits 0-1 : queue node index (4 nodes)
+ * Bits 2-15: CPU number + 1   (16K - 1 CPUs)
+ *
  * A queue node code of 0 indicates that no one is waiting for the lock.
  * As the value 0 cannot be used as a valid CPU number. We need to add
  * 1 to it before putting it into the queue code.
@@ -93,6 +97,101 @@ struct qnode_set {
  */
 static DEFINE_PER_CPU_ALIGNED(struct qnode_set, qnset) = { { { 0 } }, 0 };
 
+/*
+ ************************************************************************
+ * The following optimized codes are for architectures that support:	*
+ *  1) Atomic byte and short data write					*
+ *  2) Byte and short data exchange and compare-exchange instructions	*
+ *									*
+ * For those architectures, their asm/qspinlock.h header file should	*
+ * define the followings in order to use the optimized codes.		*
+ *  1) The _ARCH_SUPPORTS_ATOMIC_8_16_BITS_OPS macro			*
+ *  2) A "union arch_qspinlock" structure that include the individual	*
+ *     fields of the qspinlock structure, including:			*
+ *      o slock     - the qspinlock structure				*
+ *      o lock      - the lock byte					*
+ *      o qcode     - the queue node code				*
+ *      o qlcode    - the 32-bit qspinlock word				*
+ *									*
+ ************************************************************************
+ */
+#ifdef _ARCH_SUPPORTS_ATOMIC_8_16_BITS_OPS
+#ifndef _Q_MANY_CPUS
+/*
+ * With less than 16K CPUs, the following optimizations are possible with
+ * architectures that allows atomic 8/16 bit operations:
+ *  1) The 16-bit queue code can be accessed or modified directly as a
+ *     16-bit short value without disturbing the first 2 bytes.
+ */
+#define queue_encode_qcode(cpu, idx)	(((cpu) + 1) << 2 | (idx))
+
+#define queue_code_xchg queue_code_xchg
+/**
+ * queue_code_xchg - exchange a queue code value
+ * @lock : Pointer to queue spinlock structure
+ * @ocode: Old queue code in the lock [OUT]
+ * @ncode: New queue code to be exchanged
+ * Return: 0 is always returned
+ */
+static inline int queue_code_xchg(struct qspinlock *lock, u32 *ocode, u32 ncode)
+{
+	union arch_qspinlock *qlock = (union arch_qspinlock *)lock;
+
+	*ocode = xchg(&qlock->qcode, (u16)ncode);
+	return 0;
+}
+
+#define queue_spin_trylock_and_clr_qcode queue_spin_trylock_and_clr_qcode
+/**
+ * queue_spin_trylock_and_clr_qcode - Try to lock & clear qcode simultaneously
+ * @lock : Pointer to queue spinlock structure
+ * @qcode: The supposedly current qcode value
+ * Return: true if successful, false otherwise
+ */
+static inline int
+queue_spin_trylock_and_clr_qcode(struct qspinlock *lock, u32 qcode)
+{
+	qcode <<= _QCODE_OFFSET;
+	return atomic_cmpxchg(&lock->qlcode, qcode, _QSPINLOCK_LOCKED) == qcode;
+}
+
+#define queue_get_lock_qcode queue_get_lock_qcode
+/**
+ * queue_get_lock_qcode - get the lock & qcode values
+ * @lock  : Pointer to queue spinlock structure
+ * @qcode : Pointer to the returned qcode value
+ * @mycode: My qcode value
+ * Return : != 0 if lock is not available
+ *	     = 0 if lock is free
+ *
+ * It is considered locked when either the lock bit or the wait bit is set.
+ */
+static inline int
+queue_get_lock_qcode(struct qspinlock *lock, u32 *qcode, u32 mycode)
+{
+	u32 qlcode = (u32)atomic_read(&lock->qlcode);
+
+	*qcode = qlcode >> _QCODE_OFFSET;
+	return qlcode & _QSPINLOCK_LOCKED;
+}
+#endif /* _Q_MANY_CPUS */
+
+/**
+ * queue_spin_setlock - try to acquire the lock by setting the lock bit
+ * @lock: Pointer to queue spinlock structure
+ * Return: 1 if lock bit set successfully, 0 if failed
+ */
+static __always_inline int queue_spin_setlock(struct qspinlock *lock)
+{
+	union arch_qspinlock *qlock = (union arch_qspinlock *)lock;
+
+	return cmpxchg(&qlock->lock, 0, _QSPINLOCK_LOCKED) == 0;
+}
+#else /*  _ARCH_SUPPORTS_ATOMIC_8_16_BITS_OPS  */
+/*
+ * Generic functions for architectures that do not support atomic
+ * byte or short data types.
+ */
 /**
  *_queue_spin_setlock - try to acquire the lock by setting the lock bit
  * @lock: Pointer to queue spinlock structure
@@ -107,6 +206,7 @@ static __always_inline int queue_spin_setlock(struct qspinlock *lock)
 			return 1;
 	return 0;
 }
+#endif /* _ARCH_SUPPORTS_ATOMIC_8_16_BITS_OPS */
 
 /*
  ************************************************************************
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 135+ messages in thread

* [PATCH v6 04/11] qspinlock: Optimized code path for 2 contending tasks
  2014-03-12 18:54 [PATCH v6 00/11] qspinlock: a 4-byte queue spinlock with PV support Waiman Long
                   ` (5 preceding siblings ...)
  2014-03-12 18:54 ` Waiman Long
@ 2014-03-12 18:54 ` Waiman Long
  2014-03-12 19:08     ` Waiman Long
  2014-03-12 19:08   ` Waiman Long
  2014-03-12 18:54 ` Waiman Long
                   ` (14 subsequent siblings)
  21 siblings, 2 replies; 135+ messages in thread
From: Waiman Long @ 2014-03-12 18:54 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	Peter Zijlstra
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, virtualization,
	Andi Kleen, Michel Lespinasse, Alok Kataria, linux-arch,
	Gleb Natapov, x86, xen-devel, Paul E. McKenney, Rik van Riel,
	Konrad Rzeszutek Wilk, Scott J Norton, Steven Rostedt,
	Chris Wright, Oleg Nesterov, Boris Ostrovsky,
	Aswin Chandramouleeswaran, Chegu Vinod, Waiman Long,
	linux-kernel, David Vrabel, Andrew

A major problem with the queue spinlock patch is its performance at
low contention level (2-4 contending tasks) where it is slower than
the corresponding ticket spinlock code. The following table shows the
execution time (in ms) of a micro-benchmark where 5M iterations of
the lock/unlock cycles were run on a 10-core Westere-EX x86-64 CPU
with 2 different types loads - standalone (lock and protected data
in different cachelines) and embedded (lock and protected data in
the same cacheline).

		  [Standalone/Embedded]
  # of tasks	Ticket lock	Queue lock	%Change
  ----------	-----------	----------	-------
       1	  135/111	 135/102	  0%/-8%
       2	 1045/950	1943/2022	+86%/+113%
       3	 1827/1783	2372/2428	+30%/+36%
       4	 2689/2725	2934/2934	 +9%/+8%
       5	 3736/3748	3658/3652	 -2%/-3%
       6	 4942/4984	4434/4428	-10%/-11%
       7	 6304/6319	5176/5163	-18%/-18%
       8	 7736/7629	5955/5944	-23%/-22%

It can be seen that the performance degradation is particular bad
with 2 and 3 contending tasks. To reduce that performance deficit
at low contention level, a special specific optimized code path
for 2 contending tasks was added. This special code path can only be
activated with less than 16K of configured CPUs because it uses a byte
in the 32-bit lock word to hold a waiting bit for the 2nd contending
tasks instead of queuing the waiting task in the queue.

With the change, the performance data became:

		  [Standalone/Embedded]
  # of tasks	Ticket lock	Queue lock	%Change
  ----------	-----------	----------	-------
       2	 1045/950	1120/1045	 +7%/+10%

In a multi-socketed server, the optimized code path also seems to
produce a pretty good performance improvement in cross-node contention
traffic at low contention level. The table below show the performance
with 1 contending task per node:

		[Standalone]
  # of nodes	Ticket lock	Queue lock	%Change
  ----------	-----------	----------	-------
       1	   135		  135		  0%
       2	  4452		 1736		-61%
       3	 10767		13432		+25%
       4	 20835		10796		-48%

Except some drop in performance at the 3 contending tasks level,
the queue spinlock performs much better than the ticket spinlock at
2 and 4 contending tasks level.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
---
 arch/x86/include/asm/qspinlock.h |    3 +-
 kernel/locking/qspinlock.c       |  137 +++++++++++++++++++++++++++++++++++++-
 2 files changed, 136 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
index acbe155..7f3129c 100644
--- a/arch/x86/include/asm/qspinlock.h
+++ b/arch/x86/include/asm/qspinlock.h
@@ -21,9 +21,10 @@ union arch_qspinlock {
 	struct qspinlock slock;
 	struct {
 		u8  lock;	/* Lock bit	*/
-		u8  reserved;
+		u8  wait;	/* Waiting bit	*/
 		u16 qcode;	/* Queue code	*/
 	};
+	u16 lock_wait;		/* Lock and wait bits */
 	u32 qlcode;		/* Complete lock word */
 };
 
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 52d3580..0030fad 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -112,6 +112,8 @@ static DEFINE_PER_CPU_ALIGNED(struct qnode_set, qnset) = { { { 0 } }, 0 };
  *      o lock      - the lock byte					*
  *      o qcode     - the queue node code				*
  *      o qlcode    - the 32-bit qspinlock word				*
+ *      o wait      - the waiting byte					*
+ *      o lock_wait - the combined lock and waiting bytes		*
  *									*
  ************************************************************************
  */
@@ -122,8 +124,101 @@ static DEFINE_PER_CPU_ALIGNED(struct qnode_set, qnset) = { { { 0 } }, 0 };
  * architectures that allows atomic 8/16 bit operations:
  *  1) The 16-bit queue code can be accessed or modified directly as a
  *     16-bit short value without disturbing the first 2 bytes.
+ *  2) The 2nd byte of the 32-bit lock word can be used as a pending bit
+ *     for waiting lock acquirer so that it won't need to go through the
+ *     MCS style locking queuing which has a higher overhead.
  */
+#define _QSPINLOCK_WAIT_SHIFT	8	/* Waiting bit position */
+#define _QSPINLOCK_WAITING	(1 << _QSPINLOCK_WAIT_SHIFT)
+/* Masks for lock & wait bits   */
+#define _QSPINLOCK_LWMASK	(_QSPINLOCK_WAITING | _QSPINLOCK_LOCKED)
+
 #define queue_encode_qcode(cpu, idx)	(((cpu) + 1) << 2 | (idx))
+#define queue_get_qcode(lock)	(atomic_read(&(lock)->qlcode) >> _QCODE_OFFSET)
+
+#define queue_spin_trylock_quick queue_spin_trylock_quick
+/**
+ * queue_spin_trylock_quick - quick spinning on the queue spinlock
+ * @lock : Pointer to queue spinlock structure
+ * @qsval: Old queue spinlock value
+ * Return: 1 if lock acquired, 0 if failed
+ *
+ * This is an optimized contention path for 2 contending tasks. It
+ * should only be entered if no task is waiting in the queue.
+ */
+static inline int queue_spin_trylock_quick(struct qspinlock *lock, int qsval)
+{
+	union arch_qspinlock *qlock = (union arch_qspinlock *)lock;
+
+	/*
+	 * Fall into the quick spinning code path only if no task is waiting
+	 * in the queue.
+	 */
+	while (likely(!(qsval >> _QCODE_OFFSET))) {
+		if ((qsval & _QSPINLOCK_LWMASK) == _QSPINLOCK_LWMASK) {
+			/*
+			 * Both the lock and wait bits are set, wait a while
+			 * to see if that changes. It not, quit the quick path.
+			 */
+			cpu_relax();
+			cpu_relax();
+			qsval = atomic_read(&lock->qlcode);
+			if ((qsval >> _QCODE_OFFSET) ||
+			   ((qsval & _QSPINLOCK_LWMASK) == _QSPINLOCK_LWMASK))
+				return 0;
+		}
+
+		/*
+		 * Try to set the corresponding waiting bit
+		 */
+		if (xchg(&qlock->wait, _QSPINLOCK_WAITING >> 8)) {
+			/*
+			 * Wait bit was set already, try again after some delay
+			 * as the waiter will probably get the lock & clear
+			 * the wait bit.
+			 *
+			 * There are 2 cpu_relax() calls to make sure that
+			 * the wait is longer than that of the
+			 * smp_load_acquire() loop below.
+			 */
+			arch_mutex_cpu_relax();
+			arch_mutex_cpu_relax();
+			qsval = atomic_read(&lock->qlcode);
+			continue;
+		}
+
+		/*
+		 * Now wait until the lock bit is cleared
+		 */
+		while (smp_load_acquire(&qlock->qlcode) & _QSPINLOCK_LOCKED)
+			arch_mutex_cpu_relax();
+
+		/*
+		 * Set the lock bit & clear the waiting bit simultaneously
+		 * It is assumed that there is no lock stealing with this
+		 * quick path active.
+		 *
+		 * A direct memory store of _QSPINLOCK_LOCKED into the
+		 * lock_wait field causes problem with the lockref code, e.g.
+		 *   ACCESS_ONCE(qlock->lock_wait) = _QSPINLOCK_LOCKED;
+		 *
+		 * It is not currently clear why this happens. A workaround
+		 * is to use atomic instruction to store the new value.
+		 */
+		{
+			u16 lw = xchg(&qlock->lock_wait, _QSPINLOCK_LOCKED);
+			BUG_ON(lw != _QSPINLOCK_WAITING);
+		}
+		return 1;
+	}
+	return 0;
+}
+
+/*
+ * With the qspinlock quickpath logic activated, disable the trylock logic
+ * in the slowpath as it will be redundant.
+ */
+#define queue_spin_trylock(lock)	(0)
 
 #define queue_code_xchg queue_code_xchg
 /**
@@ -131,13 +226,40 @@ static DEFINE_PER_CPU_ALIGNED(struct qnode_set, qnset) = { { { 0 } }, 0 };
  * @lock : Pointer to queue spinlock structure
  * @ocode: Old queue code in the lock [OUT]
  * @ncode: New queue code to be exchanged
- * Return: 0 is always returned
+ * Return: 1 if lock is taken and so can release the queue node, 0 otherwise.
  */
 static inline int queue_code_xchg(struct qspinlock *lock, u32 *ocode, u32 ncode)
 {
 	union arch_qspinlock *qlock = (union arch_qspinlock *)lock;
 
 	*ocode = xchg(&qlock->qcode, (u16)ncode);
+	if (*ocode == 0) {
+		/*
+		 * When no one is waiting in the queue before, try to fall
+		 * back into the optimized 2-task contending code path.
+		 */
+		u32 qlcode = ACCESS_ONCE(qlock->qlcode);
+
+		if ((qlcode != ((ncode << _QCODE_OFFSET)|_QSPINLOCK_LOCKED)) ||
+		    (cmpxchg(&qlock->qlcode, qlcode,
+			     _QSPINLOCK_LOCKED|_QSPINLOCK_WAITING) != qlcode))
+			return 0;
+retry_lock:
+		/*
+		 * Successfully fall back to the optimized code path.
+		 * Now wait until the lock byte is cleared
+		 */
+		while (smp_load_acquire(&qlock->qlcode) & _QSPINLOCK_LOCKED)
+			arch_mutex_cpu_relax();
+		/*
+		 * Use cmpxchg to set the lock bit & clear the waiting bit
+		 */
+		if (cmpxchg(&qlock->lock_wait, _QSPINLOCK_WAITING,
+			    _QSPINLOCK_LOCKED) == _QSPINLOCK_WAITING)
+			return 1;	/* Got the lock */
+		arch_mutex_cpu_relax();
+		goto retry_lock;
+	}
 	return 0;
 }
 
@@ -172,7 +294,7 @@ queue_get_lock_qcode(struct qspinlock *lock, u32 *qcode, u32 mycode)
 	u32 qlcode = (u32)atomic_read(&lock->qlcode);
 
 	*qcode = qlcode >> _QCODE_OFFSET;
-	return qlcode & _QSPINLOCK_LOCKED;
+	return qlcode & _QSPINLOCK_LWMASK;
 }
 #endif /* _Q_MANY_CPUS */
 
@@ -185,7 +307,7 @@ static __always_inline int queue_spin_setlock(struct qspinlock *lock)
 {
 	union arch_qspinlock *qlock = (union arch_qspinlock *)lock;
 
-	return cmpxchg(&qlock->lock, 0, _QSPINLOCK_LOCKED) == 0;
+	return cmpxchg(&qlock->lock_wait, 0, _QSPINLOCK_LOCKED) == 0;
 }
 #else /*  _ARCH_SUPPORTS_ATOMIC_8_16_BITS_OPS  */
 /*
@@ -214,6 +336,10 @@ static __always_inline int queue_spin_setlock(struct qspinlock *lock)
  * that may get superseded by a more optimized version.			*
  ************************************************************************
  */
+#ifndef queue_spin_trylock_quick
+static inline int queue_spin_trylock_quick(struct qspinlock *lock, int qsval)
+{ return 0; }
+#endif
 
 #ifndef queue_get_lock_qcode
 /**
@@ -372,6 +498,11 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, int qsval)
 	u32 prev_qcode, my_qcode;
 
 	/*
+	 * Try the quick spinning code path
+	 */
+	if (queue_spin_trylock_quick(lock, qsval))
+		return;
+	/*
 	 * Get the queue node
 	 */
 	cpu_nr = smp_processor_id();
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 135+ messages in thread

* [PATCH v6 04/11] qspinlock: Optimized code path for 2 contending tasks
  2014-03-12 18:54 [PATCH v6 00/11] qspinlock: a 4-byte queue spinlock with PV support Waiman Long
                   ` (6 preceding siblings ...)
  2014-03-12 18:54 ` [PATCH v6 04/11] qspinlock: Optimized code path for 2 contending tasks Waiman Long
@ 2014-03-12 18:54 ` Waiman Long
  2014-03-12 18:54 ` [PATCH v6 05/11] pvqspinlock, x86: Allow unfair spinlock in a PV guest Waiman Long
                   ` (13 subsequent siblings)
  21 siblings, 0 replies; 135+ messages in thread
From: Waiman Long @ 2014-03-12 18:54 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	Peter Zijlstra
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, virtualization,
	Andi Kleen, Michel Lespinasse, Alok Kataria, linux-arch,
	Gleb Natapov, x86, xen-devel, Paul E. McKenney, Scott J Norton,
	Rusty Russell, Steven Rostedt, Chris Wright, Oleg Nesterov,
	Boris Ostrovsky, Aswin Chandramouleeswaran, Chegu Vinod,
	Waiman Long, linux-kernel, David Vrabel, Andrew Morton, Linu

A major problem with the queue spinlock patch is its performance at
low contention level (2-4 contending tasks) where it is slower than
the corresponding ticket spinlock code. The following table shows the
execution time (in ms) of a micro-benchmark where 5M iterations of
the lock/unlock cycles were run on a 10-core Westere-EX x86-64 CPU
with 2 different types loads - standalone (lock and protected data
in different cachelines) and embedded (lock and protected data in
the same cacheline).

		  [Standalone/Embedded]
  # of tasks	Ticket lock	Queue lock	%Change
  ----------	-----------	----------	-------
       1	  135/111	 135/102	  0%/-8%
       2	 1045/950	1943/2022	+86%/+113%
       3	 1827/1783	2372/2428	+30%/+36%
       4	 2689/2725	2934/2934	 +9%/+8%
       5	 3736/3748	3658/3652	 -2%/-3%
       6	 4942/4984	4434/4428	-10%/-11%
       7	 6304/6319	5176/5163	-18%/-18%
       8	 7736/7629	5955/5944	-23%/-22%

It can be seen that the performance degradation is particular bad
with 2 and 3 contending tasks. To reduce that performance deficit
at low contention level, a special specific optimized code path
for 2 contending tasks was added. This special code path can only be
activated with less than 16K of configured CPUs because it uses a byte
in the 32-bit lock word to hold a waiting bit for the 2nd contending
tasks instead of queuing the waiting task in the queue.

With the change, the performance data became:

		  [Standalone/Embedded]
  # of tasks	Ticket lock	Queue lock	%Change
  ----------	-----------	----------	-------
       2	 1045/950	1120/1045	 +7%/+10%

In a multi-socketed server, the optimized code path also seems to
produce a pretty good performance improvement in cross-node contention
traffic at low contention level. The table below show the performance
with 1 contending task per node:

		[Standalone]
  # of nodes	Ticket lock	Queue lock	%Change
  ----------	-----------	----------	-------
       1	   135		  135		  0%
       2	  4452		 1736		-61%
       3	 10767		13432		+25%
       4	 20835		10796		-48%

Except some drop in performance at the 3 contending tasks level,
the queue spinlock performs much better than the ticket spinlock at
2 and 4 contending tasks level.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
---
 arch/x86/include/asm/qspinlock.h |    3 +-
 kernel/locking/qspinlock.c       |  137 +++++++++++++++++++++++++++++++++++++-
 2 files changed, 136 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
index acbe155..7f3129c 100644
--- a/arch/x86/include/asm/qspinlock.h
+++ b/arch/x86/include/asm/qspinlock.h
@@ -21,9 +21,10 @@ union arch_qspinlock {
 	struct qspinlock slock;
 	struct {
 		u8  lock;	/* Lock bit	*/
-		u8  reserved;
+		u8  wait;	/* Waiting bit	*/
 		u16 qcode;	/* Queue code	*/
 	};
+	u16 lock_wait;		/* Lock and wait bits */
 	u32 qlcode;		/* Complete lock word */
 };
 
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 52d3580..0030fad 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -112,6 +112,8 @@ static DEFINE_PER_CPU_ALIGNED(struct qnode_set, qnset) = { { { 0 } }, 0 };
  *      o lock      - the lock byte					*
  *      o qcode     - the queue node code				*
  *      o qlcode    - the 32-bit qspinlock word				*
+ *      o wait      - the waiting byte					*
+ *      o lock_wait - the combined lock and waiting bytes		*
  *									*
  ************************************************************************
  */
@@ -122,8 +124,101 @@ static DEFINE_PER_CPU_ALIGNED(struct qnode_set, qnset) = { { { 0 } }, 0 };
  * architectures that allows atomic 8/16 bit operations:
  *  1) The 16-bit queue code can be accessed or modified directly as a
  *     16-bit short value without disturbing the first 2 bytes.
+ *  2) The 2nd byte of the 32-bit lock word can be used as a pending bit
+ *     for waiting lock acquirer so that it won't need to go through the
+ *     MCS style locking queuing which has a higher overhead.
  */
+#define _QSPINLOCK_WAIT_SHIFT	8	/* Waiting bit position */
+#define _QSPINLOCK_WAITING	(1 << _QSPINLOCK_WAIT_SHIFT)
+/* Masks for lock & wait bits   */
+#define _QSPINLOCK_LWMASK	(_QSPINLOCK_WAITING | _QSPINLOCK_LOCKED)
+
 #define queue_encode_qcode(cpu, idx)	(((cpu) + 1) << 2 | (idx))
+#define queue_get_qcode(lock)	(atomic_read(&(lock)->qlcode) >> _QCODE_OFFSET)
+
+#define queue_spin_trylock_quick queue_spin_trylock_quick
+/**
+ * queue_spin_trylock_quick - quick spinning on the queue spinlock
+ * @lock : Pointer to queue spinlock structure
+ * @qsval: Old queue spinlock value
+ * Return: 1 if lock acquired, 0 if failed
+ *
+ * This is an optimized contention path for 2 contending tasks. It
+ * should only be entered if no task is waiting in the queue.
+ */
+static inline int queue_spin_trylock_quick(struct qspinlock *lock, int qsval)
+{
+	union arch_qspinlock *qlock = (union arch_qspinlock *)lock;
+
+	/*
+	 * Fall into the quick spinning code path only if no task is waiting
+	 * in the queue.
+	 */
+	while (likely(!(qsval >> _QCODE_OFFSET))) {
+		if ((qsval & _QSPINLOCK_LWMASK) == _QSPINLOCK_LWMASK) {
+			/*
+			 * Both the lock and wait bits are set, wait a while
+			 * to see if that changes. It not, quit the quick path.
+			 */
+			cpu_relax();
+			cpu_relax();
+			qsval = atomic_read(&lock->qlcode);
+			if ((qsval >> _QCODE_OFFSET) ||
+			   ((qsval & _QSPINLOCK_LWMASK) == _QSPINLOCK_LWMASK))
+				return 0;
+		}
+
+		/*
+		 * Try to set the corresponding waiting bit
+		 */
+		if (xchg(&qlock->wait, _QSPINLOCK_WAITING >> 8)) {
+			/*
+			 * Wait bit was set already, try again after some delay
+			 * as the waiter will probably get the lock & clear
+			 * the wait bit.
+			 *
+			 * There are 2 cpu_relax() calls to make sure that
+			 * the wait is longer than that of the
+			 * smp_load_acquire() loop below.
+			 */
+			arch_mutex_cpu_relax();
+			arch_mutex_cpu_relax();
+			qsval = atomic_read(&lock->qlcode);
+			continue;
+		}
+
+		/*
+		 * Now wait until the lock bit is cleared
+		 */
+		while (smp_load_acquire(&qlock->qlcode) & _QSPINLOCK_LOCKED)
+			arch_mutex_cpu_relax();
+
+		/*
+		 * Set the lock bit & clear the waiting bit simultaneously
+		 * It is assumed that there is no lock stealing with this
+		 * quick path active.
+		 *
+		 * A direct memory store of _QSPINLOCK_LOCKED into the
+		 * lock_wait field causes problem with the lockref code, e.g.
+		 *   ACCESS_ONCE(qlock->lock_wait) = _QSPINLOCK_LOCKED;
+		 *
+		 * It is not currently clear why this happens. A workaround
+		 * is to use atomic instruction to store the new value.
+		 */
+		{
+			u16 lw = xchg(&qlock->lock_wait, _QSPINLOCK_LOCKED);
+			BUG_ON(lw != _QSPINLOCK_WAITING);
+		}
+		return 1;
+	}
+	return 0;
+}
+
+/*
+ * With the qspinlock quickpath logic activated, disable the trylock logic
+ * in the slowpath as it will be redundant.
+ */
+#define queue_spin_trylock(lock)	(0)
 
 #define queue_code_xchg queue_code_xchg
 /**
@@ -131,13 +226,40 @@ static DEFINE_PER_CPU_ALIGNED(struct qnode_set, qnset) = { { { 0 } }, 0 };
  * @lock : Pointer to queue spinlock structure
  * @ocode: Old queue code in the lock [OUT]
  * @ncode: New queue code to be exchanged
- * Return: 0 is always returned
+ * Return: 1 if lock is taken and so can release the queue node, 0 otherwise.
  */
 static inline int queue_code_xchg(struct qspinlock *lock, u32 *ocode, u32 ncode)
 {
 	union arch_qspinlock *qlock = (union arch_qspinlock *)lock;
 
 	*ocode = xchg(&qlock->qcode, (u16)ncode);
+	if (*ocode == 0) {
+		/*
+		 * When no one is waiting in the queue before, try to fall
+		 * back into the optimized 2-task contending code path.
+		 */
+		u32 qlcode = ACCESS_ONCE(qlock->qlcode);
+
+		if ((qlcode != ((ncode << _QCODE_OFFSET)|_QSPINLOCK_LOCKED)) ||
+		    (cmpxchg(&qlock->qlcode, qlcode,
+			     _QSPINLOCK_LOCKED|_QSPINLOCK_WAITING) != qlcode))
+			return 0;
+retry_lock:
+		/*
+		 * Successfully fall back to the optimized code path.
+		 * Now wait until the lock byte is cleared
+		 */
+		while (smp_load_acquire(&qlock->qlcode) & _QSPINLOCK_LOCKED)
+			arch_mutex_cpu_relax();
+		/*
+		 * Use cmpxchg to set the lock bit & clear the waiting bit
+		 */
+		if (cmpxchg(&qlock->lock_wait, _QSPINLOCK_WAITING,
+			    _QSPINLOCK_LOCKED) == _QSPINLOCK_WAITING)
+			return 1;	/* Got the lock */
+		arch_mutex_cpu_relax();
+		goto retry_lock;
+	}
 	return 0;
 }
 
@@ -172,7 +294,7 @@ queue_get_lock_qcode(struct qspinlock *lock, u32 *qcode, u32 mycode)
 	u32 qlcode = (u32)atomic_read(&lock->qlcode);
 
 	*qcode = qlcode >> _QCODE_OFFSET;
-	return qlcode & _QSPINLOCK_LOCKED;
+	return qlcode & _QSPINLOCK_LWMASK;
 }
 #endif /* _Q_MANY_CPUS */
 
@@ -185,7 +307,7 @@ static __always_inline int queue_spin_setlock(struct qspinlock *lock)
 {
 	union arch_qspinlock *qlock = (union arch_qspinlock *)lock;
 
-	return cmpxchg(&qlock->lock, 0, _QSPINLOCK_LOCKED) == 0;
+	return cmpxchg(&qlock->lock_wait, 0, _QSPINLOCK_LOCKED) == 0;
 }
 #else /*  _ARCH_SUPPORTS_ATOMIC_8_16_BITS_OPS  */
 /*
@@ -214,6 +336,10 @@ static __always_inline int queue_spin_setlock(struct qspinlock *lock)
  * that may get superseded by a more optimized version.			*
  ************************************************************************
  */
+#ifndef queue_spin_trylock_quick
+static inline int queue_spin_trylock_quick(struct qspinlock *lock, int qsval)
+{ return 0; }
+#endif
 
 #ifndef queue_get_lock_qcode
 /**
@@ -372,6 +498,11 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, int qsval)
 	u32 prev_qcode, my_qcode;
 
 	/*
+	 * Try the quick spinning code path
+	 */
+	if (queue_spin_trylock_quick(lock, qsval))
+		return;
+	/*
 	 * Get the queue node
 	 */
 	cpu_nr = smp_processor_id();
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 135+ messages in thread

* [PATCH v6 05/11] pvqspinlock, x86: Allow unfair spinlock in a PV guest
  2014-03-12 18:54 [PATCH v6 00/11] qspinlock: a 4-byte queue spinlock with PV support Waiman Long
                   ` (7 preceding siblings ...)
  2014-03-12 18:54 ` Waiman Long
@ 2014-03-12 18:54 ` Waiman Long
  2014-03-13 10:54   ` David Vrabel
                     ` (3 more replies)
  2014-03-12 18:54 ` Waiman Long
                   ` (12 subsequent siblings)
  21 siblings, 4 replies; 135+ messages in thread
From: Waiman Long @ 2014-03-12 18:54 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	Peter Zijlstra
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, virtualization,
	Andi Kleen, Michel Lespinasse, Alok Kataria, linux-arch,
	Gleb Natapov, x86, xen-devel, Paul E. McKenney, Rik van Riel,
	Konrad Rzeszutek Wilk, Scott J Norton, Steven Rostedt,
	Chris Wright, Oleg Nesterov, Boris Ostrovsky,
	Aswin Chandramouleeswaran, Chegu Vinod, Waiman Long,
	linux-kernel, David Vrabel, Andrew

Locking is always an issue in a virtualized environment as the virtual
CPU that is waiting on a lock may get scheduled out and hence block
any progress in lock acquisition even when the lock has been freed.

One solution to this problem is to allow unfair lock in a
para-virtualized environment. In this case, a new lock acquirer can
come and steal the lock if the next-in-line CPU to get the lock is
scheduled out. Unfair lock in a native environment is generally not a
good idea as there is a possibility of lock starvation for a heavily
contended lock.

This patch add a new configuration option for the x86
architecture to enable the use of unfair queue spinlock
(PARAVIRT_UNFAIR_LOCKS) in a real para-virtualized guest. A jump label
(paravirt_unfairlocks_enabled) is used to switch between a fair and
an unfair version of the spinlock code. This jump label will only be
enabled in a real PV guest.

Enabling this configuration feature causes a slight decrease the
performance of an uncontended lock-unlock operation by about 1-2%
mainly due to the use of a static key. However, uncontended lock-unlock
operation are really just a tiny percentage of a real workload. So
there should no noticeable change in application performance.

With the unfair locking activated on bare metal 4-socket Westmere-EX
box, the execution times (in ms) of a spinlock micro-benchmark were
as follows:

  # of    Ticket       Fair	    Unfair
  tasks    lock     queue lock    queue lock
  ------  -------   ----------    ----------
    1       135        135	     137
    2      1045       1120	     747
    3      1827       2345     	    1084
    4      2689       2934	    1438
    5      3736       3658	    1722
    6      4942       4434	    2092
    7      6304       5176          2245
    8      7736       5955          2388

Executing one task per node, the performance data were:

  # of    Ticket       Fair	    Unfair
  nodes    lock     queue lock    queue lock
  ------  -------   ----------    ----------
    1        135        135          137
    2       4452       1736         1178
    3      10767      13432         1933
    4      20835      10796         2596

Of course there are pretty big variation in the execution times
of each individual task. For the 4 nodes case above, the standard
deviation was 209ms.

In general, the shorter the critical section, the better the
performance benefit of an unfair lock. For large critical section,
however, there may not be much benefit.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
---
 arch/x86/Kconfig                     |   11 +++++
 arch/x86/include/asm/qspinlock.h     |   72 ++++++++++++++++++++++++++++++++++
 arch/x86/kernel/Makefile             |    1 +
 arch/x86/kernel/paravirt-spinlocks.c |    7 +++
 4 files changed, 91 insertions(+), 0 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index de573f9..010abc4 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -629,6 +629,17 @@ config PARAVIRT_SPINLOCKS
 
 	  If you are unsure how to answer this question, answer Y.
 
+config PARAVIRT_UNFAIR_LOCKS
+	bool "Enable unfair locks in a para-virtualized guest"
+	depends on PARAVIRT && SMP && QUEUE_SPINLOCK
+	depends on !CONFIG_X86_OOSTORE && !CONFIG_X86_PPRO_FENCE
+	---help---
+	  This changes the kernel to use unfair locks in a
+	  para-virtualized guest. This will help performance in most
+	  cases. However, there is a possibility of lock starvation
+	  on a heavily contended lock especially in a large guest
+	  with many virtual CPUs.
+
 source "arch/x86/xen/Kconfig"
 
 config KVM_GUEST
diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
index 7f3129c..0e6740a 100644
--- a/arch/x86/include/asm/qspinlock.h
+++ b/arch/x86/include/asm/qspinlock.h
@@ -51,4 +51,76 @@ static inline void queue_spin_unlock(struct qspinlock *lock)
 
 #include <asm-generic/qspinlock.h>
 
+#ifdef CONFIG_PARAVIRT_UNFAIR_LOCKS
+/**
+ * queue_spin_lock_unfair - acquire a queue spinlock unfairly
+ * @lock: Pointer to queue spinlock structure
+ */
+static __always_inline void queue_spin_lock_unfair(struct qspinlock *lock)
+{
+	union arch_qspinlock *qlock = (union arch_qspinlock *)lock;
+
+	if (likely(cmpxchg(&qlock->lock, 0, _QSPINLOCK_LOCKED) == 0))
+		return;
+	/*
+	 * Since the lock is now unfair, we should not activate the 2-task
+	 * quick spinning code path which disallows lock stealing.
+	 */
+	queue_spin_lock_slowpath(lock, -1);
+}
+
+/**
+ * queue_spin_trylock_unfair - try to acquire the queue spinlock unfairly
+ * @lock : Pointer to queue spinlock structure
+ * Return: 1 if lock acquired, 0 if failed
+ */
+static __always_inline int queue_spin_trylock_unfair(struct qspinlock *lock)
+{
+	union arch_qspinlock *qlock = (union arch_qspinlock *)lock;
+
+	if (!qlock->lock && (cmpxchg(&qlock->lock, 0, _QSPINLOCK_LOCKED) == 0))
+		return 1;
+	return 0;
+}
+
+/*
+ * Redefine arch_spin_lock and arch_spin_trylock as inline functions that will
+ * jump to the unfair versions if the static key paravirt_unfairlocks_enabled
+ * is true.
+ */
+#undef arch_spin_lock
+#undef arch_spin_trylock
+#undef arch_spin_lock_flags
+
+extern struct static_key paravirt_unfairlocks_enabled;
+
+/**
+ * arch_spin_lock - acquire a queue spinlock
+ * @lock: Pointer to queue spinlock structure
+ */
+static inline void arch_spin_lock(struct qspinlock *lock)
+{
+	if (static_key_false(&paravirt_unfairlocks_enabled))
+		queue_spin_lock_unfair(lock);
+	else
+		queue_spin_lock(lock);
+}
+
+/**
+ * arch_spin_trylock - try to acquire the queue spinlock
+ * @lock : Pointer to queue spinlock structure
+ * Return: 1 if lock acquired, 0 if failed
+ */
+static inline int arch_spin_trylock(struct qspinlock *lock)
+{
+	if (static_key_false(&paravirt_unfairlocks_enabled))
+		return queue_spin_trylock_unfair(lock);
+	else
+		return queue_spin_trylock(lock);
+}
+
+#define arch_spin_lock_flags(l, f)	arch_spin_lock(l)
+
+#endif /* CONFIG_PARAVIRT_UNFAIR_LOCKS */
+
 #endif /* _ASM_X86_QSPINLOCK_H */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index cb648c8..1107a20 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -88,6 +88,7 @@ obj-$(CONFIG_DEBUG_NMI_SELFTEST) += nmi_selftest.o
 obj-$(CONFIG_KVM_GUEST)		+= kvm.o kvmclock.o
 obj-$(CONFIG_PARAVIRT)		+= paravirt.o paravirt_patch_$(BITS).o
 obj-$(CONFIG_PARAVIRT_SPINLOCKS)+= paravirt-spinlocks.o
+obj-$(CONFIG_PARAVIRT_UNFAIR_LOCKS)+= paravirt-spinlocks.o
 obj-$(CONFIG_PARAVIRT_CLOCK)	+= pvclock.o
 
 obj-$(CONFIG_PCSPKR_PLATFORM)	+= pcspeaker.o
diff --git a/arch/x86/kernel/paravirt-spinlocks.c b/arch/x86/kernel/paravirt-spinlocks.c
index bbb6c73..a50032a 100644
--- a/arch/x86/kernel/paravirt-spinlocks.c
+++ b/arch/x86/kernel/paravirt-spinlocks.c
@@ -8,6 +8,7 @@
 
 #include <asm/paravirt.h>
 
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
 struct pv_lock_ops pv_lock_ops = {
 #ifdef CONFIG_SMP
 	.lock_spinning = __PV_IS_CALLEE_SAVE(paravirt_nop),
@@ -18,3 +19,9 @@ EXPORT_SYMBOL(pv_lock_ops);
 
 struct static_key paravirt_ticketlocks_enabled = STATIC_KEY_INIT_FALSE;
 EXPORT_SYMBOL(paravirt_ticketlocks_enabled);
+#endif
+
+#ifdef CONFIG_PARAVIRT_UNFAIR_LOCKS
+struct static_key paravirt_unfairlocks_enabled = STATIC_KEY_INIT_FALSE;
+EXPORT_SYMBOL(paravirt_unfairlocks_enabled);
+#endif
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 135+ messages in thread

* [PATCH v6 05/11] pvqspinlock, x86: Allow unfair spinlock in a PV guest
  2014-03-12 18:54 [PATCH v6 00/11] qspinlock: a 4-byte queue spinlock with PV support Waiman Long
                   ` (8 preceding siblings ...)
  2014-03-12 18:54 ` [PATCH v6 05/11] pvqspinlock, x86: Allow unfair spinlock in a PV guest Waiman Long
@ 2014-03-12 18:54 ` Waiman Long
  2014-03-12 18:54 ` [PATCH v6 06/11] pvqspinlock, x86: Allow unfair queue spinlock in a KVM guest Waiman Long
                   ` (11 subsequent siblings)
  21 siblings, 0 replies; 135+ messages in thread
From: Waiman Long @ 2014-03-12 18:54 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	Peter Zijlstra
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, virtualization,
	Andi Kleen, Michel Lespinasse, Alok Kataria, linux-arch,
	Gleb Natapov, x86, xen-devel, Paul E. McKenney, Scott J Norton,
	Rusty Russell, Steven Rostedt, Chris Wright, Oleg Nesterov,
	Boris Ostrovsky, Aswin Chandramouleeswaran, Chegu Vinod,
	Waiman Long, linux-kernel, David Vrabel, Andrew Morton, Linu

Locking is always an issue in a virtualized environment as the virtual
CPU that is waiting on a lock may get scheduled out and hence block
any progress in lock acquisition even when the lock has been freed.

One solution to this problem is to allow unfair lock in a
para-virtualized environment. In this case, a new lock acquirer can
come and steal the lock if the next-in-line CPU to get the lock is
scheduled out. Unfair lock in a native environment is generally not a
good idea as there is a possibility of lock starvation for a heavily
contended lock.

This patch add a new configuration option for the x86
architecture to enable the use of unfair queue spinlock
(PARAVIRT_UNFAIR_LOCKS) in a real para-virtualized guest. A jump label
(paravirt_unfairlocks_enabled) is used to switch between a fair and
an unfair version of the spinlock code. This jump label will only be
enabled in a real PV guest.

Enabling this configuration feature causes a slight decrease the
performance of an uncontended lock-unlock operation by about 1-2%
mainly due to the use of a static key. However, uncontended lock-unlock
operation are really just a tiny percentage of a real workload. So
there should no noticeable change in application performance.

With the unfair locking activated on bare metal 4-socket Westmere-EX
box, the execution times (in ms) of a spinlock micro-benchmark were
as follows:

  # of    Ticket       Fair	    Unfair
  tasks    lock     queue lock    queue lock
  ------  -------   ----------    ----------
    1       135        135	     137
    2      1045       1120	     747
    3      1827       2345     	    1084
    4      2689       2934	    1438
    5      3736       3658	    1722
    6      4942       4434	    2092
    7      6304       5176          2245
    8      7736       5955          2388

Executing one task per node, the performance data were:

  # of    Ticket       Fair	    Unfair
  nodes    lock     queue lock    queue lock
  ------  -------   ----------    ----------
    1        135        135          137
    2       4452       1736         1178
    3      10767      13432         1933
    4      20835      10796         2596

Of course there are pretty big variation in the execution times
of each individual task. For the 4 nodes case above, the standard
deviation was 209ms.

In general, the shorter the critical section, the better the
performance benefit of an unfair lock. For large critical section,
however, there may not be much benefit.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
---
 arch/x86/Kconfig                     |   11 +++++
 arch/x86/include/asm/qspinlock.h     |   72 ++++++++++++++++++++++++++++++++++
 arch/x86/kernel/Makefile             |    1 +
 arch/x86/kernel/paravirt-spinlocks.c |    7 +++
 4 files changed, 91 insertions(+), 0 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index de573f9..010abc4 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -629,6 +629,17 @@ config PARAVIRT_SPINLOCKS
 
 	  If you are unsure how to answer this question, answer Y.
 
+config PARAVIRT_UNFAIR_LOCKS
+	bool "Enable unfair locks in a para-virtualized guest"
+	depends on PARAVIRT && SMP && QUEUE_SPINLOCK
+	depends on !CONFIG_X86_OOSTORE && !CONFIG_X86_PPRO_FENCE
+	---help---
+	  This changes the kernel to use unfair locks in a
+	  para-virtualized guest. This will help performance in most
+	  cases. However, there is a possibility of lock starvation
+	  on a heavily contended lock especially in a large guest
+	  with many virtual CPUs.
+
 source "arch/x86/xen/Kconfig"
 
 config KVM_GUEST
diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
index 7f3129c..0e6740a 100644
--- a/arch/x86/include/asm/qspinlock.h
+++ b/arch/x86/include/asm/qspinlock.h
@@ -51,4 +51,76 @@ static inline void queue_spin_unlock(struct qspinlock *lock)
 
 #include <asm-generic/qspinlock.h>
 
+#ifdef CONFIG_PARAVIRT_UNFAIR_LOCKS
+/**
+ * queue_spin_lock_unfair - acquire a queue spinlock unfairly
+ * @lock: Pointer to queue spinlock structure
+ */
+static __always_inline void queue_spin_lock_unfair(struct qspinlock *lock)
+{
+	union arch_qspinlock *qlock = (union arch_qspinlock *)lock;
+
+	if (likely(cmpxchg(&qlock->lock, 0, _QSPINLOCK_LOCKED) == 0))
+		return;
+	/*
+	 * Since the lock is now unfair, we should not activate the 2-task
+	 * quick spinning code path which disallows lock stealing.
+	 */
+	queue_spin_lock_slowpath(lock, -1);
+}
+
+/**
+ * queue_spin_trylock_unfair - try to acquire the queue spinlock unfairly
+ * @lock : Pointer to queue spinlock structure
+ * Return: 1 if lock acquired, 0 if failed
+ */
+static __always_inline int queue_spin_trylock_unfair(struct qspinlock *lock)
+{
+	union arch_qspinlock *qlock = (union arch_qspinlock *)lock;
+
+	if (!qlock->lock && (cmpxchg(&qlock->lock, 0, _QSPINLOCK_LOCKED) == 0))
+		return 1;
+	return 0;
+}
+
+/*
+ * Redefine arch_spin_lock and arch_spin_trylock as inline functions that will
+ * jump to the unfair versions if the static key paravirt_unfairlocks_enabled
+ * is true.
+ */
+#undef arch_spin_lock
+#undef arch_spin_trylock
+#undef arch_spin_lock_flags
+
+extern struct static_key paravirt_unfairlocks_enabled;
+
+/**
+ * arch_spin_lock - acquire a queue spinlock
+ * @lock: Pointer to queue spinlock structure
+ */
+static inline void arch_spin_lock(struct qspinlock *lock)
+{
+	if (static_key_false(&paravirt_unfairlocks_enabled))
+		queue_spin_lock_unfair(lock);
+	else
+		queue_spin_lock(lock);
+}
+
+/**
+ * arch_spin_trylock - try to acquire the queue spinlock
+ * @lock : Pointer to queue spinlock structure
+ * Return: 1 if lock acquired, 0 if failed
+ */
+static inline int arch_spin_trylock(struct qspinlock *lock)
+{
+	if (static_key_false(&paravirt_unfairlocks_enabled))
+		return queue_spin_trylock_unfair(lock);
+	else
+		return queue_spin_trylock(lock);
+}
+
+#define arch_spin_lock_flags(l, f)	arch_spin_lock(l)
+
+#endif /* CONFIG_PARAVIRT_UNFAIR_LOCKS */
+
 #endif /* _ASM_X86_QSPINLOCK_H */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index cb648c8..1107a20 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -88,6 +88,7 @@ obj-$(CONFIG_DEBUG_NMI_SELFTEST) += nmi_selftest.o
 obj-$(CONFIG_KVM_GUEST)		+= kvm.o kvmclock.o
 obj-$(CONFIG_PARAVIRT)		+= paravirt.o paravirt_patch_$(BITS).o
 obj-$(CONFIG_PARAVIRT_SPINLOCKS)+= paravirt-spinlocks.o
+obj-$(CONFIG_PARAVIRT_UNFAIR_LOCKS)+= paravirt-spinlocks.o
 obj-$(CONFIG_PARAVIRT_CLOCK)	+= pvclock.o
 
 obj-$(CONFIG_PCSPKR_PLATFORM)	+= pcspeaker.o
diff --git a/arch/x86/kernel/paravirt-spinlocks.c b/arch/x86/kernel/paravirt-spinlocks.c
index bbb6c73..a50032a 100644
--- a/arch/x86/kernel/paravirt-spinlocks.c
+++ b/arch/x86/kernel/paravirt-spinlocks.c
@@ -8,6 +8,7 @@
 
 #include <asm/paravirt.h>
 
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
 struct pv_lock_ops pv_lock_ops = {
 #ifdef CONFIG_SMP
 	.lock_spinning = __PV_IS_CALLEE_SAVE(paravirt_nop),
@@ -18,3 +19,9 @@ EXPORT_SYMBOL(pv_lock_ops);
 
 struct static_key paravirt_ticketlocks_enabled = STATIC_KEY_INIT_FALSE;
 EXPORT_SYMBOL(paravirt_ticketlocks_enabled);
+#endif
+
+#ifdef CONFIG_PARAVIRT_UNFAIR_LOCKS
+struct static_key paravirt_unfairlocks_enabled = STATIC_KEY_INIT_FALSE;
+EXPORT_SYMBOL(paravirt_unfairlocks_enabled);
+#endif
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 135+ messages in thread

* [PATCH v6 06/11] pvqspinlock, x86: Allow unfair queue spinlock in a KVM guest
  2014-03-12 18:54 [PATCH v6 00/11] qspinlock: a 4-byte queue spinlock with PV support Waiman Long
                   ` (10 preceding siblings ...)
  2014-03-12 18:54 ` [PATCH v6 06/11] pvqspinlock, x86: Allow unfair queue spinlock in a KVM guest Waiman Long
@ 2014-03-12 18:54 ` Waiman Long
  2014-03-12 18:54 ` [PATCH v6 07/11] pvqspinlock, x86: Allow unfair queue spinlock in a XEN guest Waiman Long
                   ` (9 subsequent siblings)
  21 siblings, 0 replies; 135+ messages in thread
From: Waiman Long @ 2014-03-12 18:54 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	Peter Zijlstra
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, virtualization,
	Andi Kleen, Michel Lespinasse, Alok Kataria, linux-arch,
	Gleb Natapov, x86, xen-devel, Paul E. McKenney, Rik van Riel,
	Konrad Rzeszutek Wilk, Scott J Norton, Steven Rostedt,
	Chris Wright, Oleg Nesterov, Boris Ostrovsky,
	Aswin Chandramouleeswaran, Chegu Vinod, Waiman Long,
	linux-kernel, David Vrabel, Andrew

This patch adds a KVM init function to activate the unfair queue
spinlock in a KVM guest when the PARAVIRT_UNFAIR_LOCKS kernel config
option is selected.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
---
 arch/x86/kernel/kvm.c |   17 +++++++++++++++++
 1 files changed, 17 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 713f1b3..a489140 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -826,3 +826,20 @@ static __init int kvm_spinlock_init_jump(void)
 early_initcall(kvm_spinlock_init_jump);
 
 #endif	/* CONFIG_PARAVIRT_SPINLOCKS */
+
+#ifdef CONFIG_PARAVIRT_UNFAIR_LOCKS
+/*
+ * Enable unfair lock if running in a real para-virtualized environment
+ */
+static __init int kvm_unfair_locks_init_jump(void)
+{
+	if (!kvm_para_available())
+		return 0;
+
+	static_key_slow_inc(&paravirt_unfairlocks_enabled);
+	printk(KERN_INFO "KVM setup unfair spinlock\n");
+
+	return 0;
+}
+early_initcall(kvm_unfair_locks_init_jump);
+#endif
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 135+ messages in thread

* [PATCH v6 06/11] pvqspinlock, x86: Allow unfair queue spinlock in a KVM guest
  2014-03-12 18:54 [PATCH v6 00/11] qspinlock: a 4-byte queue spinlock with PV support Waiman Long
                   ` (9 preceding siblings ...)
  2014-03-12 18:54 ` Waiman Long
@ 2014-03-12 18:54 ` Waiman Long
  2014-03-12 18:54 ` Waiman Long
                   ` (10 subsequent siblings)
  21 siblings, 0 replies; 135+ messages in thread
From: Waiman Long @ 2014-03-12 18:54 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	Peter Zijlstra
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, virtualization,
	Andi Kleen, Michel Lespinasse, Alok Kataria, linux-arch,
	Gleb Natapov, x86, xen-devel, Paul E. McKenney, Scott J Norton,
	Rusty Russell, Steven Rostedt, Chris Wright, Oleg Nesterov,
	Boris Ostrovsky, Aswin Chandramouleeswaran, Chegu Vinod,
	Waiman Long, linux-kernel, David Vrabel, Andrew Morton, Linu

This patch adds a KVM init function to activate the unfair queue
spinlock in a KVM guest when the PARAVIRT_UNFAIR_LOCKS kernel config
option is selected.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
---
 arch/x86/kernel/kvm.c |   17 +++++++++++++++++
 1 files changed, 17 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 713f1b3..a489140 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -826,3 +826,20 @@ static __init int kvm_spinlock_init_jump(void)
 early_initcall(kvm_spinlock_init_jump);
 
 #endif	/* CONFIG_PARAVIRT_SPINLOCKS */
+
+#ifdef CONFIG_PARAVIRT_UNFAIR_LOCKS
+/*
+ * Enable unfair lock if running in a real para-virtualized environment
+ */
+static __init int kvm_unfair_locks_init_jump(void)
+{
+	if (!kvm_para_available())
+		return 0;
+
+	static_key_slow_inc(&paravirt_unfairlocks_enabled);
+	printk(KERN_INFO "KVM setup unfair spinlock\n");
+
+	return 0;
+}
+early_initcall(kvm_unfair_locks_init_jump);
+#endif
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 135+ messages in thread

* [PATCH v6 07/11] pvqspinlock, x86: Allow unfair queue spinlock in a XEN guest
  2014-03-12 18:54 [PATCH v6 00/11] qspinlock: a 4-byte queue spinlock with PV support Waiman Long
                   ` (12 preceding siblings ...)
  2014-03-12 18:54 ` [PATCH v6 07/11] pvqspinlock, x86: Allow unfair queue spinlock in a XEN guest Waiman Long
@ 2014-03-12 18:54 ` Waiman Long
  2014-03-12 18:54 ` [PATCH v6 08/11] pvqspinlock, x86: Rename paravirt_ticketlocks_enabled Waiman Long
                   ` (7 subsequent siblings)
  21 siblings, 0 replies; 135+ messages in thread
From: Waiman Long @ 2014-03-12 18:54 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	Peter Zijlstra
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, virtualization,
	Andi Kleen, Michel Lespinasse, Alok Kataria, linux-arch,
	Gleb Natapov, x86, xen-devel, Paul E. McKenney, Rik van Riel,
	Konrad Rzeszutek Wilk, Scott J Norton, Steven Rostedt,
	Chris Wright, Oleg Nesterov, Boris Ostrovsky,
	Aswin Chandramouleeswaran, Chegu Vinod, Waiman Long,
	linux-kernel, David Vrabel, Andrew

This patch adds a XEN init function to activate the unfair queue
spinlock in a XEN guest when the PARAVIRT_UNFAIR_LOCKS kernel config
option is selected.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
---
 arch/x86/xen/setup.c |   19 +++++++++++++++++++
 1 files changed, 19 insertions(+), 0 deletions(-)

diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
index 0982233..66bb6f5 100644
--- a/arch/x86/xen/setup.c
+++ b/arch/x86/xen/setup.c
@@ -625,3 +625,22 @@ void __init xen_arch_setup(void)
 	numa_off = 1;
 #endif
 }
+
+#ifdef CONFIG_PARAVIRT_UNFAIR_LOCKS
+/*
+ * Enable unfair lock if running in a Xen guest
+ */
+static __init int xen_unfair_locks_init_jump(void)
+{
+	/*
+	 * Disable unfair lock if not running in a PV domain
+	 */
+	if (!xen_pv_domain())
+		return 0;
+
+	static_key_slow_inc(&paravirt_unfairlocks_enabled);
+
+	return 0;
+}
+early_initcall(xen_unfair_locks_init_jump);
+#endif
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 135+ messages in thread

* [PATCH v6 07/11] pvqspinlock, x86: Allow unfair queue spinlock in a XEN guest
  2014-03-12 18:54 [PATCH v6 00/11] qspinlock: a 4-byte queue spinlock with PV support Waiman Long
                   ` (11 preceding siblings ...)
  2014-03-12 18:54 ` Waiman Long
@ 2014-03-12 18:54 ` Waiman Long
  2014-03-12 18:54 ` Waiman Long
                   ` (8 subsequent siblings)
  21 siblings, 0 replies; 135+ messages in thread
From: Waiman Long @ 2014-03-12 18:54 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	Peter Zijlstra
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, virtualization,
	Andi Kleen, Michel Lespinasse, Alok Kataria, linux-arch,
	Gleb Natapov, x86, xen-devel, Paul E. McKenney, Scott J Norton,
	Rusty Russell, Steven Rostedt, Chris Wright, Oleg Nesterov,
	Boris Ostrovsky, Aswin Chandramouleeswaran, Chegu Vinod,
	Waiman Long, linux-kernel, David Vrabel, Andrew Morton, Linu

This patch adds a XEN init function to activate the unfair queue
spinlock in a XEN guest when the PARAVIRT_UNFAIR_LOCKS kernel config
option is selected.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
---
 arch/x86/xen/setup.c |   19 +++++++++++++++++++
 1 files changed, 19 insertions(+), 0 deletions(-)

diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
index 0982233..66bb6f5 100644
--- a/arch/x86/xen/setup.c
+++ b/arch/x86/xen/setup.c
@@ -625,3 +625,22 @@ void __init xen_arch_setup(void)
 	numa_off = 1;
 #endif
 }
+
+#ifdef CONFIG_PARAVIRT_UNFAIR_LOCKS
+/*
+ * Enable unfair lock if running in a Xen guest
+ */
+static __init int xen_unfair_locks_init_jump(void)
+{
+	/*
+	 * Disable unfair lock if not running in a PV domain
+	 */
+	if (!xen_pv_domain())
+		return 0;
+
+	static_key_slow_inc(&paravirt_unfairlocks_enabled);
+
+	return 0;
+}
+early_initcall(xen_unfair_locks_init_jump);
+#endif
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 135+ messages in thread

* [PATCH v6 08/11] pvqspinlock, x86: Rename paravirt_ticketlocks_enabled
  2014-03-12 18:54 [PATCH v6 00/11] qspinlock: a 4-byte queue spinlock with PV support Waiman Long
                   ` (13 preceding siblings ...)
  2014-03-12 18:54 ` Waiman Long
@ 2014-03-12 18:54 ` Waiman Long
  2014-03-12 18:54 ` Waiman Long
                   ` (6 subsequent siblings)
  21 siblings, 0 replies; 135+ messages in thread
From: Waiman Long @ 2014-03-12 18:54 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	Peter Zijlstra
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, virtualization,
	Andi Kleen, Michel Lespinasse, Alok Kataria, linux-arch,
	Gleb Natapov, x86, xen-devel, Paul E. McKenney, Scott J Norton,
	Rusty Russell, Steven Rostedt, Chris Wright, Oleg Nesterov,
	Boris Ostrovsky, Aswin Chandramouleeswaran, Chegu Vinod,
	Waiman Long, linux-kernel, David Vrabel, Andrew Morton, Linu

This patch renames the paravirt_ticketlocks_enabled static key to a
more generic paravirt_spinlocks_enabled name.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
---
 arch/x86/include/asm/spinlock.h      |    4 ++--
 arch/x86/kernel/kvm.c                |    2 +-
 arch/x86/kernel/paravirt-spinlocks.c |    4 ++--
 arch/x86/xen/spinlock.c              |    2 +-
 4 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index 6e6de1f..283f2cf 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -40,7 +40,7 @@
 /* How long a lock should spin before we consider blocking */
 #define SPIN_THRESHOLD	(1 << 15)
 
-extern struct static_key paravirt_ticketlocks_enabled;
+extern struct static_key paravirt_spinlocks_enabled;
 static __always_inline bool static_key_false(struct static_key *key);
 
 #ifdef CONFIG_QUEUE_SPINLOCK
@@ -151,7 +151,7 @@ static inline void __ticket_unlock_slowpath(arch_spinlock_t *lock,
 static __always_inline void arch_spin_unlock(arch_spinlock_t *lock)
 {
 	if (TICKET_SLOWPATH_FLAG &&
-	    static_key_false(&paravirt_ticketlocks_enabled)) {
+	    static_key_false(&paravirt_spinlocks_enabled)) {
 		arch_spinlock_t prev;
 
 		prev = *lock;
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index a489140..f318e78 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -818,7 +818,7 @@ static __init int kvm_spinlock_init_jump(void)
 	if (!kvm_para_has_feature(KVM_FEATURE_PV_UNHALT))
 		return 0;
 
-	static_key_slow_inc(&paravirt_ticketlocks_enabled);
+	static_key_slow_inc(&paravirt_spinlocks_enabled);
 	printk(KERN_INFO "KVM setup paravirtual spinlock\n");
 
 	return 0;
diff --git a/arch/x86/kernel/paravirt-spinlocks.c b/arch/x86/kernel/paravirt-spinlocks.c
index a50032a..8c67cbe 100644
--- a/arch/x86/kernel/paravirt-spinlocks.c
+++ b/arch/x86/kernel/paravirt-spinlocks.c
@@ -17,8 +17,8 @@ struct pv_lock_ops pv_lock_ops = {
 };
 EXPORT_SYMBOL(pv_lock_ops);
 
-struct static_key paravirt_ticketlocks_enabled = STATIC_KEY_INIT_FALSE;
-EXPORT_SYMBOL(paravirt_ticketlocks_enabled);
+struct static_key paravirt_spinlocks_enabled = STATIC_KEY_INIT_FALSE;
+EXPORT_SYMBOL(paravirt_spinlocks_enabled);
 #endif
 
 #ifdef CONFIG_PARAVIRT_UNFAIR_LOCKS
diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c
index 581521c..06f4a64 100644
--- a/arch/x86/xen/spinlock.c
+++ b/arch/x86/xen/spinlock.c
@@ -290,7 +290,7 @@ static __init int xen_init_spinlocks_jump(void)
 	if (!xen_pvspin)
 		return 0;
 
-	static_key_slow_inc(&paravirt_ticketlocks_enabled);
+	static_key_slow_inc(&paravirt_spinlocks_enabled);
 	return 0;
 }
 early_initcall(xen_init_spinlocks_jump);
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 135+ messages in thread

* [PATCH v6 08/11] pvqspinlock, x86: Rename paravirt_ticketlocks_enabled
  2014-03-12 18:54 [PATCH v6 00/11] qspinlock: a 4-byte queue spinlock with PV support Waiman Long
                   ` (14 preceding siblings ...)
  2014-03-12 18:54 ` [PATCH v6 08/11] pvqspinlock, x86: Rename paravirt_ticketlocks_enabled Waiman Long
@ 2014-03-12 18:54 ` Waiman Long
  2014-03-12 18:54 ` [PATCH RFC v6 09/11] pvqspinlock, x86: Add qspinlock para-virtualization support Waiman Long
                   ` (5 subsequent siblings)
  21 siblings, 0 replies; 135+ messages in thread
From: Waiman Long @ 2014-03-12 18:54 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	Peter Zijlstra
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, virtualization,
	Andi Kleen, Michel Lespinasse, Alok Kataria, linux-arch,
	Gleb Natapov, x86, xen-devel, Paul E. McKenney, Rik van Riel,
	Konrad Rzeszutek Wilk, Scott J Norton, Steven Rostedt,
	Chris Wright, Oleg Nesterov, Boris Ostrovsky,
	Aswin Chandramouleeswaran, Chegu Vinod, Waiman Long,
	linux-kernel, David Vrabel, Andrew

This patch renames the paravirt_ticketlocks_enabled static key to a
more generic paravirt_spinlocks_enabled name.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
---
 arch/x86/include/asm/spinlock.h      |    4 ++--
 arch/x86/kernel/kvm.c                |    2 +-
 arch/x86/kernel/paravirt-spinlocks.c |    4 ++--
 arch/x86/xen/spinlock.c              |    2 +-
 4 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index 6e6de1f..283f2cf 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -40,7 +40,7 @@
 /* How long a lock should spin before we consider blocking */
 #define SPIN_THRESHOLD	(1 << 15)
 
-extern struct static_key paravirt_ticketlocks_enabled;
+extern struct static_key paravirt_spinlocks_enabled;
 static __always_inline bool static_key_false(struct static_key *key);
 
 #ifdef CONFIG_QUEUE_SPINLOCK
@@ -151,7 +151,7 @@ static inline void __ticket_unlock_slowpath(arch_spinlock_t *lock,
 static __always_inline void arch_spin_unlock(arch_spinlock_t *lock)
 {
 	if (TICKET_SLOWPATH_FLAG &&
-	    static_key_false(&paravirt_ticketlocks_enabled)) {
+	    static_key_false(&paravirt_spinlocks_enabled)) {
 		arch_spinlock_t prev;
 
 		prev = *lock;
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index a489140..f318e78 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -818,7 +818,7 @@ static __init int kvm_spinlock_init_jump(void)
 	if (!kvm_para_has_feature(KVM_FEATURE_PV_UNHALT))
 		return 0;
 
-	static_key_slow_inc(&paravirt_ticketlocks_enabled);
+	static_key_slow_inc(&paravirt_spinlocks_enabled);
 	printk(KERN_INFO "KVM setup paravirtual spinlock\n");
 
 	return 0;
diff --git a/arch/x86/kernel/paravirt-spinlocks.c b/arch/x86/kernel/paravirt-spinlocks.c
index a50032a..8c67cbe 100644
--- a/arch/x86/kernel/paravirt-spinlocks.c
+++ b/arch/x86/kernel/paravirt-spinlocks.c
@@ -17,8 +17,8 @@ struct pv_lock_ops pv_lock_ops = {
 };
 EXPORT_SYMBOL(pv_lock_ops);
 
-struct static_key paravirt_ticketlocks_enabled = STATIC_KEY_INIT_FALSE;
-EXPORT_SYMBOL(paravirt_ticketlocks_enabled);
+struct static_key paravirt_spinlocks_enabled = STATIC_KEY_INIT_FALSE;
+EXPORT_SYMBOL(paravirt_spinlocks_enabled);
 #endif
 
 #ifdef CONFIG_PARAVIRT_UNFAIR_LOCKS
diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c
index 581521c..06f4a64 100644
--- a/arch/x86/xen/spinlock.c
+++ b/arch/x86/xen/spinlock.c
@@ -290,7 +290,7 @@ static __init int xen_init_spinlocks_jump(void)
 	if (!xen_pvspin)
 		return 0;
 
-	static_key_slow_inc(&paravirt_ticketlocks_enabled);
+	static_key_slow_inc(&paravirt_spinlocks_enabled);
 	return 0;
 }
 early_initcall(xen_init_spinlocks_jump);
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 135+ messages in thread

* [PATCH RFC v6 09/11] pvqspinlock, x86: Add qspinlock para-virtualization support
  2014-03-12 18:54 [PATCH v6 00/11] qspinlock: a 4-byte queue spinlock with PV support Waiman Long
                   ` (16 preceding siblings ...)
  2014-03-12 18:54 ` [PATCH RFC v6 09/11] pvqspinlock, x86: Add qspinlock para-virtualization support Waiman Long
@ 2014-03-12 18:54 ` Waiman Long
  2014-03-13 11:21     ` David Vrabel
  2014-03-13 11:21   ` David Vrabel
  2014-03-12 18:54 ` [PATCH RFC v6 10/11] pvqspinlock, x86: Enable qspinlock PV support for KVM Waiman Long
                   ` (3 subsequent siblings)
  21 siblings, 2 replies; 135+ messages in thread
From: Waiman Long @ 2014-03-12 18:54 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	Peter Zijlstra
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, virtualization,
	Andi Kleen, Michel Lespinasse, Alok Kataria, linux-arch,
	Gleb Natapov, x86, xen-devel, Paul E. McKenney, Scott J Norton,
	Rusty Russell, Steven Rostedt, Chris Wright, Oleg Nesterov,
	Boris Ostrovsky, Aswin Chandramouleeswaran, Chegu Vinod,
	Waiman Long, linux-kernel, David Vrabel, Andrew Morton, Linu

This patch adds para-virtualization support to the queue spinlock in
the same way as was done in the PV ticket lock code. In essence, the
lock waiters will spin for a specified number of times (QSPIN_THRESHOLD
= 2^14) and then halted itself. The queue head waiter will spins
2*QSPIN_THRESHOLD times before halting itself. When it has spinned
QSPIN_THRESHOLD times, the queue head will assume that the lock
holder may be scheduled out and attempt to kick the lock holder CPU
if it has the CPU number on hand. Before being halted, the queue head
waiter will set a flag (_QSPINLOCK_LOCKED_SLOWPATH) in the lock byte
to indicate that the unlock slowpath has to be invoked.

In the unlock slowpath, the current lock holder will find the queue
head by following the previous node pointer links stored in the
queue node structure until it finds one that has the wait flag turned
off. It then attempt to kick the CPU of the queue head.

After the queue head acquired the lock, it will also check the status
of the next node and set _QSPINLOCK_LOCKED_SLOWPATH if it has been
halted.

Enabling the PV code does have a performance impact on spinlock
acquisitions and releases. The following table shows the execution
time (in ms) of a spinlock micro-benchmark that does lock/unlock
operations 5M times for each task versus the number of contending
tasks on a Westmere-EX system.

  # of        Ticket lock	     Queue lock
  tasks   PV off/PV on/%Change 	  PV off/PV on/%Change
  ------  --------------------   ---------------------
    1	     135/  179/+33%	     137/  169/+23%
    2	    1045/ 1103/ +6%	    1120/ 1536/+37%
    3	    1827/ 2683/+47%	    2313/ 2425/ +5%
    4       2689/ 4191/+56%	    2914/ 3128/ +7%
    5       3736/ 5830/+56%	    3715/ 3762/ +1%
    6       4942/ 7609/+54%	    4504/ 4558/ +2%
    7       6304/ 9570/+52%	    5292/ 5351/ +1%
    8       7736/11323/+46%	    6037/ 6097/ +1%

It can be seen that the ticket lock PV code has a fairly big decrease
in performance when there are 3 or more contending tasks. The
queue spinlock PV code, on the other hand, only has a minor drop in
performance for 3 or more contending tasks.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
---
 arch/x86/include/asm/paravirt.h       |   12 ++-
 arch/x86/include/asm/paravirt_types.h |   12 ++
 arch/x86/include/asm/pvqspinlock.h    |  232 +++++++++++++++++++++++++++++++++
 arch/x86/include/asm/qspinlock.h      |   35 +++++
 arch/x86/kernel/paravirt-spinlocks.c  |    5 +
 kernel/locking/qspinlock.c            |   96 ++++++++++++++-
 6 files changed, 390 insertions(+), 2 deletions(-)
 create mode 100644 arch/x86/include/asm/pvqspinlock.h

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index cd6e161..cabc37a 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -711,7 +711,17 @@ static inline void __set_fixmap(unsigned /* enum fixed_addresses */ idx,
 }
 
 #if defined(CONFIG_SMP) && defined(CONFIG_PARAVIRT_SPINLOCKS)
+#ifdef CONFIG_QUEUE_SPINLOCK
+static __always_inline void __queue_kick_cpu(int cpu, enum pv_kick_type type)
+{
+	PVOP_VCALL2(pv_lock_ops.kick_cpu, cpu, type);
+}
 
+static __always_inline void __queue_hibernate(void)
+{
+	PVOP_VCALL0(pv_lock_ops.hibernate);
+}
+#else
 static __always_inline void __ticket_lock_spinning(struct arch_spinlock *lock,
 							__ticket_t ticket)
 {
@@ -723,7 +733,7 @@ static __always_inline void __ticket_unlock_kick(struct arch_spinlock *lock,
 {
 	PVOP_VCALL2(pv_lock_ops.unlock_kick, lock, ticket);
 }
-
+#endif
 #endif
 
 #ifdef CONFIG_X86_32
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index 7549b8b..fa16aa6 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -333,9 +333,21 @@ struct arch_spinlock;
 typedef u16 __ticket_t;
 #endif
 
+#ifdef CONFIG_QUEUE_SPINLOCK
+enum pv_kick_type {
+	PV_KICK_LOCK_HOLDER,
+	PV_KICK_QUEUE_HEAD
+};
+#endif
+
 struct pv_lock_ops {
+#ifdef CONFIG_QUEUE_SPINLOCK
+	void (*kick_cpu)(int cpu, enum pv_kick_type);
+	void (*hibernate)(void);
+#else
 	struct paravirt_callee_save lock_spinning;
 	void (*unlock_kick)(struct arch_spinlock *lock, __ticket_t ticket);
+#endif
 };
 
 /* This contains all the paravirt structures: we get a convenient
diff --git a/arch/x86/include/asm/pvqspinlock.h b/arch/x86/include/asm/pvqspinlock.h
new file mode 100644
index 0000000..13cbc4f
--- /dev/null
+++ b/arch/x86/include/asm/pvqspinlock.h
@@ -0,0 +1,232 @@
+#ifndef _ASM_X86_PVQSPINLOCK_H
+#define _ASM_X86_PVQSPINLOCK_H
+
+/*
+ *	Queue Spinlock Para-Virtualization (PV) Support
+ *
+ *	+------+	    +-----+ nxtcpu_p1  +----+
+ *	| Lock |	    |Queue|----------->|Next|
+ *	|Holder|<-----------|Head |<-----------|Node|
+ *	+------+ prev_qcode +-----+ prev_qcode +----+
+ *
+ * As long as the current lock holder passes through the slowpath, the queue
+ * head CPU will have its CPU number stored in prev_qcode. The situation is
+ * the same for the node next to the queue head.
+ *
+ * The next node, while setting up the next pointer in the queue head, can
+ * also store its CPU number in that node. With that change, the queue head
+ * will have the CPU numbers of both its upstream and downstream neighbors.
+ *
+ * The PV support code for queue spinlock is roughly the same as that
+ * of the ticket spinlock. Each CPU waiting for the lock will spin until it
+ * reaches a threshold. When that happens, it will put itself to halt so
+ * that the hypervisor can reuse the CPU in some other guests.
+ *
+ * The differences between the two versions of PV support are:
+ * 1) The queue head will spin twice as long as the other nodes before it
+ *    puts itself to halt.
+ * 2) The queue head will also attempt to kick the lock holder, if it has
+ *    the CPU number, in the half way point.
+ */
+
+/*
+ * Spin threshold for queue spinlock
+ * This is half of the ticket lock's SPIN_THRESHOLD. The queue head will
+ * be halted after 2*QSPIN_THRESHOLD whereas the other nodes will be
+ * halted after QSPIN_THRESHOLD.
+ */
+#define	QSPIN_THRESHOLD	(1U<<14)
+
+/*
+ * PV macros
+ */
+#define PV_SET_VAR(type, var, val)	type var = val
+#define PV_VAR(var)			var
+
+/*
+ * CPU state flags
+ */
+#define	PV_CPU_ACTIVE	1	/* This CPU is active		 */
+#define	PV_CPU_KICKING	2	/* This CPU is kicking other CPU */
+#define PV_CPU_KICKED   3	/* This CPU is being kicked	 */
+#define PV_CPU_HALTED	-1	/* This CPU is halted		 */
+
+/*
+ * Additional fields to be added to the qnode structure
+ */
+#if CONFIG_NR_CPUS >= (1 << 16)
+#define _cpuid_t	u32
+#else
+#define _cpuid_t	u16
+#endif
+
+struct qnode;
+
+struct pv_qvars {
+	s16	      cpustate;		/* CPU status flag		*/
+	_cpuid_t      nxtcpu_p1;	/* CPU number of next node + 1	*/
+	_cpuid_t      mycpu;		/* CPU number of this node	*/
+	struct qnode *prev;		/* Pointer to previous node	*/
+};
+
+/**
+ * pv_init_vars - initialize fields in struct pv_qvars
+ * @pv : pointer to struct pv_qvars
+ * @cpu: current CPU number
+ */
+static __always_inline void pv_init_vars(struct pv_qvars *pv, int cpu)
+{
+	pv->cpustate  = PV_CPU_ACTIVE;
+	pv->prev      = NULL;
+	pv->nxtcpu_p1 = 0;
+	pv->mycpu     = cpu;
+}
+
+/**
+ * pv_head_spin_check - perform para-virtualization checks for queue head
+ * @pv    : pointer to struct pv_qvars
+ * @count : loop count
+ * @qcode : queue code of the supposed lock holder
+ * @lock  : pointer to the qspinlock structure
+ *
+ * The following checks will be done:
+ * 1) Attempt to kick the lock holder, if known, after QSPIN_THRESHOLD
+ * 2) Halt itself if lock is still not available after 2*QSPIN_THRESHOLD
+ */
+static __always_inline void pv_head_spin_check(struct pv_qvars *pv, int *count,
+				u32 qcode, struct qspinlock *lock)
+{
+	if (!static_key_false(&paravirt_spinlocks_enabled))
+		return;
+	if (unlikely((++(*count) == QSPIN_THRESHOLD) && qcode)) {
+		/*
+		 * Get the CPU number of the lock holder & kick it
+		 * The lock may have been stealed by another CPU
+		 * if PARAVIRT_UNFAIR_LOCKS is set, so the computed
+		 * CPU number may not be the actual lock holder.
+		 */
+		int cpu = (qcode >> (_QCODE_VAL_OFFSET + 2)) - 1;
+
+		ACCESS_ONCE(pv->cpustate) = PV_CPU_KICKING;
+		__queue_kick_cpu(cpu, PV_KICK_LOCK_HOLDER);
+		ACCESS_ONCE(pv->cpustate) = PV_CPU_ACTIVE;
+	}
+	if (unlikely(*count >= 2*QSPIN_THRESHOLD)) {
+		u8 lockval;
+
+		/*
+		 * Set the lock byte to _QSPINLOCK_LOCKED_SLOWPATH before
+		 * trying to hibernate itself. It is possible that the
+		 * lock byte had been set to _QSPINLOCK_LOCKED_SLOWPATH
+		 * already. In this case, just proceeds to sleeping.
+		 */
+		ACCESS_ONCE(pv->cpustate) = PV_CPU_HALTED;
+		lockval = cmpxchg(&((union arch_qspinlock *)lock)->lock,
+			  _QSPINLOCK_LOCKED, _QSPINLOCK_LOCKED_SLOWPATH);
+		if (lockval == 0) {
+			/*
+			 * Can exit now as the lock is free
+			 */
+			ACCESS_ONCE(pv->cpustate) = PV_CPU_ACTIVE;
+			*count = 0;
+			return;
+		}
+		__queue_hibernate();
+		ACCESS_ONCE(pv->cpustate) = PV_CPU_ACTIVE;
+		*count = 0;	/* Reset count */
+	}
+}
+
+/**
+ * pv_queue_spin_check - perform para-virtualization checks for queue member
+ * @pv   : pointer to struct pv_qvars
+ * @count: loop count
+ */
+static __always_inline void pv_queue_spin_check(struct pv_qvars *pv, int *count)
+{
+	if (!static_key_false(&paravirt_spinlocks_enabled))
+		return;
+	/*
+	 * Attempt to halt oneself after QSPIN_THRESHOLD spins
+	 */
+	if (unlikely(++(*count) >= QSPIN_THRESHOLD)) {
+		/*
+		 * Time to hibernate itself
+		 */
+		ACCESS_ONCE(pv->cpustate) = PV_CPU_HALTED;
+		__queue_hibernate();
+		ACCESS_ONCE(pv->cpustate) = PV_CPU_ACTIVE;
+		*count = 0;	/* Reset count */
+	}
+}
+
+/**
+ * pv_next_node_check - set _QSPINLOCK_LOCKED_SLOWPATH flag if the next node
+ *			is halted
+ * @pv   : pointer to struct pv_qvars
+ * @count: loop count
+ *
+ * The current CPU should have gotten the lock before calling this function.
+ */
+static __always_inline void
+pv_next_node_check(struct pv_qvars *pv, struct qspinlock *lock)
+{
+	if (!static_key_false(&paravirt_spinlocks_enabled))
+		return;
+	if (ACCESS_ONCE(pv->cpustate) == PV_CPU_HALTED)
+		ACCESS_ONCE(((union arch_qspinlock *)lock)->lock)
+			= _QSPINLOCK_LOCKED_SLOWPATH;
+}
+
+/**
+ * pv_set_vars - set nxtcpu_p1 in previous PV and prev in current PV
+ * @pv  : pointer to struct pv_qvars
+ * @ppv : pointer to struct pv_qvars of previous node
+ * @cpu : cpu number
+ * @prev: pointer to the previous queue node
+ */
+static __always_inline void pv_set_vars(struct pv_qvars *pv,
+			struct pv_qvars *ppv, int cpu, struct qnode *prev)
+{
+	ppv->nxtcpu_p1 = cpu + 1;
+	pv->prev       = prev;
+}
+
+/**
+ * pv_set_prev - set previous queue node pointer
+ * @pv  : pointer to struct pv_qvars to be set
+ * @prev: pointer to the previous node
+ */
+static __always_inline void pv_set_prev(struct pv_qvars *pv, struct qnode *prev)
+{
+	ACCESS_ONCE(pv->prev) = prev;
+}
+
+/*
+ * The following inlined functions are being used by the
+ * queue_spin_unlock_slowpath() function.
+ */
+
+/**
+ * pv_get_prev - get previous queue node pointer
+ * @pv   : pointer to struct pv_qvars to be set
+ * Return: the previous queue node pointer
+ */
+static __always_inline struct qnode *pv_get_prev(struct pv_qvars *pv)
+{
+	return ACCESS_ONCE(pv->prev);
+}
+
+/**
+ * pv_kick_node - kick up the CPU of the given node
+ * @pv  : pointer to struct pv_qvars of the node to be kicked
+ */
+static __always_inline void pv_kick_node(struct pv_qvars *pv)
+{
+	if (pv->cpustate != PV_CPU_HALTED)
+		return;
+	ACCESS_ONCE(pv->cpustate) = PV_CPU_KICKED;
+	__queue_kick_cpu(pv->mycpu, PV_KICK_QUEUE_HEAD);
+}
+
+#endif /* _ASM_X86_PVQSPINLOCK_H */
diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
index 0e6740a..4f85c33 100644
--- a/arch/x86/include/asm/qspinlock.h
+++ b/arch/x86/include/asm/qspinlock.h
@@ -38,7 +38,11 @@ union arch_qspinlock {
  * that the clearing the lock bit is done ASAP without artificial delay
  * due to compiler optimization.
  */
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+static __always_inline void __queue_spin_unlock(struct qspinlock *lock)
+#else
 static inline void queue_spin_unlock(struct qspinlock *lock)
+#endif
 {
 	union arch_qspinlock *qlock = (union arch_qspinlock *)lock;
 
@@ -47,6 +51,37 @@ static inline void queue_spin_unlock(struct qspinlock *lock)
 	barrier();
 }
 
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+/*
+ * The lock byte can have a value of _QSPINLOCK_LOCKED_SLOWPATH to indicate
+ * that it needs to go through the slowpath to do the unlocking.
+ */
+#define _QSPINLOCK_LOCKED_SLOWPATH	3	/* Set both bits 0 & 1 */
+
+extern void queue_spin_unlock_slowpath(struct qspinlock *lock);
+
+static inline void queue_spin_unlock(struct qspinlock *lock)
+{
+	union arch_qspinlock *qlock = (union arch_qspinlock *)lock;
+
+	barrier();
+	if (static_key_false(&paravirt_spinlocks_enabled)) {
+		/*
+		 * Need to atomically clear the lock byte to avoid racing with
+		 * queue head waiter trying to set _QSPINLOCK_LOCKED_SLOWPATH.
+		 */
+		if (likely(cmpxchg(&qlock->lock, _QSPINLOCK_LOCKED, 0)
+				== _QSPINLOCK_LOCKED))
+			return;
+		else
+			queue_spin_unlock_slowpath(lock);
+
+	} else {
+		__queue_spin_unlock(lock);
+	}
+}
+#endif
+
 #endif /* !CONFIG_X86_OOSTORE && !CONFIG_X86_PPRO_FENCE */
 
 #include <asm-generic/qspinlock.h>
diff --git a/arch/x86/kernel/paravirt-spinlocks.c b/arch/x86/kernel/paravirt-spinlocks.c
index 8c67cbe..d98547f 100644
--- a/arch/x86/kernel/paravirt-spinlocks.c
+++ b/arch/x86/kernel/paravirt-spinlocks.c
@@ -11,9 +11,14 @@
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
 struct pv_lock_ops pv_lock_ops = {
 #ifdef CONFIG_SMP
+#ifdef CONFIG_QUEUE_SPINLOCK
+	.kick_cpu = paravirt_nop,
+	.hibernate = paravirt_nop,
+#else
 	.lock_spinning = __PV_IS_CALLEE_SAVE(paravirt_nop),
 	.unlock_kick = paravirt_nop,
 #endif
+#endif
 };
 EXPORT_SYMBOL(pv_lock_ops);
 
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 0030fad..a07cf8c 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -58,6 +58,31 @@
  */
 
 /*
+ * Para-virtualized queue spinlock support
+ */
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+#include <asm/pvqspinlock.h>
+#else
+
+#define PV_SET_VAR(type, var, val)
+#define PV_VAR(var)			0
+
+struct qnode;
+struct pv_qvars {};
+static inline void pv_init_vars(struct pv_qvars *pv, int cpu_nr)	{}
+static inline void pv_set_vars(struct pv_qvars *pv, struct pv_qvars *ppv,
+			int cpu, struct qnode *prev)			{}
+static inline void pv_head_spin_check(struct pv_qvars *pv, int *count,
+			u32 qcode, struct qspinlock *lock)		{}
+static inline void pv_queue_spin_check(struct pv_qvars *pv, int *count)	{}
+static inline void pv_next_node_check(struct pv_qvars *pv, void *lock)	{}
+static inline void pv_kick_node(struct pv_qvars *pv)			{}
+static inline void pv_set_prev(struct pv_qvars *pv, struct qnode *prev)	{}
+static inline struct qnode *pv_get_prev(struct pv_qvars *pv)
+{ return NULL; }
+#endif
+
+/*
  * The 24-bit queue node code is divided into the following 2 fields:
  * Bits 0-1 : queue node index (4 nodes)
  * Bits 2-23: CPU number + 1   (4M - 1 CPUs)
@@ -84,6 +109,7 @@
  */
 struct qnode {
 	u32		 wait;		/* Waiting flag		*/
+	struct pv_qvars	 pv;		/* Para-virtualization	*/
 	struct qnode	*next;		/* Next queue node addr */
 };
 
@@ -341,6 +367,11 @@ static inline int queue_spin_trylock_quick(struct qspinlock *lock, int qsval)
 { return 0; }
 #endif
 
+#ifndef queue_get_qcode
+#define queue_get_qcode(lock)	(atomic_read(&(lock)->qlcode) &\
+				 ~_QSPINLOCK_LOCKED)
+#endif
+
 #ifndef queue_get_lock_qcode
 /**
  * queue_get_lock_qcode - get the lock & qcode values
@@ -496,6 +527,7 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, int qsval)
 	unsigned int cpu_nr, qn_idx;
 	struct qnode *node, *next;
 	u32 prev_qcode, my_qcode;
+	PV_SET_VAR(int, hcnt, 0);
 
 	/*
 	 * Try the quick spinning code path
@@ -523,6 +555,7 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, int qsval)
 	 */
 	node->wait = true;
 	node->next = NULL;
+	pv_init_vars(&node->pv, cpu_nr);
 
 	/*
 	 * The lock may be available at this point, try again if no task was
@@ -552,13 +585,25 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, int qsval)
 		 * and set up the "next" fields of the that node.
 		 */
 		struct qnode *prev = xlate_qcode(prev_qcode);
+		PV_SET_VAR(int, qcnt, 0);
 
 		ACCESS_ONCE(prev->next) = node;
+
+		/*
+		 * Set current CPU number into the previous node and the
+		 * previous node address into the current node.
+		 */
+		pv_set_vars(&node->pv, &prev->pv, cpu_nr, prev);
+
 		/*
 		 * Wait until the waiting flag is off
 		 */
-		while (smp_load_acquire(&node->wait))
+		while (smp_load_acquire(&node->wait)) {
 			arch_mutex_cpu_relax();
+			pv_queue_spin_check(&node->pv, PV_VAR(&qcnt));
+		}
+	} else {
+		ACCESS_ONCE(node->wait) = false;	/* At queue head */
 	}
 
 	/*
@@ -585,6 +630,11 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, int qsval)
 				goto release_node;
 		}
 		arch_mutex_cpu_relax();
+
+		/*
+		 * Perform para-virtualization checks
+		 */
+		pv_head_spin_check(&node->pv, PV_VAR(&hcnt), prev_qcode, lock);
 	}
 
 notify_next:
@@ -596,9 +646,53 @@ notify_next:
 	/*
 	 * The next one in queue is now at the head
 	 */
+	pv_next_node_check(&next->pv, lock);
 	smp_store_release(&next->wait, false);
 
 release_node:
 	put_qnode();
 }
 EXPORT_SYMBOL(queue_spin_lock_slowpath);
+
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+/**
+ * queue_spin_unlock_slowpath - kick up the CPU of the queue head
+ * @lock : Pointer to queue spinlock structure
+ *
+ * The lock is released after finding the queue head to avoid racing
+ * condition between the queue head and the lock holder.
+ */
+void queue_spin_unlock_slowpath(struct qspinlock *lock)
+{
+	struct qnode *node, *prev;
+	u32 qcode = (u32)queue_get_qcode(lock);
+
+	/*
+	 * Get the queue tail node
+	 */
+	node = xlate_qcode(qcode);
+
+	/*
+	 * Locate the queue head node by following the prev pointer from
+	 * tail to head.
+	 * It is assumed that the PV guests won't have that many CPUs so
+	 * that it won't take a long time to follow the pointers.
+	 */
+	while (ACCESS_ONCE(node->wait)) {
+		prev = pv_get_prev(&node->pv);
+		if (prev)
+			node = prev;
+		else
+			/*
+			 * Delay a bit to allow the prev pointer to be set up
+			 */
+			arch_mutex_cpu_relax();
+	}
+	/*
+	 * Found the queue head, now release the lock before waking it up
+	 */
+	__queue_spin_unlock(lock);
+	pv_kick_node(&node->pv);
+}
+EXPORT_SYMBOL(queue_spin_unlock_slowpath);
+#endif /* CONFIG_PARAVIRT_SPINLOCKS */
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 135+ messages in thread

* [PATCH RFC v6 09/11] pvqspinlock, x86: Add qspinlock para-virtualization support
  2014-03-12 18:54 [PATCH v6 00/11] qspinlock: a 4-byte queue spinlock with PV support Waiman Long
                   ` (15 preceding siblings ...)
  2014-03-12 18:54 ` Waiman Long
@ 2014-03-12 18:54 ` Waiman Long
  2014-03-12 18:54 ` Waiman Long
                   ` (4 subsequent siblings)
  21 siblings, 0 replies; 135+ messages in thread
From: Waiman Long @ 2014-03-12 18:54 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	Peter Zijlstra
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, virtualization,
	Andi Kleen, Michel Lespinasse, Alok Kataria, linux-arch,
	Gleb Natapov, x86, xen-devel, Paul E. McKenney, Rik van Riel,
	Konrad Rzeszutek Wilk, Scott J Norton, Steven Rostedt,
	Chris Wright, Oleg Nesterov, Boris Ostrovsky,
	Aswin Chandramouleeswaran, Chegu Vinod, Waiman Long,
	linux-kernel, David Vrabel, Andrew

This patch adds para-virtualization support to the queue spinlock in
the same way as was done in the PV ticket lock code. In essence, the
lock waiters will spin for a specified number of times (QSPIN_THRESHOLD
= 2^14) and then halted itself. The queue head waiter will spins
2*QSPIN_THRESHOLD times before halting itself. When it has spinned
QSPIN_THRESHOLD times, the queue head will assume that the lock
holder may be scheduled out and attempt to kick the lock holder CPU
if it has the CPU number on hand. Before being halted, the queue head
waiter will set a flag (_QSPINLOCK_LOCKED_SLOWPATH) in the lock byte
to indicate that the unlock slowpath has to be invoked.

In the unlock slowpath, the current lock holder will find the queue
head by following the previous node pointer links stored in the
queue node structure until it finds one that has the wait flag turned
off. It then attempt to kick the CPU of the queue head.

After the queue head acquired the lock, it will also check the status
of the next node and set _QSPINLOCK_LOCKED_SLOWPATH if it has been
halted.

Enabling the PV code does have a performance impact on spinlock
acquisitions and releases. The following table shows the execution
time (in ms) of a spinlock micro-benchmark that does lock/unlock
operations 5M times for each task versus the number of contending
tasks on a Westmere-EX system.

  # of        Ticket lock	     Queue lock
  tasks   PV off/PV on/%Change 	  PV off/PV on/%Change
  ------  --------------------   ---------------------
    1	     135/  179/+33%	     137/  169/+23%
    2	    1045/ 1103/ +6%	    1120/ 1536/+37%
    3	    1827/ 2683/+47%	    2313/ 2425/ +5%
    4       2689/ 4191/+56%	    2914/ 3128/ +7%
    5       3736/ 5830/+56%	    3715/ 3762/ +1%
    6       4942/ 7609/+54%	    4504/ 4558/ +2%
    7       6304/ 9570/+52%	    5292/ 5351/ +1%
    8       7736/11323/+46%	    6037/ 6097/ +1%

It can be seen that the ticket lock PV code has a fairly big decrease
in performance when there are 3 or more contending tasks. The
queue spinlock PV code, on the other hand, only has a minor drop in
performance for 3 or more contending tasks.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
---
 arch/x86/include/asm/paravirt.h       |   12 ++-
 arch/x86/include/asm/paravirt_types.h |   12 ++
 arch/x86/include/asm/pvqspinlock.h    |  232 +++++++++++++++++++++++++++++++++
 arch/x86/include/asm/qspinlock.h      |   35 +++++
 arch/x86/kernel/paravirt-spinlocks.c  |    5 +
 kernel/locking/qspinlock.c            |   96 ++++++++++++++-
 6 files changed, 390 insertions(+), 2 deletions(-)
 create mode 100644 arch/x86/include/asm/pvqspinlock.h

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index cd6e161..cabc37a 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -711,7 +711,17 @@ static inline void __set_fixmap(unsigned /* enum fixed_addresses */ idx,
 }
 
 #if defined(CONFIG_SMP) && defined(CONFIG_PARAVIRT_SPINLOCKS)
+#ifdef CONFIG_QUEUE_SPINLOCK
+static __always_inline void __queue_kick_cpu(int cpu, enum pv_kick_type type)
+{
+	PVOP_VCALL2(pv_lock_ops.kick_cpu, cpu, type);
+}
 
+static __always_inline void __queue_hibernate(void)
+{
+	PVOP_VCALL0(pv_lock_ops.hibernate);
+}
+#else
 static __always_inline void __ticket_lock_spinning(struct arch_spinlock *lock,
 							__ticket_t ticket)
 {
@@ -723,7 +733,7 @@ static __always_inline void __ticket_unlock_kick(struct arch_spinlock *lock,
 {
 	PVOP_VCALL2(pv_lock_ops.unlock_kick, lock, ticket);
 }
-
+#endif
 #endif
 
 #ifdef CONFIG_X86_32
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index 7549b8b..fa16aa6 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -333,9 +333,21 @@ struct arch_spinlock;
 typedef u16 __ticket_t;
 #endif
 
+#ifdef CONFIG_QUEUE_SPINLOCK
+enum pv_kick_type {
+	PV_KICK_LOCK_HOLDER,
+	PV_KICK_QUEUE_HEAD
+};
+#endif
+
 struct pv_lock_ops {
+#ifdef CONFIG_QUEUE_SPINLOCK
+	void (*kick_cpu)(int cpu, enum pv_kick_type);
+	void (*hibernate)(void);
+#else
 	struct paravirt_callee_save lock_spinning;
 	void (*unlock_kick)(struct arch_spinlock *lock, __ticket_t ticket);
+#endif
 };
 
 /* This contains all the paravirt structures: we get a convenient
diff --git a/arch/x86/include/asm/pvqspinlock.h b/arch/x86/include/asm/pvqspinlock.h
new file mode 100644
index 0000000..13cbc4f
--- /dev/null
+++ b/arch/x86/include/asm/pvqspinlock.h
@@ -0,0 +1,232 @@
+#ifndef _ASM_X86_PVQSPINLOCK_H
+#define _ASM_X86_PVQSPINLOCK_H
+
+/*
+ *	Queue Spinlock Para-Virtualization (PV) Support
+ *
+ *	+------+	    +-----+ nxtcpu_p1  +----+
+ *	| Lock |	    |Queue|----------->|Next|
+ *	|Holder|<-----------|Head |<-----------|Node|
+ *	+------+ prev_qcode +-----+ prev_qcode +----+
+ *
+ * As long as the current lock holder passes through the slowpath, the queue
+ * head CPU will have its CPU number stored in prev_qcode. The situation is
+ * the same for the node next to the queue head.
+ *
+ * The next node, while setting up the next pointer in the queue head, can
+ * also store its CPU number in that node. With that change, the queue head
+ * will have the CPU numbers of both its upstream and downstream neighbors.
+ *
+ * The PV support code for queue spinlock is roughly the same as that
+ * of the ticket spinlock. Each CPU waiting for the lock will spin until it
+ * reaches a threshold. When that happens, it will put itself to halt so
+ * that the hypervisor can reuse the CPU in some other guests.
+ *
+ * The differences between the two versions of PV support are:
+ * 1) The queue head will spin twice as long as the other nodes before it
+ *    puts itself to halt.
+ * 2) The queue head will also attempt to kick the lock holder, if it has
+ *    the CPU number, in the half way point.
+ */
+
+/*
+ * Spin threshold for queue spinlock
+ * This is half of the ticket lock's SPIN_THRESHOLD. The queue head will
+ * be halted after 2*QSPIN_THRESHOLD whereas the other nodes will be
+ * halted after QSPIN_THRESHOLD.
+ */
+#define	QSPIN_THRESHOLD	(1U<<14)
+
+/*
+ * PV macros
+ */
+#define PV_SET_VAR(type, var, val)	type var = val
+#define PV_VAR(var)			var
+
+/*
+ * CPU state flags
+ */
+#define	PV_CPU_ACTIVE	1	/* This CPU is active		 */
+#define	PV_CPU_KICKING	2	/* This CPU is kicking other CPU */
+#define PV_CPU_KICKED   3	/* This CPU is being kicked	 */
+#define PV_CPU_HALTED	-1	/* This CPU is halted		 */
+
+/*
+ * Additional fields to be added to the qnode structure
+ */
+#if CONFIG_NR_CPUS >= (1 << 16)
+#define _cpuid_t	u32
+#else
+#define _cpuid_t	u16
+#endif
+
+struct qnode;
+
+struct pv_qvars {
+	s16	      cpustate;		/* CPU status flag		*/
+	_cpuid_t      nxtcpu_p1;	/* CPU number of next node + 1	*/
+	_cpuid_t      mycpu;		/* CPU number of this node	*/
+	struct qnode *prev;		/* Pointer to previous node	*/
+};
+
+/**
+ * pv_init_vars - initialize fields in struct pv_qvars
+ * @pv : pointer to struct pv_qvars
+ * @cpu: current CPU number
+ */
+static __always_inline void pv_init_vars(struct pv_qvars *pv, int cpu)
+{
+	pv->cpustate  = PV_CPU_ACTIVE;
+	pv->prev      = NULL;
+	pv->nxtcpu_p1 = 0;
+	pv->mycpu     = cpu;
+}
+
+/**
+ * pv_head_spin_check - perform para-virtualization checks for queue head
+ * @pv    : pointer to struct pv_qvars
+ * @count : loop count
+ * @qcode : queue code of the supposed lock holder
+ * @lock  : pointer to the qspinlock structure
+ *
+ * The following checks will be done:
+ * 1) Attempt to kick the lock holder, if known, after QSPIN_THRESHOLD
+ * 2) Halt itself if lock is still not available after 2*QSPIN_THRESHOLD
+ */
+static __always_inline void pv_head_spin_check(struct pv_qvars *pv, int *count,
+				u32 qcode, struct qspinlock *lock)
+{
+	if (!static_key_false(&paravirt_spinlocks_enabled))
+		return;
+	if (unlikely((++(*count) == QSPIN_THRESHOLD) && qcode)) {
+		/*
+		 * Get the CPU number of the lock holder & kick it
+		 * The lock may have been stealed by another CPU
+		 * if PARAVIRT_UNFAIR_LOCKS is set, so the computed
+		 * CPU number may not be the actual lock holder.
+		 */
+		int cpu = (qcode >> (_QCODE_VAL_OFFSET + 2)) - 1;
+
+		ACCESS_ONCE(pv->cpustate) = PV_CPU_KICKING;
+		__queue_kick_cpu(cpu, PV_KICK_LOCK_HOLDER);
+		ACCESS_ONCE(pv->cpustate) = PV_CPU_ACTIVE;
+	}
+	if (unlikely(*count >= 2*QSPIN_THRESHOLD)) {
+		u8 lockval;
+
+		/*
+		 * Set the lock byte to _QSPINLOCK_LOCKED_SLOWPATH before
+		 * trying to hibernate itself. It is possible that the
+		 * lock byte had been set to _QSPINLOCK_LOCKED_SLOWPATH
+		 * already. In this case, just proceeds to sleeping.
+		 */
+		ACCESS_ONCE(pv->cpustate) = PV_CPU_HALTED;
+		lockval = cmpxchg(&((union arch_qspinlock *)lock)->lock,
+			  _QSPINLOCK_LOCKED, _QSPINLOCK_LOCKED_SLOWPATH);
+		if (lockval == 0) {
+			/*
+			 * Can exit now as the lock is free
+			 */
+			ACCESS_ONCE(pv->cpustate) = PV_CPU_ACTIVE;
+			*count = 0;
+			return;
+		}
+		__queue_hibernate();
+		ACCESS_ONCE(pv->cpustate) = PV_CPU_ACTIVE;
+		*count = 0;	/* Reset count */
+	}
+}
+
+/**
+ * pv_queue_spin_check - perform para-virtualization checks for queue member
+ * @pv   : pointer to struct pv_qvars
+ * @count: loop count
+ */
+static __always_inline void pv_queue_spin_check(struct pv_qvars *pv, int *count)
+{
+	if (!static_key_false(&paravirt_spinlocks_enabled))
+		return;
+	/*
+	 * Attempt to halt oneself after QSPIN_THRESHOLD spins
+	 */
+	if (unlikely(++(*count) >= QSPIN_THRESHOLD)) {
+		/*
+		 * Time to hibernate itself
+		 */
+		ACCESS_ONCE(pv->cpustate) = PV_CPU_HALTED;
+		__queue_hibernate();
+		ACCESS_ONCE(pv->cpustate) = PV_CPU_ACTIVE;
+		*count = 0;	/* Reset count */
+	}
+}
+
+/**
+ * pv_next_node_check - set _QSPINLOCK_LOCKED_SLOWPATH flag if the next node
+ *			is halted
+ * @pv   : pointer to struct pv_qvars
+ * @count: loop count
+ *
+ * The current CPU should have gotten the lock before calling this function.
+ */
+static __always_inline void
+pv_next_node_check(struct pv_qvars *pv, struct qspinlock *lock)
+{
+	if (!static_key_false(&paravirt_spinlocks_enabled))
+		return;
+	if (ACCESS_ONCE(pv->cpustate) == PV_CPU_HALTED)
+		ACCESS_ONCE(((union arch_qspinlock *)lock)->lock)
+			= _QSPINLOCK_LOCKED_SLOWPATH;
+}
+
+/**
+ * pv_set_vars - set nxtcpu_p1 in previous PV and prev in current PV
+ * @pv  : pointer to struct pv_qvars
+ * @ppv : pointer to struct pv_qvars of previous node
+ * @cpu : cpu number
+ * @prev: pointer to the previous queue node
+ */
+static __always_inline void pv_set_vars(struct pv_qvars *pv,
+			struct pv_qvars *ppv, int cpu, struct qnode *prev)
+{
+	ppv->nxtcpu_p1 = cpu + 1;
+	pv->prev       = prev;
+}
+
+/**
+ * pv_set_prev - set previous queue node pointer
+ * @pv  : pointer to struct pv_qvars to be set
+ * @prev: pointer to the previous node
+ */
+static __always_inline void pv_set_prev(struct pv_qvars *pv, struct qnode *prev)
+{
+	ACCESS_ONCE(pv->prev) = prev;
+}
+
+/*
+ * The following inlined functions are being used by the
+ * queue_spin_unlock_slowpath() function.
+ */
+
+/**
+ * pv_get_prev - get previous queue node pointer
+ * @pv   : pointer to struct pv_qvars to be set
+ * Return: the previous queue node pointer
+ */
+static __always_inline struct qnode *pv_get_prev(struct pv_qvars *pv)
+{
+	return ACCESS_ONCE(pv->prev);
+}
+
+/**
+ * pv_kick_node - kick up the CPU of the given node
+ * @pv  : pointer to struct pv_qvars of the node to be kicked
+ */
+static __always_inline void pv_kick_node(struct pv_qvars *pv)
+{
+	if (pv->cpustate != PV_CPU_HALTED)
+		return;
+	ACCESS_ONCE(pv->cpustate) = PV_CPU_KICKED;
+	__queue_kick_cpu(pv->mycpu, PV_KICK_QUEUE_HEAD);
+}
+
+#endif /* _ASM_X86_PVQSPINLOCK_H */
diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
index 0e6740a..4f85c33 100644
--- a/arch/x86/include/asm/qspinlock.h
+++ b/arch/x86/include/asm/qspinlock.h
@@ -38,7 +38,11 @@ union arch_qspinlock {
  * that the clearing the lock bit is done ASAP without artificial delay
  * due to compiler optimization.
  */
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+static __always_inline void __queue_spin_unlock(struct qspinlock *lock)
+#else
 static inline void queue_spin_unlock(struct qspinlock *lock)
+#endif
 {
 	union arch_qspinlock *qlock = (union arch_qspinlock *)lock;
 
@@ -47,6 +51,37 @@ static inline void queue_spin_unlock(struct qspinlock *lock)
 	barrier();
 }
 
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+/*
+ * The lock byte can have a value of _QSPINLOCK_LOCKED_SLOWPATH to indicate
+ * that it needs to go through the slowpath to do the unlocking.
+ */
+#define _QSPINLOCK_LOCKED_SLOWPATH	3	/* Set both bits 0 & 1 */
+
+extern void queue_spin_unlock_slowpath(struct qspinlock *lock);
+
+static inline void queue_spin_unlock(struct qspinlock *lock)
+{
+	union arch_qspinlock *qlock = (union arch_qspinlock *)lock;
+
+	barrier();
+	if (static_key_false(&paravirt_spinlocks_enabled)) {
+		/*
+		 * Need to atomically clear the lock byte to avoid racing with
+		 * queue head waiter trying to set _QSPINLOCK_LOCKED_SLOWPATH.
+		 */
+		if (likely(cmpxchg(&qlock->lock, _QSPINLOCK_LOCKED, 0)
+				== _QSPINLOCK_LOCKED))
+			return;
+		else
+			queue_spin_unlock_slowpath(lock);
+
+	} else {
+		__queue_spin_unlock(lock);
+	}
+}
+#endif
+
 #endif /* !CONFIG_X86_OOSTORE && !CONFIG_X86_PPRO_FENCE */
 
 #include <asm-generic/qspinlock.h>
diff --git a/arch/x86/kernel/paravirt-spinlocks.c b/arch/x86/kernel/paravirt-spinlocks.c
index 8c67cbe..d98547f 100644
--- a/arch/x86/kernel/paravirt-spinlocks.c
+++ b/arch/x86/kernel/paravirt-spinlocks.c
@@ -11,9 +11,14 @@
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
 struct pv_lock_ops pv_lock_ops = {
 #ifdef CONFIG_SMP
+#ifdef CONFIG_QUEUE_SPINLOCK
+	.kick_cpu = paravirt_nop,
+	.hibernate = paravirt_nop,
+#else
 	.lock_spinning = __PV_IS_CALLEE_SAVE(paravirt_nop),
 	.unlock_kick = paravirt_nop,
 #endif
+#endif
 };
 EXPORT_SYMBOL(pv_lock_ops);
 
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 0030fad..a07cf8c 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -58,6 +58,31 @@
  */
 
 /*
+ * Para-virtualized queue spinlock support
+ */
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+#include <asm/pvqspinlock.h>
+#else
+
+#define PV_SET_VAR(type, var, val)
+#define PV_VAR(var)			0
+
+struct qnode;
+struct pv_qvars {};
+static inline void pv_init_vars(struct pv_qvars *pv, int cpu_nr)	{}
+static inline void pv_set_vars(struct pv_qvars *pv, struct pv_qvars *ppv,
+			int cpu, struct qnode *prev)			{}
+static inline void pv_head_spin_check(struct pv_qvars *pv, int *count,
+			u32 qcode, struct qspinlock *lock)		{}
+static inline void pv_queue_spin_check(struct pv_qvars *pv, int *count)	{}
+static inline void pv_next_node_check(struct pv_qvars *pv, void *lock)	{}
+static inline void pv_kick_node(struct pv_qvars *pv)			{}
+static inline void pv_set_prev(struct pv_qvars *pv, struct qnode *prev)	{}
+static inline struct qnode *pv_get_prev(struct pv_qvars *pv)
+{ return NULL; }
+#endif
+
+/*
  * The 24-bit queue node code is divided into the following 2 fields:
  * Bits 0-1 : queue node index (4 nodes)
  * Bits 2-23: CPU number + 1   (4M - 1 CPUs)
@@ -84,6 +109,7 @@
  */
 struct qnode {
 	u32		 wait;		/* Waiting flag		*/
+	struct pv_qvars	 pv;		/* Para-virtualization	*/
 	struct qnode	*next;		/* Next queue node addr */
 };
 
@@ -341,6 +367,11 @@ static inline int queue_spin_trylock_quick(struct qspinlock *lock, int qsval)
 { return 0; }
 #endif
 
+#ifndef queue_get_qcode
+#define queue_get_qcode(lock)	(atomic_read(&(lock)->qlcode) &\
+				 ~_QSPINLOCK_LOCKED)
+#endif
+
 #ifndef queue_get_lock_qcode
 /**
  * queue_get_lock_qcode - get the lock & qcode values
@@ -496,6 +527,7 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, int qsval)
 	unsigned int cpu_nr, qn_idx;
 	struct qnode *node, *next;
 	u32 prev_qcode, my_qcode;
+	PV_SET_VAR(int, hcnt, 0);
 
 	/*
 	 * Try the quick spinning code path
@@ -523,6 +555,7 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, int qsval)
 	 */
 	node->wait = true;
 	node->next = NULL;
+	pv_init_vars(&node->pv, cpu_nr);
 
 	/*
 	 * The lock may be available at this point, try again if no task was
@@ -552,13 +585,25 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, int qsval)
 		 * and set up the "next" fields of the that node.
 		 */
 		struct qnode *prev = xlate_qcode(prev_qcode);
+		PV_SET_VAR(int, qcnt, 0);
 
 		ACCESS_ONCE(prev->next) = node;
+
+		/*
+		 * Set current CPU number into the previous node and the
+		 * previous node address into the current node.
+		 */
+		pv_set_vars(&node->pv, &prev->pv, cpu_nr, prev);
+
 		/*
 		 * Wait until the waiting flag is off
 		 */
-		while (smp_load_acquire(&node->wait))
+		while (smp_load_acquire(&node->wait)) {
 			arch_mutex_cpu_relax();
+			pv_queue_spin_check(&node->pv, PV_VAR(&qcnt));
+		}
+	} else {
+		ACCESS_ONCE(node->wait) = false;	/* At queue head */
 	}
 
 	/*
@@ -585,6 +630,11 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, int qsval)
 				goto release_node;
 		}
 		arch_mutex_cpu_relax();
+
+		/*
+		 * Perform para-virtualization checks
+		 */
+		pv_head_spin_check(&node->pv, PV_VAR(&hcnt), prev_qcode, lock);
 	}
 
 notify_next:
@@ -596,9 +646,53 @@ notify_next:
 	/*
 	 * The next one in queue is now at the head
 	 */
+	pv_next_node_check(&next->pv, lock);
 	smp_store_release(&next->wait, false);
 
 release_node:
 	put_qnode();
 }
 EXPORT_SYMBOL(queue_spin_lock_slowpath);
+
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+/**
+ * queue_spin_unlock_slowpath - kick up the CPU of the queue head
+ * @lock : Pointer to queue spinlock structure
+ *
+ * The lock is released after finding the queue head to avoid racing
+ * condition between the queue head and the lock holder.
+ */
+void queue_spin_unlock_slowpath(struct qspinlock *lock)
+{
+	struct qnode *node, *prev;
+	u32 qcode = (u32)queue_get_qcode(lock);
+
+	/*
+	 * Get the queue tail node
+	 */
+	node = xlate_qcode(qcode);
+
+	/*
+	 * Locate the queue head node by following the prev pointer from
+	 * tail to head.
+	 * It is assumed that the PV guests won't have that many CPUs so
+	 * that it won't take a long time to follow the pointers.
+	 */
+	while (ACCESS_ONCE(node->wait)) {
+		prev = pv_get_prev(&node->pv);
+		if (prev)
+			node = prev;
+		else
+			/*
+			 * Delay a bit to allow the prev pointer to be set up
+			 */
+			arch_mutex_cpu_relax();
+	}
+	/*
+	 * Found the queue head, now release the lock before waking it up
+	 */
+	__queue_spin_unlock(lock);
+	pv_kick_node(&node->pv);
+}
+EXPORT_SYMBOL(queue_spin_unlock_slowpath);
+#endif /* CONFIG_PARAVIRT_SPINLOCKS */
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 135+ messages in thread

* [PATCH RFC v6 10/11] pvqspinlock, x86: Enable qspinlock PV support for KVM
  2014-03-12 18:54 [PATCH v6 00/11] qspinlock: a 4-byte queue spinlock with PV support Waiman Long
                   ` (18 preceding siblings ...)
  2014-03-12 18:54 ` [PATCH RFC v6 10/11] pvqspinlock, x86: Enable qspinlock PV support for KVM Waiman Long
@ 2014-03-12 18:54 ` Waiman Long
  2014-03-13 13:59     ` Paolo Bonzini
                     ` (3 more replies)
  2014-03-12 18:54 ` [PATCH RFC v6 11/11] pvqspinlock, x86: Enable qspinlock PV support for XEN Waiman Long
  2014-03-12 18:54 ` Waiman Long
  21 siblings, 4 replies; 135+ messages in thread
From: Waiman Long @ 2014-03-12 18:54 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	Peter Zijlstra
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, virtualization,
	Andi Kleen, Michel Lespinasse, Alok Kataria, linux-arch,
	Gleb Natapov, x86, xen-devel, Paul E. McKenney, Scott J Norton,
	Rusty Russell, Steven Rostedt, Chris Wright, Oleg Nesterov,
	Boris Ostrovsky, Aswin Chandramouleeswaran, Chegu Vinod,
	Waiman Long, linux-kernel, David Vrabel, Andrew Morton, Linu

This patch adds the necessary KVM specific code to allow KVM to support
the sleeping and CPU kicking operations needed by the queue spinlock PV
code.

A KVM guest of 20 CPU cores was created to run the disk workload of
the AIM7 benchmark on both ext4 and xfs RAM disks at 3000 users on a
3.14-rc6 based kernel. The JPM (jobs/minute) data of the test run were:

  kernel                        XFS FS  %change ext4 FS %change
  ------                        ------  ------- ------- -------
  PV ticketlock (baseline)      2409639    -    1289398    -
  qspinlock                     2396804  -0.5%  1285714  -0.3%
  PV qspinlock                  2380952  -1.2%  1266714  -1.8%
  unfair qspinlock              2403204  -0.3%  1503759   +17%
  unfair + PV qspinlock         2425876  +0.8%  1530612   +19%

The XFS test had moderate spinlock contention of 1.6% whereas the
ext4 test had heavy spinlock contention of 15.4% as reported by perf.

The PV code doesn't seem to help much in performance as the
sleeping/kicking logic wasn't activated during the test run as shown
by statistics data in the debugfs. The unfair lock, on the other hand,
did help to improve, especially the ext4 filesystem test.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
---
 arch/x86/kernel/kvm.c |   87 +++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/Kconfig.locks  |    2 +-
 2 files changed, 88 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index f318e78..aaf704e 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -568,6 +568,7 @@ static void kvm_kick_cpu(int cpu)
 	kvm_hypercall2(KVM_HC_KICK_CPU, flags, apicid);
 }
 
+#ifndef CONFIG_QUEUE_SPINLOCK
 enum kvm_contention_stat {
 	TAKEN_SLOW,
 	TAKEN_SLOW_PICKUP,
@@ -795,6 +796,87 @@ static void kvm_unlock_kick(struct arch_spinlock *lock, __ticket_t ticket)
 		}
 	}
 }
+#else /* !CONFIG_QUEUE_SPINLOCK */
+
+#ifdef CONFIG_KVM_DEBUG_FS
+static struct dentry *d_spin_debug;
+static struct dentry *d_kvm_debug;
+static u32 lh_kick_stats;	/* Lock holder kick count */
+static u32 qh_kick_stats;	/* Queue head kick count  */
+static u32 hibernate_stats;	/* Hibernation count	  */
+
+static int __init kvm_spinlock_debugfs(void)
+{
+	d_kvm_debug = debugfs_create_dir("kvm-guest", NULL);
+	if (!d_kvm_debug) {
+		printk(KERN_WARNING
+		       "Could not create 'kvm' debugfs directory\n");
+		return -ENOMEM;
+	}
+	d_spin_debug = debugfs_create_dir("spinlocks", d_kvm_debug);
+
+	debugfs_create_u32("lh_kick_stats", 0644, d_spin_debug, &lh_kick_stats);
+	debugfs_create_u32("qh_kick_stats", 0644, d_spin_debug, &qh_kick_stats);
+	debugfs_create_u32("hibernate_stats",
+			   0644, d_spin_debug, &hibernate_stats);
+	return 0;
+}
+
+static inline void inc_kick_stats(enum pv_kick_type type)
+{
+	if (type == PV_KICK_LOCK_HOLDER)
+		add_smp(&lh_kick_stats, 1);
+	else /* type == PV_KICK_QUEUE_HEAD */
+		add_smp(&qh_kick_stats, 1);
+}
+
+static inline void inc_hib_stats(void)
+{
+	add_smp(&hibernate_stats, 1);
+}
+
+fs_initcall(kvm_spinlock_debugfs);
+
+#else /* CONFIG_KVM_DEBUG_FS */
+static inline void inc_kick_stats(enum pv_kick_type type)
+{
+}
+
+static inline void inc_hib_stats(void)
+{
+
+}
+#endif /* CONFIG_KVM_DEBUG_FS */
+
+static void kvm_kick_cpu_type(int cpu, enum pv_kick_type type)
+{
+	kvm_kick_cpu(cpu);
+	inc_kick_stats(type);
+}
+
+/*
+ * Halt the current CPU & release it back to the host
+ */
+static void kvm_hibernate(void)
+{
+	unsigned long flags;
+
+	if (in_nmi())
+		return;
+
+	inc_hib_stats();
+	/*
+	 * Make sure an interrupt handler can't upset things in a
+	 * partially setup state.
+	 */
+	local_irq_save(flags);
+	if (arch_irqs_disabled_flags(flags))
+		halt();
+	else
+		safe_halt();
+	local_irq_restore(flags);
+}
+#endif /* !CONFIG_QUEUE_SPINLOCK */
 
 /*
  * Setup pv_lock_ops to exploit KVM_FEATURE_PV_UNHALT if present.
@@ -807,8 +889,13 @@ void __init kvm_spinlock_init(void)
 	if (!kvm_para_has_feature(KVM_FEATURE_PV_UNHALT))
 		return;
 
+#ifdef CONFIG_QUEUE_SPINLOCK
+	pv_lock_ops.kick_cpu = kvm_kick_cpu_type;
+	pv_lock_ops.hibernate = kvm_hibernate;
+#else
 	pv_lock_ops.lock_spinning = PV_CALLEE_SAVE(kvm_lock_spinning);
 	pv_lock_ops.unlock_kick = kvm_unlock_kick;
+#endif
 }
 
 static __init int kvm_spinlock_init_jump(void)
diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks
index f185584..a70fdeb 100644
--- a/kernel/Kconfig.locks
+++ b/kernel/Kconfig.locks
@@ -229,4 +229,4 @@ config ARCH_USE_QUEUE_SPINLOCK
 
 config QUEUE_SPINLOCK
 	def_bool y if ARCH_USE_QUEUE_SPINLOCK
-	depends on SMP && !PARAVIRT_SPINLOCKS
+	depends on SMP && (!PARAVIRT_SPINLOCKS || !XEN)
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 135+ messages in thread

* [PATCH RFC v6 10/11] pvqspinlock, x86: Enable qspinlock PV support for KVM
  2014-03-12 18:54 [PATCH v6 00/11] qspinlock: a 4-byte queue spinlock with PV support Waiman Long
                   ` (17 preceding siblings ...)
  2014-03-12 18:54 ` Waiman Long
@ 2014-03-12 18:54 ` Waiman Long
  2014-03-12 18:54 ` Waiman Long
                   ` (2 subsequent siblings)
  21 siblings, 0 replies; 135+ messages in thread
From: Waiman Long @ 2014-03-12 18:54 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	Peter Zijlstra
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, virtualization,
	Andi Kleen, Michel Lespinasse, Alok Kataria, linux-arch,
	Gleb Natapov, x86, xen-devel, Paul E. McKenney, Rik van Riel,
	Konrad Rzeszutek Wilk, Scott J Norton, Steven Rostedt,
	Chris Wright, Oleg Nesterov, Boris Ostrovsky,
	Aswin Chandramouleeswaran, Chegu Vinod, Waiman Long,
	linux-kernel, David Vrabel, Andrew

This patch adds the necessary KVM specific code to allow KVM to support
the sleeping and CPU kicking operations needed by the queue spinlock PV
code.

A KVM guest of 20 CPU cores was created to run the disk workload of
the AIM7 benchmark on both ext4 and xfs RAM disks at 3000 users on a
3.14-rc6 based kernel. The JPM (jobs/minute) data of the test run were:

  kernel                        XFS FS  %change ext4 FS %change
  ------                        ------  ------- ------- -------
  PV ticketlock (baseline)      2409639    -    1289398    -
  qspinlock                     2396804  -0.5%  1285714  -0.3%
  PV qspinlock                  2380952  -1.2%  1266714  -1.8%
  unfair qspinlock              2403204  -0.3%  1503759   +17%
  unfair + PV qspinlock         2425876  +0.8%  1530612   +19%

The XFS test had moderate spinlock contention of 1.6% whereas the
ext4 test had heavy spinlock contention of 15.4% as reported by perf.

The PV code doesn't seem to help much in performance as the
sleeping/kicking logic wasn't activated during the test run as shown
by statistics data in the debugfs. The unfair lock, on the other hand,
did help to improve, especially the ext4 filesystem test.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
---
 arch/x86/kernel/kvm.c |   87 +++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/Kconfig.locks  |    2 +-
 2 files changed, 88 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index f318e78..aaf704e 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -568,6 +568,7 @@ static void kvm_kick_cpu(int cpu)
 	kvm_hypercall2(KVM_HC_KICK_CPU, flags, apicid);
 }
 
+#ifndef CONFIG_QUEUE_SPINLOCK
 enum kvm_contention_stat {
 	TAKEN_SLOW,
 	TAKEN_SLOW_PICKUP,
@@ -795,6 +796,87 @@ static void kvm_unlock_kick(struct arch_spinlock *lock, __ticket_t ticket)
 		}
 	}
 }
+#else /* !CONFIG_QUEUE_SPINLOCK */
+
+#ifdef CONFIG_KVM_DEBUG_FS
+static struct dentry *d_spin_debug;
+static struct dentry *d_kvm_debug;
+static u32 lh_kick_stats;	/* Lock holder kick count */
+static u32 qh_kick_stats;	/* Queue head kick count  */
+static u32 hibernate_stats;	/* Hibernation count	  */
+
+static int __init kvm_spinlock_debugfs(void)
+{
+	d_kvm_debug = debugfs_create_dir("kvm-guest", NULL);
+	if (!d_kvm_debug) {
+		printk(KERN_WARNING
+		       "Could not create 'kvm' debugfs directory\n");
+		return -ENOMEM;
+	}
+	d_spin_debug = debugfs_create_dir("spinlocks", d_kvm_debug);
+
+	debugfs_create_u32("lh_kick_stats", 0644, d_spin_debug, &lh_kick_stats);
+	debugfs_create_u32("qh_kick_stats", 0644, d_spin_debug, &qh_kick_stats);
+	debugfs_create_u32("hibernate_stats",
+			   0644, d_spin_debug, &hibernate_stats);
+	return 0;
+}
+
+static inline void inc_kick_stats(enum pv_kick_type type)
+{
+	if (type == PV_KICK_LOCK_HOLDER)
+		add_smp(&lh_kick_stats, 1);
+	else /* type == PV_KICK_QUEUE_HEAD */
+		add_smp(&qh_kick_stats, 1);
+}
+
+static inline void inc_hib_stats(void)
+{
+	add_smp(&hibernate_stats, 1);
+}
+
+fs_initcall(kvm_spinlock_debugfs);
+
+#else /* CONFIG_KVM_DEBUG_FS */
+static inline void inc_kick_stats(enum pv_kick_type type)
+{
+}
+
+static inline void inc_hib_stats(void)
+{
+
+}
+#endif /* CONFIG_KVM_DEBUG_FS */
+
+static void kvm_kick_cpu_type(int cpu, enum pv_kick_type type)
+{
+	kvm_kick_cpu(cpu);
+	inc_kick_stats(type);
+}
+
+/*
+ * Halt the current CPU & release it back to the host
+ */
+static void kvm_hibernate(void)
+{
+	unsigned long flags;
+
+	if (in_nmi())
+		return;
+
+	inc_hib_stats();
+	/*
+	 * Make sure an interrupt handler can't upset things in a
+	 * partially setup state.
+	 */
+	local_irq_save(flags);
+	if (arch_irqs_disabled_flags(flags))
+		halt();
+	else
+		safe_halt();
+	local_irq_restore(flags);
+}
+#endif /* !CONFIG_QUEUE_SPINLOCK */
 
 /*
  * Setup pv_lock_ops to exploit KVM_FEATURE_PV_UNHALT if present.
@@ -807,8 +889,13 @@ void __init kvm_spinlock_init(void)
 	if (!kvm_para_has_feature(KVM_FEATURE_PV_UNHALT))
 		return;
 
+#ifdef CONFIG_QUEUE_SPINLOCK
+	pv_lock_ops.kick_cpu = kvm_kick_cpu_type;
+	pv_lock_ops.hibernate = kvm_hibernate;
+#else
 	pv_lock_ops.lock_spinning = PV_CALLEE_SAVE(kvm_lock_spinning);
 	pv_lock_ops.unlock_kick = kvm_unlock_kick;
+#endif
 }
 
 static __init int kvm_spinlock_init_jump(void)
diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks
index f185584..a70fdeb 100644
--- a/kernel/Kconfig.locks
+++ b/kernel/Kconfig.locks
@@ -229,4 +229,4 @@ config ARCH_USE_QUEUE_SPINLOCK
 
 config QUEUE_SPINLOCK
 	def_bool y if ARCH_USE_QUEUE_SPINLOCK
-	depends on SMP && !PARAVIRT_SPINLOCKS
+	depends on SMP && (!PARAVIRT_SPINLOCKS || !XEN)
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 135+ messages in thread

* [PATCH RFC v6 11/11] pvqspinlock, x86: Enable qspinlock PV support for XEN
  2014-03-12 18:54 [PATCH v6 00/11] qspinlock: a 4-byte queue spinlock with PV support Waiman Long
                   ` (19 preceding siblings ...)
  2014-03-12 18:54 ` Waiman Long
@ 2014-03-12 18:54 ` Waiman Long
  2014-03-12 18:54 ` Waiman Long
  21 siblings, 0 replies; 135+ messages in thread
From: Waiman Long @ 2014-03-12 18:54 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	Peter Zijlstra
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, virtualization,
	Andi Kleen, Michel Lespinasse, Alok Kataria, linux-arch,
	Gleb Natapov, x86, xen-devel, Paul E. McKenney, Scott J Norton,
	Rusty Russell, Steven Rostedt, Chris Wright, Oleg Nesterov,
	Boris Ostrovsky, Aswin Chandramouleeswaran, Chegu Vinod,
	Waiman Long, linux-kernel, David Vrabel, Andrew Morton, Linu

This patch adds the necessary KVM specific code to allow XEN to support
the sleeping and CPU kicking operations needed by the queue spinlock PV
code.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
---
 arch/x86/xen/spinlock.c |   95 ++++++++++++++++++++++++++++++++++++++++++++--
 kernel/Kconfig.locks    |    2 +-
 2 files changed, 91 insertions(+), 6 deletions(-)

diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c
index 06f4a64..ae97c57 100644
--- a/arch/x86/xen/spinlock.c
+++ b/arch/x86/xen/spinlock.c
@@ -17,6 +17,12 @@
 #include "xen-ops.h"
 #include "debugfs.h"
 
+static DEFINE_PER_CPU(int, lock_kicker_irq) = -1;
+static DEFINE_PER_CPU(char *, irq_name);
+static bool xen_pvspin = true;
+
+#ifndef CONFIG_QUEUE_SPINLOCK
+
 enum xen_contention_stat {
 	TAKEN_SLOW,
 	TAKEN_SLOW_PICKUP,
@@ -100,12 +106,9 @@ struct xen_lock_waiting {
 	__ticket_t want;
 };
 
-static DEFINE_PER_CPU(int, lock_kicker_irq) = -1;
-static DEFINE_PER_CPU(char *, irq_name);
 static DEFINE_PER_CPU(struct xen_lock_waiting, lock_waiting);
 static cpumask_t waiting_cpus;
 
-static bool xen_pvspin = true;
 __visible void xen_lock_spinning(struct arch_spinlock *lock, __ticket_t want)
 {
 	int irq = __this_cpu_read(lock_kicker_irq);
@@ -213,6 +216,78 @@ static void xen_unlock_kick(struct arch_spinlock *lock, __ticket_t next)
 	}
 }
 
+#else /* CONFIG_QUEUE_SPINLOCK */
+
+#ifdef CONFIG_XEN_DEBUG_FS
+static u32 lh_kick_stats;	/* Lock holder kick count */
+static u32 qh_kick_stats;	/* Queue head kick count  */
+static u32 hibernate_stats;	/* Hibernation count	  */
+
+static inline void inc_kick_stats(enum pv_kick_type type)
+{
+	if (type == PV_KICK_LOCK_HOLDER)
+		add_smp(&lh_kick_stats, 1);
+	else /* type == PV_KICK_QUEUE_HEAD */
+		add_smp(&qh_kick_stats, 1);
+}
+
+static inline void inc_hib_stats(void)
+{
+	add_smp(&hibernate_stats, 1);
+}
+#else /* CONFIG_XEN_DEBUG_FS */
+static inline void inc_kick_stats(enum pv_kick_type type)
+{
+}
+
+static inline void inc_hib_stats(void)
+{
+
+}
+#endif /* CONFIG_XEN_DEBUG_FS */
+
+static void xen_kick_cpu_type(int cpu, enum pv_kick_type type)
+{
+	xen_send_IPI_one(cpu, XEN_SPIN_UNLOCK_VECTOR);
+	inc_kick_stats(type);
+}
+
+/*
+ * Halt the current CPU & release it back to the host
+ */
+static void xen_hibernate(void)
+{
+	int irq = __this_cpu_read(lock_kicker_irq);
+	unsigned long flags;
+
+	/* If kicker interrupts not initialized yet, just spin */
+	if (irq == -1)
+		return;
+
+	/*
+	 * Make sure an interrupt handler can't upset things in a
+	 * partially setup state.
+	 */
+	local_irq_save(flags);
+
+	inc_hib_stats();
+	/* clear pending */
+	xen_clear_irq_pending(irq);
+
+	/* Allow interrupts while blocked */
+	local_irq_restore(flags);
+
+	/*
+	 * If an interrupt happens here, it will leave the wakeup irq
+	 * pending, which will cause xen_poll_irq() to return
+	 * immediately.
+	 */
+
+	/* Block until irq becomes pending (or perhaps a spurious wakeup) */
+	xen_poll_irq(irq);
+}
+#endif /* CONFIG_QUEUE_SPINLOCK */
+
 static irqreturn_t dummy_handler(int irq, void *dev_id)
 {
 	BUG();
@@ -258,7 +333,6 @@ void xen_uninit_lock_cpu(int cpu)
 	per_cpu(irq_name, cpu) = NULL;
 }
 
-
 /*
  * Our init of PV spinlocks is split in two init functions due to us
  * using paravirt patching and jump labels patching and having to do
@@ -275,8 +349,13 @@ void __init xen_init_spinlocks(void)
 		return;
 	}
 
+#ifdef CONFIG_QUEUE_SPINLOCK
+	pv_lock_ops.kick_cpu = xen_kick_cpu_type;
+        pv_lock_ops.hibernate = xen_hibernate;
+#else
 	pv_lock_ops.lock_spinning = PV_CALLEE_SAVE(xen_lock_spinning);
 	pv_lock_ops.unlock_kick = xen_unlock_kick;
+#endif
 }
 
 /*
@@ -318,6 +397,7 @@ static int __init xen_spinlock_debugfs(void)
 
 	d_spin_debug = debugfs_create_dir("spinlocks", d_xen);
 
+#ifndef CONFIG_QUEUE_SPINLOCK
 	debugfs_create_u8("zero_stats", 0644, d_spin_debug, &zero_stats);
 
 	debugfs_create_u32("taken_slow", 0444, d_spin_debug,
@@ -337,7 +417,12 @@ static int __init xen_spinlock_debugfs(void)
 
 	debugfs_create_u32_array("histo_blocked", 0444, d_spin_debug,
 				spinlock_stats.histo_spin_blocked, HISTO_BUCKETS + 1);
-
+#else /* CONFIG_QUEUE_SPINLOCK */
+	debugfs_create_u32("lh_kick_stats", 0644, d_spin_debug, &lh_kick_stats);
+	debugfs_create_u32("qh_kick_stats", 0644, d_spin_debug, &qh_kick_stats);
+	debugfs_create_u32("hibernate_stats",
+			   0644, d_spin_debug, &hibernate_stats);
+#endif /* CONFIG_QUEUE_SPINLOCK */
 	return 0;
 }
 fs_initcall(xen_spinlock_debugfs);
diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks
index a70fdeb..451e392 100644
--- a/kernel/Kconfig.locks
+++ b/kernel/Kconfig.locks
@@ -229,4 +229,4 @@ config ARCH_USE_QUEUE_SPINLOCK
 
 config QUEUE_SPINLOCK
 	def_bool y if ARCH_USE_QUEUE_SPINLOCK
-	depends on SMP && (!PARAVIRT_SPINLOCKS || !XEN)
+	depends on SMP
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 135+ messages in thread

* [PATCH RFC v6 11/11] pvqspinlock, x86: Enable qspinlock PV support for XEN
  2014-03-12 18:54 [PATCH v6 00/11] qspinlock: a 4-byte queue spinlock with PV support Waiman Long
                   ` (20 preceding siblings ...)
  2014-03-12 18:54 ` [PATCH RFC v6 11/11] pvqspinlock, x86: Enable qspinlock PV support for XEN Waiman Long
@ 2014-03-12 18:54 ` Waiman Long
  21 siblings, 0 replies; 135+ messages in thread
From: Waiman Long @ 2014-03-12 18:54 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	Peter Zijlstra
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, virtualization,
	Andi Kleen, Michel Lespinasse, Alok Kataria, linux-arch,
	Gleb Natapov, x86, xen-devel, Paul E. McKenney, Rik van Riel,
	Konrad Rzeszutek Wilk, Scott J Norton, Steven Rostedt,
	Chris Wright, Oleg Nesterov, Boris Ostrovsky,
	Aswin Chandramouleeswaran, Chegu Vinod, Waiman Long,
	linux-kernel, David Vrabel, Andrew

This patch adds the necessary KVM specific code to allow XEN to support
the sleeping and CPU kicking operations needed by the queue spinlock PV
code.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
---
 arch/x86/xen/spinlock.c |   95 ++++++++++++++++++++++++++++++++++++++++++++--
 kernel/Kconfig.locks    |    2 +-
 2 files changed, 91 insertions(+), 6 deletions(-)

diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c
index 06f4a64..ae97c57 100644
--- a/arch/x86/xen/spinlock.c
+++ b/arch/x86/xen/spinlock.c
@@ -17,6 +17,12 @@
 #include "xen-ops.h"
 #include "debugfs.h"
 
+static DEFINE_PER_CPU(int, lock_kicker_irq) = -1;
+static DEFINE_PER_CPU(char *, irq_name);
+static bool xen_pvspin = true;
+
+#ifndef CONFIG_QUEUE_SPINLOCK
+
 enum xen_contention_stat {
 	TAKEN_SLOW,
 	TAKEN_SLOW_PICKUP,
@@ -100,12 +106,9 @@ struct xen_lock_waiting {
 	__ticket_t want;
 };
 
-static DEFINE_PER_CPU(int, lock_kicker_irq) = -1;
-static DEFINE_PER_CPU(char *, irq_name);
 static DEFINE_PER_CPU(struct xen_lock_waiting, lock_waiting);
 static cpumask_t waiting_cpus;
 
-static bool xen_pvspin = true;
 __visible void xen_lock_spinning(struct arch_spinlock *lock, __ticket_t want)
 {
 	int irq = __this_cpu_read(lock_kicker_irq);
@@ -213,6 +216,78 @@ static void xen_unlock_kick(struct arch_spinlock *lock, __ticket_t next)
 	}
 }
 
+#else /* CONFIG_QUEUE_SPINLOCK */
+
+#ifdef CONFIG_XEN_DEBUG_FS
+static u32 lh_kick_stats;	/* Lock holder kick count */
+static u32 qh_kick_stats;	/* Queue head kick count  */
+static u32 hibernate_stats;	/* Hibernation count	  */
+
+static inline void inc_kick_stats(enum pv_kick_type type)
+{
+	if (type == PV_KICK_LOCK_HOLDER)
+		add_smp(&lh_kick_stats, 1);
+	else /* type == PV_KICK_QUEUE_HEAD */
+		add_smp(&qh_kick_stats, 1);
+}
+
+static inline void inc_hib_stats(void)
+{
+	add_smp(&hibernate_stats, 1);
+}
+#else /* CONFIG_XEN_DEBUG_FS */
+static inline void inc_kick_stats(enum pv_kick_type type)
+{
+}
+
+static inline void inc_hib_stats(void)
+{
+
+}
+#endif /* CONFIG_XEN_DEBUG_FS */
+
+static void xen_kick_cpu_type(int cpu, enum pv_kick_type type)
+{
+	xen_send_IPI_one(cpu, XEN_SPIN_UNLOCK_VECTOR);
+	inc_kick_stats(type);
+}
+
+/*
+ * Halt the current CPU & release it back to the host
+ */
+static void xen_hibernate(void)
+{
+	int irq = __this_cpu_read(lock_kicker_irq);
+	unsigned long flags;
+
+	/* If kicker interrupts not initialized yet, just spin */
+	if (irq == -1)
+		return;
+
+	/*
+	 * Make sure an interrupt handler can't upset things in a
+	 * partially setup state.
+	 */
+	local_irq_save(flags);
+
+	inc_hib_stats();
+	/* clear pending */
+	xen_clear_irq_pending(irq);
+
+	/* Allow interrupts while blocked */
+	local_irq_restore(flags);
+
+	/*
+	 * If an interrupt happens here, it will leave the wakeup irq
+	 * pending, which will cause xen_poll_irq() to return
+	 * immediately.
+	 */
+
+	/* Block until irq becomes pending (or perhaps a spurious wakeup) */
+	xen_poll_irq(irq);
+}
+#endif /* CONFIG_QUEUE_SPINLOCK */
+
 static irqreturn_t dummy_handler(int irq, void *dev_id)
 {
 	BUG();
@@ -258,7 +333,6 @@ void xen_uninit_lock_cpu(int cpu)
 	per_cpu(irq_name, cpu) = NULL;
 }
 
-
 /*
  * Our init of PV spinlocks is split in two init functions due to us
  * using paravirt patching and jump labels patching and having to do
@@ -275,8 +349,13 @@ void __init xen_init_spinlocks(void)
 		return;
 	}
 
+#ifdef CONFIG_QUEUE_SPINLOCK
+	pv_lock_ops.kick_cpu = xen_kick_cpu_type;
+        pv_lock_ops.hibernate = xen_hibernate;
+#else
 	pv_lock_ops.lock_spinning = PV_CALLEE_SAVE(xen_lock_spinning);
 	pv_lock_ops.unlock_kick = xen_unlock_kick;
+#endif
 }
 
 /*
@@ -318,6 +397,7 @@ static int __init xen_spinlock_debugfs(void)
 
 	d_spin_debug = debugfs_create_dir("spinlocks", d_xen);
 
+#ifndef CONFIG_QUEUE_SPINLOCK
 	debugfs_create_u8("zero_stats", 0644, d_spin_debug, &zero_stats);
 
 	debugfs_create_u32("taken_slow", 0444, d_spin_debug,
@@ -337,7 +417,12 @@ static int __init xen_spinlock_debugfs(void)
 
 	debugfs_create_u32_array("histo_blocked", 0444, d_spin_debug,
 				spinlock_stats.histo_spin_blocked, HISTO_BUCKETS + 1);
-
+#else /* CONFIG_QUEUE_SPINLOCK */
+	debugfs_create_u32("lh_kick_stats", 0644, d_spin_debug, &lh_kick_stats);
+	debugfs_create_u32("qh_kick_stats", 0644, d_spin_debug, &qh_kick_stats);
+	debugfs_create_u32("hibernate_stats",
+			   0644, d_spin_debug, &hibernate_stats);
+#endif /* CONFIG_QUEUE_SPINLOCK */
 	return 0;
 }
 fs_initcall(xen_spinlock_debugfs);
diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks
index a70fdeb..451e392 100644
--- a/kernel/Kconfig.locks
+++ b/kernel/Kconfig.locks
@@ -229,4 +229,4 @@ config ARCH_USE_QUEUE_SPINLOCK
 
 config QUEUE_SPINLOCK
 	def_bool y if ARCH_USE_QUEUE_SPINLOCK
-	depends on SMP && (!PARAVIRT_SPINLOCKS || !XEN)
+	depends on SMP
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 135+ messages in thread

* Re: [PATCH v6 04/11] qspinlock: Optimized code path for 2 contending tasks
  2014-03-12 18:54 ` [PATCH v6 04/11] qspinlock: Optimized code path for 2 contending tasks Waiman Long
@ 2014-03-12 19:08     ` Waiman Long
  2014-03-12 19:08   ` Waiman Long
  1 sibling, 0 replies; 135+ messages in thread
From: Waiman Long @ 2014-03-12 19:08 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, Peter Zijlstra,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Thomas Gleixner, linux-arch, Gleb Natapov, x86, Ingo Molnar,
	xen-devel, Paul E. McKenney, Rik van Riel, Arnd Bergmann,
	Konrad Rzeszutek Wilk, Scott J Norton, Steven Rostedt,
	Chris Wright, Oleg Nesterov, Alok Kataria,
	Aswin Chandramouleeswaran, Chegu Vinod

On 03/12/2014 02:54 PM, Waiman Long wrote:
> +
> +		/*
> +		 * Now wait until the lock bit is cleared
> +		 */
> +		while (smp_load_acquire(&qlock->qlcode)&  _QSPINLOCK_LOCKED)
> +			arch_mutex_cpu_relax();
> +
> +		/*
> +		 * Set the lock bit&  clear the waiting bit simultaneously
> +		 * It is assumed that there is no lock stealing with this
> +		 * quick path active.
> +		 *
> +		 * A direct memory store of _QSPINLOCK_LOCKED into the
> +		 * lock_wait field causes problem with the lockref code, e.g.
> +		 *   ACCESS_ONCE(qlock->lock_wait) = _QSPINLOCK_LOCKED;
> +		 *
> +		 * It is not currently clear why this happens. A workaround
> +		 * is to use atomic instruction to store the new value.
> +		 */
> +		{
> +			u16 lw = xchg(&qlock->lock_wait, _QSPINLOCK_LOCKED);
> +			BUG_ON(lw != _QSPINLOCK_WAITING);
> +		}
> +		return 1;
>

It was found that when I used a direct memory store instead of an atomic 
op, the following kernel crash might happen at filesystem dismount time:

Red Hat Enterprise Linux Server 7.0 (Maipo)
Kernel 3.14.0-rc6-qlock on an x86_64

h11-kvm20 login: [ 1529.934047] BUG: Dentry 
ffff883f4c048480{i=30181e9e,n=libopc
odes-2.23.52.0.1-15.el7.so} still in use (-1) [unmount of xfs dm-1]
[ 1529.935762] ------------[ cut here ]------------
[ 1529.936331] kernel BUG at fs/dcache.c:1343!
[ 1529.936714] invalid opcode: 0000 [#1] SMP
[ 1529.936714] Modules linked in: ext4 mbcache jbd2 binfmt_misc brd 
ip6t_rpfilte
r cfg80211 ip6t_REJECT rfkill ipt_REJECT xt_conntrack ebtable_nat 
ebtable_broute
  bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 
nf_defrag
_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw 
ip6table_filter
  ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 
nf_nat nf_c
onntrack iptable_mangle iptable_security iptable_raw iptable_filter 
ip_tables sg
  ppdev snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hwdep 
snd_seq snd_s
eq_device snd_pcm snd_timer snd parport_pc parport soundcore serio_raw 
i2c_piix4
  virtio_console virtio_balloon microcode pcspkr nfsd auth_rpcgss 
nfs_acl lockd s
unrpc uinput xfs libcrc32c sr_mod cdrom ata_generic pata_acpi qxl 
virtio_blk vir
tio_net drm_kms_helper ttm drm ata_piix libata virtio_pci virtio_ring 
floppy i2c
_core virtio dm_mirror dm_region_hash dm_log dm_mod
[ 1529.936714] CPU: 12 PID: 11106 Comm: umount Not tainted 
3.14.0-rc6-qlock #1
[ 1529.936714] Hardware name: Red Hat KVM, BIOS Bochs 01/01/2011
[ 1529.936714] task: ffff881f9183b540 ti: ffff881f920fa000 task.ti: 
ffff881f920f
a000
[ 1529.936714] RIP: 0010:[<ffffffff811c185c>]  [<ffffffff811c185c>] 
umount_colle
ct+0xec/0x110
[ 1529.936714] RSP: 0018:ffff881f920fbdc8  EFLAGS: 00010282
[ 1529.936714] RAX: 0000000000000073 RBX: ffff883f4c048480 RCX: 
0000000000000000
[ 1529.936714] RDX: 0000000000000001 RSI: 0000000000000046 RDI: 
0000000000000246
[ 1529.936714] RBP: ffff881f920fbde0 R08: ffffffff819e42e0 R09: 
0000000000000396
[ 1529.936714] R10: 0000000000000000 R11: ffff881f920fbb06 R12: 
ffff881f920fbe60
[ 1529.936714] R13: ffff883f8d458460 R14: ffff883f4c048480 R15: 
ffff883f8d4583c0
[ 1529.936714] FS:  00007f6027b0c880(0000) GS:ffff88403fc40000(0000) 
knlGS:00000
00000000000
[ 1529.936714] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 1529.936714] CR2: 00007f60276c4900 CR3: 0000003f421c0000 CR4: 
00000000000006e0
[ 1529.936714] Stack:
[ 1529.936714]  ffff883f8edf4ac8 ffff883f4c048510 ffff883f910a02d0 
ffff881f920fb
e50
[ 1529.936714]  ffffffff811c2d03 0000000000000000 00ff881f920fbe50 
0000896600000
000
[ 1529.936714]  ffff883f8d4587d8 ffff883f8d458780 ffffffff811c1770 
ffff881f920fb
e60
[ 1529.936714] Call Trace:
[ 1529.936714]  [<ffffffff811c2d03>] d_walk+0xc3/0x260
[ 1529.936714]  [<ffffffff811c1770>] ? check_and_collect+0x30/0x30
[ 1529.936714]  [<ffffffff811c3985>] shrink_dcache_for_umount+0x75/0x120
[ 1529.936714]  [<ffffffff811adf21>] generic_shutdown_super+0x21/0xf0
[ 1529.936714]  [<ffffffff811ae207>] kill_block_super+0x27/0x70
[ 1529.936714]  [<ffffffff811ae4ed>] deactivate_locked_super+0x3d/0x60
[ 1529.936714]  [<ffffffff811aea96>] deactivate_super+0x46/0x60
[ 1529.936714]  [<ffffffff811ca277>] mntput_no_expire+0xa7/0x140
[ 1529.936714]  [<ffffffff811cb6ce>] SyS_umount+0x8e/0x100
[ 1529.936714]  [<ffffffff815d2c29>] system_call_fastpath+0x16/0x1b
[ 1529.936714] Code: 00 00 48 8b 40 28 4c 8b 08 48 8b 43 30 48 85 c0 74 
2a 48 8b
  50 40 48 89 34 24 48 c7 c7 e0 4a 7f 81 48 89 de 31 c0 e8 03 cb 3f 00 
<0f> 0b 66
  90 48 89 f7 e8 c8 fc ff ff e9 66 ff ff ff 31 d2 90 eb
[ 1529.936714] RIP  [<ffffffff811c185c>] umount_collect+0xec/0x110
[ 1529.936714]  RSP <ffff881f920fbdc8>
[ 1529.976523] ---[ end trace 6c8ce7cee0969bbb ]---
[ 1529.977137] Kernel panic - not syncing: Fatal exception
[ 1529.978119] Kernel Offset: 0x0 from 0xffffffff81000000 (relocation 
range: 0xf
fffffff80000000-0xffffffff9fffffff)
[ 1529.978119] drm_kms_helper: panic occurred, switching back to text 
console

It was more readily reproducible in a KVM guest. It was harder to 
reproduce in a bare metal machine, but kernel crash still happened after 
several tries.

I am not sure what exactly cause this crash, but it will have something 
to do with the interaction between the lockref and the qspinlock code. I 
would like more eyes on that to find the root cause of it.

-Longman

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v6 04/11] qspinlock: Optimized code path for 2 contending tasks
@ 2014-03-12 19:08     ` Waiman Long
  0 siblings, 0 replies; 135+ messages in thread
From: Waiman Long @ 2014-03-12 19:08 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, Peter Zijlstra,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Thomas Gleixner, linux-arch, Gleb Natapov, x86, Ingo Molnar,
	xen-devel, Paul E. McKenney, Rik van Riel, Arnd Bergmann,
	Konrad Rzeszutek Wilk, Scott J Norton, Steven Rostedt,
	Chris Wright, Oleg Nesterov, Alok Kataria,
	Aswin Chandramouleeswaran, Chegu Vinod

On 03/12/2014 02:54 PM, Waiman Long wrote:
> +
> +		/*
> +		 * Now wait until the lock bit is cleared
> +		 */
> +		while (smp_load_acquire(&qlock->qlcode)&  _QSPINLOCK_LOCKED)
> +			arch_mutex_cpu_relax();
> +
> +		/*
> +		 * Set the lock bit&  clear the waiting bit simultaneously
> +		 * It is assumed that there is no lock stealing with this
> +		 * quick path active.
> +		 *
> +		 * A direct memory store of _QSPINLOCK_LOCKED into the
> +		 * lock_wait field causes problem with the lockref code, e.g.
> +		 *   ACCESS_ONCE(qlock->lock_wait) = _QSPINLOCK_LOCKED;
> +		 *
> +		 * It is not currently clear why this happens. A workaround
> +		 * is to use atomic instruction to store the new value.
> +		 */
> +		{
> +			u16 lw = xchg(&qlock->lock_wait, _QSPINLOCK_LOCKED);
> +			BUG_ON(lw != _QSPINLOCK_WAITING);
> +		}
> +		return 1;
>

It was found that when I used a direct memory store instead of an atomic 
op, the following kernel crash might happen at filesystem dismount time:

Red Hat Enterprise Linux Server 7.0 (Maipo)
Kernel 3.14.0-rc6-qlock on an x86_64

h11-kvm20 login: [ 1529.934047] BUG: Dentry 
ffff883f4c048480{i=30181e9e,n=libopc
odes-2.23.52.0.1-15.el7.so} still in use (-1) [unmount of xfs dm-1]
[ 1529.935762] ------------[ cut here ]------------
[ 1529.936331] kernel BUG at fs/dcache.c:1343!
[ 1529.936714] invalid opcode: 0000 [#1] SMP
[ 1529.936714] Modules linked in: ext4 mbcache jbd2 binfmt_misc brd 
ip6t_rpfilte
r cfg80211 ip6t_REJECT rfkill ipt_REJECT xt_conntrack ebtable_nat 
ebtable_broute
  bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 
nf_defrag
_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw 
ip6table_filter
  ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 
nf_nat nf_c
onntrack iptable_mangle iptable_security iptable_raw iptable_filter 
ip_tables sg
  ppdev snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hwdep 
snd_seq snd_s
eq_device snd_pcm snd_timer snd parport_pc parport soundcore serio_raw 
i2c_piix4
  virtio_console virtio_balloon microcode pcspkr nfsd auth_rpcgss 
nfs_acl lockd s
unrpc uinput xfs libcrc32c sr_mod cdrom ata_generic pata_acpi qxl 
virtio_blk vir
tio_net drm_kms_helper ttm drm ata_piix libata virtio_pci virtio_ring 
floppy i2c
_core virtio dm_mirror dm_region_hash dm_log dm_mod
[ 1529.936714] CPU: 12 PID: 11106 Comm: umount Not tainted 
3.14.0-rc6-qlock #1
[ 1529.936714] Hardware name: Red Hat KVM, BIOS Bochs 01/01/2011
[ 1529.936714] task: ffff881f9183b540 ti: ffff881f920fa000 task.ti: 
ffff881f920f
a000
[ 1529.936714] RIP: 0010:[<ffffffff811c185c>]  [<ffffffff811c185c>] 
umount_colle
ct+0xec/0x110
[ 1529.936714] RSP: 0018:ffff881f920fbdc8  EFLAGS: 00010282
[ 1529.936714] RAX: 0000000000000073 RBX: ffff883f4c048480 RCX: 
0000000000000000
[ 1529.936714] RDX: 0000000000000001 RSI: 0000000000000046 RDI: 
0000000000000246
[ 1529.936714] RBP: ffff881f920fbde0 R08: ffffffff819e42e0 R09: 
0000000000000396
[ 1529.936714] R10: 0000000000000000 R11: ffff881f920fbb06 R12: 
ffff881f920fbe60
[ 1529.936714] R13: ffff883f8d458460 R14: ffff883f4c048480 R15: 
ffff883f8d4583c0
[ 1529.936714] FS:  00007f6027b0c880(0000) GS:ffff88403fc40000(0000) 
knlGS:00000
00000000000
[ 1529.936714] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 1529.936714] CR2: 00007f60276c4900 CR3: 0000003f421c0000 CR4: 
00000000000006e0
[ 1529.936714] Stack:
[ 1529.936714]  ffff883f8edf4ac8 ffff883f4c048510 ffff883f910a02d0 
ffff881f920fb
e50
[ 1529.936714]  ffffffff811c2d03 0000000000000000 00ff881f920fbe50 
0000896600000
000
[ 1529.936714]  ffff883f8d4587d8 ffff883f8d458780 ffffffff811c1770 
ffff881f920fb
e60
[ 1529.936714] Call Trace:
[ 1529.936714]  [<ffffffff811c2d03>] d_walk+0xc3/0x260
[ 1529.936714]  [<ffffffff811c1770>] ? check_and_collect+0x30/0x30
[ 1529.936714]  [<ffffffff811c3985>] shrink_dcache_for_umount+0x75/0x120
[ 1529.936714]  [<ffffffff811adf21>] generic_shutdown_super+0x21/0xf0
[ 1529.936714]  [<ffffffff811ae207>] kill_block_super+0x27/0x70
[ 1529.936714]  [<ffffffff811ae4ed>] deactivate_locked_super+0x3d/0x60
[ 1529.936714]  [<ffffffff811aea96>] deactivate_super+0x46/0x60
[ 1529.936714]  [<ffffffff811ca277>] mntput_no_expire+0xa7/0x140
[ 1529.936714]  [<ffffffff811cb6ce>] SyS_umount+0x8e/0x100
[ 1529.936714]  [<ffffffff815d2c29>] system_call_fastpath+0x16/0x1b
[ 1529.936714] Code: 00 00 48 8b 40 28 4c 8b 08 48 8b 43 30 48 85 c0 74 
2a 48 8b
  50 40 48 89 34 24 48 c7 c7 e0 4a 7f 81 48 89 de 31 c0 e8 03 cb 3f 00 
<0f> 0b 66
  90 48 89 f7 e8 c8 fc ff ff e9 66 ff ff ff 31 d2 90 eb
[ 1529.936714] RIP  [<ffffffff811c185c>] umount_collect+0xec/0x110
[ 1529.936714]  RSP <ffff881f920fbdc8>
[ 1529.976523] ---[ end trace 6c8ce7cee0969bbb ]---
[ 1529.977137] Kernel panic - not syncing: Fatal exception
[ 1529.978119] Kernel Offset: 0x0 from 0xffffffff81000000 (relocation 
range: 0xf
fffffff80000000-0xffffffff9fffffff)
[ 1529.978119] drm_kms_helper: panic occurred, switching back to text 
console

It was more readily reproducible in a KVM guest. It was harder to 
reproduce in a bare metal machine, but kernel crash still happened after 
several tries.

I am not sure what exactly cause this crash, but it will have something 
to do with the interaction between the lockref and the qspinlock code. I 
would like more eyes on that to find the root cause of it.

-Longman

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v6 04/11] qspinlock: Optimized code path for 2 contending tasks
  2014-03-12 18:54 ` [PATCH v6 04/11] qspinlock: Optimized code path for 2 contending tasks Waiman Long
  2014-03-12 19:08     ` Waiman Long
@ 2014-03-12 19:08   ` Waiman Long
  1 sibling, 0 replies; 135+ messages in thread
From: Waiman Long @ 2014-03-12 19:08 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, Peter Zijlstra,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Thomas Gleixner, linux-arch, Gleb Natapov, x86, Ingo Molnar,
	xen-devel, Paul E. McKenney, Arnd Bergmann, Scott J Norton,
	Rusty Russell, Steven Rostedt, Chris Wright, Oleg Nesterov,
	Alok Kataria, Aswin Chandramouleeswaran, Chegu Vinod,
	Boris Ostrovsky

On 03/12/2014 02:54 PM, Waiman Long wrote:
> +
> +		/*
> +		 * Now wait until the lock bit is cleared
> +		 */
> +		while (smp_load_acquire(&qlock->qlcode)&  _QSPINLOCK_LOCKED)
> +			arch_mutex_cpu_relax();
> +
> +		/*
> +		 * Set the lock bit&  clear the waiting bit simultaneously
> +		 * It is assumed that there is no lock stealing with this
> +		 * quick path active.
> +		 *
> +		 * A direct memory store of _QSPINLOCK_LOCKED into the
> +		 * lock_wait field causes problem with the lockref code, e.g.
> +		 *   ACCESS_ONCE(qlock->lock_wait) = _QSPINLOCK_LOCKED;
> +		 *
> +		 * It is not currently clear why this happens. A workaround
> +		 * is to use atomic instruction to store the new value.
> +		 */
> +		{
> +			u16 lw = xchg(&qlock->lock_wait, _QSPINLOCK_LOCKED);
> +			BUG_ON(lw != _QSPINLOCK_WAITING);
> +		}
> +		return 1;
>

It was found that when I used a direct memory store instead of an atomic 
op, the following kernel crash might happen at filesystem dismount time:

Red Hat Enterprise Linux Server 7.0 (Maipo)
Kernel 3.14.0-rc6-qlock on an x86_64

h11-kvm20 login: [ 1529.934047] BUG: Dentry 
ffff883f4c048480{i=30181e9e,n=libopc
odes-2.23.52.0.1-15.el7.so} still in use (-1) [unmount of xfs dm-1]
[ 1529.935762] ------------[ cut here ]------------
[ 1529.936331] kernel BUG at fs/dcache.c:1343!
[ 1529.936714] invalid opcode: 0000 [#1] SMP
[ 1529.936714] Modules linked in: ext4 mbcache jbd2 binfmt_misc brd 
ip6t_rpfilte
r cfg80211 ip6t_REJECT rfkill ipt_REJECT xt_conntrack ebtable_nat 
ebtable_broute
  bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 
nf_defrag
_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw 
ip6table_filter
  ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 
nf_nat nf_c
onntrack iptable_mangle iptable_security iptable_raw iptable_filter 
ip_tables sg
  ppdev snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hwdep 
snd_seq snd_s
eq_device snd_pcm snd_timer snd parport_pc parport soundcore serio_raw 
i2c_piix4
  virtio_console virtio_balloon microcode pcspkr nfsd auth_rpcgss 
nfs_acl lockd s
unrpc uinput xfs libcrc32c sr_mod cdrom ata_generic pata_acpi qxl 
virtio_blk vir
tio_net drm_kms_helper ttm drm ata_piix libata virtio_pci virtio_ring 
floppy i2c
_core virtio dm_mirror dm_region_hash dm_log dm_mod
[ 1529.936714] CPU: 12 PID: 11106 Comm: umount Not tainted 
3.14.0-rc6-qlock #1
[ 1529.936714] Hardware name: Red Hat KVM, BIOS Bochs 01/01/2011
[ 1529.936714] task: ffff881f9183b540 ti: ffff881f920fa000 task.ti: 
ffff881f920f
a000
[ 1529.936714] RIP: 0010:[<ffffffff811c185c>]  [<ffffffff811c185c>] 
umount_colle
ct+0xec/0x110
[ 1529.936714] RSP: 0018:ffff881f920fbdc8  EFLAGS: 00010282
[ 1529.936714] RAX: 0000000000000073 RBX: ffff883f4c048480 RCX: 
0000000000000000
[ 1529.936714] RDX: 0000000000000001 RSI: 0000000000000046 RDI: 
0000000000000246
[ 1529.936714] RBP: ffff881f920fbde0 R08: ffffffff819e42e0 R09: 
0000000000000396
[ 1529.936714] R10: 0000000000000000 R11: ffff881f920fbb06 R12: 
ffff881f920fbe60
[ 1529.936714] R13: ffff883f8d458460 R14: ffff883f4c048480 R15: 
ffff883f8d4583c0
[ 1529.936714] FS:  00007f6027b0c880(0000) GS:ffff88403fc40000(0000) 
knlGS:00000
00000000000
[ 1529.936714] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 1529.936714] CR2: 00007f60276c4900 CR3: 0000003f421c0000 CR4: 
00000000000006e0
[ 1529.936714] Stack:
[ 1529.936714]  ffff883f8edf4ac8 ffff883f4c048510 ffff883f910a02d0 
ffff881f920fb
e50
[ 1529.936714]  ffffffff811c2d03 0000000000000000 00ff881f920fbe50 
0000896600000
000
[ 1529.936714]  ffff883f8d4587d8 ffff883f8d458780 ffffffff811c1770 
ffff881f920fb
e60
[ 1529.936714] Call Trace:
[ 1529.936714]  [<ffffffff811c2d03>] d_walk+0xc3/0x260
[ 1529.936714]  [<ffffffff811c1770>] ? check_and_collect+0x30/0x30
[ 1529.936714]  [<ffffffff811c3985>] shrink_dcache_for_umount+0x75/0x120
[ 1529.936714]  [<ffffffff811adf21>] generic_shutdown_super+0x21/0xf0
[ 1529.936714]  [<ffffffff811ae207>] kill_block_super+0x27/0x70
[ 1529.936714]  [<ffffffff811ae4ed>] deactivate_locked_super+0x3d/0x60
[ 1529.936714]  [<ffffffff811aea96>] deactivate_super+0x46/0x60
[ 1529.936714]  [<ffffffff811ca277>] mntput_no_expire+0xa7/0x140
[ 1529.936714]  [<ffffffff811cb6ce>] SyS_umount+0x8e/0x100
[ 1529.936714]  [<ffffffff815d2c29>] system_call_fastpath+0x16/0x1b
[ 1529.936714] Code: 00 00 48 8b 40 28 4c 8b 08 48 8b 43 30 48 85 c0 74 
2a 48 8b
  50 40 48 89 34 24 48 c7 c7 e0 4a 7f 81 48 89 de 31 c0 e8 03 cb 3f 00 
<0f> 0b 66
  90 48 89 f7 e8 c8 fc ff ff e9 66 ff ff ff 31 d2 90 eb
[ 1529.936714] RIP  [<ffffffff811c185c>] umount_collect+0xec/0x110
[ 1529.936714]  RSP <ffff881f920fbdc8>
[ 1529.976523] ---[ end trace 6c8ce7cee0969bbb ]---
[ 1529.977137] Kernel panic - not syncing: Fatal exception
[ 1529.978119] Kernel Offset: 0x0 from 0xffffffff81000000 (relocation 
range: 0xf
fffffff80000000-0xffffffff9fffffff)
[ 1529.978119] drm_kms_helper: panic occurred, switching back to text 
console

It was more readily reproducible in a KVM guest. It was harder to 
reproduce in a bare metal machine, but kernel crash still happened after 
several tries.

I am not sure what exactly cause this crash, but it will have something 
to do with the interaction between the lockref and the qspinlock code. I 
would like more eyes on that to find the root cause of it.

-Longman

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v6 05/11] pvqspinlock, x86: Allow unfair spinlock in a PV guest
  2014-03-12 18:54 ` [PATCH v6 05/11] pvqspinlock, x86: Allow unfair spinlock in a PV guest Waiman Long
@ 2014-03-13 10:54     ` David Vrabel
  2014-03-13 10:54     ` David Vrabel
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 135+ messages in thread
From: David Vrabel @ 2014-03-13 10:54 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, Peter Zijlstra,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Thomas Gleixner, linux-arch, Gleb Natapov, x86, Ingo Molnar,
	xen-devel, Paul E. McKenney, Rik van Riel, Arnd Bergmann,
	Konrad Rzeszutek Wilk, Scott J Norton, Steven Rostedt,
	Chris Wright, Oleg Nesterov, Alok Kataria,
	Aswin Chandramouleeswaran, Chegu Vinod

On 12/03/14 18:54, Waiman Long wrote:
> Locking is always an issue in a virtualized environment as the virtual
> CPU that is waiting on a lock may get scheduled out and hence block
> any progress in lock acquisition even when the lock has been freed.
> 
> One solution to this problem is to allow unfair lock in a
> para-virtualized environment. In this case, a new lock acquirer can
> come and steal the lock if the next-in-line CPU to get the lock is
> scheduled out. Unfair lock in a native environment is generally not a
> good idea as there is a possibility of lock starvation for a heavily
> contended lock.

I do not think this is a good idea -- the problems with unfair locks are
worse in a virtualized guest.  If a waiting VCPU deschedules and has to
be kicked to grab a lock then it is very likely to lose a race with
another running VCPU trying to take a lock (since it takes time for the
VCPU to be rescheduled).

> With the unfair locking activated on bare metal 4-socket Westmere-EX
> box, the execution times (in ms) of a spinlock micro-benchmark were
> as follows:
> 
>   # of    Ticket       Fair	    Unfair
>   tasks    lock     queue lock    queue lock
>   ------  -------   ----------    ----------
>     1       135        135	     137
>     2      1045       1120	     747
>     3      1827       2345     	    1084
>     4      2689       2934	    1438
>     5      3736       3658	    1722
>     6      4942       4434	    2092
>     7      6304       5176          2245
>     8      7736       5955          2388

Are these figures with or without the later PV support patches?

David

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v6 05/11] pvqspinlock, x86: Allow unfair spinlock in a PV guest
@ 2014-03-13 10:54     ` David Vrabel
  0 siblings, 0 replies; 135+ messages in thread
From: David Vrabel @ 2014-03-13 10:54 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, Peter Zijlstra,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Thomas Gleixner, linux-arch, Gleb Natapov, x86, Ingo Molnar,
	xen-devel, Paul E. McKenney, Rik van Riel, Arnd Bergmann,
	Konrad Rzeszutek Wilk, Scott J Norton, Steven Rostedt,
	Chris Wright, Oleg Nesterov, Alok Kataria,
	Aswin Chandramouleeswaran, Chegu Vinod

On 12/03/14 18:54, Waiman Long wrote:
> Locking is always an issue in a virtualized environment as the virtual
> CPU that is waiting on a lock may get scheduled out and hence block
> any progress in lock acquisition even when the lock has been freed.
> 
> One solution to this problem is to allow unfair lock in a
> para-virtualized environment. In this case, a new lock acquirer can
> come and steal the lock if the next-in-line CPU to get the lock is
> scheduled out. Unfair lock in a native environment is generally not a
> good idea as there is a possibility of lock starvation for a heavily
> contended lock.

I do not think this is a good idea -- the problems with unfair locks are
worse in a virtualized guest.  If a waiting VCPU deschedules and has to
be kicked to grab a lock then it is very likely to lose a race with
another running VCPU trying to take a lock (since it takes time for the
VCPU to be rescheduled).

> With the unfair locking activated on bare metal 4-socket Westmere-EX
> box, the execution times (in ms) of a spinlock micro-benchmark were
> as follows:
> 
>   # of    Ticket       Fair	    Unfair
>   tasks    lock     queue lock    queue lock
>   ------  -------   ----------    ----------
>     1       135        135	     137
>     2      1045       1120	     747
>     3      1827       2345     	    1084
>     4      2689       2934	    1438
>     5      3736       3658	    1722
>     6      4942       4434	    2092
>     7      6304       5176          2245
>     8      7736       5955          2388

Are these figures with or without the later PV support patches?

David

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v6 05/11] pvqspinlock, x86: Allow unfair spinlock in a PV guest
  2014-03-12 18:54 ` [PATCH v6 05/11] pvqspinlock, x86: Allow unfair spinlock in a PV guest Waiman Long
@ 2014-03-13 10:54   ` David Vrabel
  2014-03-13 10:54     ` David Vrabel
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 135+ messages in thread
From: David Vrabel @ 2014-03-13 10:54 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, Peter Zijlstra,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Thomas Gleixner, linux-arch, Gleb Natapov, x86, Ingo Molnar,
	xen-devel, Paul E. McKenney, Arnd Bergmann, Scott J Norton,
	Rusty Russell, Steven Rostedt, Chris Wright, Oleg Nesterov,
	Alok Kataria, Aswin Chandramouleeswaran, Chegu Vinod,
	Boris Ostrovsky

On 12/03/14 18:54, Waiman Long wrote:
> Locking is always an issue in a virtualized environment as the virtual
> CPU that is waiting on a lock may get scheduled out and hence block
> any progress in lock acquisition even when the lock has been freed.
> 
> One solution to this problem is to allow unfair lock in a
> para-virtualized environment. In this case, a new lock acquirer can
> come and steal the lock if the next-in-line CPU to get the lock is
> scheduled out. Unfair lock in a native environment is generally not a
> good idea as there is a possibility of lock starvation for a heavily
> contended lock.

I do not think this is a good idea -- the problems with unfair locks are
worse in a virtualized guest.  If a waiting VCPU deschedules and has to
be kicked to grab a lock then it is very likely to lose a race with
another running VCPU trying to take a lock (since it takes time for the
VCPU to be rescheduled).

> With the unfair locking activated on bare metal 4-socket Westmere-EX
> box, the execution times (in ms) of a spinlock micro-benchmark were
> as follows:
> 
>   # of    Ticket       Fair	    Unfair
>   tasks    lock     queue lock    queue lock
>   ------  -------   ----------    ----------
>     1       135        135	     137
>     2      1045       1120	     747
>     3      1827       2345     	    1084
>     4      2689       2934	    1438
>     5      3736       3658	    1722
>     6      4942       4434	    2092
>     7      6304       5176          2245
>     8      7736       5955          2388

Are these figures with or without the later PV support patches?

David

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH RFC v6 09/11] pvqspinlock, x86: Add qspinlock para-virtualization support
  2014-03-12 18:54 ` Waiman Long
@ 2014-03-13 11:21     ` David Vrabel
  2014-03-13 11:21   ` David Vrabel
  1 sibling, 0 replies; 135+ messages in thread
From: David Vrabel @ 2014-03-13 11:21 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, Peter Zijlstra,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Thomas Gleixner, linux-arch, Gleb Natapov, x86, Ingo Molnar,
	xen-devel, Paul E. McKenney, Rik van Riel, Arnd Bergmann,
	Konrad Rzeszutek Wilk, Scott J Norton, Steven Rostedt,
	Chris Wright, Oleg Nesterov, Alok Kataria,
	Aswin Chandramouleeswaran, Chegu Vinod

On 12/03/14 18:54, Waiman Long wrote:
> This patch adds para-virtualization support to the queue spinlock in
> the same way as was done in the PV ticket lock code. In essence, the
> lock waiters will spin for a specified number of times (QSPIN_THRESHOLD
> = 2^14) and then halted itself. The queue head waiter will spins
> 2*QSPIN_THRESHOLD times before halting itself. When it has spinned
> QSPIN_THRESHOLD times, the queue head will assume that the lock
> holder may be scheduled out and attempt to kick the lock holder CPU
> if it has the CPU number on hand.

I don't really understand the reasoning for kicking the lock holder.  It
will either be: running, runnable, or halted because it's in a slow path
wait for another lock.  In any of these states I do not see how a kick
is useful.

> Enabling the PV code does have a performance impact on spinlock
> acquisitions and releases. The following table shows the execution
> time (in ms) of a spinlock micro-benchmark that does lock/unlock
> operations 5M times for each task versus the number of contending
> tasks on a Westmere-EX system.
> 
>   # of        Ticket lock	     Queue lock
>   tasks   PV off/PV on/%Change 	  PV off/PV on/%Change
>   ------  --------------------   ---------------------
>     1	     135/  179/+33%	     137/  169/+23%
>     2	    1045/ 1103/ +6%	    1120/ 1536/+37%
>     3	    1827/ 2683/+47%	    2313/ 2425/ +5%
>     4       2689/ 4191/+56%	    2914/ 3128/ +7%
>     5       3736/ 5830/+56%	    3715/ 3762/ +1%
>     6       4942/ 7609/+54%	    4504/ 4558/ +2%
>     7       6304/ 9570/+52%	    5292/ 5351/ +1%
>     8       7736/11323/+46%	    6037/ 6097/ +1%

Do you have measurements from tests when VCPUs are overcommitted?

> +#ifdef CONFIG_PARAVIRT_SPINLOCKS
> +/**
> + * queue_spin_unlock_slowpath - kick up the CPU of the queue head
> + * @lock : Pointer to queue spinlock structure
> + *
> + * The lock is released after finding the queue head to avoid racing
> + * condition between the queue head and the lock holder.
> + */
> +void queue_spin_unlock_slowpath(struct qspinlock *lock)
> +{
> +	struct qnode *node, *prev;
> +	u32 qcode = (u32)queue_get_qcode(lock);
> +
> +	/*
> +	 * Get the queue tail node
> +	 */
> +	node = xlate_qcode(qcode);
> +
> +	/*
> +	 * Locate the queue head node by following the prev pointer from
> +	 * tail to head.
> +	 * It is assumed that the PV guests won't have that many CPUs so
> +	 * that it won't take a long time to follow the pointers.

This isn't a valid assumption, but this isn't that different from the
search done in the ticket slow unlock path so I guess it's ok.

David

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH RFC v6 09/11] pvqspinlock, x86: Add qspinlock para-virtualization support
@ 2014-03-13 11:21     ` David Vrabel
  0 siblings, 0 replies; 135+ messages in thread
From: David Vrabel @ 2014-03-13 11:21 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, Peter Zijlstra,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Thomas Gleixner, linux-arch, Gleb Natapov, x86, Ingo Molnar,
	xen-devel, Paul E. McKenney, Rik van Riel, Arnd Bergmann,
	Konrad Rzeszutek Wilk, Scott J Norton, Steven Rostedt,
	Chris Wright, Oleg Nesterov, Alok Kataria,
	Aswin Chandramouleeswaran, Chegu Vinod

On 12/03/14 18:54, Waiman Long wrote:
> This patch adds para-virtualization support to the queue spinlock in
> the same way as was done in the PV ticket lock code. In essence, the
> lock waiters will spin for a specified number of times (QSPIN_THRESHOLD
> = 2^14) and then halted itself. The queue head waiter will spins
> 2*QSPIN_THRESHOLD times before halting itself. When it has spinned
> QSPIN_THRESHOLD times, the queue head will assume that the lock
> holder may be scheduled out and attempt to kick the lock holder CPU
> if it has the CPU number on hand.

I don't really understand the reasoning for kicking the lock holder.  It
will either be: running, runnable, or halted because it's in a slow path
wait for another lock.  In any of these states I do not see how a kick
is useful.

> Enabling the PV code does have a performance impact on spinlock
> acquisitions and releases. The following table shows the execution
> time (in ms) of a spinlock micro-benchmark that does lock/unlock
> operations 5M times for each task versus the number of contending
> tasks on a Westmere-EX system.
> 
>   # of        Ticket lock	     Queue lock
>   tasks   PV off/PV on/%Change 	  PV off/PV on/%Change
>   ------  --------------------   ---------------------
>     1	     135/  179/+33%	     137/  169/+23%
>     2	    1045/ 1103/ +6%	    1120/ 1536/+37%
>     3	    1827/ 2683/+47%	    2313/ 2425/ +5%
>     4       2689/ 4191/+56%	    2914/ 3128/ +7%
>     5       3736/ 5830/+56%	    3715/ 3762/ +1%
>     6       4942/ 7609/+54%	    4504/ 4558/ +2%
>     7       6304/ 9570/+52%	    5292/ 5351/ +1%
>     8       7736/11323/+46%	    6037/ 6097/ +1%

Do you have measurements from tests when VCPUs are overcommitted?

> +#ifdef CONFIG_PARAVIRT_SPINLOCKS
> +/**
> + * queue_spin_unlock_slowpath - kick up the CPU of the queue head
> + * @lock : Pointer to queue spinlock structure
> + *
> + * The lock is released after finding the queue head to avoid racing
> + * condition between the queue head and the lock holder.
> + */
> +void queue_spin_unlock_slowpath(struct qspinlock *lock)
> +{
> +	struct qnode *node, *prev;
> +	u32 qcode = (u32)queue_get_qcode(lock);
> +
> +	/*
> +	 * Get the queue tail node
> +	 */
> +	node = xlate_qcode(qcode);
> +
> +	/*
> +	 * Locate the queue head node by following the prev pointer from
> +	 * tail to head.
> +	 * It is assumed that the PV guests won't have that many CPUs so
> +	 * that it won't take a long time to follow the pointers.

This isn't a valid assumption, but this isn't that different from the
search done in the ticket slow unlock path so I guess it's ok.

David

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH RFC v6 09/11] pvqspinlock, x86: Add qspinlock para-virtualization support
  2014-03-12 18:54 ` Waiman Long
  2014-03-13 11:21     ` David Vrabel
@ 2014-03-13 11:21   ` David Vrabel
  1 sibling, 0 replies; 135+ messages in thread
From: David Vrabel @ 2014-03-13 11:21 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, Peter Zijlstra,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Thomas Gleixner, linux-arch, Gleb Natapov, x86, Ingo Molnar,
	xen-devel, Paul E. McKenney, Arnd Bergmann, Scott J Norton,
	Rusty Russell, Steven Rostedt, Chris Wright, Oleg Nesterov,
	Alok Kataria, Aswin Chandramouleeswaran, Chegu Vinod,
	Boris Ostrovsky

On 12/03/14 18:54, Waiman Long wrote:
> This patch adds para-virtualization support to the queue spinlock in
> the same way as was done in the PV ticket lock code. In essence, the
> lock waiters will spin for a specified number of times (QSPIN_THRESHOLD
> = 2^14) and then halted itself. The queue head waiter will spins
> 2*QSPIN_THRESHOLD times before halting itself. When it has spinned
> QSPIN_THRESHOLD times, the queue head will assume that the lock
> holder may be scheduled out and attempt to kick the lock holder CPU
> if it has the CPU number on hand.

I don't really understand the reasoning for kicking the lock holder.  It
will either be: running, runnable, or halted because it's in a slow path
wait for another lock.  In any of these states I do not see how a kick
is useful.

> Enabling the PV code does have a performance impact on spinlock
> acquisitions and releases. The following table shows the execution
> time (in ms) of a spinlock micro-benchmark that does lock/unlock
> operations 5M times for each task versus the number of contending
> tasks on a Westmere-EX system.
> 
>   # of        Ticket lock	     Queue lock
>   tasks   PV off/PV on/%Change 	  PV off/PV on/%Change
>   ------  --------------------   ---------------------
>     1	     135/  179/+33%	     137/  169/+23%
>     2	    1045/ 1103/ +6%	    1120/ 1536/+37%
>     3	    1827/ 2683/+47%	    2313/ 2425/ +5%
>     4       2689/ 4191/+56%	    2914/ 3128/ +7%
>     5       3736/ 5830/+56%	    3715/ 3762/ +1%
>     6       4942/ 7609/+54%	    4504/ 4558/ +2%
>     7       6304/ 9570/+52%	    5292/ 5351/ +1%
>     8       7736/11323/+46%	    6037/ 6097/ +1%

Do you have measurements from tests when VCPUs are overcommitted?

> +#ifdef CONFIG_PARAVIRT_SPINLOCKS
> +/**
> + * queue_spin_unlock_slowpath - kick up the CPU of the queue head
> + * @lock : Pointer to queue spinlock structure
> + *
> + * The lock is released after finding the queue head to avoid racing
> + * condition between the queue head and the lock holder.
> + */
> +void queue_spin_unlock_slowpath(struct qspinlock *lock)
> +{
> +	struct qnode *node, *prev;
> +	u32 qcode = (u32)queue_get_qcode(lock);
> +
> +	/*
> +	 * Get the queue tail node
> +	 */
> +	node = xlate_qcode(qcode);
> +
> +	/*
> +	 * Locate the queue head node by following the prev pointer from
> +	 * tail to head.
> +	 * It is assumed that the PV guests won't have that many CPUs so
> +	 * that it won't take a long time to follow the pointers.

This isn't a valid assumption, but this isn't that different from the
search done in the ticket slow unlock path so I guess it's ok.

David

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v6 05/11] pvqspinlock, x86: Allow unfair spinlock in a PV guest
@ 2014-03-13 13:16       ` Paolo Bonzini
  0 siblings, 0 replies; 135+ messages in thread
From: Paolo Bonzini @ 2014-03-13 13:16 UTC (permalink / raw)
  To: David Vrabel, Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, Gleb Natapov,
	Peter Zijlstra, virtualization, Andi Kleen, H. Peter Anvin,
	Michel Lespinasse, Alok Kataria, linux-arch, kvm, x86,
	Ingo Molnar, xen-devel, Paul E. McKenney, Rik van Riel,
	Arnd Bergmann, Konrad Rzeszutek Wilk, Scott J Norton,
	Steven Rostedt, Chris Wright, Thomas Gleixner,
	Aswin Chandramouleeswaran, Chegu Vinod

Il 13/03/2014 11:54, David Vrabel ha scritto:
> On 12/03/14 18:54, Waiman Long wrote:
>> Locking is always an issue in a virtualized environment as the virtual
>> CPU that is waiting on a lock may get scheduled out and hence block
>> any progress in lock acquisition even when the lock has been freed.
>>
>> One solution to this problem is to allow unfair lock in a
>> para-virtualized environment. In this case, a new lock acquirer can
>> come and steal the lock if the next-in-line CPU to get the lock is
>> scheduled out. Unfair lock in a native environment is generally not a
>> good idea as there is a possibility of lock starvation for a heavily
>> contended lock.
>
> I do not think this is a good idea -- the problems with unfair locks are
> worse in a virtualized guest.  If a waiting VCPU deschedules and has to
> be kicked to grab a lock then it is very likely to lose a race with
> another running VCPU trying to take a lock (since it takes time for the
> VCPU to be rescheduled).

Actually, I think the unfair version should be automatically selected if 
running on a hypervisor.  Per-hypervisor pvops can choose to enable the 
fair one.

Lock unfairness may be particularly evident on a virtualized guest when 
the host is overcommitted, but problems with fair locks are even worse.

In fact, RHEL/CentOS 6 already uses unfair locks if 
X86_FEATURE_HYPERVISOR is set.  The patch was rejected upstream in favor 
of pv ticketlocks, but pv ticketlocks do not cover all hypervisors so 
perhaps we could revisit that choice.

Measurements were done by Gleb for two guests running 2.6.32 with 16 
vcpus each, on a 16-core system.  One guest ran with unfair locks, one 
guest ran with fair locks.  Two kernel compilations ("time make -j 16 
all") were started at the same time on both guests, and times were as 
follows:

     unfair:                         fair:
     real 13m34.674s                 real 19m35.827s
     user 96m2.638s                  user 102m38.665s
     sys 56m14.991s                  sys 158m22.470s

     real 13m3.768s                  real 19m4.375s
     user 95m34.509s                 user 111m9.903s
     sys 53m40.550s                  sys 141m59.370s

Actually, interpreting the numbers shows an even worse slowdown.

Compilation took ~6.5 minutes in a guest when the host was not 
overcommitted, and with unfair locks everything scaled just fine.

Ticketlocks fell completely apart; during the first 13 minutes they were 
allotted 16*6.5=104 minutes of CPU time, and they spent almost all of it 
spinning in the kernel (102 minutes in the first run).  They did perhaps 
30 seconds worth of work because, as soon as the unfair-lock guest 
finished and the host was no longer overcommitted, compilation finished 
in 6 minutes.

So that's approximately 12x slowdown from using non-pv fair locks (vs. 
unfair locks) on a 200%-overcommitted host.

Paolo

>> With the unfair locking activated on bare metal 4-socket Westmere-EX
>> box, the execution times (in ms) of a spinlock micro-benchmark were
>> as follows:
>>
>>   # of    Ticket       Fair	    Unfair
>>   tasks    lock     queue lock    queue lock
>>   ------  -------   ----------    ----------
>>     1       135        135	     137
>>     2      1045       1120	     747
>>     3      1827       2345     	    1084
>>     4      2689       2934	    1438
>>     5      3736       3658	    1722
>>     6      4942       4434	    2092
>>     7      6304       5176          2245
>>     8      7736       5955          2388
>
> Are these figures with or without the later PV support patches?

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v6 05/11] pvqspinlock, x86: Allow unfair spinlock in a PV guest
@ 2014-03-13 13:16       ` Paolo Bonzini
  0 siblings, 0 replies; 135+ messages in thread
From: Paolo Bonzini @ 2014-03-13 13:16 UTC (permalink / raw)
  To: David Vrabel, Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, Peter Zijlstra,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Thomas Gleixner, linux-arch, Gleb Natapov, x86, Ingo Molnar,
	xen-devel, Paul E. McKenney, Rik van Riel, Arnd Bergmann,
	Konrad Rzeszutek Wilk, Scott J Norton, Steven Rostedt,
	Chris Wright, Oleg Nesterov, Alok Kataria,
	Aswin Chandramouleeswaran, Chegu Vinod

Il 13/03/2014 11:54, David Vrabel ha scritto:
> On 12/03/14 18:54, Waiman Long wrote:
>> Locking is always an issue in a virtualized environment as the virtual
>> CPU that is waiting on a lock may get scheduled out and hence block
>> any progress in lock acquisition even when the lock has been freed.
>>
>> One solution to this problem is to allow unfair lock in a
>> para-virtualized environment. In this case, a new lock acquirer can
>> come and steal the lock if the next-in-line CPU to get the lock is
>> scheduled out. Unfair lock in a native environment is generally not a
>> good idea as there is a possibility of lock starvation for a heavily
>> contended lock.
>
> I do not think this is a good idea -- the problems with unfair locks are
> worse in a virtualized guest.  If a waiting VCPU deschedules and has to
> be kicked to grab a lock then it is very likely to lose a race with
> another running VCPU trying to take a lock (since it takes time for the
> VCPU to be rescheduled).

Actually, I think the unfair version should be automatically selected if 
running on a hypervisor.  Per-hypervisor pvops can choose to enable the 
fair one.

Lock unfairness may be particularly evident on a virtualized guest when 
the host is overcommitted, but problems with fair locks are even worse.

In fact, RHEL/CentOS 6 already uses unfair locks if 
X86_FEATURE_HYPERVISOR is set.  The patch was rejected upstream in favor 
of pv ticketlocks, but pv ticketlocks do not cover all hypervisors so 
perhaps we could revisit that choice.

Measurements were done by Gleb for two guests running 2.6.32 with 16 
vcpus each, on a 16-core system.  One guest ran with unfair locks, one 
guest ran with fair locks.  Two kernel compilations ("time make -j 16 
all") were started at the same time on both guests, and times were as 
follows:

     unfair:                         fair:
     real 13m34.674s                 real 19m35.827s
     user 96m2.638s                  user 102m38.665s
     sys 56m14.991s                  sys 158m22.470s

     real 13m3.768s                  real 19m4.375s
     user 95m34.509s                 user 111m9.903s
     sys 53m40.550s                  sys 141m59.370s

Actually, interpreting the numbers shows an even worse slowdown.

Compilation took ~6.5 minutes in a guest when the host was not 
overcommitted, and with unfair locks everything scaled just fine.

Ticketlocks fell completely apart; during the first 13 minutes they were 
allotted 16*6.5=104 minutes of CPU time, and they spent almost all of it 
spinning in the kernel (102 minutes in the first run).  They did perhaps 
30 seconds worth of work because, as soon as the unfair-lock guest 
finished and the host was no longer overcommitted, compilation finished 
in 6 minutes.

So that's approximately 12x slowdown from using non-pv fair locks (vs. 
unfair locks) on a 200%-overcommitted host.

Paolo

>> With the unfair locking activated on bare metal 4-socket Westmere-EX
>> box, the execution times (in ms) of a spinlock micro-benchmark were
>> as follows:
>>
>>   # of    Ticket       Fair	    Unfair
>>   tasks    lock     queue lock    queue lock
>>   ------  -------   ----------    ----------
>>     1       135        135	     137
>>     2      1045       1120	     747
>>     3      1827       2345     	    1084
>>     4      2689       2934	    1438
>>     5      3736       3658	    1722
>>     6      4942       4434	    2092
>>     7      6304       5176          2245
>>     8      7736       5955          2388
>
> Are these figures with or without the later PV support patches?



^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v6 05/11] pvqspinlock, x86: Allow unfair spinlock in a PV guest
@ 2014-03-13 13:16       ` Paolo Bonzini
  0 siblings, 0 replies; 135+ messages in thread
From: Paolo Bonzini @ 2014-03-13 13:16 UTC (permalink / raw)
  To: David Vrabel, Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, Gleb Natapov,
	Peter Zijlstra, virtualization, Andi Kleen, H. Peter Anvin,
	Michel Lespinasse, Alok Kataria, linux-arch, kvm, x86,
	Ingo Molnar, xen-devel, Paul E. McKenney, Rik van Riel,
	Arnd Bergmann, Konrad Rzeszutek Wilk, Scott J Norton,
	Steven Rostedt, Chris Wright, Thomas Gleixner,
	Aswin Chandramouleeswaran, Chegu Vinod, Oleg Nesterov

Il 13/03/2014 11:54, David Vrabel ha scritto:
> On 12/03/14 18:54, Waiman Long wrote:
>> Locking is always an issue in a virtualized environment as the virtual
>> CPU that is waiting on a lock may get scheduled out and hence block
>> any progress in lock acquisition even when the lock has been freed.
>>
>> One solution to this problem is to allow unfair lock in a
>> para-virtualized environment. In this case, a new lock acquirer can
>> come and steal the lock if the next-in-line CPU to get the lock is
>> scheduled out. Unfair lock in a native environment is generally not a
>> good idea as there is a possibility of lock starvation for a heavily
>> contended lock.
>
> I do not think this is a good idea -- the problems with unfair locks are
> worse in a virtualized guest.  If a waiting VCPU deschedules and has to
> be kicked to grab a lock then it is very likely to lose a race with
> another running VCPU trying to take a lock (since it takes time for the
> VCPU to be rescheduled).

Actually, I think the unfair version should be automatically selected if 
running on a hypervisor.  Per-hypervisor pvops can choose to enable the 
fair one.

Lock unfairness may be particularly evident on a virtualized guest when 
the host is overcommitted, but problems with fair locks are even worse.

In fact, RHEL/CentOS 6 already uses unfair locks if 
X86_FEATURE_HYPERVISOR is set.  The patch was rejected upstream in favor 
of pv ticketlocks, but pv ticketlocks do not cover all hypervisors so 
perhaps we could revisit that choice.

Measurements were done by Gleb for two guests running 2.6.32 with 16 
vcpus each, on a 16-core system.  One guest ran with unfair locks, one 
guest ran with fair locks.  Two kernel compilations ("time make -j 16 
all") were started at the same time on both guests, and times were as 
follows:

     unfair:                         fair:
     real 13m34.674s                 real 19m35.827s
     user 96m2.638s                  user 102m38.665s
     sys 56m14.991s                  sys 158m22.470s

     real 13m3.768s                  real 19m4.375s
     user 95m34.509s                 user 111m9.903s
     sys 53m40.550s                  sys 141m59.370s

Actually, interpreting the numbers shows an even worse slowdown.

Compilation took ~6.5 minutes in a guest when the host was not 
overcommitted, and with unfair locks everything scaled just fine.

Ticketlocks fell completely apart; during the first 13 minutes they were 
allotted 16*6.5=104 minutes of CPU time, and they spent almost all of it 
spinning in the kernel (102 minutes in the first run).  They did perhaps 
30 seconds worth of work because, as soon as the unfair-lock guest 
finished and the host was no longer overcommitted, compilation finished 
in 6 minutes.

So that's approximately 12x slowdown from using non-pv fair locks (vs. 
unfair locks) on a 200%-overcommitted host.

Paolo

>> With the unfair locking activated on bare metal 4-socket Westmere-EX
>> box, the execution times (in ms) of a spinlock micro-benchmark were
>> as follows:
>>
>>   # of    Ticket       Fair	    Unfair
>>   tasks    lock     queue lock    queue lock
>>   ------  -------   ----------    ----------
>>     1       135        135	     137
>>     2      1045       1120	     747
>>     3      1827       2345     	    1084
>>     4      2689       2934	    1438
>>     5      3736       3658	    1722
>>     6      4942       4434	    2092
>>     7      6304       5176          2245
>>     8      7736       5955          2388
>
> Are these figures with or without the later PV support patches?

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v6 05/11] pvqspinlock, x86: Allow unfair spinlock in a PV guest
  2014-03-13 10:54     ` David Vrabel
  (?)
@ 2014-03-13 13:16     ` Paolo Bonzini
  -1 siblings, 0 replies; 135+ messages in thread
From: Paolo Bonzini @ 2014-03-13 13:16 UTC (permalink / raw)
  To: David Vrabel, Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, Gleb Natapov,
	Peter Zijlstra, virtualization, Andi Kleen, H. Peter Anvin,
	Michel Lespinasse, Alok Kataria, linux-arch, kvm, x86,
	Ingo Molnar, xen-devel, Paul E. McKenney, Arnd Bergmann,
	Scott J Norton, Steven Rostedt, Chris Wright, Thomas Gleixner,
	Aswin Chandramouleeswaran, Chegu Vinod, Oleg Nesterov

Il 13/03/2014 11:54, David Vrabel ha scritto:
> On 12/03/14 18:54, Waiman Long wrote:
>> Locking is always an issue in a virtualized environment as the virtual
>> CPU that is waiting on a lock may get scheduled out and hence block
>> any progress in lock acquisition even when the lock has been freed.
>>
>> One solution to this problem is to allow unfair lock in a
>> para-virtualized environment. In this case, a new lock acquirer can
>> come and steal the lock if the next-in-line CPU to get the lock is
>> scheduled out. Unfair lock in a native environment is generally not a
>> good idea as there is a possibility of lock starvation for a heavily
>> contended lock.
>
> I do not think this is a good idea -- the problems with unfair locks are
> worse in a virtualized guest.  If a waiting VCPU deschedules and has to
> be kicked to grab a lock then it is very likely to lose a race with
> another running VCPU trying to take a lock (since it takes time for the
> VCPU to be rescheduled).

Actually, I think the unfair version should be automatically selected if 
running on a hypervisor.  Per-hypervisor pvops can choose to enable the 
fair one.

Lock unfairness may be particularly evident on a virtualized guest when 
the host is overcommitted, but problems with fair locks are even worse.

In fact, RHEL/CentOS 6 already uses unfair locks if 
X86_FEATURE_HYPERVISOR is set.  The patch was rejected upstream in favor 
of pv ticketlocks, but pv ticketlocks do not cover all hypervisors so 
perhaps we could revisit that choice.

Measurements were done by Gleb for two guests running 2.6.32 with 16 
vcpus each, on a 16-core system.  One guest ran with unfair locks, one 
guest ran with fair locks.  Two kernel compilations ("time make -j 16 
all") were started at the same time on both guests, and times were as 
follows:

     unfair:                         fair:
     real 13m34.674s                 real 19m35.827s
     user 96m2.638s                  user 102m38.665s
     sys 56m14.991s                  sys 158m22.470s

     real 13m3.768s                  real 19m4.375s
     user 95m34.509s                 user 111m9.903s
     sys 53m40.550s                  sys 141m59.370s

Actually, interpreting the numbers shows an even worse slowdown.

Compilation took ~6.5 minutes in a guest when the host was not 
overcommitted, and with unfair locks everything scaled just fine.

Ticketlocks fell completely apart; during the first 13 minutes they were 
allotted 16*6.5=104 minutes of CPU time, and they spent almost all of it 
spinning in the kernel (102 minutes in the first run).  They did perhaps 
30 seconds worth of work because, as soon as the unfair-lock guest 
finished and the host was no longer overcommitted, compilation finished 
in 6 minutes.

So that's approximately 12x slowdown from using non-pv fair locks (vs. 
unfair locks) on a 200%-overcommitted host.

Paolo

>> With the unfair locking activated on bare metal 4-socket Westmere-EX
>> box, the execution times (in ms) of a spinlock micro-benchmark were
>> as follows:
>>
>>   # of    Ticket       Fair	    Unfair
>>   tasks    lock     queue lock    queue lock
>>   ------  -------   ----------    ----------
>>     1       135        135	     137
>>     2      1045       1120	     747
>>     3      1827       2345     	    1084
>>     4      2689       2934	    1438
>>     5      3736       3658	    1722
>>     6      4942       4434	    2092
>>     7      6304       5176          2245
>>     8      7736       5955          2388
>
> Are these figures with or without the later PV support patches?

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v6 04/11] qspinlock: Optimized code path for 2 contending tasks
  2014-03-12 19:08     ` Waiman Long
@ 2014-03-13 13:57       ` Peter Zijlstra
  -1 siblings, 0 replies; 135+ messages in thread
From: Peter Zijlstra @ 2014-03-13 13:57 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, Boris Ostrovsky,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Thomas Gleixner, linux-arch, Gleb Natapov, x86, Ingo Molnar,
	xen-devel, Paul E. McKenney, Rik van Riel, Arnd Bergmann,
	Konrad Rzeszutek Wilk, Scott J Norton, Steven Rostedt,
	Chris Wright, Oleg Nesterov, Alok Kataria,
	Aswin Chandramouleeswaran, Chegu

On Wed, Mar 12, 2014 at 03:08:24PM -0400, Waiman Long wrote:
> On 03/12/2014 02:54 PM, Waiman Long wrote:
> >+		/*
> >+		 * Set the lock bit&  clear the waiting bit simultaneously
> >+		 * It is assumed that there is no lock stealing with this
> >+		 * quick path active.
> >+		 *
> >+		 * A direct memory store of _QSPINLOCK_LOCKED into the
> >+		 * lock_wait field causes problem with the lockref code, e.g.
> >+		 *   ACCESS_ONCE(qlock->lock_wait) = _QSPINLOCK_LOCKED;
> >+		 *
> >+		 * It is not currently clear why this happens. A workaround
> >+		 * is to use atomic instruction to store the new value.
> >+		 */
> >+		{
> >+			u16 lw = xchg(&qlock->lock_wait, _QSPINLOCK_LOCKED);
> >+			BUG_ON(lw != _QSPINLOCK_WAITING);
> >+		}

> It was found that when I used a direct memory store instead of an atomic op,
> the following kernel crash might happen at filesystem dismount time:
> 
> [ 1529.936714] Call Trace:
> [ 1529.936714]  [<ffffffff811c2d03>] d_walk+0xc3/0x260
> [ 1529.936714]  [<ffffffff811c1770>] ? check_and_collect+0x30/0x30
> [ 1529.936714]  [<ffffffff811c3985>] shrink_dcache_for_umount+0x75/0x120
> [ 1529.936714]  [<ffffffff811adf21>] generic_shutdown_super+0x21/0xf0
> [ 1529.936714]  [<ffffffff811ae207>] kill_block_super+0x27/0x70
> [ 1529.936714]  [<ffffffff811ae4ed>] deactivate_locked_super+0x3d/0x60
> [ 1529.936714]  [<ffffffff811aea96>] deactivate_super+0x46/0x60
> [ 1529.936714]  [<ffffffff811ca277>] mntput_no_expire+0xa7/0x140
> [ 1529.936714]  [<ffffffff811cb6ce>] SyS_umount+0x8e/0x100
> [ 1529.936714]  [<ffffffff815d2c29>] system_call_fastpath+0x16/0x1b

> It was more readily reproducible in a KVM guest. It was harder to reproduce
> in a bare metal machine, but kernel crash still happened after several
> tries.
> 
> I am not sure what exactly cause this crash, but it will have something to
> do with the interaction between the lockref and the qspinlock code. I would
> like more eyes on that to find the root cause of it.

I cannot reproduce with my series that has the one word write.

What I did was I made my swap partition (who needs that anyway on a
machine with 16G of memory) into an XFS partition.

Then I copied my linux.git onto it and unmounted.

I'll try a few more times; the above trace seems to suggest it happens
during dcache cleanup, so I suppose I should read the filesystem some
and unmount again.

Is there anything specific you did to make it go bang?

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v6 04/11] qspinlock: Optimized code path for 2 contending tasks
@ 2014-03-13 13:57       ` Peter Zijlstra
  0 siblings, 0 replies; 135+ messages in thread
From: Peter Zijlstra @ 2014-03-13 13:57 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, Boris Ostrovsky,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Thomas Gleixner, linux-arch, Gleb Natapov, x86, Ingo Molnar,
	xen-devel, Paul E. McKenney, Rik van Riel, Arnd Bergmann,
	Konrad Rzeszutek Wilk, Scott J Norton, Steven Rostedt,
	Chris Wright, Oleg Nesterov, Alok Kataria,
	Aswin Chandramouleeswaran, Chegu

On Wed, Mar 12, 2014 at 03:08:24PM -0400, Waiman Long wrote:
> On 03/12/2014 02:54 PM, Waiman Long wrote:
> >+		/*
> >+		 * Set the lock bit&  clear the waiting bit simultaneously
> >+		 * It is assumed that there is no lock stealing with this
> >+		 * quick path active.
> >+		 *
> >+		 * A direct memory store of _QSPINLOCK_LOCKED into the
> >+		 * lock_wait field causes problem with the lockref code, e.g.
> >+		 *   ACCESS_ONCE(qlock->lock_wait) = _QSPINLOCK_LOCKED;
> >+		 *
> >+		 * It is not currently clear why this happens. A workaround
> >+		 * is to use atomic instruction to store the new value.
> >+		 */
> >+		{
> >+			u16 lw = xchg(&qlock->lock_wait, _QSPINLOCK_LOCKED);
> >+			BUG_ON(lw != _QSPINLOCK_WAITING);
> >+		}

> It was found that when I used a direct memory store instead of an atomic op,
> the following kernel crash might happen at filesystem dismount time:
> 
> [ 1529.936714] Call Trace:
> [ 1529.936714]  [<ffffffff811c2d03>] d_walk+0xc3/0x260
> [ 1529.936714]  [<ffffffff811c1770>] ? check_and_collect+0x30/0x30
> [ 1529.936714]  [<ffffffff811c3985>] shrink_dcache_for_umount+0x75/0x120
> [ 1529.936714]  [<ffffffff811adf21>] generic_shutdown_super+0x21/0xf0
> [ 1529.936714]  [<ffffffff811ae207>] kill_block_super+0x27/0x70
> [ 1529.936714]  [<ffffffff811ae4ed>] deactivate_locked_super+0x3d/0x60
> [ 1529.936714]  [<ffffffff811aea96>] deactivate_super+0x46/0x60
> [ 1529.936714]  [<ffffffff811ca277>] mntput_no_expire+0xa7/0x140
> [ 1529.936714]  [<ffffffff811cb6ce>] SyS_umount+0x8e/0x100
> [ 1529.936714]  [<ffffffff815d2c29>] system_call_fastpath+0x16/0x1b

> It was more readily reproducible in a KVM guest. It was harder to reproduce
> in a bare metal machine, but kernel crash still happened after several
> tries.
> 
> I am not sure what exactly cause this crash, but it will have something to
> do with the interaction between the lockref and the qspinlock code. I would
> like more eyes on that to find the root cause of it.

I cannot reproduce with my series that has the one word write.

What I did was I made my swap partition (who needs that anyway on a
machine with 16G of memory) into an XFS partition.

Then I copied my linux.git onto it and unmounted.

I'll try a few more times; the above trace seems to suggest it happens
during dcache cleanup, so I suppose I should read the filesystem some
and unmount again.

Is there anything specific you did to make it go bang?

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v6 04/11] qspinlock: Optimized code path for 2 contending tasks
  2014-03-12 19:08     ` Waiman Long
  (?)
@ 2014-03-13 13:57     ` Peter Zijlstra
  -1 siblings, 0 replies; 135+ messages in thread
From: Peter Zijlstra @ 2014-03-13 13:57 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, Boris Ostrovsky,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Thomas Gleixner, linux-arch, Gleb Natapov, x86, Ingo Molnar,
	xen-devel, Paul E. McKenney, Arnd Bergmann, Scott J Norton,
	Rusty Russell, Steven Rostedt, Chris Wright, Oleg Nesterov,
	Alok Kataria, Aswin Chandramouleeswaran, Chegu Vinod,
	linux-kernel

On Wed, Mar 12, 2014 at 03:08:24PM -0400, Waiman Long wrote:
> On 03/12/2014 02:54 PM, Waiman Long wrote:
> >+		/*
> >+		 * Set the lock bit&  clear the waiting bit simultaneously
> >+		 * It is assumed that there is no lock stealing with this
> >+		 * quick path active.
> >+		 *
> >+		 * A direct memory store of _QSPINLOCK_LOCKED into the
> >+		 * lock_wait field causes problem with the lockref code, e.g.
> >+		 *   ACCESS_ONCE(qlock->lock_wait) = _QSPINLOCK_LOCKED;
> >+		 *
> >+		 * It is not currently clear why this happens. A workaround
> >+		 * is to use atomic instruction to store the new value.
> >+		 */
> >+		{
> >+			u16 lw = xchg(&qlock->lock_wait, _QSPINLOCK_LOCKED);
> >+			BUG_ON(lw != _QSPINLOCK_WAITING);
> >+		}

> It was found that when I used a direct memory store instead of an atomic op,
> the following kernel crash might happen at filesystem dismount time:
> 
> [ 1529.936714] Call Trace:
> [ 1529.936714]  [<ffffffff811c2d03>] d_walk+0xc3/0x260
> [ 1529.936714]  [<ffffffff811c1770>] ? check_and_collect+0x30/0x30
> [ 1529.936714]  [<ffffffff811c3985>] shrink_dcache_for_umount+0x75/0x120
> [ 1529.936714]  [<ffffffff811adf21>] generic_shutdown_super+0x21/0xf0
> [ 1529.936714]  [<ffffffff811ae207>] kill_block_super+0x27/0x70
> [ 1529.936714]  [<ffffffff811ae4ed>] deactivate_locked_super+0x3d/0x60
> [ 1529.936714]  [<ffffffff811aea96>] deactivate_super+0x46/0x60
> [ 1529.936714]  [<ffffffff811ca277>] mntput_no_expire+0xa7/0x140
> [ 1529.936714]  [<ffffffff811cb6ce>] SyS_umount+0x8e/0x100
> [ 1529.936714]  [<ffffffff815d2c29>] system_call_fastpath+0x16/0x1b

> It was more readily reproducible in a KVM guest. It was harder to reproduce
> in a bare metal machine, but kernel crash still happened after several
> tries.
> 
> I am not sure what exactly cause this crash, but it will have something to
> do with the interaction between the lockref and the qspinlock code. I would
> like more eyes on that to find the root cause of it.

I cannot reproduce with my series that has the one word write.

What I did was I made my swap partition (who needs that anyway on a
machine with 16G of memory) into an XFS partition.

Then I copied my linux.git onto it and unmounted.

I'll try a few more times; the above trace seems to suggest it happens
during dcache cleanup, so I suppose I should read the filesystem some
and unmount again.

Is there anything specific you did to make it go bang?

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH RFC v6 09/11] pvqspinlock, x86: Add qspinlock para-virtualization support
@ 2014-03-13 13:57       ` Paolo Bonzini
  0 siblings, 0 replies; 135+ messages in thread
From: Paolo Bonzini @ 2014-03-13 13:57 UTC (permalink / raw)
  To: David Vrabel, Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, Peter Zijlstra,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Thomas Gleixner, linux-arch, Gleb Natapov, x86, Ingo Molnar,
	xen-devel, Paul E. McKenney, Rik van Riel, Arnd Bergmann,
	Konrad Rzeszutek Wilk, Scott J Norton, Steven Rostedt,
	Chris Wright, Oleg Nesterov, Alok Kataria,
	Aswin Chandramouleeswaran, Chegu

Il 13/03/2014 12:21, David Vrabel ha scritto:
> On 12/03/14 18:54, Waiman Long wrote:
>> This patch adds para-virtualization support to the queue spinlock in
>> the same way as was done in the PV ticket lock code. In essence, the
>> lock waiters will spin for a specified number of times (QSPIN_THRESHOLD
>> = 2^14) and then halted itself. The queue head waiter will spins
>> 2*QSPIN_THRESHOLD times before halting itself. When it has spinned
>> QSPIN_THRESHOLD times, the queue head will assume that the lock
>> holder may be scheduled out and attempt to kick the lock holder CPU
>> if it has the CPU number on hand.
>
> I don't really understand the reasoning for kicking the lock holder.

I agree.  If the lock holder isn't running, there's probably a good 
reason for that and going to sleep will not necessarily convince the 
scheduler to give more CPU to the lock holder.  I think there are two 
choices:

1) use yield_to to donate part of the waiter's quantum to the lock 
holder?    For this we probably need a new, separate hypercall 
interface.  For KVM it would be the same as hlt in the guest but with an 
additional yield_to in the host.

2) do nothing, just go to sleep.

Could you get (or do you have) numbers for (2)?

More important, I think a barrier is missing:

	Lock holder ---------------------------------------

	// queue_spin_unlock
	barrier();
	ACCESS_ONCE(qlock->lock) = 0;
	barrier();

	// pv_kick_node:
	if (pv->cpustate != PV_CPU_HALTED)
		return;
	ACCESS_ONCE(pv->cpustate) = PV_CPU_KICKED;
	__queue_kick_cpu(pv->mycpu, PV_KICK_QUEUE_HEAD);

		Waiter -------------------------------------------

		// pv_head_spin_check
		ACCESS_ONCE(pv->cpustate) = PV_CPU_HALTED;
		lockval = cmpxchg(&qlock->lock,
				  _QSPINLOCK_LOCKED,
				  _QSPINLOCK_LOCKED_SLOWPATH);
		if (lockval == 0) {
			/*
			 * Can exit now as the lock is free
			 */
			ACCESS_ONCE(pv->cpustate) = PV_CPU_ACTIVE;
			*count = 0;
			return;
		}
		__queue_hibernate();

Nothing protects from writing qlock->lock before pv->cpustate is read, 
leading to this:

	Lock holder			Waiter
	---------------------------------------------------------------
	read pv->cpustate
		(it is PV_CPU_ACTIVE)
					pv->cpustate = PV_CPU_HALTED
					lockval = cmpxchg(...)
					hibernate()
	qlock->lock = 0
	if (pv->cpustate != PV_CPU_HALTED)
		return;

I think you need this:

-	if (pv->cpustate != PV_CPU_HALTED)
-		return;
-	ACCESS_ONCE(pv->cpustate) = PV_CPU_KICKED;
+	if (cmpxchg(pv->cpustate, PV_CPU_HALTED, PV_CPU_KICKED)
+			!= PV_CPU_HALTED)
+		return;

Paolo

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH RFC v6 09/11] pvqspinlock, x86: Add qspinlock para-virtualization support
@ 2014-03-13 13:57       ` Paolo Bonzini
  0 siblings, 0 replies; 135+ messages in thread
From: Paolo Bonzini @ 2014-03-13 13:57 UTC (permalink / raw)
  To: David Vrabel, Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, Peter Zijlstra,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Thomas Gleixner, linux-arch, Gleb Natapov, x86, Ingo Molnar,
	xen-devel, Paul E. McKenney, Rik van Riel, Arnd Bergmann,
	Konrad Rzeszutek Wilk, Scott J Norton, Steven Rostedt,
	Chris Wright, Oleg Nesterov, Alok Kataria,
	Aswin Chandramouleeswaran, Chegu Vinod

Il 13/03/2014 12:21, David Vrabel ha scritto:
> On 12/03/14 18:54, Waiman Long wrote:
>> This patch adds para-virtualization support to the queue spinlock in
>> the same way as was done in the PV ticket lock code. In essence, the
>> lock waiters will spin for a specified number of times (QSPIN_THRESHOLD
>> = 2^14) and then halted itself. The queue head waiter will spins
>> 2*QSPIN_THRESHOLD times before halting itself. When it has spinned
>> QSPIN_THRESHOLD times, the queue head will assume that the lock
>> holder may be scheduled out and attempt to kick the lock holder CPU
>> if it has the CPU number on hand.
>
> I don't really understand the reasoning for kicking the lock holder.

I agree.  If the lock holder isn't running, there's probably a good 
reason for that and going to sleep will not necessarily convince the 
scheduler to give more CPU to the lock holder.  I think there are two 
choices:

1) use yield_to to donate part of the waiter's quantum to the lock 
holder?    For this we probably need a new, separate hypercall 
interface.  For KVM it would be the same as hlt in the guest but with an 
additional yield_to in the host.

2) do nothing, just go to sleep.

Could you get (or do you have) numbers for (2)?

More important, I think a barrier is missing:

	Lock holder ---------------------------------------

	// queue_spin_unlock
	barrier();
	ACCESS_ONCE(qlock->lock) = 0;
	barrier();

	// pv_kick_node:
	if (pv->cpustate != PV_CPU_HALTED)
		return;
	ACCESS_ONCE(pv->cpustate) = PV_CPU_KICKED;
	__queue_kick_cpu(pv->mycpu, PV_KICK_QUEUE_HEAD);

		Waiter -------------------------------------------

		// pv_head_spin_check
		ACCESS_ONCE(pv->cpustate) = PV_CPU_HALTED;
		lockval = cmpxchg(&qlock->lock,
				  _QSPINLOCK_LOCKED,
				  _QSPINLOCK_LOCKED_SLOWPATH);
		if (lockval == 0) {
			/*
			 * Can exit now as the lock is free
			 */
			ACCESS_ONCE(pv->cpustate) = PV_CPU_ACTIVE;
			*count = 0;
			return;
		}
		__queue_hibernate();

Nothing protects from writing qlock->lock before pv->cpustate is read, 
leading to this:

	Lock holder			Waiter
	---------------------------------------------------------------
	read pv->cpustate
		(it is PV_CPU_ACTIVE)
					pv->cpustate = PV_CPU_HALTED
					lockval = cmpxchg(...)
					hibernate()
	qlock->lock = 0
	if (pv->cpustate != PV_CPU_HALTED)
		return;

I think you need this:

-	if (pv->cpustate != PV_CPU_HALTED)
-		return;
-	ACCESS_ONCE(pv->cpustate) = PV_CPU_KICKED;
+	if (cmpxchg(pv->cpustate, PV_CPU_HALTED, PV_CPU_KICKED)
+			!= PV_CPU_HALTED)
+		return;

Paolo

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH RFC v6 09/11] pvqspinlock, x86: Add qspinlock para-virtualization support
@ 2014-03-13 13:57       ` Paolo Bonzini
  0 siblings, 0 replies; 135+ messages in thread
From: Paolo Bonzini @ 2014-03-13 13:57 UTC (permalink / raw)
  To: David Vrabel, Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, Peter Zijlstra,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Thomas Gleixner, linux-arch, Gleb Natapov, x86, Ingo Molnar,
	xen-devel, Paul E. McKenney, Rik van Riel, Arnd Bergmann,
	Konrad Rzeszutek Wilk, Scott J Norton, Steven Rostedt,
	Chris Wright, Oleg Nesterov, Alok Kataria,
	Aswin Chandramouleeswaran, Chegu

Il 13/03/2014 12:21, David Vrabel ha scritto:
> On 12/03/14 18:54, Waiman Long wrote:
>> This patch adds para-virtualization support to the queue spinlock in
>> the same way as was done in the PV ticket lock code. In essence, the
>> lock waiters will spin for a specified number of times (QSPIN_THRESHOLD
>> = 2^14) and then halted itself. The queue head waiter will spins
>> 2*QSPIN_THRESHOLD times before halting itself. When it has spinned
>> QSPIN_THRESHOLD times, the queue head will assume that the lock
>> holder may be scheduled out and attempt to kick the lock holder CPU
>> if it has the CPU number on hand.
>
> I don't really understand the reasoning for kicking the lock holder.

I agree.  If the lock holder isn't running, there's probably a good 
reason for that and going to sleep will not necessarily convince the 
scheduler to give more CPU to the lock holder.  I think there are two 
choices:

1) use yield_to to donate part of the waiter's quantum to the lock 
holder?    For this we probably need a new, separate hypercall 
interface.  For KVM it would be the same as hlt in the guest but with an 
additional yield_to in the host.

2) do nothing, just go to sleep.

Could you get (or do you have) numbers for (2)?

More important, I think a barrier is missing:

	Lock holder ---------------------------------------

	// queue_spin_unlock
	barrier();
	ACCESS_ONCE(qlock->lock) = 0;
	barrier();

	// pv_kick_node:
	if (pv->cpustate != PV_CPU_HALTED)
		return;
	ACCESS_ONCE(pv->cpustate) = PV_CPU_KICKED;
	__queue_kick_cpu(pv->mycpu, PV_KICK_QUEUE_HEAD);

		Waiter -------------------------------------------

		// pv_head_spin_check
		ACCESS_ONCE(pv->cpustate) = PV_CPU_HALTED;
		lockval = cmpxchg(&qlock->lock,
				  _QSPINLOCK_LOCKED,
				  _QSPINLOCK_LOCKED_SLOWPATH);
		if (lockval == 0) {
			/*
			 * Can exit now as the lock is free
			 */
			ACCESS_ONCE(pv->cpustate) = PV_CPU_ACTIVE;
			*count = 0;
			return;
		}
		__queue_hibernate();

Nothing protects from writing qlock->lock before pv->cpustate is read, 
leading to this:

	Lock holder			Waiter
	---------------------------------------------------------------
	read pv->cpustate
		(it is PV_CPU_ACTIVE)
					pv->cpustate = PV_CPU_HALTED
					lockval = cmpxchg(...)
					hibernate()
	qlock->lock = 0
	if (pv->cpustate != PV_CPU_HALTED)
		return;

I think you need this:

-	if (pv->cpustate != PV_CPU_HALTED)
-		return;
-	ACCESS_ONCE(pv->cpustate) = PV_CPU_KICKED;
+	if (cmpxchg(pv->cpustate, PV_CPU_HALTED, PV_CPU_KICKED)
+			!= PV_CPU_HALTED)
+		return;

Paolo

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH RFC v6 09/11] pvqspinlock, x86: Add qspinlock para-virtualization support
  2014-03-13 11:21     ` David Vrabel
  (?)
@ 2014-03-13 13:57     ` Paolo Bonzini
  -1 siblings, 0 replies; 135+ messages in thread
From: Paolo Bonzini @ 2014-03-13 13:57 UTC (permalink / raw)
  To: David Vrabel, Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, Gleb Natapov,
	Peter Zijlstra, virtualization, Andi Kleen, H. Peter Anvin,
	Michel Lespinasse, Alok Kataria, linux-arch, kvm, x86,
	Ingo Molnar, xen-devel, Paul E. McKenney, Rik van Riel,
	Arnd Bergmann, Konrad Rzeszutek Wilk, Scott J Norton,
	Steven Rostedt, Chris Wright, Thomas Gleixner,
	Aswin Chandramouleeswaran, Chegu Vinod

Il 13/03/2014 12:21, David Vrabel ha scritto:
> On 12/03/14 18:54, Waiman Long wrote:
>> This patch adds para-virtualization support to the queue spinlock in
>> the same way as was done in the PV ticket lock code. In essence, the
>> lock waiters will spin for a specified number of times (QSPIN_THRESHOLD
>> = 2^14) and then halted itself. The queue head waiter will spins
>> 2*QSPIN_THRESHOLD times before halting itself. When it has spinned
>> QSPIN_THRESHOLD times, the queue head will assume that the lock
>> holder may be scheduled out and attempt to kick the lock holder CPU
>> if it has the CPU number on hand.
>
> I don't really understand the reasoning for kicking the lock holder.

I agree.  If the lock holder isn't running, there's probably a good 
reason for that and going to sleep will not necessarily convince the 
scheduler to give more CPU to the lock holder.  I think there are two 
choices:

1) use yield_to to donate part of the waiter's quantum to the lock 
holder?    For this we probably need a new, separate hypercall 
interface.  For KVM it would be the same as hlt in the guest but with an 
additional yield_to in the host.

2) do nothing, just go to sleep.

Could you get (or do you have) numbers for (2)?

More important, I think a barrier is missing:

	Lock holder ---------------------------------------

	// queue_spin_unlock
	barrier();
	ACCESS_ONCE(qlock->lock) = 0;
	barrier();

	// pv_kick_node:
	if (pv->cpustate != PV_CPU_HALTED)
		return;
	ACCESS_ONCE(pv->cpustate) = PV_CPU_KICKED;
	__queue_kick_cpu(pv->mycpu, PV_KICK_QUEUE_HEAD);

		Waiter -------------------------------------------

		// pv_head_spin_check
		ACCESS_ONCE(pv->cpustate) = PV_CPU_HALTED;
		lockval = cmpxchg(&qlock->lock,
				  _QSPINLOCK_LOCKED,
				  _QSPINLOCK_LOCKED_SLOWPATH);
		if (lockval == 0) {
			/*
			 * Can exit now as the lock is free
			 */
			ACCESS_ONCE(pv->cpustate) = PV_CPU_ACTIVE;
			*count = 0;
			return;
		}
		__queue_hibernate();

Nothing protects from writing qlock->lock before pv->cpustate is read, 
leading to this:

	Lock holder			Waiter
	---------------------------------------------------------------
	read pv->cpustate
		(it is PV_CPU_ACTIVE)
					pv->cpustate = PV_CPU_HALTED
					lockval = cmpxchg(...)
					hibernate()
	qlock->lock = 0
	if (pv->cpustate != PV_CPU_HALTED)
		return;

I think you need this:

-	if (pv->cpustate != PV_CPU_HALTED)
-		return;
-	ACCESS_ONCE(pv->cpustate) = PV_CPU_KICKED;
+	if (cmpxchg(pv->cpustate, PV_CPU_HALTED, PV_CPU_KICKED)
+			!= PV_CPU_HALTED)
+		return;

Paolo

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH RFC v6 09/11] pvqspinlock, x86: Add qspinlock para-virtualization support
  2014-03-13 11:21     ` David Vrabel
                       ` (2 preceding siblings ...)
  (?)
@ 2014-03-13 13:57     ` Paolo Bonzini
  -1 siblings, 0 replies; 135+ messages in thread
From: Paolo Bonzini @ 2014-03-13 13:57 UTC (permalink / raw)
  To: David Vrabel, Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, Gleb Natapov,
	Peter Zijlstra, virtualization, Andi Kleen, H. Peter Anvin,
	Michel Lespinasse, Alok Kataria, linux-arch, kvm, x86,
	Ingo Molnar, xen-devel, Paul E. McKenney, Arnd Bergmann,
	Scott J Norton, Steven Rostedt, Chris Wright, Thomas Gleixner,
	Aswin Chandramouleeswaran, Chegu Vinod, Oleg Nesterov

Il 13/03/2014 12:21, David Vrabel ha scritto:
> On 12/03/14 18:54, Waiman Long wrote:
>> This patch adds para-virtualization support to the queue spinlock in
>> the same way as was done in the PV ticket lock code. In essence, the
>> lock waiters will spin for a specified number of times (QSPIN_THRESHOLD
>> = 2^14) and then halted itself. The queue head waiter will spins
>> 2*QSPIN_THRESHOLD times before halting itself. When it has spinned
>> QSPIN_THRESHOLD times, the queue head will assume that the lock
>> holder may be scheduled out and attempt to kick the lock holder CPU
>> if it has the CPU number on hand.
>
> I don't really understand the reasoning for kicking the lock holder.

I agree.  If the lock holder isn't running, there's probably a good 
reason for that and going to sleep will not necessarily convince the 
scheduler to give more CPU to the lock holder.  I think there are two 
choices:

1) use yield_to to donate part of the waiter's quantum to the lock 
holder?    For this we probably need a new, separate hypercall 
interface.  For KVM it would be the same as hlt in the guest but with an 
additional yield_to in the host.

2) do nothing, just go to sleep.

Could you get (or do you have) numbers for (2)?

More important, I think a barrier is missing:

	Lock holder ---------------------------------------

	// queue_spin_unlock
	barrier();
	ACCESS_ONCE(qlock->lock) = 0;
	barrier();

	// pv_kick_node:
	if (pv->cpustate != PV_CPU_HALTED)
		return;
	ACCESS_ONCE(pv->cpustate) = PV_CPU_KICKED;
	__queue_kick_cpu(pv->mycpu, PV_KICK_QUEUE_HEAD);

		Waiter -------------------------------------------

		// pv_head_spin_check
		ACCESS_ONCE(pv->cpustate) = PV_CPU_HALTED;
		lockval = cmpxchg(&qlock->lock,
				  _QSPINLOCK_LOCKED,
				  _QSPINLOCK_LOCKED_SLOWPATH);
		if (lockval == 0) {
			/*
			 * Can exit now as the lock is free
			 */
			ACCESS_ONCE(pv->cpustate) = PV_CPU_ACTIVE;
			*count = 0;
			return;
		}
		__queue_hibernate();

Nothing protects from writing qlock->lock before pv->cpustate is read, 
leading to this:

	Lock holder			Waiter
	---------------------------------------------------------------
	read pv->cpustate
		(it is PV_CPU_ACTIVE)
					pv->cpustate = PV_CPU_HALTED
					lockval = cmpxchg(...)
					hibernate()
	qlock->lock = 0
	if (pv->cpustate != PV_CPU_HALTED)
		return;

I think you need this:

-	if (pv->cpustate != PV_CPU_HALTED)
-		return;
-	ACCESS_ONCE(pv->cpustate) = PV_CPU_KICKED;
+	if (cmpxchg(pv->cpustate, PV_CPU_HALTED, PV_CPU_KICKED)
+			!= PV_CPU_HALTED)
+		return;

Paolo

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH RFC v6 10/11] pvqspinlock, x86: Enable qspinlock PV support for KVM
  2014-03-12 18:54 ` Waiman Long
@ 2014-03-13 13:59     ` Paolo Bonzini
  2014-03-13 13:59   ` Paolo Bonzini
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 135+ messages in thread
From: Paolo Bonzini @ 2014-03-13 13:59 UTC (permalink / raw)
  To: Waiman Long, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Arnd Bergmann, Peter Zijlstra
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, virtualization,
	Andi Kleen, Michel Lespinasse, Alok Kataria, linux-arch,
	Gleb Natapov, x86, xen-devel, Paul E. McKenney, Scott J Norton,
	Rusty Russell, Steven Rostedt, Chris Wright, Oleg Nesterov,
	Boris Ostrovsky, Aswin Chandramouleeswaran, Chegu Vinod,
	linux-kernel, David Vrabel, Andrew Morton

Il 12/03/2014 19:54, Waiman Long ha scritto:
> @@ -807,8 +889,13 @@ void __init kvm_spinlock_init(void)
>  	if (!kvm_para_has_feature(KVM_FEATURE_PV_UNHALT))
>  		return;
>
> +#ifdef CONFIG_QUEUE_SPINLOCK
> +	pv_lock_ops.kick_cpu = kvm_kick_cpu_type;
> +	pv_lock_ops.hibernate = kvm_hibernate;
> +#else
>  	pv_lock_ops.lock_spinning = PV_CALLEE_SAVE(kvm_lock_spinning);
>  	pv_lock_ops.unlock_kick = kvm_unlock_kick;
> +#endif

This should also disable the unfair path.

Paolo


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH RFC v6 10/11] pvqspinlock, x86: Enable qspinlock PV support for KVM
@ 2014-03-13 13:59     ` Paolo Bonzini
  0 siblings, 0 replies; 135+ messages in thread
From: Paolo Bonzini @ 2014-03-13 13:59 UTC (permalink / raw)
  To: Waiman Long, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Arnd Bergmann, Peter Zijlstra
  Cc: Jeremy Fitzhardinge, Raghavendra K T, Gleb Natapov,
	Aswin Chandramouleeswaran, Andi Kleen, Michel Lespinasse,
	Boris Ostrovsky, linux-arch, kvm, x86, xen-devel,
	Paul E. McKenney, Scott J Norton, Steven Rostedt, Chris Wright,
	Alok Kataria, virtualization, Chegu Vinod, Oleg Nesterov,
	linux-kernel, David Vrabel, Andrew Morton

Il 12/03/2014 19:54, Waiman Long ha scritto:
> @@ -807,8 +889,13 @@ void __init kvm_spinlock_init(void)
>  	if (!kvm_para_has_feature(KVM_FEATURE_PV_UNHALT))
>  		return;
>
> +#ifdef CONFIG_QUEUE_SPINLOCK
> +	pv_lock_ops.kick_cpu = kvm_kick_cpu_type;
> +	pv_lock_ops.hibernate = kvm_hibernate;
> +#else
>  	pv_lock_ops.lock_spinning = PV_CALLEE_SAVE(kvm_lock_spinning);
>  	pv_lock_ops.unlock_kick = kvm_unlock_kick;
> +#endif

This should also disable the unfair path.

Paolo

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH RFC v6 10/11] pvqspinlock, x86: Enable qspinlock PV support for KVM
  2014-03-12 18:54 ` Waiman Long
  2014-03-13 13:59     ` Paolo Bonzini
@ 2014-03-13 13:59   ` Paolo Bonzini
  2014-03-13 15:25   ` Peter Zijlstra
  2014-03-13 15:25     ` Peter Zijlstra
  3 siblings, 0 replies; 135+ messages in thread
From: Paolo Bonzini @ 2014-03-13 13:59 UTC (permalink / raw)
  To: Waiman Long, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Arnd Bergmann, Peter Zijlstra
  Cc: Jeremy Fitzhardinge, Raghavendra K T, Gleb Natapov,
	Aswin Chandramouleeswaran, Andi Kleen, Michel Lespinasse,
	Boris Ostrovsky, linux-arch, kvm, x86, xen-devel,
	Paul E. McKenney, Scott J Norton, Rusty Russell, Steven Rostedt,
	Chris Wright, Alok Kataria, virtualization, Chegu Vinod,
	Oleg Nesterov, linux-kernel, David Vrabel, Andrew Morton

Il 12/03/2014 19:54, Waiman Long ha scritto:
> @@ -807,8 +889,13 @@ void __init kvm_spinlock_init(void)
>  	if (!kvm_para_has_feature(KVM_FEATURE_PV_UNHALT))
>  		return;
>
> +#ifdef CONFIG_QUEUE_SPINLOCK
> +	pv_lock_ops.kick_cpu = kvm_kick_cpu_type;
> +	pv_lock_ops.hibernate = kvm_hibernate;
> +#else
>  	pv_lock_ops.lock_spinning = PV_CALLEE_SAVE(kvm_lock_spinning);
>  	pv_lock_ops.unlock_kick = kvm_unlock_kick;
> +#endif

This should also disable the unfair path.

Paolo

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v6 05/11] pvqspinlock, x86: Allow unfair spinlock in a PV guest
  2014-03-12 18:54 ` [PATCH v6 05/11] pvqspinlock, x86: Allow unfair spinlock in a PV guest Waiman Long
@ 2014-03-13 15:15     ` Peter Zijlstra
  2014-03-13 10:54     ` David Vrabel
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 135+ messages in thread
From: Peter Zijlstra @ 2014-03-13 15:15 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, Boris Ostrovsky,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Thomas Gleixner, linux-arch, Gleb Natapov, x86, Ingo Molnar,
	xen-devel, Paul E. McKenney, Rik van Riel, Arnd Bergmann,
	Konrad Rzeszutek Wilk, Scott J Norton, Steven Rostedt,
	Chris Wright, Oleg Nesterov, Alok Kataria,
	Aswin Chandramouleeswaran, Chegu

On Wed, Mar 12, 2014 at 02:54:52PM -0400, Waiman Long wrote:
> +static inline void arch_spin_lock(struct qspinlock *lock)
> +{
> +	if (static_key_false(&paravirt_unfairlocks_enabled))
> +		queue_spin_lock_unfair(lock);
> +	else
> +		queue_spin_lock(lock);
> +}

So I would have expected something like:

	if (static_key_false(&paravirt_spinlock)) {
		while (!queue_spin_trylock(lock))
			cpu_relax();
		return;
	}

At the top of queue_spin_lock_slowpath().

> +static inline int arch_spin_trylock(struct qspinlock *lock)
> +{
> +	if (static_key_false(&paravirt_unfairlocks_enabled))
> +		return queue_spin_trylock_unfair(lock);
> +	else
> +		return queue_spin_trylock(lock);
> +}

That just doesn't make any kind of sense; a trylock cannot be fair or
unfair.

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v6 05/11] pvqspinlock, x86: Allow unfair spinlock in a PV guest
@ 2014-03-13 15:15     ` Peter Zijlstra
  0 siblings, 0 replies; 135+ messages in thread
From: Peter Zijlstra @ 2014-03-13 15:15 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, Boris Ostrovsky,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Thomas Gleixner, linux-arch, Gleb Natapov, x86, Ingo Molnar,
	xen-devel, Paul E. McKenney, Rik van Riel, Arnd Bergmann,
	Konrad Rzeszutek Wilk, Scott J Norton, Steven Rostedt,
	Chris Wright, Oleg Nesterov, Alok Kataria,
	Aswin Chandramouleeswaran, Chegu

On Wed, Mar 12, 2014 at 02:54:52PM -0400, Waiman Long wrote:
> +static inline void arch_spin_lock(struct qspinlock *lock)
> +{
> +	if (static_key_false(&paravirt_unfairlocks_enabled))
> +		queue_spin_lock_unfair(lock);
> +	else
> +		queue_spin_lock(lock);
> +}

So I would have expected something like:

	if (static_key_false(&paravirt_spinlock)) {
		while (!queue_spin_trylock(lock))
			cpu_relax();
		return;
	}

At the top of queue_spin_lock_slowpath().

> +static inline int arch_spin_trylock(struct qspinlock *lock)
> +{
> +	if (static_key_false(&paravirt_unfairlocks_enabled))
> +		return queue_spin_trylock_unfair(lock);
> +	else
> +		return queue_spin_trylock(lock);
> +}

That just doesn't make any kind of sense; a trylock cannot be fair or
unfair.

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v6 05/11] pvqspinlock, x86: Allow unfair spinlock in a PV guest
  2014-03-12 18:54 ` [PATCH v6 05/11] pvqspinlock, x86: Allow unfair spinlock in a PV guest Waiman Long
                     ` (2 preceding siblings ...)
  2014-03-13 15:15     ` Peter Zijlstra
@ 2014-03-13 15:15   ` Peter Zijlstra
  3 siblings, 0 replies; 135+ messages in thread
From: Peter Zijlstra @ 2014-03-13 15:15 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, Boris Ostrovsky,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Thomas Gleixner, linux-arch, Gleb Natapov, x86, Ingo Molnar,
	xen-devel, Paul E. McKenney, Arnd Bergmann, Scott J Norton,
	Rusty Russell, Steven Rostedt, Chris Wright, Oleg Nesterov,
	Alok Kataria, Aswin Chandramouleeswaran, Chegu Vinod,
	linux-kernel

On Wed, Mar 12, 2014 at 02:54:52PM -0400, Waiman Long wrote:
> +static inline void arch_spin_lock(struct qspinlock *lock)
> +{
> +	if (static_key_false(&paravirt_unfairlocks_enabled))
> +		queue_spin_lock_unfair(lock);
> +	else
> +		queue_spin_lock(lock);
> +}

So I would have expected something like:

	if (static_key_false(&paravirt_spinlock)) {
		while (!queue_spin_trylock(lock))
			cpu_relax();
		return;
	}

At the top of queue_spin_lock_slowpath().

> +static inline int arch_spin_trylock(struct qspinlock *lock)
> +{
> +	if (static_key_false(&paravirt_unfairlocks_enabled))
> +		return queue_spin_trylock_unfair(lock);
> +	else
> +		return queue_spin_trylock(lock);
> +}

That just doesn't make any kind of sense; a trylock cannot be fair or
unfair.

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH RFC v6 10/11] pvqspinlock, x86: Enable qspinlock PV support for KVM
  2014-03-12 18:54 ` Waiman Long
@ 2014-03-13 15:25     ` Peter Zijlstra
  2014-03-13 13:59   ` Paolo Bonzini
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 135+ messages in thread
From: Peter Zijlstra @ 2014-03-13 15:25 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, Boris Ostrovsky,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Thomas Gleixner, linux-arch, Gleb Natapov, x86, Ingo Molnar,
	xen-devel, Paul E. McKenney, Rik van Riel, Arnd Bergmann,
	Konrad Rzeszutek Wilk, Scott J Norton, Steven Rostedt,
	Chris Wright, Oleg Nesterov, Alok Kataria,
	Aswin Chandramouleeswaran, Chegu

On Wed, Mar 12, 2014 at 02:54:57PM -0400, Waiman Long wrote:
> A KVM guest of 20 CPU cores was created to run the disk workload of
> the AIM7 benchmark on both ext4 and xfs RAM disks at 3000 users on a
> 3.14-rc6 based kernel. The JPM (jobs/minute) data of the test run were:

You really should just delete that aim7 crap. A benchmark that runs for
hours is _NOT_ usable.

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH RFC v6 10/11] pvqspinlock, x86: Enable qspinlock PV support for KVM
@ 2014-03-13 15:25     ` Peter Zijlstra
  0 siblings, 0 replies; 135+ messages in thread
From: Peter Zijlstra @ 2014-03-13 15:25 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, Boris Ostrovsky,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Thomas Gleixner, linux-arch, Gleb Natapov, x86, Ingo Molnar,
	xen-devel, Paul E. McKenney, Rik van Riel, Arnd Bergmann,
	Konrad Rzeszutek Wilk, Scott J Norton, Steven Rostedt,
	Chris Wright, Oleg Nesterov, Alok Kataria,
	Aswin Chandramouleeswaran, Chegu

On Wed, Mar 12, 2014 at 02:54:57PM -0400, Waiman Long wrote:
> A KVM guest of 20 CPU cores was created to run the disk workload of
> the AIM7 benchmark on both ext4 and xfs RAM disks at 3000 users on a
> 3.14-rc6 based kernel. The JPM (jobs/minute) data of the test run were:

You really should just delete that aim7 crap. A benchmark that runs for
hours is _NOT_ usable.

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH RFC v6 10/11] pvqspinlock, x86: Enable qspinlock PV support for KVM
  2014-03-12 18:54 ` Waiman Long
  2014-03-13 13:59     ` Paolo Bonzini
  2014-03-13 13:59   ` Paolo Bonzini
@ 2014-03-13 15:25   ` Peter Zijlstra
  2014-03-13 15:25     ` Peter Zijlstra
  3 siblings, 0 replies; 135+ messages in thread
From: Peter Zijlstra @ 2014-03-13 15:25 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, Boris Ostrovsky,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Thomas Gleixner, linux-arch, Gleb Natapov, x86, Ingo Molnar,
	xen-devel, Paul E. McKenney, Arnd Bergmann, Scott J Norton,
	Rusty Russell, Steven Rostedt, Chris Wright, Oleg Nesterov,
	Alok Kataria, Aswin Chandramouleeswaran, Chegu Vinod,
	linux-kernel

On Wed, Mar 12, 2014 at 02:54:57PM -0400, Waiman Long wrote:
> A KVM guest of 20 CPU cores was created to run the disk workload of
> the AIM7 benchmark on both ext4 and xfs RAM disks at 3000 users on a
> 3.14-rc6 based kernel. The JPM (jobs/minute) data of the test run were:

You really should just delete that aim7 crap. A benchmark that runs for
hours is _NOT_ usable.

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v6 05/11] pvqspinlock, x86: Allow unfair spinlock in a PV guest
  2014-03-13 10:54     ` David Vrabel
@ 2014-03-13 19:03       ` Waiman Long
  -1 siblings, 0 replies; 135+ messages in thread
From: Waiman Long @ 2014-03-13 19:03 UTC (permalink / raw)
  To: David Vrabel
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, Peter Zijlstra,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Thomas Gleixner, linux-arch, Gleb Natapov, x86, Ingo Molnar,
	xen-devel, Paul E. McKenney, Rik van Riel, Arnd Bergmann,
	Konrad Rzeszutek Wilk, Scott J Norton, Steven Rostedt,
	Chris Wright, Oleg Nesterov, Alok Kataria,
	Aswin Chandramouleeswaran, Chegu Vinod

On 03/13/2014 06:54 AM, David Vrabel wrote:
> On 12/03/14 18:54, Waiman Long wrote:
>> Locking is always an issue in a virtualized environment as the virtual
>> CPU that is waiting on a lock may get scheduled out and hence block
>> any progress in lock acquisition even when the lock has been freed.
>>
>> One solution to this problem is to allow unfair lock in a
>> para-virtualized environment. In this case, a new lock acquirer can
>> come and steal the lock if the next-in-line CPU to get the lock is
>> scheduled out. Unfair lock in a native environment is generally not a
>> good idea as there is a possibility of lock starvation for a heavily
>> contended lock.
> I do not think this is a good idea -- the problems with unfair locks are
> worse in a virtualized guest.  If a waiting VCPU deschedules and has to
> be kicked to grab a lock then it is very likely to lose a race with
> another running VCPU trying to take a lock (since it takes time for the
> VCPU to be rescheduled).

I have seen figure that it will take about 1000 cycles to kick in a CPU. 
As long as the critical section isn't that long, there is enough time 
for a lock stealer to come in, grab the lock, do whatever it needs to do 
and leave without introducing too much latency to the kicked-in CPU.

Anyway there are people who ask for unfair lock. In fact, RHEL6 ship a 
virtual guest with unfair lock. So I provide an option for those people 
who want unfair lock to enable it in their virtual guest. For those who 
don't want it, they can always turn them off when building the kernel.

>> With the unfair locking activated on bare metal 4-socket Westmere-EX
>> box, the execution times (in ms) of a spinlock micro-benchmark were
>> as follows:
>>
>>    # of    Ticket       Fair	    Unfair
>>    tasks    lock     queue lock    queue lock
>>    ------  -------   ----------    ----------
>>      1       135        135	     137
>>      2      1045       1120	     747
>>      3      1827       2345     	    1084
>>      4      2689       2934	    1438
>>      5      3736       3658	    1722
>>      6      4942       4434	    2092
>>      7      6304       5176          2245
>>      8      7736       5955          2388
> Are these figures with or without the later PV support patches?

This is without the PV patch.

Regards,
Longman

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v6 05/11] pvqspinlock, x86: Allow unfair spinlock in a PV guest
@ 2014-03-13 19:03       ` Waiman Long
  0 siblings, 0 replies; 135+ messages in thread
From: Waiman Long @ 2014-03-13 19:03 UTC (permalink / raw)
  To: David Vrabel
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, Peter Zijlstra,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Thomas Gleixner, linux-arch, Gleb Natapov, x86, Ingo Molnar,
	xen-devel, Paul E. McKenney, Rik van Riel, Arnd Bergmann,
	Konrad Rzeszutek Wilk, Scott J Norton, Steven Rostedt,
	Chris Wright, Oleg Nesterov, Alok Kataria,
	Aswin Chandramouleeswaran, Chegu Vinod

On 03/13/2014 06:54 AM, David Vrabel wrote:
> On 12/03/14 18:54, Waiman Long wrote:
>> Locking is always an issue in a virtualized environment as the virtual
>> CPU that is waiting on a lock may get scheduled out and hence block
>> any progress in lock acquisition even when the lock has been freed.
>>
>> One solution to this problem is to allow unfair lock in a
>> para-virtualized environment. In this case, a new lock acquirer can
>> come and steal the lock if the next-in-line CPU to get the lock is
>> scheduled out. Unfair lock in a native environment is generally not a
>> good idea as there is a possibility of lock starvation for a heavily
>> contended lock.
> I do not think this is a good idea -- the problems with unfair locks are
> worse in a virtualized guest.  If a waiting VCPU deschedules and has to
> be kicked to grab a lock then it is very likely to lose a race with
> another running VCPU trying to take a lock (since it takes time for the
> VCPU to be rescheduled).

I have seen figure that it will take about 1000 cycles to kick in a CPU. 
As long as the critical section isn't that long, there is enough time 
for a lock stealer to come in, grab the lock, do whatever it needs to do 
and leave without introducing too much latency to the kicked-in CPU.

Anyway there are people who ask for unfair lock. In fact, RHEL6 ship a 
virtual guest with unfair lock. So I provide an option for those people 
who want unfair lock to enable it in their virtual guest. For those who 
don't want it, they can always turn them off when building the kernel.

>> With the unfair locking activated on bare metal 4-socket Westmere-EX
>> box, the execution times (in ms) of a spinlock micro-benchmark were
>> as follows:
>>
>>    # of    Ticket       Fair	    Unfair
>>    tasks    lock     queue lock    queue lock
>>    ------  -------   ----------    ----------
>>      1       135        135	     137
>>      2      1045       1120	     747
>>      3      1827       2345     	    1084
>>      4      2689       2934	    1438
>>      5      3736       3658	    1722
>>      6      4942       4434	    2092
>>      7      6304       5176          2245
>>      8      7736       5955          2388
> Are these figures with or without the later PV support patches?

This is without the PV patch.

Regards,
Longman

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v6 05/11] pvqspinlock, x86: Allow unfair spinlock in a PV guest
  2014-03-13 10:54     ` David Vrabel
                       ` (2 preceding siblings ...)
  (?)
@ 2014-03-13 19:03     ` Waiman Long
  -1 siblings, 0 replies; 135+ messages in thread
From: Waiman Long @ 2014-03-13 19:03 UTC (permalink / raw)
  To: David Vrabel
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, Peter Zijlstra,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Thomas Gleixner, linux-arch, Gleb Natapov, x86, Ingo Molnar,
	xen-devel, Paul E. McKenney, Arnd Bergmann, Scott J Norton,
	Rusty Russell, Steven Rostedt, Chris Wright, Oleg Nesterov,
	Alok Kataria, Aswin Chandramouleeswaran, Chegu Vinod,
	Boris Ostrovsky

On 03/13/2014 06:54 AM, David Vrabel wrote:
> On 12/03/14 18:54, Waiman Long wrote:
>> Locking is always an issue in a virtualized environment as the virtual
>> CPU that is waiting on a lock may get scheduled out and hence block
>> any progress in lock acquisition even when the lock has been freed.
>>
>> One solution to this problem is to allow unfair lock in a
>> para-virtualized environment. In this case, a new lock acquirer can
>> come and steal the lock if the next-in-line CPU to get the lock is
>> scheduled out. Unfair lock in a native environment is generally not a
>> good idea as there is a possibility of lock starvation for a heavily
>> contended lock.
> I do not think this is a good idea -- the problems with unfair locks are
> worse in a virtualized guest.  If a waiting VCPU deschedules and has to
> be kicked to grab a lock then it is very likely to lose a race with
> another running VCPU trying to take a lock (since it takes time for the
> VCPU to be rescheduled).

I have seen figure that it will take about 1000 cycles to kick in a CPU. 
As long as the critical section isn't that long, there is enough time 
for a lock stealer to come in, grab the lock, do whatever it needs to do 
and leave without introducing too much latency to the kicked-in CPU.

Anyway there are people who ask for unfair lock. In fact, RHEL6 ship a 
virtual guest with unfair lock. So I provide an option for those people 
who want unfair lock to enable it in their virtual guest. For those who 
don't want it, they can always turn them off when building the kernel.

>> With the unfair locking activated on bare metal 4-socket Westmere-EX
>> box, the execution times (in ms) of a spinlock micro-benchmark were
>> as follows:
>>
>>    # of    Ticket       Fair	    Unfair
>>    tasks    lock     queue lock    queue lock
>>    ------  -------   ----------    ----------
>>      1       135        135	     137
>>      2      1045       1120	     747
>>      3      1827       2345     	    1084
>>      4      2689       2934	    1438
>>      5      3736       3658	    1722
>>      6      4942       4434	    2092
>>      7      6304       5176          2245
>>      8      7736       5955          2388
> Are these figures with or without the later PV support patches?

This is without the PV patch.

Regards,
Longman

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH RFC v6 09/11] pvqspinlock, x86: Add qspinlock para-virtualization support
  2014-03-13 11:21     ` David Vrabel
@ 2014-03-13 19:05       ` Waiman Long
  -1 siblings, 0 replies; 135+ messages in thread
From: Waiman Long @ 2014-03-13 19:05 UTC (permalink / raw)
  To: David Vrabel
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, Peter Zijlstra,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Thomas Gleixner, linux-arch, Gleb Natapov, x86, Ingo Molnar,
	xen-devel, Paul E. McKenney, Rik van Riel, Arnd Bergmann,
	Konrad Rzeszutek Wilk, Scott J Norton, Steven Rostedt,
	Chris Wright, Oleg Nesterov, Alok Kataria,
	Aswin Chandramouleeswaran, Chegu Vinod

On 03/13/2014 07:21 AM, David Vrabel wrote:
> On 12/03/14 18:54, Waiman Long wrote:
>> This patch adds para-virtualization support to the queue spinlock in
>> the same way as was done in the PV ticket lock code. In essence, the
>> lock waiters will spin for a specified number of times (QSPIN_THRESHOLD
>> = 2^14) and then halted itself. The queue head waiter will spins
>> 2*QSPIN_THRESHOLD times before halting itself. When it has spinned
>> QSPIN_THRESHOLD times, the queue head will assume that the lock
>> holder may be scheduled out and attempt to kick the lock holder CPU
>> if it has the CPU number on hand.
> I don't really understand the reasoning for kicking the lock holder.  It
> will either be: running, runnable, or halted because it's in a slow path
> wait for another lock.  In any of these states I do not see how a kick
> is useful.

You may be right. I can certainly take this part out of the patch if 
people don't think that is useful.

>> Enabling the PV code does have a performance impact on spinlock
>> acquisitions and releases. The following table shows the execution
>> time (in ms) of a spinlock micro-benchmark that does lock/unlock
>> operations 5M times for each task versus the number of contending
>> tasks on a Westmere-EX system.
>>
>>    # of        Ticket lock	     Queue lock
>>    tasks   PV off/PV on/%Change 	  PV off/PV on/%Change
>>    ------  --------------------   ---------------------
>>      1	     135/  179/+33%	     137/  169/+23%
>>      2	    1045/ 1103/ +6%	    1120/ 1536/+37%
>>      3	    1827/ 2683/+47%	    2313/ 2425/ +5%
>>      4       2689/ 4191/+56%	    2914/ 3128/ +7%
>>      5       3736/ 5830/+56%	    3715/ 3762/ +1%
>>      6       4942/ 7609/+54%	    4504/ 4558/ +2%
>>      7       6304/ 9570/+52%	    5292/ 5351/ +1%
>>      8       7736/11323/+46%	    6037/ 6097/ +1%
> Do you have measurements from tests when VCPUs are overcommitted?

I don't have a measurement with overcommitted guests yet. I will set up 
such an environment and do some tests on it.

>> +#ifdef CONFIG_PARAVIRT_SPINLOCKS
>> +/**
>> + * queue_spin_unlock_slowpath - kick up the CPU of the queue head
>> + * @lock : Pointer to queue spinlock structure
>> + *
>> + * The lock is released after finding the queue head to avoid racing
>> + * condition between the queue head and the lock holder.
>> + */
>> +void queue_spin_unlock_slowpath(struct qspinlock *lock)
>> +{
>> +	struct qnode *node, *prev;
>> +	u32 qcode = (u32)queue_get_qcode(lock);
>> +
>> +	/*
>> +	 * Get the queue tail node
>> +	 */
>> +	node = xlate_qcode(qcode);
>> +
>> +	/*
>> +	 * Locate the queue head node by following the prev pointer from
>> +	 * tail to head.
>> +	 * It is assumed that the PV guests won't have that many CPUs so
>> +	 * that it won't take a long time to follow the pointers.
> This isn't a valid assumption, but this isn't that different from the
> search done in the ticket slow unlock path so I guess it's ok.
>
> David

I will change that to say that in most cases, the queue length will be 
short.

-Longman

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH RFC v6 09/11] pvqspinlock, x86: Add qspinlock para-virtualization support
@ 2014-03-13 19:05       ` Waiman Long
  0 siblings, 0 replies; 135+ messages in thread
From: Waiman Long @ 2014-03-13 19:05 UTC (permalink / raw)
  To: David Vrabel
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, Peter Zijlstra,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Thomas Gleixner, linux-arch, Gleb Natapov, x86, Ingo Molnar,
	xen-devel, Paul E. McKenney, Rik van Riel, Arnd Bergmann,
	Konrad Rzeszutek Wilk, Scott J Norton, Steven Rostedt,
	Chris Wright, Oleg Nesterov, Alok Kataria,
	Aswin Chandramouleeswaran, Chegu Vinod

On 03/13/2014 07:21 AM, David Vrabel wrote:
> On 12/03/14 18:54, Waiman Long wrote:
>> This patch adds para-virtualization support to the queue spinlock in
>> the same way as was done in the PV ticket lock code. In essence, the
>> lock waiters will spin for a specified number of times (QSPIN_THRESHOLD
>> = 2^14) and then halted itself. The queue head waiter will spins
>> 2*QSPIN_THRESHOLD times before halting itself. When it has spinned
>> QSPIN_THRESHOLD times, the queue head will assume that the lock
>> holder may be scheduled out and attempt to kick the lock holder CPU
>> if it has the CPU number on hand.
> I don't really understand the reasoning for kicking the lock holder.  It
> will either be: running, runnable, or halted because it's in a slow path
> wait for another lock.  In any of these states I do not see how a kick
> is useful.

You may be right. I can certainly take this part out of the patch if 
people don't think that is useful.

>> Enabling the PV code does have a performance impact on spinlock
>> acquisitions and releases. The following table shows the execution
>> time (in ms) of a spinlock micro-benchmark that does lock/unlock
>> operations 5M times for each task versus the number of contending
>> tasks on a Westmere-EX system.
>>
>>    # of        Ticket lock	     Queue lock
>>    tasks   PV off/PV on/%Change 	  PV off/PV on/%Change
>>    ------  --------------------   ---------------------
>>      1	     135/  179/+33%	     137/  169/+23%
>>      2	    1045/ 1103/ +6%	    1120/ 1536/+37%
>>      3	    1827/ 2683/+47%	    2313/ 2425/ +5%
>>      4       2689/ 4191/+56%	    2914/ 3128/ +7%
>>      5       3736/ 5830/+56%	    3715/ 3762/ +1%
>>      6       4942/ 7609/+54%	    4504/ 4558/ +2%
>>      7       6304/ 9570/+52%	    5292/ 5351/ +1%
>>      8       7736/11323/+46%	    6037/ 6097/ +1%
> Do you have measurements from tests when VCPUs are overcommitted?

I don't have a measurement with overcommitted guests yet. I will set up 
such an environment and do some tests on it.

>> +#ifdef CONFIG_PARAVIRT_SPINLOCKS
>> +/**
>> + * queue_spin_unlock_slowpath - kick up the CPU of the queue head
>> + * @lock : Pointer to queue spinlock structure
>> + *
>> + * The lock is released after finding the queue head to avoid racing
>> + * condition between the queue head and the lock holder.
>> + */
>> +void queue_spin_unlock_slowpath(struct qspinlock *lock)
>> +{
>> +	struct qnode *node, *prev;
>> +	u32 qcode = (u32)queue_get_qcode(lock);
>> +
>> +	/*
>> +	 * Get the queue tail node
>> +	 */
>> +	node = xlate_qcode(qcode);
>> +
>> +	/*
>> +	 * Locate the queue head node by following the prev pointer from
>> +	 * tail to head.
>> +	 * It is assumed that the PV guests won't have that many CPUs so
>> +	 * that it won't take a long time to follow the pointers.
> This isn't a valid assumption, but this isn't that different from the
> search done in the ticket slow unlock path so I guess it's ok.
>
> David

I will change that to say that in most cases, the queue length will be 
short.

-Longman

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH RFC v6 09/11] pvqspinlock, x86: Add qspinlock para-virtualization support
  2014-03-13 11:21     ` David Vrabel
                       ` (4 preceding siblings ...)
  (?)
@ 2014-03-13 19:05     ` Waiman Long
  -1 siblings, 0 replies; 135+ messages in thread
From: Waiman Long @ 2014-03-13 19:05 UTC (permalink / raw)
  To: David Vrabel
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, Peter Zijlstra,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Thomas Gleixner, linux-arch, Gleb Natapov, x86, Ingo Molnar,
	xen-devel, Paul E. McKenney, Arnd Bergmann, Scott J Norton,
	Rusty Russell, Steven Rostedt, Chris Wright, Oleg Nesterov,
	Alok Kataria, Aswin Chandramouleeswaran, Chegu Vinod,
	Boris Ostrovsky

On 03/13/2014 07:21 AM, David Vrabel wrote:
> On 12/03/14 18:54, Waiman Long wrote:
>> This patch adds para-virtualization support to the queue spinlock in
>> the same way as was done in the PV ticket lock code. In essence, the
>> lock waiters will spin for a specified number of times (QSPIN_THRESHOLD
>> = 2^14) and then halted itself. The queue head waiter will spins
>> 2*QSPIN_THRESHOLD times before halting itself. When it has spinned
>> QSPIN_THRESHOLD times, the queue head will assume that the lock
>> holder may be scheduled out and attempt to kick the lock holder CPU
>> if it has the CPU number on hand.
> I don't really understand the reasoning for kicking the lock holder.  It
> will either be: running, runnable, or halted because it's in a slow path
> wait for another lock.  In any of these states I do not see how a kick
> is useful.

You may be right. I can certainly take this part out of the patch if 
people don't think that is useful.

>> Enabling the PV code does have a performance impact on spinlock
>> acquisitions and releases. The following table shows the execution
>> time (in ms) of a spinlock micro-benchmark that does lock/unlock
>> operations 5M times for each task versus the number of contending
>> tasks on a Westmere-EX system.
>>
>>    # of        Ticket lock	     Queue lock
>>    tasks   PV off/PV on/%Change 	  PV off/PV on/%Change
>>    ------  --------------------   ---------------------
>>      1	     135/  179/+33%	     137/  169/+23%
>>      2	    1045/ 1103/ +6%	    1120/ 1536/+37%
>>      3	    1827/ 2683/+47%	    2313/ 2425/ +5%
>>      4       2689/ 4191/+56%	    2914/ 3128/ +7%
>>      5       3736/ 5830/+56%	    3715/ 3762/ +1%
>>      6       4942/ 7609/+54%	    4504/ 4558/ +2%
>>      7       6304/ 9570/+52%	    5292/ 5351/ +1%
>>      8       7736/11323/+46%	    6037/ 6097/ +1%
> Do you have measurements from tests when VCPUs are overcommitted?

I don't have a measurement with overcommitted guests yet. I will set up 
such an environment and do some tests on it.

>> +#ifdef CONFIG_PARAVIRT_SPINLOCKS
>> +/**
>> + * queue_spin_unlock_slowpath - kick up the CPU of the queue head
>> + * @lock : Pointer to queue spinlock structure
>> + *
>> + * The lock is released after finding the queue head to avoid racing
>> + * condition between the queue head and the lock holder.
>> + */
>> +void queue_spin_unlock_slowpath(struct qspinlock *lock)
>> +{
>> +	struct qnode *node, *prev;
>> +	u32 qcode = (u32)queue_get_qcode(lock);
>> +
>> +	/*
>> +	 * Get the queue tail node
>> +	 */
>> +	node = xlate_qcode(qcode);
>> +
>> +	/*
>> +	 * Locate the queue head node by following the prev pointer from
>> +	 * tail to head.
>> +	 * It is assumed that the PV guests won't have that many CPUs so
>> +	 * that it won't take a long time to follow the pointers.
> This isn't a valid assumption, but this isn't that different from the
> search done in the ticket slow unlock path so I guess it's ok.
>
> David

I will change that to say that in most cases, the queue length will be 
short.

-Longman

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH RFC v6 10/11] pvqspinlock, x86: Enable qspinlock PV support for KVM
  2014-03-13 13:59     ` Paolo Bonzini
  (?)
  (?)
@ 2014-03-13 19:13     ` Waiman Long
  2014-03-14  8:42       ` Paolo Bonzini
  2014-03-14  8:42         ` Paolo Bonzini
  -1 siblings, 2 replies; 135+ messages in thread
From: Waiman Long @ 2014-03-13 19:13 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, Peter Zijlstra,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Alok Kataria, linux-arch, Gleb Natapov, x86, Ingo Molnar,
	xen-devel, Paul E. McKenney, Arnd Bergmann, Scott J Norton,
	Steven Rostedt, Chris Wright, Thomas Gleixner,
	Aswin Chandramouleeswaran, Chegu Vinod, Boris Ostrovsky,
	Oleg Nesterov, linux-kernel

On 03/13/2014 09:59 AM, Paolo Bonzini wrote:
> Il 12/03/2014 19:54, Waiman Long ha scritto:
>> @@ -807,8 +889,13 @@ void __init kvm_spinlock_init(void)
>>      if (!kvm_para_has_feature(KVM_FEATURE_PV_UNHALT))
>>          return;
>>
>> +#ifdef CONFIG_QUEUE_SPINLOCK
>> +    pv_lock_ops.kick_cpu = kvm_kick_cpu_type;
>> +    pv_lock_ops.hibernate = kvm_hibernate;
>> +#else
>>      pv_lock_ops.lock_spinning = PV_CALLEE_SAVE(kvm_lock_spinning);
>>      pv_lock_ops.unlock_kick = kvm_unlock_kick;
>> +#endif
>
> This should also disable the unfair path.
>
> Paolo
>

The unfair lock uses a different jump label and does not require any 
special PV ops. There is a separate init function for that.

-Longman

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH RFC v6 10/11] pvqspinlock, x86: Enable qspinlock PV support for KVM
  2014-03-13 13:59     ` Paolo Bonzini
  (?)
@ 2014-03-13 19:13     ` Waiman Long
  -1 siblings, 0 replies; 135+ messages in thread
From: Waiman Long @ 2014-03-13 19:13 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, Peter Zijlstra,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Alok Kataria, linux-arch, Gleb Natapov, x86, Ingo Molnar,
	xen-devel, Paul E. McKenney, Arnd Bergmann, Scott J Norton,
	Rusty Russell, Steven Rostedt, Chris Wright, Thomas Gleixner,
	Aswin Chandramouleeswaran, Chegu Vinod, Boris Ostrovsky, Oleg

On 03/13/2014 09:59 AM, Paolo Bonzini wrote:
> Il 12/03/2014 19:54, Waiman Long ha scritto:
>> @@ -807,8 +889,13 @@ void __init kvm_spinlock_init(void)
>>      if (!kvm_para_has_feature(KVM_FEATURE_PV_UNHALT))
>>          return;
>>
>> +#ifdef CONFIG_QUEUE_SPINLOCK
>> +    pv_lock_ops.kick_cpu = kvm_kick_cpu_type;
>> +    pv_lock_ops.hibernate = kvm_hibernate;
>> +#else
>>      pv_lock_ops.lock_spinning = PV_CALLEE_SAVE(kvm_lock_spinning);
>>      pv_lock_ops.unlock_kick = kvm_unlock_kick;
>> +#endif
>
> This should also disable the unfair path.
>
> Paolo
>

The unfair lock uses a different jump label and does not require any 
special PV ops. There is a separate init function for that.

-Longman

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH RFC v6 09/11] pvqspinlock, x86: Add qspinlock para-virtualization support
@ 2014-03-13 19:49         ` Waiman Long
  0 siblings, 0 replies; 135+ messages in thread
From: Waiman Long @ 2014-03-13 19:49 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: David Vrabel, Jeremy Fitzhardinge, Raghavendra K T, kvm,
	Peter Zijlstra, virtualization, Andi Kleen, H. Peter Anvin,
	Michel Lespinasse, Thomas Gleixner, linux-arch, Gleb Natapov,
	x86, Ingo Molnar, xen-devel, Paul E. McKenney, Rik van Riel,
	Arnd Bergmann, Konrad Rzeszutek Wilk, Scott J Norton,
	Steven Rostedt, Chris Wright, Oleg Nesterov, Alok Kataria, Aswin

On 03/13/2014 09:57 AM, Paolo Bonzini wrote:
> Il 13/03/2014 12:21, David Vrabel ha scritto:
>> On 12/03/14 18:54, Waiman Long wrote:
>>> This patch adds para-virtualization support to the queue spinlock in
>>> the same way as was done in the PV ticket lock code. In essence, the
>>> lock waiters will spin for a specified number of times (QSPIN_THRESHOLD
>>> = 2^14) and then halted itself. The queue head waiter will spins
>>> 2*QSPIN_THRESHOLD times before halting itself. When it has spinned
>>> QSPIN_THRESHOLD times, the queue head will assume that the lock
>>> holder may be scheduled out and attempt to kick the lock holder CPU
>>> if it has the CPU number on hand.
>>
>> I don't really understand the reasoning for kicking the lock holder.
>
> I agree.  If the lock holder isn't running, there's probably a good 
> reason for that and going to sleep will not necessarily convince the 
> scheduler to give more CPU to the lock holder.  I think there are two 
> choices:
>
> 1) use yield_to to donate part of the waiter's quantum to the lock 
> holder?    For this we probably need a new, separate hypercall 
> interface.  For KVM it would be the same as hlt in the guest but with 
> an additional yield_to in the host.
>
> 2) do nothing, just go to sleep.
>
> Could you get (or do you have) numbers for (2)?

I will take out the lock holder kick portion from the patch. I will also 
try to collect more test data.

>
> More important, I think a barrier is missing:
>
>     Lock holder ---------------------------------------
>
>     // queue_spin_unlock
>     barrier();
>     ACCESS_ONCE(qlock->lock) = 0;
>     barrier();
>

This is not the unlock code that is used when PV spinlock is enabled. 
The right unlock code is

         if (static_key_false(&paravirt_spinlocks_enabled)) {
                 /*
                  * Need to atomically clear the lock byte to avoid 
racing with
                  * queue head waiter trying to set 
_QSPINLOCK_LOCKED_SLOWPATH.
                  */
                 if (likely(cmpxchg(&qlock->lock, _QSPINLOCK_LOCKED, 0)
                                 == _QSPINLOCK_LOCKED))
                         return;
                 else
                         queue_spin_unlock_slowpath(lock);

         } else {
                 __queue_spin_unlock(lock);
         }

>     // pv_kick_node:
>     if (pv->cpustate != PV_CPU_HALTED)
>         return;
>     ACCESS_ONCE(pv->cpustate) = PV_CPU_KICKED;
>     __queue_kick_cpu(pv->mycpu, PV_KICK_QUEUE_HEAD);
>
>         Waiter -------------------------------------------
>
>         // pv_head_spin_check
>         ACCESS_ONCE(pv->cpustate) = PV_CPU_HALTED;
>         lockval = cmpxchg(&qlock->lock,
>                   _QSPINLOCK_LOCKED,
>                   _QSPINLOCK_LOCKED_SLOWPATH);
>         if (lockval == 0) {
>             /*
>              * Can exit now as the lock is free
>              */
>             ACCESS_ONCE(pv->cpustate) = PV_CPU_ACTIVE;
>             *count = 0;
>             return;
>         }
>         __queue_hibernate();
>
> Nothing protects from writing qlock->lock before pv->cpustate is read, 
> leading to this:
>
>     Lock holder            Waiter
>     ---------------------------------------------------------------
>     read pv->cpustate
>         (it is PV_CPU_ACTIVE)
>                     pv->cpustate = PV_CPU_HALTED
>                     lockval = cmpxchg(...)
>                     hibernate()
>     qlock->lock = 0
>     if (pv->cpustate != PV_CPU_HALTED)
>         return;
>

The lock holder will read cpustate only if the lock byte has been 
changed to _QSPINLOCK_LOCKED_SLOWPATH. So the setting of the lock byte 
synchronize the 2 threads. The only thing that I am not certain is when 
the waiter is trying to go to sleep while, at the same time, the lock 
holder is trying to kick it. Will there be a missed wakeup because of 
this timing issue?

-Longman

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH RFC v6 09/11] pvqspinlock, x86: Add qspinlock para-virtualization support
@ 2014-03-13 19:49         ` Waiman Long
  0 siblings, 0 replies; 135+ messages in thread
From: Waiman Long @ 2014-03-13 19:49 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: David Vrabel, Jeremy Fitzhardinge, Raghavendra K T, kvm,
	Peter Zijlstra, virtualization, Andi Kleen, H. Peter Anvin,
	Michel Lespinasse, Thomas Gleixner, linux-arch, Gleb Natapov,
	x86, Ingo Molnar, xen-devel, Paul E. McKenney, Rik van Riel,
	Arnd Bergmann, Konrad Rzeszutek Wilk, Scott J Norton,
	Steven Rostedt, Chris Wright, Oleg Nesterov, Alok Kataria,
	Aswin Chandramouleeswaran, Chegu Vinod

On 03/13/2014 09:57 AM, Paolo Bonzini wrote:
> Il 13/03/2014 12:21, David Vrabel ha scritto:
>> On 12/03/14 18:54, Waiman Long wrote:
>>> This patch adds para-virtualization support to the queue spinlock in
>>> the same way as was done in the PV ticket lock code. In essence, the
>>> lock waiters will spin for a specified number of times (QSPIN_THRESHOLD
>>> = 2^14) and then halted itself. The queue head waiter will spins
>>> 2*QSPIN_THRESHOLD times before halting itself. When it has spinned
>>> QSPIN_THRESHOLD times, the queue head will assume that the lock
>>> holder may be scheduled out and attempt to kick the lock holder CPU
>>> if it has the CPU number on hand.
>>
>> I don't really understand the reasoning for kicking the lock holder.
>
> I agree.  If the lock holder isn't running, there's probably a good 
> reason for that and going to sleep will not necessarily convince the 
> scheduler to give more CPU to the lock holder.  I think there are two 
> choices:
>
> 1) use yield_to to donate part of the waiter's quantum to the lock 
> holder?    For this we probably need a new, separate hypercall 
> interface.  For KVM it would be the same as hlt in the guest but with 
> an additional yield_to in the host.
>
> 2) do nothing, just go to sleep.
>
> Could you get (or do you have) numbers for (2)?

I will take out the lock holder kick portion from the patch. I will also 
try to collect more test data.

>
> More important, I think a barrier is missing:
>
>     Lock holder ---------------------------------------
>
>     // queue_spin_unlock
>     barrier();
>     ACCESS_ONCE(qlock->lock) = 0;
>     barrier();
>

This is not the unlock code that is used when PV spinlock is enabled. 
The right unlock code is

         if (static_key_false(&paravirt_spinlocks_enabled)) {
                 /*
                  * Need to atomically clear the lock byte to avoid 
racing with
                  * queue head waiter trying to set 
_QSPINLOCK_LOCKED_SLOWPATH.
                  */
                 if (likely(cmpxchg(&qlock->lock, _QSPINLOCK_LOCKED, 0)
                                 == _QSPINLOCK_LOCKED))
                         return;
                 else
                         queue_spin_unlock_slowpath(lock);

         } else {
                 __queue_spin_unlock(lock);
         }

>     // pv_kick_node:
>     if (pv->cpustate != PV_CPU_HALTED)
>         return;
>     ACCESS_ONCE(pv->cpustate) = PV_CPU_KICKED;
>     __queue_kick_cpu(pv->mycpu, PV_KICK_QUEUE_HEAD);
>
>         Waiter -------------------------------------------
>
>         // pv_head_spin_check
>         ACCESS_ONCE(pv->cpustate) = PV_CPU_HALTED;
>         lockval = cmpxchg(&qlock->lock,
>                   _QSPINLOCK_LOCKED,
>                   _QSPINLOCK_LOCKED_SLOWPATH);
>         if (lockval == 0) {
>             /*
>              * Can exit now as the lock is free
>              */
>             ACCESS_ONCE(pv->cpustate) = PV_CPU_ACTIVE;
>             *count = 0;
>             return;
>         }
>         __queue_hibernate();
>
> Nothing protects from writing qlock->lock before pv->cpustate is read, 
> leading to this:
>
>     Lock holder            Waiter
>     ---------------------------------------------------------------
>     read pv->cpustate
>         (it is PV_CPU_ACTIVE)
>                     pv->cpustate = PV_CPU_HALTED
>                     lockval = cmpxchg(...)
>                     hibernate()
>     qlock->lock = 0
>     if (pv->cpustate != PV_CPU_HALTED)
>         return;
>

The lock holder will read cpustate only if the lock byte has been 
changed to _QSPINLOCK_LOCKED_SLOWPATH. So the setting of the lock byte 
synchronize the 2 threads. The only thing that I am not certain is when 
the waiter is trying to go to sleep while, at the same time, the lock 
holder is trying to kick it. Will there be a missed wakeup because of 
this timing issue?

-Longman


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH RFC v6 09/11] pvqspinlock, x86: Add qspinlock para-virtualization support
@ 2014-03-13 19:49         ` Waiman Long
  0 siblings, 0 replies; 135+ messages in thread
From: Waiman Long @ 2014-03-13 19:49 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: David Vrabel, Jeremy Fitzhardinge, Raghavendra K T, kvm,
	Peter Zijlstra, virtualization, Andi Kleen, H. Peter Anvin,
	Michel Lespinasse, Thomas Gleixner, linux-arch, Gleb Natapov,
	x86, Ingo Molnar, xen-devel, Paul E. McKenney, Rik van Riel,
	Arnd Bergmann, Konrad Rzeszutek Wilk, Scott J Norton,
	Steven Rostedt, Chris Wright, Oleg Nesterov, Alok Kataria, Aswin

On 03/13/2014 09:57 AM, Paolo Bonzini wrote:
> Il 13/03/2014 12:21, David Vrabel ha scritto:
>> On 12/03/14 18:54, Waiman Long wrote:
>>> This patch adds para-virtualization support to the queue spinlock in
>>> the same way as was done in the PV ticket lock code. In essence, the
>>> lock waiters will spin for a specified number of times (QSPIN_THRESHOLD
>>> = 2^14) and then halted itself. The queue head waiter will spins
>>> 2*QSPIN_THRESHOLD times before halting itself. When it has spinned
>>> QSPIN_THRESHOLD times, the queue head will assume that the lock
>>> holder may be scheduled out and attempt to kick the lock holder CPU
>>> if it has the CPU number on hand.
>>
>> I don't really understand the reasoning for kicking the lock holder.
>
> I agree.  If the lock holder isn't running, there's probably a good 
> reason for that and going to sleep will not necessarily convince the 
> scheduler to give more CPU to the lock holder.  I think there are two 
> choices:
>
> 1) use yield_to to donate part of the waiter's quantum to the lock 
> holder?    For this we probably need a new, separate hypercall 
> interface.  For KVM it would be the same as hlt in the guest but with 
> an additional yield_to in the host.
>
> 2) do nothing, just go to sleep.
>
> Could you get (or do you have) numbers for (2)?

I will take out the lock holder kick portion from the patch. I will also 
try to collect more test data.

>
> More important, I think a barrier is missing:
>
>     Lock holder ---------------------------------------
>
>     // queue_spin_unlock
>     barrier();
>     ACCESS_ONCE(qlock->lock) = 0;
>     barrier();
>

This is not the unlock code that is used when PV spinlock is enabled. 
The right unlock code is

         if (static_key_false(&paravirt_spinlocks_enabled)) {
                 /*
                  * Need to atomically clear the lock byte to avoid 
racing with
                  * queue head waiter trying to set 
_QSPINLOCK_LOCKED_SLOWPATH.
                  */
                 if (likely(cmpxchg(&qlock->lock, _QSPINLOCK_LOCKED, 0)
                                 == _QSPINLOCK_LOCKED))
                         return;
                 else
                         queue_spin_unlock_slowpath(lock);

         } else {
                 __queue_spin_unlock(lock);
         }

>     // pv_kick_node:
>     if (pv->cpustate != PV_CPU_HALTED)
>         return;
>     ACCESS_ONCE(pv->cpustate) = PV_CPU_KICKED;
>     __queue_kick_cpu(pv->mycpu, PV_KICK_QUEUE_HEAD);
>
>         Waiter -------------------------------------------
>
>         // pv_head_spin_check
>         ACCESS_ONCE(pv->cpustate) = PV_CPU_HALTED;
>         lockval = cmpxchg(&qlock->lock,
>                   _QSPINLOCK_LOCKED,
>                   _QSPINLOCK_LOCKED_SLOWPATH);
>         if (lockval == 0) {
>             /*
>              * Can exit now as the lock is free
>              */
>             ACCESS_ONCE(pv->cpustate) = PV_CPU_ACTIVE;
>             *count = 0;
>             return;
>         }
>         __queue_hibernate();
>
> Nothing protects from writing qlock->lock before pv->cpustate is read, 
> leading to this:
>
>     Lock holder            Waiter
>     ---------------------------------------------------------------
>     read pv->cpustate
>         (it is PV_CPU_ACTIVE)
>                     pv->cpustate = PV_CPU_HALTED
>                     lockval = cmpxchg(...)
>                     hibernate()
>     qlock->lock = 0
>     if (pv->cpustate != PV_CPU_HALTED)
>         return;
>

The lock holder will read cpustate only if the lock byte has been 
changed to _QSPINLOCK_LOCKED_SLOWPATH. So the setting of the lock byte 
synchronize the 2 threads. The only thing that I am not certain is when 
the waiter is trying to go to sleep while, at the same time, the lock 
holder is trying to kick it. Will there be a missed wakeup because of 
this timing issue?

-Longman

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH RFC v6 09/11] pvqspinlock, x86: Add qspinlock para-virtualization support
  2014-03-13 13:57       ` Paolo Bonzini
                         ` (2 preceding siblings ...)
  (?)
@ 2014-03-13 19:49       ` Waiman Long
  -1 siblings, 0 replies; 135+ messages in thread
From: Waiman Long @ 2014-03-13 19:49 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Jeremy Fitzhardinge, Raghavendra K T, Gleb Natapov,
	Peter Zijlstra, virtualization, Andi Kleen, H. Peter Anvin,
	Michel Lespinasse, Alok Kataria, linux-arch, kvm, x86,
	Ingo Molnar, xen-devel, Paul E. McKenney, Rik van Riel,
	Arnd Bergmann, Konrad Rzeszutek Wilk, Scott J Norton,
	Steven Rostedt, Chris Wright, Thomas Gleixner,
	Aswin Chandramouleeswaran, Chegu Vinod

On 03/13/2014 09:57 AM, Paolo Bonzini wrote:
> Il 13/03/2014 12:21, David Vrabel ha scritto:
>> On 12/03/14 18:54, Waiman Long wrote:
>>> This patch adds para-virtualization support to the queue spinlock in
>>> the same way as was done in the PV ticket lock code. In essence, the
>>> lock waiters will spin for a specified number of times (QSPIN_THRESHOLD
>>> = 2^14) and then halted itself. The queue head waiter will spins
>>> 2*QSPIN_THRESHOLD times before halting itself. When it has spinned
>>> QSPIN_THRESHOLD times, the queue head will assume that the lock
>>> holder may be scheduled out and attempt to kick the lock holder CPU
>>> if it has the CPU number on hand.
>>
>> I don't really understand the reasoning for kicking the lock holder.
>
> I agree.  If the lock holder isn't running, there's probably a good 
> reason for that and going to sleep will not necessarily convince the 
> scheduler to give more CPU to the lock holder.  I think there are two 
> choices:
>
> 1) use yield_to to donate part of the waiter's quantum to the lock 
> holder?    For this we probably need a new, separate hypercall 
> interface.  For KVM it would be the same as hlt in the guest but with 
> an additional yield_to in the host.
>
> 2) do nothing, just go to sleep.
>
> Could you get (or do you have) numbers for (2)?

I will take out the lock holder kick portion from the patch. I will also 
try to collect more test data.

>
> More important, I think a barrier is missing:
>
>     Lock holder ---------------------------------------
>
>     // queue_spin_unlock
>     barrier();
>     ACCESS_ONCE(qlock->lock) = 0;
>     barrier();
>

This is not the unlock code that is used when PV spinlock is enabled. 
The right unlock code is

         if (static_key_false(&paravirt_spinlocks_enabled)) {
                 /*
                  * Need to atomically clear the lock byte to avoid 
racing with
                  * queue head waiter trying to set 
_QSPINLOCK_LOCKED_SLOWPATH.
                  */
                 if (likely(cmpxchg(&qlock->lock, _QSPINLOCK_LOCKED, 0)
                                 == _QSPINLOCK_LOCKED))
                         return;
                 else
                         queue_spin_unlock_slowpath(lock);

         } else {
                 __queue_spin_unlock(lock);
         }

>     // pv_kick_node:
>     if (pv->cpustate != PV_CPU_HALTED)
>         return;
>     ACCESS_ONCE(pv->cpustate) = PV_CPU_KICKED;
>     __queue_kick_cpu(pv->mycpu, PV_KICK_QUEUE_HEAD);
>
>         Waiter -------------------------------------------
>
>         // pv_head_spin_check
>         ACCESS_ONCE(pv->cpustate) = PV_CPU_HALTED;
>         lockval = cmpxchg(&qlock->lock,
>                   _QSPINLOCK_LOCKED,
>                   _QSPINLOCK_LOCKED_SLOWPATH);
>         if (lockval == 0) {
>             /*
>              * Can exit now as the lock is free
>              */
>             ACCESS_ONCE(pv->cpustate) = PV_CPU_ACTIVE;
>             *count = 0;
>             return;
>         }
>         __queue_hibernate();
>
> Nothing protects from writing qlock->lock before pv->cpustate is read, 
> leading to this:
>
>     Lock holder            Waiter
>     ---------------------------------------------------------------
>     read pv->cpustate
>         (it is PV_CPU_ACTIVE)
>                     pv->cpustate = PV_CPU_HALTED
>                     lockval = cmpxchg(...)
>                     hibernate()
>     qlock->lock = 0
>     if (pv->cpustate != PV_CPU_HALTED)
>         return;
>

The lock holder will read cpustate only if the lock byte has been 
changed to _QSPINLOCK_LOCKED_SLOWPATH. So the setting of the lock byte 
synchronize the 2 threads. The only thing that I am not certain is when 
the waiter is trying to go to sleep while, at the same time, the lock 
holder is trying to kick it. Will there be a missed wakeup because of 
this timing issue?

-Longman

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH RFC v6 09/11] pvqspinlock, x86: Add qspinlock para-virtualization support
  2014-03-13 13:57       ` Paolo Bonzini
  (?)
  (?)
@ 2014-03-13 19:49       ` Waiman Long
  -1 siblings, 0 replies; 135+ messages in thread
From: Waiman Long @ 2014-03-13 19:49 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Jeremy Fitzhardinge, Raghavendra K T, Gleb Natapov,
	Peter Zijlstra, virtualization, Andi Kleen, H. Peter Anvin,
	Michel Lespinasse, Alok Kataria, linux-arch, kvm, x86,
	Ingo Molnar, xen-devel, Paul E. McKenney, Arnd Bergmann,
	Scott J Norton, Steven Rostedt, Chris Wright, Thomas Gleixner,
	Aswin Chandramouleeswaran, Chegu Vinod, Oleg Nesterov,
	David Vrabel

On 03/13/2014 09:57 AM, Paolo Bonzini wrote:
> Il 13/03/2014 12:21, David Vrabel ha scritto:
>> On 12/03/14 18:54, Waiman Long wrote:
>>> This patch adds para-virtualization support to the queue spinlock in
>>> the same way as was done in the PV ticket lock code. In essence, the
>>> lock waiters will spin for a specified number of times (QSPIN_THRESHOLD
>>> = 2^14) and then halted itself. The queue head waiter will spins
>>> 2*QSPIN_THRESHOLD times before halting itself. When it has spinned
>>> QSPIN_THRESHOLD times, the queue head will assume that the lock
>>> holder may be scheduled out and attempt to kick the lock holder CPU
>>> if it has the CPU number on hand.
>>
>> I don't really understand the reasoning for kicking the lock holder.
>
> I agree.  If the lock holder isn't running, there's probably a good 
> reason for that and going to sleep will not necessarily convince the 
> scheduler to give more CPU to the lock holder.  I think there are two 
> choices:
>
> 1) use yield_to to donate part of the waiter's quantum to the lock 
> holder?    For this we probably need a new, separate hypercall 
> interface.  For KVM it would be the same as hlt in the guest but with 
> an additional yield_to in the host.
>
> 2) do nothing, just go to sleep.
>
> Could you get (or do you have) numbers for (2)?

I will take out the lock holder kick portion from the patch. I will also 
try to collect more test data.

>
> More important, I think a barrier is missing:
>
>     Lock holder ---------------------------------------
>
>     // queue_spin_unlock
>     barrier();
>     ACCESS_ONCE(qlock->lock) = 0;
>     barrier();
>

This is not the unlock code that is used when PV spinlock is enabled. 
The right unlock code is

         if (static_key_false(&paravirt_spinlocks_enabled)) {
                 /*
                  * Need to atomically clear the lock byte to avoid 
racing with
                  * queue head waiter trying to set 
_QSPINLOCK_LOCKED_SLOWPATH.
                  */
                 if (likely(cmpxchg(&qlock->lock, _QSPINLOCK_LOCKED, 0)
                                 == _QSPINLOCK_LOCKED))
                         return;
                 else
                         queue_spin_unlock_slowpath(lock);

         } else {
                 __queue_spin_unlock(lock);
         }

>     // pv_kick_node:
>     if (pv->cpustate != PV_CPU_HALTED)
>         return;
>     ACCESS_ONCE(pv->cpustate) = PV_CPU_KICKED;
>     __queue_kick_cpu(pv->mycpu, PV_KICK_QUEUE_HEAD);
>
>         Waiter -------------------------------------------
>
>         // pv_head_spin_check
>         ACCESS_ONCE(pv->cpustate) = PV_CPU_HALTED;
>         lockval = cmpxchg(&qlock->lock,
>                   _QSPINLOCK_LOCKED,
>                   _QSPINLOCK_LOCKED_SLOWPATH);
>         if (lockval == 0) {
>             /*
>              * Can exit now as the lock is free
>              */
>             ACCESS_ONCE(pv->cpustate) = PV_CPU_ACTIVE;
>             *count = 0;
>             return;
>         }
>         __queue_hibernate();
>
> Nothing protects from writing qlock->lock before pv->cpustate is read, 
> leading to this:
>
>     Lock holder            Waiter
>     ---------------------------------------------------------------
>     read pv->cpustate
>         (it is PV_CPU_ACTIVE)
>                     pv->cpustate = PV_CPU_HALTED
>                     lockval = cmpxchg(...)
>                     hibernate()
>     qlock->lock = 0
>     if (pv->cpustate != PV_CPU_HALTED)
>         return;
>

The lock holder will read cpustate only if the lock byte has been 
changed to _QSPINLOCK_LOCKED_SLOWPATH. So the setting of the lock byte 
synchronize the 2 threads. The only thing that I am not certain is when 
the waiter is trying to go to sleep while, at the same time, the lock 
holder is trying to kick it. Will there be a missed wakeup because of 
this timing issue?

-Longman

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v6 05/11] pvqspinlock, x86: Allow unfair spinlock in a PV guest
  2014-03-13 15:15     ` Peter Zijlstra
@ 2014-03-13 20:05       ` Waiman Long
  -1 siblings, 0 replies; 135+ messages in thread
From: Waiman Long @ 2014-03-13 20:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, Boris Ostrovsky,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Thomas Gleixner, linux-arch, Gleb Natapov, x86, Ingo Molnar,
	xen-devel, Paul E. McKenney, Rik van Riel, Arnd Bergmann,
	Konrad Rzeszutek Wilk, Scott J Norton, Steven Rostedt,
	Chris Wright, Oleg Nesterov, Alok Kataria,
	Aswin Chandramouleeswaran, Chegu

On 03/13/2014 11:15 AM, Peter Zijlstra wrote:
> On Wed, Mar 12, 2014 at 02:54:52PM -0400, Waiman Long wrote:
>> +static inline void arch_spin_lock(struct qspinlock *lock)
>> +{
>> +	if (static_key_false(&paravirt_unfairlocks_enabled))
>> +		queue_spin_lock_unfair(lock);
>> +	else
>> +		queue_spin_lock(lock);
>> +}
> So I would have expected something like:
>
> 	if (static_key_false(&paravirt_spinlock)) {
> 		while (!queue_spin_trylock(lock))
> 			cpu_relax();
> 		return;
> 	}
>
> At the top of queue_spin_lock_slowpath().

I don't like the idea of constantly spinning on the lock. That can cause 
all sort of performance issues. My version of the unfair lock tries to 
grab the lock ignoring if there are others waiting in the queue or not. 
So instead of the doing a cmpxchg of the whole 32-bit word, I just do a 
cmpxchg of the lock byte in the unfair version. A CPU has only one 
chance to steal the lock. If it can't, it will be lined up in the queue 
just like the fair version. It is not as unfair as the other unfair 
locking schemes that spins on the lock repetitively. So lock starvation 
should be less a problem.

On the other hand, it may not perform as well as the other unfair 
locking schemes. It is a compromise to provide some lock unfairness 
without sacrificing the good cacheline behavior of the queue spinlock.

>> +static inline int arch_spin_trylock(struct qspinlock *lock)
>> +{
>> +	if (static_key_false(&paravirt_unfairlocks_enabled))
>> +		return queue_spin_trylock_unfair(lock);
>> +	else
>> +		return queue_spin_trylock(lock);
>> +}
> That just doesn't make any kind of sense; a trylock cannot be fair or
> unfair.

Because I use a different cmpxchg for the fair and unfair versions, I 
also need a different version for trylock.

-Longman

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v6 05/11] pvqspinlock, x86: Allow unfair spinlock in a PV guest
@ 2014-03-13 20:05       ` Waiman Long
  0 siblings, 0 replies; 135+ messages in thread
From: Waiman Long @ 2014-03-13 20:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, Boris Ostrovsky,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Thomas Gleixner, linux-arch, Gleb Natapov, x86, Ingo Molnar,
	xen-devel, Paul E. McKenney, Rik van Riel, Arnd Bergmann,
	Konrad Rzeszutek Wilk, Scott J Norton, Steven Rostedt,
	Chris Wright, Oleg Nesterov, Alok Kataria,
	Aswin Chandramouleeswaran, Chegu

On 03/13/2014 11:15 AM, Peter Zijlstra wrote:
> On Wed, Mar 12, 2014 at 02:54:52PM -0400, Waiman Long wrote:
>> +static inline void arch_spin_lock(struct qspinlock *lock)
>> +{
>> +	if (static_key_false(&paravirt_unfairlocks_enabled))
>> +		queue_spin_lock_unfair(lock);
>> +	else
>> +		queue_spin_lock(lock);
>> +}
> So I would have expected something like:
>
> 	if (static_key_false(&paravirt_spinlock)) {
> 		while (!queue_spin_trylock(lock))
> 			cpu_relax();
> 		return;
> 	}
>
> At the top of queue_spin_lock_slowpath().

I don't like the idea of constantly spinning on the lock. That can cause 
all sort of performance issues. My version of the unfair lock tries to 
grab the lock ignoring if there are others waiting in the queue or not. 
So instead of the doing a cmpxchg of the whole 32-bit word, I just do a 
cmpxchg of the lock byte in the unfair version. A CPU has only one 
chance to steal the lock. If it can't, it will be lined up in the queue 
just like the fair version. It is not as unfair as the other unfair 
locking schemes that spins on the lock repetitively. So lock starvation 
should be less a problem.

On the other hand, it may not perform as well as the other unfair 
locking schemes. It is a compromise to provide some lock unfairness 
without sacrificing the good cacheline behavior of the queue spinlock.

>> +static inline int arch_spin_trylock(struct qspinlock *lock)
>> +{
>> +	if (static_key_false(&paravirt_unfairlocks_enabled))
>> +		return queue_spin_trylock_unfair(lock);
>> +	else
>> +		return queue_spin_trylock(lock);
>> +}
> That just doesn't make any kind of sense; a trylock cannot be fair or
> unfair.

Because I use a different cmpxchg for the fair and unfair versions, I 
also need a different version for trylock.

-Longman

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v6 05/11] pvqspinlock, x86: Allow unfair spinlock in a PV guest
  2014-03-13 15:15     ` Peter Zijlstra
  (?)
@ 2014-03-13 20:05     ` Waiman Long
  -1 siblings, 0 replies; 135+ messages in thread
From: Waiman Long @ 2014-03-13 20:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, Boris Ostrovsky,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Thomas Gleixner, linux-arch, Gleb Natapov, x86, Ingo Molnar,
	xen-devel, Paul E. McKenney, Arnd Bergmann, Scott J Norton,
	Rusty Russell, Steven Rostedt, Chris Wright, Oleg Nesterov,
	Alok Kataria, Aswin Chandramouleeswaran, Chegu Vinod,
	linux-kernel

On 03/13/2014 11:15 AM, Peter Zijlstra wrote:
> On Wed, Mar 12, 2014 at 02:54:52PM -0400, Waiman Long wrote:
>> +static inline void arch_spin_lock(struct qspinlock *lock)
>> +{
>> +	if (static_key_false(&paravirt_unfairlocks_enabled))
>> +		queue_spin_lock_unfair(lock);
>> +	else
>> +		queue_spin_lock(lock);
>> +}
> So I would have expected something like:
>
> 	if (static_key_false(&paravirt_spinlock)) {
> 		while (!queue_spin_trylock(lock))
> 			cpu_relax();
> 		return;
> 	}
>
> At the top of queue_spin_lock_slowpath().

I don't like the idea of constantly spinning on the lock. That can cause 
all sort of performance issues. My version of the unfair lock tries to 
grab the lock ignoring if there are others waiting in the queue or not. 
So instead of the doing a cmpxchg of the whole 32-bit word, I just do a 
cmpxchg of the lock byte in the unfair version. A CPU has only one 
chance to steal the lock. If it can't, it will be lined up in the queue 
just like the fair version. It is not as unfair as the other unfair 
locking schemes that spins on the lock repetitively. So lock starvation 
should be less a problem.

On the other hand, it may not perform as well as the other unfair 
locking schemes. It is a compromise to provide some lock unfairness 
without sacrificing the good cacheline behavior of the queue spinlock.

>> +static inline int arch_spin_trylock(struct qspinlock *lock)
>> +{
>> +	if (static_key_false(&paravirt_unfairlocks_enabled))
>> +		return queue_spin_trylock_unfair(lock);
>> +	else
>> +		return queue_spin_trylock(lock);
>> +}
> That just doesn't make any kind of sense; a trylock cannot be fair or
> unfair.

Because I use a different cmpxchg for the fair and unfair versions, I 
also need a different version for trylock.

-Longman

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH RFC v6 10/11] pvqspinlock, x86: Enable qspinlock PV support for KVM
  2014-03-13 15:25     ` Peter Zijlstra
@ 2014-03-13 20:09       ` Waiman Long
  -1 siblings, 0 replies; 135+ messages in thread
From: Waiman Long @ 2014-03-13 20:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, Boris Ostrovsky,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Thomas Gleixner, linux-arch, Gleb Natapov, x86, Ingo Molnar,
	xen-devel, Paul E. McKenney, Rik van Riel, Arnd Bergmann,
	Konrad Rzeszutek Wilk, Scott J Norton, Steven Rostedt,
	Chris Wright, Oleg Nesterov, Alok Kataria,
	Aswin Chandramouleeswaran, Chegu

On 03/13/2014 11:25 AM, Peter Zijlstra wrote:
> On Wed, Mar 12, 2014 at 02:54:57PM -0400, Waiman Long wrote:
>> A KVM guest of 20 CPU cores was created to run the disk workload of
>> the AIM7 benchmark on both ext4 and xfs RAM disks at 3000 users on a
>> 3.14-rc6 based kernel. The JPM (jobs/minute) data of the test run were:
> You really should just delete that aim7 crap. A benchmark that runs for
> hours is _NOT_ usable.

The specific subtest that I used run only 10 seconds or so in my test 
box as I  used ramdisk for all the filesystem access. If you use a 
physical disk, it will be much slower.

-Longman

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH RFC v6 10/11] pvqspinlock, x86: Enable qspinlock PV support for KVM
@ 2014-03-13 20:09       ` Waiman Long
  0 siblings, 0 replies; 135+ messages in thread
From: Waiman Long @ 2014-03-13 20:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, Boris Ostrovsky,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Thomas Gleixner, linux-arch, Gleb Natapov, x86, Ingo Molnar,
	xen-devel, Paul E. McKenney, Rik van Riel, Arnd Bergmann,
	Konrad Rzeszutek Wilk, Scott J Norton, Steven Rostedt,
	Chris Wright, Oleg Nesterov, Alok Kataria,
	Aswin Chandramouleeswaran, Chegu

On 03/13/2014 11:25 AM, Peter Zijlstra wrote:
> On Wed, Mar 12, 2014 at 02:54:57PM -0400, Waiman Long wrote:
>> A KVM guest of 20 CPU cores was created to run the disk workload of
>> the AIM7 benchmark on both ext4 and xfs RAM disks at 3000 users on a
>> 3.14-rc6 based kernel. The JPM (jobs/minute) data of the test run were:
> You really should just delete that aim7 crap. A benchmark that runs for
> hours is _NOT_ usable.

The specific subtest that I used run only 10 seconds or so in my test 
box as I  used ramdisk for all the filesystem access. If you use a 
physical disk, it will be much slower.

-Longman

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH RFC v6 10/11] pvqspinlock, x86: Enable qspinlock PV support for KVM
  2014-03-13 15:25     ` Peter Zijlstra
  (?)
@ 2014-03-13 20:09     ` Waiman Long
  -1 siblings, 0 replies; 135+ messages in thread
From: Waiman Long @ 2014-03-13 20:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, Boris Ostrovsky,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Thomas Gleixner, linux-arch, Gleb Natapov, x86, Ingo Molnar,
	xen-devel, Paul E. McKenney, Arnd Bergmann, Scott J Norton,
	Rusty Russell, Steven Rostedt, Chris Wright, Oleg Nesterov,
	Alok Kataria, Aswin Chandramouleeswaran, Chegu Vinod,
	linux-kernel

On 03/13/2014 11:25 AM, Peter Zijlstra wrote:
> On Wed, Mar 12, 2014 at 02:54:57PM -0400, Waiman Long wrote:
>> A KVM guest of 20 CPU cores was created to run the disk workload of
>> the AIM7 benchmark on both ext4 and xfs RAM disks at 3000 users on a
>> 3.14-rc6 based kernel. The JPM (jobs/minute) data of the test run were:
> You really should just delete that aim7 crap. A benchmark that runs for
> hours is _NOT_ usable.

The specific subtest that I used run only 10 seconds or so in my test 
box as I  used ramdisk for all the filesystem access. If you use a 
physical disk, it will be much slower.

-Longman

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v6 05/11] pvqspinlock, x86: Allow unfair spinlock in a PV guest
  2014-03-13 20:05       ` Waiman Long
@ 2014-03-14  8:30         ` Peter Zijlstra
  -1 siblings, 0 replies; 135+ messages in thread
From: Peter Zijlstra @ 2014-03-14  8:30 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, Boris Ostrovsky,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Thomas Gleixner, linux-arch, Gleb Natapov, x86, Ingo Molnar,
	xen-devel, Paul E. McKenney, Rik van Riel, Arnd Bergmann,
	Konrad Rzeszutek Wilk, Scott J Norton, Steven Rostedt,
	Chris Wright, Oleg Nesterov, Alok Kataria,
	Aswin Chandramouleeswaran, Chegu

On Thu, Mar 13, 2014 at 04:05:19PM -0400, Waiman Long wrote:
> On 03/13/2014 11:15 AM, Peter Zijlstra wrote:
> >On Wed, Mar 12, 2014 at 02:54:52PM -0400, Waiman Long wrote:
> >>+static inline void arch_spin_lock(struct qspinlock *lock)
> >>+{
> >>+	if (static_key_false(&paravirt_unfairlocks_enabled))
> >>+		queue_spin_lock_unfair(lock);
> >>+	else
> >>+		queue_spin_lock(lock);
> >>+}
> >So I would have expected something like:
> >
> >	if (static_key_false(&paravirt_spinlock)) {
> >		while (!queue_spin_trylock(lock))
> >			cpu_relax();
> >		return;
> >	}
> >
> >At the top of queue_spin_lock_slowpath().
> 
> I don't like the idea of constantly spinning on the lock. That can cause all
> sort of performance issues.

Its bloody virt; _that_ is a performance issue to begin with.

Anybody half sane stops using virt (esp. if they care about
performance).

> My version of the unfair lock tries to grab the
> lock ignoring if there are others waiting in the queue or not. So instead of
> the doing a cmpxchg of the whole 32-bit word, I just do a cmpxchg of the
> lock byte in the unfair version. A CPU has only one chance to steal the
> lock. If it can't, it will be lined up in the queue just like the fair
> version. It is not as unfair as the other unfair locking schemes that spins
> on the lock repetitively. So lock starvation should be less a problem.
> 
> On the other hand, it may not perform as well as the other unfair locking
> schemes. It is a compromise to provide some lock unfairness without
> sacrificing the good cacheline behavior of the queue spinlock.

But but but,.. any kind of queueing gets you into a world of hurt with
virt.

The simple test-and-set lock (as per the above) still sucks due to lock
holder preemption, but at least the suckage doesn't queue. Because with
queueing you not only have to worry about the lock holder getting
preemption, but also the waiter(s).

Take the situation of 3 (v)CPUs where cpu0 holds the lock but is
preempted. cpu1 queues, cpu2 queues. Then cpu1 gets preempted, after
which cpu0 gets back online.

The simple test-and-set lock will now let cpu2 acquire. Your queue
however will just sit there spinning, waiting for cpu1 to come back from
holiday.

I think you're way over engineering this. Just do the simple
test-and-set lock for virt && !paravirt (as I think Paolo Bonzini
suggested RHEL6 already does).

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v6 05/11] pvqspinlock, x86: Allow unfair spinlock in a PV guest
@ 2014-03-14  8:30         ` Peter Zijlstra
  0 siblings, 0 replies; 135+ messages in thread
From: Peter Zijlstra @ 2014-03-14  8:30 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, Boris Ostrovsky,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Thomas Gleixner, linux-arch, Gleb Natapov, x86, Ingo Molnar,
	xen-devel, Paul E. McKenney, Rik van Riel, Arnd Bergmann,
	Konrad Rzeszutek Wilk, Scott J Norton, Steven Rostedt,
	Chris Wright, Oleg Nesterov, Alok Kataria,
	Aswin Chandramouleeswaran, Chegu

On Thu, Mar 13, 2014 at 04:05:19PM -0400, Waiman Long wrote:
> On 03/13/2014 11:15 AM, Peter Zijlstra wrote:
> >On Wed, Mar 12, 2014 at 02:54:52PM -0400, Waiman Long wrote:
> >>+static inline void arch_spin_lock(struct qspinlock *lock)
> >>+{
> >>+	if (static_key_false(&paravirt_unfairlocks_enabled))
> >>+		queue_spin_lock_unfair(lock);
> >>+	else
> >>+		queue_spin_lock(lock);
> >>+}
> >So I would have expected something like:
> >
> >	if (static_key_false(&paravirt_spinlock)) {
> >		while (!queue_spin_trylock(lock))
> >			cpu_relax();
> >		return;
> >	}
> >
> >At the top of queue_spin_lock_slowpath().
> 
> I don't like the idea of constantly spinning on the lock. That can cause all
> sort of performance issues.

Its bloody virt; _that_ is a performance issue to begin with.

Anybody half sane stops using virt (esp. if they care about
performance).

> My version of the unfair lock tries to grab the
> lock ignoring if there are others waiting in the queue or not. So instead of
> the doing a cmpxchg of the whole 32-bit word, I just do a cmpxchg of the
> lock byte in the unfair version. A CPU has only one chance to steal the
> lock. If it can't, it will be lined up in the queue just like the fair
> version. It is not as unfair as the other unfair locking schemes that spins
> on the lock repetitively. So lock starvation should be less a problem.
> 
> On the other hand, it may not perform as well as the other unfair locking
> schemes. It is a compromise to provide some lock unfairness without
> sacrificing the good cacheline behavior of the queue spinlock.

But but but,.. any kind of queueing gets you into a world of hurt with
virt.

The simple test-and-set lock (as per the above) still sucks due to lock
holder preemption, but at least the suckage doesn't queue. Because with
queueing you not only have to worry about the lock holder getting
preemption, but also the waiter(s).

Take the situation of 3 (v)CPUs where cpu0 holds the lock but is
preempted. cpu1 queues, cpu2 queues. Then cpu1 gets preempted, after
which cpu0 gets back online.

The simple test-and-set lock will now let cpu2 acquire. Your queue
however will just sit there spinning, waiting for cpu1 to come back from
holiday.

I think you're way over engineering this. Just do the simple
test-and-set lock for virt && !paravirt (as I think Paolo Bonzini
suggested RHEL6 already does).

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v6 05/11] pvqspinlock, x86: Allow unfair spinlock in a PV guest
  2014-03-13 20:05       ` Waiman Long
  (?)
  (?)
@ 2014-03-14  8:30       ` Peter Zijlstra
  -1 siblings, 0 replies; 135+ messages in thread
From: Peter Zijlstra @ 2014-03-14  8:30 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, Boris Ostrovsky,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Thomas Gleixner, linux-arch, Gleb Natapov, x86, Ingo Molnar,
	xen-devel, Paul E. McKenney, Arnd Bergmann, Scott J Norton,
	Rusty Russell, Steven Rostedt, Chris Wright, Oleg Nesterov,
	Alok Kataria, Aswin Chandramouleeswaran, Chegu Vinod,
	linux-kernel

On Thu, Mar 13, 2014 at 04:05:19PM -0400, Waiman Long wrote:
> On 03/13/2014 11:15 AM, Peter Zijlstra wrote:
> >On Wed, Mar 12, 2014 at 02:54:52PM -0400, Waiman Long wrote:
> >>+static inline void arch_spin_lock(struct qspinlock *lock)
> >>+{
> >>+	if (static_key_false(&paravirt_unfairlocks_enabled))
> >>+		queue_spin_lock_unfair(lock);
> >>+	else
> >>+		queue_spin_lock(lock);
> >>+}
> >So I would have expected something like:
> >
> >	if (static_key_false(&paravirt_spinlock)) {
> >		while (!queue_spin_trylock(lock))
> >			cpu_relax();
> >		return;
> >	}
> >
> >At the top of queue_spin_lock_slowpath().
> 
> I don't like the idea of constantly spinning on the lock. That can cause all
> sort of performance issues.

Its bloody virt; _that_ is a performance issue to begin with.

Anybody half sane stops using virt (esp. if they care about
performance).

> My version of the unfair lock tries to grab the
> lock ignoring if there are others waiting in the queue or not. So instead of
> the doing a cmpxchg of the whole 32-bit word, I just do a cmpxchg of the
> lock byte in the unfair version. A CPU has only one chance to steal the
> lock. If it can't, it will be lined up in the queue just like the fair
> version. It is not as unfair as the other unfair locking schemes that spins
> on the lock repetitively. So lock starvation should be less a problem.
> 
> On the other hand, it may not perform as well as the other unfair locking
> schemes. It is a compromise to provide some lock unfairness without
> sacrificing the good cacheline behavior of the queue spinlock.

But but but,.. any kind of queueing gets you into a world of hurt with
virt.

The simple test-and-set lock (as per the above) still sucks due to lock
holder preemption, but at least the suckage doesn't queue. Because with
queueing you not only have to worry about the lock holder getting
preemption, but also the waiter(s).

Take the situation of 3 (v)CPUs where cpu0 holds the lock but is
preempted. cpu1 queues, cpu2 queues. Then cpu1 gets preempted, after
which cpu0 gets back online.

The simple test-and-set lock will now let cpu2 acquire. Your queue
however will just sit there spinning, waiting for cpu1 to come back from
holiday.

I think you're way over engineering this. Just do the simple
test-and-set lock for virt && !paravirt (as I think Paolo Bonzini
suggested RHEL6 already does).

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH RFC v6 10/11] pvqspinlock, x86: Enable qspinlock PV support for KVM
  2014-03-13 19:13     ` Waiman Long
@ 2014-03-14  8:42         ` Paolo Bonzini
  2014-03-14  8:42         ` Paolo Bonzini
  1 sibling, 0 replies; 135+ messages in thread
From: Paolo Bonzini @ 2014-03-14  8:42 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, linux-kernel, Gleb Natapov,
	Peter Zijlstra, virtualization, Andi Kleen, H. Peter Anvin,
	Michel Lespinasse, Alok Kataria, linux-arch, kvm, x86,
	Ingo Molnar, xen-devel, Paul E. McKenney, Arnd Bergmann,
	Scott J Norton, Steven Rostedt, Chris Wright, Thomas Gleixner,
	Aswin Chandramouleeswaran, Chegu Vinod, Boris Ostrovsky,
	Oleg Nesterov

Il 13/03/2014 20:13, Waiman Long ha scritto:
>>>
>>
>> This should also disable the unfair path.
>>
>> Paolo
>>
>
> The unfair lock uses a different jump label and does not require any
> special PV ops. There is a separate init function for that.

Yeah, what I mean is that the patches that enable paravirtualization 
should also take care of decreasing the unfair-lock jump label when 
paravirtualization is enabled.

Paolo

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH RFC v6 10/11] pvqspinlock, x86: Enable qspinlock PV support for KVM
@ 2014-03-14  8:42         ` Paolo Bonzini
  0 siblings, 0 replies; 135+ messages in thread
From: Paolo Bonzini @ 2014-03-14  8:42 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, linux-kernel, Gleb Natapov,
	Peter Zijlstra, virtualization, Andi Kleen, H. Peter Anvin,
	Michel Lespinasse, Alok Kataria, linux-arch, kvm, x86,
	Ingo Molnar, xen-devel, Paul E. McKenney, Arnd Bergmann,
	Scott J Norton, Steven Rostedt, Chris Wright, Thomas Gleixner,
	Aswin Chandramouleeswaran, Chegu Vinod, Boris Ostrovsky,
	Oleg Nesterov

Il 13/03/2014 20:13, Waiman Long ha scritto:
>>>
>>
>> This should also disable the unfair path.
>>
>> Paolo
>>
>
> The unfair lock uses a different jump label and does not require any
> special PV ops. There is a separate init function for that.

Yeah, what I mean is that the patches that enable paravirtualization 
should also take care of decreasing the unfair-lock jump label when 
paravirtualization is enabled.

Paolo

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH RFC v6 10/11] pvqspinlock, x86: Enable qspinlock PV support for KVM
  2014-03-13 19:13     ` Waiman Long
@ 2014-03-14  8:42       ` Paolo Bonzini
  2014-03-14  8:42         ` Paolo Bonzini
  1 sibling, 0 replies; 135+ messages in thread
From: Paolo Bonzini @ 2014-03-14  8:42 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, linux-kernel, Gleb Natapov,
	Peter Zijlstra, virtualization, Andi Kleen, H. Peter Anvin,
	Michel Lespinasse, Alok Kataria, linux-arch, kvm, x86,
	Ingo Molnar, xen-devel, Paul E. McKenney, Arnd Bergmann,
	Scott J Norton, Steven Rostedt, Chris Wright, Thomas Gleixner,
	Aswin Chandramouleeswaran, Chegu Vinod, Boris Ostrovsky,
	Oleg Nesterov

Il 13/03/2014 20:13, Waiman Long ha scritto:
>>>
>>
>> This should also disable the unfair path.
>>
>> Paolo
>>
>
> The unfair lock uses a different jump label and does not require any
> special PV ops. There is a separate init function for that.

Yeah, what I mean is that the patches that enable paravirtualization 
should also take care of decreasing the unfair-lock jump label when 
paravirtualization is enabled.

Paolo

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v6 05/11] pvqspinlock, x86: Allow unfair spinlock in a PV guest
@ 2014-03-14  8:48           ` Paolo Bonzini
  0 siblings, 0 replies; 135+ messages in thread
From: Paolo Bonzini @ 2014-03-14  8:48 UTC (permalink / raw)
  To: Peter Zijlstra, Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, Gleb Natapov,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Thomas Gleixner, linux-arch, kvm, x86, Alok Kataria, Ingo Molnar,
	xen-devel, Paul E. McKenney, Rik van Riel, Arnd Bergmann,
	Konrad Rzeszutek Wilk, Scott J Norton, Steven Rostedt,
	Chris Wright, Boris Ostrovsky, Aswin Chandramouleeswaran,
	Oleg Nesterov

Il 14/03/2014 09:30, Peter Zijlstra ha scritto:
> Take the situation of 3 (v)CPUs where cpu0 holds the lock but is
> preempted. cpu1 queues, cpu2 queues. Then cpu1 gets preempted, after
> which cpu0 gets back online.
>
> The simple test-and-set lock will now let cpu2 acquire. Your queue
> however will just sit there spinning, waiting for cpu1 to come back from
> holiday.
>
> I think you're way over engineering this. Just do the simple
> test-and-set lock for virt && !paravirt (as I think Paolo Bonzini
> suggested RHEL6 already does).

Exactly.

Paolo

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v6 05/11] pvqspinlock, x86: Allow unfair spinlock in a PV guest
@ 2014-03-14  8:48           ` Paolo Bonzini
  0 siblings, 0 replies; 135+ messages in thread
From: Paolo Bonzini @ 2014-03-14  8:48 UTC (permalink / raw)
  To: Peter Zijlstra, Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, Boris Ostrovsky,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Thomas Gleixner, linux-arch, Gleb Natapov, x86, Ingo Molnar,
	xen-devel, Paul E. McKenney, Rik van Riel, Arnd Bergmann,
	Konrad Rzeszutek Wilk, Scott J Norton, Steven Rostedt,
	Chris Wright, Oleg Nesterov, Alok Kataria,
	Aswin Chandramouleeswaran

Il 14/03/2014 09:30, Peter Zijlstra ha scritto:
> Take the situation of 3 (v)CPUs where cpu0 holds the lock but is
> preempted. cpu1 queues, cpu2 queues. Then cpu1 gets preempted, after
> which cpu0 gets back online.
>
> The simple test-and-set lock will now let cpu2 acquire. Your queue
> however will just sit there spinning, waiting for cpu1 to come back from
> holiday.
>
> I think you're way over engineering this. Just do the simple
> test-and-set lock for virt && !paravirt (as I think Paolo Bonzini
> suggested RHEL6 already does).

Exactly.

Paolo

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v6 05/11] pvqspinlock, x86: Allow unfair spinlock in a PV guest
  2014-03-14  8:30         ` Peter Zijlstra
  (?)
@ 2014-03-14  8:48         ` Paolo Bonzini
  -1 siblings, 0 replies; 135+ messages in thread
From: Paolo Bonzini @ 2014-03-14  8:48 UTC (permalink / raw)
  To: Peter Zijlstra, Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, Gleb Natapov,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Thomas Gleixner, linux-arch, kvm, x86, Alok Kataria, Ingo Molnar,
	xen-devel, Paul E. McKenney, Arnd Bergmann, Scott J Norton,
	Steven Rostedt, Chris Wright, Boris Ostrovsky,
	Aswin Chandramouleeswaran, Oleg Nesterov

Il 14/03/2014 09:30, Peter Zijlstra ha scritto:
> Take the situation of 3 (v)CPUs where cpu0 holds the lock but is
> preempted. cpu1 queues, cpu2 queues. Then cpu1 gets preempted, after
> which cpu0 gets back online.
>
> The simple test-and-set lock will now let cpu2 acquire. Your queue
> however will just sit there spinning, waiting for cpu1 to come back from
> holiday.
>
> I think you're way over engineering this. Just do the simple
> test-and-set lock for virt && !paravirt (as I think Paolo Bonzini
> suggested RHEL6 already does).

Exactly.

Paolo

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH RFC v6 09/11] pvqspinlock, x86: Add qspinlock para-virtualization support
  2014-03-13 19:49         ` Waiman Long
  (?)
  (?)
@ 2014-03-14  9:44         ` Paolo Bonzini
  -1 siblings, 0 replies; 135+ messages in thread
From: Paolo Bonzini @ 2014-03-14  9:44 UTC (permalink / raw)
  To: Waiman Long
  Cc: David Vrabel, Jeremy Fitzhardinge, Raghavendra K T, kvm,
	Peter Zijlstra, virtualization, Andi Kleen, H. Peter Anvin,
	Michel Lespinasse, Thomas Gleixner, linux-arch, Gleb Natapov,
	x86, Ingo Molnar, xen-devel, Paul E. McKenney, Rik van Riel,
	Arnd Bergmann, Konrad Rzeszutek Wilk, Scott J Norton,
	Steven Rostedt, Chris Wright, Oleg Nesterov, Alok Kataria

Il 13/03/2014 20:49, Waiman Long ha scritto:
> On 03/13/2014 09:57 AM, Paolo Bonzini wrote:
>> Il 13/03/2014 12:21, David Vrabel ha scritto:
>>> On 12/03/14 18:54, Waiman Long wrote:
>>>> This patch adds para-virtualization support to the queue spinlock in
>>>> the same way as was done in the PV ticket lock code. In essence, the
>>>> lock waiters will spin for a specified number of times (QSPIN_THRESHOLD
>>>> = 2^14) and then halted itself. The queue head waiter will spins
>>>> 2*QSPIN_THRESHOLD times before halting itself. When it has spinned
>>>> QSPIN_THRESHOLD times, the queue head will assume that the lock
>>>> holder may be scheduled out and attempt to kick the lock holder CPU
>>>> if it has the CPU number on hand.
>>>
>>> I don't really understand the reasoning for kicking the lock holder.
>>
>> I agree.  If the lock holder isn't running, there's probably a good
>> reason for that and going to sleep will not necessarily convince the
>> scheduler to give more CPU to the lock holder.  I think there are two
>> choices:
>>
>> 1) use yield_to to donate part of the waiter's quantum to the lock
>> holder?    For this we probably need a new, separate hypercall
>> interface.  For KVM it would be the same as hlt in the guest but with
>> an additional yield_to in the host.
>>
>> 2) do nothing, just go to sleep.
>>
>> Could you get (or do you have) numbers for (2)?
> 
> I will take out the lock holder kick portion from the patch. I will also
> try to collect more test data.
> 
>>
>> More important, I think a barrier is missing:
>>
>>     Lock holder ---------------------------------------
>>
>>     // queue_spin_unlock
>>     barrier();
>>     ACCESS_ONCE(qlock->lock) = 0;
>>     barrier();
>>
> 
> This is not the unlock code that is used when PV spinlock is enabled.

It is __queue_spin_unlock.  But you're right:

>         if (static_key_false(&paravirt_spinlocks_enabled)) {
>                 /*
>                  * Need to atomically clear the lock byte to avoid racing with
>                  * queue head waiter trying to set _QSPINLOCK_LOCKED_SLOWPATH.
>                  */
>                 if (likely(cmpxchg(&qlock->lock, _QSPINLOCK_LOCKED, 0)
>                                 == _QSPINLOCK_LOCKED))
>                         return;
>                 else
>                         queue_spin_unlock_slowpath(lock);
> 
>         } else {
>                 __queue_spin_unlock(lock);
>         }

... indeed the __queue_spin_unlock/pv_kick_node pair is only done if the
waiter has already written _QSPINLOCK_LOCKED_SLOWPATH, and this means
that the lock holder must also observe PV_CPU_HALTED.

So this is correct:

>> Nothing protects from writing qlock->lock before pv->cpustate is read,

but this cannot happen:

>> leading to this:
>>
>>     Lock holder            Waiter
>>     ---------------------------------------------------------------
>>     read pv->cpustate
>>         (it is PV_CPU_ACTIVE)
>>                     pv->cpustate = PV_CPU_HALTED
>>                     lockval = cmpxchg(...)
>>                     hibernate()
>>     qlock->lock = 0
>>     if (pv->cpustate != PV_CPU_HALTED)
>>         return;
>>
> 
> The lock holder will read cpustate only if the lock byte has been
> changed to _QSPINLOCK_LOCKED_SLOWPATH. So the setting of the lock byte
> synchronize the 2 threads.

Yes.

> The only thing that I am not certain is when
> the waiter is trying to go to sleep while, at the same time, the lock
> holder is trying to kick it. Will there be a missed wakeup because of
> this timing issue?

This is okay.  The kick_cpu hypercall is sticky until the next halt, if
no halt is pending.  Otherwise, pv ticketlocks would have the same issue.

Paolo

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH RFC v6 09/11] pvqspinlock, x86: Add qspinlock para-virtualization support
  2014-03-13 19:49         ` Waiman Long
                           ` (3 preceding siblings ...)
  (?)
@ 2014-03-14  9:44         ` Paolo Bonzini
  -1 siblings, 0 replies; 135+ messages in thread
From: Paolo Bonzini @ 2014-03-14  9:44 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, Gleb Natapov,
	Peter Zijlstra, virtualization, Andi Kleen, H. Peter Anvin,
	Michel Lespinasse, Alok Kataria, linux-arch, kvm, x86,
	Ingo Molnar, xen-devel, Paul E. McKenney, Rik van Riel,
	Arnd Bergmann, Konrad Rzeszutek Wilk, Scott J Norton,
	Steven Rostedt, Chris Wright, Thomas Gleixner, Oleg Nesterov,
	David Vrabel

Il 13/03/2014 20:49, Waiman Long ha scritto:
> On 03/13/2014 09:57 AM, Paolo Bonzini wrote:
>> Il 13/03/2014 12:21, David Vrabel ha scritto:
>>> On 12/03/14 18:54, Waiman Long wrote:
>>>> This patch adds para-virtualization support to the queue spinlock in
>>>> the same way as was done in the PV ticket lock code. In essence, the
>>>> lock waiters will spin for a specified number of times (QSPIN_THRESHOLD
>>>> = 2^14) and then halted itself. The queue head waiter will spins
>>>> 2*QSPIN_THRESHOLD times before halting itself. When it has spinned
>>>> QSPIN_THRESHOLD times, the queue head will assume that the lock
>>>> holder may be scheduled out and attempt to kick the lock holder CPU
>>>> if it has the CPU number on hand.
>>>
>>> I don't really understand the reasoning for kicking the lock holder.
>>
>> I agree.  If the lock holder isn't running, there's probably a good
>> reason for that and going to sleep will not necessarily convince the
>> scheduler to give more CPU to the lock holder.  I think there are two
>> choices:
>>
>> 1) use yield_to to donate part of the waiter's quantum to the lock
>> holder?    For this we probably need a new, separate hypercall
>> interface.  For KVM it would be the same as hlt in the guest but with
>> an additional yield_to in the host.
>>
>> 2) do nothing, just go to sleep.
>>
>> Could you get (or do you have) numbers for (2)?
> 
> I will take out the lock holder kick portion from the patch. I will also
> try to collect more test data.
> 
>>
>> More important, I think a barrier is missing:
>>
>>     Lock holder ---------------------------------------
>>
>>     // queue_spin_unlock
>>     barrier();
>>     ACCESS_ONCE(qlock->lock) = 0;
>>     barrier();
>>
> 
> This is not the unlock code that is used when PV spinlock is enabled.

It is __queue_spin_unlock.  But you're right:

>         if (static_key_false(&paravirt_spinlocks_enabled)) {
>                 /*
>                  * Need to atomically clear the lock byte to avoid racing with
>                  * queue head waiter trying to set _QSPINLOCK_LOCKED_SLOWPATH.
>                  */
>                 if (likely(cmpxchg(&qlock->lock, _QSPINLOCK_LOCKED, 0)
>                                 == _QSPINLOCK_LOCKED))
>                         return;
>                 else
>                         queue_spin_unlock_slowpath(lock);
> 
>         } else {
>                 __queue_spin_unlock(lock);
>         }

... indeed the __queue_spin_unlock/pv_kick_node pair is only done if the
waiter has already written _QSPINLOCK_LOCKED_SLOWPATH, and this means
that the lock holder must also observe PV_CPU_HALTED.

So this is correct:

>> Nothing protects from writing qlock->lock before pv->cpustate is read,

but this cannot happen:

>> leading to this:
>>
>>     Lock holder            Waiter
>>     ---------------------------------------------------------------
>>     read pv->cpustate
>>         (it is PV_CPU_ACTIVE)
>>                     pv->cpustate = PV_CPU_HALTED
>>                     lockval = cmpxchg(...)
>>                     hibernate()
>>     qlock->lock = 0
>>     if (pv->cpustate != PV_CPU_HALTED)
>>         return;
>>
> 
> The lock holder will read cpustate only if the lock byte has been
> changed to _QSPINLOCK_LOCKED_SLOWPATH. So the setting of the lock byte
> synchronize the 2 threads.

Yes.

> The only thing that I am not certain is when
> the waiter is trying to go to sleep while, at the same time, the lock
> holder is trying to kick it. Will there be a missed wakeup because of
> this timing issue?

This is okay.  The kick_cpu hypercall is sticky until the next halt, if
no halt is pending.  Otherwise, pv ticketlocks would have the same issue.

Paolo

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH RFC v6 09/11] pvqspinlock, x86: Add qspinlock para-virtualization support
  2014-03-13 19:49         ` Waiman Long
                           ` (2 preceding siblings ...)
  (?)
@ 2014-03-14  9:44         ` Paolo Bonzini
  -1 siblings, 0 replies; 135+ messages in thread
From: Paolo Bonzini @ 2014-03-14  9:44 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, Gleb Natapov,
	Peter Zijlstra, virtualization, Andi Kleen, H. Peter Anvin,
	Michel Lespinasse, Alok Kataria, linux-arch, kvm, x86,
	Ingo Molnar, xen-devel, Paul E. McKenney, Arnd Bergmann,
	Scott J Norton, Steven Rostedt, Chris Wright, Thomas Gleixner,
	Oleg Nesterov, David Vrabel

Il 13/03/2014 20:49, Waiman Long ha scritto:
> On 03/13/2014 09:57 AM, Paolo Bonzini wrote:
>> Il 13/03/2014 12:21, David Vrabel ha scritto:
>>> On 12/03/14 18:54, Waiman Long wrote:
>>>> This patch adds para-virtualization support to the queue spinlock in
>>>> the same way as was done in the PV ticket lock code. In essence, the
>>>> lock waiters will spin for a specified number of times (QSPIN_THRESHOLD
>>>> = 2^14) and then halted itself. The queue head waiter will spins
>>>> 2*QSPIN_THRESHOLD times before halting itself. When it has spinned
>>>> QSPIN_THRESHOLD times, the queue head will assume that the lock
>>>> holder may be scheduled out and attempt to kick the lock holder CPU
>>>> if it has the CPU number on hand.
>>>
>>> I don't really understand the reasoning for kicking the lock holder.
>>
>> I agree.  If the lock holder isn't running, there's probably a good
>> reason for that and going to sleep will not necessarily convince the
>> scheduler to give more CPU to the lock holder.  I think there are two
>> choices:
>>
>> 1) use yield_to to donate part of the waiter's quantum to the lock
>> holder?    For this we probably need a new, separate hypercall
>> interface.  For KVM it would be the same as hlt in the guest but with
>> an additional yield_to in the host.
>>
>> 2) do nothing, just go to sleep.
>>
>> Could you get (or do you have) numbers for (2)?
> 
> I will take out the lock holder kick portion from the patch. I will also
> try to collect more test data.
> 
>>
>> More important, I think a barrier is missing:
>>
>>     Lock holder ---------------------------------------
>>
>>     // queue_spin_unlock
>>     barrier();
>>     ACCESS_ONCE(qlock->lock) = 0;
>>     barrier();
>>
> 
> This is not the unlock code that is used when PV spinlock is enabled.

It is __queue_spin_unlock.  But you're right:

>         if (static_key_false(&paravirt_spinlocks_enabled)) {
>                 /*
>                  * Need to atomically clear the lock byte to avoid racing with
>                  * queue head waiter trying to set _QSPINLOCK_LOCKED_SLOWPATH.
>                  */
>                 if (likely(cmpxchg(&qlock->lock, _QSPINLOCK_LOCKED, 0)
>                                 == _QSPINLOCK_LOCKED))
>                         return;
>                 else
>                         queue_spin_unlock_slowpath(lock);
> 
>         } else {
>                 __queue_spin_unlock(lock);
>         }

... indeed the __queue_spin_unlock/pv_kick_node pair is only done if the
waiter has already written _QSPINLOCK_LOCKED_SLOWPATH, and this means
that the lock holder must also observe PV_CPU_HALTED.

So this is correct:

>> Nothing protects from writing qlock->lock before pv->cpustate is read,

but this cannot happen:

>> leading to this:
>>
>>     Lock holder            Waiter
>>     ---------------------------------------------------------------
>>     read pv->cpustate
>>         (it is PV_CPU_ACTIVE)
>>                     pv->cpustate = PV_CPU_HALTED
>>                     lockval = cmpxchg(...)
>>                     hibernate()
>>     qlock->lock = 0
>>     if (pv->cpustate != PV_CPU_HALTED)
>>         return;
>>
> 
> The lock holder will read cpustate only if the lock byte has been
> changed to _QSPINLOCK_LOCKED_SLOWPATH. So the setting of the lock byte
> synchronize the 2 threads.

Yes.

> The only thing that I am not certain is when
> the waiter is trying to go to sleep while, at the same time, the lock
> holder is trying to kick it. Will there be a missed wakeup because of
> this timing issue?

This is okay.  The kick_cpu hypercall is sticky until the next halt, if
no halt is pending.  Otherwise, pv ticketlocks would have the same issue.

Paolo

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v6 04/11] qspinlock: Optimized code path for 2 contending tasks
  2014-03-13 13:57       ` Peter Zijlstra
@ 2014-03-17 17:23         ` Waiman Long
  -1 siblings, 0 replies; 135+ messages in thread
From: Waiman Long @ 2014-03-17 17:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, Boris Ostrovsky,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Thomas Gleixner, linux-arch, Gleb Natapov, x86, Ingo Molnar,
	xen-devel, Paul E. McKenney, Rik van Riel, Arnd Bergmann,
	Konrad Rzeszutek Wilk, Scott J Norton, Steven Rostedt,
	Chris Wright, Oleg Nesterov, Alok Kataria,
	Aswin Chandramouleeswaran, Chegu

On 03/13/2014 09:57 AM, Peter Zijlstra wrote:
> On Wed, Mar 12, 2014 at 03:08:24PM -0400, Waiman Long wrote:
>> On 03/12/2014 02:54 PM, Waiman Long wrote:
>>> +		/*
>>> +		 * Set the lock bit&   clear the waiting bit simultaneously
>>> +		 * It is assumed that there is no lock stealing with this
>>> +		 * quick path active.
>>> +		 *
>>> +		 * A direct memory store of _QSPINLOCK_LOCKED into the
>>> +		 * lock_wait field causes problem with the lockref code, e.g.
>>> +		 *   ACCESS_ONCE(qlock->lock_wait) = _QSPINLOCK_LOCKED;
>>> +		 *
>>> +		 * It is not currently clear why this happens. A workaround
>>> +		 * is to use atomic instruction to store the new value.
>>> +		 */
>>> +		{
>>> +			u16 lw = xchg(&qlock->lock_wait, _QSPINLOCK_LOCKED);
>>> +			BUG_ON(lw != _QSPINLOCK_WAITING);
>>> +		}
>> It was found that when I used a direct memory store instead of an atomic op,
>> the following kernel crash might happen at filesystem dismount time:
>>
>> [ 1529.936714] Call Trace:
>> [ 1529.936714]  [<ffffffff811c2d03>] d_walk+0xc3/0x260
>> [ 1529.936714]  [<ffffffff811c1770>] ? check_and_collect+0x30/0x30
>> [ 1529.936714]  [<ffffffff811c3985>] shrink_dcache_for_umount+0x75/0x120
>> [ 1529.936714]  [<ffffffff811adf21>] generic_shutdown_super+0x21/0xf0
>> [ 1529.936714]  [<ffffffff811ae207>] kill_block_super+0x27/0x70
>> [ 1529.936714]  [<ffffffff811ae4ed>] deactivate_locked_super+0x3d/0x60
>> [ 1529.936714]  [<ffffffff811aea96>] deactivate_super+0x46/0x60
>> [ 1529.936714]  [<ffffffff811ca277>] mntput_no_expire+0xa7/0x140
>> [ 1529.936714]  [<ffffffff811cb6ce>] SyS_umount+0x8e/0x100
>> [ 1529.936714]  [<ffffffff815d2c29>] system_call_fastpath+0x16/0x1b
>> It was more readily reproducible in a KVM guest. It was harder to reproduce
>> in a bare metal machine, but kernel crash still happened after several
>> tries.
>>
>> I am not sure what exactly cause this crash, but it will have something to
>> do with the interaction between the lockref and the qspinlock code. I would
>> like more eyes on that to find the root cause of it.
> I cannot reproduce with my series that has the one word write.
>
> What I did was I made my swap partition (who needs that anyway on a
> machine with 16G of memory) into an XFS partition.
>
> Then I copied my linux.git onto it and unmounted.
>
> I'll try a few more times; the above trace seems to suggest it happens
> during dcache cleanup, so I suppose I should read the filesystem some
> and unmount again.
>
> Is there anything specific you did to make it go bang?

I had found the reason for the crash, it has to do with my original 
definition of the queue_spin_value_unlocked() function. When I extended 
it to cover the first 2 bytes (lock + wait bit), the problem is gone.

-Longman

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v6 04/11] qspinlock: Optimized code path for 2 contending tasks
@ 2014-03-17 17:23         ` Waiman Long
  0 siblings, 0 replies; 135+ messages in thread
From: Waiman Long @ 2014-03-17 17:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, Boris Ostrovsky,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Thomas Gleixner, linux-arch, Gleb Natapov, x86, Ingo Molnar,
	xen-devel, Paul E. McKenney, Rik van Riel, Arnd Bergmann,
	Konrad Rzeszutek Wilk, Scott J Norton, Steven Rostedt,
	Chris Wright, Oleg Nesterov, Alok Kataria,
	Aswin Chandramouleeswaran, Chegu

On 03/13/2014 09:57 AM, Peter Zijlstra wrote:
> On Wed, Mar 12, 2014 at 03:08:24PM -0400, Waiman Long wrote:
>> On 03/12/2014 02:54 PM, Waiman Long wrote:
>>> +		/*
>>> +		 * Set the lock bit&   clear the waiting bit simultaneously
>>> +		 * It is assumed that there is no lock stealing with this
>>> +		 * quick path active.
>>> +		 *
>>> +		 * A direct memory store of _QSPINLOCK_LOCKED into the
>>> +		 * lock_wait field causes problem with the lockref code, e.g.
>>> +		 *   ACCESS_ONCE(qlock->lock_wait) = _QSPINLOCK_LOCKED;
>>> +		 *
>>> +		 * It is not currently clear why this happens. A workaround
>>> +		 * is to use atomic instruction to store the new value.
>>> +		 */
>>> +		{
>>> +			u16 lw = xchg(&qlock->lock_wait, _QSPINLOCK_LOCKED);
>>> +			BUG_ON(lw != _QSPINLOCK_WAITING);
>>> +		}
>> It was found that when I used a direct memory store instead of an atomic op,
>> the following kernel crash might happen at filesystem dismount time:
>>
>> [ 1529.936714] Call Trace:
>> [ 1529.936714]  [<ffffffff811c2d03>] d_walk+0xc3/0x260
>> [ 1529.936714]  [<ffffffff811c1770>] ? check_and_collect+0x30/0x30
>> [ 1529.936714]  [<ffffffff811c3985>] shrink_dcache_for_umount+0x75/0x120
>> [ 1529.936714]  [<ffffffff811adf21>] generic_shutdown_super+0x21/0xf0
>> [ 1529.936714]  [<ffffffff811ae207>] kill_block_super+0x27/0x70
>> [ 1529.936714]  [<ffffffff811ae4ed>] deactivate_locked_super+0x3d/0x60
>> [ 1529.936714]  [<ffffffff811aea96>] deactivate_super+0x46/0x60
>> [ 1529.936714]  [<ffffffff811ca277>] mntput_no_expire+0xa7/0x140
>> [ 1529.936714]  [<ffffffff811cb6ce>] SyS_umount+0x8e/0x100
>> [ 1529.936714]  [<ffffffff815d2c29>] system_call_fastpath+0x16/0x1b
>> It was more readily reproducible in a KVM guest. It was harder to reproduce
>> in a bare metal machine, but kernel crash still happened after several
>> tries.
>>
>> I am not sure what exactly cause this crash, but it will have something to
>> do with the interaction between the lockref and the qspinlock code. I would
>> like more eyes on that to find the root cause of it.
> I cannot reproduce with my series that has the one word write.
>
> What I did was I made my swap partition (who needs that anyway on a
> machine with 16G of memory) into an XFS partition.
>
> Then I copied my linux.git onto it and unmounted.
>
> I'll try a few more times; the above trace seems to suggest it happens
> during dcache cleanup, so I suppose I should read the filesystem some
> and unmount again.
>
> Is there anything specific you did to make it go bang?

I had found the reason for the crash, it has to do with my original 
definition of the queue_spin_value_unlocked() function. When I extended 
it to cover the first 2 bytes (lock + wait bit), the problem is gone.

-Longman

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v6 04/11] qspinlock: Optimized code path for 2 contending tasks
  2014-03-13 13:57       ` Peter Zijlstra
  (?)
@ 2014-03-17 17:23       ` Waiman Long
  -1 siblings, 0 replies; 135+ messages in thread
From: Waiman Long @ 2014-03-17 17:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, Boris Ostrovsky,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Thomas Gleixner, linux-arch, Gleb Natapov, x86, Ingo Molnar,
	xen-devel, Paul E. McKenney, Arnd Bergmann, Scott J Norton,
	Rusty Russell, Steven Rostedt, Chris Wright, Oleg Nesterov,
	Alok Kataria, Aswin Chandramouleeswaran, Chegu Vinod,
	linux-kernel

On 03/13/2014 09:57 AM, Peter Zijlstra wrote:
> On Wed, Mar 12, 2014 at 03:08:24PM -0400, Waiman Long wrote:
>> On 03/12/2014 02:54 PM, Waiman Long wrote:
>>> +		/*
>>> +		 * Set the lock bit&   clear the waiting bit simultaneously
>>> +		 * It is assumed that there is no lock stealing with this
>>> +		 * quick path active.
>>> +		 *
>>> +		 * A direct memory store of _QSPINLOCK_LOCKED into the
>>> +		 * lock_wait field causes problem with the lockref code, e.g.
>>> +		 *   ACCESS_ONCE(qlock->lock_wait) = _QSPINLOCK_LOCKED;
>>> +		 *
>>> +		 * It is not currently clear why this happens. A workaround
>>> +		 * is to use atomic instruction to store the new value.
>>> +		 */
>>> +		{
>>> +			u16 lw = xchg(&qlock->lock_wait, _QSPINLOCK_LOCKED);
>>> +			BUG_ON(lw != _QSPINLOCK_WAITING);
>>> +		}
>> It was found that when I used a direct memory store instead of an atomic op,
>> the following kernel crash might happen at filesystem dismount time:
>>
>> [ 1529.936714] Call Trace:
>> [ 1529.936714]  [<ffffffff811c2d03>] d_walk+0xc3/0x260
>> [ 1529.936714]  [<ffffffff811c1770>] ? check_and_collect+0x30/0x30
>> [ 1529.936714]  [<ffffffff811c3985>] shrink_dcache_for_umount+0x75/0x120
>> [ 1529.936714]  [<ffffffff811adf21>] generic_shutdown_super+0x21/0xf0
>> [ 1529.936714]  [<ffffffff811ae207>] kill_block_super+0x27/0x70
>> [ 1529.936714]  [<ffffffff811ae4ed>] deactivate_locked_super+0x3d/0x60
>> [ 1529.936714]  [<ffffffff811aea96>] deactivate_super+0x46/0x60
>> [ 1529.936714]  [<ffffffff811ca277>] mntput_no_expire+0xa7/0x140
>> [ 1529.936714]  [<ffffffff811cb6ce>] SyS_umount+0x8e/0x100
>> [ 1529.936714]  [<ffffffff815d2c29>] system_call_fastpath+0x16/0x1b
>> It was more readily reproducible in a KVM guest. It was harder to reproduce
>> in a bare metal machine, but kernel crash still happened after several
>> tries.
>>
>> I am not sure what exactly cause this crash, but it will have something to
>> do with the interaction between the lockref and the qspinlock code. I would
>> like more eyes on that to find the root cause of it.
> I cannot reproduce with my series that has the one word write.
>
> What I did was I made my swap partition (who needs that anyway on a
> machine with 16G of memory) into an XFS partition.
>
> Then I copied my linux.git onto it and unmounted.
>
> I'll try a few more times; the above trace seems to suggest it happens
> during dcache cleanup, so I suppose I should read the filesystem some
> and unmount again.
>
> Is there anything specific you did to make it go bang?

I had found the reason for the crash, it has to do with my original 
definition of the queue_spin_value_unlocked() function. When I extended 
it to cover the first 2 bytes (lock + wait bit), the problem is gone.

-Longman

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v6 05/11] pvqspinlock, x86: Allow unfair spinlock in a PV guest
  2014-03-14  8:30         ` Peter Zijlstra
@ 2014-03-17 17:44           ` Waiman Long
  -1 siblings, 0 replies; 135+ messages in thread
From: Waiman Long @ 2014-03-17 17:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, Boris Ostrovsky,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Thomas Gleixner, linux-arch, Gleb Natapov, x86, Ingo Molnar,
	xen-devel, Paul E. McKenney, Rik van Riel, Arnd Bergmann,
	Konrad Rzeszutek Wilk, Scott J Norton, Steven Rostedt,
	Chris Wright, Oleg Nesterov, Alok Kataria,
	Aswin Chandramouleeswaran, Chegu

On 03/14/2014 04:30 AM, Peter Zijlstra wrote:
> On Thu, Mar 13, 2014 at 04:05:19PM -0400, Waiman Long wrote:
>> On 03/13/2014 11:15 AM, Peter Zijlstra wrote:
>>> On Wed, Mar 12, 2014 at 02:54:52PM -0400, Waiman Long wrote:
>>>> +static inline void arch_spin_lock(struct qspinlock *lock)
>>>> +{
>>>> +	if (static_key_false(&paravirt_unfairlocks_enabled))
>>>> +		queue_spin_lock_unfair(lock);
>>>> +	else
>>>> +		queue_spin_lock(lock);
>>>> +}
>>> So I would have expected something like:
>>>
>>> 	if (static_key_false(&paravirt_spinlock)) {
>>> 		while (!queue_spin_trylock(lock))
>>> 			cpu_relax();
>>> 		return;
>>> 	}
>>>
>>> At the top of queue_spin_lock_slowpath().
>> I don't like the idea of constantly spinning on the lock. That can cause all
>> sort of performance issues.
> Its bloody virt; _that_ is a performance issue to begin with.
>
> Anybody half sane stops using virt (esp. if they care about
> performance).
>
>> My version of the unfair lock tries to grab the
>> lock ignoring if there are others waiting in the queue or not. So instead of
>> the doing a cmpxchg of the whole 32-bit word, I just do a cmpxchg of the
>> lock byte in the unfair version. A CPU has only one chance to steal the
>> lock. If it can't, it will be lined up in the queue just like the fair
>> version. It is not as unfair as the other unfair locking schemes that spins
>> on the lock repetitively. So lock starvation should be less a problem.
>>
>> On the other hand, it may not perform as well as the other unfair locking
>> schemes. It is a compromise to provide some lock unfairness without
>> sacrificing the good cacheline behavior of the queue spinlock.
> But but but,.. any kind of queueing gets you into a world of hurt with
> virt.
>
> The simple test-and-set lock (as per the above) still sucks due to lock
> holder preemption, but at least the suckage doesn't queue. Because with
> queueing you not only have to worry about the lock holder getting
> preemption, but also the waiter(s).
>
> Take the situation of 3 (v)CPUs where cpu0 holds the lock but is
> preempted. cpu1 queues, cpu2 queues. Then cpu1 gets preempted, after
> which cpu0 gets back online.
>
> The simple test-and-set lock will now let cpu2 acquire. Your queue
> however will just sit there spinning, waiting for cpu1 to come back from
> holiday.
>
> I think you're way over engineering this. Just do the simple
> test-and-set lock for virt&&  !paravirt (as I think Paolo Bonzini
> suggested RHEL6 already does).

The PV ticketlock code was designed to handle lock holder preemption by 
redirecting CPU resources in a preempted guest to another guest that can 
better use it and then return the preempted CPU back sooner.

Using a simple test-and-set lock will not allow us to enable this PV 
spinlock functionality as there is no structure to decide who does what. 
I can extend the current unfair lock code to allow those waiting in the 
queue to also attempt to steal the lock, though at a lesser frequency so 
that the queue head has a higher chance of getting the lock. This will 
solve the lock waiter preemption problem that you worry about. This does 
make the code a bit more complex, but it allow us to enable both the 
unfair lock and the PV spinlock code together to solve the lock waiter 
and lock holder preemption problems.

-Longman

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v6 05/11] pvqspinlock, x86: Allow unfair spinlock in a PV guest
@ 2014-03-17 17:44           ` Waiman Long
  0 siblings, 0 replies; 135+ messages in thread
From: Waiman Long @ 2014-03-17 17:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, Boris Ostrovsky,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Thomas Gleixner, linux-arch, Gleb Natapov, x86, Ingo Molnar,
	xen-devel, Paul E. McKenney, Rik van Riel, Arnd Bergmann,
	Konrad Rzeszutek Wilk, Scott J Norton, Steven Rostedt,
	Chris Wright, Oleg Nesterov, Alok Kataria,
	Aswin Chandramouleeswaran, Chegu

On 03/14/2014 04:30 AM, Peter Zijlstra wrote:
> On Thu, Mar 13, 2014 at 04:05:19PM -0400, Waiman Long wrote:
>> On 03/13/2014 11:15 AM, Peter Zijlstra wrote:
>>> On Wed, Mar 12, 2014 at 02:54:52PM -0400, Waiman Long wrote:
>>>> +static inline void arch_spin_lock(struct qspinlock *lock)
>>>> +{
>>>> +	if (static_key_false(&paravirt_unfairlocks_enabled))
>>>> +		queue_spin_lock_unfair(lock);
>>>> +	else
>>>> +		queue_spin_lock(lock);
>>>> +}
>>> So I would have expected something like:
>>>
>>> 	if (static_key_false(&paravirt_spinlock)) {
>>> 		while (!queue_spin_trylock(lock))
>>> 			cpu_relax();
>>> 		return;
>>> 	}
>>>
>>> At the top of queue_spin_lock_slowpath().
>> I don't like the idea of constantly spinning on the lock. That can cause all
>> sort of performance issues.
> Its bloody virt; _that_ is a performance issue to begin with.
>
> Anybody half sane stops using virt (esp. if they care about
> performance).
>
>> My version of the unfair lock tries to grab the
>> lock ignoring if there are others waiting in the queue or not. So instead of
>> the doing a cmpxchg of the whole 32-bit word, I just do a cmpxchg of the
>> lock byte in the unfair version. A CPU has only one chance to steal the
>> lock. If it can't, it will be lined up in the queue just like the fair
>> version. It is not as unfair as the other unfair locking schemes that spins
>> on the lock repetitively. So lock starvation should be less a problem.
>>
>> On the other hand, it may not perform as well as the other unfair locking
>> schemes. It is a compromise to provide some lock unfairness without
>> sacrificing the good cacheline behavior of the queue spinlock.
> But but but,.. any kind of queueing gets you into a world of hurt with
> virt.
>
> The simple test-and-set lock (as per the above) still sucks due to lock
> holder preemption, but at least the suckage doesn't queue. Because with
> queueing you not only have to worry about the lock holder getting
> preemption, but also the waiter(s).
>
> Take the situation of 3 (v)CPUs where cpu0 holds the lock but is
> preempted. cpu1 queues, cpu2 queues. Then cpu1 gets preempted, after
> which cpu0 gets back online.
>
> The simple test-and-set lock will now let cpu2 acquire. Your queue
> however will just sit there spinning, waiting for cpu1 to come back from
> holiday.
>
> I think you're way over engineering this. Just do the simple
> test-and-set lock for virt&&  !paravirt (as I think Paolo Bonzini
> suggested RHEL6 already does).

The PV ticketlock code was designed to handle lock holder preemption by 
redirecting CPU resources in a preempted guest to another guest that can 
better use it and then return the preempted CPU back sooner.

Using a simple test-and-set lock will not allow us to enable this PV 
spinlock functionality as there is no structure to decide who does what. 
I can extend the current unfair lock code to allow those waiting in the 
queue to also attempt to steal the lock, though at a lesser frequency so 
that the queue head has a higher chance of getting the lock. This will 
solve the lock waiter preemption problem that you worry about. This does 
make the code a bit more complex, but it allow us to enable both the 
unfair lock and the PV spinlock code together to solve the lock waiter 
and lock holder preemption problems.

-Longman

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v6 05/11] pvqspinlock, x86: Allow unfair spinlock in a PV guest
  2014-03-14  8:30         ` Peter Zijlstra
                           ` (2 preceding siblings ...)
  (?)
@ 2014-03-17 17:44         ` Waiman Long
  -1 siblings, 0 replies; 135+ messages in thread
From: Waiman Long @ 2014-03-17 17:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, Boris Ostrovsky,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Thomas Gleixner, linux-arch, Gleb Natapov, x86, Ingo Molnar,
	xen-devel, Paul E. McKenney, Arnd Bergmann, Scott J Norton,
	Rusty Russell, Steven Rostedt, Chris Wright, Oleg Nesterov,
	Alok Kataria, Aswin Chandramouleeswaran, Chegu Vinod,
	linux-kernel

On 03/14/2014 04:30 AM, Peter Zijlstra wrote:
> On Thu, Mar 13, 2014 at 04:05:19PM -0400, Waiman Long wrote:
>> On 03/13/2014 11:15 AM, Peter Zijlstra wrote:
>>> On Wed, Mar 12, 2014 at 02:54:52PM -0400, Waiman Long wrote:
>>>> +static inline void arch_spin_lock(struct qspinlock *lock)
>>>> +{
>>>> +	if (static_key_false(&paravirt_unfairlocks_enabled))
>>>> +		queue_spin_lock_unfair(lock);
>>>> +	else
>>>> +		queue_spin_lock(lock);
>>>> +}
>>> So I would have expected something like:
>>>
>>> 	if (static_key_false(&paravirt_spinlock)) {
>>> 		while (!queue_spin_trylock(lock))
>>> 			cpu_relax();
>>> 		return;
>>> 	}
>>>
>>> At the top of queue_spin_lock_slowpath().
>> I don't like the idea of constantly spinning on the lock. That can cause all
>> sort of performance issues.
> Its bloody virt; _that_ is a performance issue to begin with.
>
> Anybody half sane stops using virt (esp. if they care about
> performance).
>
>> My version of the unfair lock tries to grab the
>> lock ignoring if there are others waiting in the queue or not. So instead of
>> the doing a cmpxchg of the whole 32-bit word, I just do a cmpxchg of the
>> lock byte in the unfair version. A CPU has only one chance to steal the
>> lock. If it can't, it will be lined up in the queue just like the fair
>> version. It is not as unfair as the other unfair locking schemes that spins
>> on the lock repetitively. So lock starvation should be less a problem.
>>
>> On the other hand, it may not perform as well as the other unfair locking
>> schemes. It is a compromise to provide some lock unfairness without
>> sacrificing the good cacheline behavior of the queue spinlock.
> But but but,.. any kind of queueing gets you into a world of hurt with
> virt.
>
> The simple test-and-set lock (as per the above) still sucks due to lock
> holder preemption, but at least the suckage doesn't queue. Because with
> queueing you not only have to worry about the lock holder getting
> preemption, but also the waiter(s).
>
> Take the situation of 3 (v)CPUs where cpu0 holds the lock but is
> preempted. cpu1 queues, cpu2 queues. Then cpu1 gets preempted, after
> which cpu0 gets back online.
>
> The simple test-and-set lock will now let cpu2 acquire. Your queue
> however will just sit there spinning, waiting for cpu1 to come back from
> holiday.
>
> I think you're way over engineering this. Just do the simple
> test-and-set lock for virt&&  !paravirt (as I think Paolo Bonzini
> suggested RHEL6 already does).

The PV ticketlock code was designed to handle lock holder preemption by 
redirecting CPU resources in a preempted guest to another guest that can 
better use it and then return the preempted CPU back sooner.

Using a simple test-and-set lock will not allow us to enable this PV 
spinlock functionality as there is no structure to decide who does what. 
I can extend the current unfair lock code to allow those waiting in the 
queue to also attempt to steal the lock, though at a lesser frequency so 
that the queue head has a higher chance of getting the lock. This will 
solve the lock waiter preemption problem that you worry about. This does 
make the code a bit more complex, but it allow us to enable both the 
unfair lock and the PV spinlock code together to solve the lock waiter 
and lock holder preemption problems.

-Longman

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH RFC v6 10/11] pvqspinlock, x86: Enable qspinlock PV support for KVM
  2014-03-14  8:42         ` Paolo Bonzini
@ 2014-03-17 17:47           ` Waiman Long
  -1 siblings, 0 replies; 135+ messages in thread
From: Waiman Long @ 2014-03-17 17:47 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, Peter Zijlstra,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Alok Kataria, linux-arch, Gleb Natapov, x86, Ingo Molnar,
	xen-devel, Paul E. McKenney, Arnd Bergmann, Scott J Norton,
	Steven Rostedt, Chris Wright, Thomas Gleixner,
	Aswin Chandramouleeswaran, Chegu Vinod, Boris Ostrovsky,
	Oleg Nesterov, linux-kernel

On 03/14/2014 04:42 AM, Paolo Bonzini wrote:
> Il 13/03/2014 20:13, Waiman Long ha scritto:
>>>>
>>>
>>> This should also disable the unfair path.
>>>
>>> Paolo
>>>
>>
>> The unfair lock uses a different jump label and does not require any
>> special PV ops. There is a separate init function for that.
>
> Yeah, what I mean is that the patches that enable paravirtualization 
> should also take care of decreasing the unfair-lock jump label when 
> paravirtualization is enabled.
>
> Paolo

As there are people who don't like unfair lock at all, I prefer to give 
them the option to turn this on or off instead of forcing them to always 
use unfair lock in a PV guest.

-Longman

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH RFC v6 10/11] pvqspinlock, x86: Enable qspinlock PV support for KVM
@ 2014-03-17 17:47           ` Waiman Long
  0 siblings, 0 replies; 135+ messages in thread
From: Waiman Long @ 2014-03-17 17:47 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Jeremy Fitzhardinge, Raghavendra K T, Gleb Natapov,
	Peter Zijlstra, virtualization, Andi Kleen, H. Peter Anvin,
	Michel Lespinasse, Alok Kataria, linux-arch, kvm, x86,
	Ingo Molnar, xen-devel, Paul E. McKenney, Arnd Bergmann,
	Scott J Norton, Steven Rostedt, Chris Wright, Thomas Gleixner,
	Aswin Chandramouleeswaran, Chegu Vinod, Boris Ostrovsky,
	Oleg Nesterov, linux-kernel

On 03/14/2014 04:42 AM, Paolo Bonzini wrote:
> Il 13/03/2014 20:13, Waiman Long ha scritto:
>>>>
>>>
>>> This should also disable the unfair path.
>>>
>>> Paolo
>>>
>>
>> The unfair lock uses a different jump label and does not require any
>> special PV ops. There is a separate init function for that.
>
> Yeah, what I mean is that the patches that enable paravirtualization 
> should also take care of decreasing the unfair-lock jump label when 
> paravirtualization is enabled.
>
> Paolo

As there are people who don't like unfair lock at all, I prefer to give 
them the option to turn this on or off instead of forcing them to always 
use unfair lock in a PV guest.

-Longman

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH RFC v6 10/11] pvqspinlock, x86: Enable qspinlock PV support for KVM
  2014-03-14  8:42         ` Paolo Bonzini
  (?)
@ 2014-03-17 17:47         ` Waiman Long
  -1 siblings, 0 replies; 135+ messages in thread
From: Waiman Long @ 2014-03-17 17:47 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Jeremy Fitzhardinge, Raghavendra K T, Gleb Natapov,
	Peter Zijlstra, virtualization, Andi Kleen, H. Peter Anvin,
	Michel Lespinasse, Alok Kataria, linux-arch, kvm, x86,
	Ingo Molnar, xen-devel, Paul E. McKenney, Arnd Bergmann,
	Scott J Norton, Steven Rostedt, Chris Wright, Thomas Gleixner,
	Aswin Chandramouleeswaran, Chegu Vinod, Boris Ostrovsky,
	Oleg Nesterov, linux-kernel

On 03/14/2014 04:42 AM, Paolo Bonzini wrote:
> Il 13/03/2014 20:13, Waiman Long ha scritto:
>>>>
>>>
>>> This should also disable the unfair path.
>>>
>>> Paolo
>>>
>>
>> The unfair lock uses a different jump label and does not require any
>> special PV ops. There is a separate init function for that.
>
> Yeah, what I mean is that the patches that enable paravirtualization 
> should also take care of decreasing the unfair-lock jump label when 
> paravirtualization is enabled.
>
> Paolo

As there are people who don't like unfair lock at all, I prefer to give 
them the option to turn this on or off instead of forcing them to always 
use unfair lock in a PV guest.

-Longman

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v6 05/11] pvqspinlock, x86: Allow unfair spinlock in a PV guest
  2014-03-17 17:44           ` Waiman Long
@ 2014-03-17 18:54             ` Peter Zijlstra
  -1 siblings, 0 replies; 135+ messages in thread
From: Peter Zijlstra @ 2014-03-17 18:54 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, Boris Ostrovsky,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Thomas Gleixner, linux-arch, Gleb Natapov, x86, Ingo Molnar,
	xen-devel, Paul E. McKenney, Rik van Riel, Arnd Bergmann,
	Konrad Rzeszutek Wilk, Scott J Norton, Steven Rostedt,
	Chris Wright, Oleg Nesterov, Alok Kataria,
	Aswin Chandramouleeswaran, Chegu

On Mon, Mar 17, 2014 at 01:44:34PM -0400, Waiman Long wrote:
> The PV ticketlock code was designed to handle lock holder preemption by
> redirecting CPU resources in a preempted guest to another guest that can
> better use it and then return the preempted CPU back sooner.

But that's the PV code, not the unfair bit. And your fuller PV thing
doesn't need the unfair option.

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v6 05/11] pvqspinlock, x86: Allow unfair spinlock in a PV guest
@ 2014-03-17 18:54             ` Peter Zijlstra
  0 siblings, 0 replies; 135+ messages in thread
From: Peter Zijlstra @ 2014-03-17 18:54 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, Boris Ostrovsky,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Thomas Gleixner, linux-arch, Gleb Natapov, x86, Ingo Molnar,
	xen-devel, Paul E. McKenney, Rik van Riel, Arnd Bergmann,
	Konrad Rzeszutek Wilk, Scott J Norton, Steven Rostedt,
	Chris Wright, Oleg Nesterov, Alok Kataria,
	Aswin Chandramouleeswaran, Chegu

On Mon, Mar 17, 2014 at 01:44:34PM -0400, Waiman Long wrote:
> The PV ticketlock code was designed to handle lock holder preemption by
> redirecting CPU resources in a preempted guest to another guest that can
> better use it and then return the preempted CPU back sooner.

But that's the PV code, not the unfair bit. And your fuller PV thing
doesn't need the unfair option.

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v6 05/11] pvqspinlock, x86: Allow unfair spinlock in a PV guest
  2014-03-17 17:44           ` Waiman Long
  (?)
@ 2014-03-17 18:54           ` Peter Zijlstra
  -1 siblings, 0 replies; 135+ messages in thread
From: Peter Zijlstra @ 2014-03-17 18:54 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, Boris Ostrovsky,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Thomas Gleixner, linux-arch, Gleb Natapov, x86, Ingo Molnar,
	xen-devel, Paul E. McKenney, Arnd Bergmann, Scott J Norton,
	Rusty Russell, Steven Rostedt, Chris Wright, Oleg Nesterov,
	Alok Kataria, Aswin Chandramouleeswaran, Chegu Vinod,
	linux-kernel

On Mon, Mar 17, 2014 at 01:44:34PM -0400, Waiman Long wrote:
> The PV ticketlock code was designed to handle lock holder preemption by
> redirecting CPU resources in a preempted guest to another guest that can
> better use it and then return the preempted CPU back sooner.

But that's the PV code, not the unfair bit. And your fuller PV thing
doesn't need the unfair option.

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v6 05/11] pvqspinlock, x86: Allow unfair spinlock in a PV guest
@ 2014-03-17 19:05         ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 135+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-03-17 19:05 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Jeremy Fitzhardinge, Raghavendra K T, Gleb Natapov,
	Peter Zijlstra, virtualization, Andi Kleen, H. Peter Anvin,
	Michel Lespinasse, Alok Kataria, linux-arch, kvm, x86,
	Ingo Molnar, xen-devel, Paul E. McKenney, Rik van Riel,
	Arnd Bergmann, Scott J Norton, Steven Rostedt, Chris Wright,
	Thomas Gleixner, Aswin Chandramouleeswaran, Chegu Vinod,
	Waiman Long, Oleg Nesterov

On Thu, Mar 13, 2014 at 02:16:06PM +0100, Paolo Bonzini wrote:
> Il 13/03/2014 11:54, David Vrabel ha scritto:
> >On 12/03/14 18:54, Waiman Long wrote:
> >>Locking is always an issue in a virtualized environment as the virtual
> >>CPU that is waiting on a lock may get scheduled out and hence block
> >>any progress in lock acquisition even when the lock has been freed.
> >>
> >>One solution to this problem is to allow unfair lock in a
> >>para-virtualized environment. In this case, a new lock acquirer can
> >>come and steal the lock if the next-in-line CPU to get the lock is
> >>scheduled out. Unfair lock in a native environment is generally not a
> >>good idea as there is a possibility of lock starvation for a heavily
> >>contended lock.
> >
> >I do not think this is a good idea -- the problems with unfair locks are
> >worse in a virtualized guest.  If a waiting VCPU deschedules and has to
> >be kicked to grab a lock then it is very likely to lose a race with
> >another running VCPU trying to take a lock (since it takes time for the
> >VCPU to be rescheduled).
> 
> Actually, I think the unfair version should be automatically
> selected if running on a hypervisor.  Per-hypervisor pvops can
> choose to enable the fair one.
> 
> Lock unfairness may be particularly evident on a virtualized guest
> when the host is overcommitted, but problems with fair locks are
> even worse.
> 
> In fact, RHEL/CentOS 6 already uses unfair locks if
> X86_FEATURE_HYPERVISOR is set.  The patch was rejected upstream in
> favor of pv ticketlocks, but pv ticketlocks do not cover all
> hypervisors so perhaps we could revisit that choice.
> 
> Measurements were done by Gleb for two guests running 2.6.32 with 16
> vcpus each, on a 16-core system.  One guest ran with unfair locks,
> one guest ran with fair locks.  Two kernel compilations ("time make

And when you say fair locks are you saying PV ticketlocks or generic
ticketlocks? 
> -j 16 all") were started at the same time on both guests, and times
> were as follows:
> 
>     unfair:                         fair:
>     real 13m34.674s                 real 19m35.827s
>     user 96m2.638s                  user 102m38.665s
>     sys 56m14.991s                  sys 158m22.470s
> 
>     real 13m3.768s                  real 19m4.375s
>     user 95m34.509s                 user 111m9.903s
>     sys 53m40.550s                  sys 141m59.370s
> 
> Actually, interpreting the numbers shows an even worse slowdown.
> 
> Compilation took ~6.5 minutes in a guest when the host was not
> overcommitted, and with unfair locks everything scaled just fine.

You should see the same values with the PV ticketlock. It is not clear
to me if this testing did include that variant of locks?

> 
> Ticketlocks fell completely apart; during the first 13 minutes they
> were allotted 16*6.5=104 minutes of CPU time, and they spent almost
> all of it spinning in the kernel (102 minutes in the first run).

Right, the non-PV variant of them do fall apart. That is why
PV ticketlocks are so nice.

> They did perhaps 30 seconds worth of work because, as soon as the
> unfair-lock guest finished and the host was no longer overcommitted,
> compilation finished in 6 minutes.
> 
> So that's approximately 12x slowdown from using non-pv fair locks
> (vs. unfair locks) on a 200%-overcommitted host.

Ah, so it was non-PV.

I am curious if the test was any different if you tested PV ticketlocks
vs Red Hat variant of unfair locks.

> 
> Paolo
> 
> >>With the unfair locking activated on bare metal 4-socket Westmere-EX
> >>box, the execution times (in ms) of a spinlock micro-benchmark were
> >>as follows:
> >>
> >>  # of    Ticket       Fair	    Unfair
> >>  tasks    lock     queue lock    queue lock
> >>  ------  -------   ----------    ----------
> >>    1       135        135	     137
> >>    2      1045       1120	     747
> >>    3      1827       2345     	    1084
> >>    4      2689       2934	    1438
> >>    5      3736       3658	    1722
> >>    6      4942       4434	    2092
> >>    7      6304       5176          2245
> >>    8      7736       5955          2388
> >
> >Are these figures with or without the later PV support patches?
> 
> 

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v6 05/11] pvqspinlock, x86: Allow unfair spinlock in a PV guest
@ 2014-03-17 19:05         ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 135+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-03-17 19:05 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: David Vrabel, Waiman Long, Jeremy Fitzhardinge, Raghavendra K T,
	kvm, Peter Zijlstra, virtualization, Andi Kleen, H. Peter Anvin,
	Michel Lespinasse, Thomas Gleixner, linux-arch, Gleb Natapov,
	x86, Ingo Molnar, xen-devel, Paul E. McKenney, Rik van Riel,
	Arnd Bergmann, Scott J Norton, Steven Rostedt, Chris Wright,
	Oleg Nesterov, Alok Kataria, Aswin Chandramouleeswaran,
	Chegu Vinod

On Thu, Mar 13, 2014 at 02:16:06PM +0100, Paolo Bonzini wrote:
> Il 13/03/2014 11:54, David Vrabel ha scritto:
> >On 12/03/14 18:54, Waiman Long wrote:
> >>Locking is always an issue in a virtualized environment as the virtual
> >>CPU that is waiting on a lock may get scheduled out and hence block
> >>any progress in lock acquisition even when the lock has been freed.
> >>
> >>One solution to this problem is to allow unfair lock in a
> >>para-virtualized environment. In this case, a new lock acquirer can
> >>come and steal the lock if the next-in-line CPU to get the lock is
> >>scheduled out. Unfair lock in a native environment is generally not a
> >>good idea as there is a possibility of lock starvation for a heavily
> >>contended lock.
> >
> >I do not think this is a good idea -- the problems with unfair locks are
> >worse in a virtualized guest.  If a waiting VCPU deschedules and has to
> >be kicked to grab a lock then it is very likely to lose a race with
> >another running VCPU trying to take a lock (since it takes time for the
> >VCPU to be rescheduled).
> 
> Actually, I think the unfair version should be automatically
> selected if running on a hypervisor.  Per-hypervisor pvops can
> choose to enable the fair one.
> 
> Lock unfairness may be particularly evident on a virtualized guest
> when the host is overcommitted, but problems with fair locks are
> even worse.
> 
> In fact, RHEL/CentOS 6 already uses unfair locks if
> X86_FEATURE_HYPERVISOR is set.  The patch was rejected upstream in
> favor of pv ticketlocks, but pv ticketlocks do not cover all
> hypervisors so perhaps we could revisit that choice.
> 
> Measurements were done by Gleb for two guests running 2.6.32 with 16
> vcpus each, on a 16-core system.  One guest ran with unfair locks,
> one guest ran with fair locks.  Two kernel compilations ("time make

And when you say fair locks are you saying PV ticketlocks or generic
ticketlocks? 
> -j 16 all") were started at the same time on both guests, and times
> were as follows:
> 
>     unfair:                         fair:
>     real 13m34.674s                 real 19m35.827s
>     user 96m2.638s                  user 102m38.665s
>     sys 56m14.991s                  sys 158m22.470s
> 
>     real 13m3.768s                  real 19m4.375s
>     user 95m34.509s                 user 111m9.903s
>     sys 53m40.550s                  sys 141m59.370s
> 
> Actually, interpreting the numbers shows an even worse slowdown.
> 
> Compilation took ~6.5 minutes in a guest when the host was not
> overcommitted, and with unfair locks everything scaled just fine.

You should see the same values with the PV ticketlock. It is not clear
to me if this testing did include that variant of locks?

> 
> Ticketlocks fell completely apart; during the first 13 minutes they
> were allotted 16*6.5=104 minutes of CPU time, and they spent almost
> all of it spinning in the kernel (102 minutes in the first run).

Right, the non-PV variant of them do fall apart. That is why
PV ticketlocks are so nice.

> They did perhaps 30 seconds worth of work because, as soon as the
> unfair-lock guest finished and the host was no longer overcommitted,
> compilation finished in 6 minutes.
> 
> So that's approximately 12x slowdown from using non-pv fair locks
> (vs. unfair locks) on a 200%-overcommitted host.

Ah, so it was non-PV.

I am curious if the test was any different if you tested PV ticketlocks
vs Red Hat variant of unfair locks.

> 
> Paolo
> 
> >>With the unfair locking activated on bare metal 4-socket Westmere-EX
> >>box, the execution times (in ms) of a spinlock micro-benchmark were
> >>as follows:
> >>
> >>  # of    Ticket       Fair	    Unfair
> >>  tasks    lock     queue lock    queue lock
> >>  ------  -------   ----------    ----------
> >>    1       135        135	     137
> >>    2      1045       1120	     747
> >>    3      1827       2345     	    1084
> >>    4      2689       2934	    1438
> >>    5      3736       3658	    1722
> >>    6      4942       4434	    2092
> >>    7      6304       5176          2245
> >>    8      7736       5955          2388
> >
> >Are these figures with or without the later PV support patches?
> 
> 

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v6 05/11] pvqspinlock, x86: Allow unfair spinlock in a PV guest
@ 2014-03-17 19:05         ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 135+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-03-17 19:05 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Jeremy Fitzhardinge, Raghavendra K T, Gleb Natapov,
	Peter Zijlstra, virtualization, Andi Kleen, H. Peter Anvin,
	Michel Lespinasse, Alok Kataria, linux-arch, kvm, x86,
	Ingo Molnar, xen-devel, Paul E. McKenney, Rik van Riel,
	Arnd Bergmann, Scott J Norton, Steven Rostedt, Chris Wright,
	Thomas Gleixner, Aswin Chandramouleeswaran, Chegu Vinod,
	Waiman Long, Oleg Nesterov

On Thu, Mar 13, 2014 at 02:16:06PM +0100, Paolo Bonzini wrote:
> Il 13/03/2014 11:54, David Vrabel ha scritto:
> >On 12/03/14 18:54, Waiman Long wrote:
> >>Locking is always an issue in a virtualized environment as the virtual
> >>CPU that is waiting on a lock may get scheduled out and hence block
> >>any progress in lock acquisition even when the lock has been freed.
> >>
> >>One solution to this problem is to allow unfair lock in a
> >>para-virtualized environment. In this case, a new lock acquirer can
> >>come and steal the lock if the next-in-line CPU to get the lock is
> >>scheduled out. Unfair lock in a native environment is generally not a
> >>good idea as there is a possibility of lock starvation for a heavily
> >>contended lock.
> >
> >I do not think this is a good idea -- the problems with unfair locks are
> >worse in a virtualized guest.  If a waiting VCPU deschedules and has to
> >be kicked to grab a lock then it is very likely to lose a race with
> >another running VCPU trying to take a lock (since it takes time for the
> >VCPU to be rescheduled).
> 
> Actually, I think the unfair version should be automatically
> selected if running on a hypervisor.  Per-hypervisor pvops can
> choose to enable the fair one.
> 
> Lock unfairness may be particularly evident on a virtualized guest
> when the host is overcommitted, but problems with fair locks are
> even worse.
> 
> In fact, RHEL/CentOS 6 already uses unfair locks if
> X86_FEATURE_HYPERVISOR is set.  The patch was rejected upstream in
> favor of pv ticketlocks, but pv ticketlocks do not cover all
> hypervisors so perhaps we could revisit that choice.
> 
> Measurements were done by Gleb for two guests running 2.6.32 with 16
> vcpus each, on a 16-core system.  One guest ran with unfair locks,
> one guest ran with fair locks.  Two kernel compilations ("time make

And when you say fair locks are you saying PV ticketlocks or generic
ticketlocks? 
> -j 16 all") were started at the same time on both guests, and times
> were as follows:
> 
>     unfair:                         fair:
>     real 13m34.674s                 real 19m35.827s
>     user 96m2.638s                  user 102m38.665s
>     sys 56m14.991s                  sys 158m22.470s
> 
>     real 13m3.768s                  real 19m4.375s
>     user 95m34.509s                 user 111m9.903s
>     sys 53m40.550s                  sys 141m59.370s
> 
> Actually, interpreting the numbers shows an even worse slowdown.
> 
> Compilation took ~6.5 minutes in a guest when the host was not
> overcommitted, and with unfair locks everything scaled just fine.

You should see the same values with the PV ticketlock. It is not clear
to me if this testing did include that variant of locks?

> 
> Ticketlocks fell completely apart; during the first 13 minutes they
> were allotted 16*6.5=104 minutes of CPU time, and they spent almost
> all of it spinning in the kernel (102 minutes in the first run).

Right, the non-PV variant of them do fall apart. That is why
PV ticketlocks are so nice.

> They did perhaps 30 seconds worth of work because, as soon as the
> unfair-lock guest finished and the host was no longer overcommitted,
> compilation finished in 6 minutes.
> 
> So that's approximately 12x slowdown from using non-pv fair locks
> (vs. unfair locks) on a 200%-overcommitted host.

Ah, so it was non-PV.

I am curious if the test was any different if you tested PV ticketlocks
vs Red Hat variant of unfair locks.

> 
> Paolo
> 
> >>With the unfair locking activated on bare metal 4-socket Westmere-EX
> >>box, the execution times (in ms) of a spinlock micro-benchmark were
> >>as follows:
> >>
> >>  # of    Ticket       Fair	    Unfair
> >>  tasks    lock     queue lock    queue lock
> >>  ------  -------   ----------    ----------
> >>    1       135        135	     137
> >>    2      1045       1120	     747
> >>    3      1827       2345     	    1084
> >>    4      2689       2934	    1438
> >>    5      3736       3658	    1722
> >>    6      4942       4434	    2092
> >>    7      6304       5176          2245
> >>    8      7736       5955          2388
> >
> >Are these figures with or without the later PV support patches?
> 
> 

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v6 05/11] pvqspinlock, x86: Allow unfair spinlock in a PV guest
  2014-03-13 13:16       ` Paolo Bonzini
  (?)
  (?)
@ 2014-03-17 19:05       ` Konrad Rzeszutek Wilk
  -1 siblings, 0 replies; 135+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-03-17 19:05 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Jeremy Fitzhardinge, Raghavendra K T, Gleb Natapov,
	Peter Zijlstra, virtualization, Andi Kleen, H. Peter Anvin,
	Michel Lespinasse, Alok Kataria, linux-arch, kvm, x86,
	Ingo Molnar, xen-devel, Paul E. McKenney, Arnd Bergmann,
	Scott J Norton, Steven Rostedt, Chris Wright, Thomas Gleixner,
	Aswin Chandramouleeswaran, Chegu Vinod, Waiman Long,
	Oleg Nesterov, David Vrabel

On Thu, Mar 13, 2014 at 02:16:06PM +0100, Paolo Bonzini wrote:
> Il 13/03/2014 11:54, David Vrabel ha scritto:
> >On 12/03/14 18:54, Waiman Long wrote:
> >>Locking is always an issue in a virtualized environment as the virtual
> >>CPU that is waiting on a lock may get scheduled out and hence block
> >>any progress in lock acquisition even when the lock has been freed.
> >>
> >>One solution to this problem is to allow unfair lock in a
> >>para-virtualized environment. In this case, a new lock acquirer can
> >>come and steal the lock if the next-in-line CPU to get the lock is
> >>scheduled out. Unfair lock in a native environment is generally not a
> >>good idea as there is a possibility of lock starvation for a heavily
> >>contended lock.
> >
> >I do not think this is a good idea -- the problems with unfair locks are
> >worse in a virtualized guest.  If a waiting VCPU deschedules and has to
> >be kicked to grab a lock then it is very likely to lose a race with
> >another running VCPU trying to take a lock (since it takes time for the
> >VCPU to be rescheduled).
> 
> Actually, I think the unfair version should be automatically
> selected if running on a hypervisor.  Per-hypervisor pvops can
> choose to enable the fair one.
> 
> Lock unfairness may be particularly evident on a virtualized guest
> when the host is overcommitted, but problems with fair locks are
> even worse.
> 
> In fact, RHEL/CentOS 6 already uses unfair locks if
> X86_FEATURE_HYPERVISOR is set.  The patch was rejected upstream in
> favor of pv ticketlocks, but pv ticketlocks do not cover all
> hypervisors so perhaps we could revisit that choice.
> 
> Measurements were done by Gleb for two guests running 2.6.32 with 16
> vcpus each, on a 16-core system.  One guest ran with unfair locks,
> one guest ran with fair locks.  Two kernel compilations ("time make

And when you say fair locks are you saying PV ticketlocks or generic
ticketlocks? 
> -j 16 all") were started at the same time on both guests, and times
> were as follows:
> 
>     unfair:                         fair:
>     real 13m34.674s                 real 19m35.827s
>     user 96m2.638s                  user 102m38.665s
>     sys 56m14.991s                  sys 158m22.470s
> 
>     real 13m3.768s                  real 19m4.375s
>     user 95m34.509s                 user 111m9.903s
>     sys 53m40.550s                  sys 141m59.370s
> 
> Actually, interpreting the numbers shows an even worse slowdown.
> 
> Compilation took ~6.5 minutes in a guest when the host was not
> overcommitted, and with unfair locks everything scaled just fine.

You should see the same values with the PV ticketlock. It is not clear
to me if this testing did include that variant of locks?

> 
> Ticketlocks fell completely apart; during the first 13 minutes they
> were allotted 16*6.5=104 minutes of CPU time, and they spent almost
> all of it spinning in the kernel (102 minutes in the first run).

Right, the non-PV variant of them do fall apart. That is why
PV ticketlocks are so nice.

> They did perhaps 30 seconds worth of work because, as soon as the
> unfair-lock guest finished and the host was no longer overcommitted,
> compilation finished in 6 minutes.
> 
> So that's approximately 12x slowdown from using non-pv fair locks
> (vs. unfair locks) on a 200%-overcommitted host.

Ah, so it was non-PV.

I am curious if the test was any different if you tested PV ticketlocks
vs Red Hat variant of unfair locks.

> 
> Paolo
> 
> >>With the unfair locking activated on bare metal 4-socket Westmere-EX
> >>box, the execution times (in ms) of a spinlock micro-benchmark were
> >>as follows:
> >>
> >>  # of    Ticket       Fair	    Unfair
> >>  tasks    lock     queue lock    queue lock
> >>  ------  -------   ----------    ----------
> >>    1       135        135	     137
> >>    2      1045       1120	     747
> >>    3      1827       2345     	    1084
> >>    4      2689       2934	    1438
> >>    5      3736       3658	    1722
> >>    6      4942       4434	    2092
> >>    7      6304       5176          2245
> >>    8      7736       5955          2388
> >
> >Are these figures with or without the later PV support patches?
> 
> 

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v6 05/11] pvqspinlock, x86: Allow unfair spinlock in a PV guest
  2014-03-17 17:44           ` Waiman Long
@ 2014-03-17 19:10             ` Konrad Rzeszutek Wilk
  -1 siblings, 0 replies; 135+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-03-17 19:10 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, Peter Zijlstra,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Thomas Gleixner, linux-arch, Gleb Natapov, x86, Ingo Molnar,
	xen-devel, Paul E. McKenney, Rik van Riel, Arnd Bergmann,
	Scott J Norton, Steven Rostedt, Chris Wright, Oleg Nesterov,
	Alok Kataria, Aswin Chandramouleeswaran, Chegu Vinod,
	Boris Ostrovsky

On Mon, Mar 17, 2014 at 01:44:34PM -0400, Waiman Long wrote:
> On 03/14/2014 04:30 AM, Peter Zijlstra wrote:
> >On Thu, Mar 13, 2014 at 04:05:19PM -0400, Waiman Long wrote:
> >>On 03/13/2014 11:15 AM, Peter Zijlstra wrote:
> >>>On Wed, Mar 12, 2014 at 02:54:52PM -0400, Waiman Long wrote:
> >>>>+static inline void arch_spin_lock(struct qspinlock *lock)
> >>>>+{
> >>>>+	if (static_key_false(&paravirt_unfairlocks_enabled))
> >>>>+		queue_spin_lock_unfair(lock);
> >>>>+	else
> >>>>+		queue_spin_lock(lock);
> >>>>+}
> >>>So I would have expected something like:
> >>>
> >>>	if (static_key_false(&paravirt_spinlock)) {
> >>>		while (!queue_spin_trylock(lock))
> >>>			cpu_relax();
> >>>		return;
> >>>	}
> >>>
> >>>At the top of queue_spin_lock_slowpath().
> >>I don't like the idea of constantly spinning on the lock. That can cause all
> >>sort of performance issues.
> >Its bloody virt; _that_ is a performance issue to begin with.
> >
> >Anybody half sane stops using virt (esp. if they care about
> >performance).
> >
> >>My version of the unfair lock tries to grab the
> >>lock ignoring if there are others waiting in the queue or not. So instead of
> >>the doing a cmpxchg of the whole 32-bit word, I just do a cmpxchg of the
> >>lock byte in the unfair version. A CPU has only one chance to steal the
> >>lock. If it can't, it will be lined up in the queue just like the fair
> >>version. It is not as unfair as the other unfair locking schemes that spins
> >>on the lock repetitively. So lock starvation should be less a problem.
> >>
> >>On the other hand, it may not perform as well as the other unfair locking
> >>schemes. It is a compromise to provide some lock unfairness without
> >>sacrificing the good cacheline behavior of the queue spinlock.
> >But but but,.. any kind of queueing gets you into a world of hurt with
> >virt.
> >
> >The simple test-and-set lock (as per the above) still sucks due to lock
> >holder preemption, but at least the suckage doesn't queue. Because with
> >queueing you not only have to worry about the lock holder getting
> >preemption, but also the waiter(s).
> >
> >Take the situation of 3 (v)CPUs where cpu0 holds the lock but is
> >preempted. cpu1 queues, cpu2 queues. Then cpu1 gets preempted, after
> >which cpu0 gets back online.
> >
> >The simple test-and-set lock will now let cpu2 acquire. Your queue
> >however will just sit there spinning, waiting for cpu1 to come back from
> >holiday.
> >
> >I think you're way over engineering this. Just do the simple
> >test-and-set lock for virt&&  !paravirt (as I think Paolo Bonzini
> >suggested RHEL6 already does).
> 
> The PV ticketlock code was designed to handle lock holder preemption
> by redirecting CPU resources in a preempted guest to another guest
> that can better use it and then return the preempted CPU back
> sooner.
> 
> Using a simple test-and-set lock will not allow us to enable this PV
> spinlock functionality as there is no structure to decide who does
> what. I can extend the current unfair lock code to allow those

And what would be needed to do 'decide who does what'?

> waiting in the queue to also attempt to steal the lock, though at a
> lesser frequency so that the queue head has a higher chance of
> getting the lock. This will solve the lock waiter preemption problem
> that you worry about. This does make the code a bit more complex,
> but it allow us to enable both the unfair lock and the PV spinlock
> code together to solve the lock waiter and lock holder preemption
> problems.
> 
> -Longman
> 

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v6 05/11] pvqspinlock, x86: Allow unfair spinlock in a PV guest
@ 2014-03-17 19:10             ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 135+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-03-17 19:10 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, Peter Zijlstra,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Thomas Gleixner, linux-arch, Gleb Natapov, x86, Ingo Molnar,
	xen-devel, Paul E. McKenney, Rik van Riel, Arnd Bergmann,
	Scott J Norton, Steven Rostedt, Chris Wright, Oleg Nesterov,
	Alok Kataria, Aswin Chandramouleeswaran, Chegu Vinod,
	Boris Ostrovsky

On Mon, Mar 17, 2014 at 01:44:34PM -0400, Waiman Long wrote:
> On 03/14/2014 04:30 AM, Peter Zijlstra wrote:
> >On Thu, Mar 13, 2014 at 04:05:19PM -0400, Waiman Long wrote:
> >>On 03/13/2014 11:15 AM, Peter Zijlstra wrote:
> >>>On Wed, Mar 12, 2014 at 02:54:52PM -0400, Waiman Long wrote:
> >>>>+static inline void arch_spin_lock(struct qspinlock *lock)
> >>>>+{
> >>>>+	if (static_key_false(&paravirt_unfairlocks_enabled))
> >>>>+		queue_spin_lock_unfair(lock);
> >>>>+	else
> >>>>+		queue_spin_lock(lock);
> >>>>+}
> >>>So I would have expected something like:
> >>>
> >>>	if (static_key_false(&paravirt_spinlock)) {
> >>>		while (!queue_spin_trylock(lock))
> >>>			cpu_relax();
> >>>		return;
> >>>	}
> >>>
> >>>At the top of queue_spin_lock_slowpath().
> >>I don't like the idea of constantly spinning on the lock. That can cause all
> >>sort of performance issues.
> >Its bloody virt; _that_ is a performance issue to begin with.
> >
> >Anybody half sane stops using virt (esp. if they care about
> >performance).
> >
> >>My version of the unfair lock tries to grab the
> >>lock ignoring if there are others waiting in the queue or not. So instead of
> >>the doing a cmpxchg of the whole 32-bit word, I just do a cmpxchg of the
> >>lock byte in the unfair version. A CPU has only one chance to steal the
> >>lock. If it can't, it will be lined up in the queue just like the fair
> >>version. It is not as unfair as the other unfair locking schemes that spins
> >>on the lock repetitively. So lock starvation should be less a problem.
> >>
> >>On the other hand, it may not perform as well as the other unfair locking
> >>schemes. It is a compromise to provide some lock unfairness without
> >>sacrificing the good cacheline behavior of the queue spinlock.
> >But but but,.. any kind of queueing gets you into a world of hurt with
> >virt.
> >
> >The simple test-and-set lock (as per the above) still sucks due to lock
> >holder preemption, but at least the suckage doesn't queue. Because with
> >queueing you not only have to worry about the lock holder getting
> >preemption, but also the waiter(s).
> >
> >Take the situation of 3 (v)CPUs where cpu0 holds the lock but is
> >preempted. cpu1 queues, cpu2 queues. Then cpu1 gets preempted, after
> >which cpu0 gets back online.
> >
> >The simple test-and-set lock will now let cpu2 acquire. Your queue
> >however will just sit there spinning, waiting for cpu1 to come back from
> >holiday.
> >
> >I think you're way over engineering this. Just do the simple
> >test-and-set lock for virt&&  !paravirt (as I think Paolo Bonzini
> >suggested RHEL6 already does).
> 
> The PV ticketlock code was designed to handle lock holder preemption
> by redirecting CPU resources in a preempted guest to another guest
> that can better use it and then return the preempted CPU back
> sooner.
> 
> Using a simple test-and-set lock will not allow us to enable this PV
> spinlock functionality as there is no structure to decide who does
> what. I can extend the current unfair lock code to allow those

And what would be needed to do 'decide who does what'?

> waiting in the queue to also attempt to steal the lock, though at a
> lesser frequency so that the queue head has a higher chance of
> getting the lock. This will solve the lock waiter preemption problem
> that you worry about. This does make the code a bit more complex,
> but it allow us to enable both the unfair lock and the PV spinlock
> code together to solve the lock waiter and lock holder preemption
> problems.
> 
> -Longman
> 

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v6 05/11] pvqspinlock, x86: Allow unfair spinlock in a PV guest
  2014-03-17 17:44           ` Waiman Long
                             ` (3 preceding siblings ...)
  (?)
@ 2014-03-17 19:10           ` Konrad Rzeszutek Wilk
  -1 siblings, 0 replies; 135+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-03-17 19:10 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, Peter Zijlstra,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Thomas Gleixner, linux-arch, Gleb Natapov, x86, Ingo Molnar,
	xen-devel, Paul E. McKenney, Arnd Bergmann, Scott J Norton,
	Rusty Russell, Steven Rostedt, Chris Wright, Oleg Nesterov,
	Alok Kataria, Aswin Chandramouleeswaran, Chegu Vinod,
	Boris Ostrovsky

On Mon, Mar 17, 2014 at 01:44:34PM -0400, Waiman Long wrote:
> On 03/14/2014 04:30 AM, Peter Zijlstra wrote:
> >On Thu, Mar 13, 2014 at 04:05:19PM -0400, Waiman Long wrote:
> >>On 03/13/2014 11:15 AM, Peter Zijlstra wrote:
> >>>On Wed, Mar 12, 2014 at 02:54:52PM -0400, Waiman Long wrote:
> >>>>+static inline void arch_spin_lock(struct qspinlock *lock)
> >>>>+{
> >>>>+	if (static_key_false(&paravirt_unfairlocks_enabled))
> >>>>+		queue_spin_lock_unfair(lock);
> >>>>+	else
> >>>>+		queue_spin_lock(lock);
> >>>>+}
> >>>So I would have expected something like:
> >>>
> >>>	if (static_key_false(&paravirt_spinlock)) {
> >>>		while (!queue_spin_trylock(lock))
> >>>			cpu_relax();
> >>>		return;
> >>>	}
> >>>
> >>>At the top of queue_spin_lock_slowpath().
> >>I don't like the idea of constantly spinning on the lock. That can cause all
> >>sort of performance issues.
> >Its bloody virt; _that_ is a performance issue to begin with.
> >
> >Anybody half sane stops using virt (esp. if they care about
> >performance).
> >
> >>My version of the unfair lock tries to grab the
> >>lock ignoring if there are others waiting in the queue or not. So instead of
> >>the doing a cmpxchg of the whole 32-bit word, I just do a cmpxchg of the
> >>lock byte in the unfair version. A CPU has only one chance to steal the
> >>lock. If it can't, it will be lined up in the queue just like the fair
> >>version. It is not as unfair as the other unfair locking schemes that spins
> >>on the lock repetitively. So lock starvation should be less a problem.
> >>
> >>On the other hand, it may not perform as well as the other unfair locking
> >>schemes. It is a compromise to provide some lock unfairness without
> >>sacrificing the good cacheline behavior of the queue spinlock.
> >But but but,.. any kind of queueing gets you into a world of hurt with
> >virt.
> >
> >The simple test-and-set lock (as per the above) still sucks due to lock
> >holder preemption, but at least the suckage doesn't queue. Because with
> >queueing you not only have to worry about the lock holder getting
> >preemption, but also the waiter(s).
> >
> >Take the situation of 3 (v)CPUs where cpu0 holds the lock but is
> >preempted. cpu1 queues, cpu2 queues. Then cpu1 gets preempted, after
> >which cpu0 gets back online.
> >
> >The simple test-and-set lock will now let cpu2 acquire. Your queue
> >however will just sit there spinning, waiting for cpu1 to come back from
> >holiday.
> >
> >I think you're way over engineering this. Just do the simple
> >test-and-set lock for virt&&  !paravirt (as I think Paolo Bonzini
> >suggested RHEL6 already does).
> 
> The PV ticketlock code was designed to handle lock holder preemption
> by redirecting CPU resources in a preempted guest to another guest
> that can better use it and then return the preempted CPU back
> sooner.
> 
> Using a simple test-and-set lock will not allow us to enable this PV
> spinlock functionality as there is no structure to decide who does
> what. I can extend the current unfair lock code to allow those

And what would be needed to do 'decide who does what'?

> waiting in the queue to also attempt to steal the lock, though at a
> lesser frequency so that the queue head has a higher chance of
> getting the lock. This will solve the lock waiter preemption problem
> that you worry about. This does make the code a bit more complex,
> but it allow us to enable both the unfair lock and the PV spinlock
> code together to solve the lock waiter and lock holder preemption
> problems.
> 
> -Longman
> 

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v6 05/11] pvqspinlock, x86: Allow unfair spinlock in a PV guest
@ 2014-03-18  8:14           ` Paolo Bonzini
  0 siblings, 0 replies; 135+ messages in thread
From: Paolo Bonzini @ 2014-03-18  8:14 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, Peter Zijlstra,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Alok Kataria, linux-arch, Gleb Natapov, x86, Ingo Molnar,
	xen-devel, Paul E. McKenney, Rik van Riel, Arnd Bergmann,
	Scott J Norton, Steven Rostedt, Chris Wright, Thomas Gleixner,
	Aswin Chandramouleeswaran, Chegu Vinod, Waiman Long,
	Oleg Nesterov

Il 17/03/2014 20:05, Konrad Rzeszutek Wilk ha scritto:
>> > Measurements were done by Gleb for two guests running 2.6.32 with 16
>> > vcpus each, on a 16-core system.  One guest ran with unfair locks,
>> > one guest ran with fair locks.  Two kernel compilations ("time make
> And when you say fair locks are you saying PV ticketlocks or generic
> ticketlocks?

Generic, of course.

> You should see the same values with the PV ticketlock. It is not clear
> to me if this testing did include that variant of locks?

Yes, PV is fine.  But up to this point of the series, we are concerned 
about spinlock performance when running on an overcommitted hypervisor 
that doesn't support PV spinlocks.

Paolo

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v6 05/11] pvqspinlock, x86: Allow unfair spinlock in a PV guest
@ 2014-03-18  8:14           ` Paolo Bonzini
  0 siblings, 0 replies; 135+ messages in thread
From: Paolo Bonzini @ 2014-03-18  8:14 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Jeremy Fitzhardinge, Raghavendra K T, Gleb Natapov,
	Peter Zijlstra, virtualization, Andi Kleen, H. Peter Anvin,
	Michel Lespinasse, Alok Kataria, linux-arch, kvm, x86,
	Ingo Molnar, xen-devel, Paul E. McKenney, Rik van Riel,
	Arnd Bergmann, Scott J Norton, Steven Rostedt, Chris Wright,
	Thomas Gleixner, Aswin Chandramouleeswaran, Chegu Vinod,
	Waiman Long, Oleg Nesterov

Il 17/03/2014 20:05, Konrad Rzeszutek Wilk ha scritto:
>> > Measurements were done by Gleb for two guests running 2.6.32 with 16
>> > vcpus each, on a 16-core system.  One guest ran with unfair locks,
>> > one guest ran with fair locks.  Two kernel compilations ("time make
> And when you say fair locks are you saying PV ticketlocks or generic
> ticketlocks?

Generic, of course.

> You should see the same values with the PV ticketlock. It is not clear
> to me if this testing did include that variant of locks?

Yes, PV is fine.  But up to this point of the series, we are concerned 
about spinlock performance when running on an overcommitted hypervisor 
that doesn't support PV spinlocks.

Paolo

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v6 05/11] pvqspinlock, x86: Allow unfair spinlock in a PV guest
  2014-03-17 19:05         ` Konrad Rzeszutek Wilk
  (?)
  (?)
@ 2014-03-18  8:14         ` Paolo Bonzini
  -1 siblings, 0 replies; 135+ messages in thread
From: Paolo Bonzini @ 2014-03-18  8:14 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, Peter Zijlstra,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Alok Kataria, linux-arch, Gleb Natapov, x86, Ingo Molnar,
	xen-devel, Paul E. McKenney, Arnd Bergmann, Scott J Norton,
	Steven Rostedt, Chris Wright, Thomas Gleixner,
	Aswin Chandramouleeswaran, Chegu Vinod, Waiman Long,
	Oleg Nesterov

Il 17/03/2014 20:05, Konrad Rzeszutek Wilk ha scritto:
>> > Measurements were done by Gleb for two guests running 2.6.32 with 16
>> > vcpus each, on a 16-core system.  One guest ran with unfair locks,
>> > one guest ran with fair locks.  Two kernel compilations ("time make
> And when you say fair locks are you saying PV ticketlocks or generic
> ticketlocks?

Generic, of course.

> You should see the same values with the PV ticketlock. It is not clear
> to me if this testing did include that variant of locks?

Yes, PV is fine.  But up to this point of the series, we are concerned 
about spinlock performance when running on an overcommitted hypervisor 
that doesn't support PV spinlocks.

Paolo

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v6 05/11] pvqspinlock, x86: Allow unfair spinlock in a PV guest
@ 2014-03-18  8:16               ` Paolo Bonzini
  0 siblings, 0 replies; 135+ messages in thread
From: Paolo Bonzini @ 2014-03-18  8:16 UTC (permalink / raw)
  To: Peter Zijlstra, Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, Gleb Natapov,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Thomas Gleixner, linux-arch, kvm, x86, Alok Kataria, Ingo Molnar,
	xen-devel, Paul E. McKenney, Rik van Riel, Arnd Bergmann,
	Konrad Rzeszutek Wilk, Scott J Norton, Steven Rostedt,
	Chris Wright, Boris Ostrovsky, Aswin Chandramouleeswaran,
	Oleg Nesterov

Il 17/03/2014 19:54, Peter Zijlstra ha scritto:
> On Mon, Mar 17, 2014 at 01:44:34PM -0400, Waiman Long wrote:
>> The PV ticketlock code was designed to handle lock holder preemption by
>> redirecting CPU resources in a preempted guest to another guest that can
>> better use it and then return the preempted CPU back sooner.
>
> But that's the PV code, not the unfair bit. And your fuller PV thing
> doesn't need the unfair option.
>

I agree.  You need three cases:

* non-virt, non-PV: regular qspinlock

* virt (X86_FEATURE_HYPERVISOR), non-PV: test-and-set

* virt, PV: disables test-and-set, adds PV waiting to qspinlock

Paolo

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v6 05/11] pvqspinlock, x86: Allow unfair spinlock in a PV guest
@ 2014-03-18  8:16               ` Paolo Bonzini
  0 siblings, 0 replies; 135+ messages in thread
From: Paolo Bonzini @ 2014-03-18  8:16 UTC (permalink / raw)
  To: Peter Zijlstra, Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, Boris Ostrovsky,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Thomas Gleixner, linux-arch, Gleb Natapov, x86, Ingo Molnar,
	xen-devel, Paul E. McKenney, Rik van Riel, Arnd Bergmann,
	Konrad Rzeszutek Wilk, Scott J Norton, Steven Rostedt,
	Chris Wright, Oleg Nesterov, Alok Kataria,
	Aswin Chandramouleeswaran

Il 17/03/2014 19:54, Peter Zijlstra ha scritto:
> On Mon, Mar 17, 2014 at 01:44:34PM -0400, Waiman Long wrote:
>> The PV ticketlock code was designed to handle lock holder preemption by
>> redirecting CPU resources in a preempted guest to another guest that can
>> better use it and then return the preempted CPU back sooner.
>
> But that's the PV code, not the unfair bit. And your fuller PV thing
> doesn't need the unfair option.
>

I agree.  You need three cases:

* non-virt, non-PV: regular qspinlock

* virt (X86_FEATURE_HYPERVISOR), non-PV: test-and-set

* virt, PV: disables test-and-set, adds PV waiting to qspinlock

Paolo

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v6 05/11] pvqspinlock, x86: Allow unfair spinlock in a PV guest
  2014-03-17 18:54             ` Peter Zijlstra
  (?)
  (?)
@ 2014-03-18  8:16             ` Paolo Bonzini
  -1 siblings, 0 replies; 135+ messages in thread
From: Paolo Bonzini @ 2014-03-18  8:16 UTC (permalink / raw)
  To: Peter Zijlstra, Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, Gleb Natapov,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Thomas Gleixner, linux-arch, kvm, x86, Alok Kataria, Ingo Molnar,
	xen-devel, Paul E. McKenney, Arnd Bergmann, Scott J Norton,
	Steven Rostedt, Chris Wright, Boris Ostrovsky,
	Aswin Chandramouleeswaran, Oleg Nesterov

Il 17/03/2014 19:54, Peter Zijlstra ha scritto:
> On Mon, Mar 17, 2014 at 01:44:34PM -0400, Waiman Long wrote:
>> The PV ticketlock code was designed to handle lock holder preemption by
>> redirecting CPU resources in a preempted guest to another guest that can
>> better use it and then return the preempted CPU back sooner.
>
> But that's the PV code, not the unfair bit. And your fuller PV thing
> doesn't need the unfair option.
>

I agree.  You need three cases:

* non-virt, non-PV: regular qspinlock

* virt (X86_FEATURE_HYPERVISOR), non-PV: test-and-set

* virt, PV: disables test-and-set, adds PV waiting to qspinlock

Paolo

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH RFC v6 10/11] pvqspinlock, x86: Enable qspinlock PV support for KVM
  2014-03-17 17:47           ` Waiman Long
@ 2014-03-18  8:18             ` Paolo Bonzini
  -1 siblings, 0 replies; 135+ messages in thread
From: Paolo Bonzini @ 2014-03-18  8:18 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, linux-kernel, kvm,
	Peter Zijlstra, virtualization, Andi Kleen, H. Peter Anvin,
	Michel Lespinasse, Alok Kataria, linux-arch, Gleb Natapov, x86,
	Ingo Molnar, xen-devel, Paul E. McKenney, Arnd Bergmann,
	Scott J Norton, Steven Rostedt, Chris Wright, Thomas Gleixner,
	Aswin Chandramouleeswaran, Chegu Vinod, Boris Ostrovsky,
	Oleg Nesterov

Il 17/03/2014 18:47, Waiman Long ha scritto:
>> Yeah, what I mean is that the patches that enable paravirtualization
>> should also take care of decreasing the unfair-lock jump label when
>> paravirtualization is enabled.
>
> As there are people who don't like unfair lock at all, I prefer to give
> them the option to turn this on or off instead of forcing them to always
> use unfair lock in a PV guest.

I understand, but this is virtualization after all.  A bad scheduler 
decision means no progress at all.  That's much worse than unfairness.

And also, KVM and Xen will stay fair, so if you want fairness just pick 
a hypervisor that supports it.

Paolo

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH RFC v6 10/11] pvqspinlock, x86: Enable qspinlock PV support for KVM
@ 2014-03-18  8:18             ` Paolo Bonzini
  0 siblings, 0 replies; 135+ messages in thread
From: Paolo Bonzini @ 2014-03-18  8:18 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, linux-kernel, kvm,
	Peter Zijlstra, virtualization, Andi Kleen, H. Peter Anvin,
	Michel Lespinasse, Alok Kataria, linux-arch, Gleb Natapov, x86,
	Ingo Molnar, xen-devel, Paul E. McKenney, Arnd Bergmann,
	Scott J Norton, Steven Rostedt, Chris Wright, Thomas Gleixner,
	Aswin Chandramouleeswaran, Chegu Vinod, Boris Ostrovsky,
	Oleg Nesterov

Il 17/03/2014 18:47, Waiman Long ha scritto:
>> Yeah, what I mean is that the patches that enable paravirtualization
>> should also take care of decreasing the unfair-lock jump label when
>> paravirtualization is enabled.
>
> As there are people who don't like unfair lock at all, I prefer to give
> them the option to turn this on or off instead of forcing them to always
> use unfair lock in a PV guest.

I understand, but this is virtualization after all.  A bad scheduler 
decision means no progress at all.  That's much worse than unfairness.

And also, KVM and Xen will stay fair, so if you want fairness just pick 
a hypervisor that supports it.

Paolo

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH RFC v6 10/11] pvqspinlock, x86: Enable qspinlock PV support for KVM
  2014-03-17 17:47           ` Waiman Long
  (?)
  (?)
@ 2014-03-18  8:18           ` Paolo Bonzini
  -1 siblings, 0 replies; 135+ messages in thread
From: Paolo Bonzini @ 2014-03-18  8:18 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, linux-kernel, kvm,
	Peter Zijlstra, virtualization, Andi Kleen, H. Peter Anvin,
	Michel Lespinasse, Alok Kataria, linux-arch, Gleb Natapov, x86,
	Ingo Molnar, xen-devel, Paul E. McKenney, Arnd Bergmann,
	Scott J Norton, Steven Rostedt, Chris Wright, Thomas Gleixner,
	Aswin Chandramouleeswaran, Chegu Vinod, Boris Ostrovsky,
	Oleg Nesterov

Il 17/03/2014 18:47, Waiman Long ha scritto:
>> Yeah, what I mean is that the patches that enable paravirtualization
>> should also take care of decreasing the unfair-lock jump label when
>> paravirtualization is enabled.
>
> As there are people who don't like unfair lock at all, I prefer to give
> them the option to turn this on or off instead of forcing them to always
> use unfair lock in a PV guest.

I understand, but this is virtualization after all.  A bad scheduler 
decision means no progress at all.  That's much worse than unfairness.

And also, KVM and Xen will stay fair, so if you want fairness just pick 
a hypervisor that supports it.

Paolo

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v6 05/11] pvqspinlock, x86: Allow unfair spinlock in a PV guest
  2014-03-17 18:54             ` Peter Zijlstra
@ 2014-03-19  3:08               ` Waiman Long
  -1 siblings, 0 replies; 135+ messages in thread
From: Waiman Long @ 2014-03-19  3:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, Boris Ostrovsky,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Thomas Gleixner, linux-arch, Gleb Natapov, x86, Ingo Molnar,
	xen-devel, Paul E. McKenney, Rik van Riel, Arnd Bergmann,
	Konrad Rzeszutek Wilk, Scott J Norton, Steven Rostedt,
	Chris Wright, Oleg Nesterov, Alok Kataria,
	Aswin Chandramouleeswaran, Chegu

On 03/17/2014 02:54 PM, Peter Zijlstra wrote:
> On Mon, Mar 17, 2014 at 01:44:34PM -0400, Waiman Long wrote:
>> The PV ticketlock code was designed to handle lock holder preemption by
>> redirecting CPU resources in a preempted guest to another guest that can
>> better use it and then return the preempted CPU back sooner.
> But that's the PV code, not the unfair bit. And your fuller PV thing
> doesn't need the unfair option.

What I am want to try out is to combine the PV thing with the unfair 
lock and see how it performs together. I had set up 2 virtual guests 
sharing 20 vCPUs (200% overcommit). Preliminary testing showed that 
unfair lock was a bit faster than PV, but PV seems to be a bit more 
energy efficient (less total sys+user time).  I will have more data to 
share tomorrow.

-Longman

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v6 05/11] pvqspinlock, x86: Allow unfair spinlock in a PV guest
@ 2014-03-19  3:08               ` Waiman Long
  0 siblings, 0 replies; 135+ messages in thread
From: Waiman Long @ 2014-03-19  3:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, Boris Ostrovsky,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Thomas Gleixner, linux-arch, Gleb Natapov, x86, Ingo Molnar,
	xen-devel, Paul E. McKenney, Rik van Riel, Arnd Bergmann,
	Konrad Rzeszutek Wilk, Scott J Norton, Steven Rostedt,
	Chris Wright, Oleg Nesterov, Alok Kataria,
	Aswin Chandramouleeswaran, Chegu

On 03/17/2014 02:54 PM, Peter Zijlstra wrote:
> On Mon, Mar 17, 2014 at 01:44:34PM -0400, Waiman Long wrote:
>> The PV ticketlock code was designed to handle lock holder preemption by
>> redirecting CPU resources in a preempted guest to another guest that can
>> better use it and then return the preempted CPU back sooner.
> But that's the PV code, not the unfair bit. And your fuller PV thing
> doesn't need the unfair option.

What I am want to try out is to combine the PV thing with the unfair 
lock and see how it performs together. I had set up 2 virtual guests 
sharing 20 vCPUs (200% overcommit). Preliminary testing showed that 
unfair lock was a bit faster than PV, but PV seems to be a bit more 
energy efficient (less total sys+user time).  I will have more data to 
share tomorrow.

-Longman

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v6 05/11] pvqspinlock, x86: Allow unfair spinlock in a PV guest
  2014-03-17 18:54             ` Peter Zijlstra
                               ` (2 preceding siblings ...)
  (?)
@ 2014-03-19  3:08             ` Waiman Long
  -1 siblings, 0 replies; 135+ messages in thread
From: Waiman Long @ 2014-03-19  3:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, Boris Ostrovsky,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Thomas Gleixner, linux-arch, Gleb Natapov, x86, Ingo Molnar,
	xen-devel, Paul E. McKenney, Arnd Bergmann, Scott J Norton,
	Rusty Russell, Steven Rostedt, Chris Wright, Oleg Nesterov,
	Alok Kataria, Aswin Chandramouleeswaran, Chegu Vinod,
	linux-kernel

On 03/17/2014 02:54 PM, Peter Zijlstra wrote:
> On Mon, Mar 17, 2014 at 01:44:34PM -0400, Waiman Long wrote:
>> The PV ticketlock code was designed to handle lock holder preemption by
>> redirecting CPU resources in a preempted guest to another guest that can
>> better use it and then return the preempted CPU back sooner.
> But that's the PV code, not the unfair bit. And your fuller PV thing
> doesn't need the unfair option.

What I am want to try out is to combine the PV thing with the unfair 
lock and see how it performs together. I had set up 2 virtual guests 
sharing 20 vCPUs (200% overcommit). Preliminary testing showed that 
unfair lock was a bit faster than PV, but PV seems to be a bit more 
energy efficient (less total sys+user time).  I will have more data to 
share tomorrow.

-Longman

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v6 05/11] pvqspinlock, x86: Allow unfair spinlock in a PV guest
  2014-03-17 19:10             ` Konrad Rzeszutek Wilk
@ 2014-03-19  3:11               ` Waiman Long
  -1 siblings, 0 replies; 135+ messages in thread
From: Waiman Long @ 2014-03-19  3:11 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, Peter Zijlstra,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Thomas Gleixner, linux-arch, Gleb Natapov, x86, Ingo Molnar,
	xen-devel, Paul E. McKenney, Rik van Riel, Arnd Bergmann,
	Scott J Norton, Steven Rostedt, Chris Wright, Oleg Nesterov,
	Alok Kataria, Aswin Chandramouleeswaran, Chegu Vinod,
	Boris Ostrovsky

On 03/17/2014 03:10 PM, Konrad Rzeszutek Wilk wrote:
> On Mon, Mar 17, 2014 at 01:44:34PM -0400, Waiman Long wrote:
>> On 03/14/2014 04:30 AM, Peter Zijlstra wrote:
>>> On Thu, Mar 13, 2014 at 04:05:19PM -0400, Waiman Long wrote:
>>>> On 03/13/2014 11:15 AM, Peter Zijlstra wrote:
>>>>> On Wed, Mar 12, 2014 at 02:54:52PM -0400, Waiman Long wrote:
>>>>>> +static inline void arch_spin_lock(struct qspinlock *lock)
>>>>>> +{
>>>>>> +	if (static_key_false(&paravirt_unfairlocks_enabled))
>>>>>> +		queue_spin_lock_unfair(lock);
>>>>>> +	else
>>>>>> +		queue_spin_lock(lock);
>>>>>> +}
>>>>> So I would have expected something like:
>>>>>
>>>>> 	if (static_key_false(&paravirt_spinlock)) {
>>>>> 		while (!queue_spin_trylock(lock))
>>>>> 			cpu_relax();
>>>>> 		return;
>>>>> 	}
>>>>>
>>>>> At the top of queue_spin_lock_slowpath().
>>>> I don't like the idea of constantly spinning on the lock. That can cause all
>>>> sort of performance issues.
>>> Its bloody virt; _that_ is a performance issue to begin with.
>>>
>>> Anybody half sane stops using virt (esp. if they care about
>>> performance).
>>>
>>>> My version of the unfair lock tries to grab the
>>>> lock ignoring if there are others waiting in the queue or not. So instead of
>>>> the doing a cmpxchg of the whole 32-bit word, I just do a cmpxchg of the
>>>> lock byte in the unfair version. A CPU has only one chance to steal the
>>>> lock. If it can't, it will be lined up in the queue just like the fair
>>>> version. It is not as unfair as the other unfair locking schemes that spins
>>>> on the lock repetitively. So lock starvation should be less a problem.
>>>>
>>>> On the other hand, it may not perform as well as the other unfair locking
>>>> schemes. It is a compromise to provide some lock unfairness without
>>>> sacrificing the good cacheline behavior of the queue spinlock.
>>> But but but,.. any kind of queueing gets you into a world of hurt with
>>> virt.
>>>
>>> The simple test-and-set lock (as per the above) still sucks due to lock
>>> holder preemption, but at least the suckage doesn't queue. Because with
>>> queueing you not only have to worry about the lock holder getting
>>> preemption, but also the waiter(s).
>>>
>>> Take the situation of 3 (v)CPUs where cpu0 holds the lock but is
>>> preempted. cpu1 queues, cpu2 queues. Then cpu1 gets preempted, after
>>> which cpu0 gets back online.
>>>
>>> The simple test-and-set lock will now let cpu2 acquire. Your queue
>>> however will just sit there spinning, waiting for cpu1 to come back from
>>> holiday.
>>>
>>> I think you're way over engineering this. Just do the simple
>>> test-and-set lock for virt&&   !paravirt (as I think Paolo Bonzini
>>> suggested RHEL6 already does).
>> The PV ticketlock code was designed to handle lock holder preemption
>> by redirecting CPU resources in a preempted guest to another guest
>> that can better use it and then return the preempted CPU back
>> sooner.
>>
>> Using a simple test-and-set lock will not allow us to enable this PV
>> spinlock functionality as there is no structure to decide who does
>> what. I can extend the current unfair lock code to allow those
> And what would be needed to do 'decide who does what'?
>
>

Sorry for not very clear in my previous mail. The current PV code will 
halt the CPUs if the lock isn't acquired in a certain time. When the 
locker holder come back, it will wake up the next CPU in the ticket lock 
queue. With a simple set and test lock, there is no structure to decide 
which one to wake up. So you still need to have some sort of queue to do 
that, you just can't wake up all of them and let them fight for the lock.

-Longman

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v6 05/11] pvqspinlock, x86: Allow unfair spinlock in a PV guest
@ 2014-03-19  3:11               ` Waiman Long
  0 siblings, 0 replies; 135+ messages in thread
From: Waiman Long @ 2014-03-19  3:11 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, Peter Zijlstra,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Thomas Gleixner, linux-arch, Gleb Natapov, x86, Ingo Molnar,
	xen-devel, Paul E. McKenney, Rik van Riel, Arnd Bergmann,
	Scott J Norton, Steven Rostedt, Chris Wright, Oleg Nesterov,
	Alok Kataria, Aswin Chandramouleeswaran, Chegu Vinod,
	Boris Ostrovsky

On 03/17/2014 03:10 PM, Konrad Rzeszutek Wilk wrote:
> On Mon, Mar 17, 2014 at 01:44:34PM -0400, Waiman Long wrote:
>> On 03/14/2014 04:30 AM, Peter Zijlstra wrote:
>>> On Thu, Mar 13, 2014 at 04:05:19PM -0400, Waiman Long wrote:
>>>> On 03/13/2014 11:15 AM, Peter Zijlstra wrote:
>>>>> On Wed, Mar 12, 2014 at 02:54:52PM -0400, Waiman Long wrote:
>>>>>> +static inline void arch_spin_lock(struct qspinlock *lock)
>>>>>> +{
>>>>>> +	if (static_key_false(&paravirt_unfairlocks_enabled))
>>>>>> +		queue_spin_lock_unfair(lock);
>>>>>> +	else
>>>>>> +		queue_spin_lock(lock);
>>>>>> +}
>>>>> So I would have expected something like:
>>>>>
>>>>> 	if (static_key_false(&paravirt_spinlock)) {
>>>>> 		while (!queue_spin_trylock(lock))
>>>>> 			cpu_relax();
>>>>> 		return;
>>>>> 	}
>>>>>
>>>>> At the top of queue_spin_lock_slowpath().
>>>> I don't like the idea of constantly spinning on the lock. That can cause all
>>>> sort of performance issues.
>>> Its bloody virt; _that_ is a performance issue to begin with.
>>>
>>> Anybody half sane stops using virt (esp. if they care about
>>> performance).
>>>
>>>> My version of the unfair lock tries to grab the
>>>> lock ignoring if there are others waiting in the queue or not. So instead of
>>>> the doing a cmpxchg of the whole 32-bit word, I just do a cmpxchg of the
>>>> lock byte in the unfair version. A CPU has only one chance to steal the
>>>> lock. If it can't, it will be lined up in the queue just like the fair
>>>> version. It is not as unfair as the other unfair locking schemes that spins
>>>> on the lock repetitively. So lock starvation should be less a problem.
>>>>
>>>> On the other hand, it may not perform as well as the other unfair locking
>>>> schemes. It is a compromise to provide some lock unfairness without
>>>> sacrificing the good cacheline behavior of the queue spinlock.
>>> But but but,.. any kind of queueing gets you into a world of hurt with
>>> virt.
>>>
>>> The simple test-and-set lock (as per the above) still sucks due to lock
>>> holder preemption, but at least the suckage doesn't queue. Because with
>>> queueing you not only have to worry about the lock holder getting
>>> preemption, but also the waiter(s).
>>>
>>> Take the situation of 3 (v)CPUs where cpu0 holds the lock but is
>>> preempted. cpu1 queues, cpu2 queues. Then cpu1 gets preempted, after
>>> which cpu0 gets back online.
>>>
>>> The simple test-and-set lock will now let cpu2 acquire. Your queue
>>> however will just sit there spinning, waiting for cpu1 to come back from
>>> holiday.
>>>
>>> I think you're way over engineering this. Just do the simple
>>> test-and-set lock for virt&&   !paravirt (as I think Paolo Bonzini
>>> suggested RHEL6 already does).
>> The PV ticketlock code was designed to handle lock holder preemption
>> by redirecting CPU resources in a preempted guest to another guest
>> that can better use it and then return the preempted CPU back
>> sooner.
>>
>> Using a simple test-and-set lock will not allow us to enable this PV
>> spinlock functionality as there is no structure to decide who does
>> what. I can extend the current unfair lock code to allow those
> And what would be needed to do 'decide who does what'?
>
>

Sorry for not very clear in my previous mail. The current PV code will 
halt the CPUs if the lock isn't acquired in a certain time. When the 
locker holder come back, it will wake up the next CPU in the ticket lock 
queue. With a simple set and test lock, there is no structure to decide 
which one to wake up. So you still need to have some sort of queue to do 
that, you just can't wake up all of them and let them fight for the lock.

-Longman

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v6 05/11] pvqspinlock, x86: Allow unfair spinlock in a PV guest
  2014-03-17 19:10             ` Konrad Rzeszutek Wilk
  (?)
@ 2014-03-19  3:11             ` Waiman Long
  -1 siblings, 0 replies; 135+ messages in thread
From: Waiman Long @ 2014-03-19  3:11 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, Peter Zijlstra,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Thomas Gleixner, linux-arch, Gleb Natapov, x86, Ingo Molnar,
	xen-devel, Paul E. McKenney, Arnd Bergmann, Scott J Norton,
	Rusty Russell, Steven Rostedt, Chris Wright, Oleg Nesterov,
	Alok Kataria, Aswin Chandramouleeswaran, Chegu Vinod,
	Boris Ostrovsky

On 03/17/2014 03:10 PM, Konrad Rzeszutek Wilk wrote:
> On Mon, Mar 17, 2014 at 01:44:34PM -0400, Waiman Long wrote:
>> On 03/14/2014 04:30 AM, Peter Zijlstra wrote:
>>> On Thu, Mar 13, 2014 at 04:05:19PM -0400, Waiman Long wrote:
>>>> On 03/13/2014 11:15 AM, Peter Zijlstra wrote:
>>>>> On Wed, Mar 12, 2014 at 02:54:52PM -0400, Waiman Long wrote:
>>>>>> +static inline void arch_spin_lock(struct qspinlock *lock)
>>>>>> +{
>>>>>> +	if (static_key_false(&paravirt_unfairlocks_enabled))
>>>>>> +		queue_spin_lock_unfair(lock);
>>>>>> +	else
>>>>>> +		queue_spin_lock(lock);
>>>>>> +}
>>>>> So I would have expected something like:
>>>>>
>>>>> 	if (static_key_false(&paravirt_spinlock)) {
>>>>> 		while (!queue_spin_trylock(lock))
>>>>> 			cpu_relax();
>>>>> 		return;
>>>>> 	}
>>>>>
>>>>> At the top of queue_spin_lock_slowpath().
>>>> I don't like the idea of constantly spinning on the lock. That can cause all
>>>> sort of performance issues.
>>> Its bloody virt; _that_ is a performance issue to begin with.
>>>
>>> Anybody half sane stops using virt (esp. if they care about
>>> performance).
>>>
>>>> My version of the unfair lock tries to grab the
>>>> lock ignoring if there are others waiting in the queue or not. So instead of
>>>> the doing a cmpxchg of the whole 32-bit word, I just do a cmpxchg of the
>>>> lock byte in the unfair version. A CPU has only one chance to steal the
>>>> lock. If it can't, it will be lined up in the queue just like the fair
>>>> version. It is not as unfair as the other unfair locking schemes that spins
>>>> on the lock repetitively. So lock starvation should be less a problem.
>>>>
>>>> On the other hand, it may not perform as well as the other unfair locking
>>>> schemes. It is a compromise to provide some lock unfairness without
>>>> sacrificing the good cacheline behavior of the queue spinlock.
>>> But but but,.. any kind of queueing gets you into a world of hurt with
>>> virt.
>>>
>>> The simple test-and-set lock (as per the above) still sucks due to lock
>>> holder preemption, but at least the suckage doesn't queue. Because with
>>> queueing you not only have to worry about the lock holder getting
>>> preemption, but also the waiter(s).
>>>
>>> Take the situation of 3 (v)CPUs where cpu0 holds the lock but is
>>> preempted. cpu1 queues, cpu2 queues. Then cpu1 gets preempted, after
>>> which cpu0 gets back online.
>>>
>>> The simple test-and-set lock will now let cpu2 acquire. Your queue
>>> however will just sit there spinning, waiting for cpu1 to come back from
>>> holiday.
>>>
>>> I think you're way over engineering this. Just do the simple
>>> test-and-set lock for virt&&   !paravirt (as I think Paolo Bonzini
>>> suggested RHEL6 already does).
>> The PV ticketlock code was designed to handle lock holder preemption
>> by redirecting CPU resources in a preempted guest to another guest
>> that can better use it and then return the preempted CPU back
>> sooner.
>>
>> Using a simple test-and-set lock will not allow us to enable this PV
>> spinlock functionality as there is no structure to decide who does
>> what. I can extend the current unfair lock code to allow those
> And what would be needed to do 'decide who does what'?
>
>

Sorry for not very clear in my previous mail. The current PV code will 
halt the CPUs if the lock isn't acquired in a certain time. When the 
locker holder come back, it will wake up the next CPU in the ticket lock 
queue. With a simple set and test lock, there is no structure to decide 
which one to wake up. So you still need to have some sort of queue to do 
that, you just can't wake up all of them and let them fight for the lock.

-Longman

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v6 05/11] pvqspinlock, x86: Allow unfair spinlock in a PV guest
@ 2014-03-19  3:15             ` Waiman Long
  0 siblings, 0 replies; 135+ messages in thread
From: Waiman Long @ 2014-03-19  3:15 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Konrad Rzeszutek Wilk, Jeremy Fitzhardinge, Raghavendra K T,
	Gleb Natapov, Peter Zijlstra, virtualization, Andi Kleen,
	H. Peter Anvin, Michel Lespinasse, Alok Kataria, linux-arch, kvm,
	x86, Ingo Molnar, xen-devel, Paul E. McKenney, Rik van Riel,
	Arnd Bergmann, Scott J Norton, Steven Rostedt, Chris Wright,
	Thomas Gleixner, Aswin Chandramouleeswaran, Chegu Vinod, Oleg

On 03/18/2014 04:14 AM, Paolo Bonzini wrote:
> Il 17/03/2014 20:05, Konrad Rzeszutek Wilk ha scritto:
>>> > Measurements were done by Gleb for two guests running 2.6.32 with 16
>>> > vcpus each, on a 16-core system.  One guest ran with unfair locks,
>>> > one guest ran with fair locks.  Two kernel compilations ("time make
>> And when you say fair locks are you saying PV ticketlocks or generic
>> ticketlocks?
>
> Generic, of course.
>
>> You should see the same values with the PV ticketlock. It is not clear
>> to me if this testing did include that variant of locks?
>
> Yes, PV is fine.  But up to this point of the series, we are concerned 
> about spinlock performance when running on an overcommitted hypervisor 
> that doesn't support PV spinlocks.

The unfair queue lock is designed in such a way that it will only be 
activated when running in a PV guest or it won't be mergeable upstream. 
So there must be some way to determine if it is running under a hypervisor.

-Longman

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v6 05/11] pvqspinlock, x86: Allow unfair spinlock in a PV guest
@ 2014-03-19  3:15             ` Waiman Long
  0 siblings, 0 replies; 135+ messages in thread
From: Waiman Long @ 2014-03-19  3:15 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Konrad Rzeszutek Wilk, Jeremy Fitzhardinge, Raghavendra K T,
	Gleb Natapov, Peter Zijlstra, virtualization, Andi Kleen,
	H. Peter Anvin, Michel Lespinasse, Alok Kataria, linux-arch, kvm,
	x86, Ingo Molnar, xen-devel, Paul E. McKenney, Rik van Riel,
	Arnd Bergmann, Scott J Norton, Steven Rostedt, Chris Wright,
	Thomas Gleixner, Aswin Chandramouleeswaran, Chegu Vinod,
	Oleg Nesterov

On 03/18/2014 04:14 AM, Paolo Bonzini wrote:
> Il 17/03/2014 20:05, Konrad Rzeszutek Wilk ha scritto:
>>> > Measurements were done by Gleb for two guests running 2.6.32 with 16
>>> > vcpus each, on a 16-core system.  One guest ran with unfair locks,
>>> > one guest ran with fair locks.  Two kernel compilations ("time make
>> And when you say fair locks are you saying PV ticketlocks or generic
>> ticketlocks?
>
> Generic, of course.
>
>> You should see the same values with the PV ticketlock. It is not clear
>> to me if this testing did include that variant of locks?
>
> Yes, PV is fine.  But up to this point of the series, we are concerned 
> about spinlock performance when running on an overcommitted hypervisor 
> that doesn't support PV spinlocks.

The unfair queue lock is designed in such a way that it will only be 
activated when running in a PV guest or it won't be mergeable upstream. 
So there must be some way to determine if it is running under a hypervisor.

-Longman

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v6 05/11] pvqspinlock, x86: Allow unfair spinlock in a PV guest
@ 2014-03-19  3:15             ` Waiman Long
  0 siblings, 0 replies; 135+ messages in thread
From: Waiman Long @ 2014-03-19  3:15 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Konrad Rzeszutek Wilk, Jeremy Fitzhardinge, Raghavendra K T,
	Gleb Natapov, Peter Zijlstra, virtualization, Andi Kleen,
	H. Peter Anvin, Michel Lespinasse, Alok Kataria, linux-arch, kvm,
	x86, Ingo Molnar, xen-devel, Paul E. McKenney, Rik van Riel,
	Arnd Bergmann, Scott J Norton, Steven Rostedt, Chris Wright,
	Thomas Gleixner, Aswin Chandramouleeswaran, Chegu Vinod, Oleg

On 03/18/2014 04:14 AM, Paolo Bonzini wrote:
> Il 17/03/2014 20:05, Konrad Rzeszutek Wilk ha scritto:
>>> > Measurements were done by Gleb for two guests running 2.6.32 with 16
>>> > vcpus each, on a 16-core system.  One guest ran with unfair locks,
>>> > one guest ran with fair locks.  Two kernel compilations ("time make
>> And when you say fair locks are you saying PV ticketlocks or generic
>> ticketlocks?
>
> Generic, of course.
>
>> You should see the same values with the PV ticketlock. It is not clear
>> to me if this testing did include that variant of locks?
>
> Yes, PV is fine.  But up to this point of the series, we are concerned 
> about spinlock performance when running on an overcommitted hypervisor 
> that doesn't support PV spinlocks.

The unfair queue lock is designed in such a way that it will only be 
activated when running in a PV guest or it won't be mergeable upstream. 
So there must be some way to determine if it is running under a hypervisor.

-Longman

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v6 05/11] pvqspinlock, x86: Allow unfair spinlock in a PV guest
  2014-03-18  8:14           ` Paolo Bonzini
  (?)
@ 2014-03-19  3:15           ` Waiman Long
  -1 siblings, 0 replies; 135+ messages in thread
From: Waiman Long @ 2014-03-19  3:15 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, Peter Zijlstra,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Alok Kataria, linux-arch, Gleb Natapov, x86, Ingo Molnar,
	xen-devel, Paul E. McKenney, Rik van Riel, Arnd Bergmann,
	Konrad Rzeszutek Wilk, Scott J Norton, Steven Rostedt,
	Chris Wright, Thomas Gleixner, Aswin Chandramouleeswaran,
	Chegu Vinod

On 03/18/2014 04:14 AM, Paolo Bonzini wrote:
> Il 17/03/2014 20:05, Konrad Rzeszutek Wilk ha scritto:
>>> > Measurements were done by Gleb for two guests running 2.6.32 with 16
>>> > vcpus each, on a 16-core system.  One guest ran with unfair locks,
>>> > one guest ran with fair locks.  Two kernel compilations ("time make
>> And when you say fair locks are you saying PV ticketlocks or generic
>> ticketlocks?
>
> Generic, of course.
>
>> You should see the same values with the PV ticketlock. It is not clear
>> to me if this testing did include that variant of locks?
>
> Yes, PV is fine.  But up to this point of the series, we are concerned 
> about spinlock performance when running on an overcommitted hypervisor 
> that doesn't support PV spinlocks.

The unfair queue lock is designed in such a way that it will only be 
activated when running in a PV guest or it won't be mergeable upstream. 
So there must be some way to determine if it is running under a hypervisor.

-Longman

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v6 05/11] pvqspinlock, x86: Allow unfair spinlock in a PV guest
  2014-03-18  8:14           ` Paolo Bonzini
  (?)
  (?)
@ 2014-03-19  3:15           ` Waiman Long
  -1 siblings, 0 replies; 135+ messages in thread
From: Waiman Long @ 2014-03-19  3:15 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, Peter Zijlstra,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Alok Kataria, linux-arch, Gleb Natapov, x86, Ingo Molnar,
	xen-devel, Paul E. McKenney, Arnd Bergmann, Scott J Norton,
	Steven Rostedt, Chris Wright, Thomas Gleixner,
	Aswin Chandramouleeswaran, Chegu Vinod, Oleg Nesterov

On 03/18/2014 04:14 AM, Paolo Bonzini wrote:
> Il 17/03/2014 20:05, Konrad Rzeszutek Wilk ha scritto:
>>> > Measurements were done by Gleb for two guests running 2.6.32 with 16
>>> > vcpus each, on a 16-core system.  One guest ran with unfair locks,
>>> > one guest ran with fair locks.  Two kernel compilations ("time make
>> And when you say fair locks are you saying PV ticketlocks or generic
>> ticketlocks?
>
> Generic, of course.
>
>> You should see the same values with the PV ticketlock. It is not clear
>> to me if this testing did include that variant of locks?
>
> Yes, PV is fine.  But up to this point of the series, we are concerned 
> about spinlock performance when running on an overcommitted hypervisor 
> that doesn't support PV spinlocks.

The unfair queue lock is designed in such a way that it will only be 
activated when running in a PV guest or it won't be mergeable upstream. 
So there must be some way to determine if it is running under a hypervisor.

-Longman

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v6 05/11] pvqspinlock, x86: Allow unfair spinlock in a PV guest
  2014-03-19  3:15             ` Waiman Long
  (?)
  (?)
@ 2014-03-19 10:07             ` Paolo Bonzini
  2014-03-19 16:58               ` Waiman Long
                                 ` (2 more replies)
  -1 siblings, 3 replies; 135+ messages in thread
From: Paolo Bonzini @ 2014-03-19 10:07 UTC (permalink / raw)
  To: Waiman Long
  Cc: Konrad Rzeszutek Wilk, Jeremy Fitzhardinge, Raghavendra K T,
	Gleb Natapov, Peter Zijlstra, virtualization, Andi Kleen,
	H. Peter Anvin, Michel Lespinasse, Alok Kataria, linux-arch, kvm,
	x86, Ingo Molnar, xen-devel, Paul E. McKenney, Rik van Riel,
	Arnd Bergmann, Scott J Norton, Steven Rostedt, Chris Wright,
	Thomas Gleixner, Aswin Chandramouleeswaran, Chegu Vinod

Il 19/03/2014 04:15, Waiman Long ha scritto:
>>> You should see the same values with the PV ticketlock. It is not clear
>>> to me if this testing did include that variant of locks?
>>
>> Yes, PV is fine.  But up to this point of the series, we are concerned
>> about spinlock performance when running on an overcommitted hypervisor
>> that doesn't support PV spinlocks.
>
> The unfair queue lock is designed in such a way that it will only be
> activated when running in a PV guest or it won't be mergeable upstream.
> So there must be some way to determine if it is running under a hypervisor.

Exactly.  What you want is boot_cpu_has(X86_FEATURE_HYPERVISOR).

Paolo

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v6 05/11] pvqspinlock, x86: Allow unfair spinlock in a PV guest
  2014-03-19  3:15             ` Waiman Long
                               ` (2 preceding siblings ...)
  (?)
@ 2014-03-19 10:07             ` Paolo Bonzini
  -1 siblings, 0 replies; 135+ messages in thread
From: Paolo Bonzini @ 2014-03-19 10:07 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, Peter Zijlstra,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Alok Kataria, linux-arch, Gleb Natapov, x86, Ingo Molnar,
	xen-devel, Paul E. McKenney, Rik van Riel, Arnd Bergmann,
	Konrad Rzeszutek Wilk, Scott J Norton, Steven Rostedt,
	Chris Wright, Thomas Gleixner, Aswin Chandramouleeswaran,
	Chegu Vinod

Il 19/03/2014 04:15, Waiman Long ha scritto:
>>> You should see the same values with the PV ticketlock. It is not clear
>>> to me if this testing did include that variant of locks?
>>
>> Yes, PV is fine.  But up to this point of the series, we are concerned
>> about spinlock performance when running on an overcommitted hypervisor
>> that doesn't support PV spinlocks.
>
> The unfair queue lock is designed in such a way that it will only be
> activated when running in a PV guest or it won't be mergeable upstream.
> So there must be some way to determine if it is running under a hypervisor.

Exactly.  What you want is boot_cpu_has(X86_FEATURE_HYPERVISOR).

Paolo

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v6 05/11] pvqspinlock, x86: Allow unfair spinlock in a PV guest
  2014-03-19  3:15             ` Waiman Long
                               ` (3 preceding siblings ...)
  (?)
@ 2014-03-19 10:07             ` Paolo Bonzini
  -1 siblings, 0 replies; 135+ messages in thread
From: Paolo Bonzini @ 2014-03-19 10:07 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, Peter Zijlstra,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Alok Kataria, linux-arch, Gleb Natapov, x86, Ingo Molnar,
	xen-devel, Paul E. McKenney, Arnd Bergmann, Scott J Norton,
	Steven Rostedt, Chris Wright, Thomas Gleixner,
	Aswin Chandramouleeswaran, Chegu Vinod

Il 19/03/2014 04:15, Waiman Long ha scritto:
>>> You should see the same values with the PV ticketlock. It is not clear
>>> to me if this testing did include that variant of locks?
>>
>> Yes, PV is fine.  But up to this point of the series, we are concerned
>> about spinlock performance when running on an overcommitted hypervisor
>> that doesn't support PV spinlocks.
>
> The unfair queue lock is designed in such a way that it will only be
> activated when running in a PV guest or it won't be mergeable upstream.
> So there must be some way to determine if it is running under a hypervisor.

Exactly.  What you want is boot_cpu_has(X86_FEATURE_HYPERVISOR).

Paolo

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v6 05/11] pvqspinlock, x86: Allow unfair spinlock in a PV guest
  2014-03-19  3:11               ` Waiman Long
@ 2014-03-19 15:25                 ` Konrad Rzeszutek Wilk
  -1 siblings, 0 replies; 135+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-03-19 15:25 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, Peter Zijlstra,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Thomas Gleixner, linux-arch, Gleb Natapov, x86, Ingo Molnar,
	xen-devel, Paul E. McKenney, Rik van Riel, Arnd Bergmann,
	Scott J Norton, Steven Rostedt, Chris Wright, Oleg Nesterov,
	Alok Kataria, Aswin Chandramouleeswaran, Chegu Vinod,
	Boris Ostrovsky

On Tue, Mar 18, 2014 at 11:11:43PM -0400, Waiman Long wrote:
> On 03/17/2014 03:10 PM, Konrad Rzeszutek Wilk wrote:
> >On Mon, Mar 17, 2014 at 01:44:34PM -0400, Waiman Long wrote:
> >>On 03/14/2014 04:30 AM, Peter Zijlstra wrote:
> >>>On Thu, Mar 13, 2014 at 04:05:19PM -0400, Waiman Long wrote:
> >>>>On 03/13/2014 11:15 AM, Peter Zijlstra wrote:
> >>>>>On Wed, Mar 12, 2014 at 02:54:52PM -0400, Waiman Long wrote:
> >>>>>>+static inline void arch_spin_lock(struct qspinlock *lock)
> >>>>>>+{
> >>>>>>+	if (static_key_false(&paravirt_unfairlocks_enabled))
> >>>>>>+		queue_spin_lock_unfair(lock);
> >>>>>>+	else
> >>>>>>+		queue_spin_lock(lock);
> >>>>>>+}
> >>>>>So I would have expected something like:
> >>>>>
> >>>>>	if (static_key_false(&paravirt_spinlock)) {
> >>>>>		while (!queue_spin_trylock(lock))
> >>>>>			cpu_relax();
> >>>>>		return;
> >>>>>	}
> >>>>>
> >>>>>At the top of queue_spin_lock_slowpath().
> >>>>I don't like the idea of constantly spinning on the lock. That can cause all
> >>>>sort of performance issues.
> >>>Its bloody virt; _that_ is a performance issue to begin with.
> >>>
> >>>Anybody half sane stops using virt (esp. if they care about
> >>>performance).
> >>>
> >>>>My version of the unfair lock tries to grab the
> >>>>lock ignoring if there are others waiting in the queue or not. So instead of
> >>>>the doing a cmpxchg of the whole 32-bit word, I just do a cmpxchg of the
> >>>>lock byte in the unfair version. A CPU has only one chance to steal the
> >>>>lock. If it can't, it will be lined up in the queue just like the fair
> >>>>version. It is not as unfair as the other unfair locking schemes that spins
> >>>>on the lock repetitively. So lock starvation should be less a problem.
> >>>>
> >>>>On the other hand, it may not perform as well as the other unfair locking
> >>>>schemes. It is a compromise to provide some lock unfairness without
> >>>>sacrificing the good cacheline behavior of the queue spinlock.
> >>>But but but,.. any kind of queueing gets you into a world of hurt with
> >>>virt.
> >>>
> >>>The simple test-and-set lock (as per the above) still sucks due to lock
> >>>holder preemption, but at least the suckage doesn't queue. Because with
> >>>queueing you not only have to worry about the lock holder getting
> >>>preemption, but also the waiter(s).
> >>>
> >>>Take the situation of 3 (v)CPUs where cpu0 holds the lock but is
> >>>preempted. cpu1 queues, cpu2 queues. Then cpu1 gets preempted, after
> >>>which cpu0 gets back online.
> >>>
> >>>The simple test-and-set lock will now let cpu2 acquire. Your queue
> >>>however will just sit there spinning, waiting for cpu1 to come back from
> >>>holiday.
> >>>
> >>>I think you're way over engineering this. Just do the simple
> >>>test-and-set lock for virt&&   !paravirt (as I think Paolo Bonzini
> >>>suggested RHEL6 already does).
> >>The PV ticketlock code was designed to handle lock holder preemption
> >>by redirecting CPU resources in a preempted guest to another guest
> >>that can better use it and then return the preempted CPU back
> >>sooner.
> >>
> >>Using a simple test-and-set lock will not allow us to enable this PV
> >>spinlock functionality as there is no structure to decide who does
> >>what. I can extend the current unfair lock code to allow those
> >And what would be needed to do 'decide who does what'?
> >
> >
> 
> Sorry for not very clear in my previous mail. The current PV code
> will halt the CPUs if the lock isn't acquired in a certain time.
> When the locker holder come back, it will wake up the next CPU in
> the ticket lock queue. With a simple set and test lock, there is no
> structure to decide which one to wake up. So you still need to have
> some sort of queue to do that, you just can't wake up all of them
> and let them fight for the lock.

Why not? That is what bytelocks did and it worked OK in the
virtualization environment. The reason it was OK was because the
hypervisor decided which VCPU to schedule and that one would get the
lock. Granted it did have a per-cpu-what-lock-am-I-waiting-for
structure to aid in this.

In some form, the hypervisor serializes who is going to get
the lock as it ultimately decides which of the VCPUs that are
kicked to wake up - and if the lock is not FIFO - it doesn't
matter which ones gets it.

With the 'per-cpu-what-lock-I-am-waiting-for' you can also
have a round-robin of which vCPU you had kick to keep
it fair.

You might want to take a look at the PV bytelock that existed
in the Xen code prior to PV ticketlock (so 3.10) to see how
it did that.

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v6 05/11] pvqspinlock, x86: Allow unfair spinlock in a PV guest
@ 2014-03-19 15:25                 ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 135+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-03-19 15:25 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, Peter Zijlstra,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Thomas Gleixner, linux-arch, Gleb Natapov, x86, Ingo Molnar,
	xen-devel, Paul E. McKenney, Rik van Riel, Arnd Bergmann,
	Scott J Norton, Steven Rostedt, Chris Wright, Oleg Nesterov,
	Alok Kataria, Aswin Chandramouleeswaran, Chegu Vinod,
	Boris Ostrovsky

On Tue, Mar 18, 2014 at 11:11:43PM -0400, Waiman Long wrote:
> On 03/17/2014 03:10 PM, Konrad Rzeszutek Wilk wrote:
> >On Mon, Mar 17, 2014 at 01:44:34PM -0400, Waiman Long wrote:
> >>On 03/14/2014 04:30 AM, Peter Zijlstra wrote:
> >>>On Thu, Mar 13, 2014 at 04:05:19PM -0400, Waiman Long wrote:
> >>>>On 03/13/2014 11:15 AM, Peter Zijlstra wrote:
> >>>>>On Wed, Mar 12, 2014 at 02:54:52PM -0400, Waiman Long wrote:
> >>>>>>+static inline void arch_spin_lock(struct qspinlock *lock)
> >>>>>>+{
> >>>>>>+	if (static_key_false(&paravirt_unfairlocks_enabled))
> >>>>>>+		queue_spin_lock_unfair(lock);
> >>>>>>+	else
> >>>>>>+		queue_spin_lock(lock);
> >>>>>>+}
> >>>>>So I would have expected something like:
> >>>>>
> >>>>>	if (static_key_false(&paravirt_spinlock)) {
> >>>>>		while (!queue_spin_trylock(lock))
> >>>>>			cpu_relax();
> >>>>>		return;
> >>>>>	}
> >>>>>
> >>>>>At the top of queue_spin_lock_slowpath().
> >>>>I don't like the idea of constantly spinning on the lock. That can cause all
> >>>>sort of performance issues.
> >>>Its bloody virt; _that_ is a performance issue to begin with.
> >>>
> >>>Anybody half sane stops using virt (esp. if they care about
> >>>performance).
> >>>
> >>>>My version of the unfair lock tries to grab the
> >>>>lock ignoring if there are others waiting in the queue or not. So instead of
> >>>>the doing a cmpxchg of the whole 32-bit word, I just do a cmpxchg of the
> >>>>lock byte in the unfair version. A CPU has only one chance to steal the
> >>>>lock. If it can't, it will be lined up in the queue just like the fair
> >>>>version. It is not as unfair as the other unfair locking schemes that spins
> >>>>on the lock repetitively. So lock starvation should be less a problem.
> >>>>
> >>>>On the other hand, it may not perform as well as the other unfair locking
> >>>>schemes. It is a compromise to provide some lock unfairness without
> >>>>sacrificing the good cacheline behavior of the queue spinlock.
> >>>But but but,.. any kind of queueing gets you into a world of hurt with
> >>>virt.
> >>>
> >>>The simple test-and-set lock (as per the above) still sucks due to lock
> >>>holder preemption, but at least the suckage doesn't queue. Because with
> >>>queueing you not only have to worry about the lock holder getting
> >>>preemption, but also the waiter(s).
> >>>
> >>>Take the situation of 3 (v)CPUs where cpu0 holds the lock but is
> >>>preempted. cpu1 queues, cpu2 queues. Then cpu1 gets preempted, after
> >>>which cpu0 gets back online.
> >>>
> >>>The simple test-and-set lock will now let cpu2 acquire. Your queue
> >>>however will just sit there spinning, waiting for cpu1 to come back from
> >>>holiday.
> >>>
> >>>I think you're way over engineering this. Just do the simple
> >>>test-and-set lock for virt&&   !paravirt (as I think Paolo Bonzini
> >>>suggested RHEL6 already does).
> >>The PV ticketlock code was designed to handle lock holder preemption
> >>by redirecting CPU resources in a preempted guest to another guest
> >>that can better use it and then return the preempted CPU back
> >>sooner.
> >>
> >>Using a simple test-and-set lock will not allow us to enable this PV
> >>spinlock functionality as there is no structure to decide who does
> >>what. I can extend the current unfair lock code to allow those
> >And what would be needed to do 'decide who does what'?
> >
> >
> 
> Sorry for not very clear in my previous mail. The current PV code
> will halt the CPUs if the lock isn't acquired in a certain time.
> When the locker holder come back, it will wake up the next CPU in
> the ticket lock queue. With a simple set and test lock, there is no
> structure to decide which one to wake up. So you still need to have
> some sort of queue to do that, you just can't wake up all of them
> and let them fight for the lock.

Why not? That is what bytelocks did and it worked OK in the
virtualization environment. The reason it was OK was because the
hypervisor decided which VCPU to schedule and that one would get the
lock. Granted it did have a per-cpu-what-lock-am-I-waiting-for
structure to aid in this.

In some form, the hypervisor serializes who is going to get
the lock as it ultimately decides which of the VCPUs that are
kicked to wake up - and if the lock is not FIFO - it doesn't
matter which ones gets it.

With the 'per-cpu-what-lock-I-am-waiting-for' you can also
have a round-robin of which vCPU you had kick to keep
it fair.

You might want to take a look at the PV bytelock that existed
in the Xen code prior to PV ticketlock (so 3.10) to see how
it did that.

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v6 05/11] pvqspinlock, x86: Allow unfair spinlock in a PV guest
  2014-03-19  3:11               ` Waiman Long
  (?)
@ 2014-03-19 15:25               ` Konrad Rzeszutek Wilk
  -1 siblings, 0 replies; 135+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-03-19 15:25 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, Peter Zijlstra,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Thomas Gleixner, linux-arch, Gleb Natapov, x86, Ingo Molnar,
	xen-devel, Paul E. McKenney, Arnd Bergmann, Scott J Norton,
	Rusty Russell, Steven Rostedt, Chris Wright, Oleg Nesterov,
	Alok Kataria, Aswin Chandramouleeswaran, Chegu Vinod,
	Boris Ostrovsky

On Tue, Mar 18, 2014 at 11:11:43PM -0400, Waiman Long wrote:
> On 03/17/2014 03:10 PM, Konrad Rzeszutek Wilk wrote:
> >On Mon, Mar 17, 2014 at 01:44:34PM -0400, Waiman Long wrote:
> >>On 03/14/2014 04:30 AM, Peter Zijlstra wrote:
> >>>On Thu, Mar 13, 2014 at 04:05:19PM -0400, Waiman Long wrote:
> >>>>On 03/13/2014 11:15 AM, Peter Zijlstra wrote:
> >>>>>On Wed, Mar 12, 2014 at 02:54:52PM -0400, Waiman Long wrote:
> >>>>>>+static inline void arch_spin_lock(struct qspinlock *lock)
> >>>>>>+{
> >>>>>>+	if (static_key_false(&paravirt_unfairlocks_enabled))
> >>>>>>+		queue_spin_lock_unfair(lock);
> >>>>>>+	else
> >>>>>>+		queue_spin_lock(lock);
> >>>>>>+}
> >>>>>So I would have expected something like:
> >>>>>
> >>>>>	if (static_key_false(&paravirt_spinlock)) {
> >>>>>		while (!queue_spin_trylock(lock))
> >>>>>			cpu_relax();
> >>>>>		return;
> >>>>>	}
> >>>>>
> >>>>>At the top of queue_spin_lock_slowpath().
> >>>>I don't like the idea of constantly spinning on the lock. That can cause all
> >>>>sort of performance issues.
> >>>Its bloody virt; _that_ is a performance issue to begin with.
> >>>
> >>>Anybody half sane stops using virt (esp. if they care about
> >>>performance).
> >>>
> >>>>My version of the unfair lock tries to grab the
> >>>>lock ignoring if there are others waiting in the queue or not. So instead of
> >>>>the doing a cmpxchg of the whole 32-bit word, I just do a cmpxchg of the
> >>>>lock byte in the unfair version. A CPU has only one chance to steal the
> >>>>lock. If it can't, it will be lined up in the queue just like the fair
> >>>>version. It is not as unfair as the other unfair locking schemes that spins
> >>>>on the lock repetitively. So lock starvation should be less a problem.
> >>>>
> >>>>On the other hand, it may not perform as well as the other unfair locking
> >>>>schemes. It is a compromise to provide some lock unfairness without
> >>>>sacrificing the good cacheline behavior of the queue spinlock.
> >>>But but but,.. any kind of queueing gets you into a world of hurt with
> >>>virt.
> >>>
> >>>The simple test-and-set lock (as per the above) still sucks due to lock
> >>>holder preemption, but at least the suckage doesn't queue. Because with
> >>>queueing you not only have to worry about the lock holder getting
> >>>preemption, but also the waiter(s).
> >>>
> >>>Take the situation of 3 (v)CPUs where cpu0 holds the lock but is
> >>>preempted. cpu1 queues, cpu2 queues. Then cpu1 gets preempted, after
> >>>which cpu0 gets back online.
> >>>
> >>>The simple test-and-set lock will now let cpu2 acquire. Your queue
> >>>however will just sit there spinning, waiting for cpu1 to come back from
> >>>holiday.
> >>>
> >>>I think you're way over engineering this. Just do the simple
> >>>test-and-set lock for virt&&   !paravirt (as I think Paolo Bonzini
> >>>suggested RHEL6 already does).
> >>The PV ticketlock code was designed to handle lock holder preemption
> >>by redirecting CPU resources in a preempted guest to another guest
> >>that can better use it and then return the preempted CPU back
> >>sooner.
> >>
> >>Using a simple test-and-set lock will not allow us to enable this PV
> >>spinlock functionality as there is no structure to decide who does
> >>what. I can extend the current unfair lock code to allow those
> >And what would be needed to do 'decide who does what'?
> >
> >
> 
> Sorry for not very clear in my previous mail. The current PV code
> will halt the CPUs if the lock isn't acquired in a certain time.
> When the locker holder come back, it will wake up the next CPU in
> the ticket lock queue. With a simple set and test lock, there is no
> structure to decide which one to wake up. So you still need to have
> some sort of queue to do that, you just can't wake up all of them
> and let them fight for the lock.

Why not? That is what bytelocks did and it worked OK in the
virtualization environment. The reason it was OK was because the
hypervisor decided which VCPU to schedule and that one would get the
lock. Granted it did have a per-cpu-what-lock-am-I-waiting-for
structure to aid in this.

In some form, the hypervisor serializes who is going to get
the lock as it ultimately decides which of the VCPUs that are
kicked to wake up - and if the lock is not FIFO - it doesn't
matter which ones gets it.

With the 'per-cpu-what-lock-I-am-waiting-for' you can also
have a round-robin of which vCPU you had kick to keep
it fair.

You might want to take a look at the PV bytelock that existed
in the Xen code prior to PV ticketlock (so 3.10) to see how
it did that.

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v6 05/11] pvqspinlock, x86: Allow unfair spinlock in a PV guest
  2014-03-19 10:07             ` Paolo Bonzini
  2014-03-19 16:58               ` Waiman Long
  2014-03-19 16:58               ` Waiman Long
@ 2014-03-19 16:58               ` Waiman Long
  2014-03-19 17:08                 ` Paolo Bonzini
  2014-03-19 17:08                   ` Paolo Bonzini
  2 siblings, 2 replies; 135+ messages in thread
From: Waiman Long @ 2014-03-19 16:58 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Konrad Rzeszutek Wilk, Jeremy Fitzhardinge, Raghavendra K T,
	Gleb Natapov, Peter Zijlstra, virtualization, Andi Kleen,
	H. Peter Anvin, Michel Lespinasse, Alok Kataria, linux-arch, kvm,
	x86, Ingo Molnar, xen-devel, Paul E. McKenney, Rik van Riel,
	Arnd Bergmann, Scott J Norton, Steven Rostedt, Chris Wright,
	Thomas Gleixner, Aswin Chandramouleeswaran, Chegu Vinod

On 03/19/2014 06:07 AM, Paolo Bonzini wrote:
> Il 19/03/2014 04:15, Waiman Long ha scritto:
>>>> You should see the same values with the PV ticketlock. It is not clear
>>>> to me if this testing did include that variant of locks?
>>>
>>> Yes, PV is fine.  But up to this point of the series, we are concerned
>>> about spinlock performance when running on an overcommitted hypervisor
>>> that doesn't support PV spinlocks.
>>
>> The unfair queue lock is designed in such a way that it will only be
>> activated when running in a PV guest or it won't be mergeable upstream.
>> So there must be some way to determine if it is running under a 
>> hypervisor.
>
> Exactly.  What you want is boot_cpu_has(X86_FEATURE_HYPERVISOR).
>
> Paolo

The unfair lock is to be enabled by boot time check, not just by the 
presence of a configuration macro during the build process in order to 
avoid using unfair lock on bare metal. Of course, Linux distros can 
modify this if that suits their need.

-Longman

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v6 05/11] pvqspinlock, x86: Allow unfair spinlock in a PV guest
  2014-03-19 10:07             ` Paolo Bonzini
  2014-03-19 16:58               ` Waiman Long
@ 2014-03-19 16:58               ` Waiman Long
  2014-03-19 16:58               ` Waiman Long
  2 siblings, 0 replies; 135+ messages in thread
From: Waiman Long @ 2014-03-19 16:58 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, Peter Zijlstra,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Alok Kataria, linux-arch, Gleb Natapov, x86, Ingo Molnar,
	xen-devel, Paul E. McKenney, Rik van Riel, Arnd Bergmann,
	Konrad Rzeszutek Wilk, Scott J Norton, Steven Rostedt,
	Chris Wright, Thomas Gleixner, Aswin Chandramouleeswaran,
	Chegu Vinod

On 03/19/2014 06:07 AM, Paolo Bonzini wrote:
> Il 19/03/2014 04:15, Waiman Long ha scritto:
>>>> You should see the same values with the PV ticketlock. It is not clear
>>>> to me if this testing did include that variant of locks?
>>>
>>> Yes, PV is fine.  But up to this point of the series, we are concerned
>>> about spinlock performance when running on an overcommitted hypervisor
>>> that doesn't support PV spinlocks.
>>
>> The unfair queue lock is designed in such a way that it will only be
>> activated when running in a PV guest or it won't be mergeable upstream.
>> So there must be some way to determine if it is running under a 
>> hypervisor.
>
> Exactly.  What you want is boot_cpu_has(X86_FEATURE_HYPERVISOR).
>
> Paolo

The unfair lock is to be enabled by boot time check, not just by the 
presence of a configuration macro during the build process in order to 
avoid using unfair lock on bare metal. Of course, Linux distros can 
modify this if that suits their need.

-Longman

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v6 05/11] pvqspinlock, x86: Allow unfair spinlock in a PV guest
  2014-03-19 10:07             ` Paolo Bonzini
@ 2014-03-19 16:58               ` Waiman Long
  2014-03-19 16:58               ` Waiman Long
  2014-03-19 16:58               ` Waiman Long
  2 siblings, 0 replies; 135+ messages in thread
From: Waiman Long @ 2014-03-19 16:58 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, Peter Zijlstra,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Alok Kataria, linux-arch, Gleb Natapov, x86, Ingo Molnar,
	xen-devel, Paul E. McKenney, Arnd Bergmann, Scott J Norton,
	Steven Rostedt, Chris Wright, Thomas Gleixner,
	Aswin Chandramouleeswaran, Chegu Vinod

On 03/19/2014 06:07 AM, Paolo Bonzini wrote:
> Il 19/03/2014 04:15, Waiman Long ha scritto:
>>>> You should see the same values with the PV ticketlock. It is not clear
>>>> to me if this testing did include that variant of locks?
>>>
>>> Yes, PV is fine.  But up to this point of the series, we are concerned
>>> about spinlock performance when running on an overcommitted hypervisor
>>> that doesn't support PV spinlocks.
>>
>> The unfair queue lock is designed in such a way that it will only be
>> activated when running in a PV guest or it won't be mergeable upstream.
>> So there must be some way to determine if it is running under a 
>> hypervisor.
>
> Exactly.  What you want is boot_cpu_has(X86_FEATURE_HYPERVISOR).
>
> Paolo

The unfair lock is to be enabled by boot time check, not just by the 
presence of a configuration macro during the build process in order to 
avoid using unfair lock on bare metal. Of course, Linux distros can 
modify this if that suits their need.

-Longman

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v6 05/11] pvqspinlock, x86: Allow unfair spinlock in a PV guest
@ 2014-03-19 17:08                   ` Paolo Bonzini
  0 siblings, 0 replies; 135+ messages in thread
From: Paolo Bonzini @ 2014-03-19 17:08 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, Peter Zijlstra,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Alok Kataria, linux-arch, Gleb Natapov, x86, Ingo Molnar,
	xen-devel, Paul E. McKenney, Rik van Riel, Arnd Bergmann,
	Konrad Rzeszutek Wilk, Scott J Norton, Steven Rostedt,
	Chris Wright, Thomas Gleixner, Aswin Chandramouleeswaran,
	Chegu Vinod

Il 19/03/2014 17:58, Waiman Long ha scritto:
>> Exactly.  What you want is boot_cpu_has(X86_FEATURE_HYPERVISOR).
>
> The unfair lock is to be enabled by boot time check, not just by the
> presence of a configuration macro during the build process in order to
> avoid using unfair lock on bare metal. Of course, Linux distros can
> modify this if that suits their need.

"boot_cpu_has" is a run-time check.  You can use it after setup_arch has 
called init_hypervisor_platform and kvm_guest_init.  Can you just just 
check if the PV path has been enabled and, if not, do a static_key_inc 
to enable the unfair path?

Paolo

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v6 05/11] pvqspinlock, x86: Allow unfair spinlock in a PV guest
@ 2014-03-19 17:08                   ` Paolo Bonzini
  0 siblings, 0 replies; 135+ messages in thread
From: Paolo Bonzini @ 2014-03-19 17:08 UTC (permalink / raw)
  To: Waiman Long
  Cc: Konrad Rzeszutek Wilk, Jeremy Fitzhardinge, Raghavendra K T,
	Gleb Natapov, Peter Zijlstra, virtualization, Andi Kleen,
	H. Peter Anvin, Michel Lespinasse, Alok Kataria, linux-arch, kvm,
	x86, Ingo Molnar, xen-devel, Paul E. McKenney, Rik van Riel,
	Arnd Bergmann, Scott J Norton, Steven Rostedt, Chris Wright,
	Thomas Gleixner, Aswin Chandramouleeswaran, Chegu Vinod

Il 19/03/2014 17:58, Waiman Long ha scritto:
>> Exactly.  What you want is boot_cpu_has(X86_FEATURE_HYPERVISOR).
>
> The unfair lock is to be enabled by boot time check, not just by the
> presence of a configuration macro during the build process in order to
> avoid using unfair lock on bare metal. Of course, Linux distros can
> modify this if that suits their need.

"boot_cpu_has" is a run-time check.  You can use it after setup_arch has 
called init_hypervisor_platform and kvm_guest_init.  Can you just just 
check if the PV path has been enabled and, if not, do a static_key_inc 
to enable the unfair path?

Paolo

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH v6 05/11] pvqspinlock, x86: Allow unfair spinlock in a PV guest
  2014-03-19 16:58               ` Waiman Long
@ 2014-03-19 17:08                 ` Paolo Bonzini
  2014-03-19 17:08                   ` Paolo Bonzini
  1 sibling, 0 replies; 135+ messages in thread
From: Paolo Bonzini @ 2014-03-19 17:08 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, Peter Zijlstra,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Alok Kataria, linux-arch, Gleb Natapov, x86, Ingo Molnar,
	xen-devel, Paul E. McKenney, Arnd Bergmann, Scott J Norton,
	Steven Rostedt, Chris Wright, Thomas Gleixner,
	Aswin Chandramouleeswaran, Chegu Vinod

Il 19/03/2014 17:58, Waiman Long ha scritto:
>> Exactly.  What you want is boot_cpu_has(X86_FEATURE_HYPERVISOR).
>
> The unfair lock is to be enabled by boot time check, not just by the
> presence of a configuration macro during the build process in order to
> avoid using unfair lock on bare metal. Of course, Linux distros can
> modify this if that suits their need.

"boot_cpu_has" is a run-time check.  You can use it after setup_arch has 
called init_hypervisor_platform and kvm_guest_init.  Can you just just 
check if the PV path has been enabled and, if not, do a static_key_inc 
to enable the unfair path?

Paolo

^ permalink raw reply	[flat|nested] 135+ messages in thread

end of thread, other threads:[~2014-03-19 17:09 UTC | newest]

Thread overview: 135+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-03-12 18:54 [PATCH v6 00/11] qspinlock: a 4-byte queue spinlock with PV support Waiman Long
2014-03-12 18:54 ` [PATCH v6 01/11] qspinlock: A generic 4-byte queue spinlock implementation Waiman Long
2014-03-12 18:54 ` Waiman Long
2014-03-12 18:54 ` [PATCH v6 02/11] qspinlock, x86: Enable x86-64 to use queue spinlock Waiman Long
2014-03-12 18:54 ` Waiman Long
2014-03-12 18:54 ` [PATCH v6 03/11] qspinlock: More optimized code for smaller NR_CPUS Waiman Long
2014-03-12 18:54 ` Waiman Long
2014-03-12 18:54 ` [PATCH v6 04/11] qspinlock: Optimized code path for 2 contending tasks Waiman Long
2014-03-12 19:08   ` Waiman Long
2014-03-12 19:08     ` Waiman Long
2014-03-13 13:57     ` Peter Zijlstra
2014-03-13 13:57     ` Peter Zijlstra
2014-03-13 13:57       ` Peter Zijlstra
2014-03-17 17:23       ` Waiman Long
2014-03-17 17:23       ` Waiman Long
2014-03-17 17:23         ` Waiman Long
2014-03-12 19:08   ` Waiman Long
2014-03-12 18:54 ` Waiman Long
2014-03-12 18:54 ` [PATCH v6 05/11] pvqspinlock, x86: Allow unfair spinlock in a PV guest Waiman Long
2014-03-13 10:54   ` David Vrabel
2014-03-13 10:54   ` David Vrabel
2014-03-13 10:54     ` David Vrabel
2014-03-13 13:16     ` Paolo Bonzini
2014-03-13 13:16     ` Paolo Bonzini
2014-03-13 13:16       ` Paolo Bonzini
2014-03-13 13:16       ` Paolo Bonzini
2014-03-17 19:05       ` Konrad Rzeszutek Wilk
2014-03-17 19:05       ` Konrad Rzeszutek Wilk
2014-03-17 19:05         ` Konrad Rzeszutek Wilk
2014-03-17 19:05         ` Konrad Rzeszutek Wilk
2014-03-18  8:14         ` Paolo Bonzini
2014-03-18  8:14         ` Paolo Bonzini
2014-03-18  8:14           ` Paolo Bonzini
2014-03-19  3:15           ` Waiman Long
2014-03-19  3:15           ` Waiman Long
2014-03-19  3:15           ` Waiman Long
2014-03-19  3:15             ` Waiman Long
2014-03-19  3:15             ` Waiman Long
2014-03-19 10:07             ` Paolo Bonzini
2014-03-19 16:58               ` Waiman Long
2014-03-19 16:58               ` Waiman Long
2014-03-19 16:58               ` Waiman Long
2014-03-19 17:08                 ` Paolo Bonzini
2014-03-19 17:08                 ` Paolo Bonzini
2014-03-19 17:08                   ` Paolo Bonzini
2014-03-19 10:07             ` Paolo Bonzini
2014-03-19 10:07             ` Paolo Bonzini
2014-03-13 19:03     ` Waiman Long
2014-03-13 19:03     ` Waiman Long
2014-03-13 19:03       ` Waiman Long
2014-03-13 15:15   ` Peter Zijlstra
2014-03-13 15:15     ` Peter Zijlstra
2014-03-13 20:05     ` Waiman Long
2014-03-13 20:05     ` Waiman Long
2014-03-13 20:05       ` Waiman Long
2014-03-14  8:30       ` Peter Zijlstra
2014-03-14  8:30         ` Peter Zijlstra
2014-03-14  8:48         ` Paolo Bonzini
2014-03-14  8:48         ` Paolo Bonzini
2014-03-14  8:48           ` Paolo Bonzini
2014-03-17 17:44         ` Waiman Long
2014-03-17 17:44         ` Waiman Long
2014-03-17 17:44           ` Waiman Long
2014-03-17 18:54           ` Peter Zijlstra
2014-03-17 18:54           ` Peter Zijlstra
2014-03-17 18:54             ` Peter Zijlstra
2014-03-18  8:16             ` Paolo Bonzini
2014-03-18  8:16               ` Paolo Bonzini
2014-03-18  8:16             ` Paolo Bonzini
2014-03-19  3:08             ` Waiman Long
2014-03-19  3:08             ` Waiman Long
2014-03-19  3:08               ` Waiman Long
2014-03-17 19:10           ` Konrad Rzeszutek Wilk
2014-03-17 19:10             ` Konrad Rzeszutek Wilk
2014-03-19  3:11             ` Waiman Long
2014-03-19  3:11             ` Waiman Long
2014-03-19  3:11               ` Waiman Long
2014-03-19 15:25               ` Konrad Rzeszutek Wilk
2014-03-19 15:25               ` Konrad Rzeszutek Wilk
2014-03-19 15:25                 ` Konrad Rzeszutek Wilk
2014-03-17 19:10           ` Konrad Rzeszutek Wilk
2014-03-14  8:30       ` Peter Zijlstra
2014-03-13 15:15   ` Peter Zijlstra
2014-03-12 18:54 ` Waiman Long
2014-03-12 18:54 ` [PATCH v6 06/11] pvqspinlock, x86: Allow unfair queue spinlock in a KVM guest Waiman Long
2014-03-12 18:54 ` Waiman Long
2014-03-12 18:54 ` [PATCH v6 07/11] pvqspinlock, x86: Allow unfair queue spinlock in a XEN guest Waiman Long
2014-03-12 18:54 ` Waiman Long
2014-03-12 18:54 ` [PATCH v6 08/11] pvqspinlock, x86: Rename paravirt_ticketlocks_enabled Waiman Long
2014-03-12 18:54 ` Waiman Long
2014-03-12 18:54 ` [PATCH RFC v6 09/11] pvqspinlock, x86: Add qspinlock para-virtualization support Waiman Long
2014-03-12 18:54 ` Waiman Long
2014-03-13 11:21   ` David Vrabel
2014-03-13 11:21     ` David Vrabel
2014-03-13 13:57     ` Paolo Bonzini
2014-03-13 13:57     ` Paolo Bonzini
2014-03-13 13:57       ` Paolo Bonzini
2014-03-13 13:57       ` Paolo Bonzini
2014-03-13 19:49       ` Waiman Long
2014-03-13 19:49       ` Waiman Long
2014-03-13 19:49       ` Waiman Long
2014-03-13 19:49         ` Waiman Long
2014-03-13 19:49         ` Waiman Long
2014-03-14  9:44         ` Paolo Bonzini
2014-03-14  9:44         ` Paolo Bonzini
2014-03-14  9:44         ` Paolo Bonzini
2014-03-13 13:57     ` Paolo Bonzini
2014-03-13 19:05     ` Waiman Long
2014-03-13 19:05       ` Waiman Long
2014-03-13 19:05     ` Waiman Long
2014-03-13 11:21   ` David Vrabel
2014-03-12 18:54 ` [PATCH RFC v6 10/11] pvqspinlock, x86: Enable qspinlock PV support for KVM Waiman Long
2014-03-12 18:54 ` Waiman Long
2014-03-13 13:59   ` Paolo Bonzini
2014-03-13 13:59     ` Paolo Bonzini
2014-03-13 19:13     ` Waiman Long
2014-03-13 19:13     ` Waiman Long
2014-03-14  8:42       ` Paolo Bonzini
2014-03-14  8:42       ` Paolo Bonzini
2014-03-14  8:42         ` Paolo Bonzini
2014-03-17 17:47         ` Waiman Long
2014-03-17 17:47         ` Waiman Long
2014-03-17 17:47           ` Waiman Long
2014-03-18  8:18           ` Paolo Bonzini
2014-03-18  8:18             ` Paolo Bonzini
2014-03-18  8:18           ` Paolo Bonzini
2014-03-13 13:59   ` Paolo Bonzini
2014-03-13 15:25   ` Peter Zijlstra
2014-03-13 15:25   ` Peter Zijlstra
2014-03-13 15:25     ` Peter Zijlstra
2014-03-13 20:09     ` Waiman Long
2014-03-13 20:09     ` Waiman Long
2014-03-13 20:09       ` Waiman Long
2014-03-12 18:54 ` [PATCH RFC v6 11/11] pvqspinlock, x86: Enable qspinlock PV support for XEN Waiman Long
2014-03-12 18:54 ` Waiman Long

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.