All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v5 0/8] qspinlock: a 4-byte queue spinlock with PV support
@ 2014-02-26 15:14 Waiman Long
  2014-02-26 15:14 ` [PATCH v5 1/8] qspinlock: Introducing a 4-byte queue spinlock implementation Waiman Long
                   ` (19 more replies)
  0 siblings, 20 replies; 125+ messages in thread
From: Waiman Long @ 2014-02-26 15:14 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	Peter Zijlstra
  Cc: Jeremy Fitzhardinge, Raghavendra K T, virtualization, Andi Kleen,
	Michel Lespinasse, Boris Ostrovsky, linux-arch, x86,
	Scott J Norton, xen-devel, Paul E. McKenney, Alexander Fyodorov,
	Daniel J Blueman, Rusty Russell, Oleg Nesterov, Steven Rostedt,
	Chris Wright, George Spelvin, Alok Kataria,
	Aswin Chandramouleeswaran, Chegu Vinod, Waiman Long,
	Linus Torvalds, linux-ke

v4->v5:
 - Move the optimized 2-task contending code to the generic file to
   enable more architectures to use it without code duplication.
 - Address some of the style-related comments by PeterZ.
 - Allow the use of unfair queue spinlock in a real para-virtualized
   execution environment.
 - Add para-virtualization support to the qspinlock code by ensuring
   that the lock holder and queue head stay alive as much as possible.

v3->v4:
 - Remove debugging code and fix a configuration error
 - Simplify the qspinlock structure and streamline the code to make it
   perform a bit better
 - Add an x86 version of asm/qspinlock.h for holding x86 specific
   optimization.
 - Add an optimized x86 code path for 2 contending tasks to improve
   low contention performance.

v2->v3:
 - Simplify the code by using numerous mode only without an unfair option.
 - Use the latest smp_load_acquire()/smp_store_release() barriers.
 - Move the queue spinlock code to kernel/locking.
 - Make the use of queue spinlock the default for x86-64 without user
   configuration.
 - Additional performance tuning.

v1->v2:
 - Add some more comments to document what the code does.
 - Add a numerous CPU mode to support >= 16K CPUs
 - Add a configuration option to allow lock stealing which can further
   improve performance in many cases.
 - Enable wakeup of queue head CPU at unlock time for non-numerous
   CPU mode.

This patch set has 3 different sections:
 1) Patches 1-3: Introduces a queue-based spinlock implementation that
    can replace the default ticket spinlock without increasing the
    size of the spinlock data structure. As a result, critical kernel
    data structures that embed spinlock won't increase in size and
    breaking data alignments.
 2) Patches 4 and 5: Enables the use of unfair queue spinlock in a
    real para-virtualized execution environment. This can resolve
    some of the locking related performance issues due to the fact
    that the next CPU to get the lock may have been scheduled out
    for a period of time.
 3) Patches 6-8: Enable qspinlock para-virtualization support by making
    sure that the lock holder and the queue head stay alive as long as
    possible.

Patches 1-3 are fully tested and ready for production. Patches 4-8, on
the other hands, are not fully tested. They have undergone compilation
tests with various combinations of kernel config setting and boot-up
tests in a non-virtualized setting. Further tests and performance
characterization are still needed to be done in a KVM guest. So
comments on them are welcomed. Suggestions or recommendations on how
to add PV support in the Xen environment are also needed.

The queue spinlock has slightly better performance than the ticket
spinlock in uncontended case. Its performance can be much better
with moderate to heavy contention.  This patch has the potential of
improving the performance of all the workloads that have moderate to
heavy spinlock contention.

The queue spinlock is especially suitable for NUMA machines with at
least 2 sockets, though noticeable performance benefit probably won't
show up in machines with less than 4 sockets.

The purpose of this patch set is not to solve any particular spinlock
contention problems. Those need to be solved by refactoring the code
to make more efficient use of the lock or finer granularity ones. The
main purpose is to make the lock contention problems more tolerable
until someone can spend the time and effort to fix them.

Waiman Long (8):
  qspinlock: Introducing a 4-byte queue spinlock implementation
  qspinlock, x86: Enable x86-64 to use queue spinlock
  qspinlock, x86: Add x86 specific optimization for 2 contending tasks
  pvqspinlock, x86: Allow unfair spinlock in a real PV environment
  pvqspinlock, x86: Enable unfair queue spinlock in a KVM guest
  pvqspinlock, x86: Rename paravirt_ticketlocks_enabled
  pvqspinlock, x86: Add qspinlock para-virtualization support
  pvqspinlock, x86: Enable KVM to use qspinlock's PV support

 arch/x86/Kconfig                      |   12 +
 arch/x86/include/asm/paravirt.h       |    9 +-
 arch/x86/include/asm/paravirt_types.h |   12 +
 arch/x86/include/asm/pvqspinlock.h    |  176 ++++++++++
 arch/x86/include/asm/qspinlock.h      |  133 +++++++
 arch/x86/include/asm/spinlock.h       |    9 +-
 arch/x86/include/asm/spinlock_types.h |    4 +
 arch/x86/kernel/Makefile              |    1 +
 arch/x86/kernel/kvm.c                 |   73 ++++-
 arch/x86/kernel/paravirt-spinlocks.c  |   15 +-
 arch/x86/xen/spinlock.c               |    2 +-
 include/asm-generic/qspinlock.h       |  122 +++++++
 include/asm-generic/qspinlock_types.h |   61 ++++
 kernel/Kconfig.locks                  |    7 +
 kernel/locking/Makefile               |    1 +
 kernel/locking/qspinlock.c            |  610 +++++++++++++++++++++++++++++++++
 16 files changed, 1239 insertions(+), 8 deletions(-)
 create mode 100644 arch/x86/include/asm/pvqspinlock.h
 create mode 100644 arch/x86/include/asm/qspinlock.h
 create mode 100644 include/asm-generic/qspinlock.h
 create mode 100644 include/asm-generic/qspinlock_types.h
 create mode 100644 kernel/locking/qspinlock.c

^ permalink raw reply	[flat|nested] 125+ messages in thread

* [PATCH v5 1/8] qspinlock: Introducing a 4-byte queue spinlock implementation
  2014-02-26 15:14 [PATCH v5 0/8] qspinlock: a 4-byte queue spinlock with PV support Waiman Long
@ 2014-02-26 15:14 ` Waiman Long
  2014-02-26 16:22   ` Peter Zijlstra
                     ` (3 more replies)
  2014-02-26 15:14 ` Waiman Long
                   ` (18 subsequent siblings)
  19 siblings, 4 replies; 125+ messages in thread
From: Waiman Long @ 2014-02-26 15:14 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	Peter Zijlstra
  Cc: Jeremy Fitzhardinge, Raghavendra K T, virtualization, Andi Kleen,
	Michel Lespinasse, Boris Ostrovsky, linux-arch, x86,
	Scott J Norton, xen-devel, Paul E. McKenney, Alexander Fyodorov,
	Daniel J Blueman, Rusty Russell, Oleg Nesterov, Steven Rostedt,
	Chris Wright, George Spelvin, Alok Kataria,
	Aswin Chandramouleeswaran, Chegu Vinod, Waiman Long,
	Linus Torvalds, linux-ke

This patch introduces a new queue spinlock implementation that can
serve as an alternative to the default ticket spinlock. Compared with
the ticket spinlock, this queue spinlock should be almost as fair as
the ticket spinlock. It has about the same speed in single-thread and
it can be much faster in high contention situations. Only in light to
moderate contention where the average queue depth is around 1-3 will
this queue spinlock be potentially a bit slower due to the higher
slowpath overhead.

This queue spinlock is especially suit to NUMA machines with a large
number of cores as the chance of spinlock contention is much higher
in those machines. The cost of contention is also higher because of
slower inter-node memory traffic.

The idea behind this spinlock implementation is the fact that spinlocks
are acquired with preemption disabled. In other words, the process
will not be migrated to another CPU while it is trying to get a
spinlock. Ignoring interrupt handling, a CPU can only be contending
in one spinlock at any one time. Of course, interrupt handler can try
to acquire one spinlock while the interrupted user process is in the
process of getting another spinlock. By allocating a set of per-cpu
queue nodes and used them to form a waiting queue, we can encode the
queue node address into a much smaller 16-bit size. Together with
the 1-byte lock bit, this queue spinlock implementation will only
need 4 bytes to hold all the information that it needs.

The current queue node address encoding of the 4-byte word is as
follows:
Bits 0-7  : the locked byte
Bits 8-9  : queue node index in the per-cpu array (4 entries)
Bits 10-31: cpu number + 1 (max cpus = 4M -1)

In the extremely unlikely case that all the queue node entries are
used up, the current code will fall back to busy spinning without
waiting in a queue with warning message.

For single-thread performance (no contention), a 256K lock/unlock
loop was run on a 2.4Ghz Westmere x86-64 CPU.  The following table
shows the average time (in ns) for a single lock/unlock sequence
(including the looping and timing overhead):

  Lock Type			Time (ns)
  ---------			---------
  Ticket spinlock		  14.1
  Queue spinlock		   8.8

So the queue spinlock is much faster than the ticket spinlock, even
though the overhead of locking and unlocking should be pretty small
when there is no contention. The performance advantage is mainly
due to the fact that ticket spinlock does a read-modify-write (add)
instruction in unlock whereas queue spinlock only does a simple write
in unlock which can be much faster in a pipelined CPU.

The AIM7 benchmark was run on a 8-socket 80-core DL980 with Westmere
x86-64 CPUs with XFS filesystem on a ramdisk and HT off to evaluate
the performance impact of this patch on a 3.13 kernel.

  +------------+----------+-----------------+---------+
  | Kernel     | 3.13 JPM |    3.13 with    | %Change |
  |            |          | qspinlock patch |	      |
  +------------+----------+-----------------+---------+
  |		      10-100 users		      |
  +------------+----------+-----------------+---------+
  |custom      |   357459 |      363109     |  +1.58% |
  |dbase       |   496847 |      498801	    |  +0.39% |
  |disk        |  2925312 |     2771387     |  -5.26% |
  |five_sec    |   166612 |      169215     |  +1.56% |
  |fserver     |   382129 |      383279     |  +0.30% |
  |high_systime|    16356 |       16380     |  +0.15% |
  |short       |  4521978 |     4257363     |  -5.85% |
  +------------+----------+-----------------+---------+
  |		     200-1000 users		      |
  +------------+----------+-----------------+---------+
  |custom      |   449070 |      447711     |  -0.30% |
  |dbase       |   845029 |      853362	    |  +0.99% |
  |disk        |  2725249 |     4892907     | +79.54% |
  |five_sec    |   169410 |      170638     |  +0.72% |
  |fserver     |   489662 |      491828     |  +0.44% |
  |high_systime|   142823 |      143790     |  +0.68% |
  |short       |  7435288 |     9016171     | +21.26% |
  +------------+----------+-----------------+---------+
  |		     1100-2000 users		      |
  +------------+----------+-----------------+---------+
  |custom      |   432470 |      432570     |  +0.02% |
  |dbase       |   889289 |      890026	    |  +0.08% |
  |disk        |  2565138 |     5008732     | +95.26% |
  |five_sec    |   169141 |      170034     |  +0.53% |
  |fserver     |   498569 |      500701     |  +0.43% |
  |high_systime|   229913 |      245866     |  +6.94% |
  |short       |  8496794 |     8281918     |  -2.53% |
  +------------+----------+-----------------+---------+

The workload with the most gain was the disk workload. Without the
patch, the perf profile at 1500 users looked like:

 26.19%    reaim  [kernel.kallsyms]  [k] _raw_spin_lock
              |--47.28%-- evict
              |--46.87%-- inode_sb_list_add
              |--1.24%-- xlog_cil_insert_items
              |--0.68%-- __remove_inode_hash
              |--0.67%-- inode_wait_for_writeback
               --3.26%-- [...]
 22.96%  swapper  [kernel.kallsyms]  [k] cpu_idle_loop
  5.56%    reaim  [kernel.kallsyms]  [k] mutex_spin_on_owner
  4.87%    reaim  [kernel.kallsyms]  [k] update_cfs_rq_blocked_load
  2.04%    reaim  [kernel.kallsyms]  [k] mspin_lock
  1.30%    reaim  [kernel.kallsyms]  [k] memcpy
  1.08%    reaim  [unknown]          [.] 0x0000003c52009447

There was pretty high spinlock contention on the inode_sb_list_lock
and maybe the inode's i_lock.

With the patch, the perf profile at 1500 users became:

 26.82%  swapper  [kernel.kallsyms]  [k] cpu_idle_loop
  4.66%    reaim  [kernel.kallsyms]  [k] mutex_spin_on_owner
  3.97%    reaim  [kernel.kallsyms]  [k] update_cfs_rq_blocked_load
  2.40%    reaim  [kernel.kallsyms]  [k] queue_spin_lock_slowpath
              |--88.31%-- _raw_spin_lock
              |          |--36.02%-- inode_sb_list_add
              |          |--35.09%-- evict
              |          |--16.89%-- xlog_cil_insert_items
              |          |--6.30%-- try_to_wake_up
              |          |--2.20%-- _xfs_buf_find
              |          |--0.75%-- __remove_inode_hash
              |          |--0.72%-- __mutex_lock_slowpath
              |          |--0.53%-- load_balance
              |--6.02%-- _raw_spin_lock_irqsave
              |          |--74.75%-- down_trylock
              |          |--9.69%-- rcu_check_quiescent_state
              |          |--7.47%-- down
              |          |--3.57%-- up
              |          |--1.67%-- rwsem_wake
              |          |--1.00%-- remove_wait_queue
              |          |--0.56%-- pagevec_lru_move_fn
              |--5.39%-- _raw_spin_lock_irq
              |          |--82.05%-- rwsem_down_read_failed
              |          |--10.48%-- rwsem_down_write_failed
              |          |--4.24%-- __down
              |          |--2.74%-- __schedule
               --0.28%-- [...]
  2.20%    reaim  [kernel.kallsyms]  [k] memcpy
  1.84%    reaim  [unknown]          [.] 0x000000000041517b
  1.77%    reaim  [kernel.kallsyms]  [k] _raw_spin_lock
              |--21.08%-- xlog_cil_insert_items
              |--10.14%-- xfs_icsb_modify_counters
              |--7.20%-- xfs_iget_cache_hit
              |--6.56%-- inode_sb_list_add
              |--5.49%-- _xfs_buf_find
              |--5.25%-- evict
              |--5.03%-- __remove_inode_hash
              |--4.64%-- __mutex_lock_slowpath
              |--3.78%-- selinux_inode_free_security
              |--2.95%-- xfs_inode_is_filestream
              |--2.35%-- try_to_wake_up
              |--2.07%-- xfs_inode_set_reclaim_tag
              |--1.52%-- list_lru_add
              |--1.16%-- xfs_inode_clear_eofblocks_tag
		  :
  1.30%    reaim  [kernel.kallsyms]  [k] effective_load
  1.27%    reaim  [kernel.kallsyms]  [k] mspin_lock
  1.10%    reaim  [kernel.kallsyms]  [k] security_compute_sid

On the ext4 filesystem, the disk workload improved from 416281 JPM
to 899101 JPM (+116%) with the patch. In this case, the contended
spinlock is the mb_cache_spinlock.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
Acked-by: Rik van Riel <riel@redhat.com>
---
 include/asm-generic/qspinlock.h       |  122 ++++++++++
 include/asm-generic/qspinlock_types.h |   55 +++++
 kernel/Kconfig.locks                  |    7 +
 kernel/locking/Makefile               |    1 +
 kernel/locking/qspinlock.c            |  393 +++++++++++++++++++++++++++++++++
 5 files changed, 578 insertions(+), 0 deletions(-)
 create mode 100644 include/asm-generic/qspinlock.h
 create mode 100644 include/asm-generic/qspinlock_types.h
 create mode 100644 kernel/locking/qspinlock.c

diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h
new file mode 100644
index 0000000..08da60f
--- /dev/null
+++ b/include/asm-generic/qspinlock.h
@@ -0,0 +1,122 @@
+/*
+ * Queue spinlock
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2013-2014 Hewlett-Packard Development Company, L.P.
+ *
+ * Authors: Waiman Long <waiman.long@hp.com>
+ */
+#ifndef __ASM_GENERIC_QSPINLOCK_H
+#define __ASM_GENERIC_QSPINLOCK_H
+
+#include <asm-generic/qspinlock_types.h>
+
+/*
+ * External function declarations
+ */
+extern void queue_spin_lock_slowpath(struct qspinlock *lock, int qsval);
+
+/**
+ * queue_spin_is_locked - is the spinlock locked?
+ * @lock: Pointer to queue spinlock structure
+ * Return: 1 if it is locked, 0 otherwise
+ */
+static __always_inline int queue_spin_is_locked(struct qspinlock *lock)
+{
+	return atomic_read(&lock->qlcode) & _QSPINLOCK_LOCKED;
+}
+
+/**
+ * queue_spin_value_unlocked - is the spinlock structure unlocked?
+ * @lock: queue spinlock structure
+ * Return: 1 if it is unlocked, 0 otherwise
+ */
+static __always_inline int queue_spin_value_unlocked(struct qspinlock lock)
+{
+	return !(atomic_read(&lock.qlcode) & _QSPINLOCK_LOCKED);
+}
+
+/**
+ * queue_spin_is_contended - check if the lock is contended
+ * @lock : Pointer to queue spinlock structure
+ * Return: 1 if lock contended, 0 otherwise
+ */
+static __always_inline int queue_spin_is_contended(struct qspinlock *lock)
+{
+	return atomic_read(&lock->qlcode) & ~_QSPINLOCK_LOCK_MASK;
+}
+/**
+ * queue_spin_trylock - try to acquire the queue spinlock
+ * @lock : Pointer to queue spinlock structure
+ * Return: 1 if lock acquired, 0 if failed
+ */
+static __always_inline int queue_spin_trylock(struct qspinlock *lock)
+{
+	if (!atomic_read(&lock->qlcode) &&
+	   (atomic_cmpxchg(&lock->qlcode, 0, _QSPINLOCK_LOCKED) == 0))
+		return 1;
+	return 0;
+}
+
+/**
+ * queue_spin_lock - acquire a queue spinlock
+ * @lock: Pointer to queue spinlock structure
+ */
+static __always_inline void queue_spin_lock(struct qspinlock *lock)
+{
+	int qsval;
+
+	/*
+	 * To reduce memory access to only once for the cold cache case,
+	 * a direct cmpxchg() is performed in the fastpath to optimize the
+	 * uncontended case. The contended performance, however, may suffer
+	 * a bit because of that.
+	 */
+	qsval = atomic_cmpxchg(&lock->qlcode, 0, _QSPINLOCK_LOCKED);
+	if (likely(qsval == 0))
+		return;
+	queue_spin_lock_slowpath(lock, qsval);
+}
+
+#ifndef queue_spin_unlock
+/**
+ * queue_spin_unlock - release a queue spinlock
+ * @lock : Pointer to queue spinlock structure
+ */
+static __always_inline void queue_spin_unlock(struct qspinlock *lock)
+{
+	/*
+	 * Use an atomic subtraction to clear the lock bit.
+	 */
+	smp_mb__before_atomic_dec();
+	atomic_sub(_QSPINLOCK_LOCKED, &lock->qlcode);
+}
+#endif
+
+/*
+ * Initializier
+ */
+#define	__ARCH_SPIN_LOCK_UNLOCKED	{ ATOMIC_INIT(0) }
+
+/*
+ * Remapping spinlock architecture specific functions to the corresponding
+ * queue spinlock functions.
+ */
+#define arch_spin_is_locked(l)		queue_spin_is_locked(l)
+#define arch_spin_is_contended(l)	queue_spin_is_contended(l)
+#define arch_spin_value_unlocked(l)	queue_spin_value_unlocked(l)
+#define arch_spin_lock(l)		queue_spin_lock(l)
+#define arch_spin_trylock(l)		queue_spin_trylock(l)
+#define arch_spin_unlock(l)		queue_spin_unlock(l)
+#define arch_spin_lock_flags(l, f)	queue_spin_lock(l)
+
+#endif /* __ASM_GENERIC_QSPINLOCK_H */
diff --git a/include/asm-generic/qspinlock_types.h b/include/asm-generic/qspinlock_types.h
new file mode 100644
index 0000000..df981d0
--- /dev/null
+++ b/include/asm-generic/qspinlock_types.h
@@ -0,0 +1,55 @@
+/*
+ * Queue spinlock
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2013-2014 Hewlett-Packard Development Company, L.P.
+ *
+ * Authors: Waiman Long <waiman.long@hp.com>
+ */
+#ifndef __ASM_GENERIC_QSPINLOCK_TYPES_H
+#define __ASM_GENERIC_QSPINLOCK_TYPES_H
+
+/*
+ * Including atomic.h with PARAVIRT on will cause compilation errors because
+ * of recursive header file incluson via paravirt_types.h. A workaround is
+ * to include paravirt_types.h here in this case.
+ */
+#ifdef CONFIG_PARAVIRT
+# include <asm/paravirt_types.h>
+#else
+# include <linux/types.h>
+# include <linux/atomic.h>
+#endif
+
+/*
+ * The queue spinlock data structure - a 32-bit word
+ *
+ * For NR_CPUS >= 16K, the bits assignment are:
+ *   Bit  0   : Set if locked
+ *   Bits 1-7 : Not used
+ *   Bits 8-31: Queue code
+ *
+ * For NR_CPUS < 16K, the bits assignment are:
+ *   Bit   0   : Set if locked
+ *   Bits  1-7 : Not used
+ *   Bits  8-15: Reserved for architecture specific optimization
+ *   Bits 16-31: Queue code
+ */
+typedef struct qspinlock {
+	atomic_t	qlcode;	/* Lock + queue code */
+} arch_spinlock_t;
+
+#define _QCODE_OFFSET		8
+#define _QSPINLOCK_LOCKED	1U
+#define	_QSPINLOCK_LOCK_MASK	0xff
+
+#endif /* __ASM_GENERIC_QSPINLOCK_TYPES_H */
diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks
index d2b32ac..f185584 100644
--- a/kernel/Kconfig.locks
+++ b/kernel/Kconfig.locks
@@ -223,3 +223,10 @@ endif
 config MUTEX_SPIN_ON_OWNER
 	def_bool y
 	depends on SMP && !DEBUG_MUTEXES
+
+config ARCH_USE_QUEUE_SPINLOCK
+	bool
+
+config QUEUE_SPINLOCK
+	def_bool y if ARCH_USE_QUEUE_SPINLOCK
+	depends on SMP && !PARAVIRT_SPINLOCKS
diff --git a/kernel/locking/Makefile b/kernel/locking/Makefile
index baab8e5..e3b3293 100644
--- a/kernel/locking/Makefile
+++ b/kernel/locking/Makefile
@@ -15,6 +15,7 @@ obj-$(CONFIG_LOCKDEP) += lockdep_proc.o
 endif
 obj-$(CONFIG_SMP) += spinlock.o
 obj-$(CONFIG_PROVE_LOCKING) += spinlock.o
+obj-$(CONFIG_QUEUE_SPINLOCK) += qspinlock.o
 obj-$(CONFIG_RT_MUTEXES) += rtmutex.o
 obj-$(CONFIG_DEBUG_RT_MUTEXES) += rtmutex-debug.o
 obj-$(CONFIG_RT_MUTEX_TESTER) += rtmutex-tester.o
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
new file mode 100644
index 0000000..ed5efa7
--- /dev/null
+++ b/kernel/locking/qspinlock.c
@@ -0,0 +1,393 @@
+/*
+ * Queue spinlock
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2013-2014 Hewlett-Packard Development Company, L.P.
+ *
+ * Authors: Waiman Long <waiman.long@hp.com>
+ */
+#include <linux/smp.h>
+#include <linux/bug.h>
+#include <linux/cpumask.h>
+#include <linux/percpu.h>
+#include <linux/hardirq.h>
+#include <linux/mutex.h>
+#include <linux/spinlock.h>
+
+/*
+ * The basic principle of a queue-based spinlock can best be understood
+ * by studying a classic queue-based spinlock implementation called the
+ * MCS lock. The paper below provides a good description for this kind
+ * of lock.
+ *
+ * http://www.cise.ufl.edu/tr/DOC/REP-1992-71.pdf
+ *
+ * This queue spinlock implementation is based on the MCS lock with twists
+ * to make it fit the following constraints:
+ * 1. A max spinlock size of 4 bytes
+ * 2. Good fastpath performance
+ * 3. No change in the locking APIs
+ *
+ * The queue spinlock fastpath is as simple as it can get, all the heavy
+ * lifting is done in the lock slowpath. The main idea behind this queue
+ * spinlock implementation is to keep the spinlock size at 4 bytes while
+ * at the same time implement a queue structure to queue up the waiting
+ * lock spinners.
+ *
+ * Since preemption is disabled before getting the lock, a given CPU will
+ * only need to use one queue node structure in a non-interrupt context.
+ * A percpu queue node structure will be allocated for this purpose and the
+ * cpu number will be put into the queue spinlock structure to indicate the
+ * tail of the queue.
+ *
+ * To handle spinlock acquisition at interrupt context (softirq or hardirq),
+ * the queue node structure is actually an array for supporting nested spin
+ * locking operations in interrupt handlers. If all the entries in the
+ * array are used up, a warning message will be printed (as that shouldn't
+ * happen in normal circumstances) and the lock spinner will fall back to
+ * busy spinning instead of waiting in a queue.
+ */
+
+/*
+ * The 24-bit queue node code is divided into the following 2 fields:
+ * Bits 0-1 : queue node index (4 nodes)
+ * Bits 2-23: CPU number + 1   (4M - 1 CPUs)
+ *
+ * The 16-bit queue node code is divided into the following 2 fields:
+ * Bits 0-1 : queue node index (4 nodes)
+ * Bits 2-15: CPU number + 1   (16K - 1 CPUs)
+ *
+ * A queue node code of 0 indicates that no one is waiting for the lock.
+ * As the value 0 cannot be used as a valid CPU number. We need to add
+ * 1 to it before putting it into the queue code.
+ */
+#define MAX_QNODES		4
+#ifndef _QCODE_VAL_OFFSET
+#define _QCODE_VAL_OFFSET	_QCODE_OFFSET
+#endif
+
+/*
+ * The queue node structure
+ *
+ * This structure is essentially the same as the mcs_spinlock structure
+ * in mcs_spinlock.h file. This structure is retained for future extension
+ * where new fields may be added.
+ */
+struct qnode {
+	u32		 wait;		/* Waiting flag		*/
+	struct qnode	*next;		/* Next queue node addr */
+};
+
+struct qnode_set {
+	struct qnode	nodes[MAX_QNODES];
+	int		node_idx;	/* Current node to use */
+};
+
+/*
+ * Per-CPU queue node structures
+ */
+static DEFINE_PER_CPU_ALIGNED(struct qnode_set, qnset) = { {{0}}, 0 };
+
+/*
+ ************************************************************************
+ * The following optimized codes are for architectures that support:	*
+ *  1) Atomic byte and short data write					*
+ *  2) Byte and short data exchange and compare-exchange instructions	*
+ *									*
+ * For those architectures, their asm/qspinlock.h header file should	*
+ * define the followings in order to use the optimized codes.		*
+ *  1) The _ARCH_SUPPORTS_ATOMIC_8_16_BITS_OPS macro			*
+ *  2) A smp_u8_store_release() macro for byte size store operation	*
+ *  3) A "union arch_qspinlock" structure that include the individual	*
+ *     fields of the qspinlock structure, including:			*
+ *      o slock - the qspinlock structure				*
+ *      o lock  - the lock byte						*
+ *									*
+ ************************************************************************
+ */
+#ifdef _ARCH_SUPPORTS_ATOMIC_8_16_BITS_OPS
+/**
+ * queue_spin_setlock - try to acquire the lock by setting the lock bit
+ * @lock: Pointer to queue spinlock structure
+ * Return: 1 if lock bit set successfully, 0 if failed
+ */
+static __always_inline int queue_spin_setlock(struct qspinlock *lock)
+{
+	union arch_qspinlock *qlock = (union arch_qspinlock *)lock;
+
+	if (!ACCESS_ONCE(qlock->lock) &&
+	   (cmpxchg(&qlock->lock, 0, _QSPINLOCK_LOCKED) == 0))
+		return 1;
+	return 0;
+}
+#else /*  _ARCH_SUPPORTS_ATOMIC_8_16_BITS_OPS  */
+/*
+ * Generic functions for architectures that do not support atomic
+ * byte or short data types.
+ */
+/**
+ *_queue_spin_setlock - try to acquire the lock by setting the lock bit
+ * @lock: Pointer to queue spinlock structure
+ * Return: 1 if lock bit set successfully, 0 if failed
+ */
+static __always_inline int queue_spin_setlock(struct qspinlock *lock)
+{
+	int qlcode = atomic_read(lock->qlcode);
+
+	if (!(qlcode & _QSPINLOCK_LOCKED) && (atomic_cmpxchg(&lock->qlcode,
+		qlcode, qlcode|_QSPINLOCK_LOCKED) == qlcode))
+			return 1;
+	return 0;
+}
+#endif /* _ARCH_SUPPORTS_ATOMIC_8_16_BITS_OPS */
+
+/*
+ ************************************************************************
+ * Inline functions used by the queue_spin_lock_slowpath() function	*
+ * that may get superseded by a more optimized version.			*
+ ************************************************************************
+ */
+
+#ifndef queue_get_lock_qcode
+/**
+ * queue_get_lock_qcode - get the lock & qcode values
+ * @lock  : Pointer to queue spinlock structure
+ * @qcode : Pointer to the returned qcode value
+ * @mycode: My qcode value (not used)
+ * Return : > 0 if lock is not available, = 0 if lock is free
+ */
+static inline int
+queue_get_lock_qcode(struct qspinlock *lock, u32 *qcode, u32 mycode)
+{
+	int qlcode = atomic_read(&lock->qlcode);
+
+	*qcode = qlcode;
+	return qlcode & _QSPINLOCK_LOCKED;
+}
+#endif /* queue_get_lock_qcode */
+
+#ifndef queue_spin_trylock_and_clr_qcode
+/**
+ * queue_spin_trylock_and_clr_qcode - Try to lock & clear qcode simultaneously
+ * @lock : Pointer to queue spinlock structure
+ * @qcode: The supposedly current qcode value
+ * Return: true if successful, false otherwise
+ */
+static inline int
+queue_spin_trylock_and_clr_qcode(struct qspinlock *lock, u32 qcode)
+{
+	return atomic_cmpxchg(&lock->qlcode, qcode, _QSPINLOCK_LOCKED) == qcode;
+}
+#endif /* queue_spin_trylock_and_clr_qcode */
+
+#ifndef queue_encode_qcode
+/**
+ * queue_encode_qcode - Encode the CPU number & node index into a qnode code
+ * @cpu_nr: CPU number
+ * @qn_idx: Queue node index
+ * Return : A qnode code that can be saved into the qspinlock structure
+ *
+ * The lock bit is set in the encoded 32-bit value as the need to encode
+ * a qnode means that the lock should have been taken.
+ */
+static u32 queue_encode_qcode(u32 cpu_nr, u8 qn_idx)
+{
+	return ((cpu_nr + 1) << (_QCODE_VAL_OFFSET + 2)) |
+		(qn_idx << _QCODE_VAL_OFFSET) | _QSPINLOCK_LOCKED;
+}
+#endif /* queue_encode_qcode */
+
+/*
+ ************************************************************************
+ * Other inline functions needed by the queue_spin_lock_slowpath()	*
+ * function.								*
+ ************************************************************************
+ */
+
+/**
+ * xlate_qcode - translate the queue code into the queue node address
+ * @qcode: Queue code to be translated
+ * Return: The corresponding queue node address
+ */
+static inline struct qnode *xlate_qcode(u32 qcode)
+{
+	u32 cpu_nr = (qcode >> (_QCODE_VAL_OFFSET + 2)) - 1;
+	u8  qn_idx = (qcode >> _QCODE_VAL_OFFSET) & 3;
+
+	return per_cpu_ptr(&qnset.nodes[qn_idx], cpu_nr);
+}
+
+/**
+ * get_qnode - Get a queue node address
+ * @qn_idx: Pointer to queue node index [out]
+ * Return : queue node address & queue node index in qn_idx, or NULL if
+ *	    no free queue node available.
+ */
+static struct qnode *get_qnode(unsigned int *qn_idx)
+{
+	struct qnode_set *qset = this_cpu_ptr(&qnset);
+	int i;
+
+	if (unlikely(qset->node_idx >= MAX_QNODES))
+		return NULL;
+	i = qset->node_idx++;
+	*qn_idx = i;
+	return &qset->nodes[i];
+}
+
+/**
+ * put_qnode - Return a queue node to the pool
+ */
+static void put_qnode(void)
+{
+	struct qnode_set *qset = this_cpu_ptr(&qnset);
+
+	qset->node_idx--;
+}
+
+/**
+ * queue_spin_lock_slowpath - acquire the queue spinlock
+ * @lock : Pointer to queue spinlock structure
+ * @qsval: Current value of the queue spinlock 32-bit word
+ */
+void queue_spin_lock_slowpath(struct qspinlock *lock, int qsval)
+{
+	unsigned int cpu_nr, qn_idx;
+	struct qnode *node, *next;
+	u32 prev_qcode, my_qcode;
+
+	/*
+	 * Get the queue node
+	 */
+	cpu_nr = smp_processor_id();
+	node   = get_qnode(&qn_idx);
+
+	/*
+	 * It should never happen that all the queue nodes are being used.
+	 */
+	BUG_ON(!node);
+
+	/*
+	 * Set up the new cpu code to be exchanged
+	 */
+	my_qcode = queue_encode_qcode(cpu_nr, qn_idx);
+
+	/*
+	 * Initialize the queue node
+	 */
+	node->wait = true;
+	node->next = NULL;
+
+	/*
+	 * The lock may be available at this point, try again if no task was
+	 * waiting in the queue.
+	 */
+	if (!(qsval >> _QCODE_OFFSET) && queue_spin_trylock(lock)) {
+		put_qnode();
+		return;
+	}
+
+	/*
+	 * Exchange current copy of the queue node code
+	 */
+	prev_qcode = atomic_xchg(&lock->qlcode, my_qcode);
+	/*
+	 * It is possible that we may accidentally steal the lock. If this is
+	 * the case, we need to either release it if not the head of the queue
+	 * or get the lock and be done with it.
+	 */
+	if (unlikely(!(prev_qcode & _QSPINLOCK_LOCKED))) {
+		if (prev_qcode == 0) {
+			/*
+			 * Got the lock since it is at the head of the queue
+			 * Now try to atomically clear the queue code.
+			 */
+			if (atomic_cmpxchg(&lock->qlcode, my_qcode,
+					  _QSPINLOCK_LOCKED) == my_qcode)
+				goto release_node;
+			/*
+			 * The cmpxchg fails only if one or more tasks
+			 * are added to the queue. In this case, we need to
+			 * notify the next one to be the head of the queue.
+			 */
+			goto notify_next;
+		}
+		/*
+		 * Accidentally steal the lock, release the lock and
+		 * let the queue head get it.
+		 */
+		queue_spin_unlock(lock);
+	} else
+		prev_qcode &= ~_QSPINLOCK_LOCKED;	/* Clear the lock bit */
+	my_qcode &= ~_QSPINLOCK_LOCKED;
+
+	if (prev_qcode) {
+		/*
+		 * Not at the queue head, get the address of the previous node
+		 * and set up the "next" fields of the that node.
+		 */
+		struct qnode *prev = xlate_qcode(prev_qcode);
+
+		ACCESS_ONCE(prev->next) = node;
+		/*
+		 * Wait until the waiting flag is off
+		 */
+		while (smp_load_acquire(&node->wait))
+			arch_mutex_cpu_relax();
+	}
+
+	/*
+	 * At the head of the wait queue now
+	 */
+	while (true) {
+		u32 qcode;
+		int retval;
+
+		retval = queue_get_lock_qcode(lock, &qcode, my_qcode);
+		if (retval > 0)
+			;	/* Lock not available yet */
+		else if (retval < 0)
+			/* Lock taken, can release the node & return */
+			goto release_node;
+		else if (qcode != my_qcode) {
+			/*
+			 * Just get the lock with other spinners waiting
+			 * in the queue.
+			 */
+			if (queue_spin_setlock(lock))
+				goto notify_next;
+		} else {
+			/*
+			 * Get the lock & clear the queue code simultaneously
+			 */
+			if (queue_spin_trylock_and_clr_qcode(lock, qcode))
+				/* No need to notify the next one */
+				goto release_node;
+		}
+		arch_mutex_cpu_relax();
+	}
+
+notify_next:
+	/*
+	 * Wait, if needed, until the next one in queue set up the next field
+	 */
+	while (!(next = ACCESS_ONCE(node->next)))
+		arch_mutex_cpu_relax();
+	/*
+	 * The next one in queue is now at the head
+	 */
+	smp_store_release(&next->wait, false);
+
+release_node:
+	put_qnode();
+}
+EXPORT_SYMBOL(queue_spin_lock_slowpath);
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH v5 1/8] qspinlock: Introducing a 4-byte queue spinlock implementation
  2014-02-26 15:14 [PATCH v5 0/8] qspinlock: a 4-byte queue spinlock with PV support Waiman Long
  2014-02-26 15:14 ` [PATCH v5 1/8] qspinlock: Introducing a 4-byte queue spinlock implementation Waiman Long
@ 2014-02-26 15:14 ` Waiman Long
  2014-02-26 15:14 ` [PATCH v5 2/8] qspinlock, x86: Enable x86-64 to use queue spinlock Waiman Long
                   ` (17 subsequent siblings)
  19 siblings, 0 replies; 125+ messages in thread
From: Waiman Long @ 2014-02-26 15:14 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	Peter Zijlstra
  Cc: Jeremy Fitzhardinge, Raghavendra K T, virtualization, Andi Kleen,
	Michel Lespinasse, Boris Ostrovsky, linux-arch, x86,
	Scott J Norton, xen-devel, Paul E. McKenney, Alexander Fyodorov,
	Rik van Riel, Konrad Rzeszutek Wilk, Daniel J Blueman,
	Oleg Nesterov, Steven Rostedt, Chris Wright, George Spelvin,
	Alok Kataria, Aswin Chandramouleeswaran, Chegu Vinod,
	Waiman Long

This patch introduces a new queue spinlock implementation that can
serve as an alternative to the default ticket spinlock. Compared with
the ticket spinlock, this queue spinlock should be almost as fair as
the ticket spinlock. It has about the same speed in single-thread and
it can be much faster in high contention situations. Only in light to
moderate contention where the average queue depth is around 1-3 will
this queue spinlock be potentially a bit slower due to the higher
slowpath overhead.

This queue spinlock is especially suit to NUMA machines with a large
number of cores as the chance of spinlock contention is much higher
in those machines. The cost of contention is also higher because of
slower inter-node memory traffic.

The idea behind this spinlock implementation is the fact that spinlocks
are acquired with preemption disabled. In other words, the process
will not be migrated to another CPU while it is trying to get a
spinlock. Ignoring interrupt handling, a CPU can only be contending
in one spinlock at any one time. Of course, interrupt handler can try
to acquire one spinlock while the interrupted user process is in the
process of getting another spinlock. By allocating a set of per-cpu
queue nodes and used them to form a waiting queue, we can encode the
queue node address into a much smaller 16-bit size. Together with
the 1-byte lock bit, this queue spinlock implementation will only
need 4 bytes to hold all the information that it needs.

The current queue node address encoding of the 4-byte word is as
follows:
Bits 0-7  : the locked byte
Bits 8-9  : queue node index in the per-cpu array (4 entries)
Bits 10-31: cpu number + 1 (max cpus = 4M -1)

In the extremely unlikely case that all the queue node entries are
used up, the current code will fall back to busy spinning without
waiting in a queue with warning message.

For single-thread performance (no contention), a 256K lock/unlock
loop was run on a 2.4Ghz Westmere x86-64 CPU.  The following table
shows the average time (in ns) for a single lock/unlock sequence
(including the looping and timing overhead):

  Lock Type			Time (ns)
  ---------			---------
  Ticket spinlock		  14.1
  Queue spinlock		   8.8

So the queue spinlock is much faster than the ticket spinlock, even
though the overhead of locking and unlocking should be pretty small
when there is no contention. The performance advantage is mainly
due to the fact that ticket spinlock does a read-modify-write (add)
instruction in unlock whereas queue spinlock only does a simple write
in unlock which can be much faster in a pipelined CPU.

The AIM7 benchmark was run on a 8-socket 80-core DL980 with Westmere
x86-64 CPUs with XFS filesystem on a ramdisk and HT off to evaluate
the performance impact of this patch on a 3.13 kernel.

  +------------+----------+-----------------+---------+
  | Kernel     | 3.13 JPM |    3.13 with    | %Change |
  |            |          | qspinlock patch |	      |
  +------------+----------+-----------------+---------+
  |		      10-100 users		      |
  +------------+----------+-----------------+---------+
  |custom      |   357459 |      363109     |  +1.58% |
  |dbase       |   496847 |      498801	    |  +0.39% |
  |disk        |  2925312 |     2771387     |  -5.26% |
  |five_sec    |   166612 |      169215     |  +1.56% |
  |fserver     |   382129 |      383279     |  +0.30% |
  |high_systime|    16356 |       16380     |  +0.15% |
  |short       |  4521978 |     4257363     |  -5.85% |
  +------------+----------+-----------------+---------+
  |		     200-1000 users		      |
  +------------+----------+-----------------+---------+
  |custom      |   449070 |      447711     |  -0.30% |
  |dbase       |   845029 |      853362	    |  +0.99% |
  |disk        |  2725249 |     4892907     | +79.54% |
  |five_sec    |   169410 |      170638     |  +0.72% |
  |fserver     |   489662 |      491828     |  +0.44% |
  |high_systime|   142823 |      143790     |  +0.68% |
  |short       |  7435288 |     9016171     | +21.26% |
  +------------+----------+-----------------+---------+
  |		     1100-2000 users		      |
  +------------+----------+-----------------+---------+
  |custom      |   432470 |      432570     |  +0.02% |
  |dbase       |   889289 |      890026	    |  +0.08% |
  |disk        |  2565138 |     5008732     | +95.26% |
  |five_sec    |   169141 |      170034     |  +0.53% |
  |fserver     |   498569 |      500701     |  +0.43% |
  |high_systime|   229913 |      245866     |  +6.94% |
  |short       |  8496794 |     8281918     |  -2.53% |
  +------------+----------+-----------------+---------+

The workload with the most gain was the disk workload. Without the
patch, the perf profile at 1500 users looked like:

 26.19%    reaim  [kernel.kallsyms]  [k] _raw_spin_lock
              |--47.28%-- evict
              |--46.87%-- inode_sb_list_add
              |--1.24%-- xlog_cil_insert_items
              |--0.68%-- __remove_inode_hash
              |--0.67%-- inode_wait_for_writeback
               --3.26%-- [...]
 22.96%  swapper  [kernel.kallsyms]  [k] cpu_idle_loop
  5.56%    reaim  [kernel.kallsyms]  [k] mutex_spin_on_owner
  4.87%    reaim  [kernel.kallsyms]  [k] update_cfs_rq_blocked_load
  2.04%    reaim  [kernel.kallsyms]  [k] mspin_lock
  1.30%    reaim  [kernel.kallsyms]  [k] memcpy
  1.08%    reaim  [unknown]          [.] 0x0000003c52009447

There was pretty high spinlock contention on the inode_sb_list_lock
and maybe the inode's i_lock.

With the patch, the perf profile at 1500 users became:

 26.82%  swapper  [kernel.kallsyms]  [k] cpu_idle_loop
  4.66%    reaim  [kernel.kallsyms]  [k] mutex_spin_on_owner
  3.97%    reaim  [kernel.kallsyms]  [k] update_cfs_rq_blocked_load
  2.40%    reaim  [kernel.kallsyms]  [k] queue_spin_lock_slowpath
              |--88.31%-- _raw_spin_lock
              |          |--36.02%-- inode_sb_list_add
              |          |--35.09%-- evict
              |          |--16.89%-- xlog_cil_insert_items
              |          |--6.30%-- try_to_wake_up
              |          |--2.20%-- _xfs_buf_find
              |          |--0.75%-- __remove_inode_hash
              |          |--0.72%-- __mutex_lock_slowpath
              |          |--0.53%-- load_balance
              |--6.02%-- _raw_spin_lock_irqsave
              |          |--74.75%-- down_trylock
              |          |--9.69%-- rcu_check_quiescent_state
              |          |--7.47%-- down
              |          |--3.57%-- up
              |          |--1.67%-- rwsem_wake
              |          |--1.00%-- remove_wait_queue
              |          |--0.56%-- pagevec_lru_move_fn
              |--5.39%-- _raw_spin_lock_irq
              |          |--82.05%-- rwsem_down_read_failed
              |          |--10.48%-- rwsem_down_write_failed
              |          |--4.24%-- __down
              |          |--2.74%-- __schedule
               --0.28%-- [...]
  2.20%    reaim  [kernel.kallsyms]  [k] memcpy
  1.84%    reaim  [unknown]          [.] 0x000000000041517b
  1.77%    reaim  [kernel.kallsyms]  [k] _raw_spin_lock
              |--21.08%-- xlog_cil_insert_items
              |--10.14%-- xfs_icsb_modify_counters
              |--7.20%-- xfs_iget_cache_hit
              |--6.56%-- inode_sb_list_add
              |--5.49%-- _xfs_buf_find
              |--5.25%-- evict
              |--5.03%-- __remove_inode_hash
              |--4.64%-- __mutex_lock_slowpath
              |--3.78%-- selinux_inode_free_security
              |--2.95%-- xfs_inode_is_filestream
              |--2.35%-- try_to_wake_up
              |--2.07%-- xfs_inode_set_reclaim_tag
              |--1.52%-- list_lru_add
              |--1.16%-- xfs_inode_clear_eofblocks_tag
		  :
  1.30%    reaim  [kernel.kallsyms]  [k] effective_load
  1.27%    reaim  [kernel.kallsyms]  [k] mspin_lock
  1.10%    reaim  [kernel.kallsyms]  [k] security_compute_sid

On the ext4 filesystem, the disk workload improved from 416281 JPM
to 899101 JPM (+116%) with the patch. In this case, the contended
spinlock is the mb_cache_spinlock.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
Acked-by: Rik van Riel <riel@redhat.com>
---
 include/asm-generic/qspinlock.h       |  122 ++++++++++
 include/asm-generic/qspinlock_types.h |   55 +++++
 kernel/Kconfig.locks                  |    7 +
 kernel/locking/Makefile               |    1 +
 kernel/locking/qspinlock.c            |  393 +++++++++++++++++++++++++++++++++
 5 files changed, 578 insertions(+), 0 deletions(-)
 create mode 100644 include/asm-generic/qspinlock.h
 create mode 100644 include/asm-generic/qspinlock_types.h
 create mode 100644 kernel/locking/qspinlock.c

diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h
new file mode 100644
index 0000000..08da60f
--- /dev/null
+++ b/include/asm-generic/qspinlock.h
@@ -0,0 +1,122 @@
+/*
+ * Queue spinlock
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2013-2014 Hewlett-Packard Development Company, L.P.
+ *
+ * Authors: Waiman Long <waiman.long@hp.com>
+ */
+#ifndef __ASM_GENERIC_QSPINLOCK_H
+#define __ASM_GENERIC_QSPINLOCK_H
+
+#include <asm-generic/qspinlock_types.h>
+
+/*
+ * External function declarations
+ */
+extern void queue_spin_lock_slowpath(struct qspinlock *lock, int qsval);
+
+/**
+ * queue_spin_is_locked - is the spinlock locked?
+ * @lock: Pointer to queue spinlock structure
+ * Return: 1 if it is locked, 0 otherwise
+ */
+static __always_inline int queue_spin_is_locked(struct qspinlock *lock)
+{
+	return atomic_read(&lock->qlcode) & _QSPINLOCK_LOCKED;
+}
+
+/**
+ * queue_spin_value_unlocked - is the spinlock structure unlocked?
+ * @lock: queue spinlock structure
+ * Return: 1 if it is unlocked, 0 otherwise
+ */
+static __always_inline int queue_spin_value_unlocked(struct qspinlock lock)
+{
+	return !(atomic_read(&lock.qlcode) & _QSPINLOCK_LOCKED);
+}
+
+/**
+ * queue_spin_is_contended - check if the lock is contended
+ * @lock : Pointer to queue spinlock structure
+ * Return: 1 if lock contended, 0 otherwise
+ */
+static __always_inline int queue_spin_is_contended(struct qspinlock *lock)
+{
+	return atomic_read(&lock->qlcode) & ~_QSPINLOCK_LOCK_MASK;
+}
+/**
+ * queue_spin_trylock - try to acquire the queue spinlock
+ * @lock : Pointer to queue spinlock structure
+ * Return: 1 if lock acquired, 0 if failed
+ */
+static __always_inline int queue_spin_trylock(struct qspinlock *lock)
+{
+	if (!atomic_read(&lock->qlcode) &&
+	   (atomic_cmpxchg(&lock->qlcode, 0, _QSPINLOCK_LOCKED) == 0))
+		return 1;
+	return 0;
+}
+
+/**
+ * queue_spin_lock - acquire a queue spinlock
+ * @lock: Pointer to queue spinlock structure
+ */
+static __always_inline void queue_spin_lock(struct qspinlock *lock)
+{
+	int qsval;
+
+	/*
+	 * To reduce memory access to only once for the cold cache case,
+	 * a direct cmpxchg() is performed in the fastpath to optimize the
+	 * uncontended case. The contended performance, however, may suffer
+	 * a bit because of that.
+	 */
+	qsval = atomic_cmpxchg(&lock->qlcode, 0, _QSPINLOCK_LOCKED);
+	if (likely(qsval == 0))
+		return;
+	queue_spin_lock_slowpath(lock, qsval);
+}
+
+#ifndef queue_spin_unlock
+/**
+ * queue_spin_unlock - release a queue spinlock
+ * @lock : Pointer to queue spinlock structure
+ */
+static __always_inline void queue_spin_unlock(struct qspinlock *lock)
+{
+	/*
+	 * Use an atomic subtraction to clear the lock bit.
+	 */
+	smp_mb__before_atomic_dec();
+	atomic_sub(_QSPINLOCK_LOCKED, &lock->qlcode);
+}
+#endif
+
+/*
+ * Initializier
+ */
+#define	__ARCH_SPIN_LOCK_UNLOCKED	{ ATOMIC_INIT(0) }
+
+/*
+ * Remapping spinlock architecture specific functions to the corresponding
+ * queue spinlock functions.
+ */
+#define arch_spin_is_locked(l)		queue_spin_is_locked(l)
+#define arch_spin_is_contended(l)	queue_spin_is_contended(l)
+#define arch_spin_value_unlocked(l)	queue_spin_value_unlocked(l)
+#define arch_spin_lock(l)		queue_spin_lock(l)
+#define arch_spin_trylock(l)		queue_spin_trylock(l)
+#define arch_spin_unlock(l)		queue_spin_unlock(l)
+#define arch_spin_lock_flags(l, f)	queue_spin_lock(l)
+
+#endif /* __ASM_GENERIC_QSPINLOCK_H */
diff --git a/include/asm-generic/qspinlock_types.h b/include/asm-generic/qspinlock_types.h
new file mode 100644
index 0000000..df981d0
--- /dev/null
+++ b/include/asm-generic/qspinlock_types.h
@@ -0,0 +1,55 @@
+/*
+ * Queue spinlock
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2013-2014 Hewlett-Packard Development Company, L.P.
+ *
+ * Authors: Waiman Long <waiman.long@hp.com>
+ */
+#ifndef __ASM_GENERIC_QSPINLOCK_TYPES_H
+#define __ASM_GENERIC_QSPINLOCK_TYPES_H
+
+/*
+ * Including atomic.h with PARAVIRT on will cause compilation errors because
+ * of recursive header file incluson via paravirt_types.h. A workaround is
+ * to include paravirt_types.h here in this case.
+ */
+#ifdef CONFIG_PARAVIRT
+# include <asm/paravirt_types.h>
+#else
+# include <linux/types.h>
+# include <linux/atomic.h>
+#endif
+
+/*
+ * The queue spinlock data structure - a 32-bit word
+ *
+ * For NR_CPUS >= 16K, the bits assignment are:
+ *   Bit  0   : Set if locked
+ *   Bits 1-7 : Not used
+ *   Bits 8-31: Queue code
+ *
+ * For NR_CPUS < 16K, the bits assignment are:
+ *   Bit   0   : Set if locked
+ *   Bits  1-7 : Not used
+ *   Bits  8-15: Reserved for architecture specific optimization
+ *   Bits 16-31: Queue code
+ */
+typedef struct qspinlock {
+	atomic_t	qlcode;	/* Lock + queue code */
+} arch_spinlock_t;
+
+#define _QCODE_OFFSET		8
+#define _QSPINLOCK_LOCKED	1U
+#define	_QSPINLOCK_LOCK_MASK	0xff
+
+#endif /* __ASM_GENERIC_QSPINLOCK_TYPES_H */
diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks
index d2b32ac..f185584 100644
--- a/kernel/Kconfig.locks
+++ b/kernel/Kconfig.locks
@@ -223,3 +223,10 @@ endif
 config MUTEX_SPIN_ON_OWNER
 	def_bool y
 	depends on SMP && !DEBUG_MUTEXES
+
+config ARCH_USE_QUEUE_SPINLOCK
+	bool
+
+config QUEUE_SPINLOCK
+	def_bool y if ARCH_USE_QUEUE_SPINLOCK
+	depends on SMP && !PARAVIRT_SPINLOCKS
diff --git a/kernel/locking/Makefile b/kernel/locking/Makefile
index baab8e5..e3b3293 100644
--- a/kernel/locking/Makefile
+++ b/kernel/locking/Makefile
@@ -15,6 +15,7 @@ obj-$(CONFIG_LOCKDEP) += lockdep_proc.o
 endif
 obj-$(CONFIG_SMP) += spinlock.o
 obj-$(CONFIG_PROVE_LOCKING) += spinlock.o
+obj-$(CONFIG_QUEUE_SPINLOCK) += qspinlock.o
 obj-$(CONFIG_RT_MUTEXES) += rtmutex.o
 obj-$(CONFIG_DEBUG_RT_MUTEXES) += rtmutex-debug.o
 obj-$(CONFIG_RT_MUTEX_TESTER) += rtmutex-tester.o
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
new file mode 100644
index 0000000..ed5efa7
--- /dev/null
+++ b/kernel/locking/qspinlock.c
@@ -0,0 +1,393 @@
+/*
+ * Queue spinlock
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2013-2014 Hewlett-Packard Development Company, L.P.
+ *
+ * Authors: Waiman Long <waiman.long@hp.com>
+ */
+#include <linux/smp.h>
+#include <linux/bug.h>
+#include <linux/cpumask.h>
+#include <linux/percpu.h>
+#include <linux/hardirq.h>
+#include <linux/mutex.h>
+#include <linux/spinlock.h>
+
+/*
+ * The basic principle of a queue-based spinlock can best be understood
+ * by studying a classic queue-based spinlock implementation called the
+ * MCS lock. The paper below provides a good description for this kind
+ * of lock.
+ *
+ * http://www.cise.ufl.edu/tr/DOC/REP-1992-71.pdf
+ *
+ * This queue spinlock implementation is based on the MCS lock with twists
+ * to make it fit the following constraints:
+ * 1. A max spinlock size of 4 bytes
+ * 2. Good fastpath performance
+ * 3. No change in the locking APIs
+ *
+ * The queue spinlock fastpath is as simple as it can get, all the heavy
+ * lifting is done in the lock slowpath. The main idea behind this queue
+ * spinlock implementation is to keep the spinlock size at 4 bytes while
+ * at the same time implement a queue structure to queue up the waiting
+ * lock spinners.
+ *
+ * Since preemption is disabled before getting the lock, a given CPU will
+ * only need to use one queue node structure in a non-interrupt context.
+ * A percpu queue node structure will be allocated for this purpose and the
+ * cpu number will be put into the queue spinlock structure to indicate the
+ * tail of the queue.
+ *
+ * To handle spinlock acquisition at interrupt context (softirq or hardirq),
+ * the queue node structure is actually an array for supporting nested spin
+ * locking operations in interrupt handlers. If all the entries in the
+ * array are used up, a warning message will be printed (as that shouldn't
+ * happen in normal circumstances) and the lock spinner will fall back to
+ * busy spinning instead of waiting in a queue.
+ */
+
+/*
+ * The 24-bit queue node code is divided into the following 2 fields:
+ * Bits 0-1 : queue node index (4 nodes)
+ * Bits 2-23: CPU number + 1   (4M - 1 CPUs)
+ *
+ * The 16-bit queue node code is divided into the following 2 fields:
+ * Bits 0-1 : queue node index (4 nodes)
+ * Bits 2-15: CPU number + 1   (16K - 1 CPUs)
+ *
+ * A queue node code of 0 indicates that no one is waiting for the lock.
+ * As the value 0 cannot be used as a valid CPU number. We need to add
+ * 1 to it before putting it into the queue code.
+ */
+#define MAX_QNODES		4
+#ifndef _QCODE_VAL_OFFSET
+#define _QCODE_VAL_OFFSET	_QCODE_OFFSET
+#endif
+
+/*
+ * The queue node structure
+ *
+ * This structure is essentially the same as the mcs_spinlock structure
+ * in mcs_spinlock.h file. This structure is retained for future extension
+ * where new fields may be added.
+ */
+struct qnode {
+	u32		 wait;		/* Waiting flag		*/
+	struct qnode	*next;		/* Next queue node addr */
+};
+
+struct qnode_set {
+	struct qnode	nodes[MAX_QNODES];
+	int		node_idx;	/* Current node to use */
+};
+
+/*
+ * Per-CPU queue node structures
+ */
+static DEFINE_PER_CPU_ALIGNED(struct qnode_set, qnset) = { {{0}}, 0 };
+
+/*
+ ************************************************************************
+ * The following optimized codes are for architectures that support:	*
+ *  1) Atomic byte and short data write					*
+ *  2) Byte and short data exchange and compare-exchange instructions	*
+ *									*
+ * For those architectures, their asm/qspinlock.h header file should	*
+ * define the followings in order to use the optimized codes.		*
+ *  1) The _ARCH_SUPPORTS_ATOMIC_8_16_BITS_OPS macro			*
+ *  2) A smp_u8_store_release() macro for byte size store operation	*
+ *  3) A "union arch_qspinlock" structure that include the individual	*
+ *     fields of the qspinlock structure, including:			*
+ *      o slock - the qspinlock structure				*
+ *      o lock  - the lock byte						*
+ *									*
+ ************************************************************************
+ */
+#ifdef _ARCH_SUPPORTS_ATOMIC_8_16_BITS_OPS
+/**
+ * queue_spin_setlock - try to acquire the lock by setting the lock bit
+ * @lock: Pointer to queue spinlock structure
+ * Return: 1 if lock bit set successfully, 0 if failed
+ */
+static __always_inline int queue_spin_setlock(struct qspinlock *lock)
+{
+	union arch_qspinlock *qlock = (union arch_qspinlock *)lock;
+
+	if (!ACCESS_ONCE(qlock->lock) &&
+	   (cmpxchg(&qlock->lock, 0, _QSPINLOCK_LOCKED) == 0))
+		return 1;
+	return 0;
+}
+#else /*  _ARCH_SUPPORTS_ATOMIC_8_16_BITS_OPS  */
+/*
+ * Generic functions for architectures that do not support atomic
+ * byte or short data types.
+ */
+/**
+ *_queue_spin_setlock - try to acquire the lock by setting the lock bit
+ * @lock: Pointer to queue spinlock structure
+ * Return: 1 if lock bit set successfully, 0 if failed
+ */
+static __always_inline int queue_spin_setlock(struct qspinlock *lock)
+{
+	int qlcode = atomic_read(lock->qlcode);
+
+	if (!(qlcode & _QSPINLOCK_LOCKED) && (atomic_cmpxchg(&lock->qlcode,
+		qlcode, qlcode|_QSPINLOCK_LOCKED) == qlcode))
+			return 1;
+	return 0;
+}
+#endif /* _ARCH_SUPPORTS_ATOMIC_8_16_BITS_OPS */
+
+/*
+ ************************************************************************
+ * Inline functions used by the queue_spin_lock_slowpath() function	*
+ * that may get superseded by a more optimized version.			*
+ ************************************************************************
+ */
+
+#ifndef queue_get_lock_qcode
+/**
+ * queue_get_lock_qcode - get the lock & qcode values
+ * @lock  : Pointer to queue spinlock structure
+ * @qcode : Pointer to the returned qcode value
+ * @mycode: My qcode value (not used)
+ * Return : > 0 if lock is not available, = 0 if lock is free
+ */
+static inline int
+queue_get_lock_qcode(struct qspinlock *lock, u32 *qcode, u32 mycode)
+{
+	int qlcode = atomic_read(&lock->qlcode);
+
+	*qcode = qlcode;
+	return qlcode & _QSPINLOCK_LOCKED;
+}
+#endif /* queue_get_lock_qcode */
+
+#ifndef queue_spin_trylock_and_clr_qcode
+/**
+ * queue_spin_trylock_and_clr_qcode - Try to lock & clear qcode simultaneously
+ * @lock : Pointer to queue spinlock structure
+ * @qcode: The supposedly current qcode value
+ * Return: true if successful, false otherwise
+ */
+static inline int
+queue_spin_trylock_and_clr_qcode(struct qspinlock *lock, u32 qcode)
+{
+	return atomic_cmpxchg(&lock->qlcode, qcode, _QSPINLOCK_LOCKED) == qcode;
+}
+#endif /* queue_spin_trylock_and_clr_qcode */
+
+#ifndef queue_encode_qcode
+/**
+ * queue_encode_qcode - Encode the CPU number & node index into a qnode code
+ * @cpu_nr: CPU number
+ * @qn_idx: Queue node index
+ * Return : A qnode code that can be saved into the qspinlock structure
+ *
+ * The lock bit is set in the encoded 32-bit value as the need to encode
+ * a qnode means that the lock should have been taken.
+ */
+static u32 queue_encode_qcode(u32 cpu_nr, u8 qn_idx)
+{
+	return ((cpu_nr + 1) << (_QCODE_VAL_OFFSET + 2)) |
+		(qn_idx << _QCODE_VAL_OFFSET) | _QSPINLOCK_LOCKED;
+}
+#endif /* queue_encode_qcode */
+
+/*
+ ************************************************************************
+ * Other inline functions needed by the queue_spin_lock_slowpath()	*
+ * function.								*
+ ************************************************************************
+ */
+
+/**
+ * xlate_qcode - translate the queue code into the queue node address
+ * @qcode: Queue code to be translated
+ * Return: The corresponding queue node address
+ */
+static inline struct qnode *xlate_qcode(u32 qcode)
+{
+	u32 cpu_nr = (qcode >> (_QCODE_VAL_OFFSET + 2)) - 1;
+	u8  qn_idx = (qcode >> _QCODE_VAL_OFFSET) & 3;
+
+	return per_cpu_ptr(&qnset.nodes[qn_idx], cpu_nr);
+}
+
+/**
+ * get_qnode - Get a queue node address
+ * @qn_idx: Pointer to queue node index [out]
+ * Return : queue node address & queue node index in qn_idx, or NULL if
+ *	    no free queue node available.
+ */
+static struct qnode *get_qnode(unsigned int *qn_idx)
+{
+	struct qnode_set *qset = this_cpu_ptr(&qnset);
+	int i;
+
+	if (unlikely(qset->node_idx >= MAX_QNODES))
+		return NULL;
+	i = qset->node_idx++;
+	*qn_idx = i;
+	return &qset->nodes[i];
+}
+
+/**
+ * put_qnode - Return a queue node to the pool
+ */
+static void put_qnode(void)
+{
+	struct qnode_set *qset = this_cpu_ptr(&qnset);
+
+	qset->node_idx--;
+}
+
+/**
+ * queue_spin_lock_slowpath - acquire the queue spinlock
+ * @lock : Pointer to queue spinlock structure
+ * @qsval: Current value of the queue spinlock 32-bit word
+ */
+void queue_spin_lock_slowpath(struct qspinlock *lock, int qsval)
+{
+	unsigned int cpu_nr, qn_idx;
+	struct qnode *node, *next;
+	u32 prev_qcode, my_qcode;
+
+	/*
+	 * Get the queue node
+	 */
+	cpu_nr = smp_processor_id();
+	node   = get_qnode(&qn_idx);
+
+	/*
+	 * It should never happen that all the queue nodes are being used.
+	 */
+	BUG_ON(!node);
+
+	/*
+	 * Set up the new cpu code to be exchanged
+	 */
+	my_qcode = queue_encode_qcode(cpu_nr, qn_idx);
+
+	/*
+	 * Initialize the queue node
+	 */
+	node->wait = true;
+	node->next = NULL;
+
+	/*
+	 * The lock may be available at this point, try again if no task was
+	 * waiting in the queue.
+	 */
+	if (!(qsval >> _QCODE_OFFSET) && queue_spin_trylock(lock)) {
+		put_qnode();
+		return;
+	}
+
+	/*
+	 * Exchange current copy of the queue node code
+	 */
+	prev_qcode = atomic_xchg(&lock->qlcode, my_qcode);
+	/*
+	 * It is possible that we may accidentally steal the lock. If this is
+	 * the case, we need to either release it if not the head of the queue
+	 * or get the lock and be done with it.
+	 */
+	if (unlikely(!(prev_qcode & _QSPINLOCK_LOCKED))) {
+		if (prev_qcode == 0) {
+			/*
+			 * Got the lock since it is at the head of the queue
+			 * Now try to atomically clear the queue code.
+			 */
+			if (atomic_cmpxchg(&lock->qlcode, my_qcode,
+					  _QSPINLOCK_LOCKED) == my_qcode)
+				goto release_node;
+			/*
+			 * The cmpxchg fails only if one or more tasks
+			 * are added to the queue. In this case, we need to
+			 * notify the next one to be the head of the queue.
+			 */
+			goto notify_next;
+		}
+		/*
+		 * Accidentally steal the lock, release the lock and
+		 * let the queue head get it.
+		 */
+		queue_spin_unlock(lock);
+	} else
+		prev_qcode &= ~_QSPINLOCK_LOCKED;	/* Clear the lock bit */
+	my_qcode &= ~_QSPINLOCK_LOCKED;
+
+	if (prev_qcode) {
+		/*
+		 * Not at the queue head, get the address of the previous node
+		 * and set up the "next" fields of the that node.
+		 */
+		struct qnode *prev = xlate_qcode(prev_qcode);
+
+		ACCESS_ONCE(prev->next) = node;
+		/*
+		 * Wait until the waiting flag is off
+		 */
+		while (smp_load_acquire(&node->wait))
+			arch_mutex_cpu_relax();
+	}
+
+	/*
+	 * At the head of the wait queue now
+	 */
+	while (true) {
+		u32 qcode;
+		int retval;
+
+		retval = queue_get_lock_qcode(lock, &qcode, my_qcode);
+		if (retval > 0)
+			;	/* Lock not available yet */
+		else if (retval < 0)
+			/* Lock taken, can release the node & return */
+			goto release_node;
+		else if (qcode != my_qcode) {
+			/*
+			 * Just get the lock with other spinners waiting
+			 * in the queue.
+			 */
+			if (queue_spin_setlock(lock))
+				goto notify_next;
+		} else {
+			/*
+			 * Get the lock & clear the queue code simultaneously
+			 */
+			if (queue_spin_trylock_and_clr_qcode(lock, qcode))
+				/* No need to notify the next one */
+				goto release_node;
+		}
+		arch_mutex_cpu_relax();
+	}
+
+notify_next:
+	/*
+	 * Wait, if needed, until the next one in queue set up the next field
+	 */
+	while (!(next = ACCESS_ONCE(node->next)))
+		arch_mutex_cpu_relax();
+	/*
+	 * The next one in queue is now at the head
+	 */
+	smp_store_release(&next->wait, false);
+
+release_node:
+	put_qnode();
+}
+EXPORT_SYMBOL(queue_spin_lock_slowpath);
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH v5 2/8] qspinlock, x86: Enable x86-64 to use queue spinlock
  2014-02-26 15:14 [PATCH v5 0/8] qspinlock: a 4-byte queue spinlock with PV support Waiman Long
                   ` (2 preceding siblings ...)
  2014-02-26 15:14 ` [PATCH v5 2/8] qspinlock, x86: Enable x86-64 to use queue spinlock Waiman Long
@ 2014-02-26 15:14 ` Waiman Long
  2014-02-26 15:14 ` [PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks Waiman Long
                   ` (15 subsequent siblings)
  19 siblings, 0 replies; 125+ messages in thread
From: Waiman Long @ 2014-02-26 15:14 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	Peter Zijlstra
  Cc: Jeremy Fitzhardinge, Raghavendra K T, virtualization, Andi Kleen,
	Michel Lespinasse, Boris Ostrovsky, linux-arch, x86,
	Scott J Norton, xen-devel, Paul E. McKenney, Alexander Fyodorov,
	Daniel J Blueman, Rusty Russell, Oleg Nesterov, Steven Rostedt,
	Chris Wright, George Spelvin, Alok Kataria,
	Aswin Chandramouleeswaran, Chegu Vinod, Waiman Long,
	Linus Torvalds, linux-ke

This patch makes the necessary changes at the x86 architecture
specific layer to enable the use of queue spinlock for x86-64. As
x86-32 machines are typically not multi-socket. The benefit of queue
spinlock may not be apparent. So queue spinlock is not enabled.

Currently, there is some incompatibilities between the para-virtualized
spinlock code (which hard-codes the use of ticket spinlock) and the
queue spinlock. Therefore, the use of queue spinlock is disabled when
the para-virtualized spinlock is enabled.

The arch/x86/include/asm/qspinlock.h header file includes some x86
specific optimization which will make the queue spinlock code perform
better than the generic implementation.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
Acked-by: Rik van Riel <riel@redhat.com>
---
 arch/x86/Kconfig                      |    1 +
 arch/x86/include/asm/qspinlock.h      |   41 +++++++++++++++++++++++++++++++++
 arch/x86/include/asm/spinlock.h       |    5 ++++
 arch/x86/include/asm/spinlock_types.h |    4 +++
 4 files changed, 51 insertions(+), 0 deletions(-)
 create mode 100644 arch/x86/include/asm/qspinlock.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 1b4ff87..5bf70ab 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -17,6 +17,7 @@ config X86_64
 	depends on 64BIT
 	select X86_DEV_DMA_OPS
 	select ARCH_USE_CMPXCHG_LOCKREF
+	select ARCH_USE_QUEUE_SPINLOCK
 
 ### Arch settings
 config X86
diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
new file mode 100644
index 0000000..44cefee
--- /dev/null
+++ b/arch/x86/include/asm/qspinlock.h
@@ -0,0 +1,41 @@
+#ifndef _ASM_X86_QSPINLOCK_H
+#define _ASM_X86_QSPINLOCK_H
+
+#include <asm-generic/qspinlock_types.h>
+
+#if !defined(CONFIG_X86_OOSTORE) && !defined(CONFIG_X86_PPRO_FENCE)
+
+#define _ARCH_SUPPORTS_ATOMIC_8_16_BITS_OPS
+
+/*
+ * x86-64 specific queue spinlock union structure
+ */
+union arch_qspinlock {
+	struct qspinlock slock;
+	u8		 lock;	/* Lock bit	*/
+};
+
+#define	queue_spin_unlock queue_spin_unlock
+/**
+ * queue_spin_unlock - release a queue spinlock
+ * @lock : Pointer to queue spinlock structure
+ *
+ * No special memory barrier other than a compiler one is needed for the
+ * x86 architecture. A compiler barrier is added at the end to make sure
+ * that the clearing the lock bit is done ASAP without artificial delay
+ * due to compiler optimization.
+ */
+static inline void queue_spin_unlock(struct qspinlock *lock)
+{
+	union arch_qspinlock *qlock = (union arch_qspinlock *)lock;
+
+	barrier();
+	ACCESS_ONCE(qlock->lock) = 0;
+	barrier();
+}
+
+#endif /* !CONFIG_X86_OOSTORE && !CONFIG_X86_PPRO_FENCE */
+
+#include <asm-generic/qspinlock.h>
+
+#endif /* _ASM_X86_QSPINLOCK_H */
diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index bf156de..6e6de1f 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -43,6 +43,10 @@
 extern struct static_key paravirt_ticketlocks_enabled;
 static __always_inline bool static_key_false(struct static_key *key);
 
+#ifdef CONFIG_QUEUE_SPINLOCK
+#include <asm/qspinlock.h>
+#else
+
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
 
 static inline void __ticket_enter_slowpath(arch_spinlock_t *lock)
@@ -181,6 +185,7 @@ static __always_inline void arch_spin_lock_flags(arch_spinlock_t *lock,
 {
 	arch_spin_lock(lock);
 }
+#endif /* CONFIG_QUEUE_SPINLOCK */
 
 static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
 {
diff --git a/arch/x86/include/asm/spinlock_types.h b/arch/x86/include/asm/spinlock_types.h
index 4f1bea1..7960268 100644
--- a/arch/x86/include/asm/spinlock_types.h
+++ b/arch/x86/include/asm/spinlock_types.h
@@ -23,6 +23,9 @@ typedef u32 __ticketpair_t;
 
 #define TICKET_SHIFT	(sizeof(__ticket_t) * 8)
 
+#ifdef CONFIG_QUEUE_SPINLOCK
+#include <asm-generic/qspinlock_types.h>
+#else
 typedef struct arch_spinlock {
 	union {
 		__ticketpair_t head_tail;
@@ -33,6 +36,7 @@ typedef struct arch_spinlock {
 } arch_spinlock_t;
 
 #define __ARCH_SPIN_LOCK_UNLOCKED	{ { 0 } }
+#endif /* CONFIG_QUEUE_SPINLOCK */
 
 #include <asm/rwlock.h>
 
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH v5 2/8] qspinlock, x86: Enable x86-64 to use queue spinlock
  2014-02-26 15:14 [PATCH v5 0/8] qspinlock: a 4-byte queue spinlock with PV support Waiman Long
  2014-02-26 15:14 ` [PATCH v5 1/8] qspinlock: Introducing a 4-byte queue spinlock implementation Waiman Long
  2014-02-26 15:14 ` Waiman Long
@ 2014-02-26 15:14 ` Waiman Long
  2014-02-26 15:14 ` Waiman Long
                   ` (16 subsequent siblings)
  19 siblings, 0 replies; 125+ messages in thread
From: Waiman Long @ 2014-02-26 15:14 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	Peter Zijlstra
  Cc: Jeremy Fitzhardinge, Raghavendra K T, virtualization, Andi Kleen,
	Michel Lespinasse, Boris Ostrovsky, linux-arch, x86,
	Scott J Norton, xen-devel, Paul E. McKenney, Alexander Fyodorov,
	Rik van Riel, Konrad Rzeszutek Wilk, Daniel J Blueman,
	Oleg Nesterov, Steven Rostedt, Chris Wright, George Spelvin,
	Alok Kataria, Aswin Chandramouleeswaran, Chegu Vinod,
	Waiman Long

This patch makes the necessary changes at the x86 architecture
specific layer to enable the use of queue spinlock for x86-64. As
x86-32 machines are typically not multi-socket. The benefit of queue
spinlock may not be apparent. So queue spinlock is not enabled.

Currently, there is some incompatibilities between the para-virtualized
spinlock code (which hard-codes the use of ticket spinlock) and the
queue spinlock. Therefore, the use of queue spinlock is disabled when
the para-virtualized spinlock is enabled.

The arch/x86/include/asm/qspinlock.h header file includes some x86
specific optimization which will make the queue spinlock code perform
better than the generic implementation.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
Acked-by: Rik van Riel <riel@redhat.com>
---
 arch/x86/Kconfig                      |    1 +
 arch/x86/include/asm/qspinlock.h      |   41 +++++++++++++++++++++++++++++++++
 arch/x86/include/asm/spinlock.h       |    5 ++++
 arch/x86/include/asm/spinlock_types.h |    4 +++
 4 files changed, 51 insertions(+), 0 deletions(-)
 create mode 100644 arch/x86/include/asm/qspinlock.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 1b4ff87..5bf70ab 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -17,6 +17,7 @@ config X86_64
 	depends on 64BIT
 	select X86_DEV_DMA_OPS
 	select ARCH_USE_CMPXCHG_LOCKREF
+	select ARCH_USE_QUEUE_SPINLOCK
 
 ### Arch settings
 config X86
diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
new file mode 100644
index 0000000..44cefee
--- /dev/null
+++ b/arch/x86/include/asm/qspinlock.h
@@ -0,0 +1,41 @@
+#ifndef _ASM_X86_QSPINLOCK_H
+#define _ASM_X86_QSPINLOCK_H
+
+#include <asm-generic/qspinlock_types.h>
+
+#if !defined(CONFIG_X86_OOSTORE) && !defined(CONFIG_X86_PPRO_FENCE)
+
+#define _ARCH_SUPPORTS_ATOMIC_8_16_BITS_OPS
+
+/*
+ * x86-64 specific queue spinlock union structure
+ */
+union arch_qspinlock {
+	struct qspinlock slock;
+	u8		 lock;	/* Lock bit	*/
+};
+
+#define	queue_spin_unlock queue_spin_unlock
+/**
+ * queue_spin_unlock - release a queue spinlock
+ * @lock : Pointer to queue spinlock structure
+ *
+ * No special memory barrier other than a compiler one is needed for the
+ * x86 architecture. A compiler barrier is added at the end to make sure
+ * that the clearing the lock bit is done ASAP without artificial delay
+ * due to compiler optimization.
+ */
+static inline void queue_spin_unlock(struct qspinlock *lock)
+{
+	union arch_qspinlock *qlock = (union arch_qspinlock *)lock;
+
+	barrier();
+	ACCESS_ONCE(qlock->lock) = 0;
+	barrier();
+}
+
+#endif /* !CONFIG_X86_OOSTORE && !CONFIG_X86_PPRO_FENCE */
+
+#include <asm-generic/qspinlock.h>
+
+#endif /* _ASM_X86_QSPINLOCK_H */
diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index bf156de..6e6de1f 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -43,6 +43,10 @@
 extern struct static_key paravirt_ticketlocks_enabled;
 static __always_inline bool static_key_false(struct static_key *key);
 
+#ifdef CONFIG_QUEUE_SPINLOCK
+#include <asm/qspinlock.h>
+#else
+
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
 
 static inline void __ticket_enter_slowpath(arch_spinlock_t *lock)
@@ -181,6 +185,7 @@ static __always_inline void arch_spin_lock_flags(arch_spinlock_t *lock,
 {
 	arch_spin_lock(lock);
 }
+#endif /* CONFIG_QUEUE_SPINLOCK */
 
 static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
 {
diff --git a/arch/x86/include/asm/spinlock_types.h b/arch/x86/include/asm/spinlock_types.h
index 4f1bea1..7960268 100644
--- a/arch/x86/include/asm/spinlock_types.h
+++ b/arch/x86/include/asm/spinlock_types.h
@@ -23,6 +23,9 @@ typedef u32 __ticketpair_t;
 
 #define TICKET_SHIFT	(sizeof(__ticket_t) * 8)
 
+#ifdef CONFIG_QUEUE_SPINLOCK
+#include <asm-generic/qspinlock_types.h>
+#else
 typedef struct arch_spinlock {
 	union {
 		__ticketpair_t head_tail;
@@ -33,6 +36,7 @@ typedef struct arch_spinlock {
 } arch_spinlock_t;
 
 #define __ARCH_SPIN_LOCK_UNLOCKED	{ { 0 } }
+#endif /* CONFIG_QUEUE_SPINLOCK */
 
 #include <asm/rwlock.h>
 
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks
  2014-02-26 15:14 [PATCH v5 0/8] qspinlock: a 4-byte queue spinlock with PV support Waiman Long
                   ` (3 preceding siblings ...)
  2014-02-26 15:14 ` Waiman Long
@ 2014-02-26 15:14 ` Waiman Long
  2014-02-26 16:20   ` Peter Zijlstra
  2014-02-26 16:20   ` Peter Zijlstra
  2014-02-26 15:14 ` Waiman Long
                   ` (14 subsequent siblings)
  19 siblings, 2 replies; 125+ messages in thread
From: Waiman Long @ 2014-02-26 15:14 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	Peter Zijlstra
  Cc: Jeremy Fitzhardinge, Raghavendra K T, virtualization, Andi Kleen,
	Michel Lespinasse, Boris Ostrovsky, linux-arch, x86,
	Scott J Norton, xen-devel, Paul E. McKenney, Alexander Fyodorov,
	Daniel J Blueman, Rusty Russell, Oleg Nesterov, Steven Rostedt,
	Chris Wright, George Spelvin, Alok Kataria,
	Aswin Chandramouleeswaran, Chegu Vinod, Waiman Long,
	Linus Torvalds, linux-ke

A major problem with the queue spinlock patch is its performance at
low contention level (2-4 contending tasks) where it is slower than
the corresponding ticket spinlock code path. The following table shows
the execution time (in ms) of a micro-benchmark where 5M iterations
of the lock/unlock cycles were run on a 10-core Westere-EX CPU with
2 different types loads - standalone (lock and protected data in
different cachelines) and embedded (lock and protected data in the
same cacheline).

		  [Standalone/Embedded]
  # of tasks	Ticket lock	Queue lock	%Change
  ----------	-----------	----------	-------
       1	  135/111	 135/102	  0%/-8%
       2	  732/950	1315/1573	+80%/+66%
       3	 1827/1783	2372/2428	+30%/+36%
       4	 2689/2725	2934/2934	 +9%/+8%
       5	 3736/3748	3658/3652	 -2%/-3%
       6	 4942/4984	4434/4428	-10%/-11%
       7	 6304/6319	5176/5163	-18%/-18%
       8	 7736/7629	5955/5944	-23%/-22%

It can be seen that the performance degradation is particular bad
with 2 and 3 contending tasks. To reduce that performance deficit
at low contention level, a special x86 specific optimized code path
for 2 contending tasks was added. This special code path will only
be activated with less than 16K of configured CPUs because it uses
a byte in the 32-bit lock word to hold a waiting bit for the 2nd
contending tasks instead of queuing the waiting task in the queue.

With the change, the performance data became:

		  [Standalone/Embedded]
  # of tasks	Ticket lock	Queue lock	%Change
  ----------	-----------	----------	-------
       2	  732/950	 523/528	-29%/-44%
       3	 1827/1783	2366/2384	+30%/+34%

The queue spinlock code path is now a bit faster with 2 contending
tasks.  There is also a very slight improvement with 3 contending
tasks.

The performance of the optimized code path can vary depending on which
of the several different code paths is taken. It is also not as fair as
the ticket spinlock and some variations on the execution times of the
2 contending tasks.  Testing with different pairs of cores within the
same CPUs shows an execution time that varies from 400ms to 1194ms.
The ticket spinlock code also shows a variation of 718-1146ms which
is probably due to the CPU topology within a socket.

In a multi-socketed server, the optimized code path also seems to
produce a big performance improvement in cross-node contention traffic
at low contention level. The table below show the performance with
1 contending task per node:

		[Standalone]
  # of nodes	Ticket lock	Queue lock	%Change
  ----------	-----------	----------	-------
       1	   135		 135		  0%
       2	  4452		 528		-88%
       3	 10767		2369		-78%
       4	 20835		2921		-86%

The micro-benchmark was also run on a 4-core Ivy-Bridge PC. The table
below shows the collected performance data:

		  [Standalone/Embedded]
  # of tasks	Ticket lock	Queue lock	%Change
  ----------	-----------	----------	-------
       1	  197/178	  181/150	 -8%/-16%
       2	 1109/928    435-1417/697-2125
       3	 1836/1702  1372-3112/1379-3138
       4	 2717/2429  1842-4158/1846-4170

The performance of the queue lock patch varied from run to run whereas
the performance of the ticket lock was more consistent. The queue
lock figures above were the range of values that were reported.

This optimization can also be easily used by other architectures as
long as they support 8 and 16 bits atomic operations.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
---
 arch/x86/include/asm/qspinlock.h      |   20 ++++-
 include/asm-generic/qspinlock_types.h |    8 ++-
 kernel/locking/qspinlock.c            |  192 ++++++++++++++++++++++++++++++++-
 3 files changed, 215 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
index 44cefee..98db42e 100644
--- a/arch/x86/include/asm/qspinlock.h
+++ b/arch/x86/include/asm/qspinlock.h
@@ -7,12 +7,30 @@
 
 #define _ARCH_SUPPORTS_ATOMIC_8_16_BITS_OPS
 
+#define smp_u8_store_release(p, v)	\
+do {					\
+	barrier();			\
+	ACCESS_ONCE(*p) = (v);		\
+} while (0)
+
+/*
+ * As the qcode will be accessed as a 16-bit word, no offset is needed
+ */
+#define _QCODE_VAL_OFFSET	0
+
 /*
  * x86-64 specific queue spinlock union structure
+ * Besides the slock and lock fields, the other fields are only
+ * valid with less than 16K CPUs.
  */
 union arch_qspinlock {
 	struct qspinlock slock;
-	u8		 lock;	/* Lock bit	*/
+	struct {
+		u8  lock;	/* Lock bit	*/
+		u8  wait;	/* Waiting bit	*/
+		u16 qcode;	/* Queue code	*/
+	};
+	u16 lock_wait;		/* Lock and wait bits */
 };
 
 #define	queue_spin_unlock queue_spin_unlock
diff --git a/include/asm-generic/qspinlock_types.h b/include/asm-generic/qspinlock_types.h
index df981d0..3a02a9e 100644
--- a/include/asm-generic/qspinlock_types.h
+++ b/include/asm-generic/qspinlock_types.h
@@ -48,7 +48,13 @@ typedef struct qspinlock {
 	atomic_t	qlcode;	/* Lock + queue code */
 } arch_spinlock_t;
 
-#define _QCODE_OFFSET		8
+#if CONFIG_NR_CPUS >= (1 << 14)
+# define _Q_MANY_CPUS
+# define _QCODE_OFFSET	8
+#else
+# define _QCODE_OFFSET	16
+#endif
+
 #define _QSPINLOCK_LOCKED	1U
 #define	_QSPINLOCK_LOCK_MASK	0xff
 
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index ed5efa7..22a63fa 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -109,8 +109,11 @@ static DEFINE_PER_CPU_ALIGNED(struct qnode_set, qnset) = { {{0}}, 0 };
  *  2) A smp_u8_store_release() macro for byte size store operation	*
  *  3) A "union arch_qspinlock" structure that include the individual	*
  *     fields of the qspinlock structure, including:			*
- *      o slock - the qspinlock structure				*
- *      o lock  - the lock byte						*
+ *      o slock     - the qspinlock structure				*
+ *      o lock      - the lock byte					*
+ *      o wait      - the waiting byte					*
+ *      o qcode     - the queue node code				*
+ *      o lock_wait - the combined lock and waiting bytes		*
  *									*
  ************************************************************************
  */
@@ -129,6 +132,176 @@ static __always_inline int queue_spin_setlock(struct qspinlock *lock)
 		return 1;
 	return 0;
 }
+
+#ifndef _Q_MANY_CPUS
+/*
+ * With less than 16K CPUs, the following optimizations are possible with
+ * the x86 architecture:
+ *  1) The 2nd byte of the 32-bit lock word can be used as a pending bit
+ *     for waiting lock acquirer so that it won't need to go through the
+ *     MCS style locking queuing which has a higher overhead.
+ *  2) The 16-bit queue code can be accessed or modified directly as a
+ *     16-bit short value without disturbing the first 2 bytes.
+ */
+#define	_QSPINLOCK_WAITING	0x100U	/* Waiting bit in 2nd byte   */
+#define	_QSPINLOCK_LWMASK	0xffff	/* Mask for lock & wait bits */
+
+#define queue_encode_qcode(cpu, idx)	(((cpu) + 1) << 2 | (idx))
+
+#define queue_spin_trylock_quick queue_spin_trylock_quick
+/**
+ * queue_spin_trylock_quick - fast spinning on the queue spinlock
+ * @lock : Pointer to queue spinlock structure
+ * @qsval: Old queue spinlock value
+ * Return: 1 if lock acquired, 0 if failed
+ *
+ * This is an optimized contention path for 2 contending tasks. It
+ * should only be entered if no task is waiting in the queue. This
+ * optimized path is not as fair as the ticket spinlock, but it offers
+ * slightly better performance. The regular MCS locking path for 3 or
+ * more contending tasks, however, is fair.
+ *
+ * Depending on the exact timing, there are several different paths where
+ * a contending task can take. The actual contention performance depends
+ * on which path is taken. So it can be faster or slower than the
+ * corresponding ticket spinlock path. On average, it is probably on par
+ * with ticket spinlock.
+ */
+static inline int queue_spin_trylock_quick(struct qspinlock *lock, int qsval)
+{
+	union arch_qspinlock *qlock = (union arch_qspinlock *)lock;
+	u16		     old;
+
+	/*
+	 * Fall into the quick spinning code path only if no one is waiting
+	 * or the lock is available.
+	 */
+	if (unlikely((qsval != _QSPINLOCK_LOCKED) &&
+		     (qsval != _QSPINLOCK_WAITING)))
+		return 0;
+
+	old = xchg(&qlock->lock_wait, _QSPINLOCK_WAITING|_QSPINLOCK_LOCKED);
+
+	if (old == 0) {
+		/*
+		 * Got the lock, can clear the waiting bit now
+		 */
+		smp_u8_store_release(&qlock->wait, 0);
+		return 1;
+	} else if (old == _QSPINLOCK_LOCKED) {
+try_again:
+		/*
+		 * Wait until the lock byte is cleared to get the lock
+		 */
+		do {
+			cpu_relax();
+		} while (ACCESS_ONCE(qlock->lock));
+		/*
+		 * Set the lock bit & clear the waiting bit
+		 */
+		if (cmpxchg(&qlock->lock_wait, _QSPINLOCK_WAITING,
+			   _QSPINLOCK_LOCKED) == _QSPINLOCK_WAITING)
+			return 1;
+		/*
+		 * Someone has steal the lock, so wait again
+		 */
+		goto try_again;
+	} else if (old == _QSPINLOCK_WAITING) {
+		/*
+		 * Another task is already waiting while it steals the lock.
+		 * A bit of unfairness here won't change the big picture.
+		 * So just take the lock and return.
+		 */
+		return 1;
+	}
+	/*
+	 * Nothing need to be done if the old value is
+	 * (_QSPINLOCK_WAITING | _QSPINLOCK_LOCKED).
+	 */
+	return 0;
+}
+
+#define queue_code_xchg queue_code_xchg
+/**
+ * queue_code_xchg - exchange a queue code value
+ * @lock : Pointer to queue spinlock structure
+ * @qcode: New queue code to be exchanged
+ * Return: The original qcode value in the queue spinlock
+ */
+static inline u32 queue_code_xchg(struct qspinlock *lock, u32 qcode)
+{
+	union arch_qspinlock *qlock = (union arch_qspinlock *)lock;
+
+	return (u32)xchg(&qlock->qcode, (u16)qcode);
+}
+
+#define queue_spin_trylock_and_clr_qcode queue_spin_trylock_and_clr_qcode
+/**
+ * queue_spin_trylock_and_clr_qcode - Try to lock & clear qcode simultaneously
+ * @lock : Pointer to queue spinlock structure
+ * @qcode: The supposedly current qcode value
+ * Return: true if successful, false otherwise
+ */
+static inline int
+queue_spin_trylock_and_clr_qcode(struct qspinlock *lock, u32 qcode)
+{
+	qcode <<= _QCODE_OFFSET;
+	return atomic_cmpxchg(&lock->qlcode, qcode, _QSPINLOCK_LOCKED) == qcode;
+}
+
+#define queue_get_lock_qcode queue_get_lock_qcode
+/**
+ * queue_get_lock_qcode - get the lock & qcode values
+ * @lock  : Pointer to queue spinlock structure
+ * @qcode : Pointer to the returned qcode value
+ * @mycode: My qcode value
+ * Return : > 0 if lock is not available
+ *	   = 0 if lock is free
+ *	   < 0 if lock is taken & can return after cleanup
+ *
+ * It is considered locked when either the lock bit or the wait bit is set.
+ */
+static inline int
+queue_get_lock_qcode(struct qspinlock *lock, u32 *qcode, u32 mycode)
+{
+	u32 qlcode;
+
+	qlcode = (u32)atomic_read(&lock->qlcode);
+	/*
+	 * With the special case that qlcode contains only _QSPINLOCK_LOCKED
+	 * and mycode. It will try to transition back to the quick spinning
+	 * code by clearing the qcode and setting the _QSPINLOCK_WAITING
+	 * bit.
+	 */
+	if (qlcode == (_QSPINLOCK_LOCKED | (mycode << _QCODE_OFFSET))) {
+		u32 old = qlcode;
+
+		qlcode = atomic_cmpxchg(&lock->qlcode, old,
+				_QSPINLOCK_LOCKED|_QSPINLOCK_WAITING);
+		if (qlcode == old) {
+			union arch_qspinlock *slock =
+				(union arch_qspinlock *)lock;
+try_again:
+			/*
+			 * Wait until the lock byte is cleared
+			 */
+			do {
+				cpu_relax();
+			} while (ACCESS_ONCE(slock->lock));
+			/*
+			 * Set the lock bit & clear the waiting bit
+			 */
+			if (cmpxchg(&slock->lock_wait, _QSPINLOCK_WAITING,
+				    _QSPINLOCK_LOCKED) == _QSPINLOCK_WAITING)
+				return -1;	/* Got the lock */
+			goto try_again;
+		}
+	}
+	*qcode = qlcode >> _QCODE_OFFSET;
+	return qlcode & _QSPINLOCK_LWMASK;
+}
+#endif /* _Q_MANY_CPUS */
+
 #else /*  _ARCH_SUPPORTS_ATOMIC_8_16_BITS_OPS  */
 /*
  * Generic functions for architectures that do not support atomic
@@ -144,7 +317,7 @@ static __always_inline int queue_spin_setlock(struct qspinlock *lock)
 	int qlcode = atomic_read(lock->qlcode);
 
 	if (!(qlcode & _QSPINLOCK_LOCKED) && (atomic_cmpxchg(&lock->qlcode,
-		qlcode, qlcode|_QSPINLOCK_LOCKED) == qlcode))
+		qlcode, code|_QSPINLOCK_LOCKED) == qlcode))
 			return 1;
 	return 0;
 }
@@ -156,6 +329,10 @@ static __always_inline int queue_spin_setlock(struct qspinlock *lock)
  * that may get superseded by a more optimized version.			*
  ************************************************************************
  */
+#ifndef queue_spin_trylock_quick
+static inline int queue_spin_trylock_quick(struct qspinlock *lock, int qsval)
+{ return 0; }
+#endif
 
 #ifndef queue_get_lock_qcode
 /**
@@ -266,6 +443,11 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, int qsval)
 	u32 prev_qcode, my_qcode;
 
 	/*
+	 * Try the quick spinning code path
+	 */
+	if (queue_spin_trylock_quick(lock, qsval))
+		return;
+	/*
 	 * Get the queue node
 	 */
 	cpu_nr = smp_processor_id();
@@ -296,6 +478,9 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, int qsval)
 		return;
 	}
 
+#ifdef queue_code_xchg
+	prev_qcode = queue_code_xchg(lock, my_qcode);
+#else
 	/*
 	 * Exchange current copy of the queue node code
 	 */
@@ -329,6 +514,7 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, int qsval)
 	} else
 		prev_qcode &= ~_QSPINLOCK_LOCKED;	/* Clear the lock bit */
 	my_qcode &= ~_QSPINLOCK_LOCKED;
+#endif /* queue_code_xchg */
 
 	if (prev_qcode) {
 		/*
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks
  2014-02-26 15:14 [PATCH v5 0/8] qspinlock: a 4-byte queue spinlock with PV support Waiman Long
                   ` (4 preceding siblings ...)
  2014-02-26 15:14 ` [PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks Waiman Long
@ 2014-02-26 15:14 ` Waiman Long
  2014-02-26 15:14 ` [PATCH RFC v5 4/8] pvqspinlock, x86: Allow unfair spinlock in a real PV environment Waiman Long
                   ` (13 subsequent siblings)
  19 siblings, 0 replies; 125+ messages in thread
From: Waiman Long @ 2014-02-26 15:14 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	Peter Zijlstra
  Cc: Jeremy Fitzhardinge, Raghavendra K T, virtualization, Andi Kleen,
	Michel Lespinasse, Boris Ostrovsky, linux-arch, x86,
	Scott J Norton, xen-devel, Paul E. McKenney, Alexander Fyodorov,
	Rik van Riel, Konrad Rzeszutek Wilk, Daniel J Blueman,
	Oleg Nesterov, Steven Rostedt, Chris Wright, George Spelvin,
	Alok Kataria, Aswin Chandramouleeswaran, Chegu Vinod,
	Waiman Long

A major problem with the queue spinlock patch is its performance at
low contention level (2-4 contending tasks) where it is slower than
the corresponding ticket spinlock code path. The following table shows
the execution time (in ms) of a micro-benchmark where 5M iterations
of the lock/unlock cycles were run on a 10-core Westere-EX CPU with
2 different types loads - standalone (lock and protected data in
different cachelines) and embedded (lock and protected data in the
same cacheline).

		  [Standalone/Embedded]
  # of tasks	Ticket lock	Queue lock	%Change
  ----------	-----------	----------	-------
       1	  135/111	 135/102	  0%/-8%
       2	  732/950	1315/1573	+80%/+66%
       3	 1827/1783	2372/2428	+30%/+36%
       4	 2689/2725	2934/2934	 +9%/+8%
       5	 3736/3748	3658/3652	 -2%/-3%
       6	 4942/4984	4434/4428	-10%/-11%
       7	 6304/6319	5176/5163	-18%/-18%
       8	 7736/7629	5955/5944	-23%/-22%

It can be seen that the performance degradation is particular bad
with 2 and 3 contending tasks. To reduce that performance deficit
at low contention level, a special x86 specific optimized code path
for 2 contending tasks was added. This special code path will only
be activated with less than 16K of configured CPUs because it uses
a byte in the 32-bit lock word to hold a waiting bit for the 2nd
contending tasks instead of queuing the waiting task in the queue.

With the change, the performance data became:

		  [Standalone/Embedded]
  # of tasks	Ticket lock	Queue lock	%Change
  ----------	-----------	----------	-------
       2	  732/950	 523/528	-29%/-44%
       3	 1827/1783	2366/2384	+30%/+34%

The queue spinlock code path is now a bit faster with 2 contending
tasks.  There is also a very slight improvement with 3 contending
tasks.

The performance of the optimized code path can vary depending on which
of the several different code paths is taken. It is also not as fair as
the ticket spinlock and some variations on the execution times of the
2 contending tasks.  Testing with different pairs of cores within the
same CPUs shows an execution time that varies from 400ms to 1194ms.
The ticket spinlock code also shows a variation of 718-1146ms which
is probably due to the CPU topology within a socket.

In a multi-socketed server, the optimized code path also seems to
produce a big performance improvement in cross-node contention traffic
at low contention level. The table below show the performance with
1 contending task per node:

		[Standalone]
  # of nodes	Ticket lock	Queue lock	%Change
  ----------	-----------	----------	-------
       1	   135		 135		  0%
       2	  4452		 528		-88%
       3	 10767		2369		-78%
       4	 20835		2921		-86%

The micro-benchmark was also run on a 4-core Ivy-Bridge PC. The table
below shows the collected performance data:

		  [Standalone/Embedded]
  # of tasks	Ticket lock	Queue lock	%Change
  ----------	-----------	----------	-------
       1	  197/178	  181/150	 -8%/-16%
       2	 1109/928    435-1417/697-2125
       3	 1836/1702  1372-3112/1379-3138
       4	 2717/2429  1842-4158/1846-4170

The performance of the queue lock patch varied from run to run whereas
the performance of the ticket lock was more consistent. The queue
lock figures above were the range of values that were reported.

This optimization can also be easily used by other architectures as
long as they support 8 and 16 bits atomic operations.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
---
 arch/x86/include/asm/qspinlock.h      |   20 ++++-
 include/asm-generic/qspinlock_types.h |    8 ++-
 kernel/locking/qspinlock.c            |  192 ++++++++++++++++++++++++++++++++-
 3 files changed, 215 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
index 44cefee..98db42e 100644
--- a/arch/x86/include/asm/qspinlock.h
+++ b/arch/x86/include/asm/qspinlock.h
@@ -7,12 +7,30 @@
 
 #define _ARCH_SUPPORTS_ATOMIC_8_16_BITS_OPS
 
+#define smp_u8_store_release(p, v)	\
+do {					\
+	barrier();			\
+	ACCESS_ONCE(*p) = (v);		\
+} while (0)
+
+/*
+ * As the qcode will be accessed as a 16-bit word, no offset is needed
+ */
+#define _QCODE_VAL_OFFSET	0
+
 /*
  * x86-64 specific queue spinlock union structure
+ * Besides the slock and lock fields, the other fields are only
+ * valid with less than 16K CPUs.
  */
 union arch_qspinlock {
 	struct qspinlock slock;
-	u8		 lock;	/* Lock bit	*/
+	struct {
+		u8  lock;	/* Lock bit	*/
+		u8  wait;	/* Waiting bit	*/
+		u16 qcode;	/* Queue code	*/
+	};
+	u16 lock_wait;		/* Lock and wait bits */
 };
 
 #define	queue_spin_unlock queue_spin_unlock
diff --git a/include/asm-generic/qspinlock_types.h b/include/asm-generic/qspinlock_types.h
index df981d0..3a02a9e 100644
--- a/include/asm-generic/qspinlock_types.h
+++ b/include/asm-generic/qspinlock_types.h
@@ -48,7 +48,13 @@ typedef struct qspinlock {
 	atomic_t	qlcode;	/* Lock + queue code */
 } arch_spinlock_t;
 
-#define _QCODE_OFFSET		8
+#if CONFIG_NR_CPUS >= (1 << 14)
+# define _Q_MANY_CPUS
+# define _QCODE_OFFSET	8
+#else
+# define _QCODE_OFFSET	16
+#endif
+
 #define _QSPINLOCK_LOCKED	1U
 #define	_QSPINLOCK_LOCK_MASK	0xff
 
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index ed5efa7..22a63fa 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -109,8 +109,11 @@ static DEFINE_PER_CPU_ALIGNED(struct qnode_set, qnset) = { {{0}}, 0 };
  *  2) A smp_u8_store_release() macro for byte size store operation	*
  *  3) A "union arch_qspinlock" structure that include the individual	*
  *     fields of the qspinlock structure, including:			*
- *      o slock - the qspinlock structure				*
- *      o lock  - the lock byte						*
+ *      o slock     - the qspinlock structure				*
+ *      o lock      - the lock byte					*
+ *      o wait      - the waiting byte					*
+ *      o qcode     - the queue node code				*
+ *      o lock_wait - the combined lock and waiting bytes		*
  *									*
  ************************************************************************
  */
@@ -129,6 +132,176 @@ static __always_inline int queue_spin_setlock(struct qspinlock *lock)
 		return 1;
 	return 0;
 }
+
+#ifndef _Q_MANY_CPUS
+/*
+ * With less than 16K CPUs, the following optimizations are possible with
+ * the x86 architecture:
+ *  1) The 2nd byte of the 32-bit lock word can be used as a pending bit
+ *     for waiting lock acquirer so that it won't need to go through the
+ *     MCS style locking queuing which has a higher overhead.
+ *  2) The 16-bit queue code can be accessed or modified directly as a
+ *     16-bit short value without disturbing the first 2 bytes.
+ */
+#define	_QSPINLOCK_WAITING	0x100U	/* Waiting bit in 2nd byte   */
+#define	_QSPINLOCK_LWMASK	0xffff	/* Mask for lock & wait bits */
+
+#define queue_encode_qcode(cpu, idx)	(((cpu) + 1) << 2 | (idx))
+
+#define queue_spin_trylock_quick queue_spin_trylock_quick
+/**
+ * queue_spin_trylock_quick - fast spinning on the queue spinlock
+ * @lock : Pointer to queue spinlock structure
+ * @qsval: Old queue spinlock value
+ * Return: 1 if lock acquired, 0 if failed
+ *
+ * This is an optimized contention path for 2 contending tasks. It
+ * should only be entered if no task is waiting in the queue. This
+ * optimized path is not as fair as the ticket spinlock, but it offers
+ * slightly better performance. The regular MCS locking path for 3 or
+ * more contending tasks, however, is fair.
+ *
+ * Depending on the exact timing, there are several different paths where
+ * a contending task can take. The actual contention performance depends
+ * on which path is taken. So it can be faster or slower than the
+ * corresponding ticket spinlock path. On average, it is probably on par
+ * with ticket spinlock.
+ */
+static inline int queue_spin_trylock_quick(struct qspinlock *lock, int qsval)
+{
+	union arch_qspinlock *qlock = (union arch_qspinlock *)lock;
+	u16		     old;
+
+	/*
+	 * Fall into the quick spinning code path only if no one is waiting
+	 * or the lock is available.
+	 */
+	if (unlikely((qsval != _QSPINLOCK_LOCKED) &&
+		     (qsval != _QSPINLOCK_WAITING)))
+		return 0;
+
+	old = xchg(&qlock->lock_wait, _QSPINLOCK_WAITING|_QSPINLOCK_LOCKED);
+
+	if (old == 0) {
+		/*
+		 * Got the lock, can clear the waiting bit now
+		 */
+		smp_u8_store_release(&qlock->wait, 0);
+		return 1;
+	} else if (old == _QSPINLOCK_LOCKED) {
+try_again:
+		/*
+		 * Wait until the lock byte is cleared to get the lock
+		 */
+		do {
+			cpu_relax();
+		} while (ACCESS_ONCE(qlock->lock));
+		/*
+		 * Set the lock bit & clear the waiting bit
+		 */
+		if (cmpxchg(&qlock->lock_wait, _QSPINLOCK_WAITING,
+			   _QSPINLOCK_LOCKED) == _QSPINLOCK_WAITING)
+			return 1;
+		/*
+		 * Someone has steal the lock, so wait again
+		 */
+		goto try_again;
+	} else if (old == _QSPINLOCK_WAITING) {
+		/*
+		 * Another task is already waiting while it steals the lock.
+		 * A bit of unfairness here won't change the big picture.
+		 * So just take the lock and return.
+		 */
+		return 1;
+	}
+	/*
+	 * Nothing need to be done if the old value is
+	 * (_QSPINLOCK_WAITING | _QSPINLOCK_LOCKED).
+	 */
+	return 0;
+}
+
+#define queue_code_xchg queue_code_xchg
+/**
+ * queue_code_xchg - exchange a queue code value
+ * @lock : Pointer to queue spinlock structure
+ * @qcode: New queue code to be exchanged
+ * Return: The original qcode value in the queue spinlock
+ */
+static inline u32 queue_code_xchg(struct qspinlock *lock, u32 qcode)
+{
+	union arch_qspinlock *qlock = (union arch_qspinlock *)lock;
+
+	return (u32)xchg(&qlock->qcode, (u16)qcode);
+}
+
+#define queue_spin_trylock_and_clr_qcode queue_spin_trylock_and_clr_qcode
+/**
+ * queue_spin_trylock_and_clr_qcode - Try to lock & clear qcode simultaneously
+ * @lock : Pointer to queue spinlock structure
+ * @qcode: The supposedly current qcode value
+ * Return: true if successful, false otherwise
+ */
+static inline int
+queue_spin_trylock_and_clr_qcode(struct qspinlock *lock, u32 qcode)
+{
+	qcode <<= _QCODE_OFFSET;
+	return atomic_cmpxchg(&lock->qlcode, qcode, _QSPINLOCK_LOCKED) == qcode;
+}
+
+#define queue_get_lock_qcode queue_get_lock_qcode
+/**
+ * queue_get_lock_qcode - get the lock & qcode values
+ * @lock  : Pointer to queue spinlock structure
+ * @qcode : Pointer to the returned qcode value
+ * @mycode: My qcode value
+ * Return : > 0 if lock is not available
+ *	   = 0 if lock is free
+ *	   < 0 if lock is taken & can return after cleanup
+ *
+ * It is considered locked when either the lock bit or the wait bit is set.
+ */
+static inline int
+queue_get_lock_qcode(struct qspinlock *lock, u32 *qcode, u32 mycode)
+{
+	u32 qlcode;
+
+	qlcode = (u32)atomic_read(&lock->qlcode);
+	/*
+	 * With the special case that qlcode contains only _QSPINLOCK_LOCKED
+	 * and mycode. It will try to transition back to the quick spinning
+	 * code by clearing the qcode and setting the _QSPINLOCK_WAITING
+	 * bit.
+	 */
+	if (qlcode == (_QSPINLOCK_LOCKED | (mycode << _QCODE_OFFSET))) {
+		u32 old = qlcode;
+
+		qlcode = atomic_cmpxchg(&lock->qlcode, old,
+				_QSPINLOCK_LOCKED|_QSPINLOCK_WAITING);
+		if (qlcode == old) {
+			union arch_qspinlock *slock =
+				(union arch_qspinlock *)lock;
+try_again:
+			/*
+			 * Wait until the lock byte is cleared
+			 */
+			do {
+				cpu_relax();
+			} while (ACCESS_ONCE(slock->lock));
+			/*
+			 * Set the lock bit & clear the waiting bit
+			 */
+			if (cmpxchg(&slock->lock_wait, _QSPINLOCK_WAITING,
+				    _QSPINLOCK_LOCKED) == _QSPINLOCK_WAITING)
+				return -1;	/* Got the lock */
+			goto try_again;
+		}
+	}
+	*qcode = qlcode >> _QCODE_OFFSET;
+	return qlcode & _QSPINLOCK_LWMASK;
+}
+#endif /* _Q_MANY_CPUS */
+
 #else /*  _ARCH_SUPPORTS_ATOMIC_8_16_BITS_OPS  */
 /*
  * Generic functions for architectures that do not support atomic
@@ -144,7 +317,7 @@ static __always_inline int queue_spin_setlock(struct qspinlock *lock)
 	int qlcode = atomic_read(lock->qlcode);
 
 	if (!(qlcode & _QSPINLOCK_LOCKED) && (atomic_cmpxchg(&lock->qlcode,
-		qlcode, qlcode|_QSPINLOCK_LOCKED) == qlcode))
+		qlcode, code|_QSPINLOCK_LOCKED) == qlcode))
 			return 1;
 	return 0;
 }
@@ -156,6 +329,10 @@ static __always_inline int queue_spin_setlock(struct qspinlock *lock)
  * that may get superseded by a more optimized version.			*
  ************************************************************************
  */
+#ifndef queue_spin_trylock_quick
+static inline int queue_spin_trylock_quick(struct qspinlock *lock, int qsval)
+{ return 0; }
+#endif
 
 #ifndef queue_get_lock_qcode
 /**
@@ -266,6 +443,11 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, int qsval)
 	u32 prev_qcode, my_qcode;
 
 	/*
+	 * Try the quick spinning code path
+	 */
+	if (queue_spin_trylock_quick(lock, qsval))
+		return;
+	/*
 	 * Get the queue node
 	 */
 	cpu_nr = smp_processor_id();
@@ -296,6 +478,9 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, int qsval)
 		return;
 	}
 
+#ifdef queue_code_xchg
+	prev_qcode = queue_code_xchg(lock, my_qcode);
+#else
 	/*
 	 * Exchange current copy of the queue node code
 	 */
@@ -329,6 +514,7 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, int qsval)
 	} else
 		prev_qcode &= ~_QSPINLOCK_LOCKED;	/* Clear the lock bit */
 	my_qcode &= ~_QSPINLOCK_LOCKED;
+#endif /* queue_code_xchg */
 
 	if (prev_qcode) {
 		/*
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH RFC v5 4/8] pvqspinlock, x86: Allow unfair spinlock in a real PV environment
  2014-02-26 15:14 [PATCH v5 0/8] qspinlock: a 4-byte queue spinlock with PV support Waiman Long
                   ` (6 preceding siblings ...)
  2014-02-26 15:14 ` [PATCH RFC v5 4/8] pvqspinlock, x86: Allow unfair spinlock in a real PV environment Waiman Long
@ 2014-02-26 15:14 ` Waiman Long
  2014-02-26 17:07   ` Konrad Rzeszutek Wilk
                     ` (3 more replies)
  2014-02-26 15:14 ` [PATCH RFC v5 5/8] pvqspinlock, x86: Enable unfair queue spinlock in a KVM guest Waiman Long
                   ` (11 subsequent siblings)
  19 siblings, 4 replies; 125+ messages in thread
From: Waiman Long @ 2014-02-26 15:14 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	Peter Zijlstra
  Cc: Jeremy Fitzhardinge, Raghavendra K T, virtualization, Andi Kleen,
	Michel Lespinasse, Boris Ostrovsky, linux-arch, x86,
	Scott J Norton, xen-devel, Paul E. McKenney, Alexander Fyodorov,
	Daniel J Blueman, Rusty Russell, Oleg Nesterov, Steven Rostedt,
	Chris Wright, George Spelvin, Alok Kataria,
	Aswin Chandramouleeswaran, Chegu Vinod, Waiman Long,
	Linus Torvalds, linux-ke

Locking is always an issue in a virtualized environment as the virtual
CPU that is waiting on a lock may get scheduled out and hence block
any progress in lock acquisition even when the lock has been freed.

One solution to this problem is to allow unfair lock in a
para-virtualized environment. In this case, a new lock acquirer can
come and steal the lock if the next-in-line CPU to get the lock is
scheduled out. Unfair lock in a native environment is generally not a
good idea as there is a possibility of lock starvation for a heavily
contended lock.

This patch add a new configuration option for the x86
architecture to enable the use of unfair queue spinlock
(PARAVIRT_UNFAIR_LOCKS) in a real para-virtualized guest. A jump label
(paravirt_unfairlocks_enabled) is used to switch between a fair and
an unfair version of the spinlock code. This jump label will only be
enabled in a real PV guest.

Enabling this configuration feature decreases the performance of an
uncontended lock-unlock operation by about 1-2%.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
---
 arch/x86/Kconfig                     |   11 +++++
 arch/x86/include/asm/qspinlock.h     |   74 ++++++++++++++++++++++++++++++++++
 arch/x86/kernel/Makefile             |    1 +
 arch/x86/kernel/paravirt-spinlocks.c |    7 +++
 4 files changed, 93 insertions(+), 0 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 5bf70ab..8d7c941 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -645,6 +645,17 @@ config PARAVIRT_SPINLOCKS
 
 	  If you are unsure how to answer this question, answer Y.
 
+config PARAVIRT_UNFAIR_LOCKS
+	bool "Enable unfair locks in a para-virtualized guest"
+	depends on PARAVIRT && SMP && QUEUE_SPINLOCK
+	depends on !CONFIG_X86_OOSTORE && !CONFIG_X86_PPRO_FENCE
+	---help---
+	  This changes the kernel to use unfair locks in a real
+	  para-virtualized guest system. This will help performance
+	  in most cases. However, there is a possibility of lock
+	  starvation on a heavily contended lock especially in a
+	  large guest with many virtual CPUs.
+
 source "arch/x86/xen/Kconfig"
 
 config KVM_GUEST
diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
index 98db42e..c278aed 100644
--- a/arch/x86/include/asm/qspinlock.h
+++ b/arch/x86/include/asm/qspinlock.h
@@ -56,4 +56,78 @@ static inline void queue_spin_unlock(struct qspinlock *lock)
 
 #include <asm-generic/qspinlock.h>
 
+#ifdef CONFIG_PARAVIRT_UNFAIR_LOCKS
+/**
+ * queue_spin_lock_unfair - acquire a queue spinlock unfairly
+ * @lock: Pointer to queue spinlock structure
+ */
+static __always_inline void queue_spin_lock_unfair(struct qspinlock *lock)
+{
+	union arch_qspinlock *qlock = (union arch_qspinlock *)lock;
+
+	if (likely(cmpxchg(&qlock->lock, 0, _QSPINLOCK_LOCKED) == 0))
+		return;
+	/*
+	 * Since the lock is now unfair, there is no need to activate
+	 * the 2-task quick spinning code path.
+	 */
+	queue_spin_lock_slowpath(lock, -1);
+}
+
+/**
+ * queue_spin_trylock_unfair - try to acquire the queue spinlock unfairly
+ * @lock : Pointer to queue spinlock structure
+ * Return: 1 if lock acquired, 0 if failed
+ */
+static __always_inline int queue_spin_trylock_unfair(struct qspinlock *lock)
+{
+	union arch_qspinlock *qlock = (union arch_qspinlock *)lock;
+
+	if (!qlock->lock &&
+	   (cmpxchg(&qlock->lock, 0, _QSPINLOCK_LOCKED) == 0))
+		return 1;
+	return 0;
+}
+
+/*
+ * Redefine arch_spin_lock and arch_spin_trylock as inline functions that will
+ * jump to the unfair versions if the static key paravirt_unfairlocks_enabled
+ * is true.
+ */
+#undef arch_spin_lock
+#undef arch_spin_trylock
+#undef arch_spin_lock_flags
+
+extern struct static_key paravirt_unfairlocks_enabled;
+
+/**
+ * arch_spin_lock - acquire a queue spinlock
+ * @lock: Pointer to queue spinlock structure
+ */
+static inline void arch_spin_lock(struct qspinlock *lock)
+{
+	if (static_key_false(&paravirt_unfairlocks_enabled)) {
+		queue_spin_lock_unfair(lock);
+		return;
+	}
+	queue_spin_lock(lock);
+}
+
+/**
+ * arch_spin_trylock - try to acquire the queue spinlock
+ * @lock : Pointer to queue spinlock structure
+ * Return: 1 if lock acquired, 0 if failed
+ */
+static inline int arch_spin_trylock(struct qspinlock *lock)
+{
+	if (static_key_false(&paravirt_unfairlocks_enabled)) {
+		return queue_spin_trylock_unfair(lock);
+	}
+	return queue_spin_trylock(lock);
+}
+
+#define arch_spin_lock_flags(l, f)	arch_spin_lock(l)
+
+#endif /* CONFIG_PARAVIRT_UNFAIR_LOCKS */
+
 #endif /* _ASM_X86_QSPINLOCK_H */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index cb648c8..1107a20 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -88,6 +88,7 @@ obj-$(CONFIG_DEBUG_NMI_SELFTEST) += nmi_selftest.o
 obj-$(CONFIG_KVM_GUEST)		+= kvm.o kvmclock.o
 obj-$(CONFIG_PARAVIRT)		+= paravirt.o paravirt_patch_$(BITS).o
 obj-$(CONFIG_PARAVIRT_SPINLOCKS)+= paravirt-spinlocks.o
+obj-$(CONFIG_PARAVIRT_UNFAIR_LOCKS)+= paravirt-spinlocks.o
 obj-$(CONFIG_PARAVIRT_CLOCK)	+= pvclock.o
 
 obj-$(CONFIG_PCSPKR_PLATFORM)	+= pcspeaker.o
diff --git a/arch/x86/kernel/paravirt-spinlocks.c b/arch/x86/kernel/paravirt-spinlocks.c
index bbb6c73..a50032a 100644
--- a/arch/x86/kernel/paravirt-spinlocks.c
+++ b/arch/x86/kernel/paravirt-spinlocks.c
@@ -8,6 +8,7 @@
 
 #include <asm/paravirt.h>
 
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
 struct pv_lock_ops pv_lock_ops = {
 #ifdef CONFIG_SMP
 	.lock_spinning = __PV_IS_CALLEE_SAVE(paravirt_nop),
@@ -18,3 +19,9 @@ EXPORT_SYMBOL(pv_lock_ops);
 
 struct static_key paravirt_ticketlocks_enabled = STATIC_KEY_INIT_FALSE;
 EXPORT_SYMBOL(paravirt_ticketlocks_enabled);
+#endif
+
+#ifdef CONFIG_PARAVIRT_UNFAIR_LOCKS
+struct static_key paravirt_unfairlocks_enabled = STATIC_KEY_INIT_FALSE;
+EXPORT_SYMBOL(paravirt_unfairlocks_enabled);
+#endif
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH RFC v5 4/8] pvqspinlock, x86: Allow unfair spinlock in a real PV environment
  2014-02-26 15:14 [PATCH v5 0/8] qspinlock: a 4-byte queue spinlock with PV support Waiman Long
                   ` (5 preceding siblings ...)
  2014-02-26 15:14 ` Waiman Long
@ 2014-02-26 15:14 ` Waiman Long
  2014-02-26 15:14 ` Waiman Long
                   ` (12 subsequent siblings)
  19 siblings, 0 replies; 125+ messages in thread
From: Waiman Long @ 2014-02-26 15:14 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	Peter Zijlstra
  Cc: Jeremy Fitzhardinge, Raghavendra K T, virtualization, Andi Kleen,
	Michel Lespinasse, Boris Ostrovsky, linux-arch, x86,
	Scott J Norton, xen-devel, Paul E. McKenney, Alexander Fyodorov,
	Rik van Riel, Konrad Rzeszutek Wilk, Daniel J Blueman,
	Oleg Nesterov, Steven Rostedt, Chris Wright, George Spelvin,
	Alok Kataria, Aswin Chandramouleeswaran, Chegu Vinod,
	Waiman Long

Locking is always an issue in a virtualized environment as the virtual
CPU that is waiting on a lock may get scheduled out and hence block
any progress in lock acquisition even when the lock has been freed.

One solution to this problem is to allow unfair lock in a
para-virtualized environment. In this case, a new lock acquirer can
come and steal the lock if the next-in-line CPU to get the lock is
scheduled out. Unfair lock in a native environment is generally not a
good idea as there is a possibility of lock starvation for a heavily
contended lock.

This patch add a new configuration option for the x86
architecture to enable the use of unfair queue spinlock
(PARAVIRT_UNFAIR_LOCKS) in a real para-virtualized guest. A jump label
(paravirt_unfairlocks_enabled) is used to switch between a fair and
an unfair version of the spinlock code. This jump label will only be
enabled in a real PV guest.

Enabling this configuration feature decreases the performance of an
uncontended lock-unlock operation by about 1-2%.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
---
 arch/x86/Kconfig                     |   11 +++++
 arch/x86/include/asm/qspinlock.h     |   74 ++++++++++++++++++++++++++++++++++
 arch/x86/kernel/Makefile             |    1 +
 arch/x86/kernel/paravirt-spinlocks.c |    7 +++
 4 files changed, 93 insertions(+), 0 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 5bf70ab..8d7c941 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -645,6 +645,17 @@ config PARAVIRT_SPINLOCKS
 
 	  If you are unsure how to answer this question, answer Y.
 
+config PARAVIRT_UNFAIR_LOCKS
+	bool "Enable unfair locks in a para-virtualized guest"
+	depends on PARAVIRT && SMP && QUEUE_SPINLOCK
+	depends on !CONFIG_X86_OOSTORE && !CONFIG_X86_PPRO_FENCE
+	---help---
+	  This changes the kernel to use unfair locks in a real
+	  para-virtualized guest system. This will help performance
+	  in most cases. However, there is a possibility of lock
+	  starvation on a heavily contended lock especially in a
+	  large guest with many virtual CPUs.
+
 source "arch/x86/xen/Kconfig"
 
 config KVM_GUEST
diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
index 98db42e..c278aed 100644
--- a/arch/x86/include/asm/qspinlock.h
+++ b/arch/x86/include/asm/qspinlock.h
@@ -56,4 +56,78 @@ static inline void queue_spin_unlock(struct qspinlock *lock)
 
 #include <asm-generic/qspinlock.h>
 
+#ifdef CONFIG_PARAVIRT_UNFAIR_LOCKS
+/**
+ * queue_spin_lock_unfair - acquire a queue spinlock unfairly
+ * @lock: Pointer to queue spinlock structure
+ */
+static __always_inline void queue_spin_lock_unfair(struct qspinlock *lock)
+{
+	union arch_qspinlock *qlock = (union arch_qspinlock *)lock;
+
+	if (likely(cmpxchg(&qlock->lock, 0, _QSPINLOCK_LOCKED) == 0))
+		return;
+	/*
+	 * Since the lock is now unfair, there is no need to activate
+	 * the 2-task quick spinning code path.
+	 */
+	queue_spin_lock_slowpath(lock, -1);
+}
+
+/**
+ * queue_spin_trylock_unfair - try to acquire the queue spinlock unfairly
+ * @lock : Pointer to queue spinlock structure
+ * Return: 1 if lock acquired, 0 if failed
+ */
+static __always_inline int queue_spin_trylock_unfair(struct qspinlock *lock)
+{
+	union arch_qspinlock *qlock = (union arch_qspinlock *)lock;
+
+	if (!qlock->lock &&
+	   (cmpxchg(&qlock->lock, 0, _QSPINLOCK_LOCKED) == 0))
+		return 1;
+	return 0;
+}
+
+/*
+ * Redefine arch_spin_lock and arch_spin_trylock as inline functions that will
+ * jump to the unfair versions if the static key paravirt_unfairlocks_enabled
+ * is true.
+ */
+#undef arch_spin_lock
+#undef arch_spin_trylock
+#undef arch_spin_lock_flags
+
+extern struct static_key paravirt_unfairlocks_enabled;
+
+/**
+ * arch_spin_lock - acquire a queue spinlock
+ * @lock: Pointer to queue spinlock structure
+ */
+static inline void arch_spin_lock(struct qspinlock *lock)
+{
+	if (static_key_false(&paravirt_unfairlocks_enabled)) {
+		queue_spin_lock_unfair(lock);
+		return;
+	}
+	queue_spin_lock(lock);
+}
+
+/**
+ * arch_spin_trylock - try to acquire the queue spinlock
+ * @lock : Pointer to queue spinlock structure
+ * Return: 1 if lock acquired, 0 if failed
+ */
+static inline int arch_spin_trylock(struct qspinlock *lock)
+{
+	if (static_key_false(&paravirt_unfairlocks_enabled)) {
+		return queue_spin_trylock_unfair(lock);
+	}
+	return queue_spin_trylock(lock);
+}
+
+#define arch_spin_lock_flags(l, f)	arch_spin_lock(l)
+
+#endif /* CONFIG_PARAVIRT_UNFAIR_LOCKS */
+
 #endif /* _ASM_X86_QSPINLOCK_H */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index cb648c8..1107a20 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -88,6 +88,7 @@ obj-$(CONFIG_DEBUG_NMI_SELFTEST) += nmi_selftest.o
 obj-$(CONFIG_KVM_GUEST)		+= kvm.o kvmclock.o
 obj-$(CONFIG_PARAVIRT)		+= paravirt.o paravirt_patch_$(BITS).o
 obj-$(CONFIG_PARAVIRT_SPINLOCKS)+= paravirt-spinlocks.o
+obj-$(CONFIG_PARAVIRT_UNFAIR_LOCKS)+= paravirt-spinlocks.o
 obj-$(CONFIG_PARAVIRT_CLOCK)	+= pvclock.o
 
 obj-$(CONFIG_PCSPKR_PLATFORM)	+= pcspeaker.o
diff --git a/arch/x86/kernel/paravirt-spinlocks.c b/arch/x86/kernel/paravirt-spinlocks.c
index bbb6c73..a50032a 100644
--- a/arch/x86/kernel/paravirt-spinlocks.c
+++ b/arch/x86/kernel/paravirt-spinlocks.c
@@ -8,6 +8,7 @@
 
 #include <asm/paravirt.h>
 
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
 struct pv_lock_ops pv_lock_ops = {
 #ifdef CONFIG_SMP
 	.lock_spinning = __PV_IS_CALLEE_SAVE(paravirt_nop),
@@ -18,3 +19,9 @@ EXPORT_SYMBOL(pv_lock_ops);
 
 struct static_key paravirt_ticketlocks_enabled = STATIC_KEY_INIT_FALSE;
 EXPORT_SYMBOL(paravirt_ticketlocks_enabled);
+#endif
+
+#ifdef CONFIG_PARAVIRT_UNFAIR_LOCKS
+struct static_key paravirt_unfairlocks_enabled = STATIC_KEY_INIT_FALSE;
+EXPORT_SYMBOL(paravirt_unfairlocks_enabled);
+#endif
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH RFC v5 5/8] pvqspinlock, x86: Enable unfair queue spinlock in a KVM guest
  2014-02-26 15:14 [PATCH v5 0/8] qspinlock: a 4-byte queue spinlock with PV support Waiman Long
                   ` (8 preceding siblings ...)
  2014-02-26 15:14 ` [PATCH RFC v5 5/8] pvqspinlock, x86: Enable unfair queue spinlock in a KVM guest Waiman Long
@ 2014-02-26 15:14 ` Waiman Long
  2014-02-26 17:08   ` Konrad Rzeszutek Wilk
                     ` (5 more replies)
  2014-02-26 15:14 ` [PATCH RFC v5 6/8] pvqspinlock, x86: Rename paravirt_ticketlocks_enabled Waiman Long
                   ` (9 subsequent siblings)
  19 siblings, 6 replies; 125+ messages in thread
From: Waiman Long @ 2014-02-26 15:14 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	Peter Zijlstra
  Cc: Jeremy Fitzhardinge, Raghavendra K T, virtualization, Andi Kleen,
	Michel Lespinasse, Boris Ostrovsky, linux-arch, x86,
	Scott J Norton, xen-devel, Paul E. McKenney, Alexander Fyodorov,
	Daniel J Blueman, Rusty Russell, Oleg Nesterov, Steven Rostedt,
	Chris Wright, George Spelvin, Alok Kataria,
	Aswin Chandramouleeswaran, Chegu Vinod, Waiman Long,
	Linus Torvalds, linux-ke

This patch adds a KVM init function to activate the unfair queue
spinlock in a KVM guest when the PARAVIRT_UNFAIR_LOCKS kernel config
option is selected.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
---
 arch/x86/kernel/kvm.c |   17 +++++++++++++++++
 1 files changed, 17 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 713f1b3..a489140 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -826,3 +826,20 @@ static __init int kvm_spinlock_init_jump(void)
 early_initcall(kvm_spinlock_init_jump);
 
 #endif	/* CONFIG_PARAVIRT_SPINLOCKS */
+
+#ifdef CONFIG_PARAVIRT_UNFAIR_LOCKS
+/*
+ * Enable unfair lock if running in a real para-virtualized environment
+ */
+static __init int kvm_unfair_locks_init_jump(void)
+{
+	if (!kvm_para_available())
+		return 0;
+
+	static_key_slow_inc(&paravirt_unfairlocks_enabled);
+	printk(KERN_INFO "KVM setup unfair spinlock\n");
+
+	return 0;
+}
+early_initcall(kvm_unfair_locks_init_jump);
+#endif
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH RFC v5 5/8] pvqspinlock, x86: Enable unfair queue spinlock in a KVM guest
  2014-02-26 15:14 [PATCH v5 0/8] qspinlock: a 4-byte queue spinlock with PV support Waiman Long
                   ` (7 preceding siblings ...)
  2014-02-26 15:14 ` Waiman Long
@ 2014-02-26 15:14 ` Waiman Long
  2014-02-26 15:14 ` Waiman Long
                   ` (10 subsequent siblings)
  19 siblings, 0 replies; 125+ messages in thread
From: Waiman Long @ 2014-02-26 15:14 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	Peter Zijlstra
  Cc: Jeremy Fitzhardinge, Raghavendra K T, virtualization, Andi Kleen,
	Michel Lespinasse, Boris Ostrovsky, linux-arch, x86,
	Scott J Norton, xen-devel, Paul E. McKenney, Alexander Fyodorov,
	Rik van Riel, Konrad Rzeszutek Wilk, Daniel J Blueman,
	Oleg Nesterov, Steven Rostedt, Chris Wright, George Spelvin,
	Alok Kataria, Aswin Chandramouleeswaran, Chegu Vinod,
	Waiman Long

This patch adds a KVM init function to activate the unfair queue
spinlock in a KVM guest when the PARAVIRT_UNFAIR_LOCKS kernel config
option is selected.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
---
 arch/x86/kernel/kvm.c |   17 +++++++++++++++++
 1 files changed, 17 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 713f1b3..a489140 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -826,3 +826,20 @@ static __init int kvm_spinlock_init_jump(void)
 early_initcall(kvm_spinlock_init_jump);
 
 #endif	/* CONFIG_PARAVIRT_SPINLOCKS */
+
+#ifdef CONFIG_PARAVIRT_UNFAIR_LOCKS
+/*
+ * Enable unfair lock if running in a real para-virtualized environment
+ */
+static __init int kvm_unfair_locks_init_jump(void)
+{
+	if (!kvm_para_available())
+		return 0;
+
+	static_key_slow_inc(&paravirt_unfairlocks_enabled);
+	printk(KERN_INFO "KVM setup unfair spinlock\n");
+
+	return 0;
+}
+early_initcall(kvm_unfair_locks_init_jump);
+#endif
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH RFC v5 6/8] pvqspinlock, x86: Rename paravirt_ticketlocks_enabled
  2014-02-26 15:14 [PATCH v5 0/8] qspinlock: a 4-byte queue spinlock with PV support Waiman Long
                   ` (9 preceding siblings ...)
  2014-02-26 15:14 ` Waiman Long
@ 2014-02-26 15:14 ` Waiman Long
  2014-02-26 15:14 ` Waiman Long
                   ` (8 subsequent siblings)
  19 siblings, 0 replies; 125+ messages in thread
From: Waiman Long @ 2014-02-26 15:14 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	Peter Zijlstra
  Cc: Jeremy Fitzhardinge, Raghavendra K T, virtualization, Andi Kleen,
	Michel Lespinasse, Boris Ostrovsky, linux-arch, x86,
	Scott J Norton, xen-devel, Paul E. McKenney, Alexander Fyodorov,
	Daniel J Blueman, Rusty Russell, Oleg Nesterov, Steven Rostedt,
	Chris Wright, George Spelvin, Alok Kataria,
	Aswin Chandramouleeswaran, Chegu Vinod, Waiman Long,
	Linus Torvalds, linux-ke

This patch renames the paravirt_ticketlocks_enabled static key to a
more generic paravirt_spinlocks_enabled name.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
---
 arch/x86/include/asm/spinlock.h      |    4 ++--
 arch/x86/kernel/kvm.c                |    2 +-
 arch/x86/kernel/paravirt-spinlocks.c |    4 ++--
 arch/x86/xen/spinlock.c              |    2 +-
 4 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index 6e6de1f..283f2cf 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -40,7 +40,7 @@
 /* How long a lock should spin before we consider blocking */
 #define SPIN_THRESHOLD	(1 << 15)
 
-extern struct static_key paravirt_ticketlocks_enabled;
+extern struct static_key paravirt_spinlocks_enabled;
 static __always_inline bool static_key_false(struct static_key *key);
 
 #ifdef CONFIG_QUEUE_SPINLOCK
@@ -151,7 +151,7 @@ static inline void __ticket_unlock_slowpath(arch_spinlock_t *lock,
 static __always_inline void arch_spin_unlock(arch_spinlock_t *lock)
 {
 	if (TICKET_SLOWPATH_FLAG &&
-	    static_key_false(&paravirt_ticketlocks_enabled)) {
+	    static_key_false(&paravirt_spinlocks_enabled)) {
 		arch_spinlock_t prev;
 
 		prev = *lock;
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index a489140..f318e78 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -818,7 +818,7 @@ static __init int kvm_spinlock_init_jump(void)
 	if (!kvm_para_has_feature(KVM_FEATURE_PV_UNHALT))
 		return 0;
 
-	static_key_slow_inc(&paravirt_ticketlocks_enabled);
+	static_key_slow_inc(&paravirt_spinlocks_enabled);
 	printk(KERN_INFO "KVM setup paravirtual spinlock\n");
 
 	return 0;
diff --git a/arch/x86/kernel/paravirt-spinlocks.c b/arch/x86/kernel/paravirt-spinlocks.c
index a50032a..8c67cbe 100644
--- a/arch/x86/kernel/paravirt-spinlocks.c
+++ b/arch/x86/kernel/paravirt-spinlocks.c
@@ -17,8 +17,8 @@ struct pv_lock_ops pv_lock_ops = {
 };
 EXPORT_SYMBOL(pv_lock_ops);
 
-struct static_key paravirt_ticketlocks_enabled = STATIC_KEY_INIT_FALSE;
-EXPORT_SYMBOL(paravirt_ticketlocks_enabled);
+struct static_key paravirt_spinlocks_enabled = STATIC_KEY_INIT_FALSE;
+EXPORT_SYMBOL(paravirt_spinlocks_enabled);
 #endif
 
 #ifdef CONFIG_PARAVIRT_UNFAIR_LOCKS
diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c
index 581521c..06f4a64 100644
--- a/arch/x86/xen/spinlock.c
+++ b/arch/x86/xen/spinlock.c
@@ -290,7 +290,7 @@ static __init int xen_init_spinlocks_jump(void)
 	if (!xen_pvspin)
 		return 0;
 
-	static_key_slow_inc(&paravirt_ticketlocks_enabled);
+	static_key_slow_inc(&paravirt_spinlocks_enabled);
 	return 0;
 }
 early_initcall(xen_init_spinlocks_jump);
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH RFC v5 6/8] pvqspinlock, x86: Rename paravirt_ticketlocks_enabled
  2014-02-26 15:14 [PATCH v5 0/8] qspinlock: a 4-byte queue spinlock with PV support Waiman Long
                   ` (10 preceding siblings ...)
  2014-02-26 15:14 ` [PATCH RFC v5 6/8] pvqspinlock, x86: Rename paravirt_ticketlocks_enabled Waiman Long
@ 2014-02-26 15:14 ` Waiman Long
  2014-02-26 15:14 ` [PATCH RFC v5 7/8] pvqspinlock, x86: Add qspinlock para-virtualization support Waiman Long
                   ` (7 subsequent siblings)
  19 siblings, 0 replies; 125+ messages in thread
From: Waiman Long @ 2014-02-26 15:14 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	Peter Zijlstra
  Cc: Jeremy Fitzhardinge, Raghavendra K T, virtualization, Andi Kleen,
	Michel Lespinasse, Boris Ostrovsky, linux-arch, x86,
	Scott J Norton, xen-devel, Paul E. McKenney, Alexander Fyodorov,
	Rik van Riel, Konrad Rzeszutek Wilk, Daniel J Blueman,
	Oleg Nesterov, Steven Rostedt, Chris Wright, George Spelvin,
	Alok Kataria, Aswin Chandramouleeswaran, Chegu Vinod,
	Waiman Long

This patch renames the paravirt_ticketlocks_enabled static key to a
more generic paravirt_spinlocks_enabled name.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
---
 arch/x86/include/asm/spinlock.h      |    4 ++--
 arch/x86/kernel/kvm.c                |    2 +-
 arch/x86/kernel/paravirt-spinlocks.c |    4 ++--
 arch/x86/xen/spinlock.c              |    2 +-
 4 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index 6e6de1f..283f2cf 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -40,7 +40,7 @@
 /* How long a lock should spin before we consider blocking */
 #define SPIN_THRESHOLD	(1 << 15)
 
-extern struct static_key paravirt_ticketlocks_enabled;
+extern struct static_key paravirt_spinlocks_enabled;
 static __always_inline bool static_key_false(struct static_key *key);
 
 #ifdef CONFIG_QUEUE_SPINLOCK
@@ -151,7 +151,7 @@ static inline void __ticket_unlock_slowpath(arch_spinlock_t *lock,
 static __always_inline void arch_spin_unlock(arch_spinlock_t *lock)
 {
 	if (TICKET_SLOWPATH_FLAG &&
-	    static_key_false(&paravirt_ticketlocks_enabled)) {
+	    static_key_false(&paravirt_spinlocks_enabled)) {
 		arch_spinlock_t prev;
 
 		prev = *lock;
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index a489140..f318e78 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -818,7 +818,7 @@ static __init int kvm_spinlock_init_jump(void)
 	if (!kvm_para_has_feature(KVM_FEATURE_PV_UNHALT))
 		return 0;
 
-	static_key_slow_inc(&paravirt_ticketlocks_enabled);
+	static_key_slow_inc(&paravirt_spinlocks_enabled);
 	printk(KERN_INFO "KVM setup paravirtual spinlock\n");
 
 	return 0;
diff --git a/arch/x86/kernel/paravirt-spinlocks.c b/arch/x86/kernel/paravirt-spinlocks.c
index a50032a..8c67cbe 100644
--- a/arch/x86/kernel/paravirt-spinlocks.c
+++ b/arch/x86/kernel/paravirt-spinlocks.c
@@ -17,8 +17,8 @@ struct pv_lock_ops pv_lock_ops = {
 };
 EXPORT_SYMBOL(pv_lock_ops);
 
-struct static_key paravirt_ticketlocks_enabled = STATIC_KEY_INIT_FALSE;
-EXPORT_SYMBOL(paravirt_ticketlocks_enabled);
+struct static_key paravirt_spinlocks_enabled = STATIC_KEY_INIT_FALSE;
+EXPORT_SYMBOL(paravirt_spinlocks_enabled);
 #endif
 
 #ifdef CONFIG_PARAVIRT_UNFAIR_LOCKS
diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c
index 581521c..06f4a64 100644
--- a/arch/x86/xen/spinlock.c
+++ b/arch/x86/xen/spinlock.c
@@ -290,7 +290,7 @@ static __init int xen_init_spinlocks_jump(void)
 	if (!xen_pvspin)
 		return 0;
 
-	static_key_slow_inc(&paravirt_ticketlocks_enabled);
+	static_key_slow_inc(&paravirt_spinlocks_enabled);
 	return 0;
 }
 early_initcall(xen_init_spinlocks_jump);
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH RFC v5 7/8] pvqspinlock, x86: Add qspinlock para-virtualization support
  2014-02-26 15:14 [PATCH v5 0/8] qspinlock: a 4-byte queue spinlock with PV support Waiman Long
                   ` (11 preceding siblings ...)
  2014-02-26 15:14 ` Waiman Long
@ 2014-02-26 15:14 ` Waiman Long
  2014-02-26 17:54   ` Konrad Rzeszutek Wilk
                     ` (3 more replies)
  2014-02-26 15:14 ` Waiman Long
                   ` (6 subsequent siblings)
  19 siblings, 4 replies; 125+ messages in thread
From: Waiman Long @ 2014-02-26 15:14 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	Peter Zijlstra
  Cc: Jeremy Fitzhardinge, Raghavendra K T, virtualization, Andi Kleen,
	Michel Lespinasse, Boris Ostrovsky, linux-arch, x86,
	Scott J Norton, xen-devel, Paul E. McKenney, Alexander Fyodorov,
	Daniel J Blueman, Rusty Russell, Oleg Nesterov, Steven Rostedt,
	Chris Wright, George Spelvin, Alok Kataria,
	Aswin Chandramouleeswaran, Chegu Vinod, Waiman Long,
	Linus Torvalds, linux-ke

This patch adds para-virtualization support to the queue spinlock code
by enabling the queue head to kick the lock holder CPU, if known,
in when the lock isn't released for a certain amount of time. It
also enables the mutual monitoring of the queue head CPU and the
following node CPU in the queue to make sure that their CPUs will
stay scheduled in.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
---
 arch/x86/include/asm/paravirt.h       |    9 ++-
 arch/x86/include/asm/paravirt_types.h |   12 +++
 arch/x86/include/asm/pvqspinlock.h    |  176 +++++++++++++++++++++++++++++++++
 arch/x86/kernel/paravirt-spinlocks.c  |    4 +
 kernel/locking/qspinlock.c            |   41 +++++++-
 5 files changed, 235 insertions(+), 7 deletions(-)
 create mode 100644 arch/x86/include/asm/pvqspinlock.h

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index cd6e161..06d3279 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -711,7 +711,12 @@ static inline void __set_fixmap(unsigned /* enum fixed_addresses */ idx,
 }
 
 #if defined(CONFIG_SMP) && defined(CONFIG_PARAVIRT_SPINLOCKS)
-
+#ifdef CONFIG_QUEUE_SPINLOCK
+static __always_inline void __queue_kick_cpu(int cpu, enum pv_kick_type type)
+{
+	PVOP_VCALL2(pv_lock_ops.kick_cpu, cpu, type);
+}
+#else
 static __always_inline void __ticket_lock_spinning(struct arch_spinlock *lock,
 							__ticket_t ticket)
 {
@@ -723,7 +728,7 @@ static __always_inline void __ticket_unlock_kick(struct arch_spinlock *lock,
 {
 	PVOP_VCALL2(pv_lock_ops.unlock_kick, lock, ticket);
 }
-
+#endif
 #endif
 
 #ifdef CONFIG_X86_32
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index 7549b8b..87f8836 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -333,9 +333,21 @@ struct arch_spinlock;
 typedef u16 __ticket_t;
 #endif
 
+#ifdef CONFIG_QUEUE_SPINLOCK
+enum pv_kick_type {
+	PV_KICK_LOCK_HOLDER,
+	PV_KICK_QUEUE_HEAD,
+	PV_KICK_NEXT_NODE
+};
+#endif
+
 struct pv_lock_ops {
+#ifdef CONFIG_QUEUE_SPINLOCK
+	void (*kick_cpu)(int cpu, enum pv_kick_type);
+#else
 	struct paravirt_callee_save lock_spinning;
 	void (*unlock_kick)(struct arch_spinlock *lock, __ticket_t ticket);
+#endif
 };
 
 /* This contains all the paravirt structures: we get a convenient
diff --git a/arch/x86/include/asm/pvqspinlock.h b/arch/x86/include/asm/pvqspinlock.h
new file mode 100644
index 0000000..45aae39
--- /dev/null
+++ b/arch/x86/include/asm/pvqspinlock.h
@@ -0,0 +1,176 @@
+#ifndef _ASM_X86_PVQSPINLOCK_H
+#define _ASM_X86_PVQSPINLOCK_H
+
+/*
+ *	Queue Spinlock Para-Virtualization Support
+ *
+ *	+------+	    +-----+ nxtcpu_p1  +----+
+ *	| Lock |	    |Queue|----------->|Next|
+ *	|Holder|<-----------|Head |<-----------|Node|
+ *	+------+ prev_qcode +-----+ prev_qcode +----+
+ *
+ * As long as the current lock holder passes through the slowpath, the queue
+ * head CPU will have its CPU number stored in prev_qcode. The situation is
+ * the same for the node next to the queue head.
+ *
+ * The next node, while setting up the next pointer in the queue head, can
+ * also store its CPU number in that node. With that change, the queue head
+ * will have the CPU numbers of both its upstream and downstream neighbors.
+ *
+ * To make forward progress in lock acquisition and release, it is necessary
+ * that both the lock holder and the queue head virtual CPUs are present.
+ * The queue head can monitor the lock holder, but the lock holder can't
+ * monitor the queue head back. As a result, the next node is also brought
+ * into the picture to monitor the queue head. In the above diagram, all the
+ * 3 virtual CPUs should be present with the queue head and next node
+ * monitoring each other to make sure they are both present.
+ *
+ * Heartbeat counters are used to track if a neighbor is active. There are
+ * 3 different sets of heartbeat counter monitoring going on:
+ * 1) The queue head will wait until the number loop iteration exceeds a
+ *    certain threshold (HEAD_SPIN_THRESHOLD). In that case, it will send
+ *    a kick-cpu signal to the lock holder if it has the CPU number available.
+ *    The kick-cpu siginal will be sent only once as the real lock holder
+ *    may not be the same as what the queue head thinks it is.
+ * 2) The queue head will periodically clear the active flag of the next node.
+ *    It will then check to see if the active flag remains cleared at the end
+ *    of the cycle. If it is, the next node CPU may be scheduled out. So it
+ *    send a kick-cpu signal to make sure that the next node CPU remain active.
+ * 3) The next node CPU will monitor its own active flag to see if it gets
+ *    clear periodically. If it is not, the queue head CPU may be scheduled
+ *    out. It will then send the kick-cpu signal to the queue head CPU.
+ */
+
+/*
+ * Loop thresholds
+ */
+#define	HEAD_SPIN_THRESHOLD	(1<<12)	/* Threshold to kick lock holder  */
+#define	CLEAR_ACTIVE_THRESHOLD	(1<<8)	/* Threahold for clearing active flag */
+#define CLEAR_ACTIVE_MASK	(CLEAR_ACTIVE_THRESHOLD - 1)
+
+/*
+ * PV macros
+ */
+#define PV_SET_VAR(type, var, val)	type var = val
+#define PV_VAR(var)			var
+#define	PV_GET_NXTCPU(node)		(node)->pv.nxtcpu_p1
+
+/*
+ * Additional fields to be added to the qnode structure
+ *
+ * Try to cram the PV fields into a 32 bits so that it won't increase the
+ * qnode size in x86-64.
+ */
+#if CONFIG_NR_CPUS >= (1 << 16)
+#define _cpuid_t	u32
+#else
+#define _cpuid_t	u16
+#endif
+
+struct pv_qvars {
+	u8	 active;	/* Set if CPU active		*/
+	u8	 prehead;	/* Set if next to queue head	*/
+	_cpuid_t nxtcpu_p1;	/* CPU number of next node + 1	*/
+};
+
+/**
+ * pv_init_vars - initialize fields in struct pv_qvars
+ * @pv: pointer to struct pv_qvars
+ */
+static __always_inline void pv_init_vars(struct pv_qvars *pv)
+{
+	pv->active    = false;
+	pv->prehead   = false;
+	pv->nxtcpu_p1 = 0;
+}
+
+/**
+ * head_spin_check - perform para-virtualization checks for queue head
+ * @count : loop count
+ * @qcode : queue code of the supposed lock holder
+ * @nxtcpu: CPU number of next node + 1
+ * @next  : pointer to the next node
+ * @offset: offset of the pv_qvars within the qnode
+ *
+ * 4 checks will be done:
+ * 1) See if it is time to kick the lock holder
+ * 2) Set the prehead flag of the next node
+ * 3) Clear the active flag of the next node periodically
+ * 4) If the active flag is not set after a while, assume the CPU of the
+ *    next-in-line node is offline and kick it back up again.
+ */
+static __always_inline void
+pv_head_spin_check(int *count, u32 qcode, int nxtcpu, void *next, int offset)
+{
+	if (!static_key_false(&paravirt_spinlocks_enabled))
+		return;
+	if ((++(*count) == HEAD_SPIN_THRESHOLD) && qcode) {
+		/*
+		 * Get the CPU number of the lock holder & kick it
+		 * The lock may have been stealed by another CPU
+		 * if PARAVIRT_UNFAIR_LOCKS is set, so the computed
+		 * CPU number may not be the actual lock holder.
+		 */
+		int cpu = (qcode >> (_QCODE_VAL_OFFSET + 2)) - 1;
+		__queue_kick_cpu(cpu, PV_KICK_LOCK_HOLDER);
+	}
+	if (next) {
+		struct pv_qvars *pv = (struct pv_qvars *)
+				      ((char *)next + offset);
+
+		if (!pv->prehead)
+			pv->prehead = true;
+		if ((*count & CLEAR_ACTIVE_MASK) == CLEAR_ACTIVE_MASK)
+			pv->active = false;
+		if (((*count & CLEAR_ACTIVE_MASK) == 0) &&
+			!pv->active && nxtcpu)
+			/*
+			 * The CPU of the next node doesn't seem to be
+			 * active, need to kick it to make sure that
+			 * it is ready to be transitioned to queue head.
+			 */
+			__queue_kick_cpu(nxtcpu - 1, PV_KICK_NEXT_NODE);
+	}
+}
+
+/**
+ * head_spin_check - perform para-virtualization checks for queue member
+ * @pv   : pointer to struct pv_qvars
+ * @count: loop count
+ * @qcode: queue code of the previous node (queue head if pv->prehead set)
+ *
+ * Set the active flag if it is next to the queue head
+ */
+static __always_inline void
+pv_queue_spin_check(struct pv_qvars *pv, int *count, u32 qcode)
+{
+	if (!static_key_false(&paravirt_spinlocks_enabled))
+		return;
+	if (ACCESS_ONCE(pv->prehead)) {
+		if (pv->active == false) {
+			*count = 0;	/* Reset counter */
+			pv->active = true;
+		}
+		if ((++(*count) >= 4 * CLEAR_ACTIVE_THRESHOLD) && qcode) {
+			/*
+			 * The queue head isn't clearing the active flag for
+			 * too long. Need to kick it.
+			 */
+			int cpu = (qcode >> (_QCODE_VAL_OFFSET + 2)) - 1;
+			__queue_kick_cpu(cpu, PV_KICK_QUEUE_HEAD);
+			*count = 0;
+		}
+	}
+}
+
+/**
+ * pv_set_cpu - set CPU # in the given pv_qvars structure
+ * @pv : pointer to struct pv_qvars to be set
+ * @cpu: cpu number to be set
+ */
+static __always_inline void pv_set_cpu(struct pv_qvars *pv, int cpu)
+{
+	pv->nxtcpu_p1 = cpu + 1;
+}
+
+#endif /* _ASM_X86_PVQSPINLOCK_H */
diff --git a/arch/x86/kernel/paravirt-spinlocks.c b/arch/x86/kernel/paravirt-spinlocks.c
index 8c67cbe..30d76f5 100644
--- a/arch/x86/kernel/paravirt-spinlocks.c
+++ b/arch/x86/kernel/paravirt-spinlocks.c
@@ -11,9 +11,13 @@
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
 struct pv_lock_ops pv_lock_ops = {
 #ifdef CONFIG_SMP
+#ifdef CONFIG_QUEUE_SPINLOCK
+	.kick_cpu = paravirt_nop,
+#else
 	.lock_spinning = __PV_IS_CALLEE_SAVE(paravirt_nop),
 	.unlock_kick = paravirt_nop,
 #endif
+#endif
 };
 EXPORT_SYMBOL(pv_lock_ops);
 
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 22a63fa..f10446e 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -58,6 +58,26 @@
  */
 
 /*
+ * Para-virtualized queue spinlock support
+ */
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+#include <asm/pvqspinlock.h>
+#else
+
+#define PV_SET_VAR(type, var, val)
+#define PV_VAR(var)			0
+#define PV_GET_NXTCPU(node)		0
+
+struct pv_qvars {};
+static __always_inline void pv_init_vars(struct pv_qvars *pv)		{}
+static __always_inline void pv_head_spin_check(int *count, u32 qcode,
+				int nxtcpu, void *next, int offset)	{}
+static __always_inline void pv_queue_spin_check(struct pv_qvars *pv,
+				int *count, u32 qcode)			{}
+static __always_inline void pv_set_cpu(struct pv_qvars *pv, int cpu)	{}
+#endif
+
+/*
  * The 24-bit queue node code is divided into the following 2 fields:
  * Bits 0-1 : queue node index (4 nodes)
  * Bits 2-23: CPU number + 1   (4M - 1 CPUs)
@@ -77,15 +97,13 @@
 
 /*
  * The queue node structure
- *
- * This structure is essentially the same as the mcs_spinlock structure
- * in mcs_spinlock.h file. This structure is retained for future extension
- * where new fields may be added.
  */
 struct qnode {
 	u32		 wait;		/* Waiting flag		*/
+	struct pv_qvars	 pv;		/* Para-virtualization  */
 	struct qnode	*next;		/* Next queue node addr */
 };
+#define PV_OFFSET	offsetof(struct qnode, pv)
 
 struct qnode_set {
 	struct qnode	nodes[MAX_QNODES];
@@ -441,6 +459,7 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, int qsval)
 	unsigned int cpu_nr, qn_idx;
 	struct qnode *node, *next;
 	u32 prev_qcode, my_qcode;
+	PV_SET_VAR(int, hcnt, 0);
 
 	/*
 	 * Try the quick spinning code path
@@ -468,6 +487,7 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, int qsval)
 	 */
 	node->wait = true;
 	node->next = NULL;
+	pv_init_vars(&node->pv);
 
 	/*
 	 * The lock may be available at this point, try again if no task was
@@ -522,13 +542,22 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, int qsval)
 		 * and set up the "next" fields of the that node.
 		 */
 		struct qnode *prev = xlate_qcode(prev_qcode);
+		PV_SET_VAR(int, qcnt, 0);
 
 		ACCESS_ONCE(prev->next) = node;
 		/*
+		 * Set current CPU number into the previous node
+		 */
+		pv_set_cpu(&prev->pv, cpu_nr);
+
+		/*
 		 * Wait until the waiting flag is off
 		 */
-		while (smp_load_acquire(&node->wait))
+		while (smp_load_acquire(&node->wait)) {
 			arch_mutex_cpu_relax();
+			pv_queue_spin_check(&node->pv, PV_VAR(&qcnt),
+					    prev_qcode);
+		}
 	}
 
 	/*
@@ -560,6 +589,8 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, int qsval)
 				goto release_node;
 		}
 		arch_mutex_cpu_relax();
+		pv_head_spin_check(PV_VAR(&hcnt), prev_qcode,
+				PV_GET_NXTCPU(node), node->next, PV_OFFSET);
 	}
 
 notify_next:
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH RFC v5 7/8] pvqspinlock, x86: Add qspinlock para-virtualization support
  2014-02-26 15:14 [PATCH v5 0/8] qspinlock: a 4-byte queue spinlock with PV support Waiman Long
                   ` (12 preceding siblings ...)
  2014-02-26 15:14 ` [PATCH RFC v5 7/8] pvqspinlock, x86: Add qspinlock para-virtualization support Waiman Long
@ 2014-02-26 15:14 ` Waiman Long
  2014-02-26 15:14 ` [PATCH RFC v5 8/8] pvqspinlock, x86: Enable KVM to use qspinlock's PV support Waiman Long
                   ` (5 subsequent siblings)
  19 siblings, 0 replies; 125+ messages in thread
From: Waiman Long @ 2014-02-26 15:14 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	Peter Zijlstra
  Cc: Jeremy Fitzhardinge, Raghavendra K T, virtualization, Andi Kleen,
	Michel Lespinasse, Boris Ostrovsky, linux-arch, x86,
	Scott J Norton, xen-devel, Paul E. McKenney, Alexander Fyodorov,
	Rik van Riel, Konrad Rzeszutek Wilk, Daniel J Blueman,
	Oleg Nesterov, Steven Rostedt, Chris Wright, George Spelvin,
	Alok Kataria, Aswin Chandramouleeswaran, Chegu Vinod,
	Waiman Long

This patch adds para-virtualization support to the queue spinlock code
by enabling the queue head to kick the lock holder CPU, if known,
in when the lock isn't released for a certain amount of time. It
also enables the mutual monitoring of the queue head CPU and the
following node CPU in the queue to make sure that their CPUs will
stay scheduled in.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
---
 arch/x86/include/asm/paravirt.h       |    9 ++-
 arch/x86/include/asm/paravirt_types.h |   12 +++
 arch/x86/include/asm/pvqspinlock.h    |  176 +++++++++++++++++++++++++++++++++
 arch/x86/kernel/paravirt-spinlocks.c  |    4 +
 kernel/locking/qspinlock.c            |   41 +++++++-
 5 files changed, 235 insertions(+), 7 deletions(-)
 create mode 100644 arch/x86/include/asm/pvqspinlock.h

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index cd6e161..06d3279 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -711,7 +711,12 @@ static inline void __set_fixmap(unsigned /* enum fixed_addresses */ idx,
 }
 
 #if defined(CONFIG_SMP) && defined(CONFIG_PARAVIRT_SPINLOCKS)
-
+#ifdef CONFIG_QUEUE_SPINLOCK
+static __always_inline void __queue_kick_cpu(int cpu, enum pv_kick_type type)
+{
+	PVOP_VCALL2(pv_lock_ops.kick_cpu, cpu, type);
+}
+#else
 static __always_inline void __ticket_lock_spinning(struct arch_spinlock *lock,
 							__ticket_t ticket)
 {
@@ -723,7 +728,7 @@ static __always_inline void __ticket_unlock_kick(struct arch_spinlock *lock,
 {
 	PVOP_VCALL2(pv_lock_ops.unlock_kick, lock, ticket);
 }
-
+#endif
 #endif
 
 #ifdef CONFIG_X86_32
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index 7549b8b..87f8836 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -333,9 +333,21 @@ struct arch_spinlock;
 typedef u16 __ticket_t;
 #endif
 
+#ifdef CONFIG_QUEUE_SPINLOCK
+enum pv_kick_type {
+	PV_KICK_LOCK_HOLDER,
+	PV_KICK_QUEUE_HEAD,
+	PV_KICK_NEXT_NODE
+};
+#endif
+
 struct pv_lock_ops {
+#ifdef CONFIG_QUEUE_SPINLOCK
+	void (*kick_cpu)(int cpu, enum pv_kick_type);
+#else
 	struct paravirt_callee_save lock_spinning;
 	void (*unlock_kick)(struct arch_spinlock *lock, __ticket_t ticket);
+#endif
 };
 
 /* This contains all the paravirt structures: we get a convenient
diff --git a/arch/x86/include/asm/pvqspinlock.h b/arch/x86/include/asm/pvqspinlock.h
new file mode 100644
index 0000000..45aae39
--- /dev/null
+++ b/arch/x86/include/asm/pvqspinlock.h
@@ -0,0 +1,176 @@
+#ifndef _ASM_X86_PVQSPINLOCK_H
+#define _ASM_X86_PVQSPINLOCK_H
+
+/*
+ *	Queue Spinlock Para-Virtualization Support
+ *
+ *	+------+	    +-----+ nxtcpu_p1  +----+
+ *	| Lock |	    |Queue|----------->|Next|
+ *	|Holder|<-----------|Head |<-----------|Node|
+ *	+------+ prev_qcode +-----+ prev_qcode +----+
+ *
+ * As long as the current lock holder passes through the slowpath, the queue
+ * head CPU will have its CPU number stored in prev_qcode. The situation is
+ * the same for the node next to the queue head.
+ *
+ * The next node, while setting up the next pointer in the queue head, can
+ * also store its CPU number in that node. With that change, the queue head
+ * will have the CPU numbers of both its upstream and downstream neighbors.
+ *
+ * To make forward progress in lock acquisition and release, it is necessary
+ * that both the lock holder and the queue head virtual CPUs are present.
+ * The queue head can monitor the lock holder, but the lock holder can't
+ * monitor the queue head back. As a result, the next node is also brought
+ * into the picture to monitor the queue head. In the above diagram, all the
+ * 3 virtual CPUs should be present with the queue head and next node
+ * monitoring each other to make sure they are both present.
+ *
+ * Heartbeat counters are used to track if a neighbor is active. There are
+ * 3 different sets of heartbeat counter monitoring going on:
+ * 1) The queue head will wait until the number loop iteration exceeds a
+ *    certain threshold (HEAD_SPIN_THRESHOLD). In that case, it will send
+ *    a kick-cpu signal to the lock holder if it has the CPU number available.
+ *    The kick-cpu siginal will be sent only once as the real lock holder
+ *    may not be the same as what the queue head thinks it is.
+ * 2) The queue head will periodically clear the active flag of the next node.
+ *    It will then check to see if the active flag remains cleared at the end
+ *    of the cycle. If it is, the next node CPU may be scheduled out. So it
+ *    send a kick-cpu signal to make sure that the next node CPU remain active.
+ * 3) The next node CPU will monitor its own active flag to see if it gets
+ *    clear periodically. If it is not, the queue head CPU may be scheduled
+ *    out. It will then send the kick-cpu signal to the queue head CPU.
+ */
+
+/*
+ * Loop thresholds
+ */
+#define	HEAD_SPIN_THRESHOLD	(1<<12)	/* Threshold to kick lock holder  */
+#define	CLEAR_ACTIVE_THRESHOLD	(1<<8)	/* Threahold for clearing active flag */
+#define CLEAR_ACTIVE_MASK	(CLEAR_ACTIVE_THRESHOLD - 1)
+
+/*
+ * PV macros
+ */
+#define PV_SET_VAR(type, var, val)	type var = val
+#define PV_VAR(var)			var
+#define	PV_GET_NXTCPU(node)		(node)->pv.nxtcpu_p1
+
+/*
+ * Additional fields to be added to the qnode structure
+ *
+ * Try to cram the PV fields into a 32 bits so that it won't increase the
+ * qnode size in x86-64.
+ */
+#if CONFIG_NR_CPUS >= (1 << 16)
+#define _cpuid_t	u32
+#else
+#define _cpuid_t	u16
+#endif
+
+struct pv_qvars {
+	u8	 active;	/* Set if CPU active		*/
+	u8	 prehead;	/* Set if next to queue head	*/
+	_cpuid_t nxtcpu_p1;	/* CPU number of next node + 1	*/
+};
+
+/**
+ * pv_init_vars - initialize fields in struct pv_qvars
+ * @pv: pointer to struct pv_qvars
+ */
+static __always_inline void pv_init_vars(struct pv_qvars *pv)
+{
+	pv->active    = false;
+	pv->prehead   = false;
+	pv->nxtcpu_p1 = 0;
+}
+
+/**
+ * head_spin_check - perform para-virtualization checks for queue head
+ * @count : loop count
+ * @qcode : queue code of the supposed lock holder
+ * @nxtcpu: CPU number of next node + 1
+ * @next  : pointer to the next node
+ * @offset: offset of the pv_qvars within the qnode
+ *
+ * 4 checks will be done:
+ * 1) See if it is time to kick the lock holder
+ * 2) Set the prehead flag of the next node
+ * 3) Clear the active flag of the next node periodically
+ * 4) If the active flag is not set after a while, assume the CPU of the
+ *    next-in-line node is offline and kick it back up again.
+ */
+static __always_inline void
+pv_head_spin_check(int *count, u32 qcode, int nxtcpu, void *next, int offset)
+{
+	if (!static_key_false(&paravirt_spinlocks_enabled))
+		return;
+	if ((++(*count) == HEAD_SPIN_THRESHOLD) && qcode) {
+		/*
+		 * Get the CPU number of the lock holder & kick it
+		 * The lock may have been stealed by another CPU
+		 * if PARAVIRT_UNFAIR_LOCKS is set, so the computed
+		 * CPU number may not be the actual lock holder.
+		 */
+		int cpu = (qcode >> (_QCODE_VAL_OFFSET + 2)) - 1;
+		__queue_kick_cpu(cpu, PV_KICK_LOCK_HOLDER);
+	}
+	if (next) {
+		struct pv_qvars *pv = (struct pv_qvars *)
+				      ((char *)next + offset);
+
+		if (!pv->prehead)
+			pv->prehead = true;
+		if ((*count & CLEAR_ACTIVE_MASK) == CLEAR_ACTIVE_MASK)
+			pv->active = false;
+		if (((*count & CLEAR_ACTIVE_MASK) == 0) &&
+			!pv->active && nxtcpu)
+			/*
+			 * The CPU of the next node doesn't seem to be
+			 * active, need to kick it to make sure that
+			 * it is ready to be transitioned to queue head.
+			 */
+			__queue_kick_cpu(nxtcpu - 1, PV_KICK_NEXT_NODE);
+	}
+}
+
+/**
+ * head_spin_check - perform para-virtualization checks for queue member
+ * @pv   : pointer to struct pv_qvars
+ * @count: loop count
+ * @qcode: queue code of the previous node (queue head if pv->prehead set)
+ *
+ * Set the active flag if it is next to the queue head
+ */
+static __always_inline void
+pv_queue_spin_check(struct pv_qvars *pv, int *count, u32 qcode)
+{
+	if (!static_key_false(&paravirt_spinlocks_enabled))
+		return;
+	if (ACCESS_ONCE(pv->prehead)) {
+		if (pv->active == false) {
+			*count = 0;	/* Reset counter */
+			pv->active = true;
+		}
+		if ((++(*count) >= 4 * CLEAR_ACTIVE_THRESHOLD) && qcode) {
+			/*
+			 * The queue head isn't clearing the active flag for
+			 * too long. Need to kick it.
+			 */
+			int cpu = (qcode >> (_QCODE_VAL_OFFSET + 2)) - 1;
+			__queue_kick_cpu(cpu, PV_KICK_QUEUE_HEAD);
+			*count = 0;
+		}
+	}
+}
+
+/**
+ * pv_set_cpu - set CPU # in the given pv_qvars structure
+ * @pv : pointer to struct pv_qvars to be set
+ * @cpu: cpu number to be set
+ */
+static __always_inline void pv_set_cpu(struct pv_qvars *pv, int cpu)
+{
+	pv->nxtcpu_p1 = cpu + 1;
+}
+
+#endif /* _ASM_X86_PVQSPINLOCK_H */
diff --git a/arch/x86/kernel/paravirt-spinlocks.c b/arch/x86/kernel/paravirt-spinlocks.c
index 8c67cbe..30d76f5 100644
--- a/arch/x86/kernel/paravirt-spinlocks.c
+++ b/arch/x86/kernel/paravirt-spinlocks.c
@@ -11,9 +11,13 @@
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
 struct pv_lock_ops pv_lock_ops = {
 #ifdef CONFIG_SMP
+#ifdef CONFIG_QUEUE_SPINLOCK
+	.kick_cpu = paravirt_nop,
+#else
 	.lock_spinning = __PV_IS_CALLEE_SAVE(paravirt_nop),
 	.unlock_kick = paravirt_nop,
 #endif
+#endif
 };
 EXPORT_SYMBOL(pv_lock_ops);
 
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 22a63fa..f10446e 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -58,6 +58,26 @@
  */
 
 /*
+ * Para-virtualized queue spinlock support
+ */
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+#include <asm/pvqspinlock.h>
+#else
+
+#define PV_SET_VAR(type, var, val)
+#define PV_VAR(var)			0
+#define PV_GET_NXTCPU(node)		0
+
+struct pv_qvars {};
+static __always_inline void pv_init_vars(struct pv_qvars *pv)		{}
+static __always_inline void pv_head_spin_check(int *count, u32 qcode,
+				int nxtcpu, void *next, int offset)	{}
+static __always_inline void pv_queue_spin_check(struct pv_qvars *pv,
+				int *count, u32 qcode)			{}
+static __always_inline void pv_set_cpu(struct pv_qvars *pv, int cpu)	{}
+#endif
+
+/*
  * The 24-bit queue node code is divided into the following 2 fields:
  * Bits 0-1 : queue node index (4 nodes)
  * Bits 2-23: CPU number + 1   (4M - 1 CPUs)
@@ -77,15 +97,13 @@
 
 /*
  * The queue node structure
- *
- * This structure is essentially the same as the mcs_spinlock structure
- * in mcs_spinlock.h file. This structure is retained for future extension
- * where new fields may be added.
  */
 struct qnode {
 	u32		 wait;		/* Waiting flag		*/
+	struct pv_qvars	 pv;		/* Para-virtualization  */
 	struct qnode	*next;		/* Next queue node addr */
 };
+#define PV_OFFSET	offsetof(struct qnode, pv)
 
 struct qnode_set {
 	struct qnode	nodes[MAX_QNODES];
@@ -441,6 +459,7 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, int qsval)
 	unsigned int cpu_nr, qn_idx;
 	struct qnode *node, *next;
 	u32 prev_qcode, my_qcode;
+	PV_SET_VAR(int, hcnt, 0);
 
 	/*
 	 * Try the quick spinning code path
@@ -468,6 +487,7 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, int qsval)
 	 */
 	node->wait = true;
 	node->next = NULL;
+	pv_init_vars(&node->pv);
 
 	/*
 	 * The lock may be available at this point, try again if no task was
@@ -522,13 +542,22 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, int qsval)
 		 * and set up the "next" fields of the that node.
 		 */
 		struct qnode *prev = xlate_qcode(prev_qcode);
+		PV_SET_VAR(int, qcnt, 0);
 
 		ACCESS_ONCE(prev->next) = node;
 		/*
+		 * Set current CPU number into the previous node
+		 */
+		pv_set_cpu(&prev->pv, cpu_nr);
+
+		/*
 		 * Wait until the waiting flag is off
 		 */
-		while (smp_load_acquire(&node->wait))
+		while (smp_load_acquire(&node->wait)) {
 			arch_mutex_cpu_relax();
+			pv_queue_spin_check(&node->pv, PV_VAR(&qcnt),
+					    prev_qcode);
+		}
 	}
 
 	/*
@@ -560,6 +589,8 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, int qsval)
 				goto release_node;
 		}
 		arch_mutex_cpu_relax();
+		pv_head_spin_check(PV_VAR(&hcnt), prev_qcode,
+				PV_GET_NXTCPU(node), node->next, PV_OFFSET);
 	}
 
 notify_next:
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH RFC v5 8/8] pvqspinlock, x86: Enable KVM to use qspinlock's PV support
  2014-02-26 15:14 [PATCH v5 0/8] qspinlock: a 4-byte queue spinlock with PV support Waiman Long
                   ` (14 preceding siblings ...)
  2014-02-26 15:14 ` [PATCH RFC v5 8/8] pvqspinlock, x86: Enable KVM to use qspinlock's PV support Waiman Long
@ 2014-02-26 15:14 ` Waiman Long
  2014-02-27  9:31   ` Paolo Bonzini
  2014-02-27  9:31   ` Paolo Bonzini
  2014-02-26 17:00 ` [PATCH v5 0/8] qspinlock: a 4-byte queue spinlock with " Konrad Rzeszutek Wilk
                   ` (3 subsequent siblings)
  19 siblings, 2 replies; 125+ messages in thread
From: Waiman Long @ 2014-02-26 15:14 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	Peter Zijlstra
  Cc: Jeremy Fitzhardinge, Raghavendra K T, virtualization, Andi Kleen,
	Michel Lespinasse, Boris Ostrovsky, linux-arch, x86,
	Scott J Norton, xen-devel, Paul E. McKenney, Alexander Fyodorov,
	Daniel J Blueman, Rusty Russell, Oleg Nesterov, Steven Rostedt,
	Chris Wright, George Spelvin, Alok Kataria,
	Aswin Chandramouleeswaran, Chegu Vinod, Waiman Long,
	Linus Torvalds, linux-ke

This patch enables KVM to use the queue spinlock's PV support code
when the PARAVIRT_SPINLOCKS kernel config option is set. However,
PV support for Xen is not ready yet and so the queue spinlock will
still have to be disabled when PARAVIRT_SPINLOCKS config option is
on with Xen.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
---
 arch/x86/kernel/kvm.c |   54 +++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/Kconfig.locks  |    2 +-
 2 files changed, 55 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index f318e78..3ddc436 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -568,6 +568,7 @@ static void kvm_kick_cpu(int cpu)
 	kvm_hypercall2(KVM_HC_KICK_CPU, flags, apicid);
 }
 
+#ifndef CONFIG_QUEUE_SPINLOCK
 enum kvm_contention_stat {
 	TAKEN_SLOW,
 	TAKEN_SLOW_PICKUP,
@@ -795,6 +796,55 @@ static void kvm_unlock_kick(struct arch_spinlock *lock, __ticket_t ticket)
 		}
 	}
 }
+#else /* !CONFIG_QUEUE_SPINLOCK */
+
+#ifdef CONFIG_KVM_DEBUG_FS
+static struct dentry *d_spin_debug;
+static struct dentry *d_kvm_debug;
+static u32 lh_kick_stats;	/* Lock holder kick count */
+static u32 qh_kick_stats;	/* Queue head kick count  */
+static u32 nn_kick_stats;	/* Next node kick count   */
+
+static int __init kvm_spinlock_debugfs(void)
+{
+	d_kvm_debug = debugfs_create_dir("kvm-guest", NULL);
+	if (!d_kvm_debug) {
+		printk(KERN_WARNING
+		       "Could not create 'kvm' debugfs directory\n");
+		return -ENOMEM;
+	}
+	d_spin_debug = debugfs_create_dir("spinlocks", d_kvm_debug);
+
+	debugfs_create_u32("lh_kick_stats", 0644, d_spin_debug, &lh_kick_stats);
+	debugfs_create_u32("qh_kick_stats", 0644, d_spin_debug, &qh_kick_stats);
+	debugfs_create_u32("nn_kick_stats", 0644, d_spin_debug, &nn_kick_stats);
+
+	return 0;
+}
+
+static inline void inc_kick_stats(enum pv_kick_type type)
+{
+	if (type == PV_KICK_LOCK_HOLDER)
+		add_smp(&lh_kick_stats, 1);
+	else if (type == PV_KICK_QUEUE_HEAD)
+		add_smp(&qh_kick_stats, 1);
+	else
+		add_smp(&nn_kick_stats, 1);
+}
+fs_initcall(kvm_spinlock_debugfs);
+
+#else /* CONFIG_KVM_DEBUG_FS */
+static inline void inc_kick_stats(enum pv_kick_type type)
+{
+}
+#endif /* CONFIG_KVM_DEBUG_FS */
+
+static void kvm_kick_cpu_type(int cpu, enum pv_kick_type type)
+{
+	kvm_kick_cpu(cpu);
+	inc_kick_stats(type);
+}
+#endif /* !CONFIG_QUEUE_SPINLOCK */
 
 /*
  * Setup pv_lock_ops to exploit KVM_FEATURE_PV_UNHALT if present.
@@ -807,8 +857,12 @@ void __init kvm_spinlock_init(void)
 	if (!kvm_para_has_feature(KVM_FEATURE_PV_UNHALT))
 		return;
 
+#ifdef CONFIG_QUEUE_SPINLOCK
+	pv_lock_ops.kick_cpu = kvm_kick_cpu_type;
+#else
 	pv_lock_ops.lock_spinning = PV_CALLEE_SAVE(kvm_lock_spinning);
 	pv_lock_ops.unlock_kick = kvm_unlock_kick;
+#endif
 }
 
 static __init int kvm_spinlock_init_jump(void)
diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks
index f185584..a70fdeb 100644
--- a/kernel/Kconfig.locks
+++ b/kernel/Kconfig.locks
@@ -229,4 +229,4 @@ config ARCH_USE_QUEUE_SPINLOCK
 
 config QUEUE_SPINLOCK
 	def_bool y if ARCH_USE_QUEUE_SPINLOCK
-	depends on SMP && !PARAVIRT_SPINLOCKS
+	depends on SMP && (!PARAVIRT_SPINLOCKS || !XEN)
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH RFC v5 8/8] pvqspinlock, x86: Enable KVM to use qspinlock's PV support
  2014-02-26 15:14 [PATCH v5 0/8] qspinlock: a 4-byte queue spinlock with PV support Waiman Long
                   ` (13 preceding siblings ...)
  2014-02-26 15:14 ` Waiman Long
@ 2014-02-26 15:14 ` Waiman Long
  2014-02-26 15:14 ` Waiman Long
                   ` (4 subsequent siblings)
  19 siblings, 0 replies; 125+ messages in thread
From: Waiman Long @ 2014-02-26 15:14 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	Peter Zijlstra
  Cc: Jeremy Fitzhardinge, Raghavendra K T, virtualization, Andi Kleen,
	Michel Lespinasse, Boris Ostrovsky, linux-arch, x86,
	Scott J Norton, xen-devel, Paul E. McKenney, Alexander Fyodorov,
	Rik van Riel, Konrad Rzeszutek Wilk, Daniel J Blueman,
	Oleg Nesterov, Steven Rostedt, Chris Wright, George Spelvin,
	Alok Kataria, Aswin Chandramouleeswaran, Chegu Vinod,
	Waiman Long

This patch enables KVM to use the queue spinlock's PV support code
when the PARAVIRT_SPINLOCKS kernel config option is set. However,
PV support for Xen is not ready yet and so the queue spinlock will
still have to be disabled when PARAVIRT_SPINLOCKS config option is
on with Xen.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
---
 arch/x86/kernel/kvm.c |   54 +++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/Kconfig.locks  |    2 +-
 2 files changed, 55 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index f318e78..3ddc436 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -568,6 +568,7 @@ static void kvm_kick_cpu(int cpu)
 	kvm_hypercall2(KVM_HC_KICK_CPU, flags, apicid);
 }
 
+#ifndef CONFIG_QUEUE_SPINLOCK
 enum kvm_contention_stat {
 	TAKEN_SLOW,
 	TAKEN_SLOW_PICKUP,
@@ -795,6 +796,55 @@ static void kvm_unlock_kick(struct arch_spinlock *lock, __ticket_t ticket)
 		}
 	}
 }
+#else /* !CONFIG_QUEUE_SPINLOCK */
+
+#ifdef CONFIG_KVM_DEBUG_FS
+static struct dentry *d_spin_debug;
+static struct dentry *d_kvm_debug;
+static u32 lh_kick_stats;	/* Lock holder kick count */
+static u32 qh_kick_stats;	/* Queue head kick count  */
+static u32 nn_kick_stats;	/* Next node kick count   */
+
+static int __init kvm_spinlock_debugfs(void)
+{
+	d_kvm_debug = debugfs_create_dir("kvm-guest", NULL);
+	if (!d_kvm_debug) {
+		printk(KERN_WARNING
+		       "Could not create 'kvm' debugfs directory\n");
+		return -ENOMEM;
+	}
+	d_spin_debug = debugfs_create_dir("spinlocks", d_kvm_debug);
+
+	debugfs_create_u32("lh_kick_stats", 0644, d_spin_debug, &lh_kick_stats);
+	debugfs_create_u32("qh_kick_stats", 0644, d_spin_debug, &qh_kick_stats);
+	debugfs_create_u32("nn_kick_stats", 0644, d_spin_debug, &nn_kick_stats);
+
+	return 0;
+}
+
+static inline void inc_kick_stats(enum pv_kick_type type)
+{
+	if (type == PV_KICK_LOCK_HOLDER)
+		add_smp(&lh_kick_stats, 1);
+	else if (type == PV_KICK_QUEUE_HEAD)
+		add_smp(&qh_kick_stats, 1);
+	else
+		add_smp(&nn_kick_stats, 1);
+}
+fs_initcall(kvm_spinlock_debugfs);
+
+#else /* CONFIG_KVM_DEBUG_FS */
+static inline void inc_kick_stats(enum pv_kick_type type)
+{
+}
+#endif /* CONFIG_KVM_DEBUG_FS */
+
+static void kvm_kick_cpu_type(int cpu, enum pv_kick_type type)
+{
+	kvm_kick_cpu(cpu);
+	inc_kick_stats(type);
+}
+#endif /* !CONFIG_QUEUE_SPINLOCK */
 
 /*
  * Setup pv_lock_ops to exploit KVM_FEATURE_PV_UNHALT if present.
@@ -807,8 +857,12 @@ void __init kvm_spinlock_init(void)
 	if (!kvm_para_has_feature(KVM_FEATURE_PV_UNHALT))
 		return;
 
+#ifdef CONFIG_QUEUE_SPINLOCK
+	pv_lock_ops.kick_cpu = kvm_kick_cpu_type;
+#else
 	pv_lock_ops.lock_spinning = PV_CALLEE_SAVE(kvm_lock_spinning);
 	pv_lock_ops.unlock_kick = kvm_unlock_kick;
+#endif
 }
 
 static __init int kvm_spinlock_init_jump(void)
diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks
index f185584..a70fdeb 100644
--- a/kernel/Kconfig.locks
+++ b/kernel/Kconfig.locks
@@ -229,4 +229,4 @@ config ARCH_USE_QUEUE_SPINLOCK
 
 config QUEUE_SPINLOCK
 	def_bool y if ARCH_USE_QUEUE_SPINLOCK
-	depends on SMP && !PARAVIRT_SPINLOCKS
+	depends on SMP && (!PARAVIRT_SPINLOCKS || !XEN)
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* Re: [PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks
  2014-02-26 15:14 ` [PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks Waiman Long
  2014-02-26 16:20   ` Peter Zijlstra
@ 2014-02-26 16:20   ` Peter Zijlstra
  2014-02-27 20:42     ` Waiman Long
  2014-02-27 20:42     ` Waiman Long
  1 sibling, 2 replies; 125+ messages in thread
From: Peter Zijlstra @ 2014-02-26 16:20 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, Boris Ostrovsky,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Alok Kataria, linux-arch, x86, Ingo Molnar, Scott J Norton,
	xen-devel, Paul E. McKenney, Alexander Fyodorov, Rik van Riel,
	Arnd Bergmann, Konrad Rzeszutek Wilk, Daniel J Blueman,
	Oleg Nesterov, Steven Rostedt, Chris Wright, George Spelvin,
	Thomas Gleixner


You don't happen to have a proper state diagram for this thing do you?

I suppose I'm going to have to make one; this is all getting a bit
unwieldy, and those xchg() + fixup things are hard to read.

On Wed, Feb 26, 2014 at 10:14:23AM -0500, Waiman Long wrote:
> +static inline int queue_spin_trylock_quick(struct qspinlock *lock, int qsval)
> +{
> +	union arch_qspinlock *qlock = (union arch_qspinlock *)lock;
> +	u16		     old;
> +
> +	/*
> +	 * Fall into the quick spinning code path only if no one is waiting
> +	 * or the lock is available.
> +	 */
> +	if (unlikely((qsval != _QSPINLOCK_LOCKED) &&
> +		     (qsval != _QSPINLOCK_WAITING)))
> +		return 0;
> +
> +	old = xchg(&qlock->lock_wait, _QSPINLOCK_WAITING|_QSPINLOCK_LOCKED);
> +
> +	if (old == 0) {
> +		/*
> +		 * Got the lock, can clear the waiting bit now
> +		 */
> +		smp_u8_store_release(&qlock->wait, 0);


So we just did an atomic op, and now you're trying to optimize this
write. Why do you need a whole byte for that?

Surely a cmpxchg loop with the right atomic op can't be _that_ much
slower? Its far more readable and likely avoids that steal fail below as
well.

> +		return 1;
> +	} else if (old == _QSPINLOCK_LOCKED) {
> +try_again:
> +		/*
> +		 * Wait until the lock byte is cleared to get the lock
> +		 */
> +		do {
> +			cpu_relax();
> +		} while (ACCESS_ONCE(qlock->lock));
> +		/*
> +		 * Set the lock bit & clear the waiting bit
> +		 */
> +		if (cmpxchg(&qlock->lock_wait, _QSPINLOCK_WAITING,
> +			   _QSPINLOCK_LOCKED) == _QSPINLOCK_WAITING)
> +			return 1;
> +		/*
> +		 * Someone has steal the lock, so wait again
> +		 */
> +		goto try_again;

That's just a fail.. steals should not ever be allowed. It's a fair lock
after all.

> +	} else if (old == _QSPINLOCK_WAITING) {
> +		/*
> +		 * Another task is already waiting while it steals the lock.
> +		 * A bit of unfairness here won't change the big picture.
> +		 * So just take the lock and return.
> +		 */
> +		return 1;
> +	}
> +	/*
> +	 * Nothing need to be done if the old value is
> +	 * (_QSPINLOCK_WAITING | _QSPINLOCK_LOCKED).
> +	 */
> +	return 0;
> +}




> @@ -296,6 +478,9 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, int qsval)
>  		return;
>  	}
>  
> +#ifdef queue_code_xchg
> +	prev_qcode = queue_code_xchg(lock, my_qcode);
> +#else
>  	/*
>  	 * Exchange current copy of the queue node code
>  	 */
> @@ -329,6 +514,7 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, int qsval)
>  	} else
>  		prev_qcode &= ~_QSPINLOCK_LOCKED;	/* Clear the lock bit */
>  	my_qcode &= ~_QSPINLOCK_LOCKED;
> +#endif /* queue_code_xchg */
>  
>  	if (prev_qcode) {
>  		/*

That's just horrible.. please just make the entire #else branch another
version of that same queue_code_xchg() function.

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks
  2014-02-26 15:14 ` [PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks Waiman Long
@ 2014-02-26 16:20   ` Peter Zijlstra
  2014-02-26 16:20   ` Peter Zijlstra
  1 sibling, 0 replies; 125+ messages in thread
From: Peter Zijlstra @ 2014-02-26 16:20 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, Boris Ostrovsky,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Alok Kataria, linux-arch, x86, Ingo Molnar, Scott J Norton,
	xen-devel, Paul E. McKenney, Alexander Fyodorov, Arnd Bergmann,
	Daniel J Blueman, Rusty Russell, Oleg Nesterov, Steven Rostedt,
	Chris Wright, George Spelvin, Thomas Gleixner,
	Aswin Chandramouleeswaran


You don't happen to have a proper state diagram for this thing do you?

I suppose I'm going to have to make one; this is all getting a bit
unwieldy, and those xchg() + fixup things are hard to read.

On Wed, Feb 26, 2014 at 10:14:23AM -0500, Waiman Long wrote:
> +static inline int queue_spin_trylock_quick(struct qspinlock *lock, int qsval)
> +{
> +	union arch_qspinlock *qlock = (union arch_qspinlock *)lock;
> +	u16		     old;
> +
> +	/*
> +	 * Fall into the quick spinning code path only if no one is waiting
> +	 * or the lock is available.
> +	 */
> +	if (unlikely((qsval != _QSPINLOCK_LOCKED) &&
> +		     (qsval != _QSPINLOCK_WAITING)))
> +		return 0;
> +
> +	old = xchg(&qlock->lock_wait, _QSPINLOCK_WAITING|_QSPINLOCK_LOCKED);
> +
> +	if (old == 0) {
> +		/*
> +		 * Got the lock, can clear the waiting bit now
> +		 */
> +		smp_u8_store_release(&qlock->wait, 0);


So we just did an atomic op, and now you're trying to optimize this
write. Why do you need a whole byte for that?

Surely a cmpxchg loop with the right atomic op can't be _that_ much
slower? Its far more readable and likely avoids that steal fail below as
well.

> +		return 1;
> +	} else if (old == _QSPINLOCK_LOCKED) {
> +try_again:
> +		/*
> +		 * Wait until the lock byte is cleared to get the lock
> +		 */
> +		do {
> +			cpu_relax();
> +		} while (ACCESS_ONCE(qlock->lock));
> +		/*
> +		 * Set the lock bit & clear the waiting bit
> +		 */
> +		if (cmpxchg(&qlock->lock_wait, _QSPINLOCK_WAITING,
> +			   _QSPINLOCK_LOCKED) == _QSPINLOCK_WAITING)
> +			return 1;
> +		/*
> +		 * Someone has steal the lock, so wait again
> +		 */
> +		goto try_again;

That's just a fail.. steals should not ever be allowed. It's a fair lock
after all.

> +	} else if (old == _QSPINLOCK_WAITING) {
> +		/*
> +		 * Another task is already waiting while it steals the lock.
> +		 * A bit of unfairness here won't change the big picture.
> +		 * So just take the lock and return.
> +		 */
> +		return 1;
> +	}
> +	/*
> +	 * Nothing need to be done if the old value is
> +	 * (_QSPINLOCK_WAITING | _QSPINLOCK_LOCKED).
> +	 */
> +	return 0;
> +}




> @@ -296,6 +478,9 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, int qsval)
>  		return;
>  	}
>  
> +#ifdef queue_code_xchg
> +	prev_qcode = queue_code_xchg(lock, my_qcode);
> +#else
>  	/*
>  	 * Exchange current copy of the queue node code
>  	 */
> @@ -329,6 +514,7 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, int qsval)
>  	} else
>  		prev_qcode &= ~_QSPINLOCK_LOCKED;	/* Clear the lock bit */
>  	my_qcode &= ~_QSPINLOCK_LOCKED;
> +#endif /* queue_code_xchg */
>  
>  	if (prev_qcode) {
>  		/*

That's just horrible.. please just make the entire #else branch another
version of that same queue_code_xchg() function.

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v5 1/8] qspinlock: Introducing a 4-byte queue spinlock implementation
  2014-02-26 15:14 ` [PATCH v5 1/8] qspinlock: Introducing a 4-byte queue spinlock implementation Waiman Long
  2014-02-26 16:22   ` Peter Zijlstra
@ 2014-02-26 16:22   ` Peter Zijlstra
  2014-02-27 20:25     ` Waiman Long
  2014-02-27 20:25     ` Waiman Long
  2014-02-26 16:24   ` Peter Zijlstra
  2014-02-26 16:24   ` Peter Zijlstra
  3 siblings, 2 replies; 125+ messages in thread
From: Peter Zijlstra @ 2014-02-26 16:22 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, Boris Ostrovsky,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Alok Kataria, linux-arch, x86, Ingo Molnar, Scott J Norton,
	xen-devel, Paul E. McKenney, Alexander Fyodorov, Rik van Riel,
	Arnd Bergmann, Konrad Rzeszutek Wilk, Daniel J Blueman,
	Oleg Nesterov, Steven Rostedt, Chris Wright, George Spelvin,
	Thomas Gleixner

On Wed, Feb 26, 2014 at 10:14:21AM -0500, Waiman Long wrote:

> +struct qnode {
> +	u32		 wait;		/* Waiting flag		*/
> +	struct qnode	*next;		/* Next queue node addr */
> +};
> +
> +struct qnode_set {
> +	struct qnode	nodes[MAX_QNODES];
> +	int		node_idx;	/* Current node to use */
> +};
> +
> +/*
> + * Per-CPU queue node structures
> + */
> +static DEFINE_PER_CPU_ALIGNED(struct qnode_set, qnset) = { {{0}}, 0 };

So I've not yet wrapped my head around any of this; and I see a later
patch adds some paravirt gunk to this, but it does blow you can't keep
it a single cacheline for the sane case.

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v5 1/8] qspinlock: Introducing a 4-byte queue spinlock implementation
  2014-02-26 15:14 ` [PATCH v5 1/8] qspinlock: Introducing a 4-byte queue spinlock implementation Waiman Long
@ 2014-02-26 16:22   ` Peter Zijlstra
  2014-02-26 16:22   ` Peter Zijlstra
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 125+ messages in thread
From: Peter Zijlstra @ 2014-02-26 16:22 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, Boris Ostrovsky,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Alok Kataria, linux-arch, x86, Ingo Molnar, Scott J Norton,
	xen-devel, Paul E. McKenney, Alexander Fyodorov, Arnd Bergmann,
	Daniel J Blueman, Rusty Russell, Oleg Nesterov, Steven Rostedt,
	Chris Wright, George Spelvin, Thomas Gleixner,
	Aswin Chandramouleeswaran

On Wed, Feb 26, 2014 at 10:14:21AM -0500, Waiman Long wrote:

> +struct qnode {
> +	u32		 wait;		/* Waiting flag		*/
> +	struct qnode	*next;		/* Next queue node addr */
> +};
> +
> +struct qnode_set {
> +	struct qnode	nodes[MAX_QNODES];
> +	int		node_idx;	/* Current node to use */
> +};
> +
> +/*
> + * Per-CPU queue node structures
> + */
> +static DEFINE_PER_CPU_ALIGNED(struct qnode_set, qnset) = { {{0}}, 0 };

So I've not yet wrapped my head around any of this; and I see a later
patch adds some paravirt gunk to this, but it does blow you can't keep
it a single cacheline for the sane case.

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v5 1/8] qspinlock: Introducing a 4-byte queue spinlock implementation
  2014-02-26 15:14 ` [PATCH v5 1/8] qspinlock: Introducing a 4-byte queue spinlock implementation Waiman Long
                     ` (2 preceding siblings ...)
  2014-02-26 16:24   ` Peter Zijlstra
@ 2014-02-26 16:24   ` Peter Zijlstra
  2014-02-27 20:25     ` Waiman Long
  2014-02-27 20:25     ` Waiman Long
  3 siblings, 2 replies; 125+ messages in thread
From: Peter Zijlstra @ 2014-02-26 16:24 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, Boris Ostrovsky,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Alok Kataria, linux-arch, x86, Ingo Molnar, Scott J Norton,
	xen-devel, Paul E. McKenney, Alexander Fyodorov, Rik van Riel,
	Arnd Bergmann, Konrad Rzeszutek Wilk, Daniel J Blueman,
	Oleg Nesterov, Steven Rostedt, Chris Wright, George Spelvin,
	Thomas Gleixner

On Wed, Feb 26, 2014 at 10:14:21AM -0500, Waiman Long wrote:
> +static void put_qnode(void)
> +{
> +	struct qnode_set *qset = this_cpu_ptr(&qnset);
> +
> +	qset->node_idx--;
> +}

That very much wants to be: this_cpu_dec().

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v5 1/8] qspinlock: Introducing a 4-byte queue spinlock implementation
  2014-02-26 15:14 ` [PATCH v5 1/8] qspinlock: Introducing a 4-byte queue spinlock implementation Waiman Long
  2014-02-26 16:22   ` Peter Zijlstra
  2014-02-26 16:22   ` Peter Zijlstra
@ 2014-02-26 16:24   ` Peter Zijlstra
  2014-02-26 16:24   ` Peter Zijlstra
  3 siblings, 0 replies; 125+ messages in thread
From: Peter Zijlstra @ 2014-02-26 16:24 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, Boris Ostrovsky,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Alok Kataria, linux-arch, x86, Ingo Molnar, Scott J Norton,
	xen-devel, Paul E. McKenney, Alexander Fyodorov, Arnd Bergmann,
	Daniel J Blueman, Rusty Russell, Oleg Nesterov, Steven Rostedt,
	Chris Wright, George Spelvin, Thomas Gleixner,
	Aswin Chandramouleeswaran

On Wed, Feb 26, 2014 at 10:14:21AM -0500, Waiman Long wrote:
> +static void put_qnode(void)
> +{
> +	struct qnode_set *qset = this_cpu_ptr(&qnset);
> +
> +	qset->node_idx--;
> +}

That very much wants to be: this_cpu_dec().

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v5 0/8] qspinlock: a 4-byte queue spinlock with PV support
  2014-02-26 15:14 [PATCH v5 0/8] qspinlock: a 4-byte queue spinlock with PV support Waiman Long
                   ` (15 preceding siblings ...)
  2014-02-26 15:14 ` Waiman Long
@ 2014-02-26 17:00 ` Konrad Rzeszutek Wilk
  2014-02-28 16:56   ` Waiman Long
  2014-02-28 16:56   ` Waiman Long
  2014-02-26 17:00 ` Konrad Rzeszutek Wilk
                   ` (2 subsequent siblings)
  19 siblings, 2 replies; 125+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-02-26 17:00 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, Peter Zijlstra,
	virtualization, Marcos Matsunaga, Andi Kleen, H. Peter Anvin,
	Michel Lespinasse, Alok Kataria, linux-arch, x86, Ingo Molnar,
	Scott J Norton, xen-devel, Paul E. McKenney, Alexander Fyodorov,
	Rik van Riel, Arnd Bergmann, Daniel J Blueman, Oleg Nesterov,
	Steven Rostedt, Chris Wright, George Spelvin, Thomas Gleixner

On Wed, Feb 26, 2014 at 10:14:20AM -0500, Waiman Long wrote:
> v4->v5:
>  - Move the optimized 2-task contending code to the generic file to
>    enable more architectures to use it without code duplication.
>  - Address some of the style-related comments by PeterZ.
>  - Allow the use of unfair queue spinlock in a real para-virtualized
>    execution environment.
>  - Add para-virtualization support to the qspinlock code by ensuring
>    that the lock holder and queue head stay alive as much as possible.
> 
> v3->v4:
>  - Remove debugging code and fix a configuration error
>  - Simplify the qspinlock structure and streamline the code to make it
>    perform a bit better
>  - Add an x86 version of asm/qspinlock.h for holding x86 specific
>    optimization.
>  - Add an optimized x86 code path for 2 contending tasks to improve
>    low contention performance.
> 
> v2->v3:
>  - Simplify the code by using numerous mode only without an unfair option.
>  - Use the latest smp_load_acquire()/smp_store_release() barriers.
>  - Move the queue spinlock code to kernel/locking.
>  - Make the use of queue spinlock the default for x86-64 without user
>    configuration.
>  - Additional performance tuning.
> 
> v1->v2:
>  - Add some more comments to document what the code does.
>  - Add a numerous CPU mode to support >= 16K CPUs
>  - Add a configuration option to allow lock stealing which can further
>    improve performance in many cases.
>  - Enable wakeup of queue head CPU at unlock time for non-numerous
>    CPU mode.
> 
> This patch set has 3 different sections:
>  1) Patches 1-3: Introduces a queue-based spinlock implementation that
>     can replace the default ticket spinlock without increasing the
>     size of the spinlock data structure. As a result, critical kernel
>     data structures that embed spinlock won't increase in size and
>     breaking data alignments.
>  2) Patches 4 and 5: Enables the use of unfair queue spinlock in a
>     real para-virtualized execution environment. This can resolve
>     some of the locking related performance issues due to the fact
>     that the next CPU to get the lock may have been scheduled out
>     for a period of time.
>  3) Patches 6-8: Enable qspinlock para-virtualization support by making
>     sure that the lock holder and the queue head stay alive as long as
>     possible.
> 
> Patches 1-3 are fully tested and ready for production. Patches 4-8, on
> the other hands, are not fully tested. They have undergone compilation
> tests with various combinations of kernel config setting and boot-up
> tests in a non-virtualized setting. Further tests and performance
> characterization are still needed to be done in a KVM guest. So
> comments on them are welcomed. Suggestions or recommendations on how
> to add PV support in the Xen environment are also needed.

It should be fairly easy. You just need to implement the kick right?
An IPI should be all that is needed - look in xen_unlock_kick. The
rest of the spinlock code is all generic that is shared between
KVM, Xen and baremetal.

You should be able to run all of this under an HVM guests as well - as
in you don't need a pure PV guest to use the PV ticketlocks.

An easy way to install/run this is to install your latest distro,
do 'yum install xen' or 'apt-get install xen'. Reboot and you
are under Xen. Launch guests, etc with your favorite virtualization
toolstack.
> 
> The queue spinlock has slightly better performance than the ticket
> spinlock in uncontended case. Its performance can be much better
> with moderate to heavy contention.  This patch has the potential of
> improving the performance of all the workloads that have moderate to
> heavy spinlock contention.
> 
> The queue spinlock is especially suitable for NUMA machines with at
> least 2 sockets, though noticeable performance benefit probably won't
> show up in machines with less than 4 sockets.
> 
> The purpose of this patch set is not to solve any particular spinlock
> contention problems. Those need to be solved by refactoring the code
> to make more efficient use of the lock or finer granularity ones. The
> main purpose is to make the lock contention problems more tolerable
> until someone can spend the time and effort to fix them.
> 
> Waiman Long (8):
>   qspinlock: Introducing a 4-byte queue spinlock implementation
>   qspinlock, x86: Enable x86-64 to use queue spinlock
>   qspinlock, x86: Add x86 specific optimization for 2 contending tasks
>   pvqspinlock, x86: Allow unfair spinlock in a real PV environment
>   pvqspinlock, x86: Enable unfair queue spinlock in a KVM guest
>   pvqspinlock, x86: Rename paravirt_ticketlocks_enabled
>   pvqspinlock, x86: Add qspinlock para-virtualization support
>   pvqspinlock, x86: Enable KVM to use qspinlock's PV support
> 
>  arch/x86/Kconfig                      |   12 +
>  arch/x86/include/asm/paravirt.h       |    9 +-
>  arch/x86/include/asm/paravirt_types.h |   12 +
>  arch/x86/include/asm/pvqspinlock.h    |  176 ++++++++++
>  arch/x86/include/asm/qspinlock.h      |  133 +++++++
>  arch/x86/include/asm/spinlock.h       |    9 +-
>  arch/x86/include/asm/spinlock_types.h |    4 +
>  arch/x86/kernel/Makefile              |    1 +
>  arch/x86/kernel/kvm.c                 |   73 ++++-
>  arch/x86/kernel/paravirt-spinlocks.c  |   15 +-
>  arch/x86/xen/spinlock.c               |    2 +-
>  include/asm-generic/qspinlock.h       |  122 +++++++
>  include/asm-generic/qspinlock_types.h |   61 ++++
>  kernel/Kconfig.locks                  |    7 +
>  kernel/locking/Makefile               |    1 +
>  kernel/locking/qspinlock.c            |  610 +++++++++++++++++++++++++++++++++
>  16 files changed, 1239 insertions(+), 8 deletions(-)
>  create mode 100644 arch/x86/include/asm/pvqspinlock.h
>  create mode 100644 arch/x86/include/asm/qspinlock.h
>  create mode 100644 include/asm-generic/qspinlock.h
>  create mode 100644 include/asm-generic/qspinlock_types.h
>  create mode 100644 kernel/locking/qspinlock.c
> 

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v5 0/8] qspinlock: a 4-byte queue spinlock with PV support
  2014-02-26 15:14 [PATCH v5 0/8] qspinlock: a 4-byte queue spinlock with PV support Waiman Long
                   ` (16 preceding siblings ...)
  2014-02-26 17:00 ` [PATCH v5 0/8] qspinlock: a 4-byte queue spinlock with " Konrad Rzeszutek Wilk
@ 2014-02-26 17:00 ` Konrad Rzeszutek Wilk
  2014-02-26 22:26 ` Paul E. McKenney
  2014-02-26 22:26 ` Paul E. McKenney
  19 siblings, 0 replies; 125+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-02-26 17:00 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, Peter Zijlstra,
	virtualization, Marcos Matsunaga, Andi Kleen, H. Peter Anvin,
	Michel Lespinasse, Alok Kataria, linux-arch, x86, Ingo Molnar,
	Scott J Norton, xen-devel, Paul E. McKenney, Alexander Fyodorov,
	Arnd Bergmann, Daniel J Blueman, Rusty Russell, Oleg Nesterov,
	Steven Rostedt, Chris Wright, George Spelvin, Thomas Gleixner

On Wed, Feb 26, 2014 at 10:14:20AM -0500, Waiman Long wrote:
> v4->v5:
>  - Move the optimized 2-task contending code to the generic file to
>    enable more architectures to use it without code duplication.
>  - Address some of the style-related comments by PeterZ.
>  - Allow the use of unfair queue spinlock in a real para-virtualized
>    execution environment.
>  - Add para-virtualization support to the qspinlock code by ensuring
>    that the lock holder and queue head stay alive as much as possible.
> 
> v3->v4:
>  - Remove debugging code and fix a configuration error
>  - Simplify the qspinlock structure and streamline the code to make it
>    perform a bit better
>  - Add an x86 version of asm/qspinlock.h for holding x86 specific
>    optimization.
>  - Add an optimized x86 code path for 2 contending tasks to improve
>    low contention performance.
> 
> v2->v3:
>  - Simplify the code by using numerous mode only without an unfair option.
>  - Use the latest smp_load_acquire()/smp_store_release() barriers.
>  - Move the queue spinlock code to kernel/locking.
>  - Make the use of queue spinlock the default for x86-64 without user
>    configuration.
>  - Additional performance tuning.
> 
> v1->v2:
>  - Add some more comments to document what the code does.
>  - Add a numerous CPU mode to support >= 16K CPUs
>  - Add a configuration option to allow lock stealing which can further
>    improve performance in many cases.
>  - Enable wakeup of queue head CPU at unlock time for non-numerous
>    CPU mode.
> 
> This patch set has 3 different sections:
>  1) Patches 1-3: Introduces a queue-based spinlock implementation that
>     can replace the default ticket spinlock without increasing the
>     size of the spinlock data structure. As a result, critical kernel
>     data structures that embed spinlock won't increase in size and
>     breaking data alignments.
>  2) Patches 4 and 5: Enables the use of unfair queue spinlock in a
>     real para-virtualized execution environment. This can resolve
>     some of the locking related performance issues due to the fact
>     that the next CPU to get the lock may have been scheduled out
>     for a period of time.
>  3) Patches 6-8: Enable qspinlock para-virtualization support by making
>     sure that the lock holder and the queue head stay alive as long as
>     possible.
> 
> Patches 1-3 are fully tested and ready for production. Patches 4-8, on
> the other hands, are not fully tested. They have undergone compilation
> tests with various combinations of kernel config setting and boot-up
> tests in a non-virtualized setting. Further tests and performance
> characterization are still needed to be done in a KVM guest. So
> comments on them are welcomed. Suggestions or recommendations on how
> to add PV support in the Xen environment are also needed.

It should be fairly easy. You just need to implement the kick right?
An IPI should be all that is needed - look in xen_unlock_kick. The
rest of the spinlock code is all generic that is shared between
KVM, Xen and baremetal.

You should be able to run all of this under an HVM guests as well - as
in you don't need a pure PV guest to use the PV ticketlocks.

An easy way to install/run this is to install your latest distro,
do 'yum install xen' or 'apt-get install xen'. Reboot and you
are under Xen. Launch guests, etc with your favorite virtualization
toolstack.
> 
> The queue spinlock has slightly better performance than the ticket
> spinlock in uncontended case. Its performance can be much better
> with moderate to heavy contention.  This patch has the potential of
> improving the performance of all the workloads that have moderate to
> heavy spinlock contention.
> 
> The queue spinlock is especially suitable for NUMA machines with at
> least 2 sockets, though noticeable performance benefit probably won't
> show up in machines with less than 4 sockets.
> 
> The purpose of this patch set is not to solve any particular spinlock
> contention problems. Those need to be solved by refactoring the code
> to make more efficient use of the lock or finer granularity ones. The
> main purpose is to make the lock contention problems more tolerable
> until someone can spend the time and effort to fix them.
> 
> Waiman Long (8):
>   qspinlock: Introducing a 4-byte queue spinlock implementation
>   qspinlock, x86: Enable x86-64 to use queue spinlock
>   qspinlock, x86: Add x86 specific optimization for 2 contending tasks
>   pvqspinlock, x86: Allow unfair spinlock in a real PV environment
>   pvqspinlock, x86: Enable unfair queue spinlock in a KVM guest
>   pvqspinlock, x86: Rename paravirt_ticketlocks_enabled
>   pvqspinlock, x86: Add qspinlock para-virtualization support
>   pvqspinlock, x86: Enable KVM to use qspinlock's PV support
> 
>  arch/x86/Kconfig                      |   12 +
>  arch/x86/include/asm/paravirt.h       |    9 +-
>  arch/x86/include/asm/paravirt_types.h |   12 +
>  arch/x86/include/asm/pvqspinlock.h    |  176 ++++++++++
>  arch/x86/include/asm/qspinlock.h      |  133 +++++++
>  arch/x86/include/asm/spinlock.h       |    9 +-
>  arch/x86/include/asm/spinlock_types.h |    4 +
>  arch/x86/kernel/Makefile              |    1 +
>  arch/x86/kernel/kvm.c                 |   73 ++++-
>  arch/x86/kernel/paravirt-spinlocks.c  |   15 +-
>  arch/x86/xen/spinlock.c               |    2 +-
>  include/asm-generic/qspinlock.h       |  122 +++++++
>  include/asm-generic/qspinlock_types.h |   61 ++++
>  kernel/Kconfig.locks                  |    7 +
>  kernel/locking/Makefile               |    1 +
>  kernel/locking/qspinlock.c            |  610 +++++++++++++++++++++++++++++++++
>  16 files changed, 1239 insertions(+), 8 deletions(-)
>  create mode 100644 arch/x86/include/asm/pvqspinlock.h
>  create mode 100644 arch/x86/include/asm/qspinlock.h
>  create mode 100644 include/asm-generic/qspinlock.h
>  create mode 100644 include/asm-generic/qspinlock_types.h
>  create mode 100644 kernel/locking/qspinlock.c
> 

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH RFC v5 4/8] pvqspinlock, x86: Allow unfair spinlock in a real PV environment
  2014-02-26 15:14 ` Waiman Long
@ 2014-02-26 17:07   ` Konrad Rzeszutek Wilk
  2014-02-28 17:06     ` Waiman Long
  2014-02-28 17:06     ` Waiman Long
  2014-02-26 17:07   ` Konrad Rzeszutek Wilk
                     ` (2 subsequent siblings)
  3 siblings, 2 replies; 125+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-02-26 17:07 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, Peter Zijlstra,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Alok Kataria, linux-arch, x86, Ingo Molnar, Scott J Norton,
	xen-devel, Paul E. McKenney, Alexander Fyodorov, Rik van Riel,
	Arnd Bergmann, Daniel J Blueman, Oleg Nesterov, Steven Rostedt,
	Chris Wright, George Spelvin, Thomas Gleixner,
	Aswin Chandramouleeswaran, Cheg

On Wed, Feb 26, 2014 at 10:14:24AM -0500, Waiman Long wrote:
> Locking is always an issue in a virtualized environment as the virtual
> CPU that is waiting on a lock may get scheduled out and hence block
> any progress in lock acquisition even when the lock has been freed.
> 
> One solution to this problem is to allow unfair lock in a
> para-virtualized environment. In this case, a new lock acquirer can
> come and steal the lock if the next-in-line CPU to get the lock is
> scheduled out. Unfair lock in a native environment is generally not a

Hmm, how do you know if the 'next-in-line CPU' is scheduled out? As
in the hypervisor knows - but you as a guest might have no idea
of it.

> good idea as there is a possibility of lock starvation for a heavily
> contended lock.

Should this then detect whether it is running under a virtualization
and only then activate itself? And when run under baremetal don't enable?

> 
> This patch add a new configuration option for the x86
> architecture to enable the use of unfair queue spinlock
> (PARAVIRT_UNFAIR_LOCKS) in a real para-virtualized guest. A jump label
> (paravirt_unfairlocks_enabled) is used to switch between a fair and
> an unfair version of the spinlock code. This jump label will only be
> enabled in a real PV guest.

As opposed to fake PV guest :-) I think you can remove the 'real'.


> 
> Enabling this configuration feature decreases the performance of an
> uncontended lock-unlock operation by about 1-2%.

Presumarily on baremetal right?

> 
> Signed-off-by: Waiman Long <Waiman.Long@hp.com>
> ---
>  arch/x86/Kconfig                     |   11 +++++
>  arch/x86/include/asm/qspinlock.h     |   74 ++++++++++++++++++++++++++++++++++
>  arch/x86/kernel/Makefile             |    1 +
>  arch/x86/kernel/paravirt-spinlocks.c |    7 +++
>  4 files changed, 93 insertions(+), 0 deletions(-)
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 5bf70ab..8d7c941 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -645,6 +645,17 @@ config PARAVIRT_SPINLOCKS
>  
>  	  If you are unsure how to answer this question, answer Y.
>  
> +config PARAVIRT_UNFAIR_LOCKS
> +	bool "Enable unfair locks in a para-virtualized guest"
> +	depends on PARAVIRT && SMP && QUEUE_SPINLOCK
> +	depends on !CONFIG_X86_OOSTORE && !CONFIG_X86_PPRO_FENCE
> +	---help---
> +	  This changes the kernel to use unfair locks in a real
> +	  para-virtualized guest system. This will help performance
> +	  in most cases. However, there is a possibility of lock
> +	  starvation on a heavily contended lock especially in a
> +	  large guest with many virtual CPUs.
> +
>  source "arch/x86/xen/Kconfig"
>  
>  config KVM_GUEST
> diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
> index 98db42e..c278aed 100644
> --- a/arch/x86/include/asm/qspinlock.h
> +++ b/arch/x86/include/asm/qspinlock.h
> @@ -56,4 +56,78 @@ static inline void queue_spin_unlock(struct qspinlock *lock)
>  
>  #include <asm-generic/qspinlock.h>
>  
> +#ifdef CONFIG_PARAVIRT_UNFAIR_LOCKS
> +/**
> + * queue_spin_lock_unfair - acquire a queue spinlock unfairly
> + * @lock: Pointer to queue spinlock structure
> + */
> +static __always_inline void queue_spin_lock_unfair(struct qspinlock *lock)
> +{
> +	union arch_qspinlock *qlock = (union arch_qspinlock *)lock;
> +
> +	if (likely(cmpxchg(&qlock->lock, 0, _QSPINLOCK_LOCKED) == 0))
> +		return;
> +	/*
> +	 * Since the lock is now unfair, there is no need to activate
> +	 * the 2-task quick spinning code path.
> +	 */
> +	queue_spin_lock_slowpath(lock, -1);
> +}
> +
> +/**
> + * queue_spin_trylock_unfair - try to acquire the queue spinlock unfairly
> + * @lock : Pointer to queue spinlock structure
> + * Return: 1 if lock acquired, 0 if failed
> + */
> +static __always_inline int queue_spin_trylock_unfair(struct qspinlock *lock)
> +{
> +	union arch_qspinlock *qlock = (union arch_qspinlock *)lock;
> +
> +	if (!qlock->lock &&
> +	   (cmpxchg(&qlock->lock, 0, _QSPINLOCK_LOCKED) == 0))
> +		return 1;
> +	return 0;
> +}
> +
> +/*
> + * Redefine arch_spin_lock and arch_spin_trylock as inline functions that will
> + * jump to the unfair versions if the static key paravirt_unfairlocks_enabled
> + * is true.
> + */
> +#undef arch_spin_lock
> +#undef arch_spin_trylock
> +#undef arch_spin_lock_flags
> +
> +extern struct static_key paravirt_unfairlocks_enabled;
> +
> +/**
> + * arch_spin_lock - acquire a queue spinlock
> + * @lock: Pointer to queue spinlock structure
> + */
> +static inline void arch_spin_lock(struct qspinlock *lock)
> +{
> +	if (static_key_false(&paravirt_unfairlocks_enabled)) {
> +		queue_spin_lock_unfair(lock);
> +		return;
> +	}
> +	queue_spin_lock(lock);

What happens when you are booting and you are in the middle of using a
ticketlock (say you are waiting for it and your are in the slow-path)
 and suddenly the unfairlocks_enabled is turned on.

All the other CPUs start using the unfair version and are you still
in the ticketlock unlocker (or worst, locker and going to sleep).


> +}
> +
> +/**
> + * arch_spin_trylock - try to acquire the queue spinlock
> + * @lock : Pointer to queue spinlock structure
> + * Return: 1 if lock acquired, 0 if failed
> + */
> +static inline int arch_spin_trylock(struct qspinlock *lock)
> +{
> +	if (static_key_false(&paravirt_unfairlocks_enabled)) {
> +		return queue_spin_trylock_unfair(lock);
> +	}
> +	return queue_spin_trylock(lock);
> +}
> +
> +#define arch_spin_lock_flags(l, f)	arch_spin_lock(l)
> +
> +#endif /* CONFIG_PARAVIRT_UNFAIR_LOCKS */
> +
>  #endif /* _ASM_X86_QSPINLOCK_H */
> diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
> index cb648c8..1107a20 100644
> --- a/arch/x86/kernel/Makefile
> +++ b/arch/x86/kernel/Makefile
> @@ -88,6 +88,7 @@ obj-$(CONFIG_DEBUG_NMI_SELFTEST) += nmi_selftest.o
>  obj-$(CONFIG_KVM_GUEST)		+= kvm.o kvmclock.o
>  obj-$(CONFIG_PARAVIRT)		+= paravirt.o paravirt_patch_$(BITS).o
>  obj-$(CONFIG_PARAVIRT_SPINLOCKS)+= paravirt-spinlocks.o
> +obj-$(CONFIG_PARAVIRT_UNFAIR_LOCKS)+= paravirt-spinlocks.o
>  obj-$(CONFIG_PARAVIRT_CLOCK)	+= pvclock.o
>  
>  obj-$(CONFIG_PCSPKR_PLATFORM)	+= pcspeaker.o
> diff --git a/arch/x86/kernel/paravirt-spinlocks.c b/arch/x86/kernel/paravirt-spinlocks.c
> index bbb6c73..a50032a 100644
> --- a/arch/x86/kernel/paravirt-spinlocks.c
> +++ b/arch/x86/kernel/paravirt-spinlocks.c
> @@ -8,6 +8,7 @@
>  
>  #include <asm/paravirt.h>
>  
> +#ifdef CONFIG_PARAVIRT_SPINLOCKS
>  struct pv_lock_ops pv_lock_ops = {
>  #ifdef CONFIG_SMP
>  	.lock_spinning = __PV_IS_CALLEE_SAVE(paravirt_nop),
> @@ -18,3 +19,9 @@ EXPORT_SYMBOL(pv_lock_ops);
>  
>  struct static_key paravirt_ticketlocks_enabled = STATIC_KEY_INIT_FALSE;
>  EXPORT_SYMBOL(paravirt_ticketlocks_enabled);
> +#endif
> +
> +#ifdef CONFIG_PARAVIRT_UNFAIR_LOCKS
> +struct static_key paravirt_unfairlocks_enabled = STATIC_KEY_INIT_FALSE;
> +EXPORT_SYMBOL(paravirt_unfairlocks_enabled);
> +#endif
> -- 
> 1.7.1
> 

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH RFC v5 4/8] pvqspinlock, x86: Allow unfair spinlock in a real PV environment
  2014-02-26 15:14 ` Waiman Long
  2014-02-26 17:07   ` Konrad Rzeszutek Wilk
@ 2014-02-26 17:07   ` Konrad Rzeszutek Wilk
  2014-02-27 12:28   ` David Vrabel
  2014-02-27 12:28   ` David Vrabel
  3 siblings, 0 replies; 125+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-02-26 17:07 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, Peter Zijlstra,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Alok Kataria, linux-arch, x86, Ingo Molnar, Scott J Norton,
	xen-devel, Paul E. McKenney, Alexander Fyodorov, Arnd Bergmann,
	Daniel J Blueman, Rusty Russell, Oleg Nesterov, Steven Rostedt,
	Chris Wright, George Spelvin, Thomas Gleixner,
	Aswin Chandramouleeswaran

On Wed, Feb 26, 2014 at 10:14:24AM -0500, Waiman Long wrote:
> Locking is always an issue in a virtualized environment as the virtual
> CPU that is waiting on a lock may get scheduled out and hence block
> any progress in lock acquisition even when the lock has been freed.
> 
> One solution to this problem is to allow unfair lock in a
> para-virtualized environment. In this case, a new lock acquirer can
> come and steal the lock if the next-in-line CPU to get the lock is
> scheduled out. Unfair lock in a native environment is generally not a

Hmm, how do you know if the 'next-in-line CPU' is scheduled out? As
in the hypervisor knows - but you as a guest might have no idea
of it.

> good idea as there is a possibility of lock starvation for a heavily
> contended lock.

Should this then detect whether it is running under a virtualization
and only then activate itself? And when run under baremetal don't enable?

> 
> This patch add a new configuration option for the x86
> architecture to enable the use of unfair queue spinlock
> (PARAVIRT_UNFAIR_LOCKS) in a real para-virtualized guest. A jump label
> (paravirt_unfairlocks_enabled) is used to switch between a fair and
> an unfair version of the spinlock code. This jump label will only be
> enabled in a real PV guest.

As opposed to fake PV guest :-) I think you can remove the 'real'.


> 
> Enabling this configuration feature decreases the performance of an
> uncontended lock-unlock operation by about 1-2%.

Presumarily on baremetal right?

> 
> Signed-off-by: Waiman Long <Waiman.Long@hp.com>
> ---
>  arch/x86/Kconfig                     |   11 +++++
>  arch/x86/include/asm/qspinlock.h     |   74 ++++++++++++++++++++++++++++++++++
>  arch/x86/kernel/Makefile             |    1 +
>  arch/x86/kernel/paravirt-spinlocks.c |    7 +++
>  4 files changed, 93 insertions(+), 0 deletions(-)
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 5bf70ab..8d7c941 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -645,6 +645,17 @@ config PARAVIRT_SPINLOCKS
>  
>  	  If you are unsure how to answer this question, answer Y.
>  
> +config PARAVIRT_UNFAIR_LOCKS
> +	bool "Enable unfair locks in a para-virtualized guest"
> +	depends on PARAVIRT && SMP && QUEUE_SPINLOCK
> +	depends on !CONFIG_X86_OOSTORE && !CONFIG_X86_PPRO_FENCE
> +	---help---
> +	  This changes the kernel to use unfair locks in a real
> +	  para-virtualized guest system. This will help performance
> +	  in most cases. However, there is a possibility of lock
> +	  starvation on a heavily contended lock especially in a
> +	  large guest with many virtual CPUs.
> +
>  source "arch/x86/xen/Kconfig"
>  
>  config KVM_GUEST
> diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
> index 98db42e..c278aed 100644
> --- a/arch/x86/include/asm/qspinlock.h
> +++ b/arch/x86/include/asm/qspinlock.h
> @@ -56,4 +56,78 @@ static inline void queue_spin_unlock(struct qspinlock *lock)
>  
>  #include <asm-generic/qspinlock.h>
>  
> +#ifdef CONFIG_PARAVIRT_UNFAIR_LOCKS
> +/**
> + * queue_spin_lock_unfair - acquire a queue spinlock unfairly
> + * @lock: Pointer to queue spinlock structure
> + */
> +static __always_inline void queue_spin_lock_unfair(struct qspinlock *lock)
> +{
> +	union arch_qspinlock *qlock = (union arch_qspinlock *)lock;
> +
> +	if (likely(cmpxchg(&qlock->lock, 0, _QSPINLOCK_LOCKED) == 0))
> +		return;
> +	/*
> +	 * Since the lock is now unfair, there is no need to activate
> +	 * the 2-task quick spinning code path.
> +	 */
> +	queue_spin_lock_slowpath(lock, -1);
> +}
> +
> +/**
> + * queue_spin_trylock_unfair - try to acquire the queue spinlock unfairly
> + * @lock : Pointer to queue spinlock structure
> + * Return: 1 if lock acquired, 0 if failed
> + */
> +static __always_inline int queue_spin_trylock_unfair(struct qspinlock *lock)
> +{
> +	union arch_qspinlock *qlock = (union arch_qspinlock *)lock;
> +
> +	if (!qlock->lock &&
> +	   (cmpxchg(&qlock->lock, 0, _QSPINLOCK_LOCKED) == 0))
> +		return 1;
> +	return 0;
> +}
> +
> +/*
> + * Redefine arch_spin_lock and arch_spin_trylock as inline functions that will
> + * jump to the unfair versions if the static key paravirt_unfairlocks_enabled
> + * is true.
> + */
> +#undef arch_spin_lock
> +#undef arch_spin_trylock
> +#undef arch_spin_lock_flags
> +
> +extern struct static_key paravirt_unfairlocks_enabled;
> +
> +/**
> + * arch_spin_lock - acquire a queue spinlock
> + * @lock: Pointer to queue spinlock structure
> + */
> +static inline void arch_spin_lock(struct qspinlock *lock)
> +{
> +	if (static_key_false(&paravirt_unfairlocks_enabled)) {
> +		queue_spin_lock_unfair(lock);
> +		return;
> +	}
> +	queue_spin_lock(lock);

What happens when you are booting and you are in the middle of using a
ticketlock (say you are waiting for it and your are in the slow-path)
 and suddenly the unfairlocks_enabled is turned on.

All the other CPUs start using the unfair version and are you still
in the ticketlock unlocker (or worst, locker and going to sleep).


> +}
> +
> +/**
> + * arch_spin_trylock - try to acquire the queue spinlock
> + * @lock : Pointer to queue spinlock structure
> + * Return: 1 if lock acquired, 0 if failed
> + */
> +static inline int arch_spin_trylock(struct qspinlock *lock)
> +{
> +	if (static_key_false(&paravirt_unfairlocks_enabled)) {
> +		return queue_spin_trylock_unfair(lock);
> +	}
> +	return queue_spin_trylock(lock);
> +}
> +
> +#define arch_spin_lock_flags(l, f)	arch_spin_lock(l)
> +
> +#endif /* CONFIG_PARAVIRT_UNFAIR_LOCKS */
> +
>  #endif /* _ASM_X86_QSPINLOCK_H */
> diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
> index cb648c8..1107a20 100644
> --- a/arch/x86/kernel/Makefile
> +++ b/arch/x86/kernel/Makefile
> @@ -88,6 +88,7 @@ obj-$(CONFIG_DEBUG_NMI_SELFTEST) += nmi_selftest.o
>  obj-$(CONFIG_KVM_GUEST)		+= kvm.o kvmclock.o
>  obj-$(CONFIG_PARAVIRT)		+= paravirt.o paravirt_patch_$(BITS).o
>  obj-$(CONFIG_PARAVIRT_SPINLOCKS)+= paravirt-spinlocks.o
> +obj-$(CONFIG_PARAVIRT_UNFAIR_LOCKS)+= paravirt-spinlocks.o
>  obj-$(CONFIG_PARAVIRT_CLOCK)	+= pvclock.o
>  
>  obj-$(CONFIG_PCSPKR_PLATFORM)	+= pcspeaker.o
> diff --git a/arch/x86/kernel/paravirt-spinlocks.c b/arch/x86/kernel/paravirt-spinlocks.c
> index bbb6c73..a50032a 100644
> --- a/arch/x86/kernel/paravirt-spinlocks.c
> +++ b/arch/x86/kernel/paravirt-spinlocks.c
> @@ -8,6 +8,7 @@
>  
>  #include <asm/paravirt.h>
>  
> +#ifdef CONFIG_PARAVIRT_SPINLOCKS
>  struct pv_lock_ops pv_lock_ops = {
>  #ifdef CONFIG_SMP
>  	.lock_spinning = __PV_IS_CALLEE_SAVE(paravirt_nop),
> @@ -18,3 +19,9 @@ EXPORT_SYMBOL(pv_lock_ops);
>  
>  struct static_key paravirt_ticketlocks_enabled = STATIC_KEY_INIT_FALSE;
>  EXPORT_SYMBOL(paravirt_ticketlocks_enabled);
> +#endif
> +
> +#ifdef CONFIG_PARAVIRT_UNFAIR_LOCKS
> +struct static_key paravirt_unfairlocks_enabled = STATIC_KEY_INIT_FALSE;
> +EXPORT_SYMBOL(paravirt_unfairlocks_enabled);
> +#endif
> -- 
> 1.7.1
> 

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH RFC v5 5/8] pvqspinlock, x86: Enable unfair queue spinlock in a KVM guest
  2014-02-26 15:14 ` Waiman Long
  2014-02-26 17:08   ` Konrad Rzeszutek Wilk
@ 2014-02-26 17:08   ` Konrad Rzeszutek Wilk
  2014-02-28 17:08     ` Waiman Long
  2014-02-28 17:08     ` Waiman Long
  2014-02-27  9:41   ` Paolo Bonzini
                     ` (3 subsequent siblings)
  5 siblings, 2 replies; 125+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-02-26 17:08 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, Peter Zijlstra,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Alok Kataria, linux-arch, x86, Ingo Molnar, Scott J Norton,
	xen-devel, Paul E. McKenney, Alexander Fyodorov, Rik van Riel,
	Arnd Bergmann, Daniel J Blueman, Oleg Nesterov, Steven Rostedt,
	Chris Wright, George Spelvin, Thomas Gleixner,
	Aswin Chandramouleeswaran, Cheg

On Wed, Feb 26, 2014 at 10:14:25AM -0500, Waiman Long wrote:
> This patch adds a KVM init function to activate the unfair queue
> spinlock in a KVM guest when the PARAVIRT_UNFAIR_LOCKS kernel config
> option is selected.
> 
> Signed-off-by: Waiman Long <Waiman.Long@hp.com>
> ---
>  arch/x86/kernel/kvm.c |   17 +++++++++++++++++
>  1 files changed, 17 insertions(+), 0 deletions(-)
> 
> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> index 713f1b3..a489140 100644
> --- a/arch/x86/kernel/kvm.c
> +++ b/arch/x86/kernel/kvm.c
> @@ -826,3 +826,20 @@ static __init int kvm_spinlock_init_jump(void)
>  early_initcall(kvm_spinlock_init_jump);
>  
>  #endif	/* CONFIG_PARAVIRT_SPINLOCKS */
> +
> +#ifdef CONFIG_PARAVIRT_UNFAIR_LOCKS
> +/*
> + * Enable unfair lock if running in a real para-virtualized environment
> + */
> +static __init int kvm_unfair_locks_init_jump(void)
> +{
> +	if (!kvm_para_available())
> +		return 0;

I think you also need to check for !kvm_para_has_feature(KVM_FEATURE_PV_UNHALT)?
Otherwise you might enable this, but the kicker function won't be
enabled.
> +
> +	static_key_slow_inc(&paravirt_unfairlocks_enabled);
> +	printk(KERN_INFO "KVM setup unfair spinlock\n");
> +
> +	return 0;
> +}
> +early_initcall(kvm_unfair_locks_init_jump);
> +#endif
> -- 
> 1.7.1
> 

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH RFC v5 5/8] pvqspinlock, x86: Enable unfair queue spinlock in a KVM guest
  2014-02-26 15:14 ` Waiman Long
@ 2014-02-26 17:08   ` Konrad Rzeszutek Wilk
  2014-02-26 17:08   ` Konrad Rzeszutek Wilk
                     ` (4 subsequent siblings)
  5 siblings, 0 replies; 125+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-02-26 17:08 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, Peter Zijlstra,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Alok Kataria, linux-arch, x86, Ingo Molnar, Scott J Norton,
	xen-devel, Paul E. McKenney, Alexander Fyodorov, Arnd Bergmann,
	Daniel J Blueman, Rusty Russell, Oleg Nesterov, Steven Rostedt,
	Chris Wright, George Spelvin, Thomas Gleixner,
	Aswin Chandramouleeswaran

On Wed, Feb 26, 2014 at 10:14:25AM -0500, Waiman Long wrote:
> This patch adds a KVM init function to activate the unfair queue
> spinlock in a KVM guest when the PARAVIRT_UNFAIR_LOCKS kernel config
> option is selected.
> 
> Signed-off-by: Waiman Long <Waiman.Long@hp.com>
> ---
>  arch/x86/kernel/kvm.c |   17 +++++++++++++++++
>  1 files changed, 17 insertions(+), 0 deletions(-)
> 
> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> index 713f1b3..a489140 100644
> --- a/arch/x86/kernel/kvm.c
> +++ b/arch/x86/kernel/kvm.c
> @@ -826,3 +826,20 @@ static __init int kvm_spinlock_init_jump(void)
>  early_initcall(kvm_spinlock_init_jump);
>  
>  #endif	/* CONFIG_PARAVIRT_SPINLOCKS */
> +
> +#ifdef CONFIG_PARAVIRT_UNFAIR_LOCKS
> +/*
> + * Enable unfair lock if running in a real para-virtualized environment
> + */
> +static __init int kvm_unfair_locks_init_jump(void)
> +{
> +	if (!kvm_para_available())
> +		return 0;

I think you also need to check for !kvm_para_has_feature(KVM_FEATURE_PV_UNHALT)?
Otherwise you might enable this, but the kicker function won't be
enabled.
> +
> +	static_key_slow_inc(&paravirt_unfairlocks_enabled);
> +	printk(KERN_INFO "KVM setup unfair spinlock\n");
> +
> +	return 0;
> +}
> +early_initcall(kvm_unfair_locks_init_jump);
> +#endif
> -- 
> 1.7.1
> 

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH RFC v5 7/8] pvqspinlock, x86: Add qspinlock para-virtualization support
  2014-02-26 15:14 ` [PATCH RFC v5 7/8] pvqspinlock, x86: Add qspinlock para-virtualization support Waiman Long
  2014-02-26 17:54   ` Konrad Rzeszutek Wilk
@ 2014-02-26 17:54   ` Konrad Rzeszutek Wilk
  2014-02-27 12:11   ` David Vrabel
  2014-02-27 12:11   ` David Vrabel
  3 siblings, 0 replies; 125+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-02-26 17:54 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, Peter Zijlstra,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Alok Kataria, linux-arch, x86, Ingo Molnar, Scott J Norton,
	xen-devel, Paul E. McKenney, Alexander Fyodorov, Rik van Riel,
	Arnd Bergmann, Daniel J Blueman, Oleg Nesterov, Steven Rostedt,
	Chris Wright, George Spelvin, Thomas Gleixner,
	Aswin Chandramouleeswaran, Cheg

On Wed, Feb 26, 2014 at 10:14:27AM -0500, Waiman Long wrote:
> This patch adds para-virtualization support to the queue spinlock code
> by enabling the queue head to kick the lock holder CPU, if known,
> in when the lock isn't released for a certain amount of time. It
  ^^ - ?
> also enables the mutual monitoring of the queue head CPU and the
> following node CPU in the queue to make sure that their CPUs will
> stay scheduled in.

stay scheduled in? How are you influencing the hypervisor to schedule
them in next?  I see this patch "x86: Enable KVM to use qspinlock's PV support"
but that might not be the best choice.

What if the hypervisor has another CPU ready to go - which is also
a lock-holder. Wouldn't it be better to just provide a cpu mask of the
CPUs it could kick?

> 
> Signed-off-by: Waiman Long <Waiman.Long@hp.com>
> ---
>  arch/x86/include/asm/paravirt.h       |    9 ++-
>  arch/x86/include/asm/paravirt_types.h |   12 +++
>  arch/x86/include/asm/pvqspinlock.h    |  176 +++++++++++++++++++++++++++++++++
>  arch/x86/kernel/paravirt-spinlocks.c  |    4 +
>  kernel/locking/qspinlock.c            |   41 +++++++-
>  5 files changed, 235 insertions(+), 7 deletions(-)
>  create mode 100644 arch/x86/include/asm/pvqspinlock.h
> 
> diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
> index cd6e161..06d3279 100644
> --- a/arch/x86/include/asm/paravirt.h
> +++ b/arch/x86/include/asm/paravirt.h
> @@ -711,7 +711,12 @@ static inline void __set_fixmap(unsigned /* enum fixed_addresses */ idx,
>  }
>  
>  #if defined(CONFIG_SMP) && defined(CONFIG_PARAVIRT_SPINLOCKS)
> -
> +#ifdef CONFIG_QUEUE_SPINLOCK
> +static __always_inline void __queue_kick_cpu(int cpu, enum pv_kick_type type)
> +{
> +	PVOP_VCALL2(pv_lock_ops.kick_cpu, cpu, type);
> +}
> +#else
>  static __always_inline void __ticket_lock_spinning(struct arch_spinlock *lock,
>  							__ticket_t ticket)
>  {
> @@ -723,7 +728,7 @@ static __always_inline void __ticket_unlock_kick(struct arch_spinlock *lock,
>  {
>  	PVOP_VCALL2(pv_lock_ops.unlock_kick, lock, ticket);
>  }
> -
> +#endif
>  #endif
>  
>  #ifdef CONFIG_X86_32
> diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
> index 7549b8b..87f8836 100644
> --- a/arch/x86/include/asm/paravirt_types.h
> +++ b/arch/x86/include/asm/paravirt_types.h
> @@ -333,9 +333,21 @@ struct arch_spinlock;
>  typedef u16 __ticket_t;
>  #endif
>  
> +#ifdef CONFIG_QUEUE_SPINLOCK
> +enum pv_kick_type {
> +	PV_KICK_LOCK_HOLDER,
> +	PV_KICK_QUEUE_HEAD,
> +	PV_KICK_NEXT_NODE
> +};
> +#endif
> +
>  struct pv_lock_ops {
> +#ifdef CONFIG_QUEUE_SPINLOCK
> +	void (*kick_cpu)(int cpu, enum pv_kick_type);
> +#else
>  	struct paravirt_callee_save lock_spinning;
>  	void (*unlock_kick)(struct arch_spinlock *lock, __ticket_t ticket);
> +#endif
>  };
>  
>  /* This contains all the paravirt structures: we get a convenient
> diff --git a/arch/x86/include/asm/pvqspinlock.h b/arch/x86/include/asm/pvqspinlock.h
> new file mode 100644
> index 0000000..45aae39
> --- /dev/null
> +++ b/arch/x86/include/asm/pvqspinlock.h
> @@ -0,0 +1,176 @@
> +#ifndef _ASM_X86_PVQSPINLOCK_H
> +#define _ASM_X86_PVQSPINLOCK_H
> +
> +/*
> + *	Queue Spinlock Para-Virtualization Support
> + *
> + *	+------+	    +-----+ nxtcpu_p1  +----+
> + *	| Lock |	    |Queue|----------->|Next|
> + *	|Holder|<-----------|Head |<-----------|Node|
> + *	+------+ prev_qcode +-----+ prev_qcode +----+
> + *
> + * As long as the current lock holder passes through the slowpath, the queue

Um, why would the the lock holder pass through the slowpath? It already
has the lock hasn't it? Or is this when it acquired it (either via fastpath
or slowpath) and it stashes this information somewhere?


> + * head CPU will have its CPU number stored in prev_qcode. The situation is
> + * the same for the node next to the queue head.
                       ^^^^^^^^         ^^^^^^^^^^

Do you mean to say next node's queue head?
> + *
> + * The next node, while setting up the next pointer in the queue head, can
> + * also store its CPU number in that node. With that change, the queue head

can or MUST?

> + * will have the CPU numbers of both its upstream and downstream neighbors.
> + *
> + * To make forward progress in lock acquisition and release, it is necessary
> + * that both the lock holder and the queue head virtual CPUs are present.
> + * The queue head can monitor the lock holder, but the lock holder can't
> + * monitor the queue head back. As a result, the next node is also brought
> + * into the picture to monitor the queue head. In the above diagram, all the
> + * 3 virtual CPUs should be present with the queue head and next node
> + * monitoring each other to make sure they are both present.

OK, that implies you must have those 3 VCPUs active right?
> + *
> + * Heartbeat counters are used to track if a neighbor is active. There are
> + * 3 different sets of heartbeat counter monitoring going on:
> + * 1) The queue head will wait until the number loop iteration exceeds a
> + *    certain threshold (HEAD_SPIN_THRESHOLD). In that case, it will send
> + *    a kick-cpu signal to the lock holder if it has the CPU number available.
> + *    The kick-cpu siginal will be sent only once as the real lock holder
> + *    may not be the same as what the queue head thinks it is.

Why would it not be the same?

Is there another patch I should read before asking these questions?

> + * 2) The queue head will periodically clear the active flag of the next node.
> + *    It will then check to see if the active flag remains cleared at the end
> + *    of the cycle. If it is, the next node CPU may be scheduled out. So it
> + *    send a kick-cpu signal to make sure that the next node CPU remain active.

So the next CPU can be scheduled out but you also kick it to make sure it is active
(aka scheduled in). Or maybe I am reading this wrong?

> + * 3) The next node CPU will monitor its own active flag to see if it gets
> + *    clear periodically. If it is not, the queue head CPU may be scheduled
         ^^^^ cleared                                             ^^^ have been?
> + *    out. It will then send the kick-cpu signal to the queue head CPU.
> + */
> +
> +/*
> + * Loop thresholds
> + */
> +#define	HEAD_SPIN_THRESHOLD	(1<<12)	/* Threshold to kick lock holder  */
> +#define	CLEAR_ACTIVE_THRESHOLD	(1<<8)	/* Threahold for clearing active flag */
> +#define CLEAR_ACTIVE_MASK	(CLEAR_ACTIVE_THRESHOLD - 1)

Something is off with the tabs here.
> +
> +/*
> + * PV macros
> + */
> +#define PV_SET_VAR(type, var, val)	type var = val
> +#define PV_VAR(var)			var
> +#define	PV_GET_NXTCPU(node)		(node)->pv.nxtcpu_p1

Ditto.
> +
> +/*
> + * Additional fields to be added to the qnode structure
> + *
> + * Try to cram the PV fields into a 32 bits so that it won't increase the
> + * qnode size in x86-64.
> + */
> +#if CONFIG_NR_CPUS >= (1 << 16)
> +#define _cpuid_t	u32
> +#else
> +#define _cpuid_t	u16
> +#endif
> +
> +struct pv_qvars {
> +	u8	 active;	/* Set if CPU active		*/
> +	u8	 prehead;	/* Set if next to queue head	*/
> +	_cpuid_t nxtcpu_p1;	/* CPU number of next node + 1	*/
> +};
> +
> +/**
> + * pv_init_vars - initialize fields in struct pv_qvars
> + * @pv: pointer to struct pv_qvars
> + */
> +static __always_inline void pv_init_vars(struct pv_qvars *pv)
> +{
> +	pv->active    = false;
> +	pv->prehead   = false;
> +	pv->nxtcpu_p1 = 0;
> +}
> +
> +/**
> + * head_spin_check - perform para-virtualization checks for queue head
> + * @count : loop count
> + * @qcode : queue code of the supposed lock holder
> + * @nxtcpu: CPU number of next node + 1
> + * @next  : pointer to the next node
> + * @offset: offset of the pv_qvars within the qnode
> + *
> + * 4 checks will be done:
> + * 1) See if it is time to kick the lock holder
> + * 2) Set the prehead flag of the next node
> + * 3) Clear the active flag of the next node periodically
> + * 4) If the active flag is not set after a while, assume the CPU of the
> + *    next-in-line node is offline and kick it back up again.
> + */
> +static __always_inline void
> +pv_head_spin_check(int *count, u32 qcode, int nxtcpu, void *next, int offset)
> +{
> +	if (!static_key_false(&paravirt_spinlocks_enabled))
> +		return;
> +	if ((++(*count) == HEAD_SPIN_THRESHOLD) && qcode) {
> +		/*
> +		 * Get the CPU number of the lock holder & kick it
> +		 * The lock may have been stealed by another CPU
                                          ^^^^^^ - stolen

> +		 * if PARAVIRT_UNFAIR_LOCKS is set, so the computed
> +		 * CPU number may not be the actual lock holder.
> +		 */
> +		int cpu = (qcode >> (_QCODE_VAL_OFFSET + 2)) - 1;
> +		__queue_kick_cpu(cpu, PV_KICK_LOCK_HOLDER);
> +	}
> +	if (next) {
> +		struct pv_qvars *pv = (struct pv_qvars *)
> +				      ((char *)next + offset);
> +
> +		if (!pv->prehead)
> +			pv->prehead = true;
> +		if ((*count & CLEAR_ACTIVE_MASK) == CLEAR_ACTIVE_MASK)
> +			pv->active = false;
> +		if (((*count & CLEAR_ACTIVE_MASK) == 0) &&
> +			!pv->active && nxtcpu)
> +			/*
> +			 * The CPU of the next node doesn't seem to be
> +			 * active, need to kick it to make sure that
> +			 * it is ready to be transitioned to queue head.
> +			 */
> +			__queue_kick_cpu(nxtcpu - 1, PV_KICK_NEXT_NODE);
> +	}
> +}
> +
> +/**
> + * head_spin_check - perform para-virtualization checks for queue member
> + * @pv   : pointer to struct pv_qvars
> + * @count: loop count
> + * @qcode: queue code of the previous node (queue head if pv->prehead set)
> + *
> + * Set the active flag if it is next to the queue head
> + */
> +static __always_inline void
> +pv_queue_spin_check(struct pv_qvars *pv, int *count, u32 qcode)
> +{
> +	if (!static_key_false(&paravirt_spinlocks_enabled))
> +		return;
> +	if (ACCESS_ONCE(pv->prehead)) {
> +		if (pv->active == false) {
> +			*count = 0;	/* Reset counter */
> +			pv->active = true;
> +		}
> +		if ((++(*count) >= 4 * CLEAR_ACTIVE_THRESHOLD) && qcode) {

This magic value could be wrapped in a macro.
> +			/*
> +			 * The queue head isn't clearing the active flag for
                                          ^^^^^^^^^^^^^ hadn't cleared
> +			 * too long. Need to kick it.
> +			 */
> +			int cpu = (qcode >> (_QCODE_VAL_OFFSET + 2)) - 1;
> +			__queue_kick_cpu(cpu, PV_KICK_QUEUE_HEAD);
> +			*count = 0;
> +		}
> +	}
> +}
> +
> +/**
> + * pv_set_cpu - set CPU # in the given pv_qvars structure
> + * @pv : pointer to struct pv_qvars to be set
> + * @cpu: cpu number to be set
> + */
> +static __always_inline void pv_set_cpu(struct pv_qvars *pv, int cpu)
> +{
> +	pv->nxtcpu_p1 = cpu + 1;
> +}
> +
> +#endif /* _ASM_X86_PVQSPINLOCK_H */
> diff --git a/arch/x86/kernel/paravirt-spinlocks.c b/arch/x86/kernel/paravirt-spinlocks.c
> index 8c67cbe..30d76f5 100644
> --- a/arch/x86/kernel/paravirt-spinlocks.c
> +++ b/arch/x86/kernel/paravirt-spinlocks.c
> @@ -11,9 +11,13 @@
>  #ifdef CONFIG_PARAVIRT_SPINLOCKS
>  struct pv_lock_ops pv_lock_ops = {
>  #ifdef CONFIG_SMP
> +#ifdef CONFIG_QUEUE_SPINLOCK
> +	.kick_cpu = paravirt_nop,
> +#else
>  	.lock_spinning = __PV_IS_CALLEE_SAVE(paravirt_nop),
>  	.unlock_kick = paravirt_nop,
>  #endif
> +#endif
>  };
>  EXPORT_SYMBOL(pv_lock_ops);
>  
> diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
> index 22a63fa..f10446e 100644
> --- a/kernel/locking/qspinlock.c
> +++ b/kernel/locking/qspinlock.c
> @@ -58,6 +58,26 @@
>   */
>  
>  /*
> + * Para-virtualized queue spinlock support
> + */
> +#ifdef CONFIG_PARAVIRT_SPINLOCKS
> +#include <asm/pvqspinlock.h>
> +#else
> +
> +#define PV_SET_VAR(type, var, val)
> +#define PV_VAR(var)			0
> +#define PV_GET_NXTCPU(node)		0
> +
> +struct pv_qvars {};
> +static __always_inline void pv_init_vars(struct pv_qvars *pv)		{}
> +static __always_inline void pv_head_spin_check(int *count, u32 qcode,
> +				int nxtcpu, void *next, int offset)	{}
> +static __always_inline void pv_queue_spin_check(struct pv_qvars *pv,
> +				int *count, u32 qcode)			{}
> +static __always_inline void pv_set_cpu(struct pv_qvars *pv, int cpu)	{}
> +#endif
> +
> +/*
>   * The 24-bit queue node code is divided into the following 2 fields:
>   * Bits 0-1 : queue node index (4 nodes)
>   * Bits 2-23: CPU number + 1   (4M - 1 CPUs)
> @@ -77,15 +97,13 @@
>  
>  /*
>   * The queue node structure
> - *
> - * This structure is essentially the same as the mcs_spinlock structure
> - * in mcs_spinlock.h file. This structure is retained for future extension
> - * where new fields may be added.

How come you are deleting this? Should that be a part of another patch?

>   */
>  struct qnode {
>  	u32		 wait;		/* Waiting flag		*/
> +	struct pv_qvars	 pv;		/* Para-virtualization  */
>  	struct qnode	*next;		/* Next queue node addr */
>  };
> +#define PV_OFFSET	offsetof(struct qnode, pv)
>  
>  struct qnode_set {
>  	struct qnode	nodes[MAX_QNODES];
> @@ -441,6 +459,7 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, int qsval)
>  	unsigned int cpu_nr, qn_idx;
>  	struct qnode *node, *next;
>  	u32 prev_qcode, my_qcode;
> +	PV_SET_VAR(int, hcnt, 0);
>  
>  	/*
>  	 * Try the quick spinning code path
> @@ -468,6 +487,7 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, int qsval)
>  	 */
>  	node->wait = true;
>  	node->next = NULL;
> +	pv_init_vars(&node->pv);
>  
>  	/*
>  	 * The lock may be available at this point, try again if no task was
> @@ -522,13 +542,22 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, int qsval)
>  		 * and set up the "next" fields of the that node.
>  		 */
>  		struct qnode *prev = xlate_qcode(prev_qcode);
> +		PV_SET_VAR(int, qcnt, 0);
>  
>  		ACCESS_ONCE(prev->next) = node;
>  		/*
> +		 * Set current CPU number into the previous node
> +		 */
> +		pv_set_cpu(&prev->pv, cpu_nr);
> +
> +		/*
>  		 * Wait until the waiting flag is off
>  		 */
> -		while (smp_load_acquire(&node->wait))
> +		while (smp_load_acquire(&node->wait)) {
>  			arch_mutex_cpu_relax();
> +			pv_queue_spin_check(&node->pv, PV_VAR(&qcnt),
> +					    prev_qcode);
> +		}
>  	}
>  
>  	/*
> @@ -560,6 +589,8 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, int qsval)
>  				goto release_node;
>  		}
>  		arch_mutex_cpu_relax();
> +		pv_head_spin_check(PV_VAR(&hcnt), prev_qcode,
> +				PV_GET_NXTCPU(node), node->next, PV_OFFSET);
>  	}
>  
>  notify_next:
> -- 
> 1.7.1
> 

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH RFC v5 7/8] pvqspinlock, x86: Add qspinlock para-virtualization support
  2014-02-26 15:14 ` [PATCH RFC v5 7/8] pvqspinlock, x86: Add qspinlock para-virtualization support Waiman Long
@ 2014-02-26 17:54   ` Konrad Rzeszutek Wilk
  2014-02-26 17:54   ` Konrad Rzeszutek Wilk
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 125+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-02-26 17:54 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, Peter Zijlstra,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Alok Kataria, linux-arch, x86, Ingo Molnar, Scott J Norton,
	xen-devel, Paul E. McKenney, Alexander Fyodorov, Arnd Bergmann,
	Daniel J Blueman, Rusty Russell, Oleg Nesterov, Steven Rostedt,
	Chris Wright, George Spelvin, Thomas Gleixner,
	Aswin Chandramouleeswaran

On Wed, Feb 26, 2014 at 10:14:27AM -0500, Waiman Long wrote:
> This patch adds para-virtualization support to the queue spinlock code
> by enabling the queue head to kick the lock holder CPU, if known,
> in when the lock isn't released for a certain amount of time. It
  ^^ - ?
> also enables the mutual monitoring of the queue head CPU and the
> following node CPU in the queue to make sure that their CPUs will
> stay scheduled in.

stay scheduled in? How are you influencing the hypervisor to schedule
them in next?  I see this patch "x86: Enable KVM to use qspinlock's PV support"
but that might not be the best choice.

What if the hypervisor has another CPU ready to go - which is also
a lock-holder. Wouldn't it be better to just provide a cpu mask of the
CPUs it could kick?

> 
> Signed-off-by: Waiman Long <Waiman.Long@hp.com>
> ---
>  arch/x86/include/asm/paravirt.h       |    9 ++-
>  arch/x86/include/asm/paravirt_types.h |   12 +++
>  arch/x86/include/asm/pvqspinlock.h    |  176 +++++++++++++++++++++++++++++++++
>  arch/x86/kernel/paravirt-spinlocks.c  |    4 +
>  kernel/locking/qspinlock.c            |   41 +++++++-
>  5 files changed, 235 insertions(+), 7 deletions(-)
>  create mode 100644 arch/x86/include/asm/pvqspinlock.h
> 
> diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
> index cd6e161..06d3279 100644
> --- a/arch/x86/include/asm/paravirt.h
> +++ b/arch/x86/include/asm/paravirt.h
> @@ -711,7 +711,12 @@ static inline void __set_fixmap(unsigned /* enum fixed_addresses */ idx,
>  }
>  
>  #if defined(CONFIG_SMP) && defined(CONFIG_PARAVIRT_SPINLOCKS)
> -
> +#ifdef CONFIG_QUEUE_SPINLOCK
> +static __always_inline void __queue_kick_cpu(int cpu, enum pv_kick_type type)
> +{
> +	PVOP_VCALL2(pv_lock_ops.kick_cpu, cpu, type);
> +}
> +#else
>  static __always_inline void __ticket_lock_spinning(struct arch_spinlock *lock,
>  							__ticket_t ticket)
>  {
> @@ -723,7 +728,7 @@ static __always_inline void __ticket_unlock_kick(struct arch_spinlock *lock,
>  {
>  	PVOP_VCALL2(pv_lock_ops.unlock_kick, lock, ticket);
>  }
> -
> +#endif
>  #endif
>  
>  #ifdef CONFIG_X86_32
> diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
> index 7549b8b..87f8836 100644
> --- a/arch/x86/include/asm/paravirt_types.h
> +++ b/arch/x86/include/asm/paravirt_types.h
> @@ -333,9 +333,21 @@ struct arch_spinlock;
>  typedef u16 __ticket_t;
>  #endif
>  
> +#ifdef CONFIG_QUEUE_SPINLOCK
> +enum pv_kick_type {
> +	PV_KICK_LOCK_HOLDER,
> +	PV_KICK_QUEUE_HEAD,
> +	PV_KICK_NEXT_NODE
> +};
> +#endif
> +
>  struct pv_lock_ops {
> +#ifdef CONFIG_QUEUE_SPINLOCK
> +	void (*kick_cpu)(int cpu, enum pv_kick_type);
> +#else
>  	struct paravirt_callee_save lock_spinning;
>  	void (*unlock_kick)(struct arch_spinlock *lock, __ticket_t ticket);
> +#endif
>  };
>  
>  /* This contains all the paravirt structures: we get a convenient
> diff --git a/arch/x86/include/asm/pvqspinlock.h b/arch/x86/include/asm/pvqspinlock.h
> new file mode 100644
> index 0000000..45aae39
> --- /dev/null
> +++ b/arch/x86/include/asm/pvqspinlock.h
> @@ -0,0 +1,176 @@
> +#ifndef _ASM_X86_PVQSPINLOCK_H
> +#define _ASM_X86_PVQSPINLOCK_H
> +
> +/*
> + *	Queue Spinlock Para-Virtualization Support
> + *
> + *	+------+	    +-----+ nxtcpu_p1  +----+
> + *	| Lock |	    |Queue|----------->|Next|
> + *	|Holder|<-----------|Head |<-----------|Node|
> + *	+------+ prev_qcode +-----+ prev_qcode +----+
> + *
> + * As long as the current lock holder passes through the slowpath, the queue

Um, why would the the lock holder pass through the slowpath? It already
has the lock hasn't it? Or is this when it acquired it (either via fastpath
or slowpath) and it stashes this information somewhere?


> + * head CPU will have its CPU number stored in prev_qcode. The situation is
> + * the same for the node next to the queue head.
                       ^^^^^^^^         ^^^^^^^^^^

Do you mean to say next node's queue head?
> + *
> + * The next node, while setting up the next pointer in the queue head, can
> + * also store its CPU number in that node. With that change, the queue head

can or MUST?

> + * will have the CPU numbers of both its upstream and downstream neighbors.
> + *
> + * To make forward progress in lock acquisition and release, it is necessary
> + * that both the lock holder and the queue head virtual CPUs are present.
> + * The queue head can monitor the lock holder, but the lock holder can't
> + * monitor the queue head back. As a result, the next node is also brought
> + * into the picture to monitor the queue head. In the above diagram, all the
> + * 3 virtual CPUs should be present with the queue head and next node
> + * monitoring each other to make sure they are both present.

OK, that implies you must have those 3 VCPUs active right?
> + *
> + * Heartbeat counters are used to track if a neighbor is active. There are
> + * 3 different sets of heartbeat counter monitoring going on:
> + * 1) The queue head will wait until the number loop iteration exceeds a
> + *    certain threshold (HEAD_SPIN_THRESHOLD). In that case, it will send
> + *    a kick-cpu signal to the lock holder if it has the CPU number available.
> + *    The kick-cpu siginal will be sent only once as the real lock holder
> + *    may not be the same as what the queue head thinks it is.

Why would it not be the same?

Is there another patch I should read before asking these questions?

> + * 2) The queue head will periodically clear the active flag of the next node.
> + *    It will then check to see if the active flag remains cleared at the end
> + *    of the cycle. If it is, the next node CPU may be scheduled out. So it
> + *    send a kick-cpu signal to make sure that the next node CPU remain active.

So the next CPU can be scheduled out but you also kick it to make sure it is active
(aka scheduled in). Or maybe I am reading this wrong?

> + * 3) The next node CPU will monitor its own active flag to see if it gets
> + *    clear periodically. If it is not, the queue head CPU may be scheduled
         ^^^^ cleared                                             ^^^ have been?
> + *    out. It will then send the kick-cpu signal to the queue head CPU.
> + */
> +
> +/*
> + * Loop thresholds
> + */
> +#define	HEAD_SPIN_THRESHOLD	(1<<12)	/* Threshold to kick lock holder  */
> +#define	CLEAR_ACTIVE_THRESHOLD	(1<<8)	/* Threahold for clearing active flag */
> +#define CLEAR_ACTIVE_MASK	(CLEAR_ACTIVE_THRESHOLD - 1)

Something is off with the tabs here.
> +
> +/*
> + * PV macros
> + */
> +#define PV_SET_VAR(type, var, val)	type var = val
> +#define PV_VAR(var)			var
> +#define	PV_GET_NXTCPU(node)		(node)->pv.nxtcpu_p1

Ditto.
> +
> +/*
> + * Additional fields to be added to the qnode structure
> + *
> + * Try to cram the PV fields into a 32 bits so that it won't increase the
> + * qnode size in x86-64.
> + */
> +#if CONFIG_NR_CPUS >= (1 << 16)
> +#define _cpuid_t	u32
> +#else
> +#define _cpuid_t	u16
> +#endif
> +
> +struct pv_qvars {
> +	u8	 active;	/* Set if CPU active		*/
> +	u8	 prehead;	/* Set if next to queue head	*/
> +	_cpuid_t nxtcpu_p1;	/* CPU number of next node + 1	*/
> +};
> +
> +/**
> + * pv_init_vars - initialize fields in struct pv_qvars
> + * @pv: pointer to struct pv_qvars
> + */
> +static __always_inline void pv_init_vars(struct pv_qvars *pv)
> +{
> +	pv->active    = false;
> +	pv->prehead   = false;
> +	pv->nxtcpu_p1 = 0;
> +}
> +
> +/**
> + * head_spin_check - perform para-virtualization checks for queue head
> + * @count : loop count
> + * @qcode : queue code of the supposed lock holder
> + * @nxtcpu: CPU number of next node + 1
> + * @next  : pointer to the next node
> + * @offset: offset of the pv_qvars within the qnode
> + *
> + * 4 checks will be done:
> + * 1) See if it is time to kick the lock holder
> + * 2) Set the prehead flag of the next node
> + * 3) Clear the active flag of the next node periodically
> + * 4) If the active flag is not set after a while, assume the CPU of the
> + *    next-in-line node is offline and kick it back up again.
> + */
> +static __always_inline void
> +pv_head_spin_check(int *count, u32 qcode, int nxtcpu, void *next, int offset)
> +{
> +	if (!static_key_false(&paravirt_spinlocks_enabled))
> +		return;
> +	if ((++(*count) == HEAD_SPIN_THRESHOLD) && qcode) {
> +		/*
> +		 * Get the CPU number of the lock holder & kick it
> +		 * The lock may have been stealed by another CPU
                                          ^^^^^^ - stolen

> +		 * if PARAVIRT_UNFAIR_LOCKS is set, so the computed
> +		 * CPU number may not be the actual lock holder.
> +		 */
> +		int cpu = (qcode >> (_QCODE_VAL_OFFSET + 2)) - 1;
> +		__queue_kick_cpu(cpu, PV_KICK_LOCK_HOLDER);
> +	}
> +	if (next) {
> +		struct pv_qvars *pv = (struct pv_qvars *)
> +				      ((char *)next + offset);
> +
> +		if (!pv->prehead)
> +			pv->prehead = true;
> +		if ((*count & CLEAR_ACTIVE_MASK) == CLEAR_ACTIVE_MASK)
> +			pv->active = false;
> +		if (((*count & CLEAR_ACTIVE_MASK) == 0) &&
> +			!pv->active && nxtcpu)
> +			/*
> +			 * The CPU of the next node doesn't seem to be
> +			 * active, need to kick it to make sure that
> +			 * it is ready to be transitioned to queue head.
> +			 */
> +			__queue_kick_cpu(nxtcpu - 1, PV_KICK_NEXT_NODE);
> +	}
> +}
> +
> +/**
> + * head_spin_check - perform para-virtualization checks for queue member
> + * @pv   : pointer to struct pv_qvars
> + * @count: loop count
> + * @qcode: queue code of the previous node (queue head if pv->prehead set)
> + *
> + * Set the active flag if it is next to the queue head
> + */
> +static __always_inline void
> +pv_queue_spin_check(struct pv_qvars *pv, int *count, u32 qcode)
> +{
> +	if (!static_key_false(&paravirt_spinlocks_enabled))
> +		return;
> +	if (ACCESS_ONCE(pv->prehead)) {
> +		if (pv->active == false) {
> +			*count = 0;	/* Reset counter */
> +			pv->active = true;
> +		}
> +		if ((++(*count) >= 4 * CLEAR_ACTIVE_THRESHOLD) && qcode) {

This magic value could be wrapped in a macro.
> +			/*
> +			 * The queue head isn't clearing the active flag for
                                          ^^^^^^^^^^^^^ hadn't cleared
> +			 * too long. Need to kick it.
> +			 */
> +			int cpu = (qcode >> (_QCODE_VAL_OFFSET + 2)) - 1;
> +			__queue_kick_cpu(cpu, PV_KICK_QUEUE_HEAD);
> +			*count = 0;
> +		}
> +	}
> +}
> +
> +/**
> + * pv_set_cpu - set CPU # in the given pv_qvars structure
> + * @pv : pointer to struct pv_qvars to be set
> + * @cpu: cpu number to be set
> + */
> +static __always_inline void pv_set_cpu(struct pv_qvars *pv, int cpu)
> +{
> +	pv->nxtcpu_p1 = cpu + 1;
> +}
> +
> +#endif /* _ASM_X86_PVQSPINLOCK_H */
> diff --git a/arch/x86/kernel/paravirt-spinlocks.c b/arch/x86/kernel/paravirt-spinlocks.c
> index 8c67cbe..30d76f5 100644
> --- a/arch/x86/kernel/paravirt-spinlocks.c
> +++ b/arch/x86/kernel/paravirt-spinlocks.c
> @@ -11,9 +11,13 @@
>  #ifdef CONFIG_PARAVIRT_SPINLOCKS
>  struct pv_lock_ops pv_lock_ops = {
>  #ifdef CONFIG_SMP
> +#ifdef CONFIG_QUEUE_SPINLOCK
> +	.kick_cpu = paravirt_nop,
> +#else
>  	.lock_spinning = __PV_IS_CALLEE_SAVE(paravirt_nop),
>  	.unlock_kick = paravirt_nop,
>  #endif
> +#endif
>  };
>  EXPORT_SYMBOL(pv_lock_ops);
>  
> diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
> index 22a63fa..f10446e 100644
> --- a/kernel/locking/qspinlock.c
> +++ b/kernel/locking/qspinlock.c
> @@ -58,6 +58,26 @@
>   */
>  
>  /*
> + * Para-virtualized queue spinlock support
> + */
> +#ifdef CONFIG_PARAVIRT_SPINLOCKS
> +#include <asm/pvqspinlock.h>
> +#else
> +
> +#define PV_SET_VAR(type, var, val)
> +#define PV_VAR(var)			0
> +#define PV_GET_NXTCPU(node)		0
> +
> +struct pv_qvars {};
> +static __always_inline void pv_init_vars(struct pv_qvars *pv)		{}
> +static __always_inline void pv_head_spin_check(int *count, u32 qcode,
> +				int nxtcpu, void *next, int offset)	{}
> +static __always_inline void pv_queue_spin_check(struct pv_qvars *pv,
> +				int *count, u32 qcode)			{}
> +static __always_inline void pv_set_cpu(struct pv_qvars *pv, int cpu)	{}
> +#endif
> +
> +/*
>   * The 24-bit queue node code is divided into the following 2 fields:
>   * Bits 0-1 : queue node index (4 nodes)
>   * Bits 2-23: CPU number + 1   (4M - 1 CPUs)
> @@ -77,15 +97,13 @@
>  
>  /*
>   * The queue node structure
> - *
> - * This structure is essentially the same as the mcs_spinlock structure
> - * in mcs_spinlock.h file. This structure is retained for future extension
> - * where new fields may be added.

How come you are deleting this? Should that be a part of another patch?

>   */
>  struct qnode {
>  	u32		 wait;		/* Waiting flag		*/
> +	struct pv_qvars	 pv;		/* Para-virtualization  */
>  	struct qnode	*next;		/* Next queue node addr */
>  };
> +#define PV_OFFSET	offsetof(struct qnode, pv)
>  
>  struct qnode_set {
>  	struct qnode	nodes[MAX_QNODES];
> @@ -441,6 +459,7 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, int qsval)
>  	unsigned int cpu_nr, qn_idx;
>  	struct qnode *node, *next;
>  	u32 prev_qcode, my_qcode;
> +	PV_SET_VAR(int, hcnt, 0);
>  
>  	/*
>  	 * Try the quick spinning code path
> @@ -468,6 +487,7 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, int qsval)
>  	 */
>  	node->wait = true;
>  	node->next = NULL;
> +	pv_init_vars(&node->pv);
>  
>  	/*
>  	 * The lock may be available at this point, try again if no task was
> @@ -522,13 +542,22 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, int qsval)
>  		 * and set up the "next" fields of the that node.
>  		 */
>  		struct qnode *prev = xlate_qcode(prev_qcode);
> +		PV_SET_VAR(int, qcnt, 0);
>  
>  		ACCESS_ONCE(prev->next) = node;
>  		/*
> +		 * Set current CPU number into the previous node
> +		 */
> +		pv_set_cpu(&prev->pv, cpu_nr);
> +
> +		/*
>  		 * Wait until the waiting flag is off
>  		 */
> -		while (smp_load_acquire(&node->wait))
> +		while (smp_load_acquire(&node->wait)) {
>  			arch_mutex_cpu_relax();
> +			pv_queue_spin_check(&node->pv, PV_VAR(&qcnt),
> +					    prev_qcode);
> +		}
>  	}
>  
>  	/*
> @@ -560,6 +589,8 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, int qsval)
>  				goto release_node;
>  		}
>  		arch_mutex_cpu_relax();
> +		pv_head_spin_check(PV_VAR(&hcnt), prev_qcode,
> +				PV_GET_NXTCPU(node), node->next, PV_OFFSET);
>  	}
>  
>  notify_next:
> -- 
> 1.7.1
> 

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v5 0/8] qspinlock: a 4-byte queue spinlock with PV support
  2014-02-26 15:14 [PATCH v5 0/8] qspinlock: a 4-byte queue spinlock with PV support Waiman Long
                   ` (17 preceding siblings ...)
  2014-02-26 17:00 ` Konrad Rzeszutek Wilk
@ 2014-02-26 22:26 ` Paul E. McKenney
  2014-02-26 22:26 ` Paul E. McKenney
  19 siblings, 0 replies; 125+ messages in thread
From: Paul E. McKenney @ 2014-02-26 22:26 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jeremy Fitzhardinge, x86, Peter Zijlstra, virtualization,
	Andi Kleen, H. Peter Anvin, Michel Lespinasse, Alok Kataria,
	linux-arch, Raghavendra K T, Ingo Molnar, Scott J Norton,
	xen-devel, Alexander Fyodorov, Arnd Bergmann, Daniel J Blueman,
	Rusty Russell, Oleg Nesterov, Steven Rostedt, Chris Wright,
	George Spelvin, Thomas Gleixner, Aswin Chandramouleeswaran,
	Chegu Vinod, Boris

On Wed, Feb 26, 2014 at 10:14:20AM -0500, Waiman Long wrote:

This series passes a short locktorture test when based on top of current
tip/core/locking.  This is for both the first three patches and for the
full set, though in the latter case it took me an embarrassingly large
number of tries to get PARAVIRT_UNFAIR_LOCKS set properly.

Again, don't read too much into this.  This was in an 8-CPU KVM guest
on x86 (though with an interfering kernel build running on the host),
and as noted earlier, locktorture is still a bit on the lame side.

							Thanx, Paul

> v4->v5:
>  - Move the optimized 2-task contending code to the generic file to
>    enable more architectures to use it without code duplication.
>  - Address some of the style-related comments by PeterZ.
>  - Allow the use of unfair queue spinlock in a real para-virtualized
>    execution environment.
>  - Add para-virtualization support to the qspinlock code by ensuring
>    that the lock holder and queue head stay alive as much as possible.
> 
> v3->v4:
>  - Remove debugging code and fix a configuration error
>  - Simplify the qspinlock structure and streamline the code to make it
>    perform a bit better
>  - Add an x86 version of asm/qspinlock.h for holding x86 specific
>    optimization.
>  - Add an optimized x86 code path for 2 contending tasks to improve
>    low contention performance.
> 
> v2->v3:
>  - Simplify the code by using numerous mode only without an unfair option.
>  - Use the latest smp_load_acquire()/smp_store_release() barriers.
>  - Move the queue spinlock code to kernel/locking.
>  - Make the use of queue spinlock the default for x86-64 without user
>    configuration.
>  - Additional performance tuning.
> 
> v1->v2:
>  - Add some more comments to document what the code does.
>  - Add a numerous CPU mode to support >= 16K CPUs
>  - Add a configuration option to allow lock stealing which can further
>    improve performance in many cases.
>  - Enable wakeup of queue head CPU at unlock time for non-numerous
>    CPU mode.
> 
> This patch set has 3 different sections:
>  1) Patches 1-3: Introduces a queue-based spinlock implementation that
>     can replace the default ticket spinlock without increasing the
>     size of the spinlock data structure. As a result, critical kernel
>     data structures that embed spinlock won't increase in size and
>     breaking data alignments.
>  2) Patches 4 and 5: Enables the use of unfair queue spinlock in a
>     real para-virtualized execution environment. This can resolve
>     some of the locking related performance issues due to the fact
>     that the next CPU to get the lock may have been scheduled out
>     for a period of time.
>  3) Patches 6-8: Enable qspinlock para-virtualization support by making
>     sure that the lock holder and the queue head stay alive as long as
>     possible.
> 
> Patches 1-3 are fully tested and ready for production. Patches 4-8, on
> the other hands, are not fully tested. They have undergone compilation
> tests with various combinations of kernel config setting and boot-up
> tests in a non-virtualized setting. Further tests and performance
> characterization are still needed to be done in a KVM guest. So
> comments on them are welcomed. Suggestions or recommendations on how
> to add PV support in the Xen environment are also needed.
> 
> The queue spinlock has slightly better performance than the ticket
> spinlock in uncontended case. Its performance can be much better
> with moderate to heavy contention.  This patch has the potential of
> improving the performance of all the workloads that have moderate to
> heavy spinlock contention.
> 
> The queue spinlock is especially suitable for NUMA machines with at
> least 2 sockets, though noticeable performance benefit probably won't
> show up in machines with less than 4 sockets.
> 
> The purpose of this patch set is not to solve any particular spinlock
> contention problems. Those need to be solved by refactoring the code
> to make more efficient use of the lock or finer granularity ones. The
> main purpose is to make the lock contention problems more tolerable
> until someone can spend the time and effort to fix them.
> 
> Waiman Long (8):
>   qspinlock: Introducing a 4-byte queue spinlock implementation
>   qspinlock, x86: Enable x86-64 to use queue spinlock
>   qspinlock, x86: Add x86 specific optimization for 2 contending tasks
>   pvqspinlock, x86: Allow unfair spinlock in a real PV environment
>   pvqspinlock, x86: Enable unfair queue spinlock in a KVM guest
>   pvqspinlock, x86: Rename paravirt_ticketlocks_enabled
>   pvqspinlock, x86: Add qspinlock para-virtualization support
>   pvqspinlock, x86: Enable KVM to use qspinlock's PV support
> 
>  arch/x86/Kconfig                      |   12 +
>  arch/x86/include/asm/paravirt.h       |    9 +-
>  arch/x86/include/asm/paravirt_types.h |   12 +
>  arch/x86/include/asm/pvqspinlock.h    |  176 ++++++++++
>  arch/x86/include/asm/qspinlock.h      |  133 +++++++
>  arch/x86/include/asm/spinlock.h       |    9 +-
>  arch/x86/include/asm/spinlock_types.h |    4 +
>  arch/x86/kernel/Makefile              |    1 +
>  arch/x86/kernel/kvm.c                 |   73 ++++-
>  arch/x86/kernel/paravirt-spinlocks.c  |   15 +-
>  arch/x86/xen/spinlock.c               |    2 +-
>  include/asm-generic/qspinlock.h       |  122 +++++++
>  include/asm-generic/qspinlock_types.h |   61 ++++
>  kernel/Kconfig.locks                  |    7 +
>  kernel/locking/Makefile               |    1 +
>  kernel/locking/qspinlock.c            |  610 +++++++++++++++++++++++++++++++++
>  16 files changed, 1239 insertions(+), 8 deletions(-)
>  create mode 100644 arch/x86/include/asm/pvqspinlock.h
>  create mode 100644 arch/x86/include/asm/qspinlock.h
>  create mode 100644 include/asm-generic/qspinlock.h
>  create mode 100644 include/asm-generic/qspinlock_types.h
>  create mode 100644 kernel/locking/qspinlock.c
> 

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v5 0/8] qspinlock: a 4-byte queue spinlock with PV support
  2014-02-26 15:14 [PATCH v5 0/8] qspinlock: a 4-byte queue spinlock with PV support Waiman Long
                   ` (18 preceding siblings ...)
  2014-02-26 22:26 ` Paul E. McKenney
@ 2014-02-26 22:26 ` Paul E. McKenney
  19 siblings, 0 replies; 125+ messages in thread
From: Paul E. McKenney @ 2014-02-26 22:26 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jeremy Fitzhardinge, x86, Peter Zijlstra, virtualization,
	Andi Kleen, H. Peter Anvin, Michel Lespinasse, Alok Kataria,
	linux-arch, Raghavendra K T, Ingo Molnar, Scott J Norton,
	xen-devel, Alexander Fyodorov, Rik van Riel, Arnd Bergmann,
	Konrad Rzeszutek Wilk, Daniel J Blueman, Oleg Nesterov,
	Steven Rostedt, Chris Wright, George Spelvin, Thomas Gleixner,
	Aswin Chandramouleeswaran, Chegu

On Wed, Feb 26, 2014 at 10:14:20AM -0500, Waiman Long wrote:

This series passes a short locktorture test when based on top of current
tip/core/locking.  This is for both the first three patches and for the
full set, though in the latter case it took me an embarrassingly large
number of tries to get PARAVIRT_UNFAIR_LOCKS set properly.

Again, don't read too much into this.  This was in an 8-CPU KVM guest
on x86 (though with an interfering kernel build running on the host),
and as noted earlier, locktorture is still a bit on the lame side.

							Thanx, Paul

> v4->v5:
>  - Move the optimized 2-task contending code to the generic file to
>    enable more architectures to use it without code duplication.
>  - Address some of the style-related comments by PeterZ.
>  - Allow the use of unfair queue spinlock in a real para-virtualized
>    execution environment.
>  - Add para-virtualization support to the qspinlock code by ensuring
>    that the lock holder and queue head stay alive as much as possible.
> 
> v3->v4:
>  - Remove debugging code and fix a configuration error
>  - Simplify the qspinlock structure and streamline the code to make it
>    perform a bit better
>  - Add an x86 version of asm/qspinlock.h for holding x86 specific
>    optimization.
>  - Add an optimized x86 code path for 2 contending tasks to improve
>    low contention performance.
> 
> v2->v3:
>  - Simplify the code by using numerous mode only without an unfair option.
>  - Use the latest smp_load_acquire()/smp_store_release() barriers.
>  - Move the queue spinlock code to kernel/locking.
>  - Make the use of queue spinlock the default for x86-64 without user
>    configuration.
>  - Additional performance tuning.
> 
> v1->v2:
>  - Add some more comments to document what the code does.
>  - Add a numerous CPU mode to support >= 16K CPUs
>  - Add a configuration option to allow lock stealing which can further
>    improve performance in many cases.
>  - Enable wakeup of queue head CPU at unlock time for non-numerous
>    CPU mode.
> 
> This patch set has 3 different sections:
>  1) Patches 1-3: Introduces a queue-based spinlock implementation that
>     can replace the default ticket spinlock without increasing the
>     size of the spinlock data structure. As a result, critical kernel
>     data structures that embed spinlock won't increase in size and
>     breaking data alignments.
>  2) Patches 4 and 5: Enables the use of unfair queue spinlock in a
>     real para-virtualized execution environment. This can resolve
>     some of the locking related performance issues due to the fact
>     that the next CPU to get the lock may have been scheduled out
>     for a period of time.
>  3) Patches 6-8: Enable qspinlock para-virtualization support by making
>     sure that the lock holder and the queue head stay alive as long as
>     possible.
> 
> Patches 1-3 are fully tested and ready for production. Patches 4-8, on
> the other hands, are not fully tested. They have undergone compilation
> tests with various combinations of kernel config setting and boot-up
> tests in a non-virtualized setting. Further tests and performance
> characterization are still needed to be done in a KVM guest. So
> comments on them are welcomed. Suggestions or recommendations on how
> to add PV support in the Xen environment are also needed.
> 
> The queue spinlock has slightly better performance than the ticket
> spinlock in uncontended case. Its performance can be much better
> with moderate to heavy contention.  This patch has the potential of
> improving the performance of all the workloads that have moderate to
> heavy spinlock contention.
> 
> The queue spinlock is especially suitable for NUMA machines with at
> least 2 sockets, though noticeable performance benefit probably won't
> show up in machines with less than 4 sockets.
> 
> The purpose of this patch set is not to solve any particular spinlock
> contention problems. Those need to be solved by refactoring the code
> to make more efficient use of the lock or finer granularity ones. The
> main purpose is to make the lock contention problems more tolerable
> until someone can spend the time and effort to fix them.
> 
> Waiman Long (8):
>   qspinlock: Introducing a 4-byte queue spinlock implementation
>   qspinlock, x86: Enable x86-64 to use queue spinlock
>   qspinlock, x86: Add x86 specific optimization for 2 contending tasks
>   pvqspinlock, x86: Allow unfair spinlock in a real PV environment
>   pvqspinlock, x86: Enable unfair queue spinlock in a KVM guest
>   pvqspinlock, x86: Rename paravirt_ticketlocks_enabled
>   pvqspinlock, x86: Add qspinlock para-virtualization support
>   pvqspinlock, x86: Enable KVM to use qspinlock's PV support
> 
>  arch/x86/Kconfig                      |   12 +
>  arch/x86/include/asm/paravirt.h       |    9 +-
>  arch/x86/include/asm/paravirt_types.h |   12 +
>  arch/x86/include/asm/pvqspinlock.h    |  176 ++++++++++
>  arch/x86/include/asm/qspinlock.h      |  133 +++++++
>  arch/x86/include/asm/spinlock.h       |    9 +-
>  arch/x86/include/asm/spinlock_types.h |    4 +
>  arch/x86/kernel/Makefile              |    1 +
>  arch/x86/kernel/kvm.c                 |   73 ++++-
>  arch/x86/kernel/paravirt-spinlocks.c  |   15 +-
>  arch/x86/xen/spinlock.c               |    2 +-
>  include/asm-generic/qspinlock.h       |  122 +++++++
>  include/asm-generic/qspinlock_types.h |   61 ++++
>  kernel/Kconfig.locks                  |    7 +
>  kernel/locking/Makefile               |    1 +
>  kernel/locking/qspinlock.c            |  610 +++++++++++++++++++++++++++++++++
>  16 files changed, 1239 insertions(+), 8 deletions(-)
>  create mode 100644 arch/x86/include/asm/pvqspinlock.h
>  create mode 100644 arch/x86/include/asm/qspinlock.h
>  create mode 100644 include/asm-generic/qspinlock.h
>  create mode 100644 include/asm-generic/qspinlock_types.h
>  create mode 100644 kernel/locking/qspinlock.c
> 

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH RFC v5 8/8] pvqspinlock, x86: Enable KVM to use qspinlock's PV support
  2014-02-26 15:14 ` Waiman Long
  2014-02-27  9:31   ` Paolo Bonzini
@ 2014-02-27  9:31   ` Paolo Bonzini
  2014-02-27 18:36     ` Waiman Long
  2014-02-27 18:36     ` Waiman Long
  1 sibling, 2 replies; 125+ messages in thread
From: Paolo Bonzini @ 2014-02-27  9:31 UTC (permalink / raw)
  To: Waiman Long, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Arnd Bergmann, Peter Zijlstra
  Cc: Jeremy Fitzhardinge, Raghavendra K T, virtualization, Andi Kleen,
	Michel Lespinasse, Boris Ostrovsky, linux-arch, x86,
	Scott J Norton, xen-devel, Paul E. McKenney, Alexander Fyodorov,
	Rik van Riel, Konrad Rzeszutek Wilk, Daniel J Blueman,
	Oleg Nesterov, Steven Rostedt, Chris Wright, George Spelvin,
	Alok Kataria, Aswin Chandramouleeswaran, Chegu Vinod,
	Linus Torvalds

Il 26/02/2014 16:14, Waiman Long ha scritto:
> This patch enables KVM to use the queue spinlock's PV support code
> when the PARAVIRT_SPINLOCKS kernel config option is set. However,
> PV support for Xen is not ready yet and so the queue spinlock will
> still have to be disabled when PARAVIRT_SPINLOCKS config option is
> on with Xen.
> 
> Signed-off-by: Waiman Long <Waiman.Long@hp.com>
> ---
>  arch/x86/kernel/kvm.c |   54 +++++++++++++++++++++++++++++++++++++++++++++++++
>  kernel/Kconfig.locks  |    2 +-
>  2 files changed, 55 insertions(+), 1 deletions(-)
> 
> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> index f318e78..3ddc436 100644
> --- a/arch/x86/kernel/kvm.c
> +++ b/arch/x86/kernel/kvm.c
> @@ -568,6 +568,7 @@ static void kvm_kick_cpu(int cpu)
>  	kvm_hypercall2(KVM_HC_KICK_CPU, flags, apicid);
>  }
>  
> +#ifndef CONFIG_QUEUE_SPINLOCK
>  enum kvm_contention_stat {
>  	TAKEN_SLOW,
>  	TAKEN_SLOW_PICKUP,
> @@ -795,6 +796,55 @@ static void kvm_unlock_kick(struct arch_spinlock *lock, __ticket_t ticket)
>  		}
>  	}
>  }
> +#else /* !CONFIG_QUEUE_SPINLOCK */
> +
> +#ifdef CONFIG_KVM_DEBUG_FS
> +static struct dentry *d_spin_debug;
> +static struct dentry *d_kvm_debug;
> +static u32 lh_kick_stats;	/* Lock holder kick count */
> +static u32 qh_kick_stats;	/* Queue head kick count  */
> +static u32 nn_kick_stats;	/* Next node kick count   */
> +
> +static int __init kvm_spinlock_debugfs(void)
> +{
> +	d_kvm_debug = debugfs_create_dir("kvm-guest", NULL);
> +	if (!d_kvm_debug) {
> +		printk(KERN_WARNING
> +		       "Could not create 'kvm' debugfs directory\n");
> +		return -ENOMEM;
> +	}
> +	d_spin_debug = debugfs_create_dir("spinlocks", d_kvm_debug);
> +
> +	debugfs_create_u32("lh_kick_stats", 0644, d_spin_debug, &lh_kick_stats);
> +	debugfs_create_u32("qh_kick_stats", 0644, d_spin_debug, &qh_kick_stats);
> +	debugfs_create_u32("nn_kick_stats", 0644, d_spin_debug, &nn_kick_stats);
> +
> +	return 0;
> +}
> +
> +static inline void inc_kick_stats(enum pv_kick_type type)
> +{
> +	if (type == PV_KICK_LOCK_HOLDER)
> +		add_smp(&lh_kick_stats, 1);
> +	else if (type == PV_KICK_QUEUE_HEAD)
> +		add_smp(&qh_kick_stats, 1);
> +	else
> +		add_smp(&nn_kick_stats, 1);
> +}
> +fs_initcall(kvm_spinlock_debugfs);
> +
> +#else /* CONFIG_KVM_DEBUG_FS */
> +static inline void inc_kick_stats(enum pv_kick_type type)
> +{
> +}
> +#endif /* CONFIG_KVM_DEBUG_FS */
> +
> +static void kvm_kick_cpu_type(int cpu, enum pv_kick_type type)
> +{
> +	kvm_kick_cpu(cpu);
> +	inc_kick_stats(type);
> +}
> +#endif /* !CONFIG_QUEUE_SPINLOCK */
>  
>  /*
>   * Setup pv_lock_ops to exploit KVM_FEATURE_PV_UNHALT if present.
> @@ -807,8 +857,12 @@ void __init kvm_spinlock_init(void)
>  	if (!kvm_para_has_feature(KVM_FEATURE_PV_UNHALT))
>  		return;
>  
> +#ifdef CONFIG_QUEUE_SPINLOCK
> +	pv_lock_ops.kick_cpu = kvm_kick_cpu_type;
> +#else
>  	pv_lock_ops.lock_spinning = PV_CALLEE_SAVE(kvm_lock_spinning);
>  	pv_lock_ops.unlock_kick = kvm_unlock_kick;
> +#endif
>  }
>  
>  static __init int kvm_spinlock_init_jump(void)
> diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks
> index f185584..a70fdeb 100644
> --- a/kernel/Kconfig.locks
> +++ b/kernel/Kconfig.locks
> @@ -229,4 +229,4 @@ config ARCH_USE_QUEUE_SPINLOCK
>  
>  config QUEUE_SPINLOCK
>  	def_bool y if ARCH_USE_QUEUE_SPINLOCK
> -	depends on SMP && !PARAVIRT_SPINLOCKS
> +	depends on SMP && (!PARAVIRT_SPINLOCKS || !XEN)
> 

Should this rather be

    def_bool y if ARCH_USE_QUEUE_SPINLOCK && (!PARAVIRT_SPINLOCKS || !XEN)

?

PARAVIRT_SPINLOCKS + XEN + QUEUE_SPINLOCK + PARAVIRT_UNFAIR_LOCKS is a
valid combination, but it's impossible to choose PARAVIRT_UNFAIR_LOCKS
if QUEUE_SPINLOCK is unavailable.

Paolo

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH RFC v5 8/8] pvqspinlock, x86: Enable KVM to use qspinlock's PV support
  2014-02-26 15:14 ` Waiman Long
@ 2014-02-27  9:31   ` Paolo Bonzini
  2014-02-27  9:31   ` Paolo Bonzini
  1 sibling, 0 replies; 125+ messages in thread
From: Paolo Bonzini @ 2014-02-27  9:31 UTC (permalink / raw)
  To: Waiman Long, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Arnd Bergmann, Peter Zijlstra
  Cc: Jeremy Fitzhardinge, Raghavendra K T, virtualization, Andi Kleen,
	Michel Lespinasse, Boris Ostrovsky, linux-arch, x86,
	Scott J Norton, xen-devel, Paul E. McKenney, Alexander Fyodorov,
	Daniel J Blueman, Rusty Russell, Oleg Nesterov, Steven Rostedt,
	Chris Wright, George Spelvin, Alok Kataria,
	Aswin Chandramouleeswaran, Chegu Vinod, Linus Torvalds,
	linux-kernel, David

Il 26/02/2014 16:14, Waiman Long ha scritto:
> This patch enables KVM to use the queue spinlock's PV support code
> when the PARAVIRT_SPINLOCKS kernel config option is set. However,
> PV support for Xen is not ready yet and so the queue spinlock will
> still have to be disabled when PARAVIRT_SPINLOCKS config option is
> on with Xen.
> 
> Signed-off-by: Waiman Long <Waiman.Long@hp.com>
> ---
>  arch/x86/kernel/kvm.c |   54 +++++++++++++++++++++++++++++++++++++++++++++++++
>  kernel/Kconfig.locks  |    2 +-
>  2 files changed, 55 insertions(+), 1 deletions(-)
> 
> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> index f318e78..3ddc436 100644
> --- a/arch/x86/kernel/kvm.c
> +++ b/arch/x86/kernel/kvm.c
> @@ -568,6 +568,7 @@ static void kvm_kick_cpu(int cpu)
>  	kvm_hypercall2(KVM_HC_KICK_CPU, flags, apicid);
>  }
>  
> +#ifndef CONFIG_QUEUE_SPINLOCK
>  enum kvm_contention_stat {
>  	TAKEN_SLOW,
>  	TAKEN_SLOW_PICKUP,
> @@ -795,6 +796,55 @@ static void kvm_unlock_kick(struct arch_spinlock *lock, __ticket_t ticket)
>  		}
>  	}
>  }
> +#else /* !CONFIG_QUEUE_SPINLOCK */
> +
> +#ifdef CONFIG_KVM_DEBUG_FS
> +static struct dentry *d_spin_debug;
> +static struct dentry *d_kvm_debug;
> +static u32 lh_kick_stats;	/* Lock holder kick count */
> +static u32 qh_kick_stats;	/* Queue head kick count  */
> +static u32 nn_kick_stats;	/* Next node kick count   */
> +
> +static int __init kvm_spinlock_debugfs(void)
> +{
> +	d_kvm_debug = debugfs_create_dir("kvm-guest", NULL);
> +	if (!d_kvm_debug) {
> +		printk(KERN_WARNING
> +		       "Could not create 'kvm' debugfs directory\n");
> +		return -ENOMEM;
> +	}
> +	d_spin_debug = debugfs_create_dir("spinlocks", d_kvm_debug);
> +
> +	debugfs_create_u32("lh_kick_stats", 0644, d_spin_debug, &lh_kick_stats);
> +	debugfs_create_u32("qh_kick_stats", 0644, d_spin_debug, &qh_kick_stats);
> +	debugfs_create_u32("nn_kick_stats", 0644, d_spin_debug, &nn_kick_stats);
> +
> +	return 0;
> +}
> +
> +static inline void inc_kick_stats(enum pv_kick_type type)
> +{
> +	if (type == PV_KICK_LOCK_HOLDER)
> +		add_smp(&lh_kick_stats, 1);
> +	else if (type == PV_KICK_QUEUE_HEAD)
> +		add_smp(&qh_kick_stats, 1);
> +	else
> +		add_smp(&nn_kick_stats, 1);
> +}
> +fs_initcall(kvm_spinlock_debugfs);
> +
> +#else /* CONFIG_KVM_DEBUG_FS */
> +static inline void inc_kick_stats(enum pv_kick_type type)
> +{
> +}
> +#endif /* CONFIG_KVM_DEBUG_FS */
> +
> +static void kvm_kick_cpu_type(int cpu, enum pv_kick_type type)
> +{
> +	kvm_kick_cpu(cpu);
> +	inc_kick_stats(type);
> +}
> +#endif /* !CONFIG_QUEUE_SPINLOCK */
>  
>  /*
>   * Setup pv_lock_ops to exploit KVM_FEATURE_PV_UNHALT if present.
> @@ -807,8 +857,12 @@ void __init kvm_spinlock_init(void)
>  	if (!kvm_para_has_feature(KVM_FEATURE_PV_UNHALT))
>  		return;
>  
> +#ifdef CONFIG_QUEUE_SPINLOCK
> +	pv_lock_ops.kick_cpu = kvm_kick_cpu_type;
> +#else
>  	pv_lock_ops.lock_spinning = PV_CALLEE_SAVE(kvm_lock_spinning);
>  	pv_lock_ops.unlock_kick = kvm_unlock_kick;
> +#endif
>  }
>  
>  static __init int kvm_spinlock_init_jump(void)
> diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks
> index f185584..a70fdeb 100644
> --- a/kernel/Kconfig.locks
> +++ b/kernel/Kconfig.locks
> @@ -229,4 +229,4 @@ config ARCH_USE_QUEUE_SPINLOCK
>  
>  config QUEUE_SPINLOCK
>  	def_bool y if ARCH_USE_QUEUE_SPINLOCK
> -	depends on SMP && !PARAVIRT_SPINLOCKS
> +	depends on SMP && (!PARAVIRT_SPINLOCKS || !XEN)
> 

Should this rather be

    def_bool y if ARCH_USE_QUEUE_SPINLOCK && (!PARAVIRT_SPINLOCKS || !XEN)

?

PARAVIRT_SPINLOCKS + XEN + QUEUE_SPINLOCK + PARAVIRT_UNFAIR_LOCKS is a
valid combination, but it's impossible to choose PARAVIRT_UNFAIR_LOCKS
if QUEUE_SPINLOCK is unavailable.

Paolo

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH RFC v5 5/8] pvqspinlock, x86: Enable unfair queue spinlock in a KVM guest
  2014-02-26 15:14 ` Waiman Long
  2014-02-26 17:08   ` Konrad Rzeszutek Wilk
  2014-02-26 17:08   ` Konrad Rzeszutek Wilk
@ 2014-02-27  9:41   ` Paolo Bonzini
  2014-02-27 19:05     ` Waiman Long
  2014-02-27 19:05     ` Waiman Long
  2014-02-27  9:41   ` Paolo Bonzini
                     ` (2 subsequent siblings)
  5 siblings, 2 replies; 125+ messages in thread
From: Paolo Bonzini @ 2014-02-27  9:41 UTC (permalink / raw)
  To: Waiman Long, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Arnd Bergmann, Peter Zijlstra
  Cc: Jeremy Fitzhardinge, Raghavendra K T, virtualization, Andi Kleen,
	Michel Lespinasse, Boris Ostrovsky, linux-arch, x86,
	Scott J Norton, xen-devel, Paul E. McKenney, Alexander Fyodorov,
	Rik van Riel, Konrad Rzeszutek Wilk, Daniel J Blueman,
	Oleg Nesterov, Steven Rostedt, Chris Wright, George Spelvin,
	Alok Kataria, Aswin Chandramouleeswaran, Chegu Vinod,
	Linus Torvalds

Il 26/02/2014 16:14, Waiman Long ha scritto:
> This patch adds a KVM init function to activate the unfair queue
> spinlock in a KVM guest when the PARAVIRT_UNFAIR_LOCKS kernel config
> option is selected.
>
> Signed-off-by: Waiman Long <Waiman.Long@hp.com>
> ---
>  arch/x86/kernel/kvm.c |   17 +++++++++++++++++
>  1 files changed, 17 insertions(+), 0 deletions(-)
>
> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> index 713f1b3..a489140 100644
> --- a/arch/x86/kernel/kvm.c
> +++ b/arch/x86/kernel/kvm.c
> @@ -826,3 +826,20 @@ static __init int kvm_spinlock_init_jump(void)
>  early_initcall(kvm_spinlock_init_jump);
>
>  #endif	/* CONFIG_PARAVIRT_SPINLOCKS */
> +
> +#ifdef CONFIG_PARAVIRT_UNFAIR_LOCKS
> +/*
> + * Enable unfair lock if running in a real para-virtualized environment
> + */
> +static __init int kvm_unfair_locks_init_jump(void)
> +{
> +	if (!kvm_para_available())
> +		return 0;
> +
> +	static_key_slow_inc(&paravirt_unfairlocks_enabled);
> +	printk(KERN_INFO "KVM setup unfair spinlock\n");
> +
> +	return 0;
> +}
> +early_initcall(kvm_unfair_locks_init_jump);
> +#endif
>

I think this should apply to all paravirt implementations, unless 
pv_lock_ops.kick_cpu != NULL.

Paolo

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH RFC v5 5/8] pvqspinlock, x86: Enable unfair queue spinlock in a KVM guest
  2014-02-26 15:14 ` Waiman Long
                     ` (2 preceding siblings ...)
  2014-02-27  9:41   ` Paolo Bonzini
@ 2014-02-27  9:41   ` Paolo Bonzini
  2014-02-27 10:40   ` Raghavendra K T
  2014-02-27 10:40   ` Raghavendra K T
  5 siblings, 0 replies; 125+ messages in thread
From: Paolo Bonzini @ 2014-02-27  9:41 UTC (permalink / raw)
  To: Waiman Long, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Arnd Bergmann, Peter Zijlstra
  Cc: Jeremy Fitzhardinge, Raghavendra K T, virtualization, Andi Kleen,
	Michel Lespinasse, Boris Ostrovsky, linux-arch, x86,
	Scott J Norton, xen-devel, Paul E. McKenney, Alexander Fyodorov,
	Daniel J Blueman, Rusty Russell, Oleg Nesterov, Steven Rostedt,
	Chris Wright, George Spelvin, Alok Kataria,
	Aswin Chandramouleeswaran, Chegu Vinod, Linus Torvalds,
	linux-kernel, David

Il 26/02/2014 16:14, Waiman Long ha scritto:
> This patch adds a KVM init function to activate the unfair queue
> spinlock in a KVM guest when the PARAVIRT_UNFAIR_LOCKS kernel config
> option is selected.
>
> Signed-off-by: Waiman Long <Waiman.Long@hp.com>
> ---
>  arch/x86/kernel/kvm.c |   17 +++++++++++++++++
>  1 files changed, 17 insertions(+), 0 deletions(-)
>
> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> index 713f1b3..a489140 100644
> --- a/arch/x86/kernel/kvm.c
> +++ b/arch/x86/kernel/kvm.c
> @@ -826,3 +826,20 @@ static __init int kvm_spinlock_init_jump(void)
>  early_initcall(kvm_spinlock_init_jump);
>
>  #endif	/* CONFIG_PARAVIRT_SPINLOCKS */
> +
> +#ifdef CONFIG_PARAVIRT_UNFAIR_LOCKS
> +/*
> + * Enable unfair lock if running in a real para-virtualized environment
> + */
> +static __init int kvm_unfair_locks_init_jump(void)
> +{
> +	if (!kvm_para_available())
> +		return 0;
> +
> +	static_key_slow_inc(&paravirt_unfairlocks_enabled);
> +	printk(KERN_INFO "KVM setup unfair spinlock\n");
> +
> +	return 0;
> +}
> +early_initcall(kvm_unfair_locks_init_jump);
> +#endif
>

I think this should apply to all paravirt implementations, unless 
pv_lock_ops.kick_cpu != NULL.

Paolo

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH RFC v5 5/8] pvqspinlock, x86: Enable unfair queue spinlock in a KVM guest
  2014-02-26 15:14 ` Waiman Long
                     ` (4 preceding siblings ...)
  2014-02-27 10:40   ` Raghavendra K T
@ 2014-02-27 10:40   ` Raghavendra K T
  2014-02-27 19:12     ` Waiman Long
  2014-02-27 19:12     ` Waiman Long
  5 siblings, 2 replies; 125+ messages in thread
From: Raghavendra K T @ 2014-02-27 10:40 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jeremy Fitzhardinge, Peter Zijlstra, virtualization, Andi Kleen,
	H. Peter Anvin, Michel Lespinasse, Alok Kataria, linux-arch, x86,
	Ingo Molnar, Scott J Norton, xen-devel, Paul E. McKenney,
	Alexander Fyodorov, Rik van Riel, Arnd Bergmann,
	Konrad Rzeszutek Wilk, Daniel J Blueman, Oleg Nesterov,
	Steven Rostedt, Chris Wright, George Spelvin, Thomas Gleixner,
	Aswin Chandramouleeswaran, Chegu

On 02/26/2014 08:44 PM, Waiman Long wrote:
> This patch adds a KVM init function to activate the unfair queue
> spinlock in a KVM guest when the PARAVIRT_UNFAIR_LOCKS kernel config
> option is selected.
>
> Signed-off-by: Waiman Long <Waiman.Long@hp.com>
> ---
>   arch/x86/kernel/kvm.c |   17 +++++++++++++++++
>   1 files changed, 17 insertions(+), 0 deletions(-)
>
> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> index 713f1b3..a489140 100644
> --- a/arch/x86/kernel/kvm.c
> +++ b/arch/x86/kernel/kvm.c
> @@ -826,3 +826,20 @@ static __init int kvm_spinlock_init_jump(void)
>   early_initcall(kvm_spinlock_init_jump);
>
>   #endif	/* CONFIG_PARAVIRT_SPINLOCKS */
> +
> +#ifdef CONFIG_PARAVIRT_UNFAIR_LOCKS
> +/*
> + * Enable unfair lock if running in a real para-virtualized environment
> + */
> +static __init int kvm_unfair_locks_init_jump(void)
> +{
> +	if (!kvm_para_available())
> +		return 0;
> +

kvm_kick_cpu_type() in patch 8 assumes that host has support for kick
hypercall (KVM_HC_KICK_CPU).

I think for that we need explicit check of this 
kvm_para_has_feature(KVM_FEATURE_PV_UNHALT).

otherwise things may break for unlikely case of running a new guest on
a old host?


> +	static_key_slow_inc(&paravirt_unfairlocks_enabled);
> +	printk(KERN_INFO "KVM setup unfair spinlock\n");
> +
> +	return 0;
> +}
> +early_initcall(kvm_unfair_locks_init_jump);
> +#endif
>

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH RFC v5 5/8] pvqspinlock, x86: Enable unfair queue spinlock in a KVM guest
  2014-02-26 15:14 ` Waiman Long
                     ` (3 preceding siblings ...)
  2014-02-27  9:41   ` Paolo Bonzini
@ 2014-02-27 10:40   ` Raghavendra K T
  2014-02-27 10:40   ` Raghavendra K T
  5 siblings, 0 replies; 125+ messages in thread
From: Raghavendra K T @ 2014-02-27 10:40 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jeremy Fitzhardinge, Peter Zijlstra, virtualization, Andi Kleen,
	H. Peter Anvin, Michel Lespinasse, Alok Kataria, linux-arch, x86,
	Ingo Molnar, Scott J Norton, xen-devel, Paul E. McKenney,
	Alexander Fyodorov, Arnd Bergmann, Daniel J Blueman,
	Rusty Russell, Oleg Nesterov, Steven Rostedt, Chris Wright,
	George Spelvin, Thomas Gleixner, Aswin Chandramouleeswaran,
	Chegu Vinod, Boris

On 02/26/2014 08:44 PM, Waiman Long wrote:
> This patch adds a KVM init function to activate the unfair queue
> spinlock in a KVM guest when the PARAVIRT_UNFAIR_LOCKS kernel config
> option is selected.
>
> Signed-off-by: Waiman Long <Waiman.Long@hp.com>
> ---
>   arch/x86/kernel/kvm.c |   17 +++++++++++++++++
>   1 files changed, 17 insertions(+), 0 deletions(-)
>
> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> index 713f1b3..a489140 100644
> --- a/arch/x86/kernel/kvm.c
> +++ b/arch/x86/kernel/kvm.c
> @@ -826,3 +826,20 @@ static __init int kvm_spinlock_init_jump(void)
>   early_initcall(kvm_spinlock_init_jump);
>
>   #endif	/* CONFIG_PARAVIRT_SPINLOCKS */
> +
> +#ifdef CONFIG_PARAVIRT_UNFAIR_LOCKS
> +/*
> + * Enable unfair lock if running in a real para-virtualized environment
> + */
> +static __init int kvm_unfair_locks_init_jump(void)
> +{
> +	if (!kvm_para_available())
> +		return 0;
> +

kvm_kick_cpu_type() in patch 8 assumes that host has support for kick
hypercall (KVM_HC_KICK_CPU).

I think for that we need explicit check of this 
kvm_para_has_feature(KVM_FEATURE_PV_UNHALT).

otherwise things may break for unlikely case of running a new guest on
a old host?


> +	static_key_slow_inc(&paravirt_unfairlocks_enabled);
> +	printk(KERN_INFO "KVM setup unfair spinlock\n");
> +
> +	return 0;
> +}
> +early_initcall(kvm_unfair_locks_init_jump);
> +#endif
>

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH RFC v5 7/8] pvqspinlock, x86: Add qspinlock para-virtualization support
  2014-02-26 15:14 ` [PATCH RFC v5 7/8] pvqspinlock, x86: Add qspinlock para-virtualization support Waiman Long
  2014-02-26 17:54   ` Konrad Rzeszutek Wilk
  2014-02-26 17:54   ` Konrad Rzeszutek Wilk
@ 2014-02-27 12:11   ` David Vrabel
  2014-02-27 13:11     ` Paolo Bonzini
  2014-02-27 13:11     ` Paolo Bonzini
  2014-02-27 12:11   ` David Vrabel
  3 siblings, 2 replies; 125+ messages in thread
From: David Vrabel @ 2014-02-27 12:11 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, Peter Zijlstra,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Alok Kataria, linux-arch, x86, Ingo Molnar, Scott J Norton,
	xen-devel, Paul E. McKenney, Alexander Fyodorov, Rik van Riel,
	Arnd Bergmann, Konrad Rzeszutek Wilk, Daniel J Blueman,
	Oleg Nesterov, Steven Rostedt, Chris Wright, George Spelvin,
	Thomas Gleixner

On 26/02/14 15:14, Waiman Long wrote:
> This patch adds para-virtualization support to the queue spinlock code
> by enabling the queue head to kick the lock holder CPU, if known,
> in when the lock isn't released for a certain amount of time. It
> also enables the mutual monitoring of the queue head CPU and the
> following node CPU in the queue to make sure that their CPUs will
> stay scheduled in.

I'm not really understanding how this is supposed to work.  There
appears to be an assumption that a guest can keep one of its VCPUs
running by repeatedly kicking it?  This is not possible under Xen and I
doubt it's possible under KVM or any other hypervisor.

David

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH RFC v5 7/8] pvqspinlock, x86: Add qspinlock para-virtualization support
  2014-02-26 15:14 ` [PATCH RFC v5 7/8] pvqspinlock, x86: Add qspinlock para-virtualization support Waiman Long
                     ` (2 preceding siblings ...)
  2014-02-27 12:11   ` David Vrabel
@ 2014-02-27 12:11   ` David Vrabel
  3 siblings, 0 replies; 125+ messages in thread
From: David Vrabel @ 2014-02-27 12:11 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, Peter Zijlstra,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Alok Kataria, linux-arch, x86, Ingo Molnar, Scott J Norton,
	xen-devel, Paul E. McKenney, Alexander Fyodorov, Arnd Bergmann,
	Daniel J Blueman, Rusty Russell, Oleg Nesterov, Steven Rostedt,
	Chris Wright, George Spelvin, Thomas Gleixner,
	Aswin Chandramouleeswaran

On 26/02/14 15:14, Waiman Long wrote:
> This patch adds para-virtualization support to the queue spinlock code
> by enabling the queue head to kick the lock holder CPU, if known,
> in when the lock isn't released for a certain amount of time. It
> also enables the mutual monitoring of the queue head CPU and the
> following node CPU in the queue to make sure that their CPUs will
> stay scheduled in.

I'm not really understanding how this is supposed to work.  There
appears to be an assumption that a guest can keep one of its VCPUs
running by repeatedly kicking it?  This is not possible under Xen and I
doubt it's possible under KVM or any other hypervisor.

David

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH RFC v5 4/8] pvqspinlock, x86: Allow unfair spinlock in a real PV environment
  2014-02-26 15:14 ` Waiman Long
  2014-02-26 17:07   ` Konrad Rzeszutek Wilk
  2014-02-26 17:07   ` Konrad Rzeszutek Wilk
@ 2014-02-27 12:28   ` David Vrabel
  2014-02-27 19:40     ` Waiman Long
  2014-02-27 19:40     ` Waiman Long
  2014-02-27 12:28   ` David Vrabel
  3 siblings, 2 replies; 125+ messages in thread
From: David Vrabel @ 2014-02-27 12:28 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, Peter Zijlstra,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Alok Kataria, linux-arch, x86, Ingo Molnar, Scott J Norton,
	xen-devel, Paul E. McKenney, Alexander Fyodorov, Rik van Riel,
	Arnd Bergmann, Konrad Rzeszutek Wilk, Daniel J Blueman,
	Oleg Nesterov, Steven Rostedt, Chris Wright, George Spelvin,
	Thomas Gleixner

On 26/02/14 15:14, Waiman Long wrote:
> Locking is always an issue in a virtualized environment as the virtual
> CPU that is waiting on a lock may get scheduled out and hence block
> any progress in lock acquisition even when the lock has been freed.
> 
> One solution to this problem is to allow unfair lock in a
> para-virtualized environment. In this case, a new lock acquirer can
> come and steal the lock if the next-in-line CPU to get the lock is
> scheduled out. Unfair lock in a native environment is generally not a
> good idea as there is a possibility of lock starvation for a heavily
> contended lock.

I'm not sure I'm keen on losing the fairness in PV environment.  I'm
concerned that on an over-committed host, the lock starvation problem
will be particularly bad.

But I'll have to revist this once a non-broken PV qspinlock
implementation exists (or someone explains how the proposed one works).

David

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH RFC v5 4/8] pvqspinlock, x86: Allow unfair spinlock in a real PV environment
  2014-02-26 15:14 ` Waiman Long
                     ` (2 preceding siblings ...)
  2014-02-27 12:28   ` David Vrabel
@ 2014-02-27 12:28   ` David Vrabel
  3 siblings, 0 replies; 125+ messages in thread
From: David Vrabel @ 2014-02-27 12:28 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, Peter Zijlstra,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Alok Kataria, linux-arch, x86, Ingo Molnar, Scott J Norton,
	xen-devel, Paul E. McKenney, Alexander Fyodorov, Arnd Bergmann,
	Daniel J Blueman, Rusty Russell, Oleg Nesterov, Steven Rostedt,
	Chris Wright, George Spelvin, Thomas Gleixner,
	Aswin Chandramouleeswaran

On 26/02/14 15:14, Waiman Long wrote:
> Locking is always an issue in a virtualized environment as the virtual
> CPU that is waiting on a lock may get scheduled out and hence block
> any progress in lock acquisition even when the lock has been freed.
> 
> One solution to this problem is to allow unfair lock in a
> para-virtualized environment. In this case, a new lock acquirer can
> come and steal the lock if the next-in-line CPU to get the lock is
> scheduled out. Unfair lock in a native environment is generally not a
> good idea as there is a possibility of lock starvation for a heavily
> contended lock.

I'm not sure I'm keen on losing the fairness in PV environment.  I'm
concerned that on an over-committed host, the lock starvation problem
will be particularly bad.

But I'll have to revist this once a non-broken PV qspinlock
implementation exists (or someone explains how the proposed one works).

David

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH RFC v5 7/8] pvqspinlock, x86: Add qspinlock para-virtualization support
  2014-02-27 12:11   ` David Vrabel
@ 2014-02-27 13:11     ` Paolo Bonzini
  2014-02-27 14:18       ` David Vrabel
  2014-02-27 14:18       ` David Vrabel
  2014-02-27 13:11     ` Paolo Bonzini
  1 sibling, 2 replies; 125+ messages in thread
From: Paolo Bonzini @ 2014-02-27 13:11 UTC (permalink / raw)
  To: David Vrabel, Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, Peter Zijlstra,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Alok Kataria, linux-arch, x86, Ingo Molnar, Scott J Norton,
	xen-devel, Paul E. McKenney, Alexander Fyodorov, Rik van Riel,
	Arnd Bergmann, Konrad Rzeszutek Wilk, Daniel J Blueman,
	Oleg Nesterov, Steven Rostedt, Chris Wright, George Spelvin,
	Thomas Gleixner

Il 27/02/2014 13:11, David Vrabel ha scritto:
>> > This patch adds para-virtualization support to the queue spinlock code
>> > by enabling the queue head to kick the lock holder CPU, if known,
>> > in when the lock isn't released for a certain amount of time. It
>> > also enables the mutual monitoring of the queue head CPU and the
>> > following node CPU in the queue to make sure that their CPUs will
>> > stay scheduled in.
> I'm not really understanding how this is supposed to work.  There
> appears to be an assumption that a guest can keep one of its VCPUs
> running by repeatedly kicking it?  This is not possible under Xen and I
> doubt it's possible under KVM or any other hypervisor.

KVM allows any VCPU to wake up a currently halted VCPU of its choice, 
see Documentation/virtual/kvm/hypercalls.txt.

   5. KVM_HC_KICK_CPU
   ------------------------
   Architecture: x86
   Status: active
   Purpose: Hypercall used to wakeup a vcpu from HLT state
   Usage example : A vcpu of a paravirtualized guest that is busywaiting
   in guest kernel mode for an event to occur (ex: a spinlock to become
   available) can execute HLT instruction once it has busy-waited for more
   than a threshold time-interval. Execution of HLT instruction would cause
   the hypervisor to put the vcpu to sleep until occurrence of an 
appropriate
   event. Another vcpu of the same guest can wakeup the sleeping vcpu by
   issuing KVM_HC_KICK_CPU hypercall, specifying APIC ID (a1) of the vcpu
   to be woken up. An additional argument (a0) is used in the hypercall for
   future use.

This is the same as a dummy IPI, but cheaper (about 2000 clock cycles 
wasted on the source VCPU, and the latency on the destination is about 
half; an IPI costs roughly the same on the source and much more on the 
destination).

It looks like Xen could use an event channel.

Paolo

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH RFC v5 7/8] pvqspinlock, x86: Add qspinlock para-virtualization support
  2014-02-27 12:11   ` David Vrabel
  2014-02-27 13:11     ` Paolo Bonzini
@ 2014-02-27 13:11     ` Paolo Bonzini
  1 sibling, 0 replies; 125+ messages in thread
From: Paolo Bonzini @ 2014-02-27 13:11 UTC (permalink / raw)
  To: David Vrabel, Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, Peter Zijlstra,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Alok Kataria, linux-arch, x86, Ingo Molnar, Scott J Norton,
	xen-devel, Paul E. McKenney, Alexander Fyodorov, Arnd Bergmann,
	Daniel J Blueman, Rusty Russell, Oleg Nesterov, Steven Rostedt,
	Chris Wright, George Spelvin, Thomas Gleixner,
	Aswin Chandramouleeswaran

Il 27/02/2014 13:11, David Vrabel ha scritto:
>> > This patch adds para-virtualization support to the queue spinlock code
>> > by enabling the queue head to kick the lock holder CPU, if known,
>> > in when the lock isn't released for a certain amount of time. It
>> > also enables the mutual monitoring of the queue head CPU and the
>> > following node CPU in the queue to make sure that their CPUs will
>> > stay scheduled in.
> I'm not really understanding how this is supposed to work.  There
> appears to be an assumption that a guest can keep one of its VCPUs
> running by repeatedly kicking it?  This is not possible under Xen and I
> doubt it's possible under KVM or any other hypervisor.

KVM allows any VCPU to wake up a currently halted VCPU of its choice, 
see Documentation/virtual/kvm/hypercalls.txt.

   5. KVM_HC_KICK_CPU
   ------------------------
   Architecture: x86
   Status: active
   Purpose: Hypercall used to wakeup a vcpu from HLT state
   Usage example : A vcpu of a paravirtualized guest that is busywaiting
   in guest kernel mode for an event to occur (ex: a spinlock to become
   available) can execute HLT instruction once it has busy-waited for more
   than a threshold time-interval. Execution of HLT instruction would cause
   the hypervisor to put the vcpu to sleep until occurrence of an 
appropriate
   event. Another vcpu of the same guest can wakeup the sleeping vcpu by
   issuing KVM_HC_KICK_CPU hypercall, specifying APIC ID (a1) of the vcpu
   to be woken up. An additional argument (a0) is used in the hypercall for
   future use.

This is the same as a dummy IPI, but cheaper (about 2000 clock cycles 
wasted on the source VCPU, and the latency on the destination is about 
half; an IPI costs roughly the same on the source and much more on the 
destination).

It looks like Xen could use an event channel.

Paolo

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH RFC v5 7/8] pvqspinlock, x86: Add qspinlock para-virtualization support
  2014-02-27 13:11     ` Paolo Bonzini
  2014-02-27 14:18       ` David Vrabel
@ 2014-02-27 14:18       ` David Vrabel
  2014-02-27 14:45         ` Paolo Bonzini
  2014-02-27 14:45         ` Paolo Bonzini
  1 sibling, 2 replies; 125+ messages in thread
From: David Vrabel @ 2014-02-27 14:18 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Jeremy Fitzhardinge, Raghavendra K T, Peter Zijlstra,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Alok Kataria, linux-arch, Boris Ostrovsky, x86, Ingo Molnar,
	Scott J Norton, xen-devel, Paul E. McKenney, Alexander Fyodorov,
	Rik van Riel, Arnd Bergmann, Konrad Rzeszutek Wilk,
	Daniel J Blueman, Oleg Nesterov, Steven Rostedt, Chris Wright,
	George Spelvin

On 27/02/14 13:11, Paolo Bonzini wrote:
> Il 27/02/2014 13:11, David Vrabel ha scritto:
>>> > This patch adds para-virtualization support to the queue spinlock code
>>> > by enabling the queue head to kick the lock holder CPU, if known,
>>> > in when the lock isn't released for a certain amount of time. It
>>> > also enables the mutual monitoring of the queue head CPU and the
>>> > following node CPU in the queue to make sure that their CPUs will
>>> > stay scheduled in.
>> I'm not really understanding how this is supposed to work.  There
>> appears to be an assumption that a guest can keep one of its VCPUs
>> running by repeatedly kicking it?  This is not possible under Xen and I
>> doubt it's possible under KVM or any other hypervisor.
> 
> KVM allows any VCPU to wake up a currently halted VCPU of its choice,
> see Documentation/virtual/kvm/hypercalls.txt.

But neither of the VCPUs being kicked here are halted -- they're either
running or runnable (descheduled by the hypervisor).

David

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH RFC v5 7/8] pvqspinlock, x86: Add qspinlock para-virtualization support
  2014-02-27 13:11     ` Paolo Bonzini
@ 2014-02-27 14:18       ` David Vrabel
  2014-02-27 14:18       ` David Vrabel
  1 sibling, 0 replies; 125+ messages in thread
From: David Vrabel @ 2014-02-27 14:18 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Jeremy Fitzhardinge, Raghavendra K T, Peter Zijlstra,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Alok Kataria, linux-arch, Boris Ostrovsky, x86, Ingo Molnar,
	Scott J Norton, xen-devel, Paul E. McKenney, Alexander Fyodorov,
	Arnd Bergmann, Daniel J Blueman, Rusty Russell, Oleg Nesterov,
	Steven Rostedt, Chris Wright, George Spelvin, Thomas Gleixner

On 27/02/14 13:11, Paolo Bonzini wrote:
> Il 27/02/2014 13:11, David Vrabel ha scritto:
>>> > This patch adds para-virtualization support to the queue spinlock code
>>> > by enabling the queue head to kick the lock holder CPU, if known,
>>> > in when the lock isn't released for a certain amount of time. It
>>> > also enables the mutual monitoring of the queue head CPU and the
>>> > following node CPU in the queue to make sure that their CPUs will
>>> > stay scheduled in.
>> I'm not really understanding how this is supposed to work.  There
>> appears to be an assumption that a guest can keep one of its VCPUs
>> running by repeatedly kicking it?  This is not possible under Xen and I
>> doubt it's possible under KVM or any other hypervisor.
> 
> KVM allows any VCPU to wake up a currently halted VCPU of its choice,
> see Documentation/virtual/kvm/hypercalls.txt.

But neither of the VCPUs being kicked here are halted -- they're either
running or runnable (descheduled by the hypervisor).

David

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH RFC v5 7/8] pvqspinlock, x86: Add qspinlock para-virtualization support
  2014-02-27 14:18       ` David Vrabel
@ 2014-02-27 14:45         ` Paolo Bonzini
  2014-02-27 15:22           ` Raghavendra K T
                             ` (3 more replies)
  2014-02-27 14:45         ` Paolo Bonzini
  1 sibling, 4 replies; 125+ messages in thread
From: Paolo Bonzini @ 2014-02-27 14:45 UTC (permalink / raw)
  To: David Vrabel
  Cc: Jeremy Fitzhardinge, Raghavendra K T, Peter Zijlstra,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Alok Kataria, linux-arch, Boris Ostrovsky, x86, Ingo Molnar,
	Scott J Norton, xen-devel, Paul E. McKenney, Alexander Fyodorov,
	Rik van Riel, Arnd Bergmann, Konrad Rzeszutek Wilk,
	Daniel J Blueman, Oleg Nesterov, Steven Rostedt, Chris Wright,
	George Spelvin

Il 27/02/2014 15:18, David Vrabel ha scritto:
> On 27/02/14 13:11, Paolo Bonzini wrote:
>> Il 27/02/2014 13:11, David Vrabel ha scritto:
>>>>> This patch adds para-virtualization support to the queue spinlock code
>>>>> by enabling the queue head to kick the lock holder CPU, if known,
>>>>> in when the lock isn't released for a certain amount of time. It
>>>>> also enables the mutual monitoring of the queue head CPU and the
>>>>> following node CPU in the queue to make sure that their CPUs will
>>>>> stay scheduled in.
>>> I'm not really understanding how this is supposed to work.  There
>>> appears to be an assumption that a guest can keep one of its VCPUs
>>> running by repeatedly kicking it?  This is not possible under Xen and I
>>> doubt it's possible under KVM or any other hypervisor.
>>
>> KVM allows any VCPU to wake up a currently halted VCPU of its choice,
>> see Documentation/virtual/kvm/hypercalls.txt.
>
> But neither of the VCPUs being kicked here are halted -- they're either
> running or runnable (descheduled by the hypervisor).

/me actually looks at Waiman's code...

Right, this is really different from pvticketlocks, where the *unlock* 
primitive wakes up a sleeping VCPU.  It is more similar to PLE 
(pause-loop exiting).

Paolo

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH RFC v5 7/8] pvqspinlock, x86: Add qspinlock para-virtualization support
  2014-02-27 14:18       ` David Vrabel
  2014-02-27 14:45         ` Paolo Bonzini
@ 2014-02-27 14:45         ` Paolo Bonzini
  1 sibling, 0 replies; 125+ messages in thread
From: Paolo Bonzini @ 2014-02-27 14:45 UTC (permalink / raw)
  To: David Vrabel
  Cc: Jeremy Fitzhardinge, Raghavendra K T, Peter Zijlstra,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Alok Kataria, linux-arch, Boris Ostrovsky, x86, Ingo Molnar,
	Scott J Norton, xen-devel, Paul E. McKenney, Alexander Fyodorov,
	Arnd Bergmann, Daniel J Blueman, Rusty Russell, Oleg Nesterov,
	Steven Rostedt, Chris Wright, George Spelvin, Thomas Gleixner

Il 27/02/2014 15:18, David Vrabel ha scritto:
> On 27/02/14 13:11, Paolo Bonzini wrote:
>> Il 27/02/2014 13:11, David Vrabel ha scritto:
>>>>> This patch adds para-virtualization support to the queue spinlock code
>>>>> by enabling the queue head to kick the lock holder CPU, if known,
>>>>> in when the lock isn't released for a certain amount of time. It
>>>>> also enables the mutual monitoring of the queue head CPU and the
>>>>> following node CPU in the queue to make sure that their CPUs will
>>>>> stay scheduled in.
>>> I'm not really understanding how this is supposed to work.  There
>>> appears to be an assumption that a guest can keep one of its VCPUs
>>> running by repeatedly kicking it?  This is not possible under Xen and I
>>> doubt it's possible under KVM or any other hypervisor.
>>
>> KVM allows any VCPU to wake up a currently halted VCPU of its choice,
>> see Documentation/virtual/kvm/hypercalls.txt.
>
> But neither of the VCPUs being kicked here are halted -- they're either
> running or runnable (descheduled by the hypervisor).

/me actually looks at Waiman's code...

Right, this is really different from pvticketlocks, where the *unlock* 
primitive wakes up a sleeping VCPU.  It is more similar to PLE 
(pause-loop exiting).

Paolo

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH RFC v5 7/8] pvqspinlock, x86: Add qspinlock para-virtualization support
  2014-02-27 14:45         ` Paolo Bonzini
@ 2014-02-27 15:22           ` Raghavendra K T
  2014-02-27 15:50             ` Paolo Bonzini
                               ` (3 more replies)
  2014-02-27 15:22           ` Raghavendra K T
                             ` (2 subsequent siblings)
  3 siblings, 4 replies; 125+ messages in thread
From: Raghavendra K T @ 2014-02-27 15:22 UTC (permalink / raw)
  To: Paolo Bonzini, David Vrabel, Waiman Long
  Cc: Jeremy Fitzhardinge, Peter Zijlstra, virtualization, Andi Kleen,
	H. Peter Anvin, Michel Lespinasse, Alok Kataria, linux-arch, x86,
	Ingo Molnar, Scott J Norton, xen-devel, Paul E. McKenney,
	Alexander Fyodorov, Rik van Riel, Arnd Bergmann,
	Konrad Rzeszutek Wilk, Daniel J Blueman, Oleg Nesterov,
	Steven Rostedt, Chris Wright, George Spelvin, Thomas Gleixner,
	Aswin Chandramouleeswaran, Chegu

On 02/27/2014 08:15 PM, Paolo Bonzini wrote:
[...]
>> But neither of the VCPUs being kicked here are halted -- they're either
>> running or runnable (descheduled by the hypervisor).
>
> /me actually looks at Waiman's code...
>
> Right, this is really different from pvticketlocks, where the *unlock*
> primitive wakes up a sleeping VCPU.  It is more similar to PLE
> (pause-loop exiting).

Adding to the discussion, I see there are two possibilities here,
considering that in undercommit cases we should not exceed
HEAD_SPIN_THRESHOLD,

1. the looping vcpu in pv_head_spin_check() should do halt()
considering that we have done enough spinning (more than typical
lock-hold time), and hence we are in potential overcommit.

2. multiplex kick_cpu to do directed yield in qspinlock case.
But this may result in some ping ponging?

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH RFC v5 7/8] pvqspinlock, x86: Add qspinlock para-virtualization support
  2014-02-27 14:45         ` Paolo Bonzini
  2014-02-27 15:22           ` Raghavendra K T
@ 2014-02-27 15:22           ` Raghavendra K T
  2014-02-27 19:42           ` Waiman Long
  2014-02-27 19:42           ` Waiman Long
  3 siblings, 0 replies; 125+ messages in thread
From: Raghavendra K T @ 2014-02-27 15:22 UTC (permalink / raw)
  To: Paolo Bonzini, David Vrabel, Waiman Long
  Cc: Jeremy Fitzhardinge, Peter Zijlstra, virtualization, Andi Kleen,
	H. Peter Anvin, Michel Lespinasse, Alok Kataria, linux-arch, x86,
	Ingo Molnar, Scott J Norton, xen-devel, Paul E. McKenney,
	Alexander Fyodorov, Arnd Bergmann, Daniel J Blueman,
	Rusty Russell, Oleg Nesterov, Steven Rostedt, Chris Wright,
	George Spelvin, Thomas Gleixner, Aswin Chandramouleeswaran,
	Chegu Vinod, Boris

On 02/27/2014 08:15 PM, Paolo Bonzini wrote:
[...]
>> But neither of the VCPUs being kicked here are halted -- they're either
>> running or runnable (descheduled by the hypervisor).
>
> /me actually looks at Waiman's code...
>
> Right, this is really different from pvticketlocks, where the *unlock*
> primitive wakes up a sleeping VCPU.  It is more similar to PLE
> (pause-loop exiting).

Adding to the discussion, I see there are two possibilities here,
considering that in undercommit cases we should not exceed
HEAD_SPIN_THRESHOLD,

1. the looping vcpu in pv_head_spin_check() should do halt()
considering that we have done enough spinning (more than typical
lock-hold time), and hence we are in potential overcommit.

2. multiplex kick_cpu to do directed yield in qspinlock case.
But this may result in some ping ponging?

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH RFC v5 7/8] pvqspinlock, x86: Add qspinlock para-virtualization support
  2014-02-27 15:22           ` Raghavendra K T
  2014-02-27 15:50             ` Paolo Bonzini
@ 2014-02-27 15:50             ` Paolo Bonzini
  2014-03-03 11:06               ` [Xen-devel] " David Vrabel
  2014-03-03 11:06               ` David Vrabel
  2014-02-27 20:50             ` Waiman Long
  2014-02-27 20:50             ` Waiman Long
  3 siblings, 2 replies; 125+ messages in thread
From: Paolo Bonzini @ 2014-02-27 15:50 UTC (permalink / raw)
  To: Raghavendra K T, David Vrabel, Waiman Long
  Cc: Jeremy Fitzhardinge, Peter Zijlstra, virtualization, Andi Kleen,
	H. Peter Anvin, Michel Lespinasse, Alok Kataria, linux-arch, x86,
	Ingo Molnar, Scott J Norton, xen-devel, Paul E. McKenney,
	Alexander Fyodorov, Rik van Riel, Arnd Bergmann,
	Konrad Rzeszutek Wilk, Daniel J Blueman, Oleg Nesterov,
	Steven Rostedt, Chris Wright, George Spelvin, Thomas Gleixner,
	Aswin Chandramouleeswaran, Chegu

Il 27/02/2014 16:22, Raghavendra K T ha scritto:
> On 02/27/2014 08:15 PM, Paolo Bonzini wrote:
> [...]
>>> But neither of the VCPUs being kicked here are halted -- they're either
>>> running or runnable (descheduled by the hypervisor).
>>
>> /me actually looks at Waiman's code...
>>
>> Right, this is really different from pvticketlocks, where the *unlock*
>> primitive wakes up a sleeping VCPU.  It is more similar to PLE
>> (pause-loop exiting).
>
> Adding to the discussion, I see there are two possibilities here,
> considering that in undercommit cases we should not exceed
> HEAD_SPIN_THRESHOLD,
>
> 1. the looping vcpu in pv_head_spin_check() should do halt()
> considering that we have done enough spinning (more than typical
> lock-hold time), and hence we are in potential overcommit.
>
> 2. multiplex kick_cpu to do directed yield in qspinlock case.
> But this may result in some ping ponging?

Actually, I think the qspinlock can work roughly the same as the 
pvticketlock, using the same lock_spinning and unlock_lock hooks.

The x86-specific codepath can use bit 1 in the ->wait byte as "I have 
halted, please kick me".

	value = _QSPINLOCK_WAITING;
	i = 0;
	do
		cpu_relax();
	while (ACCESS_ONCE(slock->lock) && i++ < BUSY_WAIT);
	if (ACCESS_ONCE(slock->lock)) {
		value |= _QSPINLOCK_HALTED;
		xchg(&slock->wait, value >> 8);
		if (ACCESS_ONCE(slock->lock)) {
			... call lock_spinning hook ...
		}
	}

	/*
	 * Set the lock bit & clear the halted+waiting bits
	 */
	if (cmpxchg(&slock->lock_wait, value,
		    _QSPINLOCK_LOCKED) == value)
		return -1;	/* Got the lock */
	__atomic_and(&slock->lock_wait, ~QSPINLOCK_HALTED);

The lock_spinning/unlock_lock code can probably be much simpler, because 
you do not need to keep a list of all spinning locks.  Unlock_lock can 
just use the CPU number to wake up the right CPU.

Paolo

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH RFC v5 7/8] pvqspinlock, x86: Add qspinlock para-virtualization support
  2014-02-27 15:22           ` Raghavendra K T
@ 2014-02-27 15:50             ` Paolo Bonzini
  2014-02-27 15:50             ` Paolo Bonzini
                               ` (2 subsequent siblings)
  3 siblings, 0 replies; 125+ messages in thread
From: Paolo Bonzini @ 2014-02-27 15:50 UTC (permalink / raw)
  To: Raghavendra K T, David Vrabel, Waiman Long
  Cc: Jeremy Fitzhardinge, Peter Zijlstra, virtualization, Andi Kleen,
	H. Peter Anvin, Michel Lespinasse, Alok Kataria, linux-arch, x86,
	Ingo Molnar, Scott J Norton, xen-devel, Paul E. McKenney,
	Alexander Fyodorov, Arnd Bergmann, Daniel J Blueman,
	Rusty Russell, Oleg Nesterov, Steven Rostedt, Chris Wright,
	George Spelvin, Thomas Gleixner, Aswin Chandramouleeswaran,
	Chegu Vinod, Boris

Il 27/02/2014 16:22, Raghavendra K T ha scritto:
> On 02/27/2014 08:15 PM, Paolo Bonzini wrote:
> [...]
>>> But neither of the VCPUs being kicked here are halted -- they're either
>>> running or runnable (descheduled by the hypervisor).
>>
>> /me actually looks at Waiman's code...
>>
>> Right, this is really different from pvticketlocks, where the *unlock*
>> primitive wakes up a sleeping VCPU.  It is more similar to PLE
>> (pause-loop exiting).
>
> Adding to the discussion, I see there are two possibilities here,
> considering that in undercommit cases we should not exceed
> HEAD_SPIN_THRESHOLD,
>
> 1. the looping vcpu in pv_head_spin_check() should do halt()
> considering that we have done enough spinning (more than typical
> lock-hold time), and hence we are in potential overcommit.
>
> 2. multiplex kick_cpu to do directed yield in qspinlock case.
> But this may result in some ping ponging?

Actually, I think the qspinlock can work roughly the same as the 
pvticketlock, using the same lock_spinning and unlock_lock hooks.

The x86-specific codepath can use bit 1 in the ->wait byte as "I have 
halted, please kick me".

	value = _QSPINLOCK_WAITING;
	i = 0;
	do
		cpu_relax();
	while (ACCESS_ONCE(slock->lock) && i++ < BUSY_WAIT);
	if (ACCESS_ONCE(slock->lock)) {
		value |= _QSPINLOCK_HALTED;
		xchg(&slock->wait, value >> 8);
		if (ACCESS_ONCE(slock->lock)) {
			... call lock_spinning hook ...
		}
	}

	/*
	 * Set the lock bit & clear the halted+waiting bits
	 */
	if (cmpxchg(&slock->lock_wait, value,
		    _QSPINLOCK_LOCKED) == value)
		return -1;	/* Got the lock */
	__atomic_and(&slock->lock_wait, ~QSPINLOCK_HALTED);

The lock_spinning/unlock_lock code can probably be much simpler, because 
you do not need to keep a list of all spinning locks.  Unlock_lock can 
just use the CPU number to wake up the right CPU.

Paolo

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH RFC v5 8/8] pvqspinlock, x86: Enable KVM to use qspinlock's PV support
  2014-02-27  9:31   ` Paolo Bonzini
  2014-02-27 18:36     ` Waiman Long
@ 2014-02-27 18:36     ` Waiman Long
  1 sibling, 0 replies; 125+ messages in thread
From: Waiman Long @ 2014-02-27 18:36 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Jeremy Fitzhardinge, Raghavendra K T, Peter Zijlstra,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Alok Kataria, linux-arch, x86, Ingo Molnar, Scott J Norton,
	xen-devel, Paul E. McKenney, Alexander Fyodorov, Rik van Riel,
	Arnd Bergmann, Konrad Rzeszutek Wilk, Daniel J Blueman,
	Oleg Nesterov, Steven Rostedt, Chris Wright, George Spelvin,
	Thomas Gleixner

On 02/27/2014 04:31 AM, Paolo Bonzini wrote:
>   static __init int kvm_spinlock_init_jump(void)
> diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks
> index f185584..a70fdeb 100644
> --- a/kernel/Kconfig.locks
> +++ b/kernel/Kconfig.locks
> @@ -229,4 +229,4 @@ config ARCH_USE_QUEUE_SPINLOCK
>
>   config QUEUE_SPINLOCK
>   	def_bool y if ARCH_USE_QUEUE_SPINLOCK
> -	depends on SMP&&  !PARAVIRT_SPINLOCKS
> +	depends on SMP&&  (!PARAVIRT_SPINLOCKS || !XEN)
>
> Should this rather be
>
>      def_bool y if ARCH_USE_QUEUE_SPINLOCK&&  (!PARAVIRT_SPINLOCKS || !XEN)
>
> ?
>
> PARAVIRT_SPINLOCKS + XEN + QUEUE_SPINLOCK + PARAVIRT_UNFAIR_LOCKS is a
> valid combination, but it's impossible to choose PARAVIRT_UNFAIR_LOCKS
> if QUEUE_SPINLOCK is unavailable.
>
> Paolo

The PV ticketlock code assumes the presence of ticket spinlock. So it 
will cause compilation error if QUEUE_SPINLOCK is enabled. My patches 
7/8 modified the KVM code so that the PV ticketlock code can coexist 
with queue spinlock. As I haven't figure out the proper way to modify 
the Xen code, I need to disable the queue spinlock code if 
PARAVIRT_SPINLOCKS and XEN are both enabled. However, by disabling 
PARAVIRT_SPINLOCKS, we can use PARAVIRT_UNFAIR_LOCKS with XEN.

-Longman

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH RFC v5 8/8] pvqspinlock, x86: Enable KVM to use qspinlock's PV support
  2014-02-27  9:31   ` Paolo Bonzini
@ 2014-02-27 18:36     ` Waiman Long
  2014-02-27 18:36     ` Waiman Long
  1 sibling, 0 replies; 125+ messages in thread
From: Waiman Long @ 2014-02-27 18:36 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Jeremy Fitzhardinge, Raghavendra K T, Peter Zijlstra,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Alok Kataria, linux-arch, x86, Ingo Molnar, Scott J Norton,
	xen-devel, Paul E. McKenney, Alexander Fyodorov, Arnd Bergmann,
	Daniel J Blueman, Rusty Russell, Oleg Nesterov, Steven Rostedt,
	Chris Wright, George Spelvin, Thomas Gleixner,
	Aswin Chandramouleeswaran

On 02/27/2014 04:31 AM, Paolo Bonzini wrote:
>   static __init int kvm_spinlock_init_jump(void)
> diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks
> index f185584..a70fdeb 100644
> --- a/kernel/Kconfig.locks
> +++ b/kernel/Kconfig.locks
> @@ -229,4 +229,4 @@ config ARCH_USE_QUEUE_SPINLOCK
>
>   config QUEUE_SPINLOCK
>   	def_bool y if ARCH_USE_QUEUE_SPINLOCK
> -	depends on SMP&&  !PARAVIRT_SPINLOCKS
> +	depends on SMP&&  (!PARAVIRT_SPINLOCKS || !XEN)
>
> Should this rather be
>
>      def_bool y if ARCH_USE_QUEUE_SPINLOCK&&  (!PARAVIRT_SPINLOCKS || !XEN)
>
> ?
>
> PARAVIRT_SPINLOCKS + XEN + QUEUE_SPINLOCK + PARAVIRT_UNFAIR_LOCKS is a
> valid combination, but it's impossible to choose PARAVIRT_UNFAIR_LOCKS
> if QUEUE_SPINLOCK is unavailable.
>
> Paolo

The PV ticketlock code assumes the presence of ticket spinlock. So it 
will cause compilation error if QUEUE_SPINLOCK is enabled. My patches 
7/8 modified the KVM code so that the PV ticketlock code can coexist 
with queue spinlock. As I haven't figure out the proper way to modify 
the Xen code, I need to disable the queue spinlock code if 
PARAVIRT_SPINLOCKS and XEN are both enabled. However, by disabling 
PARAVIRT_SPINLOCKS, we can use PARAVIRT_UNFAIR_LOCKS with XEN.

-Longman

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH RFC v5 5/8] pvqspinlock, x86: Enable unfair queue spinlock in a KVM guest
  2014-02-27  9:41   ` Paolo Bonzini
  2014-02-27 19:05     ` Waiman Long
@ 2014-02-27 19:05     ` Waiman Long
  1 sibling, 0 replies; 125+ messages in thread
From: Waiman Long @ 2014-02-27 19:05 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Jeremy Fitzhardinge, Raghavendra K T, Peter Zijlstra,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Alok Kataria, linux-arch, x86, Ingo Molnar, Scott J Norton,
	xen-devel, Paul E. McKenney, Alexander Fyodorov, Rik van Riel,
	Arnd Bergmann, Konrad Rzeszutek Wilk, Daniel J Blueman,
	Oleg Nesterov, Steven Rostedt, Chris Wright, George Spelvin,
	Thomas Gleixner

On 02/27/2014 04:41 AM, Paolo Bonzini wrote:
> Il 26/02/2014 16:14, Waiman Long ha scritto:
>> This patch adds a KVM init function to activate the unfair queue
>> spinlock in a KVM guest when the PARAVIRT_UNFAIR_LOCKS kernel config
>> option is selected.
>>
>> Signed-off-by: Waiman Long <Waiman.Long@hp.com>
>> ---
>>  arch/x86/kernel/kvm.c |   17 +++++++++++++++++
>>  1 files changed, 17 insertions(+), 0 deletions(-)
>>
>> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
>> index 713f1b3..a489140 100644
>> --- a/arch/x86/kernel/kvm.c
>> +++ b/arch/x86/kernel/kvm.c
>> @@ -826,3 +826,20 @@ static __init int kvm_spinlock_init_jump(void)
>>  early_initcall(kvm_spinlock_init_jump);
>>
>>  #endif    /* CONFIG_PARAVIRT_SPINLOCKS */
>> +
>> +#ifdef CONFIG_PARAVIRT_UNFAIR_LOCKS
>> +/*
>> + * Enable unfair lock if running in a real para-virtualized environment
>> + */
>> +static __init int kvm_unfair_locks_init_jump(void)
>> +{
>> +    if (!kvm_para_available())
>> +        return 0;
>> +
>> +    static_key_slow_inc(&paravirt_unfairlocks_enabled);
>> +    printk(KERN_INFO "KVM setup unfair spinlock\n");
>> +
>> +    return 0;
>> +}
>> +early_initcall(kvm_unfair_locks_init_jump);
>> +#endif
>>
>
> I think this should apply to all paravirt implementations, unless 
> pv_lock_ops.kick_cpu != NULL.
>
> Paolo

Unfair lock is currently implemented as an independent config option 
that can be turned on or off irrespective of the other PV settings. 
There are concern about lock starvation if there is a large number of 
virtual CPUs. So one idea that I have is to disable this feature if 
there is more than a certain number of virtual CPUs available. I will 
investigate this idea when I have time.

-Longman

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH RFC v5 5/8] pvqspinlock, x86: Enable unfair queue spinlock in a KVM guest
  2014-02-27  9:41   ` Paolo Bonzini
@ 2014-02-27 19:05     ` Waiman Long
  2014-02-27 19:05     ` Waiman Long
  1 sibling, 0 replies; 125+ messages in thread
From: Waiman Long @ 2014-02-27 19:05 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Jeremy Fitzhardinge, Raghavendra K T, Peter Zijlstra,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Alok Kataria, linux-arch, x86, Ingo Molnar, Scott J Norton,
	xen-devel, Paul E. McKenney, Alexander Fyodorov, Arnd Bergmann,
	Daniel J Blueman, Rusty Russell, Oleg Nesterov, Steven Rostedt,
	Chris Wright, George Spelvin, Thomas Gleixner,
	Aswin Chandramouleeswaran

On 02/27/2014 04:41 AM, Paolo Bonzini wrote:
> Il 26/02/2014 16:14, Waiman Long ha scritto:
>> This patch adds a KVM init function to activate the unfair queue
>> spinlock in a KVM guest when the PARAVIRT_UNFAIR_LOCKS kernel config
>> option is selected.
>>
>> Signed-off-by: Waiman Long <Waiman.Long@hp.com>
>> ---
>>  arch/x86/kernel/kvm.c |   17 +++++++++++++++++
>>  1 files changed, 17 insertions(+), 0 deletions(-)
>>
>> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
>> index 713f1b3..a489140 100644
>> --- a/arch/x86/kernel/kvm.c
>> +++ b/arch/x86/kernel/kvm.c
>> @@ -826,3 +826,20 @@ static __init int kvm_spinlock_init_jump(void)
>>  early_initcall(kvm_spinlock_init_jump);
>>
>>  #endif    /* CONFIG_PARAVIRT_SPINLOCKS */
>> +
>> +#ifdef CONFIG_PARAVIRT_UNFAIR_LOCKS
>> +/*
>> + * Enable unfair lock if running in a real para-virtualized environment
>> + */
>> +static __init int kvm_unfair_locks_init_jump(void)
>> +{
>> +    if (!kvm_para_available())
>> +        return 0;
>> +
>> +    static_key_slow_inc(&paravirt_unfairlocks_enabled);
>> +    printk(KERN_INFO "KVM setup unfair spinlock\n");
>> +
>> +    return 0;
>> +}
>> +early_initcall(kvm_unfair_locks_init_jump);
>> +#endif
>>
>
> I think this should apply to all paravirt implementations, unless 
> pv_lock_ops.kick_cpu != NULL.
>
> Paolo

Unfair lock is currently implemented as an independent config option 
that can be turned on or off irrespective of the other PV settings. 
There are concern about lock starvation if there is a large number of 
virtual CPUs. So one idea that I have is to disable this feature if 
there is more than a certain number of virtual CPUs available. I will 
investigate this idea when I have time.

-Longman

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH RFC v5 5/8] pvqspinlock, x86: Enable unfair queue spinlock in a KVM guest
  2014-02-27 10:40   ` Raghavendra K T
  2014-02-27 19:12     ` Waiman Long
@ 2014-02-27 19:12     ` Waiman Long
  1 sibling, 0 replies; 125+ messages in thread
From: Waiman Long @ 2014-02-27 19:12 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Jeremy Fitzhardinge, Peter Zijlstra, virtualization, Andi Kleen,
	H. Peter Anvin, Michel Lespinasse, Alok Kataria, linux-arch, x86,
	Ingo Molnar, Scott J Norton, xen-devel, Paul E. McKenney,
	Alexander Fyodorov, Rik van Riel, Arnd Bergmann,
	Konrad Rzeszutek Wilk, Daniel J Blueman, Oleg Nesterov,
	Steven Rostedt, Chris Wright, George Spelvin, Thomas Gleixner,
	Aswin Chandramouleeswaran, Chegu

On 02/27/2014 05:40 AM, Raghavendra K T wrote:
> On 02/26/2014 08:44 PM, Waiman Long wrote:
>> This patch adds a KVM init function to activate the unfair queue
>> spinlock in a KVM guest when the PARAVIRT_UNFAIR_LOCKS kernel config
>> option is selected.
>>
>> Signed-off-by: Waiman Long <Waiman.Long@hp.com>
>> ---
>>   arch/x86/kernel/kvm.c |   17 +++++++++++++++++
>>   1 files changed, 17 insertions(+), 0 deletions(-)
>>
>> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
>> index 713f1b3..a489140 100644
>> --- a/arch/x86/kernel/kvm.c
>> +++ b/arch/x86/kernel/kvm.c
>> @@ -826,3 +826,20 @@ static __init int kvm_spinlock_init_jump(void)
>>   early_initcall(kvm_spinlock_init_jump);
>>
>>   #endif    /* CONFIG_PARAVIRT_SPINLOCKS */
>> +
>> +#ifdef CONFIG_PARAVIRT_UNFAIR_LOCKS
>> +/*
>> + * Enable unfair lock if running in a real para-virtualized environment
>> + */
>> +static __init int kvm_unfair_locks_init_jump(void)
>> +{
>> +    if (!kvm_para_available())
>> +        return 0;
>> +
>
> kvm_kick_cpu_type() in patch 8 assumes that host has support for kick
> hypercall (KVM_HC_KICK_CPU).
>
> I think for that we need explicit check of this 
> kvm_para_has_feature(KVM_FEATURE_PV_UNHALT).
>
> otherwise things may break for unlikely case of running a new guest on
> a old host?
>

Unfair lock is a separate config option that does not need to do any cpu 
kick. The checking of kvm_para_available() is just to make sure that the 
kernel is running in a real PV environment, not on bare metal but with 
CONFIG_PARAVIRT enabled.

-Longman

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH RFC v5 5/8] pvqspinlock, x86: Enable unfair queue spinlock in a KVM guest
  2014-02-27 10:40   ` Raghavendra K T
@ 2014-02-27 19:12     ` Waiman Long
  2014-02-27 19:12     ` Waiman Long
  1 sibling, 0 replies; 125+ messages in thread
From: Waiman Long @ 2014-02-27 19:12 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Jeremy Fitzhardinge, Peter Zijlstra, virtualization, Andi Kleen,
	H. Peter Anvin, Michel Lespinasse, Alok Kataria, linux-arch, x86,
	Ingo Molnar, Scott J Norton, xen-devel, Paul E. McKenney,
	Alexander Fyodorov, Arnd Bergmann, Daniel J Blueman,
	Rusty Russell, Oleg Nesterov, Steven Rostedt, Chris Wright,
	George Spelvin, Thomas Gleixner, Aswin Chandramouleeswaran,
	Chegu Vinod, Boris

On 02/27/2014 05:40 AM, Raghavendra K T wrote:
> On 02/26/2014 08:44 PM, Waiman Long wrote:
>> This patch adds a KVM init function to activate the unfair queue
>> spinlock in a KVM guest when the PARAVIRT_UNFAIR_LOCKS kernel config
>> option is selected.
>>
>> Signed-off-by: Waiman Long <Waiman.Long@hp.com>
>> ---
>>   arch/x86/kernel/kvm.c |   17 +++++++++++++++++
>>   1 files changed, 17 insertions(+), 0 deletions(-)
>>
>> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
>> index 713f1b3..a489140 100644
>> --- a/arch/x86/kernel/kvm.c
>> +++ b/arch/x86/kernel/kvm.c
>> @@ -826,3 +826,20 @@ static __init int kvm_spinlock_init_jump(void)
>>   early_initcall(kvm_spinlock_init_jump);
>>
>>   #endif    /* CONFIG_PARAVIRT_SPINLOCKS */
>> +
>> +#ifdef CONFIG_PARAVIRT_UNFAIR_LOCKS
>> +/*
>> + * Enable unfair lock if running in a real para-virtualized environment
>> + */
>> +static __init int kvm_unfair_locks_init_jump(void)
>> +{
>> +    if (!kvm_para_available())
>> +        return 0;
>> +
>
> kvm_kick_cpu_type() in patch 8 assumes that host has support for kick
> hypercall (KVM_HC_KICK_CPU).
>
> I think for that we need explicit check of this 
> kvm_para_has_feature(KVM_FEATURE_PV_UNHALT).
>
> otherwise things may break for unlikely case of running a new guest on
> a old host?
>

Unfair lock is a separate config option that does not need to do any cpu 
kick. The checking of kvm_para_available() is just to make sure that the 
kernel is running in a real PV environment, not on bare metal but with 
CONFIG_PARAVIRT enabled.

-Longman

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH RFC v5 4/8] pvqspinlock, x86: Allow unfair spinlock in a real PV environment
  2014-02-27 12:28   ` David Vrabel
@ 2014-02-27 19:40     ` Waiman Long
  2014-02-27 19:40     ` Waiman Long
  1 sibling, 0 replies; 125+ messages in thread
From: Waiman Long @ 2014-02-27 19:40 UTC (permalink / raw)
  To: David Vrabel
  Cc: Jeremy Fitzhardinge, Raghavendra K T, Peter Zijlstra,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Alok Kataria, linux-arch, x86, Ingo Molnar, Scott J Norton,
	xen-devel, Paul E. McKenney, Alexander Fyodorov, Rik van Riel,
	Arnd Bergmann, Konrad Rzeszutek Wilk, Daniel J Blueman,
	Oleg Nesterov, Steven Rostedt, Chris Wright, George Spelvin,
	Thomas Gleixner

On 02/27/2014 07:28 AM, David Vrabel wrote:
> On 26/02/14 15:14, Waiman Long wrote:
>> Locking is always an issue in a virtualized environment as the virtual
>> CPU that is waiting on a lock may get scheduled out and hence block
>> any progress in lock acquisition even when the lock has been freed.
>>
>> One solution to this problem is to allow unfair lock in a
>> para-virtualized environment. In this case, a new lock acquirer can
>> come and steal the lock if the next-in-line CPU to get the lock is
>> scheduled out. Unfair lock in a native environment is generally not a
>> good idea as there is a possibility of lock starvation for a heavily
>> contended lock.
> I'm not sure I'm keen on losing the fairness in PV environment.  I'm
> concerned that on an over-committed host, the lock starvation problem
> will be particularly bad.
>
> But I'll have to revist this once a non-broken PV qspinlock
> implementation exists (or someone explains how the proposed one works).
>
> David

On second thought, the unfair qspinlock may not be as bad as other 
unfair locks. Basically, a task gets one chance to steal the lock. If it 
can't, it will go to the back of the queue and wait for its turn. So 
unless a single CPU can monopolize a lock by acquiring it again 
immediately after release, all the tasks queuing up will eventually has 
its chance of getting the lock.

-Longman

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH RFC v5 4/8] pvqspinlock, x86: Allow unfair spinlock in a real PV environment
  2014-02-27 12:28   ` David Vrabel
  2014-02-27 19:40     ` Waiman Long
@ 2014-02-27 19:40     ` Waiman Long
  1 sibling, 0 replies; 125+ messages in thread
From: Waiman Long @ 2014-02-27 19:40 UTC (permalink / raw)
  To: David Vrabel
  Cc: Jeremy Fitzhardinge, Raghavendra K T, Peter Zijlstra,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Alok Kataria, linux-arch, x86, Ingo Molnar, Scott J Norton,
	xen-devel, Paul E. McKenney, Alexander Fyodorov, Arnd Bergmann,
	Daniel J Blueman, Rusty Russell, Oleg Nesterov, Steven Rostedt,
	Chris Wright, George Spelvin, Thomas Gleixner,
	Aswin Chandramouleeswaran

On 02/27/2014 07:28 AM, David Vrabel wrote:
> On 26/02/14 15:14, Waiman Long wrote:
>> Locking is always an issue in a virtualized environment as the virtual
>> CPU that is waiting on a lock may get scheduled out and hence block
>> any progress in lock acquisition even when the lock has been freed.
>>
>> One solution to this problem is to allow unfair lock in a
>> para-virtualized environment. In this case, a new lock acquirer can
>> come and steal the lock if the next-in-line CPU to get the lock is
>> scheduled out. Unfair lock in a native environment is generally not a
>> good idea as there is a possibility of lock starvation for a heavily
>> contended lock.
> I'm not sure I'm keen on losing the fairness in PV environment.  I'm
> concerned that on an over-committed host, the lock starvation problem
> will be particularly bad.
>
> But I'll have to revist this once a non-broken PV qspinlock
> implementation exists (or someone explains how the proposed one works).
>
> David

On second thought, the unfair qspinlock may not be as bad as other 
unfair locks. Basically, a task gets one chance to steal the lock. If it 
can't, it will go to the back of the queue and wait for its turn. So 
unless a single CPU can monopolize a lock by acquiring it again 
immediately after release, all the tasks queuing up will eventually has 
its chance of getting the lock.

-Longman

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH RFC v5 7/8] pvqspinlock, x86: Add qspinlock para-virtualization support
  2014-02-27 14:45         ` Paolo Bonzini
                             ` (2 preceding siblings ...)
  2014-02-27 19:42           ` Waiman Long
@ 2014-02-27 19:42           ` Waiman Long
  3 siblings, 0 replies; 125+ messages in thread
From: Waiman Long @ 2014-02-27 19:42 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Jeremy Fitzhardinge, x86, Peter Zijlstra, virtualization,
	Andi Kleen, H. Peter Anvin, Michel Lespinasse, Alok Kataria,
	linux-arch, Raghavendra K T, Ingo Molnar, Scott J Norton,
	xen-devel, Paul E. McKenney, Alexander Fyodorov, Rik van Riel,
	Arnd Bergmann, Konrad Rzeszutek Wilk, Daniel J Blueman,
	Oleg Nesterov, Steven Rostedt, Chris Wright, George Spelvin,
	Thomas Gleixner

On 02/27/2014 09:45 AM, Paolo Bonzini wrote:
> Il 27/02/2014 15:18, David Vrabel ha scritto:
>> On 27/02/14 13:11, Paolo Bonzini wrote:
>>> Il 27/02/2014 13:11, David Vrabel ha scritto:
>>>>>> This patch adds para-virtualization support to the queue spinlock 
>>>>>> code
>>>>>> by enabling the queue head to kick the lock holder CPU, if known,
>>>>>> in when the lock isn't released for a certain amount of time. It
>>>>>> also enables the mutual monitoring of the queue head CPU and the
>>>>>> following node CPU in the queue to make sure that their CPUs will
>>>>>> stay scheduled in.
>>>> I'm not really understanding how this is supposed to work.  There
>>>> appears to be an assumption that a guest can keep one of its VCPUs
>>>> running by repeatedly kicking it?  This is not possible under Xen 
>>>> and I
>>>> doubt it's possible under KVM or any other hypervisor.
>>>
>>> KVM allows any VCPU to wake up a currently halted VCPU of its choice,
>>> see Documentation/virtual/kvm/hypercalls.txt.
>>
>> But neither of the VCPUs being kicked here are halted -- they're either
>> running or runnable (descheduled by the hypervisor).
>
> /me actually looks at Waiman's code...
>
> Right, this is really different from pvticketlocks, where the *unlock* 
> primitive wakes up a sleeping VCPU.  It is more similar to PLE 
> (pause-loop exiting).
>
> Paolo

Yes, it is mostly to deal with vCPU that are not running because of PLE.

-Longman

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH RFC v5 7/8] pvqspinlock, x86: Add qspinlock para-virtualization support
  2014-02-27 14:45         ` Paolo Bonzini
  2014-02-27 15:22           ` Raghavendra K T
  2014-02-27 15:22           ` Raghavendra K T
@ 2014-02-27 19:42           ` Waiman Long
  2014-02-27 19:42           ` Waiman Long
  3 siblings, 0 replies; 125+ messages in thread
From: Waiman Long @ 2014-02-27 19:42 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Jeremy Fitzhardinge, x86, Peter Zijlstra, virtualization,
	Andi Kleen, H. Peter Anvin, Michel Lespinasse, Alok Kataria,
	linux-arch, Raghavendra K T, Ingo Molnar, Scott J Norton,
	xen-devel, Paul E. McKenney, Alexander Fyodorov, Arnd Bergmann,
	Daniel J Blueman, Rusty Russell, Oleg Nesterov, Steven Rostedt,
	Chris Wright, George Spelvin, Thomas Gleixner,
	Aswin Chandramouleeswaran

On 02/27/2014 09:45 AM, Paolo Bonzini wrote:
> Il 27/02/2014 15:18, David Vrabel ha scritto:
>> On 27/02/14 13:11, Paolo Bonzini wrote:
>>> Il 27/02/2014 13:11, David Vrabel ha scritto:
>>>>>> This patch adds para-virtualization support to the queue spinlock 
>>>>>> code
>>>>>> by enabling the queue head to kick the lock holder CPU, if known,
>>>>>> in when the lock isn't released for a certain amount of time. It
>>>>>> also enables the mutual monitoring of the queue head CPU and the
>>>>>> following node CPU in the queue to make sure that their CPUs will
>>>>>> stay scheduled in.
>>>> I'm not really understanding how this is supposed to work.  There
>>>> appears to be an assumption that a guest can keep one of its VCPUs
>>>> running by repeatedly kicking it?  This is not possible under Xen 
>>>> and I
>>>> doubt it's possible under KVM or any other hypervisor.
>>>
>>> KVM allows any VCPU to wake up a currently halted VCPU of its choice,
>>> see Documentation/virtual/kvm/hypercalls.txt.
>>
>> But neither of the VCPUs being kicked here are halted -- they're either
>> running or runnable (descheduled by the hypervisor).
>
> /me actually looks at Waiman's code...
>
> Right, this is really different from pvticketlocks, where the *unlock* 
> primitive wakes up a sleeping VCPU.  It is more similar to PLE 
> (pause-loop exiting).
>
> Paolo

Yes, it is mostly to deal with vCPU that are not running because of PLE.

-Longman

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v5 1/8] qspinlock: Introducing a 4-byte queue spinlock implementation
  2014-02-26 16:22   ` Peter Zijlstra
@ 2014-02-27 20:25     ` Waiman Long
  2014-02-27 20:25     ` Waiman Long
  1 sibling, 0 replies; 125+ messages in thread
From: Waiman Long @ 2014-02-27 20:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jeremy Fitzhardinge, Raghavendra K T, Boris Ostrovsky,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Alok Kataria, linux-arch, x86, Ingo Molnar, Scott J Norton,
	xen-devel, Paul E. McKenney, Alexander Fyodorov, Rik van Riel,
	Arnd Bergmann, Konrad Rzeszutek Wilk, Daniel J Blueman,
	Oleg Nesterov, Steven Rostedt, Chris Wright, George Spelvin,
	Thomas Gleixner

On 02/26/2014 11:22 AM, Peter Zijlstra wrote:
> On Wed, Feb 26, 2014 at 10:14:21AM -0500, Waiman Long wrote:
>
>> +struct qnode {
>> +	u32		 wait;		/* Waiting flag		*/
>> +	struct qnode	*next;		/* Next queue node addr */
>> +};
>> +
>> +struct qnode_set {
>> +	struct qnode	nodes[MAX_QNODES];
>> +	int		node_idx;	/* Current node to use */
>> +};
>> +
>> +/*
>> + * Per-CPU queue node structures
>> + */
>> +static DEFINE_PER_CPU_ALIGNED(struct qnode_set, qnset) = { {{0}}, 0 };
> So I've not yet wrapped my head around any of this; and I see a later
> patch adds some paravirt gunk to this, but it does blow you can't keep
> it a single cacheline for the sane case.

There is a 4-byte hole in the qnode structure for x86-64. I did try to 
make the additional PV fields used only 4 bytes so that there is no 
increase in size in the qnode structure unless we need to support 16K 
CPUs or more.

-Longman

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v5 1/8] qspinlock: Introducing a 4-byte queue spinlock implementation
  2014-02-26 16:22   ` Peter Zijlstra
  2014-02-27 20:25     ` Waiman Long
@ 2014-02-27 20:25     ` Waiman Long
  1 sibling, 0 replies; 125+ messages in thread
From: Waiman Long @ 2014-02-27 20:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jeremy Fitzhardinge, Raghavendra K T, Boris Ostrovsky,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Alok Kataria, linux-arch, x86, Ingo Molnar, Scott J Norton,
	xen-devel, Paul E. McKenney, Alexander Fyodorov, Arnd Bergmann,
	Daniel J Blueman, Rusty Russell, Oleg Nesterov, Steven Rostedt,
	Chris Wright, George Spelvin, Thomas Gleixner,
	Aswin Chandramouleeswaran

On 02/26/2014 11:22 AM, Peter Zijlstra wrote:
> On Wed, Feb 26, 2014 at 10:14:21AM -0500, Waiman Long wrote:
>
>> +struct qnode {
>> +	u32		 wait;		/* Waiting flag		*/
>> +	struct qnode	*next;		/* Next queue node addr */
>> +};
>> +
>> +struct qnode_set {
>> +	struct qnode	nodes[MAX_QNODES];
>> +	int		node_idx;	/* Current node to use */
>> +};
>> +
>> +/*
>> + * Per-CPU queue node structures
>> + */
>> +static DEFINE_PER_CPU_ALIGNED(struct qnode_set, qnset) = { {{0}}, 0 };
> So I've not yet wrapped my head around any of this; and I see a later
> patch adds some paravirt gunk to this, but it does blow you can't keep
> it a single cacheline for the sane case.

There is a 4-byte hole in the qnode structure for x86-64. I did try to 
make the additional PV fields used only 4 bytes so that there is no 
increase in size in the qnode structure unless we need to support 16K 
CPUs or more.

-Longman

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v5 1/8] qspinlock: Introducing a 4-byte queue spinlock implementation
  2014-02-26 16:24   ` Peter Zijlstra
  2014-02-27 20:25     ` Waiman Long
@ 2014-02-27 20:25     ` Waiman Long
  1 sibling, 0 replies; 125+ messages in thread
From: Waiman Long @ 2014-02-27 20:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jeremy Fitzhardinge, Raghavendra K T, Boris Ostrovsky,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Alok Kataria, linux-arch, x86, Ingo Molnar, Scott J Norton,
	xen-devel, Paul E. McKenney, Alexander Fyodorov, Rik van Riel,
	Arnd Bergmann, Konrad Rzeszutek Wilk, Daniel J Blueman,
	Oleg Nesterov, Steven Rostedt, Chris Wright, George Spelvin,
	Thomas Gleixner

On 02/26/2014 11:24 AM, Peter Zijlstra wrote:
> On Wed, Feb 26, 2014 at 10:14:21AM -0500, Waiman Long wrote:
>> +static void put_qnode(void)
>> +{
>> +	struct qnode_set *qset = this_cpu_ptr(&qnset);
>> +
>> +	qset->node_idx--;
>> +}
> That very much wants to be: this_cpu_dec().

Yes, I will change it to use this_cpu_dec().

-Longman

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v5 1/8] qspinlock: Introducing a 4-byte queue spinlock implementation
  2014-02-26 16:24   ` Peter Zijlstra
@ 2014-02-27 20:25     ` Waiman Long
  2014-02-27 20:25     ` Waiman Long
  1 sibling, 0 replies; 125+ messages in thread
From: Waiman Long @ 2014-02-27 20:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jeremy Fitzhardinge, Raghavendra K T, Boris Ostrovsky,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Alok Kataria, linux-arch, x86, Ingo Molnar, Scott J Norton,
	xen-devel, Paul E. McKenney, Alexander Fyodorov, Arnd Bergmann,
	Daniel J Blueman, Rusty Russell, Oleg Nesterov, Steven Rostedt,
	Chris Wright, George Spelvin, Thomas Gleixner,
	Aswin Chandramouleeswaran

On 02/26/2014 11:24 AM, Peter Zijlstra wrote:
> On Wed, Feb 26, 2014 at 10:14:21AM -0500, Waiman Long wrote:
>> +static void put_qnode(void)
>> +{
>> +	struct qnode_set *qset = this_cpu_ptr(&qnset);
>> +
>> +	qset->node_idx--;
>> +}
> That very much wants to be: this_cpu_dec().

Yes, I will change it to use this_cpu_dec().

-Longman

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks
  2014-02-26 16:20   ` Peter Zijlstra
@ 2014-02-27 20:42     ` Waiman Long
  2014-02-28  9:29       ` Peter Zijlstra
  2014-02-28  9:29       ` Peter Zijlstra
  2014-02-27 20:42     ` Waiman Long
  1 sibling, 2 replies; 125+ messages in thread
From: Waiman Long @ 2014-02-27 20:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jeremy Fitzhardinge, Raghavendra K T, Boris Ostrovsky,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Alok Kataria, linux-arch, x86, Ingo Molnar, Scott J Norton,
	xen-devel, Paul E. McKenney, Alexander Fyodorov, Rik van Riel,
	Arnd Bergmann, Konrad Rzeszutek Wilk, Daniel J Blueman,
	Oleg Nesterov, Steven Rostedt, Chris Wright, George Spelvin,
	Thomas Gleixner

On 02/26/2014 11:20 AM, Peter Zijlstra wrote:
> You don't happen to have a proper state diagram for this thing do you?
>
> I suppose I'm going to have to make one; this is all getting a bit
> unwieldy, and those xchg() + fixup things are hard to read.

I don't have a state diagram on hand, but I will add more comments to 
describe the 4 possible cases and how to handle them.

>
> On Wed, Feb 26, 2014 at 10:14:23AM -0500, Waiman Long wrote:
>> +static inline int queue_spin_trylock_quick(struct qspinlock *lock, int qsval)
>> +{
>> +	union arch_qspinlock *qlock = (union arch_qspinlock *)lock;
>> +	u16		     old;
>> +
>> +	/*
>> +	 * Fall into the quick spinning code path only if no one is waiting
>> +	 * or the lock is available.
>> +	 */
>> +	if (unlikely((qsval != _QSPINLOCK_LOCKED)&&
>> +		     (qsval != _QSPINLOCK_WAITING)))
>> +		return 0;
>> +
>> +	old = xchg(&qlock->lock_wait, _QSPINLOCK_WAITING|_QSPINLOCK_LOCKED);
>> +
>> +	if (old == 0) {
>> +		/*
>> +		 * Got the lock, can clear the waiting bit now
>> +		 */
>> +		smp_u8_store_release(&qlock->wait, 0);
>
> So we just did an atomic op, and now you're trying to optimize this
> write. Why do you need a whole byte for that?
>
> Surely a cmpxchg loop with the right atomic op can't be _that_ much
> slower? Its far more readable and likely avoids that steal fail below as
> well.

At low contention level, atomic operations that requires a lock prefix 
are the major contributor to the total execution times. I saw estimate 
online that the time to execute a lock prefix instruction can easily be 
50X longer than a regular instruction that can be pipelined. That is why 
I try to do it with as few lock prefix instructions as possible. If I 
have to do an atomic cmpxchg, it probably won't be faster than the 
regular qspinlock slowpath.

Given that speed at low contention level which is the common case is 
important to get this patch accepted, I have to do what I can to make it 
run as far as possible for this 2 contending task case.

>> +		return 1;
>> +	} else if (old == _QSPINLOCK_LOCKED) {
>> +try_again:
>> +		/*
>> +		 * Wait until the lock byte is cleared to get the lock
>> +		 */
>> +		do {
>> +			cpu_relax();
>> +		} while (ACCESS_ONCE(qlock->lock));
>> +		/*
>> +		 * Set the lock bit&  clear the waiting bit
>> +		 */
>> +		if (cmpxchg(&qlock->lock_wait, _QSPINLOCK_WAITING,
>> +			   _QSPINLOCK_LOCKED) == _QSPINLOCK_WAITING)
>> +			return 1;
>> +		/*
>> +		 * Someone has steal the lock, so wait again
>> +		 */
>> +		goto try_again;
> That's just a fail.. steals should not ever be allowed. It's a fair lock
> after all.

The code is unfair, but this unfairness help it to run faster than 
ticket spinlock in this particular case. And the regular qspinlock 
slowpath is fair. A little bit of unfairness in this particular case 
helps its speed.

>> +	} else if (old == _QSPINLOCK_WAITING) {
>> +		/*
>> +		 * Another task is already waiting while it steals the lock.
>> +		 * A bit of unfairness here won't change the big picture.
>> +		 * So just take the lock and return.
>> +		 */
>> +		return 1;
>> +	}
>> +	/*
>> +	 * Nothing need to be done if the old value is
>> +	 * (_QSPINLOCK_WAITING | _QSPINLOCK_LOCKED).
>> +	 */
>> +	return 0;
>> +}
>
>
>
>> @@ -296,6 +478,9 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, int qsval)
>>   		return;
>>   	}
>>
>> +#ifdef queue_code_xchg
>> +	prev_qcode = queue_code_xchg(lock, my_qcode);
>> +#else
>>   	/*
>>   	 * Exchange current copy of the queue node code
>>   	 */
>> @@ -329,6 +514,7 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, int qsval)
>>   	} else
>>   		prev_qcode&= ~_QSPINLOCK_LOCKED;	/* Clear the lock bit */
>>   	my_qcode&= ~_QSPINLOCK_LOCKED;
>> +#endif /* queue_code_xchg */
>>
>>   	if (prev_qcode) {
>>   		/*
> That's just horrible.. please just make the entire #else branch another
> version of that same queue_code_xchg() function.

OK, I will wrap it in another function.

Regards,
Longman

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks
  2014-02-26 16:20   ` Peter Zijlstra
  2014-02-27 20:42     ` Waiman Long
@ 2014-02-27 20:42     ` Waiman Long
  1 sibling, 0 replies; 125+ messages in thread
From: Waiman Long @ 2014-02-27 20:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jeremy Fitzhardinge, Raghavendra K T, Boris Ostrovsky,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Alok Kataria, linux-arch, x86, Ingo Molnar, Scott J Norton,
	xen-devel, Paul E. McKenney, Alexander Fyodorov, Arnd Bergmann,
	Daniel J Blueman, Rusty Russell, Oleg Nesterov, Steven Rostedt,
	Chris Wright, George Spelvin, Thomas Gleixner,
	Aswin Chandramouleeswaran

On 02/26/2014 11:20 AM, Peter Zijlstra wrote:
> You don't happen to have a proper state diagram for this thing do you?
>
> I suppose I'm going to have to make one; this is all getting a bit
> unwieldy, and those xchg() + fixup things are hard to read.

I don't have a state diagram on hand, but I will add more comments to 
describe the 4 possible cases and how to handle them.

>
> On Wed, Feb 26, 2014 at 10:14:23AM -0500, Waiman Long wrote:
>> +static inline int queue_spin_trylock_quick(struct qspinlock *lock, int qsval)
>> +{
>> +	union arch_qspinlock *qlock = (union arch_qspinlock *)lock;
>> +	u16		     old;
>> +
>> +	/*
>> +	 * Fall into the quick spinning code path only if no one is waiting
>> +	 * or the lock is available.
>> +	 */
>> +	if (unlikely((qsval != _QSPINLOCK_LOCKED)&&
>> +		     (qsval != _QSPINLOCK_WAITING)))
>> +		return 0;
>> +
>> +	old = xchg(&qlock->lock_wait, _QSPINLOCK_WAITING|_QSPINLOCK_LOCKED);
>> +
>> +	if (old == 0) {
>> +		/*
>> +		 * Got the lock, can clear the waiting bit now
>> +		 */
>> +		smp_u8_store_release(&qlock->wait, 0);
>
> So we just did an atomic op, and now you're trying to optimize this
> write. Why do you need a whole byte for that?
>
> Surely a cmpxchg loop with the right atomic op can't be _that_ much
> slower? Its far more readable and likely avoids that steal fail below as
> well.

At low contention level, atomic operations that requires a lock prefix 
are the major contributor to the total execution times. I saw estimate 
online that the time to execute a lock prefix instruction can easily be 
50X longer than a regular instruction that can be pipelined. That is why 
I try to do it with as few lock prefix instructions as possible. If I 
have to do an atomic cmpxchg, it probably won't be faster than the 
regular qspinlock slowpath.

Given that speed at low contention level which is the common case is 
important to get this patch accepted, I have to do what I can to make it 
run as far as possible for this 2 contending task case.

>> +		return 1;
>> +	} else if (old == _QSPINLOCK_LOCKED) {
>> +try_again:
>> +		/*
>> +		 * Wait until the lock byte is cleared to get the lock
>> +		 */
>> +		do {
>> +			cpu_relax();
>> +		} while (ACCESS_ONCE(qlock->lock));
>> +		/*
>> +		 * Set the lock bit&  clear the waiting bit
>> +		 */
>> +		if (cmpxchg(&qlock->lock_wait, _QSPINLOCK_WAITING,
>> +			   _QSPINLOCK_LOCKED) == _QSPINLOCK_WAITING)
>> +			return 1;
>> +		/*
>> +		 * Someone has steal the lock, so wait again
>> +		 */
>> +		goto try_again;
> That's just a fail.. steals should not ever be allowed. It's a fair lock
> after all.

The code is unfair, but this unfairness help it to run faster than 
ticket spinlock in this particular case. And the regular qspinlock 
slowpath is fair. A little bit of unfairness in this particular case 
helps its speed.

>> +	} else if (old == _QSPINLOCK_WAITING) {
>> +		/*
>> +		 * Another task is already waiting while it steals the lock.
>> +		 * A bit of unfairness here won't change the big picture.
>> +		 * So just take the lock and return.
>> +		 */
>> +		return 1;
>> +	}
>> +	/*
>> +	 * Nothing need to be done if the old value is
>> +	 * (_QSPINLOCK_WAITING | _QSPINLOCK_LOCKED).
>> +	 */
>> +	return 0;
>> +}
>
>
>
>> @@ -296,6 +478,9 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, int qsval)
>>   		return;
>>   	}
>>
>> +#ifdef queue_code_xchg
>> +	prev_qcode = queue_code_xchg(lock, my_qcode);
>> +#else
>>   	/*
>>   	 * Exchange current copy of the queue node code
>>   	 */
>> @@ -329,6 +514,7 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, int qsval)
>>   	} else
>>   		prev_qcode&= ~_QSPINLOCK_LOCKED;	/* Clear the lock bit */
>>   	my_qcode&= ~_QSPINLOCK_LOCKED;
>> +#endif /* queue_code_xchg */
>>
>>   	if (prev_qcode) {
>>   		/*
> That's just horrible.. please just make the entire #else branch another
> version of that same queue_code_xchg() function.

OK, I will wrap it in another function.

Regards,
Longman

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH RFC v5 7/8] pvqspinlock, x86: Add qspinlock para-virtualization support
  2014-02-27 15:22           ` Raghavendra K T
                               ` (2 preceding siblings ...)
  2014-02-27 20:50             ` Waiman Long
@ 2014-02-27 20:50             ` Waiman Long
  3 siblings, 0 replies; 125+ messages in thread
From: Waiman Long @ 2014-02-27 20:50 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Jeremy Fitzhardinge, Peter Zijlstra, virtualization, Andi Kleen,
	H. Peter Anvin, Michel Lespinasse, Alok Kataria, linux-arch, x86,
	Ingo Molnar, Scott J Norton, xen-devel, Paul E. McKenney,
	Alexander Fyodorov, Rik van Riel, Arnd Bergmann,
	Konrad Rzeszutek Wilk, Daniel J Blueman, Oleg Nesterov,
	Steven Rostedt, Chris Wright, George Spelvin, Thomas Gleixner,
	Aswin Chandramouleeswaran, Chegu

On 02/27/2014 10:22 AM, Raghavendra K T wrote:
> On 02/27/2014 08:15 PM, Paolo Bonzini wrote:
> [...]
>>> But neither of the VCPUs being kicked here are halted -- they're either
>>> running or runnable (descheduled by the hypervisor).
>>
>> /me actually looks at Waiman's code...
>>
>> Right, this is really different from pvticketlocks, where the *unlock*
>> primitive wakes up a sleeping VCPU.  It is more similar to PLE
>> (pause-loop exiting).
>
> Adding to the discussion, I see there are two possibilities here,
> considering that in undercommit cases we should not exceed
> HEAD_SPIN_THRESHOLD,
>
> 1. the looping vcpu in pv_head_spin_check() should do halt()
> considering that we have done enough spinning (more than typical
> lock-hold time), and hence we are in potential overcommit.
>
> 2. multiplex kick_cpu to do directed yield in qspinlock case.
> But this may result in some ping ponging?
>
>
>

In the current code, the lock holder can't easily locate the CPU # of 
the queue head when in the unlock path. That is why I try to keep the 
queue head alive as long as possible so that it can take over when the 
lock is free. I am trying out new code to let the CPUs that are waiting 
other than the first 2 to go to halt to see if that will help the 
overcommit case.

-Longman

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH RFC v5 7/8] pvqspinlock, x86: Add qspinlock para-virtualization support
  2014-02-27 15:22           ` Raghavendra K T
  2014-02-27 15:50             ` Paolo Bonzini
  2014-02-27 15:50             ` Paolo Bonzini
@ 2014-02-27 20:50             ` Waiman Long
  2014-02-27 20:50             ` Waiman Long
  3 siblings, 0 replies; 125+ messages in thread
From: Waiman Long @ 2014-02-27 20:50 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Jeremy Fitzhardinge, Peter Zijlstra, virtualization, Andi Kleen,
	H. Peter Anvin, Michel Lespinasse, Alok Kataria, linux-arch, x86,
	Ingo Molnar, Scott J Norton, xen-devel, Paul E. McKenney,
	Alexander Fyodorov, Arnd Bergmann, Daniel J Blueman,
	Rusty Russell, Oleg Nesterov, Steven Rostedt, Chris Wright,
	George Spelvin, Thomas Gleixner, Aswin Chandramouleeswaran,
	Chegu Vinod, Boris

On 02/27/2014 10:22 AM, Raghavendra K T wrote:
> On 02/27/2014 08:15 PM, Paolo Bonzini wrote:
> [...]
>>> But neither of the VCPUs being kicked here are halted -- they're either
>>> running or runnable (descheduled by the hypervisor).
>>
>> /me actually looks at Waiman's code...
>>
>> Right, this is really different from pvticketlocks, where the *unlock*
>> primitive wakes up a sleeping VCPU.  It is more similar to PLE
>> (pause-loop exiting).
>
> Adding to the discussion, I see there are two possibilities here,
> considering that in undercommit cases we should not exceed
> HEAD_SPIN_THRESHOLD,
>
> 1. the looping vcpu in pv_head_spin_check() should do halt()
> considering that we have done enough spinning (more than typical
> lock-hold time), and hence we are in potential overcommit.
>
> 2. multiplex kick_cpu to do directed yield in qspinlock case.
> But this may result in some ping ponging?
>
>
>

In the current code, the lock holder can't easily locate the CPU # of 
the queue head when in the unlock path. That is why I try to keep the 
queue head alive as long as possible so that it can take over when the 
lock is free. I am trying out new code to let the CPUs that are waiting 
other than the first 2 to go to halt to see if that will help the 
overcommit case.

-Longman

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks
  2014-02-27 20:42     ` Waiman Long
  2014-02-28  9:29       ` Peter Zijlstra
@ 2014-02-28  9:29       ` Peter Zijlstra
  2014-02-28 16:25         ` Linus Torvalds
                           ` (3 more replies)
  1 sibling, 4 replies; 125+ messages in thread
From: Peter Zijlstra @ 2014-02-28  9:29 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, Boris Ostrovsky,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Alok Kataria, linux-arch, x86, Ingo Molnar, Scott J Norton,
	xen-devel, Paul E. McKenney, Alexander Fyodorov, Rik van Riel,
	Arnd Bergmann, Konrad Rzeszutek Wilk, Daniel J Blueman,
	Oleg Nesterov, Steven Rostedt, Chris Wright, George Spelvin,
	Thomas Gleixner

On Thu, Feb 27, 2014 at 03:42:19PM -0500, Waiman Long wrote:
> >>+	old = xchg(&qlock->lock_wait, _QSPINLOCK_WAITING|_QSPINLOCK_LOCKED);
> >>+
> >>+	if (old == 0) {
> >>+		/*
> >>+		 * Got the lock, can clear the waiting bit now
> >>+		 */
> >>+		smp_u8_store_release(&qlock->wait, 0);
> >
> >So we just did an atomic op, and now you're trying to optimize this
> >write. Why do you need a whole byte for that?
> >
> >Surely a cmpxchg loop with the right atomic op can't be _that_ much
> >slower? Its far more readable and likely avoids that steal fail below as
> >well.
> 
> At low contention level, atomic operations that requires a lock prefix are
> the major contributor to the total execution times. I saw estimate online
> that the time to execute a lock prefix instruction can easily be 50X longer
> than a regular instruction that can be pipelined. That is why I try to do it
> with as few lock prefix instructions as possible. If I have to do an atomic
> cmpxchg, it probably won't be faster than the regular qspinlock slowpath.

At low contention the cmpxchg won't have to be retried (much) so using
it won't be a problem and you get to have arbitrary atomic ops.

> Given that speed at low contention level which is the common case is
> important to get this patch accepted, I have to do what I can to make it run
> as far as possible for this 2 contending task case.

What I'm saying is that you can do the whole thing with a single
cmpxchg. No extra ops needed. And at that point you don't need a whole
byte, you can use a single bit.

that removes the whole NR_CPUS dependent logic.

> >>+		/*
> >>+		 * Someone has steal the lock, so wait again
> >>+		 */
> >>+		goto try_again;

> >That's just a fail.. steals should not ever be allowed. It's a fair lock
> >after all.
> 
> The code is unfair, but this unfairness help it to run faster than ticket
> spinlock in this particular case. And the regular qspinlock slowpath is
> fair. A little bit of unfairness in this particular case helps its speed.

*groan*, no, unfairness not cool. ticket lock is absolutely fair; we
should preserve this.

BTW; can you share your benchmark thingy? 

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks
  2014-02-27 20:42     ` Waiman Long
@ 2014-02-28  9:29       ` Peter Zijlstra
  2014-02-28  9:29       ` Peter Zijlstra
  1 sibling, 0 replies; 125+ messages in thread
From: Peter Zijlstra @ 2014-02-28  9:29 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, Boris Ostrovsky,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Alok Kataria, linux-arch, x86, Ingo Molnar, Scott J Norton,
	xen-devel, Paul E. McKenney, Alexander Fyodorov, Arnd Bergmann,
	Daniel J Blueman, Rusty Russell, Oleg Nesterov, Steven Rostedt,
	Chris Wright, George Spelvin, Thomas Gleixner,
	Aswin Chandramouleeswaran

On Thu, Feb 27, 2014 at 03:42:19PM -0500, Waiman Long wrote:
> >>+	old = xchg(&qlock->lock_wait, _QSPINLOCK_WAITING|_QSPINLOCK_LOCKED);
> >>+
> >>+	if (old == 0) {
> >>+		/*
> >>+		 * Got the lock, can clear the waiting bit now
> >>+		 */
> >>+		smp_u8_store_release(&qlock->wait, 0);
> >
> >So we just did an atomic op, and now you're trying to optimize this
> >write. Why do you need a whole byte for that?
> >
> >Surely a cmpxchg loop with the right atomic op can't be _that_ much
> >slower? Its far more readable and likely avoids that steal fail below as
> >well.
> 
> At low contention level, atomic operations that requires a lock prefix are
> the major contributor to the total execution times. I saw estimate online
> that the time to execute a lock prefix instruction can easily be 50X longer
> than a regular instruction that can be pipelined. That is why I try to do it
> with as few lock prefix instructions as possible. If I have to do an atomic
> cmpxchg, it probably won't be faster than the regular qspinlock slowpath.

At low contention the cmpxchg won't have to be retried (much) so using
it won't be a problem and you get to have arbitrary atomic ops.

> Given that speed at low contention level which is the common case is
> important to get this patch accepted, I have to do what I can to make it run
> as far as possible for this 2 contending task case.

What I'm saying is that you can do the whole thing with a single
cmpxchg. No extra ops needed. And at that point you don't need a whole
byte, you can use a single bit.

that removes the whole NR_CPUS dependent logic.

> >>+		/*
> >>+		 * Someone has steal the lock, so wait again
> >>+		 */
> >>+		goto try_again;

> >That's just a fail.. steals should not ever be allowed. It's a fair lock
> >after all.
> 
> The code is unfair, but this unfairness help it to run faster than ticket
> spinlock in this particular case. And the regular qspinlock slowpath is
> fair. A little bit of unfairness in this particular case helps its speed.

*groan*, no, unfairness not cool. ticket lock is absolutely fair; we
should preserve this.

BTW; can you share your benchmark thingy? 

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks
  2014-02-28  9:29       ` Peter Zijlstra
@ 2014-02-28 16:25         ` Linus Torvalds
  2014-02-28 17:37           ` Peter Zijlstra
  2014-02-28 17:37           ` Peter Zijlstra
  2014-02-28 16:25         ` Linus Torvalds
                           ` (2 subsequent siblings)
  3 siblings, 2 replies; 125+ messages in thread
From: Linus Torvalds @ 2014-02-28 16:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jeremy Fitzhardinge, Raghavendra K T, Aswin Chandramouleeswaran,
	Andi Kleen, Peter Anvin, Michel Lespinasse, Alok Kataria,
	linux-arch, the arch/x86 maintainers, Ingo Molnar,
	Daniel J Blueman, xen-devel, Paul McKenney, Alexander Fyodorov,
	Rik van Riel, Arnd Bergmann, Konrad Rzeszutek Wilk,
	Scott J Norton, Steven Rostedt, Chris Wright, George Spelvin,
	Boris Ostrovsky, virtualization


[-- Attachment #1.1: Type: text/plain, Size: 402 bytes --]

On Feb 28, 2014 1:30 AM, "Peter Zijlstra" <peterz@infradead.org> wrote:
>
> At low contention the cmpxchg won't have to be retried (much) so using
> it won't be a problem and you get to have arbitrary atomic ops.

Peter, the difference between an atomic op and *no* atomic op is huge.

And Waiman posted numbers for the optimization. Why do you argue with
handwaving and against numbers?

       Linus

[-- Attachment #1.2: Type: text/html, Size: 579 bytes --]

[-- Attachment #2: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks
  2014-02-28  9:29       ` Peter Zijlstra
  2014-02-28 16:25         ` Linus Torvalds
@ 2014-02-28 16:25         ` Linus Torvalds
  2014-02-28 16:38         ` Waiman Long
  2014-02-28 16:38         ` Waiman Long
  3 siblings, 0 replies; 125+ messages in thread
From: Linus Torvalds @ 2014-02-28 16:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jeremy Fitzhardinge, Raghavendra K T, Aswin Chandramouleeswaran,
	Andi Kleen, Peter Anvin, Michel Lespinasse, Alok Kataria,
	linux-arch, the arch/x86 maintainers, Ingo Molnar,
	Daniel J Blueman, xen-devel, Paul McKenney, Alexander Fyodorov,
	Arnd Bergmann, Scott J Norton, Rusty Russell, Steven Rostedt,
	Chris Wright, George Spelvin, Boris Ostrovsky, virtualization,
	Chegu Vinod


[-- Attachment #1.1: Type: text/plain, Size: 402 bytes --]

On Feb 28, 2014 1:30 AM, "Peter Zijlstra" <peterz@infradead.org> wrote:
>
> At low contention the cmpxchg won't have to be retried (much) so using
> it won't be a problem and you get to have arbitrary atomic ops.

Peter, the difference between an atomic op and *no* atomic op is huge.

And Waiman posted numbers for the optimization. Why do you argue with
handwaving and against numbers?

       Linus

[-- Attachment #1.2: Type: text/html, Size: 579 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks
  2014-02-28  9:29       ` Peter Zijlstra
                           ` (2 preceding siblings ...)
  2014-02-28 16:38         ` Waiman Long
@ 2014-02-28 16:38         ` Waiman Long
  2014-02-28 17:56           ` Peter Zijlstra
                             ` (3 more replies)
  3 siblings, 4 replies; 125+ messages in thread
From: Waiman Long @ 2014-02-28 16:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jeremy Fitzhardinge, Raghavendra K T, Boris Ostrovsky,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Alok Kataria, linux-arch, x86, Ingo Molnar, Scott J Norton,
	xen-devel, Paul E. McKenney, Alexander Fyodorov, Rik van Riel,
	Arnd Bergmann, Konrad Rzeszutek Wilk, Daniel J Blueman,
	Oleg Nesterov, Steven Rostedt, Chris Wright, George Spelvin,
	Thomas Gleixner

[-- Attachment #1: Type: text/plain, Size: 3245 bytes --]

On 02/28/2014 04:29 AM, Peter Zijlstra wrote:
> On Thu, Feb 27, 2014 at 03:42:19PM -0500, Waiman Long wrote:
>>>> +	old = xchg(&qlock->lock_wait, _QSPINLOCK_WAITING|_QSPINLOCK_LOCKED);
>>>> +
>>>> +	if (old == 0) {
>>>> +		/*
>>>> +		 * Got the lock, can clear the waiting bit now
>>>> +		 */
>>>> +		smp_u8_store_release(&qlock->wait, 0);
>>> So we just did an atomic op, and now you're trying to optimize this
>>> write. Why do you need a whole byte for that?
>>>
>>> Surely a cmpxchg loop with the right atomic op can't be _that_ much
>>> slower? Its far more readable and likely avoids that steal fail below as
>>> well.
>> At low contention level, atomic operations that requires a lock prefix are
>> the major contributor to the total execution times. I saw estimate online
>> that the time to execute a lock prefix instruction can easily be 50X longer
>> than a regular instruction that can be pipelined. That is why I try to do it
>> with as few lock prefix instructions as possible. If I have to do an atomic
>> cmpxchg, it probably won't be faster than the regular qspinlock slowpath.
> At low contention the cmpxchg won't have to be retried (much) so using
> it won't be a problem and you get to have arbitrary atomic ops.
>
>> Given that speed at low contention level which is the common case is
>> important to get this patch accepted, I have to do what I can to make it run
>> as far as possible for this 2 contending task case.
> What I'm saying is that you can do the whole thing with a single
> cmpxchg. No extra ops needed. And at that point you don't need a whole
> byte, you can use a single bit.
>
> that removes the whole NR_CPUS dependent logic.

After modifying it to do a deterministic cmpxchg, the test run time of 2 
contending tasks jumps up from 600ms (best case) to about 1700ms which 
was worse than the original qspinlock's 1300-1500ms. It is the 
opportunistic nature of the xchg() code that can potentially combine 
multiple steps in the deterministic atomic sequence which can saves 
time. Without that, I would rather prefer going back to the basic 
qspinlock queuing sequence for 2 contending tasks.

Please take a look at the performance data in my patch 3 to see if the 
slowdown at 2 and 3 contending tasks are acceptable or not.

The reason why I need a whole byte for the lock bit is because of the 
simple unlock code of assigning 0 to the lock byte by the lock holder. 
Utilizing other bits in the low byte for other purpose will complicate 
the unlock path and slow down the no-contention case.

>>>> +		/*
>>>> +		 * Someone has steal the lock, so wait again
>>>> +		 */
>>>> +		goto try_again;
>>> That's just a fail.. steals should not ever be allowed. It's a fair lock
>>> after all.
>> The code is unfair, but this unfairness help it to run faster than ticket
>> spinlock in this particular case. And the regular qspinlock slowpath is
>> fair. A little bit of unfairness in this particular case helps its speed.
> *groan*, no, unfairness not cool. ticket lock is absolutely fair; we
> should preserve this.

We can preserve that by removing patch 3.

> BTW; can you share your benchmark thingy?

I have attached the test program that I used to generate the timing data 
for patch 3.

-Longman



[-- Attachment #2: locktest.tar.gz --]
[-- Type: application/x-gzip, Size: 5244 bytes --]

[-- Attachment #3: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks
  2014-02-28  9:29       ` Peter Zijlstra
  2014-02-28 16:25         ` Linus Torvalds
  2014-02-28 16:25         ` Linus Torvalds
@ 2014-02-28 16:38         ` Waiman Long
  2014-02-28 16:38         ` Waiman Long
  3 siblings, 0 replies; 125+ messages in thread
From: Waiman Long @ 2014-02-28 16:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jeremy Fitzhardinge, Raghavendra K T, Boris Ostrovsky,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Alok Kataria, linux-arch, x86, Ingo Molnar, Scott J Norton,
	xen-devel, Paul E. McKenney, Alexander Fyodorov, Arnd Bergmann,
	Daniel J Blueman, Rusty Russell, Oleg Nesterov, Steven Rostedt,
	Chris Wright, George Spelvin, Thomas Gleixner,
	Aswin Chandramouleeswaran

[-- Attachment #1: Type: text/plain, Size: 3245 bytes --]

On 02/28/2014 04:29 AM, Peter Zijlstra wrote:
> On Thu, Feb 27, 2014 at 03:42:19PM -0500, Waiman Long wrote:
>>>> +	old = xchg(&qlock->lock_wait, _QSPINLOCK_WAITING|_QSPINLOCK_LOCKED);
>>>> +
>>>> +	if (old == 0) {
>>>> +		/*
>>>> +		 * Got the lock, can clear the waiting bit now
>>>> +		 */
>>>> +		smp_u8_store_release(&qlock->wait, 0);
>>> So we just did an atomic op, and now you're trying to optimize this
>>> write. Why do you need a whole byte for that?
>>>
>>> Surely a cmpxchg loop with the right atomic op can't be _that_ much
>>> slower? Its far more readable and likely avoids that steal fail below as
>>> well.
>> At low contention level, atomic operations that requires a lock prefix are
>> the major contributor to the total execution times. I saw estimate online
>> that the time to execute a lock prefix instruction can easily be 50X longer
>> than a regular instruction that can be pipelined. That is why I try to do it
>> with as few lock prefix instructions as possible. If I have to do an atomic
>> cmpxchg, it probably won't be faster than the regular qspinlock slowpath.
> At low contention the cmpxchg won't have to be retried (much) so using
> it won't be a problem and you get to have arbitrary atomic ops.
>
>> Given that speed at low contention level which is the common case is
>> important to get this patch accepted, I have to do what I can to make it run
>> as far as possible for this 2 contending task case.
> What I'm saying is that you can do the whole thing with a single
> cmpxchg. No extra ops needed. And at that point you don't need a whole
> byte, you can use a single bit.
>
> that removes the whole NR_CPUS dependent logic.

After modifying it to do a deterministic cmpxchg, the test run time of 2 
contending tasks jumps up from 600ms (best case) to about 1700ms which 
was worse than the original qspinlock's 1300-1500ms. It is the 
opportunistic nature of the xchg() code that can potentially combine 
multiple steps in the deterministic atomic sequence which can saves 
time. Without that, I would rather prefer going back to the basic 
qspinlock queuing sequence for 2 contending tasks.

Please take a look at the performance data in my patch 3 to see if the 
slowdown at 2 and 3 contending tasks are acceptable or not.

The reason why I need a whole byte for the lock bit is because of the 
simple unlock code of assigning 0 to the lock byte by the lock holder. 
Utilizing other bits in the low byte for other purpose will complicate 
the unlock path and slow down the no-contention case.

>>>> +		/*
>>>> +		 * Someone has steal the lock, so wait again
>>>> +		 */
>>>> +		goto try_again;
>>> That's just a fail.. steals should not ever be allowed. It's a fair lock
>>> after all.
>> The code is unfair, but this unfairness help it to run faster than ticket
>> spinlock in this particular case. And the regular qspinlock slowpath is
>> fair. A little bit of unfairness in this particular case helps its speed.
> *groan*, no, unfairness not cool. ticket lock is absolutely fair; we
> should preserve this.

We can preserve that by removing patch 3.

> BTW; can you share your benchmark thingy?

I have attached the test program that I used to generate the timing data 
for patch 3.

-Longman



[-- Attachment #2: locktest.tar.gz --]
[-- Type: application/x-gzip, Size: 5244 bytes --]

[-- Attachment #3: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v5 0/8] qspinlock: a 4-byte queue spinlock with PV support
  2014-02-26 17:00 ` [PATCH v5 0/8] qspinlock: a 4-byte queue spinlock with " Konrad Rzeszutek Wilk
@ 2014-02-28 16:56   ` Waiman Long
  2014-02-28 16:56   ` Waiman Long
  1 sibling, 0 replies; 125+ messages in thread
From: Waiman Long @ 2014-02-28 16:56 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Jeremy Fitzhardinge, Raghavendra K T, Peter Zijlstra,
	virtualization, Marcos Matsunaga, Andi Kleen, H. Peter Anvin,
	Michel Lespinasse, Alok Kataria, linux-arch, x86, Ingo Molnar,
	Scott J Norton, xen-devel, Paul E. McKenney, Alexander Fyodorov,
	Rik van Riel, Arnd Bergmann, Daniel J Blueman, Oleg Nesterov,
	Steven Rostedt, Chris Wright, George Spelvin, Thomas Gleixner

On 02/26/2014 12:00 PM, Konrad Rzeszutek Wilk wrote:
> On Wed, Feb 26, 2014 at 10:14:20AM -0500, Waiman Long wrote:
> It should be fairly easy. You just need to implement the kick right?
> An IPI should be all that is needed - look in xen_unlock_kick. The
> rest of the spinlock code is all generic that is shared between
> KVM, Xen and baremetal.
>
> You should be able to run all of this under an HVM guests as well - as
> in you don't need a pure PV guest to use the PV ticketlocks.
>
> An easy way to install/run this is to install your latest distro,
> do 'yum install xen' or 'apt-get install xen'. Reboot and you
> are under Xen. Launch guests, etc with your favorite virtualization
> toolstack.

Thank for the tip. I will try that out.

-Longman

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v5 0/8] qspinlock: a 4-byte queue spinlock with PV support
  2014-02-26 17:00 ` [PATCH v5 0/8] qspinlock: a 4-byte queue spinlock with " Konrad Rzeszutek Wilk
  2014-02-28 16:56   ` Waiman Long
@ 2014-02-28 16:56   ` Waiman Long
  1 sibling, 0 replies; 125+ messages in thread
From: Waiman Long @ 2014-02-28 16:56 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Jeremy Fitzhardinge, Raghavendra K T, Peter Zijlstra,
	virtualization, Marcos Matsunaga, Andi Kleen, H. Peter Anvin,
	Michel Lespinasse, Alok Kataria, linux-arch, x86, Ingo Molnar,
	Scott J Norton, xen-devel, Paul E. McKenney, Alexander Fyodorov,
	Arnd Bergmann, Daniel J Blueman, Rusty Russell, Oleg Nesterov,
	Steven Rostedt, Chris Wright, George Spelvin, Thomas Gleixner

On 02/26/2014 12:00 PM, Konrad Rzeszutek Wilk wrote:
> On Wed, Feb 26, 2014 at 10:14:20AM -0500, Waiman Long wrote:
> It should be fairly easy. You just need to implement the kick right?
> An IPI should be all that is needed - look in xen_unlock_kick. The
> rest of the spinlock code is all generic that is shared between
> KVM, Xen and baremetal.
>
> You should be able to run all of this under an HVM guests as well - as
> in you don't need a pure PV guest to use the PV ticketlocks.
>
> An easy way to install/run this is to install your latest distro,
> do 'yum install xen' or 'apt-get install xen'. Reboot and you
> are under Xen. Launch guests, etc with your favorite virtualization
> toolstack.

Thank for the tip. I will try that out.

-Longman

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH RFC v5 4/8] pvqspinlock, x86: Allow unfair spinlock in a real PV environment
  2014-02-26 17:07   ` Konrad Rzeszutek Wilk
  2014-02-28 17:06     ` Waiman Long
@ 2014-02-28 17:06     ` Waiman Long
  2014-03-03 10:55       ` Paolo Bonzini
  2014-03-03 10:55       ` Paolo Bonzini
  1 sibling, 2 replies; 125+ messages in thread
From: Waiman Long @ 2014-02-28 17:06 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Jeremy Fitzhardinge, Raghavendra K T, Peter Zijlstra,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Alok Kataria, linux-arch, x86, Ingo Molnar, Scott J Norton,
	xen-devel, Paul E. McKenney, Alexander Fyodorov, Rik van Riel,
	Arnd Bergmann, Daniel J Blueman, Oleg Nesterov, Steven Rostedt,
	Chris Wright, George Spelvin, Thomas Gleixner,
	Aswin Chandramouleeswaran, Cheg

On 02/26/2014 12:07 PM, Konrad Rzeszutek Wilk wrote:
> On Wed, Feb 26, 2014 at 10:14:24AM -0500, Waiman Long wrote:
>> Locking is always an issue in a virtualized environment as the virtual
>> CPU that is waiting on a lock may get scheduled out and hence block
>> any progress in lock acquisition even when the lock has been freed.
>>
>> One solution to this problem is to allow unfair lock in a
>> para-virtualized environment. In this case, a new lock acquirer can
>> come and steal the lock if the next-in-line CPU to get the lock is
>> scheduled out. Unfair lock in a native environment is generally not a
> Hmm, how do you know if the 'next-in-line CPU' is scheduled out? As
> in the hypervisor knows - but you as a guest might have no idea
> of it.

I use a heart-beat counter to see if the other side responses within a 
certain time limit. If not, I assume it has been scheduled out probably 
due to PLE.

>> good idea as there is a possibility of lock starvation for a heavily
>> contended lock.
> Should this then detect whether it is running under a virtualization
> and only then activate itself? And when run under baremetal don't enable?

Yes, unfair lock should only be enabled if it is running under a 
para-virtualized guest. A jump label (static key) is used for this 
purpose and will be enabled by the appropriate KVM or Xen code.

>> This patch add a new configuration option for the x86
>> architecture to enable the use of unfair queue spinlock
>> (PARAVIRT_UNFAIR_LOCKS) in a real para-virtualized guest. A jump label
>> (paravirt_unfairlocks_enabled) is used to switch between a fair and
>> an unfair version of the spinlock code. This jump label will only be
>> enabled in a real PV guest.
> As opposed to fake PV guest :-) I think you can remove the 'real'.

Yes, you are right. I will remove that in the next series.

>
>> Enabling this configuration feature decreases the performance of an
>> uncontended lock-unlock operation by about 1-2%.
> Presumarily on baremetal right?

Enabling unfair lock will add additional code which has a slight 
performance penalty of 1-2% on both bare-metal and virtualized.

>> +/**
>> + * arch_spin_lock - acquire a queue spinlock
>> + * @lock: Pointer to queue spinlock structure
>> + */
>> +static inline void arch_spin_lock(struct qspinlock *lock)
>> +{
>> +	if (static_key_false(&paravirt_unfairlocks_enabled)) {
>> +		queue_spin_lock_unfair(lock);
>> +		return;
>> +	}
>> +	queue_spin_lock(lock);
> What happens when you are booting and you are in the middle of using a
> ticketlock (say you are waiting for it and your are in the slow-path)
>   and suddenly the unfairlocks_enabled is turned on.

The static key will only be changed only in the early boot period which 
I presumably doesn't need to use spinlock. This static key is 
initialized in the same way as the PV ticketlock's static key which has 
the same problem that you mentioned.

-Longman

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH RFC v5 4/8] pvqspinlock, x86: Allow unfair spinlock in a real PV environment
  2014-02-26 17:07   ` Konrad Rzeszutek Wilk
@ 2014-02-28 17:06     ` Waiman Long
  2014-02-28 17:06     ` Waiman Long
  1 sibling, 0 replies; 125+ messages in thread
From: Waiman Long @ 2014-02-28 17:06 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Jeremy Fitzhardinge, Raghavendra K T, Peter Zijlstra,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Alok Kataria, linux-arch, x86, Ingo Molnar, Scott J Norton,
	xen-devel, Paul E. McKenney, Alexander Fyodorov, Arnd Bergmann,
	Daniel J Blueman, Rusty Russell, Oleg Nesterov, Steven Rostedt,
	Chris Wright, George Spelvin, Thomas Gleixner,
	Aswin Chandramouleeswaran

On 02/26/2014 12:07 PM, Konrad Rzeszutek Wilk wrote:
> On Wed, Feb 26, 2014 at 10:14:24AM -0500, Waiman Long wrote:
>> Locking is always an issue in a virtualized environment as the virtual
>> CPU that is waiting on a lock may get scheduled out and hence block
>> any progress in lock acquisition even when the lock has been freed.
>>
>> One solution to this problem is to allow unfair lock in a
>> para-virtualized environment. In this case, a new lock acquirer can
>> come and steal the lock if the next-in-line CPU to get the lock is
>> scheduled out. Unfair lock in a native environment is generally not a
> Hmm, how do you know if the 'next-in-line CPU' is scheduled out? As
> in the hypervisor knows - but you as a guest might have no idea
> of it.

I use a heart-beat counter to see if the other side responses within a 
certain time limit. If not, I assume it has been scheduled out probably 
due to PLE.

>> good idea as there is a possibility of lock starvation for a heavily
>> contended lock.
> Should this then detect whether it is running under a virtualization
> and only then activate itself? And when run under baremetal don't enable?

Yes, unfair lock should only be enabled if it is running under a 
para-virtualized guest. A jump label (static key) is used for this 
purpose and will be enabled by the appropriate KVM or Xen code.

>> This patch add a new configuration option for the x86
>> architecture to enable the use of unfair queue spinlock
>> (PARAVIRT_UNFAIR_LOCKS) in a real para-virtualized guest. A jump label
>> (paravirt_unfairlocks_enabled) is used to switch between a fair and
>> an unfair version of the spinlock code. This jump label will only be
>> enabled in a real PV guest.
> As opposed to fake PV guest :-) I think you can remove the 'real'.

Yes, you are right. I will remove that in the next series.

>
>> Enabling this configuration feature decreases the performance of an
>> uncontended lock-unlock operation by about 1-2%.
> Presumarily on baremetal right?

Enabling unfair lock will add additional code which has a slight 
performance penalty of 1-2% on both bare-metal and virtualized.

>> +/**
>> + * arch_spin_lock - acquire a queue spinlock
>> + * @lock: Pointer to queue spinlock structure
>> + */
>> +static inline void arch_spin_lock(struct qspinlock *lock)
>> +{
>> +	if (static_key_false(&paravirt_unfairlocks_enabled)) {
>> +		queue_spin_lock_unfair(lock);
>> +		return;
>> +	}
>> +	queue_spin_lock(lock);
> What happens when you are booting and you are in the middle of using a
> ticketlock (say you are waiting for it and your are in the slow-path)
>   and suddenly the unfairlocks_enabled is turned on.

The static key will only be changed only in the early boot period which 
I presumably doesn't need to use spinlock. This static key is 
initialized in the same way as the PV ticketlock's static key which has 
the same problem that you mentioned.

-Longman

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH RFC v5 5/8] pvqspinlock, x86: Enable unfair queue spinlock in a KVM guest
  2014-02-26 17:08   ` Konrad Rzeszutek Wilk
  2014-02-28 17:08     ` Waiman Long
@ 2014-02-28 17:08     ` Waiman Long
  1 sibling, 0 replies; 125+ messages in thread
From: Waiman Long @ 2014-02-28 17:08 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Jeremy Fitzhardinge, Raghavendra K T, Peter Zijlstra,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Alok Kataria, linux-arch, x86, Ingo Molnar, Scott J Norton,
	xen-devel, Paul E. McKenney, Alexander Fyodorov, Rik van Riel,
	Arnd Bergmann, Daniel J Blueman, Oleg Nesterov, Steven Rostedt,
	Chris Wright, George Spelvin, Thomas Gleixner,
	Aswin Chandramouleeswaran, Cheg

On 02/26/2014 12:08 PM, Konrad Rzeszutek Wilk wrote:
> On Wed, Feb 26, 2014 at 10:14:25AM -0500, Waiman Long wrote:
>> This patch adds a KVM init function to activate the unfair queue
>> spinlock in a KVM guest when the PARAVIRT_UNFAIR_LOCKS kernel config
>> option is selected.
>>
>> Signed-off-by: Waiman Long<Waiman.Long@hp.com>
>> ---
>>   arch/x86/kernel/kvm.c |   17 +++++++++++++++++
>>   1 files changed, 17 insertions(+), 0 deletions(-)
>>
>> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
>> index 713f1b3..a489140 100644
>> --- a/arch/x86/kernel/kvm.c
>> +++ b/arch/x86/kernel/kvm.c
>> @@ -826,3 +826,20 @@ static __init int kvm_spinlock_init_jump(void)
>>   early_initcall(kvm_spinlock_init_jump);
>>
>>   #endif	/* CONFIG_PARAVIRT_SPINLOCKS */
>> +
>> +#ifdef CONFIG_PARAVIRT_UNFAIR_LOCKS
>> +/*
>> + * Enable unfair lock if running in a real para-virtualized environment
>> + */
>> +static __init int kvm_unfair_locks_init_jump(void)
>> +{
>> +	if (!kvm_para_available())
>> +		return 0;
> I think you also need to check for !kvm_para_has_feature(KVM_FEATURE_PV_UNHALT)?
> Otherwise you might enable this, but the kicker function won't be
> enabled.

The unfair lock code doesn't need to use the CPU kicker function. That 
is why the KVM_FEATURE_PV_UNHALT feature is not checked.

-Longman

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH RFC v5 5/8] pvqspinlock, x86: Enable unfair queue spinlock in a KVM guest
  2014-02-26 17:08   ` Konrad Rzeszutek Wilk
@ 2014-02-28 17:08     ` Waiman Long
  2014-02-28 17:08     ` Waiman Long
  1 sibling, 0 replies; 125+ messages in thread
From: Waiman Long @ 2014-02-28 17:08 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Jeremy Fitzhardinge, Raghavendra K T, Peter Zijlstra,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Alok Kataria, linux-arch, x86, Ingo Molnar, Scott J Norton,
	xen-devel, Paul E. McKenney, Alexander Fyodorov, Arnd Bergmann,
	Daniel J Blueman, Rusty Russell, Oleg Nesterov, Steven Rostedt,
	Chris Wright, George Spelvin, Thomas Gleixner,
	Aswin Chandramouleeswaran

On 02/26/2014 12:08 PM, Konrad Rzeszutek Wilk wrote:
> On Wed, Feb 26, 2014 at 10:14:25AM -0500, Waiman Long wrote:
>> This patch adds a KVM init function to activate the unfair queue
>> spinlock in a KVM guest when the PARAVIRT_UNFAIR_LOCKS kernel config
>> option is selected.
>>
>> Signed-off-by: Waiman Long<Waiman.Long@hp.com>
>> ---
>>   arch/x86/kernel/kvm.c |   17 +++++++++++++++++
>>   1 files changed, 17 insertions(+), 0 deletions(-)
>>
>> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
>> index 713f1b3..a489140 100644
>> --- a/arch/x86/kernel/kvm.c
>> +++ b/arch/x86/kernel/kvm.c
>> @@ -826,3 +826,20 @@ static __init int kvm_spinlock_init_jump(void)
>>   early_initcall(kvm_spinlock_init_jump);
>>
>>   #endif	/* CONFIG_PARAVIRT_SPINLOCKS */
>> +
>> +#ifdef CONFIG_PARAVIRT_UNFAIR_LOCKS
>> +/*
>> + * Enable unfair lock if running in a real para-virtualized environment
>> + */
>> +static __init int kvm_unfair_locks_init_jump(void)
>> +{
>> +	if (!kvm_para_available())
>> +		return 0;
> I think you also need to check for !kvm_para_has_feature(KVM_FEATURE_PV_UNHALT)?
> Otherwise you might enable this, but the kicker function won't be
> enabled.

The unfair lock code doesn't need to use the CPU kicker function. That 
is why the KVM_FEATURE_PV_UNHALT feature is not checked.

-Longman

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks
  2014-02-28 16:25         ` Linus Torvalds
  2014-02-28 17:37           ` Peter Zijlstra
@ 2014-02-28 17:37           ` Peter Zijlstra
  1 sibling, 0 replies; 125+ messages in thread
From: Peter Zijlstra @ 2014-02-28 17:37 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jeremy Fitzhardinge, Raghavendra K T, Aswin Chandramouleeswaran,
	Andi Kleen, Peter Anvin, Michel Lespinasse, Alok Kataria,
	linux-arch, the arch/x86 maintainers, Ingo Molnar,
	Daniel J Blueman, xen-devel, Paul McKenney, Alexander Fyodorov,
	Rik van Riel, Arnd Bergmann, Konrad Rzeszutek Wilk,
	Scott J Norton, Steven Rostedt, Chris Wright, George Spelvin,
	Boris Ostrovsky, virtualization

On Fri, Feb 28, 2014 at 08:25:24AM -0800, Linus Torvalds wrote:
> On Feb 28, 2014 1:30 AM, "Peter Zijlstra" <peterz@infradead.org> wrote:
> >
> > At low contention the cmpxchg won't have to be retried (much) so using
> > it won't be a problem and you get to have arbitrary atomic ops.
> 
> Peter, the difference between an atomic op and *no* atomic op is huge.

I know, I'm just asking what the difference is between the xchg() - and
atomic op, and an cmpxchg(), also an atomic op.

The xchg() makes the entire thing somewhat difficult. Needing to fixup
all kinds of states if we guessed wrong about what was in the variables.

> And Waiman posted numbers for the optimization. Why do you argue with
> handwaving and against numbers?

I've asked for his benchmark.. 

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks
  2014-02-28 16:25         ` Linus Torvalds
@ 2014-02-28 17:37           ` Peter Zijlstra
  2014-02-28 17:37           ` Peter Zijlstra
  1 sibling, 0 replies; 125+ messages in thread
From: Peter Zijlstra @ 2014-02-28 17:37 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jeremy Fitzhardinge, Raghavendra K T, Aswin Chandramouleeswaran,
	Andi Kleen, Peter Anvin, Michel Lespinasse, Alok Kataria,
	linux-arch, the arch/x86 maintainers, Ingo Molnar,
	Daniel J Blueman, xen-devel, Paul McKenney, Alexander Fyodorov,
	Arnd Bergmann, Scott J Norton, Rusty Russell, Steven Rostedt,
	Chris Wright, George Spelvin, Boris Ostrovsky, virtualization,
	Chegu Vinod

On Fri, Feb 28, 2014 at 08:25:24AM -0800, Linus Torvalds wrote:
> On Feb 28, 2014 1:30 AM, "Peter Zijlstra" <peterz@infradead.org> wrote:
> >
> > At low contention the cmpxchg won't have to be retried (much) so using
> > it won't be a problem and you get to have arbitrary atomic ops.
> 
> Peter, the difference between an atomic op and *no* atomic op is huge.

I know, I'm just asking what the difference is between the xchg() - and
atomic op, and an cmpxchg(), also an atomic op.

The xchg() makes the entire thing somewhat difficult. Needing to fixup
all kinds of states if we guessed wrong about what was in the variables.

> And Waiman posted numbers for the optimization. Why do you argue with
> handwaving and against numbers?

I've asked for his benchmark.. 

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks
  2014-02-28 16:38         ` Waiman Long
@ 2014-02-28 17:56           ` Peter Zijlstra
  2014-02-28 17:56           ` Peter Zijlstra
                             ` (2 subsequent siblings)
  3 siblings, 0 replies; 125+ messages in thread
From: Peter Zijlstra @ 2014-02-28 17:56 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, Boris Ostrovsky,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Alok Kataria, linux-arch, x86, Ingo Molnar, Scott J Norton,
	xen-devel, Paul E. McKenney, Alexander Fyodorov, Rik van Riel,
	Arnd Bergmann, Konrad Rzeszutek Wilk, Daniel J Blueman,
	Oleg Nesterov, Steven Rostedt, Chris Wright, George Spelvin,
	Thomas Gleixner



> After modifying it to do a deterministic cmpxchg, the test run time of 2
> contending tasks jumps up from 600ms (best case) to about 1700ms which was
> worse than the original qspinlock's 1300-1500ms. It is the opportunistic
> nature of the xchg() code that can potentially combine multiple steps in the
> deterministic atomic sequence which can saves time. Without that, I would
> rather prefer going back to the basic qspinlock queuing sequence for 2
> contending tasks.
> 
> Please take a look at the performance data in my patch 3 to see if the
> slowdown at 2 and 3 contending tasks are acceptable or not.

Right; so I've gone back to a simple version (~200 lines) that's fairly
easy to comprehend (to me, since I wrote it). And will now try to see if
I can find the same state transitions in your code.

I find your code somewhat hard to follow; mostly due to those xchg() +
fixup thingies. But I'll get there.

> The reason why I need a whole byte for the lock bit is because of the simple
> unlock code of assigning 0 to the lock byte by the lock holder. Utilizing
> other bits in the low byte for other purpose will complicate the unlock path
> and slow down the no-contention case.

Yeah, I get why you need a whole byte for the lock part, I was asking if
we really need another whole byte for the pending part.

So in my version I think I see an optimization case where this is indeed
useful and I can trade an atomic op for a write barrier, which should be
a big win.

It just wasn't at all clear (to me) from your code.

(I also think the optimization isn't x86 specific)

> >>The code is unfair, but this unfairness help it to run faster than ticket
> >>spinlock in this particular case. And the regular qspinlock slowpath is
> >>fair. A little bit of unfairness in this particular case helps its speed.

> >*groan*, no, unfairness not cool. ticket lock is absolutely fair; we
> >should preserve this.
> 
> We can preserve that by removing patch 3.

I've got a version that does the pending thing and still is entirely
fair.

I don't think the concept of the extra spinner is incompatible with
fairness.

> >BTW; can you share your benchmark thingy?
> 
> I have attached the test program that I used to generate the timing data for
> patch 3.

Thanks, I'll have a play.

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks
  2014-02-28 16:38         ` Waiman Long
  2014-02-28 17:56           ` Peter Zijlstra
@ 2014-02-28 17:56           ` Peter Zijlstra
  2014-03-03 17:43           ` Peter Zijlstra
  2014-03-03 17:43           ` Peter Zijlstra
  3 siblings, 0 replies; 125+ messages in thread
From: Peter Zijlstra @ 2014-02-28 17:56 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, Boris Ostrovsky,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Alok Kataria, linux-arch, x86, Ingo Molnar, Scott J Norton,
	xen-devel, Paul E. McKenney, Alexander Fyodorov, Arnd Bergmann,
	Daniel J Blueman, Rusty Russell, Oleg Nesterov, Steven Rostedt,
	Chris Wright, George Spelvin, Thomas Gleixner,
	Aswin Chandramouleeswaran



> After modifying it to do a deterministic cmpxchg, the test run time of 2
> contending tasks jumps up from 600ms (best case) to about 1700ms which was
> worse than the original qspinlock's 1300-1500ms. It is the opportunistic
> nature of the xchg() code that can potentially combine multiple steps in the
> deterministic atomic sequence which can saves time. Without that, I would
> rather prefer going back to the basic qspinlock queuing sequence for 2
> contending tasks.
> 
> Please take a look at the performance data in my patch 3 to see if the
> slowdown at 2 and 3 contending tasks are acceptable or not.

Right; so I've gone back to a simple version (~200 lines) that's fairly
easy to comprehend (to me, since I wrote it). And will now try to see if
I can find the same state transitions in your code.

I find your code somewhat hard to follow; mostly due to those xchg() +
fixup thingies. But I'll get there.

> The reason why I need a whole byte for the lock bit is because of the simple
> unlock code of assigning 0 to the lock byte by the lock holder. Utilizing
> other bits in the low byte for other purpose will complicate the unlock path
> and slow down the no-contention case.

Yeah, I get why you need a whole byte for the lock part, I was asking if
we really need another whole byte for the pending part.

So in my version I think I see an optimization case where this is indeed
useful and I can trade an atomic op for a write barrier, which should be
a big win.

It just wasn't at all clear (to me) from your code.

(I also think the optimization isn't x86 specific)

> >>The code is unfair, but this unfairness help it to run faster than ticket
> >>spinlock in this particular case. And the regular qspinlock slowpath is
> >>fair. A little bit of unfairness in this particular case helps its speed.

> >*groan*, no, unfairness not cool. ticket lock is absolutely fair; we
> >should preserve this.
> 
> We can preserve that by removing patch 3.

I've got a version that does the pending thing and still is entirely
fair.

I don't think the concept of the extra spinner is incompatible with
fairness.

> >BTW; can you share your benchmark thingy?
> 
> I have attached the test program that I used to generate the timing data for
> patch 3.

Thanks, I'll have a play.

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH RFC v5 4/8] pvqspinlock, x86: Allow unfair spinlock in a real PV environment
  2014-02-28 17:06     ` Waiman Long
@ 2014-03-03 10:55       ` Paolo Bonzini
  2014-03-04 15:15         ` Waiman Long
  2014-03-04 15:15         ` Waiman Long
  2014-03-03 10:55       ` Paolo Bonzini
  1 sibling, 2 replies; 125+ messages in thread
From: Paolo Bonzini @ 2014-03-03 10:55 UTC (permalink / raw)
  To: Waiman Long, Konrad Rzeszutek Wilk
  Cc: Jeremy Fitzhardinge, Raghavendra K T, Peter Zijlstra,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Alok Kataria, linux-arch, x86, Ingo Molnar, Scott J Norton,
	xen-devel, Paul E. McKenney, Alexander Fyodorov, Rik van Riel,
	Arnd Bergmann, Daniel J Blueman, Oleg Nesterov, Steven Rostedt,
	Chris Wright, George Spelvin, Thomas Gleixner,
	Aswin Chandramouleeswaran, Cheg

Il 28/02/2014 18:06, Waiman Long ha scritto:
> On 02/26/2014 12:07 PM, Konrad Rzeszutek Wilk wrote:
>> On Wed, Feb 26, 2014 at 10:14:24AM -0500, Waiman Long wrote:
>>> Locking is always an issue in a virtualized environment as the virtual
>>> CPU that is waiting on a lock may get scheduled out and hence block
>>> any progress in lock acquisition even when the lock has been freed.
>>>
>>> One solution to this problem is to allow unfair lock in a
>>> para-virtualized environment. In this case, a new lock acquirer can
>>> come and steal the lock if the next-in-line CPU to get the lock is
>>> scheduled out. Unfair lock in a native environment is generally not a
>> Hmm, how do you know if the 'next-in-line CPU' is scheduled out? As
>> in the hypervisor knows - but you as a guest might have no idea
>> of it.
>
> I use a heart-beat counter to see if the other side responses within a
> certain time limit. If not, I assume it has been scheduled out probably
> due to PLE.

PLE is unnecessary if you have "true" pv spinlocks where the 
next-in-line schedules itself out with a hypercall (Xen) or hlt 
instruction (KVM).  Set a bit in the qspinlock before going to sleep, 
and the lock owner will know that it needs to kick the next-in-line.

I think there is no need for the unfair lock bits.  1-2% is a pretty 
large hit.

Paolo

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH RFC v5 4/8] pvqspinlock, x86: Allow unfair spinlock in a real PV environment
  2014-02-28 17:06     ` Waiman Long
  2014-03-03 10:55       ` Paolo Bonzini
@ 2014-03-03 10:55       ` Paolo Bonzini
  1 sibling, 0 replies; 125+ messages in thread
From: Paolo Bonzini @ 2014-03-03 10:55 UTC (permalink / raw)
  To: Waiman Long, Konrad Rzeszutek Wilk
  Cc: Jeremy Fitzhardinge, Raghavendra K T, Peter Zijlstra,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Alok Kataria, linux-arch, x86, Ingo Molnar, Scott J Norton,
	xen-devel, Paul E. McKenney, Alexander Fyodorov, Arnd Bergmann,
	Daniel J Blueman, Rusty Russell, Oleg Nesterov, Steven Rostedt,
	Chris Wright, George Spelvin, Thomas Gleixner,
	Aswin Chandramouleeswaran

Il 28/02/2014 18:06, Waiman Long ha scritto:
> On 02/26/2014 12:07 PM, Konrad Rzeszutek Wilk wrote:
>> On Wed, Feb 26, 2014 at 10:14:24AM -0500, Waiman Long wrote:
>>> Locking is always an issue in a virtualized environment as the virtual
>>> CPU that is waiting on a lock may get scheduled out and hence block
>>> any progress in lock acquisition even when the lock has been freed.
>>>
>>> One solution to this problem is to allow unfair lock in a
>>> para-virtualized environment. In this case, a new lock acquirer can
>>> come and steal the lock if the next-in-line CPU to get the lock is
>>> scheduled out. Unfair lock in a native environment is generally not a
>> Hmm, how do you know if the 'next-in-line CPU' is scheduled out? As
>> in the hypervisor knows - but you as a guest might have no idea
>> of it.
>
> I use a heart-beat counter to see if the other side responses within a
> certain time limit. If not, I assume it has been scheduled out probably
> due to PLE.

PLE is unnecessary if you have "true" pv spinlocks where the 
next-in-line schedules itself out with a hypercall (Xen) or hlt 
instruction (KVM).  Set a bit in the qspinlock before going to sleep, 
and the lock owner will know that it needs to kick the next-in-line.

I think there is no need for the unfair lock bits.  1-2% is a pretty 
large hit.

Paolo

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [Xen-devel] [PATCH RFC v5 7/8] pvqspinlock, x86: Add qspinlock para-virtualization support
  2014-02-27 15:50             ` Paolo Bonzini
@ 2014-03-03 11:06               ` David Vrabel
  2014-03-03 11:06               ` David Vrabel
  1 sibling, 0 replies; 125+ messages in thread
From: David Vrabel @ 2014-03-03 11:06 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Jeremy Fitzhardinge, Raghavendra K T, Peter Zijlstra,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Thomas Gleixner, linux-arch, Boris Ostrovsky, x86, Ingo Molnar,
	Daniel J Blueman, xen-devel, Paul E. McKenney,
	Alexander Fyodorov, Arnd Bergmann, Scott J Norton,
	Steven Rostedt, Chris Wright, George Spelvin, Alok Kataria,
	Aswin Chandramouleeswaran, Chegu Vinod

On 27/02/14 15:50, Paolo Bonzini wrote:
> Il 27/02/2014 16:22, Raghavendra K T ha scritto:
>> On 02/27/2014 08:15 PM, Paolo Bonzini wrote:
>> [...]
>>>> But neither of the VCPUs being kicked here are halted -- they're either
>>>> running or runnable (descheduled by the hypervisor).
>>>
>>> /me actually looks at Waiman's code...
>>>
>>> Right, this is really different from pvticketlocks, where the *unlock*
>>> primitive wakes up a sleeping VCPU.  It is more similar to PLE
>>> (pause-loop exiting).
>>
>> Adding to the discussion, I see there are two possibilities here,
>> considering that in undercommit cases we should not exceed
>> HEAD_SPIN_THRESHOLD,
>>
>> 1. the looping vcpu in pv_head_spin_check() should do halt()
>> considering that we have done enough spinning (more than typical
>> lock-hold time), and hence we are in potential overcommit.
>>
>> 2. multiplex kick_cpu to do directed yield in qspinlock case.
>> But this may result in some ping ponging?
> 
> Actually, I think the qspinlock can work roughly the same as the
> pvticketlock, using the same lock_spinning and unlock_lock hooks.

This is is approach I would like to see.  This would also work for Xen
PV guests.

The current implementation depends on hardware PLE which Xen PV guests
do not support and I'm not sure if other architectures have something
similar (e.g., does ARM's virtualization extensions have PLE?).

David

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH RFC v5 7/8] pvqspinlock, x86: Add qspinlock para-virtualization support
  2014-02-27 15:50             ` Paolo Bonzini
  2014-03-03 11:06               ` [Xen-devel] " David Vrabel
@ 2014-03-03 11:06               ` David Vrabel
  1 sibling, 0 replies; 125+ messages in thread
From: David Vrabel @ 2014-03-03 11:06 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Jeremy Fitzhardinge, Raghavendra K T, Peter Zijlstra,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Thomas Gleixner, linux-arch, Boris Ostrovsky, x86, Ingo Molnar,
	Daniel J Blueman, xen-devel, Paul E. McKenney,
	Alexander Fyodorov, Arnd Bergmann, Scott J Norton, Rusty Russell,
	Steven Rostedt, Chris Wright, George Spelvin, Alok Kataria,
	Aswin

On 27/02/14 15:50, Paolo Bonzini wrote:
> Il 27/02/2014 16:22, Raghavendra K T ha scritto:
>> On 02/27/2014 08:15 PM, Paolo Bonzini wrote:
>> [...]
>>>> But neither of the VCPUs being kicked here are halted -- they're either
>>>> running or runnable (descheduled by the hypervisor).
>>>
>>> /me actually looks at Waiman's code...
>>>
>>> Right, this is really different from pvticketlocks, where the *unlock*
>>> primitive wakes up a sleeping VCPU.  It is more similar to PLE
>>> (pause-loop exiting).
>>
>> Adding to the discussion, I see there are two possibilities here,
>> considering that in undercommit cases we should not exceed
>> HEAD_SPIN_THRESHOLD,
>>
>> 1. the looping vcpu in pv_head_spin_check() should do halt()
>> considering that we have done enough spinning (more than typical
>> lock-hold time), and hence we are in potential overcommit.
>>
>> 2. multiplex kick_cpu to do directed yield in qspinlock case.
>> But this may result in some ping ponging?
> 
> Actually, I think the qspinlock can work roughly the same as the
> pvticketlock, using the same lock_spinning and unlock_lock hooks.

This is is approach I would like to see.  This would also work for Xen
PV guests.

The current implementation depends on hardware PLE which Xen PV guests
do not support and I'm not sure if other architectures have something
similar (e.g., does ARM's virtualization extensions have PLE?).

David

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks
  2014-02-28 16:38         ` Waiman Long
                             ` (2 preceding siblings ...)
  2014-03-03 17:43           ` Peter Zijlstra
@ 2014-03-03 17:43           ` Peter Zijlstra
  2014-03-04 15:27             ` Waiman Long
                               ` (5 more replies)
  3 siblings, 6 replies; 125+ messages in thread
From: Peter Zijlstra @ 2014-03-03 17:43 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, Boris Ostrovsky,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Alok Kataria, linux-arch, x86, Ingo Molnar, Scott J Norton,
	xen-devel, Paul E. McKenney, Alexander Fyodorov, Rik van Riel,
	Arnd Bergmann, Konrad Rzeszutek Wilk, Daniel J Blueman,
	Oleg Nesterov, Steven Rostedt, Chris Wright, George Spelvin,
	Thomas Gleixner

[-- Attachment #1: Type: text/plain, Size: 7758 bytes --]

Hi,

Here are some numbers for my version -- also attached is the test code.
I found that booting big machines is tediously slow so I lifted the
whole lot to userspace.

I measure the cycles spend in arch_spin_lock() + arch_spin_unlock().

The machines used are a 4 node (2 socket) AMD Interlagos, and a 2 node
(2 socket) Intel Westmere-EP.

AMD (ticket)		AMD (qspinlock + pending + opt)

Local:                  Local:

1:    324.425530        1:    324.102142
2:  17141.324050        2:    620.185930
3:  52212.232343        3:  25242.574661
4:  93136.458314        4:  47982.037866
6: 167967.455965        6:  95345.011864
8: 245402.534869        8: 142412.451438

2 - nodes:              2 - nodes:

2:  12763.640956        2:   1879.460823
4:  94423.027123        4:  48278.719130
6: 167903.698361        6:  96747.767310
8: 257243.508294        8: 144672.846317

4 - nodes:              4 - nodes:

 4:  82408.853603        4:  49820.323075
 8: 260492.952355        8: 143538.264724
16: 630099.031148       16: 337796.553795



Intel (ticket)		Intel (qspinlock + pending + opt)

Local:                  Local:

1:    19.002249         1:    29.002844
2:  5093.275530         2:  1282.209519
3: 22300.859761         3: 22127.477388
4: 44929.922325         4: 44493.881832
6: 86338.755247         6: 86360.083940

2 - nodes:              2 - nodes:

2:   1509.193824        2:   1209.090219
4:  48154.495998        4:  48547.242379
8: 137946.787244        8: 141381.498125

---

There a few curious facts I found (assuming my test code is sane).

 - Intel seems to be an order of magnitude faster on uncontended LOCKed
   ops compared to AMD

 - On Intel the uncontended qspinlock fast path (cmpxchg) seems slower
   than the uncontended ticket xadd -- although both are plenty fast
   when compared to AMD.

 - In general, replacing cmpxchg loops with unconditional atomic ops
   doesn't seem to matter a whole lot when the thing is contended.

Below is the (rather messy) qspinlock slow path code (the only thing
that really differs between our versions.

I'll try and slot your version in tomorrow.

---


/*
 * Exactly fills one cacheline on 64bit.
 */
static DEFINE_PER_CPU_ALIGNED(struct mcs_spinlock, mcs_nodes[4]);

static inline u32 encode_tail(int cpu, int idx)
{
	u32 code;

        code  = (cpu + 1) << _Q_TAIL_CPU_OFFSET;
	code |= idx << _Q_TAIL_IDX_OFFSET; /* assume < 4 */

	return code;
}

static inline struct mcs_spinlock *decode_tail(u32 code)
{
	int cpu = (code >> _Q_TAIL_CPU_OFFSET) - 1;
	int idx = (code >> _Q_TAIL_IDX_OFFSET) & _Q_TAIL_IDX_MASK;

	return per_cpu_ptr(&mcs_nodes[idx], cpu);
}

#define _QSPINLOCK_PENDING	(1U << _Q_PENDING_OFFSET)
#define _QSPINLOCK_MASK		(_QSPINLOCK_LOCKED | _QSPINLOCK_PENDING)

// PENDING - enables the pending bit logic
// OPT     - removes one atomic op at the cost of making pending a byte
// OPT2    - replaces some cmpxchg loops with unconditional atomic ops
//
// PENDING looks to be a win, even with 2 atomic ops on Intel, and a loss on AMD
// OPT is a full win
// OPT2 somehow doesn't seem to make much difference !?
//

/**
 * queue_spin_lock_slowpath - acquire the queue spinlock
 * @lock: Pointer to queue spinlock structure
 *
 *              fast      :    slow                                  :    unlock
 *                        :                                          :
 * uncontended  (0,0,0) --:--> (0,0,1) ------------------------------:--> (*,*,0)
 *                        :       | ^--------.------.             /  :
 *                        :       v           \      \            |  :
 * pending                :    (0,1,1) +--> (0,1,0)   \           |  :
 *                        :       | ^--'              |           |  :
 *                        :       v                   |           |  :
 * uncontended            :    (n,x,y) +--> (n,0,0) --'           |  :
 *   queue                :       | ^--'                          |  :
 *                        :       v                               |  :
 * contended              :    (*,x,y) +--> (*,0,0) ---> (*,0,1) -'  :
 *   queue                :                                          :
 *
 */
void queue_spin_lock_slowpath(struct qspinlock *lock)
{
	struct mcs_spinlock *prev, *next, *node;
	u32 val, new, old, code;
	int idx;

#if PENDING
	/*
	 * trylock || pending
	 *
	 * 0,0,0 -> 0,0,1 ; trylock
	 * 0,0,1 -> 0,1,1 ; pending
	 */
	val = atomic_read(&lock->val);
#if !OPT2
	for (;;) {
		/*
		 * If we observe any contention; queue.
		 */
		if (val & ~_Q_LOCKED_MASK)
			goto queue;

		new = _QSPINLOCK_LOCKED;
		if (val == new)
			new |= _QSPINLOCK_PENDING;

		old = atomic_cmpxchg(&lock->val, val, new);
		if (old == val)
			break;

		val = old;
	}

	/*
	 * we won the trylock
	 */
	if (new == _QSPINLOCK_LOCKED)
		return;

#else
	/*
	 * we can ignore the (unlikely) trylock case and have a fall-through on
	 * the wait below.
	 */
	if (val & ~_Q_LOCKED_MASK)
		goto queue;

	if (xchg(&(((u8 *)lock)[1]), 1))
		goto queue;

// could not observe a significant difference
// between the one (xchg) and the other (bts) unconditional
// LOCKed op
//
//	if (atomic_test_and_set_bit(_Q_PENDING_OFFSET, &lock->val))
//		goto queue;
#endif

	/*
	 * we're pending, wait for the owner to go away.
	 */
	while ((val = atomic_read(&lock->val)) & _QSPINLOCK_LOCKED)
		cpu_relax();

	/*
	 * take ownership and clear the pending bit.
	 */
#if !OPT
	for (;;) {
		new = (val & ~_QSPINLOCK_PENDING) | _QSPINLOCK_LOCKED;

		old = atomic_cmpxchg(&lock->val, val, new);
		if (old == val)
			break;

		val = old;
	}
#else
	((u8 *)lock)[0] = 1; /* locked */
	smp_wmb();
	((u8 *)lock)[1] = 0; /* pending */

// there is a big difference between an atomic and
// no atomic op.
//
//	smp_mb__before_atomic_inc();
//	atomic_clear_bit(_Q_PENDING_OFFSET, &lock->val);
#endif

	return;

queue:
#endif
	node = this_cpu_ptr(&mcs_nodes[0]);
	idx = node->count++;
	code = encode_tail(smp_processor_id(), idx);

	node += idx;
	node->locked = 0;
	node->next = NULL;

	/*
	 * we already touched the queueing cacheline; don't bother with pending
	 * stuff.
	 *
	 * trylock || xchg(lock, node)
	 *
	 * 0,0,0 -> 0,0,1 ; trylock
	 * p,y,x -> n,y,x ; prev = xchg(lock, node)
	 */
	val = atomic_read(&lock->val);
#if !OPT2
	for (;;) {
		new = _QSPINLOCK_LOCKED;
		if (val)
			new = code | (val & _QSPINLOCK_MASK);

		old = atomic_cmpxchg(&lock->val, val, new);
		if (old == val)
			break;

		val = old;
	}

	/*
	 * we won the trylock; forget about queueing.
	 */
	if (new == _QSPINLOCK_LOCKED)
		goto release;
#else
	/*
	 * Like with the pending case; we can ignore the unlikely trylock case
	 * and have a fall-through on the wait.
	 */
	old = xchg(&((u16 *)lock)[1], code >> 16) << 16;
#endif

	/*
	 * if there was a previous node; link it and wait.
	 */
	if (old & ~_QSPINLOCK_MASK) {
		prev = decode_tail(old);
		ACCESS_ONCE(prev->next) = node;

		arch_mcs_spin_lock_contended(&node->locked);
	}

	/*
	 * we're at the head of the waitqueue, wait for the owner & pending to
	 * go away.
	 *
	 * *,x,y -> *,0,0
	 */
	while ((val = atomic_read(&lock->val)) & _QSPINLOCK_MASK)
		cpu_relax();

	/*
	 * claim the lock:
	 *
	 * n,0,0 -> 0,0,1 : lock, uncontended
	 * *,0,0 -> *,0,1 : lock, contended
	 */
	for (;;) {
		new = _QSPINLOCK_LOCKED;
		if (val != code)
			new |= val;

		old = atomic_cmpxchg(&lock->val, val, new);
		if (old == val)
			break;

		val = old;
	}

	/*
	 * contended path; wait for next, release.
	 */
	if (new != _QSPINLOCK_LOCKED) {
		while (!(next = ACCESS_ONCE(node->next)))
			arch_mutex_cpu_relax();

		arch_mcs_spin_unlock_contended(&next->locked);
	}

release:
	/*
	 * release the node
	 */
	this_cpu_ptr(&mcs_nodes[0])->count--;
//	this_cpu_dec(mcs_nodes[0].count);
}
EXPORT_SYMBOL(queue_spin_lock_slowpath);

[-- Attachment #2: spinlocks.tar.bz2 --]
[-- Type: application/octet-stream, Size: 10164 bytes --]

[-- Attachment #3: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks
  2014-02-28 16:38         ` Waiman Long
  2014-02-28 17:56           ` Peter Zijlstra
  2014-02-28 17:56           ` Peter Zijlstra
@ 2014-03-03 17:43           ` Peter Zijlstra
  2014-03-03 17:43           ` Peter Zijlstra
  3 siblings, 0 replies; 125+ messages in thread
From: Peter Zijlstra @ 2014-03-03 17:43 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, Boris Ostrovsky,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Alok Kataria, linux-arch, x86, Ingo Molnar, Scott J Norton,
	xen-devel, Paul E. McKenney, Alexander Fyodorov, Arnd Bergmann,
	Daniel J Blueman, Rusty Russell, Oleg Nesterov, Steven Rostedt,
	Chris Wright, George Spelvin, Thomas Gleixner,
	Aswin Chandramouleeswaran

[-- Attachment #1: Type: text/plain, Size: 7758 bytes --]

Hi,

Here are some numbers for my version -- also attached is the test code.
I found that booting big machines is tediously slow so I lifted the
whole lot to userspace.

I measure the cycles spend in arch_spin_lock() + arch_spin_unlock().

The machines used are a 4 node (2 socket) AMD Interlagos, and a 2 node
(2 socket) Intel Westmere-EP.

AMD (ticket)		AMD (qspinlock + pending + opt)

Local:                  Local:

1:    324.425530        1:    324.102142
2:  17141.324050        2:    620.185930
3:  52212.232343        3:  25242.574661
4:  93136.458314        4:  47982.037866
6: 167967.455965        6:  95345.011864
8: 245402.534869        8: 142412.451438

2 - nodes:              2 - nodes:

2:  12763.640956        2:   1879.460823
4:  94423.027123        4:  48278.719130
6: 167903.698361        6:  96747.767310
8: 257243.508294        8: 144672.846317

4 - nodes:              4 - nodes:

 4:  82408.853603        4:  49820.323075
 8: 260492.952355        8: 143538.264724
16: 630099.031148       16: 337796.553795



Intel (ticket)		Intel (qspinlock + pending + opt)

Local:                  Local:

1:    19.002249         1:    29.002844
2:  5093.275530         2:  1282.209519
3: 22300.859761         3: 22127.477388
4: 44929.922325         4: 44493.881832
6: 86338.755247         6: 86360.083940

2 - nodes:              2 - nodes:

2:   1509.193824        2:   1209.090219
4:  48154.495998        4:  48547.242379
8: 137946.787244        8: 141381.498125

---

There a few curious facts I found (assuming my test code is sane).

 - Intel seems to be an order of magnitude faster on uncontended LOCKed
   ops compared to AMD

 - On Intel the uncontended qspinlock fast path (cmpxchg) seems slower
   than the uncontended ticket xadd -- although both are plenty fast
   when compared to AMD.

 - In general, replacing cmpxchg loops with unconditional atomic ops
   doesn't seem to matter a whole lot when the thing is contended.

Below is the (rather messy) qspinlock slow path code (the only thing
that really differs between our versions.

I'll try and slot your version in tomorrow.

---


/*
 * Exactly fills one cacheline on 64bit.
 */
static DEFINE_PER_CPU_ALIGNED(struct mcs_spinlock, mcs_nodes[4]);

static inline u32 encode_tail(int cpu, int idx)
{
	u32 code;

        code  = (cpu + 1) << _Q_TAIL_CPU_OFFSET;
	code |= idx << _Q_TAIL_IDX_OFFSET; /* assume < 4 */

	return code;
}

static inline struct mcs_spinlock *decode_tail(u32 code)
{
	int cpu = (code >> _Q_TAIL_CPU_OFFSET) - 1;
	int idx = (code >> _Q_TAIL_IDX_OFFSET) & _Q_TAIL_IDX_MASK;

	return per_cpu_ptr(&mcs_nodes[idx], cpu);
}

#define _QSPINLOCK_PENDING	(1U << _Q_PENDING_OFFSET)
#define _QSPINLOCK_MASK		(_QSPINLOCK_LOCKED | _QSPINLOCK_PENDING)

// PENDING - enables the pending bit logic
// OPT     - removes one atomic op at the cost of making pending a byte
// OPT2    - replaces some cmpxchg loops with unconditional atomic ops
//
// PENDING looks to be a win, even with 2 atomic ops on Intel, and a loss on AMD
// OPT is a full win
// OPT2 somehow doesn't seem to make much difference !?
//

/**
 * queue_spin_lock_slowpath - acquire the queue spinlock
 * @lock: Pointer to queue spinlock structure
 *
 *              fast      :    slow                                  :    unlock
 *                        :                                          :
 * uncontended  (0,0,0) --:--> (0,0,1) ------------------------------:--> (*,*,0)
 *                        :       | ^--------.------.             /  :
 *                        :       v           \      \            |  :
 * pending                :    (0,1,1) +--> (0,1,0)   \           |  :
 *                        :       | ^--'              |           |  :
 *                        :       v                   |           |  :
 * uncontended            :    (n,x,y) +--> (n,0,0) --'           |  :
 *   queue                :       | ^--'                          |  :
 *                        :       v                               |  :
 * contended              :    (*,x,y) +--> (*,0,0) ---> (*,0,1) -'  :
 *   queue                :                                          :
 *
 */
void queue_spin_lock_slowpath(struct qspinlock *lock)
{
	struct mcs_spinlock *prev, *next, *node;
	u32 val, new, old, code;
	int idx;

#if PENDING
	/*
	 * trylock || pending
	 *
	 * 0,0,0 -> 0,0,1 ; trylock
	 * 0,0,1 -> 0,1,1 ; pending
	 */
	val = atomic_read(&lock->val);
#if !OPT2
	for (;;) {
		/*
		 * If we observe any contention; queue.
		 */
		if (val & ~_Q_LOCKED_MASK)
			goto queue;

		new = _QSPINLOCK_LOCKED;
		if (val == new)
			new |= _QSPINLOCK_PENDING;

		old = atomic_cmpxchg(&lock->val, val, new);
		if (old == val)
			break;

		val = old;
	}

	/*
	 * we won the trylock
	 */
	if (new == _QSPINLOCK_LOCKED)
		return;

#else
	/*
	 * we can ignore the (unlikely) trylock case and have a fall-through on
	 * the wait below.
	 */
	if (val & ~_Q_LOCKED_MASK)
		goto queue;

	if (xchg(&(((u8 *)lock)[1]), 1))
		goto queue;

// could not observe a significant difference
// between the one (xchg) and the other (bts) unconditional
// LOCKed op
//
//	if (atomic_test_and_set_bit(_Q_PENDING_OFFSET, &lock->val))
//		goto queue;
#endif

	/*
	 * we're pending, wait for the owner to go away.
	 */
	while ((val = atomic_read(&lock->val)) & _QSPINLOCK_LOCKED)
		cpu_relax();

	/*
	 * take ownership and clear the pending bit.
	 */
#if !OPT
	for (;;) {
		new = (val & ~_QSPINLOCK_PENDING) | _QSPINLOCK_LOCKED;

		old = atomic_cmpxchg(&lock->val, val, new);
		if (old == val)
			break;

		val = old;
	}
#else
	((u8 *)lock)[0] = 1; /* locked */
	smp_wmb();
	((u8 *)lock)[1] = 0; /* pending */

// there is a big difference between an atomic and
// no atomic op.
//
//	smp_mb__before_atomic_inc();
//	atomic_clear_bit(_Q_PENDING_OFFSET, &lock->val);
#endif

	return;

queue:
#endif
	node = this_cpu_ptr(&mcs_nodes[0]);
	idx = node->count++;
	code = encode_tail(smp_processor_id(), idx);

	node += idx;
	node->locked = 0;
	node->next = NULL;

	/*
	 * we already touched the queueing cacheline; don't bother with pending
	 * stuff.
	 *
	 * trylock || xchg(lock, node)
	 *
	 * 0,0,0 -> 0,0,1 ; trylock
	 * p,y,x -> n,y,x ; prev = xchg(lock, node)
	 */
	val = atomic_read(&lock->val);
#if !OPT2
	for (;;) {
		new = _QSPINLOCK_LOCKED;
		if (val)
			new = code | (val & _QSPINLOCK_MASK);

		old = atomic_cmpxchg(&lock->val, val, new);
		if (old == val)
			break;

		val = old;
	}

	/*
	 * we won the trylock; forget about queueing.
	 */
	if (new == _QSPINLOCK_LOCKED)
		goto release;
#else
	/*
	 * Like with the pending case; we can ignore the unlikely trylock case
	 * and have a fall-through on the wait.
	 */
	old = xchg(&((u16 *)lock)[1], code >> 16) << 16;
#endif

	/*
	 * if there was a previous node; link it and wait.
	 */
	if (old & ~_QSPINLOCK_MASK) {
		prev = decode_tail(old);
		ACCESS_ONCE(prev->next) = node;

		arch_mcs_spin_lock_contended(&node->locked);
	}

	/*
	 * we're at the head of the waitqueue, wait for the owner & pending to
	 * go away.
	 *
	 * *,x,y -> *,0,0
	 */
	while ((val = atomic_read(&lock->val)) & _QSPINLOCK_MASK)
		cpu_relax();

	/*
	 * claim the lock:
	 *
	 * n,0,0 -> 0,0,1 : lock, uncontended
	 * *,0,0 -> *,0,1 : lock, contended
	 */
	for (;;) {
		new = _QSPINLOCK_LOCKED;
		if (val != code)
			new |= val;

		old = atomic_cmpxchg(&lock->val, val, new);
		if (old == val)
			break;

		val = old;
	}

	/*
	 * contended path; wait for next, release.
	 */
	if (new != _QSPINLOCK_LOCKED) {
		while (!(next = ACCESS_ONCE(node->next)))
			arch_mutex_cpu_relax();

		arch_mcs_spin_unlock_contended(&next->locked);
	}

release:
	/*
	 * release the node
	 */
	this_cpu_ptr(&mcs_nodes[0])->count--;
//	this_cpu_dec(mcs_nodes[0].count);
}
EXPORT_SYMBOL(queue_spin_lock_slowpath);

[-- Attachment #2: spinlocks.tar.bz2 --]
[-- Type: application/octet-stream, Size: 10164 bytes --]

[-- Attachment #3: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH RFC v5 4/8] pvqspinlock, x86: Allow unfair spinlock in a real PV environment
  2014-03-03 10:55       ` Paolo Bonzini
  2014-03-04 15:15         ` Waiman Long
@ 2014-03-04 15:15         ` Waiman Long
  2014-03-04 15:23           ` Paolo Bonzini
                             ` (5 more replies)
  1 sibling, 6 replies; 125+ messages in thread
From: Waiman Long @ 2014-03-04 15:15 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Jeremy Fitzhardinge, Raghavendra K T, Peter Zijlstra,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Alok Kataria, linux-arch, x86, Ingo Molnar, Scott J Norton,
	xen-devel, Paul E. McKenney, Alexander Fyodorov, Rik van Riel,
	Arnd Bergmann, Konrad Rzeszutek Wilk, Daniel J Blueman,
	Oleg Nesterov, Steven Rostedt, Chris Wright, George Spelvin,
	Thomas Gleixner

On 03/03/2014 05:55 AM, Paolo Bonzini wrote:
> Il 28/02/2014 18:06, Waiman Long ha scritto:
>> On 02/26/2014 12:07 PM, Konrad Rzeszutek Wilk wrote:
>>> On Wed, Feb 26, 2014 at 10:14:24AM -0500, Waiman Long wrote:
>>>> Locking is always an issue in a virtualized environment as the virtual
>>>> CPU that is waiting on a lock may get scheduled out and hence block
>>>> any progress in lock acquisition even when the lock has been freed.
>>>>
>>>> One solution to this problem is to allow unfair lock in a
>>>> para-virtualized environment. In this case, a new lock acquirer can
>>>> come and steal the lock if the next-in-line CPU to get the lock is
>>>> scheduled out. Unfair lock in a native environment is generally not a
>>> Hmm, how do you know if the 'next-in-line CPU' is scheduled out? As
>>> in the hypervisor knows - but you as a guest might have no idea
>>> of it.
>>
>> I use a heart-beat counter to see if the other side responses within a
>> certain time limit. If not, I assume it has been scheduled out probably
>> due to PLE.
>
> PLE is unnecessary if you have "true" pv spinlocks where the 
> next-in-line schedules itself out with a hypercall (Xen) or hlt 
> instruction (KVM).  Set a bit in the qspinlock before going to sleep, 
> and the lock owner will know that it needs to kick the next-in-line.
>
> I think there is no need for the unfair lock bits.  1-2% is a pretty 
> large hit.
>
> Paolo

I don't think that PLE is something that can be controlled by software. 
It is done in hardware. I maybe wrong. Anyway, I plan to add code to 
schedule out the CPUs waiting in the queue except the first 2 in the 
next version of the patch.

The PV code in the v5 patch did seem to improve benchmark performance 
with moderate to heavy spinlock contention. However, I didn't see much 
CPU kicking going on. My theory is that the additional PV code 
complicates the pause loop timing so that the hardware PLE didn't kick 
in, whereas the original pause loop is pretty simple causing PLE to 
happen fairly frequently.

-Longman

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH RFC v5 4/8] pvqspinlock, x86: Allow unfair spinlock in a real PV environment
  2014-03-03 10:55       ` Paolo Bonzini
@ 2014-03-04 15:15         ` Waiman Long
  2014-03-04 15:15         ` Waiman Long
  1 sibling, 0 replies; 125+ messages in thread
From: Waiman Long @ 2014-03-04 15:15 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Jeremy Fitzhardinge, Raghavendra K T, Peter Zijlstra,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Alok Kataria, linux-arch, x86, Ingo Molnar, Scott J Norton,
	xen-devel, Paul E. McKenney, Alexander Fyodorov, Arnd Bergmann,
	Daniel J Blueman, Rusty Russell, Oleg Nesterov, Steven Rostedt,
	Chris Wright, George Spelvin, Thomas Gleixner,
	Aswin Chandramouleeswaran

On 03/03/2014 05:55 AM, Paolo Bonzini wrote:
> Il 28/02/2014 18:06, Waiman Long ha scritto:
>> On 02/26/2014 12:07 PM, Konrad Rzeszutek Wilk wrote:
>>> On Wed, Feb 26, 2014 at 10:14:24AM -0500, Waiman Long wrote:
>>>> Locking is always an issue in a virtualized environment as the virtual
>>>> CPU that is waiting on a lock may get scheduled out and hence block
>>>> any progress in lock acquisition even when the lock has been freed.
>>>>
>>>> One solution to this problem is to allow unfair lock in a
>>>> para-virtualized environment. In this case, a new lock acquirer can
>>>> come and steal the lock if the next-in-line CPU to get the lock is
>>>> scheduled out. Unfair lock in a native environment is generally not a
>>> Hmm, how do you know if the 'next-in-line CPU' is scheduled out? As
>>> in the hypervisor knows - but you as a guest might have no idea
>>> of it.
>>
>> I use a heart-beat counter to see if the other side responses within a
>> certain time limit. If not, I assume it has been scheduled out probably
>> due to PLE.
>
> PLE is unnecessary if you have "true" pv spinlocks where the 
> next-in-line schedules itself out with a hypercall (Xen) or hlt 
> instruction (KVM).  Set a bit in the qspinlock before going to sleep, 
> and the lock owner will know that it needs to kick the next-in-line.
>
> I think there is no need for the unfair lock bits.  1-2% is a pretty 
> large hit.
>
> Paolo

I don't think that PLE is something that can be controlled by software. 
It is done in hardware. I maybe wrong. Anyway, I plan to add code to 
schedule out the CPUs waiting in the queue except the first 2 in the 
next version of the patch.

The PV code in the v5 patch did seem to improve benchmark performance 
with moderate to heavy spinlock contention. However, I didn't see much 
CPU kicking going on. My theory is that the additional PV code 
complicates the pause loop timing so that the hardware PLE didn't kick 
in, whereas the original pause loop is pretty simple causing PLE to 
happen fairly frequently.

-Longman

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH RFC v5 4/8] pvqspinlock, x86: Allow unfair spinlock in a real PV environment
  2014-03-04 15:15         ` Waiman Long
  2014-03-04 15:23           ` Paolo Bonzini
@ 2014-03-04 15:23           ` Paolo Bonzini
  2014-03-04 15:39           ` David Vrabel
                             ` (3 subsequent siblings)
  5 siblings, 0 replies; 125+ messages in thread
From: Paolo Bonzini @ 2014-03-04 15:23 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, Peter Zijlstra,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Alok Kataria, linux-arch, x86, Ingo Molnar, Scott J Norton,
	xen-devel, Paul E. McKenney, Alexander Fyodorov, Rik van Riel,
	Arnd Bergmann, Konrad Rzeszutek Wilk, Daniel J Blueman,
	Oleg Nesterov, Steven Rostedt, Chris Wright, George Spelvin,
	Thomas Gleixner

Il 04/03/2014 16:15, Waiman Long ha scritto:
>>
>> PLE is unnecessary if you have "true" pv spinlocks where the
>> next-in-line schedules itself out with a hypercall (Xen) or hlt
>> instruction (KVM).  Set a bit in the qspinlock before going to sleep,
>> and the lock owner will know that it needs to kick the next-in-line.
>>
>> I think there is no need for the unfair lock bits.  1-2% is a pretty
>> large hit.
>
> I don't think that PLE is something that can be controlled by software.
> It is done in hardware.

Yes, but the hypervisor decides *what* to do when the processor detects 
a pause-loop.

But my point is that if you have pv spinlocks, the processor in the end 
will never or almost never do a pause-loop exit.  PLE is mostly for 
legacy guests that doesn't have pv spinlocks.

Paolo

> I maybe wrong. Anyway, I plan to add code to
> schedule out the CPUs waiting in the queue except the first 2 in the
> next version of the patch.
>
> The PV code in the v5 patch did seem to improve benchmark performance
> with moderate to heavy spinlock contention. However, I didn't see much
> CPU kicking going on. My theory is that the additional PV code
> complicates the pause loop timing so that the hardware PLE didn't kick
> in, whereas the original pause loop is pretty simple causing PLE to
> happen fairly frequently.

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH RFC v5 4/8] pvqspinlock, x86: Allow unfair spinlock in a real PV environment
  2014-03-04 15:15         ` Waiman Long
@ 2014-03-04 15:23           ` Paolo Bonzini
  2014-03-04 15:23           ` Paolo Bonzini
                             ` (4 subsequent siblings)
  5 siblings, 0 replies; 125+ messages in thread
From: Paolo Bonzini @ 2014-03-04 15:23 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, Peter Zijlstra,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Alok Kataria, linux-arch, x86, Ingo Molnar, Scott J Norton,
	xen-devel, Paul E. McKenney, Alexander Fyodorov, Arnd Bergmann,
	Daniel J Blueman, Rusty Russell, Oleg Nesterov, Steven Rostedt,
	Chris Wright, George Spelvin, Thomas Gleixner,
	Aswin Chandramouleeswaran

Il 04/03/2014 16:15, Waiman Long ha scritto:
>>
>> PLE is unnecessary if you have "true" pv spinlocks where the
>> next-in-line schedules itself out with a hypercall (Xen) or hlt
>> instruction (KVM).  Set a bit in the qspinlock before going to sleep,
>> and the lock owner will know that it needs to kick the next-in-line.
>>
>> I think there is no need for the unfair lock bits.  1-2% is a pretty
>> large hit.
>
> I don't think that PLE is something that can be controlled by software.
> It is done in hardware.

Yes, but the hypervisor decides *what* to do when the processor detects 
a pause-loop.

But my point is that if you have pv spinlocks, the processor in the end 
will never or almost never do a pause-loop exit.  PLE is mostly for 
legacy guests that doesn't have pv spinlocks.

Paolo

> I maybe wrong. Anyway, I plan to add code to
> schedule out the CPUs waiting in the queue except the first 2 in the
> next version of the patch.
>
> The PV code in the v5 patch did seem to improve benchmark performance
> with moderate to heavy spinlock contention. However, I didn't see much
> CPU kicking going on. My theory is that the additional PV code
> complicates the pause loop timing so that the hardware PLE didn't kick
> in, whereas the original pause loop is pretty simple causing PLE to
> happen fairly frequently.

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks
  2014-03-03 17:43           ` Peter Zijlstra
  2014-03-04 15:27             ` Waiman Long
@ 2014-03-04 15:27             ` Waiman Long
  2014-03-04 16:58             ` Peter Zijlstra
                               ` (3 subsequent siblings)
  5 siblings, 0 replies; 125+ messages in thread
From: Waiman Long @ 2014-03-04 15:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jeremy Fitzhardinge, Raghavendra K T, Boris Ostrovsky,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Alok Kataria, linux-arch, x86, Ingo Molnar, Scott J Norton,
	xen-devel, Paul E. McKenney, Alexander Fyodorov, Rik van Riel,
	Arnd Bergmann, Konrad Rzeszutek Wilk, Daniel J Blueman,
	Oleg Nesterov, Steven Rostedt, Chris Wright, George Spelvin,
	Thomas Gleixner

On 03/03/2014 12:43 PM, Peter Zijlstra wrote:
> Hi,
>
> Here are some numbers for my version -- also attached is the test code.
> I found that booting big machines is tediously slow so I lifted the
> whole lot to userspace.
>
> I measure the cycles spend in arch_spin_lock() + arch_spin_unlock().
>
> The machines used are a 4 node (2 socket) AMD Interlagos, and a 2 node
> (2 socket) Intel Westmere-EP.
>
> AMD (ticket)		AMD (qspinlock + pending + opt)
>
> Local:                  Local:
>
> 1:    324.425530        1:    324.102142
> 2:  17141.324050        2:    620.185930
> 3:  52212.232343        3:  25242.574661
> 4:  93136.458314        4:  47982.037866
> 6: 167967.455965        6:  95345.011864
> 8: 245402.534869        8: 142412.451438
>
> 2 - nodes:              2 - nodes:
>
> 2:  12763.640956        2:   1879.460823
> 4:  94423.027123        4:  48278.719130
> 6: 167903.698361        6:  96747.767310
> 8: 257243.508294        8: 144672.846317
>
> 4 - nodes:              4 - nodes:
>
>   4:  82408.853603        4:  49820.323075
>   8: 260492.952355        8: 143538.264724
> 16: 630099.031148       16: 337796.553795
>
>
>
> Intel (ticket)		Intel (qspinlock + pending + opt)
>
> Local:                  Local:
>
> 1:    19.002249         1:    29.002844
> 2:  5093.275530         2:  1282.209519
> 3: 22300.859761         3: 22127.477388
> 4: 44929.922325         4: 44493.881832
> 6: 86338.755247         6: 86360.083940
>
> 2 - nodes:              2 - nodes:
>
> 2:   1509.193824        2:   1209.090219
> 4:  48154.495998        4:  48547.242379
> 8: 137946.787244        8: 141381.498125
>
> ---
>
> There a few curious facts I found (assuming my test code is sane).
>
>   - Intel seems to be an order of magnitude faster on uncontended LOCKed
>     ops compared to AMD
>
>   - On Intel the uncontended qspinlock fast path (cmpxchg) seems slower
>     than the uncontended ticket xadd -- although both are plenty fast
>     when compared to AMD.
>
>   - In general, replacing cmpxchg loops with unconditional atomic ops
>     doesn't seem to matter a whole lot when the thing is contended.
>
> Below is the (rather messy) qspinlock slow path code (the only thing
> that really differs between our versions.
>
> I'll try and slot your version in tomorrow.
>
> ---
>

It is curious to see that the qspinlock code offers a big benefit on AMD 
machines, but no so much on Intel. Anyway, I am working on a revised 
version of the patch that includes some of your comments. I will also 
try to see if I can get an AMD machine to run test on.

-Longman

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks
  2014-03-03 17:43           ` Peter Zijlstra
@ 2014-03-04 15:27             ` Waiman Long
  2014-03-04 15:27             ` Waiman Long
                               ` (4 subsequent siblings)
  5 siblings, 0 replies; 125+ messages in thread
From: Waiman Long @ 2014-03-04 15:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jeremy Fitzhardinge, Raghavendra K T, Boris Ostrovsky,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Alok Kataria, linux-arch, x86, Ingo Molnar, Scott J Norton,
	xen-devel, Paul E. McKenney, Alexander Fyodorov, Arnd Bergmann,
	Daniel J Blueman, Rusty Russell, Oleg Nesterov, Steven Rostedt,
	Chris Wright, George Spelvin, Thomas Gleixner,
	Aswin Chandramouleeswaran

On 03/03/2014 12:43 PM, Peter Zijlstra wrote:
> Hi,
>
> Here are some numbers for my version -- also attached is the test code.
> I found that booting big machines is tediously slow so I lifted the
> whole lot to userspace.
>
> I measure the cycles spend in arch_spin_lock() + arch_spin_unlock().
>
> The machines used are a 4 node (2 socket) AMD Interlagos, and a 2 node
> (2 socket) Intel Westmere-EP.
>
> AMD (ticket)		AMD (qspinlock + pending + opt)
>
> Local:                  Local:
>
> 1:    324.425530        1:    324.102142
> 2:  17141.324050        2:    620.185930
> 3:  52212.232343        3:  25242.574661
> 4:  93136.458314        4:  47982.037866
> 6: 167967.455965        6:  95345.011864
> 8: 245402.534869        8: 142412.451438
>
> 2 - nodes:              2 - nodes:
>
> 2:  12763.640956        2:   1879.460823
> 4:  94423.027123        4:  48278.719130
> 6: 167903.698361        6:  96747.767310
> 8: 257243.508294        8: 144672.846317
>
> 4 - nodes:              4 - nodes:
>
>   4:  82408.853603        4:  49820.323075
>   8: 260492.952355        8: 143538.264724
> 16: 630099.031148       16: 337796.553795
>
>
>
> Intel (ticket)		Intel (qspinlock + pending + opt)
>
> Local:                  Local:
>
> 1:    19.002249         1:    29.002844
> 2:  5093.275530         2:  1282.209519
> 3: 22300.859761         3: 22127.477388
> 4: 44929.922325         4: 44493.881832
> 6: 86338.755247         6: 86360.083940
>
> 2 - nodes:              2 - nodes:
>
> 2:   1509.193824        2:   1209.090219
> 4:  48154.495998        4:  48547.242379
> 8: 137946.787244        8: 141381.498125
>
> ---
>
> There a few curious facts I found (assuming my test code is sane).
>
>   - Intel seems to be an order of magnitude faster on uncontended LOCKed
>     ops compared to AMD
>
>   - On Intel the uncontended qspinlock fast path (cmpxchg) seems slower
>     than the uncontended ticket xadd -- although both are plenty fast
>     when compared to AMD.
>
>   - In general, replacing cmpxchg loops with unconditional atomic ops
>     doesn't seem to matter a whole lot when the thing is contended.
>
> Below is the (rather messy) qspinlock slow path code (the only thing
> that really differs between our versions.
>
> I'll try and slot your version in tomorrow.
>
> ---
>

It is curious to see that the qspinlock code offers a big benefit on AMD 
machines, but no so much on Intel. Anyway, I am working on a revised 
version of the patch that includes some of your comments. I will also 
try to see if I can get an AMD machine to run test on.

-Longman

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH RFC v5 4/8] pvqspinlock, x86: Allow unfair spinlock in a real PV environment
  2014-03-04 15:15         ` Waiman Long
  2014-03-04 15:23           ` Paolo Bonzini
  2014-03-04 15:23           ` Paolo Bonzini
@ 2014-03-04 15:39           ` David Vrabel
  2014-03-04 15:39           ` David Vrabel
                             ` (2 subsequent siblings)
  5 siblings, 0 replies; 125+ messages in thread
From: David Vrabel @ 2014-03-04 15:39 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, Peter Zijlstra,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Alok Kataria, linux-arch, x86, Ingo Molnar, Scott J Norton,
	xen-devel, Paul E. McKenney, Alexander Fyodorov, Rik van Riel,
	Arnd Bergmann, Konrad Rzeszutek Wilk, Daniel J Blueman,
	Oleg Nesterov, Steven Rostedt, Chris Wright, George Spelvin,
	Thomas Gleixner

On 04/03/14 15:15, Waiman Long wrote:
> On 03/03/2014 05:55 AM, Paolo Bonzini wrote:
>> Il 28/02/2014 18:06, Waiman Long ha scritto:
>>> On 02/26/2014 12:07 PM, Konrad Rzeszutek Wilk wrote:
>>>> On Wed, Feb 26, 2014 at 10:14:24AM -0500, Waiman Long wrote:
>>>>> Locking is always an issue in a virtualized environment as the virtual
>>>>> CPU that is waiting on a lock may get scheduled out and hence block
>>>>> any progress in lock acquisition even when the lock has been freed.
>>>>>
>>>>> One solution to this problem is to allow unfair lock in a
>>>>> para-virtualized environment. In this case, a new lock acquirer can
>>>>> come and steal the lock if the next-in-line CPU to get the lock is
>>>>> scheduled out. Unfair lock in a native environment is generally not a
>>>> Hmm, how do you know if the 'next-in-line CPU' is scheduled out? As
>>>> in the hypervisor knows - but you as a guest might have no idea
>>>> of it.
>>>
>>> I use a heart-beat counter to see if the other side responses within a
>>> certain time limit. If not, I assume it has been scheduled out probably
>>> due to PLE.
>>
>> PLE is unnecessary if you have "true" pv spinlocks where the
>> next-in-line schedules itself out with a hypercall (Xen) or hlt
>> instruction (KVM).  Set a bit in the qspinlock before going to sleep,
>> and the lock owner will know that it needs to kick the next-in-line.
>>
>> I think there is no need for the unfair lock bits.  1-2% is a pretty
>> large hit.
>>
>> Paolo
> 
> I don't think that PLE is something that can be controlled by software.

You can avoid PLE by not issuing PAUSE instructions when spinning.  You
may want to consider this if you have a lock that explicitly deschedules
the VCPU while waiting (or just deschedule before PLE would trigger).

> It is done in hardware. I maybe wrong. Anyway, I plan to add code to
> schedule out the CPUs waiting in the queue except the first 2 in the
> next version of the patch.

I think you should deschedule all waiters.

> The PV code in the v5 patch did seem to improve benchmark performance
> with moderate to heavy spinlock contention.

The goal of PV aware locks is to improve performance when locks are
contented /and/ VCPUs are over-committed.  Is this something you're
actually measuring?  It's not clear to me.

David

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH RFC v5 4/8] pvqspinlock, x86: Allow unfair spinlock in a real PV environment
  2014-03-04 15:15         ` Waiman Long
                             ` (2 preceding siblings ...)
  2014-03-04 15:39           ` David Vrabel
@ 2014-03-04 15:39           ` David Vrabel
  2014-03-04 17:50           ` Raghavendra K T
  2014-03-04 17:50           ` Raghavendra K T
  5 siblings, 0 replies; 125+ messages in thread
From: David Vrabel @ 2014-03-04 15:39 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, Peter Zijlstra,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Alok Kataria, linux-arch, x86, Ingo Molnar, Scott J Norton,
	xen-devel, Paul E. McKenney, Alexander Fyodorov, Arnd Bergmann,
	Daniel J Blueman, Rusty Russell, Oleg Nesterov, Steven Rostedt,
	Chris Wright, George Spelvin, Thomas Gleixner,
	Aswin Chandramouleeswaran

On 04/03/14 15:15, Waiman Long wrote:
> On 03/03/2014 05:55 AM, Paolo Bonzini wrote:
>> Il 28/02/2014 18:06, Waiman Long ha scritto:
>>> On 02/26/2014 12:07 PM, Konrad Rzeszutek Wilk wrote:
>>>> On Wed, Feb 26, 2014 at 10:14:24AM -0500, Waiman Long wrote:
>>>>> Locking is always an issue in a virtualized environment as the virtual
>>>>> CPU that is waiting on a lock may get scheduled out and hence block
>>>>> any progress in lock acquisition even when the lock has been freed.
>>>>>
>>>>> One solution to this problem is to allow unfair lock in a
>>>>> para-virtualized environment. In this case, a new lock acquirer can
>>>>> come and steal the lock if the next-in-line CPU to get the lock is
>>>>> scheduled out. Unfair lock in a native environment is generally not a
>>>> Hmm, how do you know if the 'next-in-line CPU' is scheduled out? As
>>>> in the hypervisor knows - but you as a guest might have no idea
>>>> of it.
>>>
>>> I use a heart-beat counter to see if the other side responses within a
>>> certain time limit. If not, I assume it has been scheduled out probably
>>> due to PLE.
>>
>> PLE is unnecessary if you have "true" pv spinlocks where the
>> next-in-line schedules itself out with a hypercall (Xen) or hlt
>> instruction (KVM).  Set a bit in the qspinlock before going to sleep,
>> and the lock owner will know that it needs to kick the next-in-line.
>>
>> I think there is no need for the unfair lock bits.  1-2% is a pretty
>> large hit.
>>
>> Paolo
> 
> I don't think that PLE is something that can be controlled by software.

You can avoid PLE by not issuing PAUSE instructions when spinning.  You
may want to consider this if you have a lock that explicitly deschedules
the VCPU while waiting (or just deschedule before PLE would trigger).

> It is done in hardware. I maybe wrong. Anyway, I plan to add code to
> schedule out the CPUs waiting in the queue except the first 2 in the
> next version of the patch.

I think you should deschedule all waiters.

> The PV code in the v5 patch did seem to improve benchmark performance
> with moderate to heavy spinlock contention.

The goal of PV aware locks is to improve performance when locks are
contented /and/ VCPUs are over-committed.  Is this something you're
actually measuring?  It's not clear to me.

David

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks
  2014-03-03 17:43           ` Peter Zijlstra
  2014-03-04 15:27             ` Waiman Long
  2014-03-04 15:27             ` Waiman Long
@ 2014-03-04 16:58             ` Peter Zijlstra
  2014-03-04 18:09               ` Peter Zijlstra
  2014-03-04 18:09               ` Peter Zijlstra
  2014-03-04 16:58             ` Peter Zijlstra
                               ` (2 subsequent siblings)
  5 siblings, 2 replies; 125+ messages in thread
From: Peter Zijlstra @ 2014-03-04 16:58 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, Boris Ostrovsky,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Alok Kataria, linux-arch, x86, Ingo Molnar, Scott J Norton,
	xen-devel, Paul E. McKenney, Alexander Fyodorov, Rik van Riel,
	Arnd Bergmann, Konrad Rzeszutek Wilk, Daniel J Blueman,
	Oleg Nesterov, Steven Rostedt, Chris Wright, George Spelvin,
	Thomas Gleixner

[-- Attachment #1: Type: text/plain, Size: 2871 bytes --]


Updated version, this includes numbers for my SNB desktop and Waiman's
variant.

Curiously Waiman's version seems consistently slower on 2 cross node
CPUs. Whereas my version seems to have a problem on SNB with 2 CPUs.

There's something weird with the ticket lock numbers; when I compile
the code with:

  gcc (Debian 4.7.2-5) 4.7.2

I get the first set; when I compile with:

  gcc (Ubuntu/Linaro 4.7.3-2ubuntu4) 4.7.3

I get the second set; afaict the other locks don't seem to have this
problem, but I only just noticed.

---

I measure the cycles spend in arch_spin_lock() + arch_spin_unlock().

The machines used are a 4 node (2 socket) AMD Interlagos, a 2 node
(2 socket) Intel Westmere-EP and my i7-2600K (SNB) desktop.


(ticket)		(qspinlock + all)	(waiman)


AMD Interlagos

Local:

 1:    324.425530        1:    324.102142        1:    323.857834
 2:  17141.324050        2:    620.185930        2:    618.737681
 3:  52212.232343        3:  25242.574661        3:  24888.154161
 4:  93136.458314        4:  47982.037866        4:  48227.610976
 6: 167967.455965        6:  95345.011864        6:  94372.276116
 8: 245402.534869        8: 142412.451438        8: 140516.525791

 1: 324.071515
 2: 981.778516
 3: 24414.144262
 4: 50868.376667
 6: 99536.890504
 8: 151616.395779

2 - nodes:

 2:  12763.640956        2:   1879.460823        2:   2023.594014
 4:  94423.027123        4:  48278.719130        4:  48621.724929
 6: 167903.698361        6:  96747.767310        6:  95815.242461
 8: 257243.508294        8: 144672.846317        8: 143282.222038

 2:   1875.637053
 4:  50082.796058
 6: 107780.361523
 8: 163166.728218

4 - nodes:

 4:  82408.853603        4:  49820.323075	 4:  50566.036473
 8: 260492.952355        8: 143538.264724        8: 143485.584624
16: 630099.031148       16: 337796.553795       16: 333419.421338

 4: 55409.896671
 8: 167340.905593
16: 443195.057052



Intel WSM-EP

Local:

1:    19.002249         1:    29.002844		1:    28.979917
2:  5093.275530         2:  1282.209519         2:  1236.624785
3: 22300.859761         3: 22127.477388         3: 22336.241120
4: 44929.922325         4: 44493.881832         4: 44832.450598
6: 86338.755247         6: 86360.083940         6: 85808.603491

1: 20.009974
2: 1206.419074
3: 22071.535000
4: 44606.831373
6: 86498.760774

2 - nodes:

2:   1527.466159	2:   1227.051993	2:   1434.204666
4:  46004.232179        4:  46450.787234        4:  46999.356429
6:  89226.472057        6:  90124.984324        6:  90110.423115
8: 137225.472406        8: 137909.184358        8: 137988.290401


Intel SNB

Local:

1:    15.276759		1:    25.336807		1:    25.353041
2:   714.621152         2:   843.240641         2:   711.281211
3: 11339.104267         3: 11751.159167         3: 11684.286334
4: 22648.387376         4: 23454.798068         4: 22903.498910




































[-- Attachment #2: spinlocks.tar.bz2 --]
[-- Type: application/octet-stream, Size: 14659 bytes --]

[-- Attachment #3: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks
  2014-03-03 17:43           ` Peter Zijlstra
                               ` (2 preceding siblings ...)
  2014-03-04 16:58             ` Peter Zijlstra
@ 2014-03-04 16:58             ` Peter Zijlstra
  2014-03-04 17:48             ` Waiman Long
  2014-03-04 17:48             ` Waiman Long
  5 siblings, 0 replies; 125+ messages in thread
From: Peter Zijlstra @ 2014-03-04 16:58 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, Boris Ostrovsky,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Alok Kataria, linux-arch, x86, Ingo Molnar, Scott J Norton,
	xen-devel, Paul E. McKenney, Alexander Fyodorov, Arnd Bergmann,
	Daniel J Blueman, Rusty Russell, Oleg Nesterov, Steven Rostedt,
	Chris Wright, George Spelvin, Thomas Gleixner,
	Aswin Chandramouleeswaran

[-- Attachment #1: Type: text/plain, Size: 2871 bytes --]


Updated version, this includes numbers for my SNB desktop and Waiman's
variant.

Curiously Waiman's version seems consistently slower on 2 cross node
CPUs. Whereas my version seems to have a problem on SNB with 2 CPUs.

There's something weird with the ticket lock numbers; when I compile
the code with:

  gcc (Debian 4.7.2-5) 4.7.2

I get the first set; when I compile with:

  gcc (Ubuntu/Linaro 4.7.3-2ubuntu4) 4.7.3

I get the second set; afaict the other locks don't seem to have this
problem, but I only just noticed.

---

I measure the cycles spend in arch_spin_lock() + arch_spin_unlock().

The machines used are a 4 node (2 socket) AMD Interlagos, a 2 node
(2 socket) Intel Westmere-EP and my i7-2600K (SNB) desktop.


(ticket)		(qspinlock + all)	(waiman)


AMD Interlagos

Local:

 1:    324.425530        1:    324.102142        1:    323.857834
 2:  17141.324050        2:    620.185930        2:    618.737681
 3:  52212.232343        3:  25242.574661        3:  24888.154161
 4:  93136.458314        4:  47982.037866        4:  48227.610976
 6: 167967.455965        6:  95345.011864        6:  94372.276116
 8: 245402.534869        8: 142412.451438        8: 140516.525791

 1: 324.071515
 2: 981.778516
 3: 24414.144262
 4: 50868.376667
 6: 99536.890504
 8: 151616.395779

2 - nodes:

 2:  12763.640956        2:   1879.460823        2:   2023.594014
 4:  94423.027123        4:  48278.719130        4:  48621.724929
 6: 167903.698361        6:  96747.767310        6:  95815.242461
 8: 257243.508294        8: 144672.846317        8: 143282.222038

 2:   1875.637053
 4:  50082.796058
 6: 107780.361523
 8: 163166.728218

4 - nodes:

 4:  82408.853603        4:  49820.323075	 4:  50566.036473
 8: 260492.952355        8: 143538.264724        8: 143485.584624
16: 630099.031148       16: 337796.553795       16: 333419.421338

 4: 55409.896671
 8: 167340.905593
16: 443195.057052



Intel WSM-EP

Local:

1:    19.002249         1:    29.002844		1:    28.979917
2:  5093.275530         2:  1282.209519         2:  1236.624785
3: 22300.859761         3: 22127.477388         3: 22336.241120
4: 44929.922325         4: 44493.881832         4: 44832.450598
6: 86338.755247         6: 86360.083940         6: 85808.603491

1: 20.009974
2: 1206.419074
3: 22071.535000
4: 44606.831373
6: 86498.760774

2 - nodes:

2:   1527.466159	2:   1227.051993	2:   1434.204666
4:  46004.232179        4:  46450.787234        4:  46999.356429
6:  89226.472057        6:  90124.984324        6:  90110.423115
8: 137225.472406        8: 137909.184358        8: 137988.290401


Intel SNB

Local:

1:    15.276759		1:    25.336807		1:    25.353041
2:   714.621152         2:   843.240641         2:   711.281211
3: 11339.104267         3: 11751.159167         3: 11684.286334
4: 22648.387376         4: 23454.798068         4: 22903.498910




































[-- Attachment #2: spinlocks.tar.bz2 --]
[-- Type: application/octet-stream, Size: 14659 bytes --]

[-- Attachment #3: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks
  2014-03-03 17:43           ` Peter Zijlstra
                               ` (4 preceding siblings ...)
  2014-03-04 17:48             ` Waiman Long
@ 2014-03-04 17:48             ` Waiman Long
  2014-03-04 22:40               ` Peter Zijlstra
  2014-03-04 22:40               ` Peter Zijlstra
  5 siblings, 2 replies; 125+ messages in thread
From: Waiman Long @ 2014-03-04 17:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jeremy Fitzhardinge, Raghavendra K T, Boris Ostrovsky,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Alok Kataria, linux-arch, x86, Ingo Molnar, Scott J Norton,
	xen-devel, Paul E. McKenney, Alexander Fyodorov, Rik van Riel,
	Arnd Bergmann, Konrad Rzeszutek Wilk, Daniel J Blueman,
	Oleg Nesterov, Steven Rostedt, Chris Wright, George Spelvin,
	Thomas Gleixner

Peter,

I was trying to implement the generic queue code exchange code using
cmpxchg as suggested by you. However, when I gathered the performance
data, the code performed worse than I expected at a higher contention
level. Below were the execution time of the benchmark tool that I sent
you:

                 [xchg]        [cmpxchg]
   # of tasks    Ticket lock     Queue lock      Queue Lock
   ----------    -----------     -----------     ----------
        1          135            135              135
        2          732           1315            1102
        3         1827           2372            2681
        4         2689           2934             5392
        5         3736           3658             7696
        6         4942           4434            9876
        7         6304           5176           11901
        8         7736           5955           14551

Below is the code that I used:

static inline u32 queue_code_xchg(struct qspinlock *lock, u32 *ocode, 
u32 ncode)
{
         while (true) {
                 u32 qlcode = atomic_read(&lock->qlcode);

                 if (qlcode == 0) {
                         /*
                          * Try to get the lock
                          */
                         if (atomic_cmpxchg(&lock->qlcode, 0,
                                            _QSPINLOCK_LOCKED) == 0)
                                 return 1;
                 } else if (qlcode & _QSPINLOCK_LOCKED) {
                         *ocode = atomic_cmpxchg(&lock->qlcode, qlcode,
                                                 ncode | _QSPINLOCK_LOCKED);
                         if (*ocode == qlcode) {
                                 /* Clear lock bit before return */
                                 *ocode &= ~_QSPINLOCK_LOCKED;
                                 return 0;
                         }
                 }
                 /*
                  * Wait if atomic_cmpxchg() fails or lock is 
temporarily free.
                  */
                 arch_mutex_cpu_relax();
         }
}

My cmpxchg code is not optimal, and I can probably tune the code to
make it perform better. Given the trend that I was seeing, however,
I think I will keep the current xchg code, but I will package it in
an inline function.

-Longman

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks
  2014-03-03 17:43           ` Peter Zijlstra
                               ` (3 preceding siblings ...)
  2014-03-04 16:58             ` Peter Zijlstra
@ 2014-03-04 17:48             ` Waiman Long
  2014-03-04 17:48             ` Waiman Long
  5 siblings, 0 replies; 125+ messages in thread
From: Waiman Long @ 2014-03-04 17:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jeremy Fitzhardinge, Raghavendra K T, Boris Ostrovsky,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Alok Kataria, linux-arch, x86, Ingo Molnar, Scott J Norton,
	xen-devel, Paul E. McKenney, Alexander Fyodorov, Arnd Bergmann,
	Daniel J Blueman, Rusty Russell, Oleg Nesterov, Steven Rostedt,
	Chris Wright, George Spelvin, Thomas Gleixner,
	Aswin Chandramouleeswaran

Peter,

I was trying to implement the generic queue code exchange code using
cmpxchg as suggested by you. However, when I gathered the performance
data, the code performed worse than I expected at a higher contention
level. Below were the execution time of the benchmark tool that I sent
you:

                 [xchg]        [cmpxchg]
   # of tasks    Ticket lock     Queue lock      Queue Lock
   ----------    -----------     -----------     ----------
        1          135            135              135
        2          732           1315            1102
        3         1827           2372            2681
        4         2689           2934             5392
        5         3736           3658             7696
        6         4942           4434            9876
        7         6304           5176           11901
        8         7736           5955           14551

Below is the code that I used:

static inline u32 queue_code_xchg(struct qspinlock *lock, u32 *ocode, 
u32 ncode)
{
         while (true) {
                 u32 qlcode = atomic_read(&lock->qlcode);

                 if (qlcode == 0) {
                         /*
                          * Try to get the lock
                          */
                         if (atomic_cmpxchg(&lock->qlcode, 0,
                                            _QSPINLOCK_LOCKED) == 0)
                                 return 1;
                 } else if (qlcode & _QSPINLOCK_LOCKED) {
                         *ocode = atomic_cmpxchg(&lock->qlcode, qlcode,
                                                 ncode | _QSPINLOCK_LOCKED);
                         if (*ocode == qlcode) {
                                 /* Clear lock bit before return */
                                 *ocode &= ~_QSPINLOCK_LOCKED;
                                 return 0;
                         }
                 }
                 /*
                  * Wait if atomic_cmpxchg() fails or lock is 
temporarily free.
                  */
                 arch_mutex_cpu_relax();
         }
}

My cmpxchg code is not optimal, and I can probably tune the code to
make it perform better. Given the trend that I was seeing, however,
I think I will keep the current xchg code, but I will package it in
an inline function.

-Longman

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH RFC v5 4/8] pvqspinlock, x86: Allow unfair spinlock in a real PV environment
  2014-03-04 15:15         ` Waiman Long
                             ` (4 preceding siblings ...)
  2014-03-04 17:50           ` Raghavendra K T
@ 2014-03-04 17:50           ` Raghavendra K T
  5 siblings, 0 replies; 125+ messages in thread
From: Raghavendra K T @ 2014-03-04 17:50 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jeremy Fitzhardinge, Peter Zijlstra, virtualization, Andi Kleen,
	H. Peter Anvin, Michel Lespinasse, Alok Kataria, linux-arch, x86,
	Ingo Molnar, Scott J Norton, xen-devel, Paul E. McKenney,
	Alexander Fyodorov, Rik van Riel, Arnd Bergmann,
	Konrad Rzeszutek Wilk, Daniel J Blueman, Oleg Nesterov,
	Steven Rostedt, Chris Wright, George Spelvin, Thomas Gleixner,
	Aswin Chandramouleeswaran, Chegu

> The PV code in the v5 patch did seem to improve benchmark performance
> with moderate to heavy spinlock contention. However, I didn't see much
> CPU kicking going on. My theory is that the additional PV code
> complicates the pause loop timing so that the hardware PLE didn't kick
> in, whereas the original pause loop is pretty simple causing PLE to
> happen fairly frequently.

you could play with ple_gap parameter to make it work for bigger
spin-loops in such cases.

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH RFC v5 4/8] pvqspinlock, x86: Allow unfair spinlock in a real PV environment
  2014-03-04 15:15         ` Waiman Long
                             ` (3 preceding siblings ...)
  2014-03-04 15:39           ` David Vrabel
@ 2014-03-04 17:50           ` Raghavendra K T
  2014-03-04 17:50           ` Raghavendra K T
  5 siblings, 0 replies; 125+ messages in thread
From: Raghavendra K T @ 2014-03-04 17:50 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jeremy Fitzhardinge, Peter Zijlstra, virtualization, Andi Kleen,
	H. Peter Anvin, Michel Lespinasse, Alok Kataria, linux-arch, x86,
	Ingo Molnar, Scott J Norton, xen-devel, Paul E. McKenney,
	Alexander Fyodorov, Arnd Bergmann, Daniel J Blueman,
	Rusty Russell, Oleg Nesterov, Steven Rostedt, Chris Wright,
	George Spelvin, Thomas Gleixner, Aswin Chandramouleeswaran,
	Chegu Vinod, Boris

> The PV code in the v5 patch did seem to improve benchmark performance
> with moderate to heavy spinlock contention. However, I didn't see much
> CPU kicking going on. My theory is that the additional PV code
> complicates the pause loop timing so that the hardware PLE didn't kick
> in, whereas the original pause loop is pretty simple causing PLE to
> happen fairly frequently.

you could play with ple_gap parameter to make it work for bigger
spin-loops in such cases.

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks
  2014-03-04 16:58             ` Peter Zijlstra
  2014-03-04 18:09               ` Peter Zijlstra
@ 2014-03-04 18:09               ` Peter Zijlstra
  1 sibling, 0 replies; 125+ messages in thread
From: Peter Zijlstra @ 2014-03-04 18:09 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, Boris Ostrovsky,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Alok Kataria, linux-arch, x86, Ingo Molnar, Scott J Norton,
	xen-devel, Paul E. McKenney, Alexander Fyodorov, Rik van Riel,
	Arnd Bergmann, Konrad Rzeszutek Wilk, Daniel J Blueman,
	Oleg Nesterov, Steven Rostedt, Chris Wright, George Spelvin,
	Thomas Gleixner

On Tue, Mar 04, 2014 at 05:58:00PM +0100, Peter Zijlstra wrote:
>  2:  17141.324050        2:    620.185930        2:    618.737681

So I forgot that AMD has compute units that share L2:

root@interlagos:~/spinlocks# export LOCK=./ticket ; ($LOCK 0 1 ; $LOCK 0 2) | awk '/^total/ { print $2 }'
982.938839
1325.702905
root@interlagos:~/spinlocks# export LOCK=./qspinlock-pending-opt2 ; ($LOCK 0 1 ; $LOCK 0 2) | awk '/^total/ { print $2 }'
630.684313
999.119087
root@interlagos:~/spinlocks# export LOCK=./waiman ; ($LOCK 0 1 ; $LOCK 0 2) | awk '/^total/ { print $2 }'
620.562791
1644.700639


Doing the same for Intel SMT, which shares L1:


root@westmere:~/spinlocks# export LOCK=./ticket ; ($LOCK 0 12 ; $LOCK 0 1) | awk '/^total/ { print $2 }'
45.765302
1292.721827
root@westmere:~/spinlocks# export LOCK=./qspinlock-pending-opt2 ; ($LOCK 0 12 ; $LOCK 0 1) | awk '/^total/ { print $2 }'
54.536890
1260.467527
root@westmere:~/spinlocks# export LOCK=./waiman ; ($LOCK 0 12 ; $LOCK 0 1) | awk '/^total/ { print $2 }'
65.944794
1230.522895

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks
  2014-03-04 16:58             ` Peter Zijlstra
@ 2014-03-04 18:09               ` Peter Zijlstra
  2014-03-04 18:09               ` Peter Zijlstra
  1 sibling, 0 replies; 125+ messages in thread
From: Peter Zijlstra @ 2014-03-04 18:09 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, Boris Ostrovsky,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Alok Kataria, linux-arch, x86, Ingo Molnar, Scott J Norton,
	xen-devel, Paul E. McKenney, Alexander Fyodorov, Arnd Bergmann,
	Daniel J Blueman, Rusty Russell, Oleg Nesterov, Steven Rostedt,
	Chris Wright, George Spelvin, Thomas Gleixner,
	Aswin Chandramouleeswaran

On Tue, Mar 04, 2014 at 05:58:00PM +0100, Peter Zijlstra wrote:
>  2:  17141.324050        2:    620.185930        2:    618.737681

So I forgot that AMD has compute units that share L2:

root@interlagos:~/spinlocks# export LOCK=./ticket ; ($LOCK 0 1 ; $LOCK 0 2) | awk '/^total/ { print $2 }'
982.938839
1325.702905
root@interlagos:~/spinlocks# export LOCK=./qspinlock-pending-opt2 ; ($LOCK 0 1 ; $LOCK 0 2) | awk '/^total/ { print $2 }'
630.684313
999.119087
root@interlagos:~/spinlocks# export LOCK=./waiman ; ($LOCK 0 1 ; $LOCK 0 2) | awk '/^total/ { print $2 }'
620.562791
1644.700639


Doing the same for Intel SMT, which shares L1:


root@westmere:~/spinlocks# export LOCK=./ticket ; ($LOCK 0 12 ; $LOCK 0 1) | awk '/^total/ { print $2 }'
45.765302
1292.721827
root@westmere:~/spinlocks# export LOCK=./qspinlock-pending-opt2 ; ($LOCK 0 12 ; $LOCK 0 1) | awk '/^total/ { print $2 }'
54.536890
1260.467527
root@westmere:~/spinlocks# export LOCK=./waiman ; ($LOCK 0 12 ; $LOCK 0 1) | awk '/^total/ { print $2 }'
65.944794
1230.522895

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks
  2014-03-04 17:48             ` Waiman Long
@ 2014-03-04 22:40               ` Peter Zijlstra
  2014-03-05 20:59                 ` Peter Zijlstra
  2014-03-05 20:59                 ` Peter Zijlstra
  2014-03-04 22:40               ` Peter Zijlstra
  1 sibling, 2 replies; 125+ messages in thread
From: Peter Zijlstra @ 2014-03-04 22:40 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, Boris Ostrovsky,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Alok Kataria, linux-arch, x86, Ingo Molnar, Scott J Norton,
	xen-devel, Paul E. McKenney, Alexander Fyodorov, Rik van Riel,
	Arnd Bergmann, Konrad Rzeszutek Wilk, Daniel J Blueman,
	Oleg Nesterov, Steven Rostedt, Chris Wright, George Spelvin,
	Thomas Gleixner

On Tue, Mar 04, 2014 at 12:48:26PM -0500, Waiman Long wrote:
> Peter,
> 
> I was trying to implement the generic queue code exchange code using
> cmpxchg as suggested by you. However, when I gathered the performance
> data, the code performed worse than I expected at a higher contention
> level. Below were the execution time of the benchmark tool that I sent
> you:
> 
>                 [xchg]        [cmpxchg]
>   # of tasks    Ticket lock     Queue lock      Queue Lock
>   ----------    -----------     -----------     ----------
>        1          135            135              135
>        2          732           1315            1102
>        3         1827           2372            2681
>        4         2689           2934             5392
>        5         3736           3658             7696
>        6         4942           4434            9876
>        7         6304           5176           11901
>        8         7736           5955           14551
> 

I'm just not seeing that; with test-4 modified to take the AMD compute
units into account:

root@interlagos:~/spinlocks# LOCK=./qspinlock-pending-opt ./test-4.sh ; LOCK=./qspinlock-pending-opt2 ./test-4.sh
 4: 50783.509653
 8: 146295.875715
16: 332942.964709
 4: 51033.341441
 8: 146320.656285
16: 332586.355194

And the difference between opt and opt2 is that opt2 replaces 2 cmpxchg
loops with unconditional ops (xchg8 and xchg16).

And I'd think that 4 CPUs x 4 Nodes would be heavy contention.

I'll have another poke tomorrow; including verifying asm tomorrow, need
to go sleep now.

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks
  2014-03-04 17:48             ` Waiman Long
  2014-03-04 22:40               ` Peter Zijlstra
@ 2014-03-04 22:40               ` Peter Zijlstra
  1 sibling, 0 replies; 125+ messages in thread
From: Peter Zijlstra @ 2014-03-04 22:40 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, Boris Ostrovsky,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Alok Kataria, linux-arch, x86, Ingo Molnar, Scott J Norton,
	xen-devel, Paul E. McKenney, Alexander Fyodorov, Arnd Bergmann,
	Daniel J Blueman, Rusty Russell, Oleg Nesterov, Steven Rostedt,
	Chris Wright, George Spelvin, Thomas Gleixner,
	Aswin Chandramouleeswaran

On Tue, Mar 04, 2014 at 12:48:26PM -0500, Waiman Long wrote:
> Peter,
> 
> I was trying to implement the generic queue code exchange code using
> cmpxchg as suggested by you. However, when I gathered the performance
> data, the code performed worse than I expected at a higher contention
> level. Below were the execution time of the benchmark tool that I sent
> you:
> 
>                 [xchg]        [cmpxchg]
>   # of tasks    Ticket lock     Queue lock      Queue Lock
>   ----------    -----------     -----------     ----------
>        1          135            135              135
>        2          732           1315            1102
>        3         1827           2372            2681
>        4         2689           2934             5392
>        5         3736           3658             7696
>        6         4942           4434            9876
>        7         6304           5176           11901
>        8         7736           5955           14551
> 

I'm just not seeing that; with test-4 modified to take the AMD compute
units into account:

root@interlagos:~/spinlocks# LOCK=./qspinlock-pending-opt ./test-4.sh ; LOCK=./qspinlock-pending-opt2 ./test-4.sh
 4: 50783.509653
 8: 146295.875715
16: 332942.964709
 4: 51033.341441
 8: 146320.656285
16: 332586.355194

And the difference between opt and opt2 is that opt2 replaces 2 cmpxchg
loops with unconditional ops (xchg8 and xchg16).

And I'd think that 4 CPUs x 4 Nodes would be heavy contention.

I'll have another poke tomorrow; including verifying asm tomorrow, need
to go sleep now.

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks
  2014-03-04 22:40               ` Peter Zijlstra
  2014-03-05 20:59                 ` Peter Zijlstra
@ 2014-03-05 20:59                 ` Peter Zijlstra
  1 sibling, 0 replies; 125+ messages in thread
From: Peter Zijlstra @ 2014-03-05 20:59 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, Boris Ostrovsky,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Alok Kataria, linux-arch, x86, Ingo Molnar, Scott J Norton,
	xen-devel, Paul E. McKenney, Alexander Fyodorov, Rik van Riel,
	Arnd Bergmann, Konrad Rzeszutek Wilk, Daniel J Blueman,
	Oleg Nesterov, Steven Rostedt, Chris Wright, George Spelvin,
	Thomas Gleixner

On Tue, Mar 04, 2014 at 11:40:43PM +0100, Peter Zijlstra wrote:
> On Tue, Mar 04, 2014 at 12:48:26PM -0500, Waiman Long wrote:
> > Peter,
> > 
> > I was trying to implement the generic queue code exchange code using
> > cmpxchg as suggested by you. However, when I gathered the performance
> > data, the code performed worse than I expected at a higher contention
> > level. Below were the execution time of the benchmark tool that I sent
> > you:
> 
> I'm just not seeing that; with test-4 modified to take the AMD compute
> units into account:

OK; I tried on a few larger machines and I can indeed see it there.

That said; our code doesn't differ that much. I see why you're not doing
too well on the 2 CPU contention. You've got an atomic op too much in
that path. But given you see benefit even with 2 atomic ops (I had mixed
results on that) we can do the pending/waiter thing unconditionally for
NR_CPUS>16k.

I also think I can do your full xchg thing without allowing lock steals.

I'll try and do a full series tomorrow that starts with simple code and
builds on that, doing each optimization one by one.

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks
  2014-03-04 22:40               ` Peter Zijlstra
@ 2014-03-05 20:59                 ` Peter Zijlstra
  2014-03-05 20:59                 ` Peter Zijlstra
  1 sibling, 0 replies; 125+ messages in thread
From: Peter Zijlstra @ 2014-03-05 20:59 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, Boris Ostrovsky,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Alok Kataria, linux-arch, x86, Ingo Molnar, Scott J Norton,
	xen-devel, Paul E. McKenney, Alexander Fyodorov, Arnd Bergmann,
	Daniel J Blueman, Rusty Russell, Oleg Nesterov, Steven Rostedt,
	Chris Wright, George Spelvin, Thomas Gleixner,
	Aswin Chandramouleeswaran

On Tue, Mar 04, 2014 at 11:40:43PM +0100, Peter Zijlstra wrote:
> On Tue, Mar 04, 2014 at 12:48:26PM -0500, Waiman Long wrote:
> > Peter,
> > 
> > I was trying to implement the generic queue code exchange code using
> > cmpxchg as suggested by you. However, when I gathered the performance
> > data, the code performed worse than I expected at a higher contention
> > level. Below were the execution time of the benchmark tool that I sent
> > you:
> 
> I'm just not seeing that; with test-4 modified to take the AMD compute
> units into account:

OK; I tried on a few larger machines and I can indeed see it there.

That said; our code doesn't differ that much. I see why you're not doing
too well on the 2 CPU contention. You've got an atomic op too much in
that path. But given you see benefit even with 2 atomic ops (I had mixed
results on that) we can do the pending/waiter thing unconditionally for
NR_CPUS>16k.

I also think I can do your full xchg thing without allowing lock steals.

I'll try and do a full series tomorrow that starts with simple code and
builds on that, doing each optimization one by one.

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v5 1/8] qspinlock: Introducing a 4-byte queue spinlock implementation
  2014-03-02 13:31   ` Oleg Nesterov
@ 2014-03-04 14:58       ` Waiman Long
  2014-03-04 14:58     ` Waiman Long
  1 sibling, 0 replies; 125+ messages in thread
From: Waiman Long @ 2014-03-04 14:58 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, Peter Zijlstra,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Thomas Gleixner, linux-arch, Gleb Natapov, x86, Ingo Molnar,
	xen-devel, Paul E. McKenney, Rik van Riel, Arnd Bergmann,
	Konrad Rzeszutek Wilk, Scott J Norton, Steven Rostedt,
	Chris Wright, Alok Kataria, Aswin Chandramouleeswaran,
	Chegu Vinod

On 03/02/2014 08:31 AM, Oleg Nesterov wrote:
> Forgot to ask...
>
> On 02/26, Waiman Long wrote:
>> +notify_next:
>> +	/*
>> +	 * Wait, if needed, until the next one in queue set up the next field
>> +	 */
>> +	while (!(next = ACCESS_ONCE(node->next)))
>> +		arch_mutex_cpu_relax();
>> +	/*
>> +	 * The next one in queue is now at the head
>> +	 */
>> +	smp_store_release(&next->wait, false);
> Do we really need smp_store_release()? It seems that we can rely on the
> control dependency here. And afaics there is no need to serialise this
> store with other changes in *lock, plus they all have mb's anyway.
>
> Oleg.
>

I am just following the current logic in the mcs_spin_unlock function. 
It is probably true that we don't need the release semantic in this 
particular case.

-Longman

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v5 1/8] qspinlock: Introducing a 4-byte queue spinlock implementation
@ 2014-03-04 14:58       ` Waiman Long
  0 siblings, 0 replies; 125+ messages in thread
From: Waiman Long @ 2014-03-04 14:58 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, Peter Zijlstra,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Thomas Gleixner, linux-arch, Gleb Natapov, x86, Ingo Molnar,
	xen-devel, Paul E. McKenney, Rik van Riel, Arnd Bergmann,
	Konrad Rzeszutek Wilk, Scott J Norton, Steven Rostedt,
	Chris Wright, Alok Kataria, Aswin Chandramouleeswaran,
	Chegu Vinod, Boris Ostrovsky

On 03/02/2014 08:31 AM, Oleg Nesterov wrote:
> Forgot to ask...
>
> On 02/26, Waiman Long wrote:
>> +notify_next:
>> +	/*
>> +	 * Wait, if needed, until the next one in queue set up the next field
>> +	 */
>> +	while (!(next = ACCESS_ONCE(node->next)))
>> +		arch_mutex_cpu_relax();
>> +	/*
>> +	 * The next one in queue is now at the head
>> +	 */
>> +	smp_store_release(&next->wait, false);
> Do we really need smp_store_release()? It seems that we can rely on the
> control dependency here. And afaics there is no need to serialise this
> store with other changes in *lock, plus they all have mb's anyway.
>
> Oleg.
>

I am just following the current logic in the mcs_spin_unlock function. 
It is probably true that we don't need the release semantic in this 
particular case.

-Longman

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v5 1/8] qspinlock: Introducing a 4-byte queue spinlock implementation
  2014-03-02 13:31   ` Oleg Nesterov
  2014-03-04 14:58       ` Waiman Long
@ 2014-03-04 14:58     ` Waiman Long
  1 sibling, 0 replies; 125+ messages in thread
From: Waiman Long @ 2014-03-04 14:58 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, Peter Zijlstra,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Thomas Gleixner, linux-arch, Gleb Natapov, x86, Ingo Molnar,
	xen-devel, Paul E. McKenney, Arnd Bergmann, Scott J Norton,
	Rusty Russell, Steven Rostedt, Chris Wright, Alok Kataria,
	Aswin Chandramouleeswaran, Chegu Vinod, Boris Ostrovsky,
	linux-kernel

On 03/02/2014 08:31 AM, Oleg Nesterov wrote:
> Forgot to ask...
>
> On 02/26, Waiman Long wrote:
>> +notify_next:
>> +	/*
>> +	 * Wait, if needed, until the next one in queue set up the next field
>> +	 */
>> +	while (!(next = ACCESS_ONCE(node->next)))
>> +		arch_mutex_cpu_relax();
>> +	/*
>> +	 * The next one in queue is now at the head
>> +	 */
>> +	smp_store_release(&next->wait, false);
> Do we really need smp_store_release()? It seems that we can rely on the
> control dependency here. And afaics there is no need to serialise this
> store with other changes in *lock, plus they all have mb's anyway.
>
> Oleg.
>

I am just following the current logic in the mcs_spin_unlock function. 
It is probably true that we don't need the release semantic in this 
particular case.

-Longman

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v5 1/8] qspinlock: Introducing a 4-byte queue spinlock implementation
  2014-03-02 13:12   ` Oleg Nesterov
@ 2014-03-04 14:46       ` Waiman Long
  2014-03-04 14:46       ` Waiman Long
  1 sibling, 0 replies; 125+ messages in thread
From: Waiman Long @ 2014-03-04 14:46 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, Peter Zijlstra,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Thomas Gleixner, linux-arch, Gleb Natapov, x86, Ingo Molnar,
	xen-devel, Paul E. McKenney, Rik van Riel, Arnd Bergmann,
	Konrad Rzeszutek Wilk, Scott J Norton, Steven Rostedt,
	Chris Wright, Alok Kataria, Aswin Chandramouleeswaran,
	Chegu Vinod

On 03/02/2014 08:12 AM, Oleg Nesterov wrote:
> On 02/26, Waiman Long wrote:
>> +void queue_spin_lock_slowpath(struct qspinlock *lock, int qsval)
>> +{
>> +	unsigned int cpu_nr, qn_idx;
>> +	struct qnode *node, *next;
>> +	u32 prev_qcode, my_qcode;
>> +
>> +	/*
>> +	 * Get the queue node
>> +	 */
>> +	cpu_nr = smp_processor_id();
>> +	node   = get_qnode(&qn_idx);
>> +
>> +	/*
>> +	 * It should never happen that all the queue nodes are being used.
>> +	 */
>> +	BUG_ON(!node);
>> +
>> +	/*
>> +	 * Set up the new cpu code to be exchanged
>> +	 */
>> +	my_qcode = queue_encode_qcode(cpu_nr, qn_idx);
>> +
>> +	/*
>> +	 * Initialize the queue node
>> +	 */
>> +	node->wait = true;
>> +	node->next = NULL;
>> +
>> +	/*
>> +	 * The lock may be available at this point, try again if no task was
>> +	 * waiting in the queue.
>> +	 */
>> +	if (!(qsval>>  _QCODE_OFFSET)&&  queue_spin_trylock(lock)) {
>> +		put_qnode();
>> +		return;
>> +	}
> Cosmetic, but probably "goto release_node" would be more consistent.

Yes, that is true.

> And I am wondering how much this "qsval>>  _QCODE_OFFSET" check can help.
> Note that this is the only usage of this arg, perhaps it would be better
> to simply remove it and shrink the caller's code a bit? It is also used
> in 3/8, but we can read the "fresh" value of ->qlcode (trylock does this
> anyway), and perhaps it can actually help if it is already unlocked.

First of all, there is no shrinkage in the caller code even if the qsval 
argument is removed, at least for x86. The caller just targets the 
return register of the cmpxchg instruction to be the 2nd function 
parameter register.

When the lock is lightly contended, there isn't much difference on 
whether to check qsval or a fresh copy of qlcode. However, when the lock 
is heavily contended, every additional read or write will contribute to 
the cacheline bouncing traffic. The code was written to minimize those 
optional read request.

>> +	prev_qcode = atomic_xchg(&lock->qlcode, my_qcode);
>> +	/*
>> +	 * It is possible that we may accidentally steal the lock. If this is
>> +	 * the case, we need to either release it if not the head of the queue
>> +	 * or get the lock and be done with it.
>> +	 */
>> +	if (unlikely(!(prev_qcode&  _QSPINLOCK_LOCKED))) {
>> +		if (prev_qcode == 0) {
>> +			/*
>> +			 * Got the lock since it is at the head of the queue
>> +			 * Now try to atomically clear the queue code.
>> +			 */
>> +			if (atomic_cmpxchg(&lock->qlcode, my_qcode,
>> +					  _QSPINLOCK_LOCKED) == my_qcode)
>> +				goto release_node;
>> +			/*
>> +			 * The cmpxchg fails only if one or more tasks
>> +			 * are added to the queue. In this case, we need to
>> +			 * notify the next one to be the head of the queue.
>> +			 */
>> +			goto notify_next;
>> +		}
>> +		/*
>> +		 * Accidentally steal the lock, release the lock and
>> +		 * let the queue head get it.
>> +		 */
>> +		queue_spin_unlock(lock);
>> +	} else
>> +		prev_qcode&= ~_QSPINLOCK_LOCKED;	/* Clear the lock bit */
> You know, actually I started this email because I thought that "goto notify_next"
> is wrong, I misread the patch as if this "goto" can happen even if prev_qcode != 0.
>
> So feel free to ignore, all my comments are cosmetic/subjective, but to me it
> would be more clean/clear to rewrite the code above as
>
> 	if (prev_qcode == 0) {
> 		if (atomic_cmpxchg(..., _QSPINLOCK_LOCKED) == my_qcode)
> 			goto release_node;
> 		goto notify_next;
> 	}
>
> 	if (prev_qcode&  _QSPINLOCK_LOCKED)
> 		prev_qcode&= ~_QSPINLOCK_LOCKED;
> 	else
> 		queue_spin_unlock(lock);
>

This part of the code cause confusion and make it harder to read. I am 
planning to rewrite it to use cmpxchg to make sure that it won't 
accidentally steal the lock. That should make the code easier to 
understand and make it possible to write better optimized code in other 
part of the function.

>> +	while (true) {
>> +		u32 qcode;
>> +		int retval;
>> +
>> +		retval = queue_get_lock_qcode(lock,&qcode, my_qcode);
>> +		if (retval>  0)
>> +			;	/* Lock not available yet */
>> +		else if (retval<  0)
>> +			/* Lock taken, can release the node&  return */
>> +			goto release_node;
> I guess this is for 3/8which adds the optimized version of
> queue_get_lock_qcode(), so perhaps this "retval<  0" block can go into 3/8
> as well.
>

Yes, that is true.

>> +		else if (qcode != my_qcode) {
>> +			/*
>> +			 * Just get the lock with other spinners waiting
>> +			 * in the queue.
>> +			 */
>> +			if (queue_spin_setlock(lock))
>> +				goto notify_next;
> OTOH, at least the generic (non-optimized) version of queue_spin_setlock()
> could probably accept "qcode" and avoid atomic_read() + _QSPINLOCK_LOCKED
> check.
>

Will do so.

Thank for the comments.

-Longman

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v5 1/8] qspinlock: Introducing a 4-byte queue spinlock implementation
@ 2014-03-04 14:46       ` Waiman Long
  0 siblings, 0 replies; 125+ messages in thread
From: Waiman Long @ 2014-03-04 14:46 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, Peter Zijlstra,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Thomas Gleixner, linux-arch, Gleb Natapov, x86, Ingo Molnar,
	xen-devel, Paul E. McKenney, Rik van Riel, Arnd Bergmann,
	Konrad Rzeszutek Wilk, Scott J Norton, Steven Rostedt,
	Chris Wright, Alok Kataria, Aswin Chandramouleeswaran,
	Chegu Vinod, Boris Ostrovsky

On 03/02/2014 08:12 AM, Oleg Nesterov wrote:
> On 02/26, Waiman Long wrote:
>> +void queue_spin_lock_slowpath(struct qspinlock *lock, int qsval)
>> +{
>> +	unsigned int cpu_nr, qn_idx;
>> +	struct qnode *node, *next;
>> +	u32 prev_qcode, my_qcode;
>> +
>> +	/*
>> +	 * Get the queue node
>> +	 */
>> +	cpu_nr = smp_processor_id();
>> +	node   = get_qnode(&qn_idx);
>> +
>> +	/*
>> +	 * It should never happen that all the queue nodes are being used.
>> +	 */
>> +	BUG_ON(!node);
>> +
>> +	/*
>> +	 * Set up the new cpu code to be exchanged
>> +	 */
>> +	my_qcode = queue_encode_qcode(cpu_nr, qn_idx);
>> +
>> +	/*
>> +	 * Initialize the queue node
>> +	 */
>> +	node->wait = true;
>> +	node->next = NULL;
>> +
>> +	/*
>> +	 * The lock may be available at this point, try again if no task was
>> +	 * waiting in the queue.
>> +	 */
>> +	if (!(qsval>>  _QCODE_OFFSET)&&  queue_spin_trylock(lock)) {
>> +		put_qnode();
>> +		return;
>> +	}
> Cosmetic, but probably "goto release_node" would be more consistent.

Yes, that is true.

> And I am wondering how much this "qsval>>  _QCODE_OFFSET" check can help.
> Note that this is the only usage of this arg, perhaps it would be better
> to simply remove it and shrink the caller's code a bit? It is also used
> in 3/8, but we can read the "fresh" value of ->qlcode (trylock does this
> anyway), and perhaps it can actually help if it is already unlocked.

First of all, there is no shrinkage in the caller code even if the qsval 
argument is removed, at least for x86. The caller just targets the 
return register of the cmpxchg instruction to be the 2nd function 
parameter register.

When the lock is lightly contended, there isn't much difference on 
whether to check qsval or a fresh copy of qlcode. However, when the lock 
is heavily contended, every additional read or write will contribute to 
the cacheline bouncing traffic. The code was written to minimize those 
optional read request.

>> +	prev_qcode = atomic_xchg(&lock->qlcode, my_qcode);
>> +	/*
>> +	 * It is possible that we may accidentally steal the lock. If this is
>> +	 * the case, we need to either release it if not the head of the queue
>> +	 * or get the lock and be done with it.
>> +	 */
>> +	if (unlikely(!(prev_qcode&  _QSPINLOCK_LOCKED))) {
>> +		if (prev_qcode == 0) {
>> +			/*
>> +			 * Got the lock since it is at the head of the queue
>> +			 * Now try to atomically clear the queue code.
>> +			 */
>> +			if (atomic_cmpxchg(&lock->qlcode, my_qcode,
>> +					  _QSPINLOCK_LOCKED) == my_qcode)
>> +				goto release_node;
>> +			/*
>> +			 * The cmpxchg fails only if one or more tasks
>> +			 * are added to the queue. In this case, we need to
>> +			 * notify the next one to be the head of the queue.
>> +			 */
>> +			goto notify_next;
>> +		}
>> +		/*
>> +		 * Accidentally steal the lock, release the lock and
>> +		 * let the queue head get it.
>> +		 */
>> +		queue_spin_unlock(lock);
>> +	} else
>> +		prev_qcode&= ~_QSPINLOCK_LOCKED;	/* Clear the lock bit */
> You know, actually I started this email because I thought that "goto notify_next"
> is wrong, I misread the patch as if this "goto" can happen even if prev_qcode != 0.
>
> So feel free to ignore, all my comments are cosmetic/subjective, but to me it
> would be more clean/clear to rewrite the code above as
>
> 	if (prev_qcode == 0) {
> 		if (atomic_cmpxchg(..., _QSPINLOCK_LOCKED) == my_qcode)
> 			goto release_node;
> 		goto notify_next;
> 	}
>
> 	if (prev_qcode&  _QSPINLOCK_LOCKED)
> 		prev_qcode&= ~_QSPINLOCK_LOCKED;
> 	else
> 		queue_spin_unlock(lock);
>

This part of the code cause confusion and make it harder to read. I am 
planning to rewrite it to use cmpxchg to make sure that it won't 
accidentally steal the lock. That should make the code easier to 
understand and make it possible to write better optimized code in other 
part of the function.

>> +	while (true) {
>> +		u32 qcode;
>> +		int retval;
>> +
>> +		retval = queue_get_lock_qcode(lock,&qcode, my_qcode);
>> +		if (retval>  0)
>> +			;	/* Lock not available yet */
>> +		else if (retval<  0)
>> +			/* Lock taken, can release the node&  return */
>> +			goto release_node;
> I guess this is for 3/8which adds the optimized version of
> queue_get_lock_qcode(), so perhaps this "retval<  0" block can go into 3/8
> as well.
>

Yes, that is true.

>> +		else if (qcode != my_qcode) {
>> +			/*
>> +			 * Just get the lock with other spinners waiting
>> +			 * in the queue.
>> +			 */
>> +			if (queue_spin_setlock(lock))
>> +				goto notify_next;
> OTOH, at least the generic (non-optimized) version of queue_spin_setlock()
> could probably accept "qcode" and avoid atomic_read() + _QSPINLOCK_LOCKED
> check.
>

Will do so.

Thank for the comments.

-Longman

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v5 1/8] qspinlock: Introducing a 4-byte queue spinlock implementation
  2014-03-02 13:12   ` Oleg Nesterov
@ 2014-03-04 14:46     ` Waiman Long
  2014-03-04 14:46       ` Waiman Long
  1 sibling, 0 replies; 125+ messages in thread
From: Waiman Long @ 2014-03-04 14:46 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, Peter Zijlstra,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Thomas Gleixner, linux-arch, Gleb Natapov, x86, Ingo Molnar,
	xen-devel, Paul E. McKenney, Arnd Bergmann, Scott J Norton,
	Rusty Russell, Steven Rostedt, Chris Wright, Alok Kataria,
	Aswin Chandramouleeswaran, Chegu Vinod, Boris Ostrovsky,
	linux-kernel

On 03/02/2014 08:12 AM, Oleg Nesterov wrote:
> On 02/26, Waiman Long wrote:
>> +void queue_spin_lock_slowpath(struct qspinlock *lock, int qsval)
>> +{
>> +	unsigned int cpu_nr, qn_idx;
>> +	struct qnode *node, *next;
>> +	u32 prev_qcode, my_qcode;
>> +
>> +	/*
>> +	 * Get the queue node
>> +	 */
>> +	cpu_nr = smp_processor_id();
>> +	node   = get_qnode(&qn_idx);
>> +
>> +	/*
>> +	 * It should never happen that all the queue nodes are being used.
>> +	 */
>> +	BUG_ON(!node);
>> +
>> +	/*
>> +	 * Set up the new cpu code to be exchanged
>> +	 */
>> +	my_qcode = queue_encode_qcode(cpu_nr, qn_idx);
>> +
>> +	/*
>> +	 * Initialize the queue node
>> +	 */
>> +	node->wait = true;
>> +	node->next = NULL;
>> +
>> +	/*
>> +	 * The lock may be available at this point, try again if no task was
>> +	 * waiting in the queue.
>> +	 */
>> +	if (!(qsval>>  _QCODE_OFFSET)&&  queue_spin_trylock(lock)) {
>> +		put_qnode();
>> +		return;
>> +	}
> Cosmetic, but probably "goto release_node" would be more consistent.

Yes, that is true.

> And I am wondering how much this "qsval>>  _QCODE_OFFSET" check can help.
> Note that this is the only usage of this arg, perhaps it would be better
> to simply remove it and shrink the caller's code a bit? It is also used
> in 3/8, but we can read the "fresh" value of ->qlcode (trylock does this
> anyway), and perhaps it can actually help if it is already unlocked.

First of all, there is no shrinkage in the caller code even if the qsval 
argument is removed, at least for x86. The caller just targets the 
return register of the cmpxchg instruction to be the 2nd function 
parameter register.

When the lock is lightly contended, there isn't much difference on 
whether to check qsval or a fresh copy of qlcode. However, when the lock 
is heavily contended, every additional read or write will contribute to 
the cacheline bouncing traffic. The code was written to minimize those 
optional read request.

>> +	prev_qcode = atomic_xchg(&lock->qlcode, my_qcode);
>> +	/*
>> +	 * It is possible that we may accidentally steal the lock. If this is
>> +	 * the case, we need to either release it if not the head of the queue
>> +	 * or get the lock and be done with it.
>> +	 */
>> +	if (unlikely(!(prev_qcode&  _QSPINLOCK_LOCKED))) {
>> +		if (prev_qcode == 0) {
>> +			/*
>> +			 * Got the lock since it is at the head of the queue
>> +			 * Now try to atomically clear the queue code.
>> +			 */
>> +			if (atomic_cmpxchg(&lock->qlcode, my_qcode,
>> +					  _QSPINLOCK_LOCKED) == my_qcode)
>> +				goto release_node;
>> +			/*
>> +			 * The cmpxchg fails only if one or more tasks
>> +			 * are added to the queue. In this case, we need to
>> +			 * notify the next one to be the head of the queue.
>> +			 */
>> +			goto notify_next;
>> +		}
>> +		/*
>> +		 * Accidentally steal the lock, release the lock and
>> +		 * let the queue head get it.
>> +		 */
>> +		queue_spin_unlock(lock);
>> +	} else
>> +		prev_qcode&= ~_QSPINLOCK_LOCKED;	/* Clear the lock bit */
> You know, actually I started this email because I thought that "goto notify_next"
> is wrong, I misread the patch as if this "goto" can happen even if prev_qcode != 0.
>
> So feel free to ignore, all my comments are cosmetic/subjective, but to me it
> would be more clean/clear to rewrite the code above as
>
> 	if (prev_qcode == 0) {
> 		if (atomic_cmpxchg(..., _QSPINLOCK_LOCKED) == my_qcode)
> 			goto release_node;
> 		goto notify_next;
> 	}
>
> 	if (prev_qcode&  _QSPINLOCK_LOCKED)
> 		prev_qcode&= ~_QSPINLOCK_LOCKED;
> 	else
> 		queue_spin_unlock(lock);
>

This part of the code cause confusion and make it harder to read. I am 
planning to rewrite it to use cmpxchg to make sure that it won't 
accidentally steal the lock. That should make the code easier to 
understand and make it possible to write better optimized code in other 
part of the function.

>> +	while (true) {
>> +		u32 qcode;
>> +		int retval;
>> +
>> +		retval = queue_get_lock_qcode(lock,&qcode, my_qcode);
>> +		if (retval>  0)
>> +			;	/* Lock not available yet */
>> +		else if (retval<  0)
>> +			/* Lock taken, can release the node&  return */
>> +			goto release_node;
> I guess this is for 3/8which adds the optimized version of
> queue_get_lock_qcode(), so perhaps this "retval<  0" block can go into 3/8
> as well.
>

Yes, that is true.

>> +		else if (qcode != my_qcode) {
>> +			/*
>> +			 * Just get the lock with other spinners waiting
>> +			 * in the queue.
>> +			 */
>> +			if (queue_spin_setlock(lock))
>> +				goto notify_next;
> OTOH, at least the generic (non-optimized) version of queue_spin_setlock()
> could probably accept "qcode" and avoid atomic_read() + _QSPINLOCK_LOCKED
> check.
>

Will do so.

Thank for the comments.

-Longman

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v5 1/8] qspinlock: Introducing a 4-byte queue spinlock implementation
  2014-02-27  4:32 ` Waiman Long
  2014-03-02 13:12   ` Oleg Nesterov
  2014-03-02 13:12   ` Oleg Nesterov
@ 2014-03-02 13:31   ` Oleg Nesterov
  2014-03-04 14:58       ` Waiman Long
  2014-03-04 14:58     ` Waiman Long
  2014-03-02 13:31   ` Oleg Nesterov
  3 siblings, 2 replies; 125+ messages in thread
From: Oleg Nesterov @ 2014-03-02 13:31 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, Peter Zijlstra,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Thomas Gleixner, linux-arch, Gleb Natapov, x86, Ingo Molnar,
	xen-devel, Paul E. McKenney, Arnd Bergmann, Scott J Norton,
	Rusty Russell, Steven Rostedt, Chris Wright, Alok Kataria,
	Aswin Chandramouleeswaran, Chegu Vinod, Boris Ostrovsky,
	linux-kernel

Forgot to ask...

On 02/26, Waiman Long wrote:
>
> +notify_next:
> +	/*
> +	 * Wait, if needed, until the next one in queue set up the next field
> +	 */
> +	while (!(next = ACCESS_ONCE(node->next)))
> +		arch_mutex_cpu_relax();
> +	/*
> +	 * The next one in queue is now at the head
> +	 */
> +	smp_store_release(&next->wait, false);

Do we really need smp_store_release()? It seems that we can rely on the
control dependency here. And afaics there is no need to serialise this
store with other changes in *lock, plus they all have mb's anyway.

Oleg.

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v5 1/8] qspinlock: Introducing a 4-byte queue spinlock implementation
  2014-02-27  4:32 ` Waiman Long
                     ` (2 preceding siblings ...)
  2014-03-02 13:31   ` Oleg Nesterov
@ 2014-03-02 13:31   ` Oleg Nesterov
  3 siblings, 0 replies; 125+ messages in thread
From: Oleg Nesterov @ 2014-03-02 13:31 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, Peter Zijlstra,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Thomas Gleixner, linux-arch, Gleb Natapov, x86, Ingo Molnar,
	xen-devel, Paul E. McKenney, Rik van Riel, Arnd Bergmann,
	Konrad Rzeszutek Wilk, Scott J Norton, Steven Rostedt,
	Chris Wright, Alok Kataria, Aswin Chandramouleeswaran,
	Chegu Vinod

Forgot to ask...

On 02/26, Waiman Long wrote:
>
> +notify_next:
> +	/*
> +	 * Wait, if needed, until the next one in queue set up the next field
> +	 */
> +	while (!(next = ACCESS_ONCE(node->next)))
> +		arch_mutex_cpu_relax();
> +	/*
> +	 * The next one in queue is now at the head
> +	 */
> +	smp_store_release(&next->wait, false);

Do we really need smp_store_release()? It seems that we can rely on the
control dependency here. And afaics there is no need to serialise this
store with other changes in *lock, plus they all have mb's anyway.

Oleg.

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v5 1/8] qspinlock: Introducing a 4-byte queue spinlock implementation
  2014-02-27  4:32 ` Waiman Long
@ 2014-03-02 13:12   ` Oleg Nesterov
  2014-03-04 14:46     ` Waiman Long
  2014-03-04 14:46       ` Waiman Long
  2014-03-02 13:12   ` Oleg Nesterov
                     ` (2 subsequent siblings)
  3 siblings, 2 replies; 125+ messages in thread
From: Oleg Nesterov @ 2014-03-02 13:12 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, Peter Zijlstra,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Thomas Gleixner, linux-arch, Gleb Natapov, x86, Ingo Molnar,
	xen-devel, Paul E. McKenney, Arnd Bergmann, Scott J Norton,
	Rusty Russell, Steven Rostedt, Chris Wright, Alok Kataria,
	Aswin Chandramouleeswaran, Chegu Vinod, Boris Ostrovsky,
	linux-kernel

On 02/26, Waiman Long wrote:
>
> +void queue_spin_lock_slowpath(struct qspinlock *lock, int qsval)
> +{
> +	unsigned int cpu_nr, qn_idx;
> +	struct qnode *node, *next;
> +	u32 prev_qcode, my_qcode;
> +
> +	/*
> +	 * Get the queue node
> +	 */
> +	cpu_nr = smp_processor_id();
> +	node   = get_qnode(&qn_idx);
> +
> +	/*
> +	 * It should never happen that all the queue nodes are being used.
> +	 */
> +	BUG_ON(!node);
> +
> +	/*
> +	 * Set up the new cpu code to be exchanged
> +	 */
> +	my_qcode = queue_encode_qcode(cpu_nr, qn_idx);
> +
> +	/*
> +	 * Initialize the queue node
> +	 */
> +	node->wait = true;
> +	node->next = NULL;
> +
> +	/*
> +	 * The lock may be available at this point, try again if no task was
> +	 * waiting in the queue.
> +	 */
> +	if (!(qsval >> _QCODE_OFFSET) && queue_spin_trylock(lock)) {
> +		put_qnode();
> +		return;
> +	}

Cosmetic, but probably "goto release_node" would be more consistent.

And I am wondering how much this "qsval >> _QCODE_OFFSET" check can help.
Note that this is the only usage of this arg, perhaps it would be better
to simply remove it and shrink the caller's code a bit? It is also used
in 3/8, but we can read the "fresh" value of ->qlcode (trylock does this
anyway), and perhaps it can actually help if it is already unlocked.

> +	prev_qcode = atomic_xchg(&lock->qlcode, my_qcode);
> +	/*
> +	 * It is possible that we may accidentally steal the lock. If this is
> +	 * the case, we need to either release it if not the head of the queue
> +	 * or get the lock and be done with it.
> +	 */
> +	if (unlikely(!(prev_qcode & _QSPINLOCK_LOCKED))) {
> +		if (prev_qcode == 0) {
> +			/*
> +			 * Got the lock since it is at the head of the queue
> +			 * Now try to atomically clear the queue code.
> +			 */
> +			if (atomic_cmpxchg(&lock->qlcode, my_qcode,
> +					  _QSPINLOCK_LOCKED) == my_qcode)
> +				goto release_node;
> +			/*
> +			 * The cmpxchg fails only if one or more tasks
> +			 * are added to the queue. In this case, we need to
> +			 * notify the next one to be the head of the queue.
> +			 */
> +			goto notify_next;
> +		}
> +		/*
> +		 * Accidentally steal the lock, release the lock and
> +		 * let the queue head get it.
> +		 */
> +		queue_spin_unlock(lock);
> +	} else
> +		prev_qcode &= ~_QSPINLOCK_LOCKED;	/* Clear the lock bit */

You know, actually I started this email because I thought that "goto notify_next"
is wrong, I misread the patch as if this "goto" can happen even if prev_qcode != 0.

So feel free to ignore, all my comments are cosmetic/subjective, but to me it
would be more clean/clear to rewrite the code above as

	if (prev_qcode == 0) {
		if (atomic_cmpxchg(..., _QSPINLOCK_LOCKED) == my_qcode)
			goto release_node;
		goto notify_next;
	}

	if (prev_qcode & _QSPINLOCK_LOCKED)
		prev_qcode &= ~_QSPINLOCK_LOCKED;
	else
		queue_spin_unlock(lock);


> +	while (true) {
> +		u32 qcode;
> +		int retval;
> +
> +		retval = queue_get_lock_qcode(lock, &qcode, my_qcode);
> +		if (retval > 0)
> +			;	/* Lock not available yet */
> +		else if (retval < 0)
> +			/* Lock taken, can release the node & return */
> +			goto release_node;

I guess this is for 3/8which adds the optimized version of
queue_get_lock_qcode(), so perhaps this "retval < 0" block can go into 3/8
as well.

> +		else if (qcode != my_qcode) {
> +			/*
> +			 * Just get the lock with other spinners waiting
> +			 * in the queue.
> +			 */
> +			if (queue_spin_setlock(lock))
> +				goto notify_next;

OTOH, at least the generic (non-optimized) version of queue_spin_setlock()
could probably accept "qcode" and avoid atomic_read() + _QSPINLOCK_LOCKED
check.

But once again, please feel free to ignore.

Oleg.

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH v5 1/8] qspinlock: Introducing a 4-byte queue spinlock implementation
  2014-02-27  4:32 ` Waiman Long
  2014-03-02 13:12   ` Oleg Nesterov
@ 2014-03-02 13:12   ` Oleg Nesterov
  2014-03-02 13:31   ` Oleg Nesterov
  2014-03-02 13:31   ` Oleg Nesterov
  3 siblings, 0 replies; 125+ messages in thread
From: Oleg Nesterov @ 2014-03-02 13:12 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, Peter Zijlstra,
	virtualization, Andi Kleen, H. Peter Anvin, Michel Lespinasse,
	Thomas Gleixner, linux-arch, Gleb Natapov, x86, Ingo Molnar,
	xen-devel, Paul E. McKenney, Rik van Riel, Arnd Bergmann,
	Konrad Rzeszutek Wilk, Scott J Norton, Steven Rostedt,
	Chris Wright, Alok Kataria, Aswin Chandramouleeswaran,
	Chegu Vinod

On 02/26, Waiman Long wrote:
>
> +void queue_spin_lock_slowpath(struct qspinlock *lock, int qsval)
> +{
> +	unsigned int cpu_nr, qn_idx;
> +	struct qnode *node, *next;
> +	u32 prev_qcode, my_qcode;
> +
> +	/*
> +	 * Get the queue node
> +	 */
> +	cpu_nr = smp_processor_id();
> +	node   = get_qnode(&qn_idx);
> +
> +	/*
> +	 * It should never happen that all the queue nodes are being used.
> +	 */
> +	BUG_ON(!node);
> +
> +	/*
> +	 * Set up the new cpu code to be exchanged
> +	 */
> +	my_qcode = queue_encode_qcode(cpu_nr, qn_idx);
> +
> +	/*
> +	 * Initialize the queue node
> +	 */
> +	node->wait = true;
> +	node->next = NULL;
> +
> +	/*
> +	 * The lock may be available at this point, try again if no task was
> +	 * waiting in the queue.
> +	 */
> +	if (!(qsval >> _QCODE_OFFSET) && queue_spin_trylock(lock)) {
> +		put_qnode();
> +		return;
> +	}

Cosmetic, but probably "goto release_node" would be more consistent.

And I am wondering how much this "qsval >> _QCODE_OFFSET" check can help.
Note that this is the only usage of this arg, perhaps it would be better
to simply remove it and shrink the caller's code a bit? It is also used
in 3/8, but we can read the "fresh" value of ->qlcode (trylock does this
anyway), and perhaps it can actually help if it is already unlocked.

> +	prev_qcode = atomic_xchg(&lock->qlcode, my_qcode);
> +	/*
> +	 * It is possible that we may accidentally steal the lock. If this is
> +	 * the case, we need to either release it if not the head of the queue
> +	 * or get the lock and be done with it.
> +	 */
> +	if (unlikely(!(prev_qcode & _QSPINLOCK_LOCKED))) {
> +		if (prev_qcode == 0) {
> +			/*
> +			 * Got the lock since it is at the head of the queue
> +			 * Now try to atomically clear the queue code.
> +			 */
> +			if (atomic_cmpxchg(&lock->qlcode, my_qcode,
> +					  _QSPINLOCK_LOCKED) == my_qcode)
> +				goto release_node;
> +			/*
> +			 * The cmpxchg fails only if one or more tasks
> +			 * are added to the queue. In this case, we need to
> +			 * notify the next one to be the head of the queue.
> +			 */
> +			goto notify_next;
> +		}
> +		/*
> +		 * Accidentally steal the lock, release the lock and
> +		 * let the queue head get it.
> +		 */
> +		queue_spin_unlock(lock);
> +	} else
> +		prev_qcode &= ~_QSPINLOCK_LOCKED;	/* Clear the lock bit */

You know, actually I started this email because I thought that "goto notify_next"
is wrong, I misread the patch as if this "goto" can happen even if prev_qcode != 0.

So feel free to ignore, all my comments are cosmetic/subjective, but to me it
would be more clean/clear to rewrite the code above as

	if (prev_qcode == 0) {
		if (atomic_cmpxchg(..., _QSPINLOCK_LOCKED) == my_qcode)
			goto release_node;
		goto notify_next;
	}

	if (prev_qcode & _QSPINLOCK_LOCKED)
		prev_qcode &= ~_QSPINLOCK_LOCKED;
	else
		queue_spin_unlock(lock);


> +	while (true) {
> +		u32 qcode;
> +		int retval;
> +
> +		retval = queue_get_lock_qcode(lock, &qcode, my_qcode);
> +		if (retval > 0)
> +			;	/* Lock not available yet */
> +		else if (retval < 0)
> +			/* Lock taken, can release the node & return */
> +			goto release_node;

I guess this is for 3/8which adds the optimized version of
queue_get_lock_qcode(), so perhaps this "retval < 0" block can go into 3/8
as well.

> +		else if (qcode != my_qcode) {
> +			/*
> +			 * Just get the lock with other spinners waiting
> +			 * in the queue.
> +			 */
> +			if (queue_spin_setlock(lock))
> +				goto notify_next;

OTOH, at least the generic (non-optimized) version of queue_spin_setlock()
could probably accept "qcode" and avoid atomic_read() + _QSPINLOCK_LOCKED
check.

But once again, please feel free to ignore.

Oleg.

^ permalink raw reply	[flat|nested] 125+ messages in thread

* [PATCH v5 1/8] qspinlock: Introducing a 4-byte queue spinlock implementation
  2014-02-27  4:32 Waiman Long
  2014-02-27  4:32 ` [PATCH v5 1/8] qspinlock: Introducing a 4-byte queue spinlock implementation Waiman Long
@ 2014-02-27  4:32 ` Waiman Long
  2014-03-02 13:12   ` Oleg Nesterov
                     ` (3 more replies)
  1 sibling, 4 replies; 125+ messages in thread
From: Waiman Long @ 2014-02-27  4:32 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	Peter Zijlstra
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, virtualization,
	Andi Kleen, Michel Lespinasse, Alok Kataria, linux-arch,
	Gleb Natapov, x86, xen-devel, Paul E. McKenney, Rik van Riel,
	Konrad Rzeszutek Wilk, Scott J Norton, Steven Rostedt,
	Chris Wright, Oleg Nesterov, Boris Ostrovsky,
	Aswin Chandramouleeswaran, Chegu Vinod, Waiman Long,
	linux-kernel, David Vrabel, Andrew

This patch introduces a new queue spinlock implementation that can
serve as an alternative to the default ticket spinlock. Compared with
the ticket spinlock, this queue spinlock should be almost as fair as
the ticket spinlock. It has about the same speed in single-thread and
it can be much faster in high contention situations. Only in light to
moderate contention where the average queue depth is around 1-3 will
this queue spinlock be potentially a bit slower due to the higher
slowpath overhead.

This queue spinlock is especially suit to NUMA machines with a large
number of cores as the chance of spinlock contention is much higher
in those machines. The cost of contention is also higher because of
slower inter-node memory traffic.

The idea behind this spinlock implementation is the fact that spinlocks
are acquired with preemption disabled. In other words, the process
will not be migrated to another CPU while it is trying to get a
spinlock. Ignoring interrupt handling, a CPU can only be contending
in one spinlock at any one time. Of course, interrupt handler can try
to acquire one spinlock while the interrupted user process is in the
process of getting another spinlock. By allocating a set of per-cpu
queue nodes and used them to form a waiting queue, we can encode the
queue node address into a much smaller 16-bit size. Together with
the 1-byte lock bit, this queue spinlock implementation will only
need 4 bytes to hold all the information that it needs.

The current queue node address encoding of the 4-byte word is as
follows:
Bits 0-7  : the locked byte
Bits 8-9  : queue node index in the per-cpu array (4 entries)
Bits 10-31: cpu number + 1 (max cpus = 4M -1)

In the extremely unlikely case that all the queue node entries are
used up, the current code will fall back to busy spinning without
waiting in a queue with warning message.

For single-thread performance (no contention), a 256K lock/unlock
loop was run on a 2.4Ghz Westmere x86-64 CPU.  The following table
shows the average time (in ns) for a single lock/unlock sequence
(including the looping and timing overhead):

  Lock Type			Time (ns)
  ---------			---------
  Ticket spinlock		  14.1
  Queue spinlock		   8.8

So the queue spinlock is much faster than the ticket spinlock, even
though the overhead of locking and unlocking should be pretty small
when there is no contention. The performance advantage is mainly
due to the fact that ticket spinlock does a read-modify-write (add)
instruction in unlock whereas queue spinlock only does a simple write
in unlock which can be much faster in a pipelined CPU.

The AIM7 benchmark was run on a 8-socket 80-core DL980 with Westmere
x86-64 CPUs with XFS filesystem on a ramdisk and HT off to evaluate
the performance impact of this patch on a 3.13 kernel.

  +------------+----------+-----------------+---------+
  | Kernel     | 3.13 JPM |    3.13 with    | %Change |
  |            |          | qspinlock patch |	      |
  +------------+----------+-----------------+---------+
  |		      10-100 users		      |
  +------------+----------+-----------------+---------+
  |custom      |   357459 |      363109     |  +1.58% |
  |dbase       |   496847 |      498801	    |  +0.39% |
  |disk        |  2925312 |     2771387     |  -5.26% |
  |five_sec    |   166612 |      169215     |  +1.56% |
  |fserver     |   382129 |      383279     |  +0.30% |
  |high_systime|    16356 |       16380     |  +0.15% |
  |short       |  4521978 |     4257363     |  -5.85% |
  +------------+----------+-----------------+---------+
  |		     200-1000 users		      |
  +------------+----------+-----------------+---------+
  |custom      |   449070 |      447711     |  -0.30% |
  |dbase       |   845029 |      853362	    |  +0.99% |
  |disk        |  2725249 |     4892907     | +79.54% |
  |five_sec    |   169410 |      170638     |  +0.72% |
  |fserver     |   489662 |      491828     |  +0.44% |
  |high_systime|   142823 |      143790     |  +0.68% |
  |short       |  7435288 |     9016171     | +21.26% |
  +------------+----------+-----------------+---------+
  |		     1100-2000 users		      |
  +------------+----------+-----------------+---------+
  |custom      |   432470 |      432570     |  +0.02% |
  |dbase       |   889289 |      890026	    |  +0.08% |
  |disk        |  2565138 |     5008732     | +95.26% |
  |five_sec    |   169141 |      170034     |  +0.53% |
  |fserver     |   498569 |      500701     |  +0.43% |
  |high_systime|   229913 |      245866     |  +6.94% |
  |short       |  8496794 |     8281918     |  -2.53% |
  +------------+----------+-----------------+---------+

The workload with the most gain was the disk workload. Without the
patch, the perf profile at 1500 users looked like:

 26.19%    reaim  [kernel.kallsyms]  [k] _raw_spin_lock
              |--47.28%-- evict
              |--46.87%-- inode_sb_list_add
              |--1.24%-- xlog_cil_insert_items
              |--0.68%-- __remove_inode_hash
              |--0.67%-- inode_wait_for_writeback
               --3.26%-- [...]
 22.96%  swapper  [kernel.kallsyms]  [k] cpu_idle_loop
  5.56%    reaim  [kernel.kallsyms]  [k] mutex_spin_on_owner
  4.87%    reaim  [kernel.kallsyms]  [k] update_cfs_rq_blocked_load
  2.04%    reaim  [kernel.kallsyms]  [k] mspin_lock
  1.30%    reaim  [kernel.kallsyms]  [k] memcpy
  1.08%    reaim  [unknown]          [.] 0x0000003c52009447

There was pretty high spinlock contention on the inode_sb_list_lock
and maybe the inode's i_lock.

With the patch, the perf profile at 1500 users became:

 26.82%  swapper  [kernel.kallsyms]  [k] cpu_idle_loop
  4.66%    reaim  [kernel.kallsyms]  [k] mutex_spin_on_owner
  3.97%    reaim  [kernel.kallsyms]  [k] update_cfs_rq_blocked_load
  2.40%    reaim  [kernel.kallsyms]  [k] queue_spin_lock_slowpath
              |--88.31%-- _raw_spin_lock
              |          |--36.02%-- inode_sb_list_add
              |          |--35.09%-- evict
              |          |--16.89%-- xlog_cil_insert_items
              |          |--6.30%-- try_to_wake_up
              |          |--2.20%-- _xfs_buf_find
              |          |--0.75%-- __remove_inode_hash
              |          |--0.72%-- __mutex_lock_slowpath
              |          |--0.53%-- load_balance
              |--6.02%-- _raw_spin_lock_irqsave
              |          |--74.75%-- down_trylock
              |          |--9.69%-- rcu_check_quiescent_state
              |          |--7.47%-- down
              |          |--3.57%-- up
              |          |--1.67%-- rwsem_wake
              |          |--1.00%-- remove_wait_queue
              |          |--0.56%-- pagevec_lru_move_fn
              |--5.39%-- _raw_spin_lock_irq
              |          |--82.05%-- rwsem_down_read_failed
              |          |--10.48%-- rwsem_down_write_failed
              |          |--4.24%-- __down
              |          |--2.74%-- __schedule
               --0.28%-- [...]
  2.20%    reaim  [kernel.kallsyms]  [k] memcpy
  1.84%    reaim  [unknown]          [.] 0x000000000041517b
  1.77%    reaim  [kernel.kallsyms]  [k] _raw_spin_lock
              |--21.08%-- xlog_cil_insert_items
              |--10.14%-- xfs_icsb_modify_counters
              |--7.20%-- xfs_iget_cache_hit
              |--6.56%-- inode_sb_list_add
              |--5.49%-- _xfs_buf_find
              |--5.25%-- evict
              |--5.03%-- __remove_inode_hash
              |--4.64%-- __mutex_lock_slowpath
              |--3.78%-- selinux_inode_free_security
              |--2.95%-- xfs_inode_is_filestream
              |--2.35%-- try_to_wake_up
              |--2.07%-- xfs_inode_set_reclaim_tag
              |--1.52%-- list_lru_add
              |--1.16%-- xfs_inode_clear_eofblocks_tag
		  :
  1.30%    reaim  [kernel.kallsyms]  [k] effective_load
  1.27%    reaim  [kernel.kallsyms]  [k] mspin_lock
  1.10%    reaim  [kernel.kallsyms]  [k] security_compute_sid

On the ext4 filesystem, the disk workload improved from 416281 JPM
to 899101 JPM (+116%) with the patch. In this case, the contended
spinlock is the mb_cache_spinlock.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
Acked-by: Rik van Riel <riel@redhat.com>
---
 include/asm-generic/qspinlock.h       |  122 ++++++++++
 include/asm-generic/qspinlock_types.h |   55 +++++
 kernel/Kconfig.locks                  |    7 +
 kernel/locking/Makefile               |    1 +
 kernel/locking/qspinlock.c            |  393 +++++++++++++++++++++++++++++++++
 5 files changed, 578 insertions(+), 0 deletions(-)
 create mode 100644 include/asm-generic/qspinlock.h
 create mode 100644 include/asm-generic/qspinlock_types.h
 create mode 100644 kernel/locking/qspinlock.c

diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h
new file mode 100644
index 0000000..08da60f
--- /dev/null
+++ b/include/asm-generic/qspinlock.h
@@ -0,0 +1,122 @@
+/*
+ * Queue spinlock
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2013-2014 Hewlett-Packard Development Company, L.P.
+ *
+ * Authors: Waiman Long <waiman.long@hp.com>
+ */
+#ifndef __ASM_GENERIC_QSPINLOCK_H
+#define __ASM_GENERIC_QSPINLOCK_H
+
+#include <asm-generic/qspinlock_types.h>
+
+/*
+ * External function declarations
+ */
+extern void queue_spin_lock_slowpath(struct qspinlock *lock, int qsval);
+
+/**
+ * queue_spin_is_locked - is the spinlock locked?
+ * @lock: Pointer to queue spinlock structure
+ * Return: 1 if it is locked, 0 otherwise
+ */
+static __always_inline int queue_spin_is_locked(struct qspinlock *lock)
+{
+	return atomic_read(&lock->qlcode) & _QSPINLOCK_LOCKED;
+}
+
+/**
+ * queue_spin_value_unlocked - is the spinlock structure unlocked?
+ * @lock: queue spinlock structure
+ * Return: 1 if it is unlocked, 0 otherwise
+ */
+static __always_inline int queue_spin_value_unlocked(struct qspinlock lock)
+{
+	return !(atomic_read(&lock.qlcode) & _QSPINLOCK_LOCKED);
+}
+
+/**
+ * queue_spin_is_contended - check if the lock is contended
+ * @lock : Pointer to queue spinlock structure
+ * Return: 1 if lock contended, 0 otherwise
+ */
+static __always_inline int queue_spin_is_contended(struct qspinlock *lock)
+{
+	return atomic_read(&lock->qlcode) & ~_QSPINLOCK_LOCK_MASK;
+}
+/**
+ * queue_spin_trylock - try to acquire the queue spinlock
+ * @lock : Pointer to queue spinlock structure
+ * Return: 1 if lock acquired, 0 if failed
+ */
+static __always_inline int queue_spin_trylock(struct qspinlock *lock)
+{
+	if (!atomic_read(&lock->qlcode) &&
+	   (atomic_cmpxchg(&lock->qlcode, 0, _QSPINLOCK_LOCKED) == 0))
+		return 1;
+	return 0;
+}
+
+/**
+ * queue_spin_lock - acquire a queue spinlock
+ * @lock: Pointer to queue spinlock structure
+ */
+static __always_inline void queue_spin_lock(struct qspinlock *lock)
+{
+	int qsval;
+
+	/*
+	 * To reduce memory access to only once for the cold cache case,
+	 * a direct cmpxchg() is performed in the fastpath to optimize the
+	 * uncontended case. The contended performance, however, may suffer
+	 * a bit because of that.
+	 */
+	qsval = atomic_cmpxchg(&lock->qlcode, 0, _QSPINLOCK_LOCKED);
+	if (likely(qsval == 0))
+		return;
+	queue_spin_lock_slowpath(lock, qsval);
+}
+
+#ifndef queue_spin_unlock
+/**
+ * queue_spin_unlock - release a queue spinlock
+ * @lock : Pointer to queue spinlock structure
+ */
+static __always_inline void queue_spin_unlock(struct qspinlock *lock)
+{
+	/*
+	 * Use an atomic subtraction to clear the lock bit.
+	 */
+	smp_mb__before_atomic_dec();
+	atomic_sub(_QSPINLOCK_LOCKED, &lock->qlcode);
+}
+#endif
+
+/*
+ * Initializier
+ */
+#define	__ARCH_SPIN_LOCK_UNLOCKED	{ ATOMIC_INIT(0) }
+
+/*
+ * Remapping spinlock architecture specific functions to the corresponding
+ * queue spinlock functions.
+ */
+#define arch_spin_is_locked(l)		queue_spin_is_locked(l)
+#define arch_spin_is_contended(l)	queue_spin_is_contended(l)
+#define arch_spin_value_unlocked(l)	queue_spin_value_unlocked(l)
+#define arch_spin_lock(l)		queue_spin_lock(l)
+#define arch_spin_trylock(l)		queue_spin_trylock(l)
+#define arch_spin_unlock(l)		queue_spin_unlock(l)
+#define arch_spin_lock_flags(l, f)	queue_spin_lock(l)
+
+#endif /* __ASM_GENERIC_QSPINLOCK_H */
diff --git a/include/asm-generic/qspinlock_types.h b/include/asm-generic/qspinlock_types.h
new file mode 100644
index 0000000..df981d0
--- /dev/null
+++ b/include/asm-generic/qspinlock_types.h
@@ -0,0 +1,55 @@
+/*
+ * Queue spinlock
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2013-2014 Hewlett-Packard Development Company, L.P.
+ *
+ * Authors: Waiman Long <waiman.long@hp.com>
+ */
+#ifndef __ASM_GENERIC_QSPINLOCK_TYPES_H
+#define __ASM_GENERIC_QSPINLOCK_TYPES_H
+
+/*
+ * Including atomic.h with PARAVIRT on will cause compilation errors because
+ * of recursive header file incluson via paravirt_types.h. A workaround is
+ * to include paravirt_types.h here in this case.
+ */
+#ifdef CONFIG_PARAVIRT
+# include <asm/paravirt_types.h>
+#else
+# include <linux/types.h>
+# include <linux/atomic.h>
+#endif
+
+/*
+ * The queue spinlock data structure - a 32-bit word
+ *
+ * For NR_CPUS >= 16K, the bits assignment are:
+ *   Bit  0   : Set if locked
+ *   Bits 1-7 : Not used
+ *   Bits 8-31: Queue code
+ *
+ * For NR_CPUS < 16K, the bits assignment are:
+ *   Bit   0   : Set if locked
+ *   Bits  1-7 : Not used
+ *   Bits  8-15: Reserved for architecture specific optimization
+ *   Bits 16-31: Queue code
+ */
+typedef struct qspinlock {
+	atomic_t	qlcode;	/* Lock + queue code */
+} arch_spinlock_t;
+
+#define _QCODE_OFFSET		8
+#define _QSPINLOCK_LOCKED	1U
+#define	_QSPINLOCK_LOCK_MASK	0xff
+
+#endif /* __ASM_GENERIC_QSPINLOCK_TYPES_H */
diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks
index d2b32ac..f185584 100644
--- a/kernel/Kconfig.locks
+++ b/kernel/Kconfig.locks
@@ -223,3 +223,10 @@ endif
 config MUTEX_SPIN_ON_OWNER
 	def_bool y
 	depends on SMP && !DEBUG_MUTEXES
+
+config ARCH_USE_QUEUE_SPINLOCK
+	bool
+
+config QUEUE_SPINLOCK
+	def_bool y if ARCH_USE_QUEUE_SPINLOCK
+	depends on SMP && !PARAVIRT_SPINLOCKS
diff --git a/kernel/locking/Makefile b/kernel/locking/Makefile
index baab8e5..e3b3293 100644
--- a/kernel/locking/Makefile
+++ b/kernel/locking/Makefile
@@ -15,6 +15,7 @@ obj-$(CONFIG_LOCKDEP) += lockdep_proc.o
 endif
 obj-$(CONFIG_SMP) += spinlock.o
 obj-$(CONFIG_PROVE_LOCKING) += spinlock.o
+obj-$(CONFIG_QUEUE_SPINLOCK) += qspinlock.o
 obj-$(CONFIG_RT_MUTEXES) += rtmutex.o
 obj-$(CONFIG_DEBUG_RT_MUTEXES) += rtmutex-debug.o
 obj-$(CONFIG_RT_MUTEX_TESTER) += rtmutex-tester.o
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
new file mode 100644
index 0000000..ed5efa7
--- /dev/null
+++ b/kernel/locking/qspinlock.c
@@ -0,0 +1,393 @@
+/*
+ * Queue spinlock
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2013-2014 Hewlett-Packard Development Company, L.P.
+ *
+ * Authors: Waiman Long <waiman.long@hp.com>
+ */
+#include <linux/smp.h>
+#include <linux/bug.h>
+#include <linux/cpumask.h>
+#include <linux/percpu.h>
+#include <linux/hardirq.h>
+#include <linux/mutex.h>
+#include <linux/spinlock.h>
+
+/*
+ * The basic principle of a queue-based spinlock can best be understood
+ * by studying a classic queue-based spinlock implementation called the
+ * MCS lock. The paper below provides a good description for this kind
+ * of lock.
+ *
+ * http://www.cise.ufl.edu/tr/DOC/REP-1992-71.pdf
+ *
+ * This queue spinlock implementation is based on the MCS lock with twists
+ * to make it fit the following constraints:
+ * 1. A max spinlock size of 4 bytes
+ * 2. Good fastpath performance
+ * 3. No change in the locking APIs
+ *
+ * The queue spinlock fastpath is as simple as it can get, all the heavy
+ * lifting is done in the lock slowpath. The main idea behind this queue
+ * spinlock implementation is to keep the spinlock size at 4 bytes while
+ * at the same time implement a queue structure to queue up the waiting
+ * lock spinners.
+ *
+ * Since preemption is disabled before getting the lock, a given CPU will
+ * only need to use one queue node structure in a non-interrupt context.
+ * A percpu queue node structure will be allocated for this purpose and the
+ * cpu number will be put into the queue spinlock structure to indicate the
+ * tail of the queue.
+ *
+ * To handle spinlock acquisition at interrupt context (softirq or hardirq),
+ * the queue node structure is actually an array for supporting nested spin
+ * locking operations in interrupt handlers. If all the entries in the
+ * array are used up, a warning message will be printed (as that shouldn't
+ * happen in normal circumstances) and the lock spinner will fall back to
+ * busy spinning instead of waiting in a queue.
+ */
+
+/*
+ * The 24-bit queue node code is divided into the following 2 fields:
+ * Bits 0-1 : queue node index (4 nodes)
+ * Bits 2-23: CPU number + 1   (4M - 1 CPUs)
+ *
+ * The 16-bit queue node code is divided into the following 2 fields:
+ * Bits 0-1 : queue node index (4 nodes)
+ * Bits 2-15: CPU number + 1   (16K - 1 CPUs)
+ *
+ * A queue node code of 0 indicates that no one is waiting for the lock.
+ * As the value 0 cannot be used as a valid CPU number. We need to add
+ * 1 to it before putting it into the queue code.
+ */
+#define MAX_QNODES		4
+#ifndef _QCODE_VAL_OFFSET
+#define _QCODE_VAL_OFFSET	_QCODE_OFFSET
+#endif
+
+/*
+ * The queue node structure
+ *
+ * This structure is essentially the same as the mcs_spinlock structure
+ * in mcs_spinlock.h file. This structure is retained for future extension
+ * where new fields may be added.
+ */
+struct qnode {
+	u32		 wait;		/* Waiting flag		*/
+	struct qnode	*next;		/* Next queue node addr */
+};
+
+struct qnode_set {
+	struct qnode	nodes[MAX_QNODES];
+	int		node_idx;	/* Current node to use */
+};
+
+/*
+ * Per-CPU queue node structures
+ */
+static DEFINE_PER_CPU_ALIGNED(struct qnode_set, qnset) = { {{0}}, 0 };
+
+/*
+ ************************************************************************
+ * The following optimized codes are for architectures that support:	*
+ *  1) Atomic byte and short data write					*
+ *  2) Byte and short data exchange and compare-exchange instructions	*
+ *									*
+ * For those architectures, their asm/qspinlock.h header file should	*
+ * define the followings in order to use the optimized codes.		*
+ *  1) The _ARCH_SUPPORTS_ATOMIC_8_16_BITS_OPS macro			*
+ *  2) A smp_u8_store_release() macro for byte size store operation	*
+ *  3) A "union arch_qspinlock" structure that include the individual	*
+ *     fields of the qspinlock structure, including:			*
+ *      o slock - the qspinlock structure				*
+ *      o lock  - the lock byte						*
+ *									*
+ ************************************************************************
+ */
+#ifdef _ARCH_SUPPORTS_ATOMIC_8_16_BITS_OPS
+/**
+ * queue_spin_setlock - try to acquire the lock by setting the lock bit
+ * @lock: Pointer to queue spinlock structure
+ * Return: 1 if lock bit set successfully, 0 if failed
+ */
+static __always_inline int queue_spin_setlock(struct qspinlock *lock)
+{
+	union arch_qspinlock *qlock = (union arch_qspinlock *)lock;
+
+	if (!ACCESS_ONCE(qlock->lock) &&
+	   (cmpxchg(&qlock->lock, 0, _QSPINLOCK_LOCKED) == 0))
+		return 1;
+	return 0;
+}
+#else /*  _ARCH_SUPPORTS_ATOMIC_8_16_BITS_OPS  */
+/*
+ * Generic functions for architectures that do not support atomic
+ * byte or short data types.
+ */
+/**
+ *_queue_spin_setlock - try to acquire the lock by setting the lock bit
+ * @lock: Pointer to queue spinlock structure
+ * Return: 1 if lock bit set successfully, 0 if failed
+ */
+static __always_inline int queue_spin_setlock(struct qspinlock *lock)
+{
+	int qlcode = atomic_read(lock->qlcode);
+
+	if (!(qlcode & _QSPINLOCK_LOCKED) && (atomic_cmpxchg(&lock->qlcode,
+		qlcode, qlcode|_QSPINLOCK_LOCKED) == qlcode))
+			return 1;
+	return 0;
+}
+#endif /* _ARCH_SUPPORTS_ATOMIC_8_16_BITS_OPS */
+
+/*
+ ************************************************************************
+ * Inline functions used by the queue_spin_lock_slowpath() function	*
+ * that may get superseded by a more optimized version.			*
+ ************************************************************************
+ */
+
+#ifndef queue_get_lock_qcode
+/**
+ * queue_get_lock_qcode - get the lock & qcode values
+ * @lock  : Pointer to queue spinlock structure
+ * @qcode : Pointer to the returned qcode value
+ * @mycode: My qcode value (not used)
+ * Return : > 0 if lock is not available, = 0 if lock is free
+ */
+static inline int
+queue_get_lock_qcode(struct qspinlock *lock, u32 *qcode, u32 mycode)
+{
+	int qlcode = atomic_read(&lock->qlcode);
+
+	*qcode = qlcode;
+	return qlcode & _QSPINLOCK_LOCKED;
+}
+#endif /* queue_get_lock_qcode */
+
+#ifndef queue_spin_trylock_and_clr_qcode
+/**
+ * queue_spin_trylock_and_clr_qcode - Try to lock & clear qcode simultaneously
+ * @lock : Pointer to queue spinlock structure
+ * @qcode: The supposedly current qcode value
+ * Return: true if successful, false otherwise
+ */
+static inline int
+queue_spin_trylock_and_clr_qcode(struct qspinlock *lock, u32 qcode)
+{
+	return atomic_cmpxchg(&lock->qlcode, qcode, _QSPINLOCK_LOCKED) == qcode;
+}
+#endif /* queue_spin_trylock_and_clr_qcode */
+
+#ifndef queue_encode_qcode
+/**
+ * queue_encode_qcode - Encode the CPU number & node index into a qnode code
+ * @cpu_nr: CPU number
+ * @qn_idx: Queue node index
+ * Return : A qnode code that can be saved into the qspinlock structure
+ *
+ * The lock bit is set in the encoded 32-bit value as the need to encode
+ * a qnode means that the lock should have been taken.
+ */
+static u32 queue_encode_qcode(u32 cpu_nr, u8 qn_idx)
+{
+	return ((cpu_nr + 1) << (_QCODE_VAL_OFFSET + 2)) |
+		(qn_idx << _QCODE_VAL_OFFSET) | _QSPINLOCK_LOCKED;
+}
+#endif /* queue_encode_qcode */
+
+/*
+ ************************************************************************
+ * Other inline functions needed by the queue_spin_lock_slowpath()	*
+ * function.								*
+ ************************************************************************
+ */
+
+/**
+ * xlate_qcode - translate the queue code into the queue node address
+ * @qcode: Queue code to be translated
+ * Return: The corresponding queue node address
+ */
+static inline struct qnode *xlate_qcode(u32 qcode)
+{
+	u32 cpu_nr = (qcode >> (_QCODE_VAL_OFFSET + 2)) - 1;
+	u8  qn_idx = (qcode >> _QCODE_VAL_OFFSET) & 3;
+
+	return per_cpu_ptr(&qnset.nodes[qn_idx], cpu_nr);
+}
+
+/**
+ * get_qnode - Get a queue node address
+ * @qn_idx: Pointer to queue node index [out]
+ * Return : queue node address & queue node index in qn_idx, or NULL if
+ *	    no free queue node available.
+ */
+static struct qnode *get_qnode(unsigned int *qn_idx)
+{
+	struct qnode_set *qset = this_cpu_ptr(&qnset);
+	int i;
+
+	if (unlikely(qset->node_idx >= MAX_QNODES))
+		return NULL;
+	i = qset->node_idx++;
+	*qn_idx = i;
+	return &qset->nodes[i];
+}
+
+/**
+ * put_qnode - Return a queue node to the pool
+ */
+static void put_qnode(void)
+{
+	struct qnode_set *qset = this_cpu_ptr(&qnset);
+
+	qset->node_idx--;
+}
+
+/**
+ * queue_spin_lock_slowpath - acquire the queue spinlock
+ * @lock : Pointer to queue spinlock structure
+ * @qsval: Current value of the queue spinlock 32-bit word
+ */
+void queue_spin_lock_slowpath(struct qspinlock *lock, int qsval)
+{
+	unsigned int cpu_nr, qn_idx;
+	struct qnode *node, *next;
+	u32 prev_qcode, my_qcode;
+
+	/*
+	 * Get the queue node
+	 */
+	cpu_nr = smp_processor_id();
+	node   = get_qnode(&qn_idx);
+
+	/*
+	 * It should never happen that all the queue nodes are being used.
+	 */
+	BUG_ON(!node);
+
+	/*
+	 * Set up the new cpu code to be exchanged
+	 */
+	my_qcode = queue_encode_qcode(cpu_nr, qn_idx);
+
+	/*
+	 * Initialize the queue node
+	 */
+	node->wait = true;
+	node->next = NULL;
+
+	/*
+	 * The lock may be available at this point, try again if no task was
+	 * waiting in the queue.
+	 */
+	if (!(qsval >> _QCODE_OFFSET) && queue_spin_trylock(lock)) {
+		put_qnode();
+		return;
+	}
+
+	/*
+	 * Exchange current copy of the queue node code
+	 */
+	prev_qcode = atomic_xchg(&lock->qlcode, my_qcode);
+	/*
+	 * It is possible that we may accidentally steal the lock. If this is
+	 * the case, we need to either release it if not the head of the queue
+	 * or get the lock and be done with it.
+	 */
+	if (unlikely(!(prev_qcode & _QSPINLOCK_LOCKED))) {
+		if (prev_qcode == 0) {
+			/*
+			 * Got the lock since it is at the head of the queue
+			 * Now try to atomically clear the queue code.
+			 */
+			if (atomic_cmpxchg(&lock->qlcode, my_qcode,
+					  _QSPINLOCK_LOCKED) == my_qcode)
+				goto release_node;
+			/*
+			 * The cmpxchg fails only if one or more tasks
+			 * are added to the queue. In this case, we need to
+			 * notify the next one to be the head of the queue.
+			 */
+			goto notify_next;
+		}
+		/*
+		 * Accidentally steal the lock, release the lock and
+		 * let the queue head get it.
+		 */
+		queue_spin_unlock(lock);
+	} else
+		prev_qcode &= ~_QSPINLOCK_LOCKED;	/* Clear the lock bit */
+	my_qcode &= ~_QSPINLOCK_LOCKED;
+
+	if (prev_qcode) {
+		/*
+		 * Not at the queue head, get the address of the previous node
+		 * and set up the "next" fields of the that node.
+		 */
+		struct qnode *prev = xlate_qcode(prev_qcode);
+
+		ACCESS_ONCE(prev->next) = node;
+		/*
+		 * Wait until the waiting flag is off
+		 */
+		while (smp_load_acquire(&node->wait))
+			arch_mutex_cpu_relax();
+	}
+
+	/*
+	 * At the head of the wait queue now
+	 */
+	while (true) {
+		u32 qcode;
+		int retval;
+
+		retval = queue_get_lock_qcode(lock, &qcode, my_qcode);
+		if (retval > 0)
+			;	/* Lock not available yet */
+		else if (retval < 0)
+			/* Lock taken, can release the node & return */
+			goto release_node;
+		else if (qcode != my_qcode) {
+			/*
+			 * Just get the lock with other spinners waiting
+			 * in the queue.
+			 */
+			if (queue_spin_setlock(lock))
+				goto notify_next;
+		} else {
+			/*
+			 * Get the lock & clear the queue code simultaneously
+			 */
+			if (queue_spin_trylock_and_clr_qcode(lock, qcode))
+				/* No need to notify the next one */
+				goto release_node;
+		}
+		arch_mutex_cpu_relax();
+	}
+
+notify_next:
+	/*
+	 * Wait, if needed, until the next one in queue set up the next field
+	 */
+	while (!(next = ACCESS_ONCE(node->next)))
+		arch_mutex_cpu_relax();
+	/*
+	 * The next one in queue is now at the head
+	 */
+	smp_store_release(&next->wait, false);
+
+release_node:
+	put_qnode();
+}
+EXPORT_SYMBOL(queue_spin_lock_slowpath);
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH v5 1/8] qspinlock: Introducing a 4-byte queue spinlock implementation
  2014-02-27  4:32 Waiman Long
@ 2014-02-27  4:32 ` Waiman Long
  2014-02-27  4:32 ` Waiman Long
  1 sibling, 0 replies; 125+ messages in thread
From: Waiman Long @ 2014-02-27  4:32 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann,
	Peter Zijlstra
  Cc: Jeremy Fitzhardinge, Raghavendra K T, kvm, virtualization,
	Andi Kleen, Michel Lespinasse, Alok Kataria, linux-arch,
	Gleb Natapov, x86, xen-devel, Paul E. McKenney, Scott J Norton,
	Rusty Russell, Steven Rostedt, Chris Wright, Oleg Nesterov,
	Boris Ostrovsky, Aswin Chandramouleeswaran, Chegu Vinod,
	Waiman Long, linux-kernel, David Vrabel, Andrew Morton, Linu

This patch introduces a new queue spinlock implementation that can
serve as an alternative to the default ticket spinlock. Compared with
the ticket spinlock, this queue spinlock should be almost as fair as
the ticket spinlock. It has about the same speed in single-thread and
it can be much faster in high contention situations. Only in light to
moderate contention where the average queue depth is around 1-3 will
this queue spinlock be potentially a bit slower due to the higher
slowpath overhead.

This queue spinlock is especially suit to NUMA machines with a large
number of cores as the chance of spinlock contention is much higher
in those machines. The cost of contention is also higher because of
slower inter-node memory traffic.

The idea behind this spinlock implementation is the fact that spinlocks
are acquired with preemption disabled. In other words, the process
will not be migrated to another CPU while it is trying to get a
spinlock. Ignoring interrupt handling, a CPU can only be contending
in one spinlock at any one time. Of course, interrupt handler can try
to acquire one spinlock while the interrupted user process is in the
process of getting another spinlock. By allocating a set of per-cpu
queue nodes and used them to form a waiting queue, we can encode the
queue node address into a much smaller 16-bit size. Together with
the 1-byte lock bit, this queue spinlock implementation will only
need 4 bytes to hold all the information that it needs.

The current queue node address encoding of the 4-byte word is as
follows:
Bits 0-7  : the locked byte
Bits 8-9  : queue node index in the per-cpu array (4 entries)
Bits 10-31: cpu number + 1 (max cpus = 4M -1)

In the extremely unlikely case that all the queue node entries are
used up, the current code will fall back to busy spinning without
waiting in a queue with warning message.

For single-thread performance (no contention), a 256K lock/unlock
loop was run on a 2.4Ghz Westmere x86-64 CPU.  The following table
shows the average time (in ns) for a single lock/unlock sequence
(including the looping and timing overhead):

  Lock Type			Time (ns)
  ---------			---------
  Ticket spinlock		  14.1
  Queue spinlock		   8.8

So the queue spinlock is much faster than the ticket spinlock, even
though the overhead of locking and unlocking should be pretty small
when there is no contention. The performance advantage is mainly
due to the fact that ticket spinlock does a read-modify-write (add)
instruction in unlock whereas queue spinlock only does a simple write
in unlock which can be much faster in a pipelined CPU.

The AIM7 benchmark was run on a 8-socket 80-core DL980 with Westmere
x86-64 CPUs with XFS filesystem on a ramdisk and HT off to evaluate
the performance impact of this patch on a 3.13 kernel.

  +------------+----------+-----------------+---------+
  | Kernel     | 3.13 JPM |    3.13 with    | %Change |
  |            |          | qspinlock patch |	      |
  +------------+----------+-----------------+---------+
  |		      10-100 users		      |
  +------------+----------+-----------------+---------+
  |custom      |   357459 |      363109     |  +1.58% |
  |dbase       |   496847 |      498801	    |  +0.39% |
  |disk        |  2925312 |     2771387     |  -5.26% |
  |five_sec    |   166612 |      169215     |  +1.56% |
  |fserver     |   382129 |      383279     |  +0.30% |
  |high_systime|    16356 |       16380     |  +0.15% |
  |short       |  4521978 |     4257363     |  -5.85% |
  +------------+----------+-----------------+---------+
  |		     200-1000 users		      |
  +------------+----------+-----------------+---------+
  |custom      |   449070 |      447711     |  -0.30% |
  |dbase       |   845029 |      853362	    |  +0.99% |
  |disk        |  2725249 |     4892907     | +79.54% |
  |five_sec    |   169410 |      170638     |  +0.72% |
  |fserver     |   489662 |      491828     |  +0.44% |
  |high_systime|   142823 |      143790     |  +0.68% |
  |short       |  7435288 |     9016171     | +21.26% |
  +------------+----------+-----------------+---------+
  |		     1100-2000 users		      |
  +------------+----------+-----------------+---------+
  |custom      |   432470 |      432570     |  +0.02% |
  |dbase       |   889289 |      890026	    |  +0.08% |
  |disk        |  2565138 |     5008732     | +95.26% |
  |five_sec    |   169141 |      170034     |  +0.53% |
  |fserver     |   498569 |      500701     |  +0.43% |
  |high_systime|   229913 |      245866     |  +6.94% |
  |short       |  8496794 |     8281918     |  -2.53% |
  +------------+----------+-----------------+---------+

The workload with the most gain was the disk workload. Without the
patch, the perf profile at 1500 users looked like:

 26.19%    reaim  [kernel.kallsyms]  [k] _raw_spin_lock
              |--47.28%-- evict
              |--46.87%-- inode_sb_list_add
              |--1.24%-- xlog_cil_insert_items
              |--0.68%-- __remove_inode_hash
              |--0.67%-- inode_wait_for_writeback
               --3.26%-- [...]
 22.96%  swapper  [kernel.kallsyms]  [k] cpu_idle_loop
  5.56%    reaim  [kernel.kallsyms]  [k] mutex_spin_on_owner
  4.87%    reaim  [kernel.kallsyms]  [k] update_cfs_rq_blocked_load
  2.04%    reaim  [kernel.kallsyms]  [k] mspin_lock
  1.30%    reaim  [kernel.kallsyms]  [k] memcpy
  1.08%    reaim  [unknown]          [.] 0x0000003c52009447

There was pretty high spinlock contention on the inode_sb_list_lock
and maybe the inode's i_lock.

With the patch, the perf profile at 1500 users became:

 26.82%  swapper  [kernel.kallsyms]  [k] cpu_idle_loop
  4.66%    reaim  [kernel.kallsyms]  [k] mutex_spin_on_owner
  3.97%    reaim  [kernel.kallsyms]  [k] update_cfs_rq_blocked_load
  2.40%    reaim  [kernel.kallsyms]  [k] queue_spin_lock_slowpath
              |--88.31%-- _raw_spin_lock
              |          |--36.02%-- inode_sb_list_add
              |          |--35.09%-- evict
              |          |--16.89%-- xlog_cil_insert_items
              |          |--6.30%-- try_to_wake_up
              |          |--2.20%-- _xfs_buf_find
              |          |--0.75%-- __remove_inode_hash
              |          |--0.72%-- __mutex_lock_slowpath
              |          |--0.53%-- load_balance
              |--6.02%-- _raw_spin_lock_irqsave
              |          |--74.75%-- down_trylock
              |          |--9.69%-- rcu_check_quiescent_state
              |          |--7.47%-- down
              |          |--3.57%-- up
              |          |--1.67%-- rwsem_wake
              |          |--1.00%-- remove_wait_queue
              |          |--0.56%-- pagevec_lru_move_fn
              |--5.39%-- _raw_spin_lock_irq
              |          |--82.05%-- rwsem_down_read_failed
              |          |--10.48%-- rwsem_down_write_failed
              |          |--4.24%-- __down
              |          |--2.74%-- __schedule
               --0.28%-- [...]
  2.20%    reaim  [kernel.kallsyms]  [k] memcpy
  1.84%    reaim  [unknown]          [.] 0x000000000041517b
  1.77%    reaim  [kernel.kallsyms]  [k] _raw_spin_lock
              |--21.08%-- xlog_cil_insert_items
              |--10.14%-- xfs_icsb_modify_counters
              |--7.20%-- xfs_iget_cache_hit
              |--6.56%-- inode_sb_list_add
              |--5.49%-- _xfs_buf_find
              |--5.25%-- evict
              |--5.03%-- __remove_inode_hash
              |--4.64%-- __mutex_lock_slowpath
              |--3.78%-- selinux_inode_free_security
              |--2.95%-- xfs_inode_is_filestream
              |--2.35%-- try_to_wake_up
              |--2.07%-- xfs_inode_set_reclaim_tag
              |--1.52%-- list_lru_add
              |--1.16%-- xfs_inode_clear_eofblocks_tag
		  :
  1.30%    reaim  [kernel.kallsyms]  [k] effective_load
  1.27%    reaim  [kernel.kallsyms]  [k] mspin_lock
  1.10%    reaim  [kernel.kallsyms]  [k] security_compute_sid

On the ext4 filesystem, the disk workload improved from 416281 JPM
to 899101 JPM (+116%) with the patch. In this case, the contended
spinlock is the mb_cache_spinlock.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
Acked-by: Rik van Riel <riel@redhat.com>
---
 include/asm-generic/qspinlock.h       |  122 ++++++++++
 include/asm-generic/qspinlock_types.h |   55 +++++
 kernel/Kconfig.locks                  |    7 +
 kernel/locking/Makefile               |    1 +
 kernel/locking/qspinlock.c            |  393 +++++++++++++++++++++++++++++++++
 5 files changed, 578 insertions(+), 0 deletions(-)
 create mode 100644 include/asm-generic/qspinlock.h
 create mode 100644 include/asm-generic/qspinlock_types.h
 create mode 100644 kernel/locking/qspinlock.c

diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h
new file mode 100644
index 0000000..08da60f
--- /dev/null
+++ b/include/asm-generic/qspinlock.h
@@ -0,0 +1,122 @@
+/*
+ * Queue spinlock
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2013-2014 Hewlett-Packard Development Company, L.P.
+ *
+ * Authors: Waiman Long <waiman.long@hp.com>
+ */
+#ifndef __ASM_GENERIC_QSPINLOCK_H
+#define __ASM_GENERIC_QSPINLOCK_H
+
+#include <asm-generic/qspinlock_types.h>
+
+/*
+ * External function declarations
+ */
+extern void queue_spin_lock_slowpath(struct qspinlock *lock, int qsval);
+
+/**
+ * queue_spin_is_locked - is the spinlock locked?
+ * @lock: Pointer to queue spinlock structure
+ * Return: 1 if it is locked, 0 otherwise
+ */
+static __always_inline int queue_spin_is_locked(struct qspinlock *lock)
+{
+	return atomic_read(&lock->qlcode) & _QSPINLOCK_LOCKED;
+}
+
+/**
+ * queue_spin_value_unlocked - is the spinlock structure unlocked?
+ * @lock: queue spinlock structure
+ * Return: 1 if it is unlocked, 0 otherwise
+ */
+static __always_inline int queue_spin_value_unlocked(struct qspinlock lock)
+{
+	return !(atomic_read(&lock.qlcode) & _QSPINLOCK_LOCKED);
+}
+
+/**
+ * queue_spin_is_contended - check if the lock is contended
+ * @lock : Pointer to queue spinlock structure
+ * Return: 1 if lock contended, 0 otherwise
+ */
+static __always_inline int queue_spin_is_contended(struct qspinlock *lock)
+{
+	return atomic_read(&lock->qlcode) & ~_QSPINLOCK_LOCK_MASK;
+}
+/**
+ * queue_spin_trylock - try to acquire the queue spinlock
+ * @lock : Pointer to queue spinlock structure
+ * Return: 1 if lock acquired, 0 if failed
+ */
+static __always_inline int queue_spin_trylock(struct qspinlock *lock)
+{
+	if (!atomic_read(&lock->qlcode) &&
+	   (atomic_cmpxchg(&lock->qlcode, 0, _QSPINLOCK_LOCKED) == 0))
+		return 1;
+	return 0;
+}
+
+/**
+ * queue_spin_lock - acquire a queue spinlock
+ * @lock: Pointer to queue spinlock structure
+ */
+static __always_inline void queue_spin_lock(struct qspinlock *lock)
+{
+	int qsval;
+
+	/*
+	 * To reduce memory access to only once for the cold cache case,
+	 * a direct cmpxchg() is performed in the fastpath to optimize the
+	 * uncontended case. The contended performance, however, may suffer
+	 * a bit because of that.
+	 */
+	qsval = atomic_cmpxchg(&lock->qlcode, 0, _QSPINLOCK_LOCKED);
+	if (likely(qsval == 0))
+		return;
+	queue_spin_lock_slowpath(lock, qsval);
+}
+
+#ifndef queue_spin_unlock
+/**
+ * queue_spin_unlock - release a queue spinlock
+ * @lock : Pointer to queue spinlock structure
+ */
+static __always_inline void queue_spin_unlock(struct qspinlock *lock)
+{
+	/*
+	 * Use an atomic subtraction to clear the lock bit.
+	 */
+	smp_mb__before_atomic_dec();
+	atomic_sub(_QSPINLOCK_LOCKED, &lock->qlcode);
+}
+#endif
+
+/*
+ * Initializier
+ */
+#define	__ARCH_SPIN_LOCK_UNLOCKED	{ ATOMIC_INIT(0) }
+
+/*
+ * Remapping spinlock architecture specific functions to the corresponding
+ * queue spinlock functions.
+ */
+#define arch_spin_is_locked(l)		queue_spin_is_locked(l)
+#define arch_spin_is_contended(l)	queue_spin_is_contended(l)
+#define arch_spin_value_unlocked(l)	queue_spin_value_unlocked(l)
+#define arch_spin_lock(l)		queue_spin_lock(l)
+#define arch_spin_trylock(l)		queue_spin_trylock(l)
+#define arch_spin_unlock(l)		queue_spin_unlock(l)
+#define arch_spin_lock_flags(l, f)	queue_spin_lock(l)
+
+#endif /* __ASM_GENERIC_QSPINLOCK_H */
diff --git a/include/asm-generic/qspinlock_types.h b/include/asm-generic/qspinlock_types.h
new file mode 100644
index 0000000..df981d0
--- /dev/null
+++ b/include/asm-generic/qspinlock_types.h
@@ -0,0 +1,55 @@
+/*
+ * Queue spinlock
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2013-2014 Hewlett-Packard Development Company, L.P.
+ *
+ * Authors: Waiman Long <waiman.long@hp.com>
+ */
+#ifndef __ASM_GENERIC_QSPINLOCK_TYPES_H
+#define __ASM_GENERIC_QSPINLOCK_TYPES_H
+
+/*
+ * Including atomic.h with PARAVIRT on will cause compilation errors because
+ * of recursive header file incluson via paravirt_types.h. A workaround is
+ * to include paravirt_types.h here in this case.
+ */
+#ifdef CONFIG_PARAVIRT
+# include <asm/paravirt_types.h>
+#else
+# include <linux/types.h>
+# include <linux/atomic.h>
+#endif
+
+/*
+ * The queue spinlock data structure - a 32-bit word
+ *
+ * For NR_CPUS >= 16K, the bits assignment are:
+ *   Bit  0   : Set if locked
+ *   Bits 1-7 : Not used
+ *   Bits 8-31: Queue code
+ *
+ * For NR_CPUS < 16K, the bits assignment are:
+ *   Bit   0   : Set if locked
+ *   Bits  1-7 : Not used
+ *   Bits  8-15: Reserved for architecture specific optimization
+ *   Bits 16-31: Queue code
+ */
+typedef struct qspinlock {
+	atomic_t	qlcode;	/* Lock + queue code */
+} arch_spinlock_t;
+
+#define _QCODE_OFFSET		8
+#define _QSPINLOCK_LOCKED	1U
+#define	_QSPINLOCK_LOCK_MASK	0xff
+
+#endif /* __ASM_GENERIC_QSPINLOCK_TYPES_H */
diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks
index d2b32ac..f185584 100644
--- a/kernel/Kconfig.locks
+++ b/kernel/Kconfig.locks
@@ -223,3 +223,10 @@ endif
 config MUTEX_SPIN_ON_OWNER
 	def_bool y
 	depends on SMP && !DEBUG_MUTEXES
+
+config ARCH_USE_QUEUE_SPINLOCK
+	bool
+
+config QUEUE_SPINLOCK
+	def_bool y if ARCH_USE_QUEUE_SPINLOCK
+	depends on SMP && !PARAVIRT_SPINLOCKS
diff --git a/kernel/locking/Makefile b/kernel/locking/Makefile
index baab8e5..e3b3293 100644
--- a/kernel/locking/Makefile
+++ b/kernel/locking/Makefile
@@ -15,6 +15,7 @@ obj-$(CONFIG_LOCKDEP) += lockdep_proc.o
 endif
 obj-$(CONFIG_SMP) += spinlock.o
 obj-$(CONFIG_PROVE_LOCKING) += spinlock.o
+obj-$(CONFIG_QUEUE_SPINLOCK) += qspinlock.o
 obj-$(CONFIG_RT_MUTEXES) += rtmutex.o
 obj-$(CONFIG_DEBUG_RT_MUTEXES) += rtmutex-debug.o
 obj-$(CONFIG_RT_MUTEX_TESTER) += rtmutex-tester.o
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
new file mode 100644
index 0000000..ed5efa7
--- /dev/null
+++ b/kernel/locking/qspinlock.c
@@ -0,0 +1,393 @@
+/*
+ * Queue spinlock
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2013-2014 Hewlett-Packard Development Company, L.P.
+ *
+ * Authors: Waiman Long <waiman.long@hp.com>
+ */
+#include <linux/smp.h>
+#include <linux/bug.h>
+#include <linux/cpumask.h>
+#include <linux/percpu.h>
+#include <linux/hardirq.h>
+#include <linux/mutex.h>
+#include <linux/spinlock.h>
+
+/*
+ * The basic principle of a queue-based spinlock can best be understood
+ * by studying a classic queue-based spinlock implementation called the
+ * MCS lock. The paper below provides a good description for this kind
+ * of lock.
+ *
+ * http://www.cise.ufl.edu/tr/DOC/REP-1992-71.pdf
+ *
+ * This queue spinlock implementation is based on the MCS lock with twists
+ * to make it fit the following constraints:
+ * 1. A max spinlock size of 4 bytes
+ * 2. Good fastpath performance
+ * 3. No change in the locking APIs
+ *
+ * The queue spinlock fastpath is as simple as it can get, all the heavy
+ * lifting is done in the lock slowpath. The main idea behind this queue
+ * spinlock implementation is to keep the spinlock size at 4 bytes while
+ * at the same time implement a queue structure to queue up the waiting
+ * lock spinners.
+ *
+ * Since preemption is disabled before getting the lock, a given CPU will
+ * only need to use one queue node structure in a non-interrupt context.
+ * A percpu queue node structure will be allocated for this purpose and the
+ * cpu number will be put into the queue spinlock structure to indicate the
+ * tail of the queue.
+ *
+ * To handle spinlock acquisition at interrupt context (softirq or hardirq),
+ * the queue node structure is actually an array for supporting nested spin
+ * locking operations in interrupt handlers. If all the entries in the
+ * array are used up, a warning message will be printed (as that shouldn't
+ * happen in normal circumstances) and the lock spinner will fall back to
+ * busy spinning instead of waiting in a queue.
+ */
+
+/*
+ * The 24-bit queue node code is divided into the following 2 fields:
+ * Bits 0-1 : queue node index (4 nodes)
+ * Bits 2-23: CPU number + 1   (4M - 1 CPUs)
+ *
+ * The 16-bit queue node code is divided into the following 2 fields:
+ * Bits 0-1 : queue node index (4 nodes)
+ * Bits 2-15: CPU number + 1   (16K - 1 CPUs)
+ *
+ * A queue node code of 0 indicates that no one is waiting for the lock.
+ * As the value 0 cannot be used as a valid CPU number. We need to add
+ * 1 to it before putting it into the queue code.
+ */
+#define MAX_QNODES		4
+#ifndef _QCODE_VAL_OFFSET
+#define _QCODE_VAL_OFFSET	_QCODE_OFFSET
+#endif
+
+/*
+ * The queue node structure
+ *
+ * This structure is essentially the same as the mcs_spinlock structure
+ * in mcs_spinlock.h file. This structure is retained for future extension
+ * where new fields may be added.
+ */
+struct qnode {
+	u32		 wait;		/* Waiting flag		*/
+	struct qnode	*next;		/* Next queue node addr */
+};
+
+struct qnode_set {
+	struct qnode	nodes[MAX_QNODES];
+	int		node_idx;	/* Current node to use */
+};
+
+/*
+ * Per-CPU queue node structures
+ */
+static DEFINE_PER_CPU_ALIGNED(struct qnode_set, qnset) = { {{0}}, 0 };
+
+/*
+ ************************************************************************
+ * The following optimized codes are for architectures that support:	*
+ *  1) Atomic byte and short data write					*
+ *  2) Byte and short data exchange and compare-exchange instructions	*
+ *									*
+ * For those architectures, their asm/qspinlock.h header file should	*
+ * define the followings in order to use the optimized codes.		*
+ *  1) The _ARCH_SUPPORTS_ATOMIC_8_16_BITS_OPS macro			*
+ *  2) A smp_u8_store_release() macro for byte size store operation	*
+ *  3) A "union arch_qspinlock" structure that include the individual	*
+ *     fields of the qspinlock structure, including:			*
+ *      o slock - the qspinlock structure				*
+ *      o lock  - the lock byte						*
+ *									*
+ ************************************************************************
+ */
+#ifdef _ARCH_SUPPORTS_ATOMIC_8_16_BITS_OPS
+/**
+ * queue_spin_setlock - try to acquire the lock by setting the lock bit
+ * @lock: Pointer to queue spinlock structure
+ * Return: 1 if lock bit set successfully, 0 if failed
+ */
+static __always_inline int queue_spin_setlock(struct qspinlock *lock)
+{
+	union arch_qspinlock *qlock = (union arch_qspinlock *)lock;
+
+	if (!ACCESS_ONCE(qlock->lock) &&
+	   (cmpxchg(&qlock->lock, 0, _QSPINLOCK_LOCKED) == 0))
+		return 1;
+	return 0;
+}
+#else /*  _ARCH_SUPPORTS_ATOMIC_8_16_BITS_OPS  */
+/*
+ * Generic functions for architectures that do not support atomic
+ * byte or short data types.
+ */
+/**
+ *_queue_spin_setlock - try to acquire the lock by setting the lock bit
+ * @lock: Pointer to queue spinlock structure
+ * Return: 1 if lock bit set successfully, 0 if failed
+ */
+static __always_inline int queue_spin_setlock(struct qspinlock *lock)
+{
+	int qlcode = atomic_read(lock->qlcode);
+
+	if (!(qlcode & _QSPINLOCK_LOCKED) && (atomic_cmpxchg(&lock->qlcode,
+		qlcode, qlcode|_QSPINLOCK_LOCKED) == qlcode))
+			return 1;
+	return 0;
+}
+#endif /* _ARCH_SUPPORTS_ATOMIC_8_16_BITS_OPS */
+
+/*
+ ************************************************************************
+ * Inline functions used by the queue_spin_lock_slowpath() function	*
+ * that may get superseded by a more optimized version.			*
+ ************************************************************************
+ */
+
+#ifndef queue_get_lock_qcode
+/**
+ * queue_get_lock_qcode - get the lock & qcode values
+ * @lock  : Pointer to queue spinlock structure
+ * @qcode : Pointer to the returned qcode value
+ * @mycode: My qcode value (not used)
+ * Return : > 0 if lock is not available, = 0 if lock is free
+ */
+static inline int
+queue_get_lock_qcode(struct qspinlock *lock, u32 *qcode, u32 mycode)
+{
+	int qlcode = atomic_read(&lock->qlcode);
+
+	*qcode = qlcode;
+	return qlcode & _QSPINLOCK_LOCKED;
+}
+#endif /* queue_get_lock_qcode */
+
+#ifndef queue_spin_trylock_and_clr_qcode
+/**
+ * queue_spin_trylock_and_clr_qcode - Try to lock & clear qcode simultaneously
+ * @lock : Pointer to queue spinlock structure
+ * @qcode: The supposedly current qcode value
+ * Return: true if successful, false otherwise
+ */
+static inline int
+queue_spin_trylock_and_clr_qcode(struct qspinlock *lock, u32 qcode)
+{
+	return atomic_cmpxchg(&lock->qlcode, qcode, _QSPINLOCK_LOCKED) == qcode;
+}
+#endif /* queue_spin_trylock_and_clr_qcode */
+
+#ifndef queue_encode_qcode
+/**
+ * queue_encode_qcode - Encode the CPU number & node index into a qnode code
+ * @cpu_nr: CPU number
+ * @qn_idx: Queue node index
+ * Return : A qnode code that can be saved into the qspinlock structure
+ *
+ * The lock bit is set in the encoded 32-bit value as the need to encode
+ * a qnode means that the lock should have been taken.
+ */
+static u32 queue_encode_qcode(u32 cpu_nr, u8 qn_idx)
+{
+	return ((cpu_nr + 1) << (_QCODE_VAL_OFFSET + 2)) |
+		(qn_idx << _QCODE_VAL_OFFSET) | _QSPINLOCK_LOCKED;
+}
+#endif /* queue_encode_qcode */
+
+/*
+ ************************************************************************
+ * Other inline functions needed by the queue_spin_lock_slowpath()	*
+ * function.								*
+ ************************************************************************
+ */
+
+/**
+ * xlate_qcode - translate the queue code into the queue node address
+ * @qcode: Queue code to be translated
+ * Return: The corresponding queue node address
+ */
+static inline struct qnode *xlate_qcode(u32 qcode)
+{
+	u32 cpu_nr = (qcode >> (_QCODE_VAL_OFFSET + 2)) - 1;
+	u8  qn_idx = (qcode >> _QCODE_VAL_OFFSET) & 3;
+
+	return per_cpu_ptr(&qnset.nodes[qn_idx], cpu_nr);
+}
+
+/**
+ * get_qnode - Get a queue node address
+ * @qn_idx: Pointer to queue node index [out]
+ * Return : queue node address & queue node index in qn_idx, or NULL if
+ *	    no free queue node available.
+ */
+static struct qnode *get_qnode(unsigned int *qn_idx)
+{
+	struct qnode_set *qset = this_cpu_ptr(&qnset);
+	int i;
+
+	if (unlikely(qset->node_idx >= MAX_QNODES))
+		return NULL;
+	i = qset->node_idx++;
+	*qn_idx = i;
+	return &qset->nodes[i];
+}
+
+/**
+ * put_qnode - Return a queue node to the pool
+ */
+static void put_qnode(void)
+{
+	struct qnode_set *qset = this_cpu_ptr(&qnset);
+
+	qset->node_idx--;
+}
+
+/**
+ * queue_spin_lock_slowpath - acquire the queue spinlock
+ * @lock : Pointer to queue spinlock structure
+ * @qsval: Current value of the queue spinlock 32-bit word
+ */
+void queue_spin_lock_slowpath(struct qspinlock *lock, int qsval)
+{
+	unsigned int cpu_nr, qn_idx;
+	struct qnode *node, *next;
+	u32 prev_qcode, my_qcode;
+
+	/*
+	 * Get the queue node
+	 */
+	cpu_nr = smp_processor_id();
+	node   = get_qnode(&qn_idx);
+
+	/*
+	 * It should never happen that all the queue nodes are being used.
+	 */
+	BUG_ON(!node);
+
+	/*
+	 * Set up the new cpu code to be exchanged
+	 */
+	my_qcode = queue_encode_qcode(cpu_nr, qn_idx);
+
+	/*
+	 * Initialize the queue node
+	 */
+	node->wait = true;
+	node->next = NULL;
+
+	/*
+	 * The lock may be available at this point, try again if no task was
+	 * waiting in the queue.
+	 */
+	if (!(qsval >> _QCODE_OFFSET) && queue_spin_trylock(lock)) {
+		put_qnode();
+		return;
+	}
+
+	/*
+	 * Exchange current copy of the queue node code
+	 */
+	prev_qcode = atomic_xchg(&lock->qlcode, my_qcode);
+	/*
+	 * It is possible that we may accidentally steal the lock. If this is
+	 * the case, we need to either release it if not the head of the queue
+	 * or get the lock and be done with it.
+	 */
+	if (unlikely(!(prev_qcode & _QSPINLOCK_LOCKED))) {
+		if (prev_qcode == 0) {
+			/*
+			 * Got the lock since it is at the head of the queue
+			 * Now try to atomically clear the queue code.
+			 */
+			if (atomic_cmpxchg(&lock->qlcode, my_qcode,
+					  _QSPINLOCK_LOCKED) == my_qcode)
+				goto release_node;
+			/*
+			 * The cmpxchg fails only if one or more tasks
+			 * are added to the queue. In this case, we need to
+			 * notify the next one to be the head of the queue.
+			 */
+			goto notify_next;
+		}
+		/*
+		 * Accidentally steal the lock, release the lock and
+		 * let the queue head get it.
+		 */
+		queue_spin_unlock(lock);
+	} else
+		prev_qcode &= ~_QSPINLOCK_LOCKED;	/* Clear the lock bit */
+	my_qcode &= ~_QSPINLOCK_LOCKED;
+
+	if (prev_qcode) {
+		/*
+		 * Not at the queue head, get the address of the previous node
+		 * and set up the "next" fields of the that node.
+		 */
+		struct qnode *prev = xlate_qcode(prev_qcode);
+
+		ACCESS_ONCE(prev->next) = node;
+		/*
+		 * Wait until the waiting flag is off
+		 */
+		while (smp_load_acquire(&node->wait))
+			arch_mutex_cpu_relax();
+	}
+
+	/*
+	 * At the head of the wait queue now
+	 */
+	while (true) {
+		u32 qcode;
+		int retval;
+
+		retval = queue_get_lock_qcode(lock, &qcode, my_qcode);
+		if (retval > 0)
+			;	/* Lock not available yet */
+		else if (retval < 0)
+			/* Lock taken, can release the node & return */
+			goto release_node;
+		else if (qcode != my_qcode) {
+			/*
+			 * Just get the lock with other spinners waiting
+			 * in the queue.
+			 */
+			if (queue_spin_setlock(lock))
+				goto notify_next;
+		} else {
+			/*
+			 * Get the lock & clear the queue code simultaneously
+			 */
+			if (queue_spin_trylock_and_clr_qcode(lock, qcode))
+				/* No need to notify the next one */
+				goto release_node;
+		}
+		arch_mutex_cpu_relax();
+	}
+
+notify_next:
+	/*
+	 * Wait, if needed, until the next one in queue set up the next field
+	 */
+	while (!(next = ACCESS_ONCE(node->next)))
+		arch_mutex_cpu_relax();
+	/*
+	 * The next one in queue is now at the head
+	 */
+	smp_store_release(&next->wait, false);
+
+release_node:
+	put_qnode();
+}
+EXPORT_SYMBOL(queue_spin_lock_slowpath);
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 125+ messages in thread

end of thread, other threads:[~2014-03-05 20:59 UTC | newest]

Thread overview: 125+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-02-26 15:14 [PATCH v5 0/8] qspinlock: a 4-byte queue spinlock with PV support Waiman Long
2014-02-26 15:14 ` [PATCH v5 1/8] qspinlock: Introducing a 4-byte queue spinlock implementation Waiman Long
2014-02-26 16:22   ` Peter Zijlstra
2014-02-26 16:22   ` Peter Zijlstra
2014-02-27 20:25     ` Waiman Long
2014-02-27 20:25     ` Waiman Long
2014-02-26 16:24   ` Peter Zijlstra
2014-02-26 16:24   ` Peter Zijlstra
2014-02-27 20:25     ` Waiman Long
2014-02-27 20:25     ` Waiman Long
2014-02-26 15:14 ` Waiman Long
2014-02-26 15:14 ` [PATCH v5 2/8] qspinlock, x86: Enable x86-64 to use queue spinlock Waiman Long
2014-02-26 15:14 ` Waiman Long
2014-02-26 15:14 ` [PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks Waiman Long
2014-02-26 16:20   ` Peter Zijlstra
2014-02-26 16:20   ` Peter Zijlstra
2014-02-27 20:42     ` Waiman Long
2014-02-28  9:29       ` Peter Zijlstra
2014-02-28  9:29       ` Peter Zijlstra
2014-02-28 16:25         ` Linus Torvalds
2014-02-28 17:37           ` Peter Zijlstra
2014-02-28 17:37           ` Peter Zijlstra
2014-02-28 16:25         ` Linus Torvalds
2014-02-28 16:38         ` Waiman Long
2014-02-28 16:38         ` Waiman Long
2014-02-28 17:56           ` Peter Zijlstra
2014-02-28 17:56           ` Peter Zijlstra
2014-03-03 17:43           ` Peter Zijlstra
2014-03-03 17:43           ` Peter Zijlstra
2014-03-04 15:27             ` Waiman Long
2014-03-04 15:27             ` Waiman Long
2014-03-04 16:58             ` Peter Zijlstra
2014-03-04 18:09               ` Peter Zijlstra
2014-03-04 18:09               ` Peter Zijlstra
2014-03-04 16:58             ` Peter Zijlstra
2014-03-04 17:48             ` Waiman Long
2014-03-04 17:48             ` Waiman Long
2014-03-04 22:40               ` Peter Zijlstra
2014-03-05 20:59                 ` Peter Zijlstra
2014-03-05 20:59                 ` Peter Zijlstra
2014-03-04 22:40               ` Peter Zijlstra
2014-02-27 20:42     ` Waiman Long
2014-02-26 15:14 ` Waiman Long
2014-02-26 15:14 ` [PATCH RFC v5 4/8] pvqspinlock, x86: Allow unfair spinlock in a real PV environment Waiman Long
2014-02-26 15:14 ` Waiman Long
2014-02-26 17:07   ` Konrad Rzeszutek Wilk
2014-02-28 17:06     ` Waiman Long
2014-02-28 17:06     ` Waiman Long
2014-03-03 10:55       ` Paolo Bonzini
2014-03-04 15:15         ` Waiman Long
2014-03-04 15:15         ` Waiman Long
2014-03-04 15:23           ` Paolo Bonzini
2014-03-04 15:23           ` Paolo Bonzini
2014-03-04 15:39           ` David Vrabel
2014-03-04 15:39           ` David Vrabel
2014-03-04 17:50           ` Raghavendra K T
2014-03-04 17:50           ` Raghavendra K T
2014-03-03 10:55       ` Paolo Bonzini
2014-02-26 17:07   ` Konrad Rzeszutek Wilk
2014-02-27 12:28   ` David Vrabel
2014-02-27 19:40     ` Waiman Long
2014-02-27 19:40     ` Waiman Long
2014-02-27 12:28   ` David Vrabel
2014-02-26 15:14 ` [PATCH RFC v5 5/8] pvqspinlock, x86: Enable unfair queue spinlock in a KVM guest Waiman Long
2014-02-26 15:14 ` Waiman Long
2014-02-26 17:08   ` Konrad Rzeszutek Wilk
2014-02-26 17:08   ` Konrad Rzeszutek Wilk
2014-02-28 17:08     ` Waiman Long
2014-02-28 17:08     ` Waiman Long
2014-02-27  9:41   ` Paolo Bonzini
2014-02-27 19:05     ` Waiman Long
2014-02-27 19:05     ` Waiman Long
2014-02-27  9:41   ` Paolo Bonzini
2014-02-27 10:40   ` Raghavendra K T
2014-02-27 10:40   ` Raghavendra K T
2014-02-27 19:12     ` Waiman Long
2014-02-27 19:12     ` Waiman Long
2014-02-26 15:14 ` [PATCH RFC v5 6/8] pvqspinlock, x86: Rename paravirt_ticketlocks_enabled Waiman Long
2014-02-26 15:14 ` Waiman Long
2014-02-26 15:14 ` [PATCH RFC v5 7/8] pvqspinlock, x86: Add qspinlock para-virtualization support Waiman Long
2014-02-26 17:54   ` Konrad Rzeszutek Wilk
2014-02-26 17:54   ` Konrad Rzeszutek Wilk
2014-02-27 12:11   ` David Vrabel
2014-02-27 13:11     ` Paolo Bonzini
2014-02-27 14:18       ` David Vrabel
2014-02-27 14:18       ` David Vrabel
2014-02-27 14:45         ` Paolo Bonzini
2014-02-27 15:22           ` Raghavendra K T
2014-02-27 15:50             ` Paolo Bonzini
2014-02-27 15:50             ` Paolo Bonzini
2014-03-03 11:06               ` [Xen-devel] " David Vrabel
2014-03-03 11:06               ` David Vrabel
2014-02-27 20:50             ` Waiman Long
2014-02-27 20:50             ` Waiman Long
2014-02-27 15:22           ` Raghavendra K T
2014-02-27 19:42           ` Waiman Long
2014-02-27 19:42           ` Waiman Long
2014-02-27 14:45         ` Paolo Bonzini
2014-02-27 13:11     ` Paolo Bonzini
2014-02-27 12:11   ` David Vrabel
2014-02-26 15:14 ` Waiman Long
2014-02-26 15:14 ` [PATCH RFC v5 8/8] pvqspinlock, x86: Enable KVM to use qspinlock's PV support Waiman Long
2014-02-26 15:14 ` Waiman Long
2014-02-27  9:31   ` Paolo Bonzini
2014-02-27  9:31   ` Paolo Bonzini
2014-02-27 18:36     ` Waiman Long
2014-02-27 18:36     ` Waiman Long
2014-02-26 17:00 ` [PATCH v5 0/8] qspinlock: a 4-byte queue spinlock with " Konrad Rzeszutek Wilk
2014-02-28 16:56   ` Waiman Long
2014-02-28 16:56   ` Waiman Long
2014-02-26 17:00 ` Konrad Rzeszutek Wilk
2014-02-26 22:26 ` Paul E. McKenney
2014-02-26 22:26 ` Paul E. McKenney
2014-02-27  4:32 Waiman Long
2014-02-27  4:32 ` [PATCH v5 1/8] qspinlock: Introducing a 4-byte queue spinlock implementation Waiman Long
2014-02-27  4:32 ` Waiman Long
2014-03-02 13:12   ` Oleg Nesterov
2014-03-04 14:46     ` Waiman Long
2014-03-04 14:46     ` Waiman Long
2014-03-04 14:46       ` Waiman Long
2014-03-02 13:12   ` Oleg Nesterov
2014-03-02 13:31   ` Oleg Nesterov
2014-03-04 14:58     ` Waiman Long
2014-03-04 14:58       ` Waiman Long
2014-03-04 14:58     ` Waiman Long
2014-03-02 13:31   ` Oleg Nesterov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.