linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/7] locking/rtqspinlock: Realtime queued spinlocks
@ 2017-01-03 18:00 Waiman Long
  2017-01-03 18:00 ` [RFC PATCH 1/7] locking/spinlock: Remove the unused spin_lock_bh_nested API Waiman Long
                   ` (7 more replies)
  0 siblings, 8 replies; 24+ messages in thread
From: Waiman Long @ 2017-01-03 18:00 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, H. Peter Anvin
  Cc: linux-kernel, Waiman Long

This patchset introduces a new variant of queued spinlocks - the
realtime queued spinlocks. The purpose of this new variant is to
support real spinlock in a realtime environment where high priority
RT tasks should be allowed to complete its work ASAP. This means as
little waiting time for spinlocks as possible.

Non-RT tasks will wait for spinlocks in the MCS waiting queue as
usual. RT tasks and interrupts will spin directly on the spinlocks
and use the priority value in the pending byte to arbitrate who get
the lock first.

Patch 1 removes the unused spin_lock_bh_nested() API.

Patch 2 introduces the basic realtime queued spinlocks where the
pending byte is used for storing the priority of the highest priority
RT task that is waiting on the spinlock. All the RT tasks will spin
directly on the spinlock instead of waiting in the queue.

Patch 3 moves all interrupt context lock waiters to RT spinning.

Patch 4 overrides the spin_lock_nested() call with special code to
enabled RT lock spinning for nested spinlock.

Patch 5 handles priority boosting by periodically checking its priority
and unqueuing from the waiting queue and do RT spinning if applicable.

Patch 6 allows voluntary CPU preemption to happen when a CPU is
waiting for a spinlock.

Patch 7 enables event counts to be collected by the qspinlock stat
package so that we could monitor what have happened within the kernel.

With a locking microbenchmark running on a 2-socket 36-core E5-2699
v3 system, the elapsed times to complete 2M locking loop per non-RT
thread were as follows:

   # of threads   qspinlock   rt-qspinlock  % change
   ------------   ---------   ------------  --------
        2           0.29s        1.97s       +580%
	3           1.46s        2.05s        +40%
	4           1.81s        2.38s        +31%
	5           2.36s        2.87s        +22%
	6           2.73s        3.58s        +31%
	7           3.17s        3.74s        +18%
	8           3.67s        4.70s        +28%
	9           3.89s        5.28s        +36%
       10           4.35s        6.58s        +51%

The RT qspinlock is slower than the non-RT qspinlock which is expected.

This patchset hasn't included any patch to modify the call sites of
spin_lock_nested() to include the outer lock of the nest spinlock
pair yet. That will be included in a later version of this patchset
once it is determined that RT qspinlocks is worth pursuing.

Only minimal testing to build and boot the patched kernel was
done. More extensive testing will be done with later versions of
this patchset.

Waiman Long (7):
  locking/spinlock: Remove the unused spin_lock_bh_nested API
  locking/rtqspinlock: Introduce realtime queued spinlocks
  locking/rtqspinlock: Use static RT priority when in interrupt context
  locking/rtqspinlock: Override spin_lock_nested with special RT variants
  locking/rtqspinlock: Handle priority boosting
  locking/rtqspinlock: Voluntarily yield CPU when need_sched()
  locking/rtqspinlock: Enable collection of event counts

 arch/x86/Kconfig                 |  18 +-
 include/linux/spinlock.h         |  43 +++-
 include/linux/spinlock_api_smp.h |   9 +-
 include/linux/spinlock_api_up.h  |   1 -
 kernel/Kconfig.locks             |   9 +
 kernel/locking/qspinlock.c       |  51 +++-
 kernel/locking/qspinlock_rt.h    | 543 +++++++++++++++++++++++++++++++++++++++
 kernel/locking/qspinlock_stat.h  |  81 +++++-
 kernel/locking/spinlock.c        |   8 -
 9 files changed, 721 insertions(+), 42 deletions(-)
 create mode 100644 kernel/locking/qspinlock_rt.h

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC PATCH 1/7] locking/spinlock: Remove the unused spin_lock_bh_nested API
  2017-01-03 18:00 [RFC PATCH 0/7] locking/rtqspinlock: Realtime queued spinlocks Waiman Long
@ 2017-01-03 18:00 ` Waiman Long
  2017-01-03 18:00 ` [RFC PATCH 2/7] locking/rtqspinlock: Introduce realtime queued spinlocks Waiman Long
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 24+ messages in thread
From: Waiman Long @ 2017-01-03 18:00 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, H. Peter Anvin
  Cc: linux-kernel, Waiman Long

The spin_lock_bh_nested() API is defined but is not used anywhere
in the kernel. So all spin_lock_bh_nested() and related APIs are
now removed.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 include/linux/spinlock.h         | 8 --------
 include/linux/spinlock_api_smp.h | 2 --
 include/linux/spinlock_api_up.h  | 1 -
 kernel/locking/spinlock.c        | 8 --------
 4 files changed, 19 deletions(-)

diff --git a/include/linux/spinlock.h b/include/linux/spinlock.h
index 47dd0ce..59248dc 100644
--- a/include/linux/spinlock.h
+++ b/include/linux/spinlock.h
@@ -180,8 +180,6 @@ static inline void do_raw_spin_unlock(raw_spinlock_t *lock) __releases(lock)
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 # define raw_spin_lock_nested(lock, subclass) \
 	_raw_spin_lock_nested(lock, subclass)
-# define raw_spin_lock_bh_nested(lock, subclass) \
-	_raw_spin_lock_bh_nested(lock, subclass)
 
 # define raw_spin_lock_nest_lock(lock, nest_lock)			\
 	 do {								\
@@ -197,7 +195,6 @@ static inline void do_raw_spin_unlock(raw_spinlock_t *lock) __releases(lock)
 # define raw_spin_lock_nested(lock, subclass)		\
 	_raw_spin_lock(((void)(subclass), (lock)))
 # define raw_spin_lock_nest_lock(lock, nest_lock)	_raw_spin_lock(lock)
-# define raw_spin_lock_bh_nested(lock, subclass)	_raw_spin_lock_bh(lock)
 #endif
 
 #if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK)
@@ -317,11 +314,6 @@ static __always_inline int spin_trylock(spinlock_t *lock)
 	raw_spin_lock_nested(spinlock_check(lock), subclass);	\
 } while (0)
 
-#define spin_lock_bh_nested(lock, subclass)			\
-do {								\
-	raw_spin_lock_bh_nested(spinlock_check(lock), subclass);\
-} while (0)
-
 #define spin_lock_nest_lock(lock, nest_lock)				\
 do {									\
 	raw_spin_lock_nest_lock(spinlock_check(lock), nest_lock);	\
diff --git a/include/linux/spinlock_api_smp.h b/include/linux/spinlock_api_smp.h
index 5344268..42dfab8 100644
--- a/include/linux/spinlock_api_smp.h
+++ b/include/linux/spinlock_api_smp.h
@@ -22,8 +22,6 @@
 void __lockfunc _raw_spin_lock(raw_spinlock_t *lock)		__acquires(lock);
 void __lockfunc _raw_spin_lock_nested(raw_spinlock_t *lock, int subclass)
 								__acquires(lock);
-void __lockfunc _raw_spin_lock_bh_nested(raw_spinlock_t *lock, int subclass)
-								__acquires(lock);
 void __lockfunc
 _raw_spin_lock_nest_lock(raw_spinlock_t *lock, struct lockdep_map *map)
 								__acquires(lock);
diff --git a/include/linux/spinlock_api_up.h b/include/linux/spinlock_api_up.h
index d3afef9..d0d1888 100644
--- a/include/linux/spinlock_api_up.h
+++ b/include/linux/spinlock_api_up.h
@@ -57,7 +57,6 @@
 
 #define _raw_spin_lock(lock)			__LOCK(lock)
 #define _raw_spin_lock_nested(lock, subclass)	__LOCK(lock)
-#define _raw_spin_lock_bh_nested(lock, subclass) __LOCK(lock)
 #define _raw_read_lock(lock)			__LOCK(lock)
 #define _raw_write_lock(lock)			__LOCK(lock)
 #define _raw_spin_lock_bh(lock)			__LOCK_BH(lock)
diff --git a/kernel/locking/spinlock.c b/kernel/locking/spinlock.c
index db3ccb1..4b082b5 100644
--- a/kernel/locking/spinlock.c
+++ b/kernel/locking/spinlock.c
@@ -363,14 +363,6 @@ void __lockfunc _raw_spin_lock_nested(raw_spinlock_t *lock, int subclass)
 }
 EXPORT_SYMBOL(_raw_spin_lock_nested);
 
-void __lockfunc _raw_spin_lock_bh_nested(raw_spinlock_t *lock, int subclass)
-{
-	__local_bh_disable_ip(_RET_IP_, SOFTIRQ_LOCK_OFFSET);
-	spin_acquire(&lock->dep_map, subclass, 0, _RET_IP_);
-	LOCK_CONTENDED(lock, do_raw_spin_trylock, do_raw_spin_lock);
-}
-EXPORT_SYMBOL(_raw_spin_lock_bh_nested);
-
 unsigned long __lockfunc _raw_spin_lock_irqsave_nested(raw_spinlock_t *lock,
 						   int subclass)
 {
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH 2/7] locking/rtqspinlock: Introduce realtime queued spinlocks
  2017-01-03 18:00 [RFC PATCH 0/7] locking/rtqspinlock: Realtime queued spinlocks Waiman Long
  2017-01-03 18:00 ` [RFC PATCH 1/7] locking/spinlock: Remove the unused spin_lock_bh_nested API Waiman Long
@ 2017-01-03 18:00 ` Waiman Long
  2017-01-03 18:00 ` [RFC PATCH 3/7] locking/rtqspinlock: Use static RT priority when in interrupt context Waiman Long
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 24+ messages in thread
From: Waiman Long @ 2017-01-03 18:00 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, H. Peter Anvin
  Cc: linux-kernel, Waiman Long

Realtime queued spinlock is a variant of the queued spinlock that is
suitable for use in a realtime running environment where the highest
priority task should always get its work done as soon as possible. That
means a minimal wait for spinlocks. To make that happen, RT tasks
will not wait in the MCS wait queue of the queued spinlocks. Instead,
they will spin directly on the lock and put the RT priority value of
the highest priority RT task into the pending byte of the lock.

With the assumption that the number of RT tasks actively running on
a system is relatively small, the performance overhead of such a RT
variant should be small too.

Realtime queued spinlock is currently mutually exclusive of paravirtual
spinlocks.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 arch/x86/Kconfig              |   2 +-
 kernel/Kconfig.locks          |   9 +++
 kernel/locking/qspinlock.c    |  34 +++++++++-
 kernel/locking/qspinlock_rt.h | 145 ++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 186 insertions(+), 4 deletions(-)
 create mode 100644 kernel/locking/qspinlock_rt.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index e487493..7a97b31 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -723,7 +723,7 @@ config PARAVIRT_DEBUG
 
 config PARAVIRT_SPINLOCKS
 	bool "Paravirtualization layer for spinlocks"
-	depends on PARAVIRT && SMP
+	depends on PARAVIRT && SMP && !REALTIME_QUEUED_SPINLOCKS
 	---help---
 	  Paravirtualized spinlocks allow a pvops backend to replace the
 	  spinlock implementation with something virtualization-friendly
diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks
index 84d882f..93f5bb5 100644
--- a/kernel/Kconfig.locks
+++ b/kernel/Kconfig.locks
@@ -248,3 +248,12 @@ config ARCH_USE_QUEUED_RWLOCKS
 config QUEUED_RWLOCKS
 	def_bool y if ARCH_USE_QUEUED_RWLOCKS
 	depends on SMP
+
+config REALTIME_QUEUED_SPINLOCKS
+	bool "Realtime queued spinlocks"
+	default n
+	depends on QUEUED_SPINLOCKS && PREEMPT && NR_CPUS < 16384
+	help
+	  This enables support for realtime environment where tasks with
+	  realtime priority will get the spinlocks ASAP instead of waiting
+	  in a queue.
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index b2caec7..cb5c2e5 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -268,6 +268,19 @@ static __always_inline u32  __pv_wait_head_or_lock(struct qspinlock *lock,
 #endif
 
 /*
+ * Realtime queued spinlock is mutual exclusive with native and PV spinlocks.
+ */
+#ifdef CONFIG_REALTIME_QUEUED_SPINLOCKS
+#include "qspinlock_rt.h"
+#else
+static __always_inline u32 rt_wait_head_or_retry(struct qspinlock *lock)
+						{ return 0; }
+#define rt_pending(v)		0
+#define rt_enabled()		false
+#define rt_spin_trylock(l)	false
+#endif
+
+/*
  * Various notes on spin_is_locked() and spin_unlock_wait(), which are
  * 'interesting' functions:
  *
@@ -415,7 +428,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 
 	BUILD_BUG_ON(CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS));
 
-	if (pv_enabled())
+	if (pv_enabled() || rt_enabled())
 		goto queue;
 
 	if (virt_spin_lock(lock))
@@ -490,6 +503,9 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	 * queuing.
 	 */
 queue:
+	if (rt_spin_trylock(lock))
+		return;
+
 	node = this_cpu_ptr(&mcs_nodes[0]);
 	idx = node->count++;
 	tail = encode_tail(smp_processor_id(), idx);
@@ -573,6 +589,13 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	if ((val = pv_wait_head_or_lock(lock, node)))
 		goto locked;
 
+	/*
+	 * The RT rt_wait_head_or_lock function, if active, will acquire the
+	 * lock and return a non-zero value.
+	 */
+	if ((val = rt_wait_head_or_retry(lock)))
+		goto locked;
+
 	val = smp_cond_load_acquire(&lock->val.counter, !(VAL & _Q_LOCKED_PENDING_MASK));
 
 locked:
@@ -587,7 +610,9 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	 * to grab the lock.
 	 */
 	for (;;) {
-		/* In the PV case we might already have _Q_LOCKED_VAL set */
+		/*
+		 * In the PV & RT cases we might already have _Q_LOCKED_VAL set.
+		 */
 		if ((val & _Q_TAIL_MASK) != tail) {
 			set_locked(lock);
 			break;
@@ -596,8 +621,11 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 		 * The smp_cond_load_acquire() call above has provided the
 		 * necessary acquire semantics required for locking. At most
 		 * two iterations of this loop may be ran.
+		 *
+		 * In the RT case, the pending byte needs to be preserved.
 		 */
-		old = atomic_cmpxchg_relaxed(&lock->val, val, _Q_LOCKED_VAL);
+		old = atomic_cmpxchg_relaxed(&lock->val, val,
+					     rt_pending(val) | _Q_LOCKED_VAL);
 		if (old == val)
 			goto release;	/* No contention */
 
diff --git a/kernel/locking/qspinlock_rt.h b/kernel/locking/qspinlock_rt.h
new file mode 100644
index 0000000..69513d6
--- /dev/null
+++ b/kernel/locking/qspinlock_rt.h
@@ -0,0 +1,145 @@
+/*
+ * Realtime queued spinlocks
+ * -------------------------
+ *
+ * By Waiman Long <longman@redhat.com>
+ *
+ * This is a variant of queued spinlocks that is designed to meet the
+ * requirement of a realtime environment. Tasks with realtime priority will
+ * spin on the lock instead of waiting in the queue like the other non-RT
+ * tasks. Those RT tasks make use of the pending byte to store the rt_priority
+ * of the highest priority task that is currently spinning. That task will
+ * then acquire the lock and reset the pending priority if set previously when
+ * it becomes free effectively jumping the queue ahead of the other lower
+ * priority RT tasks as well as non-RT tasks. The other spinning RT tasks
+ * should then bid to set this pending byte to their rt_priority level again.
+ *
+ * Assuming that the number of RT tasks in a system is limited, the
+ * performance overhead of RT tasks spinning on the lock should be small.
+ *
+ * As RT qspinlock needs the whole pending byte, it cannot be used on kernel
+ * configured to support 16K or more CPUs (CONFIG_NR_CPUS).
+ */
+#include <linux/sched.h>
+
+/*
+ * =========================[ Helper Functions ]=========================
+ */
+
+/*
+ * Translate the priority of a task to an equivalent RT priority
+ */
+static u8 rt_task_priority(struct task_struct *task)
+{
+	int prio = READ_ONCE(task->prio);
+
+	return (prio >= MAX_RT_PRIO) ? 0 : (u8)(MAX_RT_PRIO - prio);
+}
+
+/*
+ * ====================[ Functions Used by qspinlock.c ]====================
+ */
+
+static inline bool rt_enabled(void)
+{
+	return true;
+}
+
+/*
+ * Return the pending byte portion of the integer value of the lock.
+ */
+static inline int rt_pending(int val)
+{
+	return val & _Q_PENDING_MASK;
+}
+
+/*
+ * Return: true if locked acquired
+ *	   false if queuing in the MCS wait queue is needed.
+ */
+static bool rt_spin_trylock(struct qspinlock *lock)
+{
+	struct __qspinlock *l = (void *)lock;
+	u8 prio = rt_task_priority(current);
+
+	BUILD_BUG_ON(_Q_PENDING_BITS != 8);
+
+	if (!prio)
+		return false;
+
+	/*
+	 * Spin on the lock and try to set its priority into the pending byte.
+	 */
+	for (;;) {
+		u16 lockpend = READ_ONCE(l->locked_pending);
+		u8  pdprio = (u8)(lockpend >> _Q_PENDING_OFFSET);
+
+		if (prio < pdprio) {
+			/*
+			 * Higher priority task present, one more cpu_relax()
+			 * before the next attempt.
+			 */
+			cpu_relax();
+			goto next;
+		}
+
+		if (!(lockpend & _Q_LOCKED_MASK)) {
+			u16 old = lockpend;
+			u16 new = (pdprio == prio)
+				? _Q_LOCKED_VAL : (lockpend | _Q_LOCKED_VAL);
+
+			/*
+			 * Lock is free and priority <= prio, try to acquire
+			 * the lock and clear the priority`if it matches my
+			 * prio.
+			 */
+			lockpend = cmpxchg_acquire(&l->locked_pending,
+						   old, new);
+			if (old == lockpend)
+				break;
+
+			pdprio = (u8)(lockpend >> _Q_PENDING_OFFSET);
+		}
+
+		if (pdprio < prio)
+			cmpxchg_relaxed(&l->pending, pdprio, prio);
+next:
+		cpu_relax();
+	}
+	return true;
+}
+
+/*
+ * We need to make the non-RT tasks wait longer if RT tasks are spinning for
+ * the lock. This is done to reduce the chance that a non-RT task may
+ * accidently grab the lock away from the RT tasks in the short interval
+ * where the pending priority may be reset after an RT task acquires the lock.
+ *
+ * Return: Current value of the lock.
+ */
+static u32 rt_wait_head_or_retry(struct qspinlock *lock)
+{
+	struct __qspinlock *l = (void *)lock;
+
+	for (;;) {
+		u16 lockpend = READ_ONCE(l->locked_pending);
+
+		if (!lockpend) {
+			lockpend = cmpxchg_acquire(&l->locked_pending, 0,
+						   _Q_LOCKED_VAL);
+			if (!lockpend)
+				break;
+		}
+
+		/*
+		 * 4 cpu_relax's if RT tasks present.
+		 */
+		if (lockpend & _Q_PENDING_MASK) {
+			cpu_relax();
+			cpu_relax();
+			cpu_relax();
+		}
+		cpu_relax();
+	}
+	return atomic_read(&lock->val);
+}
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH 3/7] locking/rtqspinlock: Use static RT priority when in interrupt context
  2017-01-03 18:00 [RFC PATCH 0/7] locking/rtqspinlock: Realtime queued spinlocks Waiman Long
  2017-01-03 18:00 ` [RFC PATCH 1/7] locking/spinlock: Remove the unused spin_lock_bh_nested API Waiman Long
  2017-01-03 18:00 ` [RFC PATCH 2/7] locking/rtqspinlock: Introduce realtime queued spinlocks Waiman Long
@ 2017-01-03 18:00 ` Waiman Long
  2017-01-03 18:00 ` [RFC PATCH 4/7] locking/rtqspinlock: Override spin_lock_nested with special RT variants Waiman Long
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 24+ messages in thread
From: Waiman Long @ 2017-01-03 18:00 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, H. Peter Anvin
  Cc: linux-kernel, Waiman Long

When in interrupt context, the priority of the interrupted task is
meaningless. So static RT priority is assigned in this case to make
sure that it can get lock ASAP to reduce latency to the interrupted
task.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/qspinlock_rt.h | 15 +++++++++++++--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/kernel/locking/qspinlock_rt.h b/kernel/locking/qspinlock_rt.h
index 69513d6..b6289fb 100644
--- a/kernel/locking/qspinlock_rt.h
+++ b/kernel/locking/qspinlock_rt.h
@@ -19,6 +19,13 @@
  *
  * As RT qspinlock needs the whole pending byte, it cannot be used on kernel
  * configured to support 16K or more CPUs (CONFIG_NR_CPUS).
+ *
+ * In interrupt context, the priority of the interrupted task is not
+ * meaningful. So a fixed static RT priority is used and they won't go into
+ * the MCS wait queue.
+ *  1) Soft IRQ = 1
+ *  2) Hard IRQ = MAX_RT_PRIO
+ *  3) NMI	= MAX_RT_PRIO+1
  */
 #include <linux/sched.h>
 
@@ -60,11 +67,15 @@ static inline int rt_pending(int val)
 static bool rt_spin_trylock(struct qspinlock *lock)
 {
 	struct __qspinlock *l = (void *)lock;
-	u8 prio = rt_task_priority(current);
+	struct task_struct *task = in_interrupt() ? NULL : current;
+	u8 prio;
 
 	BUILD_BUG_ON(_Q_PENDING_BITS != 8);
 
-	if (!prio)
+	if (!task)
+		prio = in_nmi() ? MAX_RT_PRIO + 1
+		     : in_irq() ? MAX_RT_PRIO : 1;
+	else if (!(prio = rt_task_priority(task)))
 		return false;
 
 	/*
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH 4/7] locking/rtqspinlock: Override spin_lock_nested with special RT variants
  2017-01-03 18:00 [RFC PATCH 0/7] locking/rtqspinlock: Realtime queued spinlocks Waiman Long
                   ` (2 preceding siblings ...)
  2017-01-03 18:00 ` [RFC PATCH 3/7] locking/rtqspinlock: Use static RT priority when in interrupt context Waiman Long
@ 2017-01-03 18:00 ` Waiman Long
  2017-01-03 18:00 ` [RFC PATCH 5/7] locking/rtqspinlock: Handle priority boosting Waiman Long
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 24+ messages in thread
From: Waiman Long @ 2017-01-03 18:00 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, H. Peter Anvin
  Cc: linux-kernel, Waiman Long

In general, the spinlock critical section is typically small enough
that once a CPU acquires a spinlock, it should be able to release the
lock in a reasonably short time. Waiting for the lock, however, can
take a while depending on how many CPUs are contending for the lock.

The exception here is when the lock holder needs to acquire another
spinlock. In this case, it may need to hold the spinlock for a
relatively long time. This can be problematic for RT tasks as it can
cause lock inversion problem where a lower priority task is holding
back a higher priority task.

The maximum level of spinlock nesting in the Linux kernel is 2, and all
those nested spinlock calls are annotated by the spin_lock_nested()
or its variants for lock dependency check. As a result, we can
actually override those APIs with a special version of RT spinlock
function _rt_raw_spinlock_nested() that can solve the lock inversion
problem. This is done by passing in the outer lock to the nested
spinlock call to the RT spinlock function which can query the priority
of the highest priority task that is waiting on the outer lock. It can
then boost its priority in the inner lock waiting accordingly. If the
outer lock isn't provided, the function default to a RT priority of 1.

Two new APIs are provided to allow the passing of the required outer
lock parameter:

 1) spin_lock_nested2(lock, subclass, nest_lock)
 2) spin_lock_irqsave_nested2(lock, flags, subclass, nest_lock)

There are about 70 call sites in the kernel that use spin_lock_nested()
or its variants. They can be modified later on to use the new APIs.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 include/linux/spinlock.h         |  37 +++++++++-
 include/linux/spinlock_api_smp.h |   7 ++
 kernel/locking/qspinlock_rt.h    | 143 +++++++++++++++++++++++++++++++--------
 3 files changed, 156 insertions(+), 31 deletions(-)

diff --git a/include/linux/spinlock.h b/include/linux/spinlock.h
index 59248dc..7cd71ab 100644
--- a/include/linux/spinlock.h
+++ b/include/linux/spinlock.h
@@ -177,7 +177,15 @@ static inline void do_raw_spin_unlock(raw_spinlock_t *lock) __releases(lock)
 
 #define raw_spin_lock(lock)	_raw_spin_lock(lock)
 
-#ifdef CONFIG_DEBUG_LOCK_ALLOC
+#ifdef CONFIG_REALTIME_QUEUED_SPINLOCKS
+# define raw_spin_lock_nested(lock, subclass)		\
+	_rt_raw_spin_lock_nested(lock, subclass, NULL)
+# define raw_spin_lock_nested2(lock, subclass, nest_lock)\
+	_rt_raw_spin_lock_nested(lock, subclass, nest_lock)
+# define raw_spin_lock_nest_lock(lock, nest_lock)	\
+	_rt_raw_spin_lock_nested(lock, 0, &(nest_lock)->rlock)
+
+#elif defined(CONFIG_DEBUG_LOCK_ALLOC)
 # define raw_spin_lock_nested(lock, subclass) \
 	_raw_spin_lock_nested(lock, subclass)
 
@@ -205,7 +213,16 @@ static inline void do_raw_spin_unlock(raw_spinlock_t *lock) __releases(lock)
 		flags = _raw_spin_lock_irqsave(lock);	\
 	} while (0)
 
-#ifdef CONFIG_DEBUG_LOCK_ALLOC
+#ifdef CONFIG_REALTIME_QUEUED_SPINLOCKS
+#define raw_spin_lock_irqsave_nested(lock, flags, subclass)		 \
+	raw_spin_lock_irqsave_nested2(lock, flags, subclass, NULL)
+#define raw_spin_lock_irqsave_nested2(lock, flags, subclass, nest_lock)	 \
+	do {								 \
+		typecheck(unsigned long, flags);			 \
+		flags = _rt_raw_spin_lock_irqsave_nested(lock, subclass, \
+							 nest_lock);	 \
+	} while (0)
+#elif defined(CONFIG_DEBUG_LOCK_ALLOC)
 #define raw_spin_lock_irqsave_nested(lock, flags, subclass)		\
 	do {								\
 		typecheck(unsigned long, flags);			\
@@ -232,6 +249,15 @@ static inline void do_raw_spin_unlock(raw_spinlock_t *lock) __releases(lock)
 
 #endif
 
+#ifndef raw_spin_lock_nested2
+#define raw_spin_lock_nested2(lock, subclass, nest_lock)		\
+	raw_spin_lock_nested(lock, subclass)
+#endif
+#ifndef raw_spin_lock_irqsave_nested2
+#define raw_spin_lock_irqsave_nested2(lock, flags, subclass, nest_lock)	\
+	raw_spin_lock_irqsave_nested(lock, flags, subclass)
+#endif
+
 #define raw_spin_lock_irq(lock)		_raw_spin_lock_irq(lock)
 #define raw_spin_lock_bh(lock)		_raw_spin_lock_bh(lock)
 #define raw_spin_unlock(lock)		_raw_spin_unlock(lock)
@@ -314,6 +340,10 @@ static __always_inline int spin_trylock(spinlock_t *lock)
 	raw_spin_lock_nested(spinlock_check(lock), subclass);	\
 } while (0)
 
+#define spin_lock_nested2(lock, subclass, nest_lock)		\
+	raw_spin_lock_nested2(spinlock_check(lock), subclass,	\
+			      spinlock_check(nest_lock))
+
 #define spin_lock_nest_lock(lock, nest_lock)				\
 do {									\
 	raw_spin_lock_nest_lock(spinlock_check(lock), nest_lock);	\
@@ -334,6 +364,9 @@ static __always_inline void spin_lock_irq(spinlock_t *lock)
 	raw_spin_lock_irqsave_nested(spinlock_check(lock), flags, subclass); \
 } while (0)
 
+#define spin_lock_irqsave_nested2(lock, flags, subclass, nest_lock)	\
+	raw_spin_lock_irqsave_nested2(spinlock_check(lock), flags, subclass)
+
 static __always_inline void spin_unlock(spinlock_t *lock)
 {
 	raw_spin_unlock(&lock->rlock);
diff --git a/include/linux/spinlock_api_smp.h b/include/linux/spinlock_api_smp.h
index 42dfab8..57b9585 100644
--- a/include/linux/spinlock_api_smp.h
+++ b/include/linux/spinlock_api_smp.h
@@ -43,6 +43,13 @@ unsigned long __lockfunc _raw_spin_lock_irqsave(raw_spinlock_t *lock)
 _raw_spin_unlock_irqrestore(raw_spinlock_t *lock, unsigned long flags)
 								__releases(lock);
 
+#ifdef CONFIG_REALTIME_QUEUED_SPINLOCKS
+void __lockfunc _rt_raw_spin_lock_nested(raw_spinlock_t *lock, int subclass,
+		raw_spinlock_t *outerlock)		 __acquires(lock);
+unsigned long __lockfunc _rt_raw_spin_lock_irqsave_nested(raw_spinlock_t *lock,
+		int subclass, raw_spinlock_t *outerlock) __acquires(lock);
+#endif
+
 #ifdef CONFIG_INLINE_SPIN_LOCK
 #define _raw_spin_lock(lock) __raw_spin_lock(lock)
 #endif
diff --git a/kernel/locking/qspinlock_rt.h b/kernel/locking/qspinlock_rt.h
index b6289fb..c2f2722 100644
--- a/kernel/locking/qspinlock_rt.h
+++ b/kernel/locking/qspinlock_rt.h
@@ -26,9 +26,24 @@
  *  1) Soft IRQ = 1
  *  2) Hard IRQ = MAX_RT_PRIO
  *  3) NMI	= MAX_RT_PRIO+1
+ *
+ * The only additional resource that a spinlock holder may need to wait for
+ * before completing a lock critical section is another spinlock. The maximum
+ * level of spinlock nesting that is currently supported is 2. All those
+ * nested spinlock operations are annotated by spin_lock_nested() or its
+ * variants. There are currently about 70 instances of those nested spinlock
+ * calls in the kernel. These call sites can be modified to pass in the
+ * outer lock like what is done in the spin_lock_nest_lock() variant. In
+ * doing so, we can query the highest priority task that is waiting on the
+ * outer lock and adjust our waiting priority accordingly. To speed up nested
+ * spinlock calls, they will have a minimum RT priority of 1 to begin with.
  */
 #include <linux/sched.h>
 
+#ifndef MAX
+#define MAX(a, b)	(((a) >= (b)) ? (a) : (b))
+#endif
+
 /*
  * =========================[ Helper Functions ]=========================
  */
@@ -36,46 +51,36 @@
 /*
  * Translate the priority of a task to an equivalent RT priority
  */
-static u8 rt_task_priority(struct task_struct *task)
+static u8 rt_task_priority(struct task_struct *task, u8 min_prio)
 {
-	int prio = READ_ONCE(task->prio);
+	int prio;
 
-	return (prio >= MAX_RT_PRIO) ? 0 : (u8)(MAX_RT_PRIO - prio);
-}
-
-/*
- * ====================[ Functions Used by qspinlock.c ]====================
- */
+	if (!task)
+		return min_prio;
 
-static inline bool rt_enabled(void)
-{
-	return true;
+	prio = READ_ONCE(task->prio);
+	return (u8)MAX((prio >= MAX_RT_PRIO) ? 0 : MAX_RT_PRIO - prio,
+			min_prio);
 }
 
 /*
- * Return the pending byte portion of the integer value of the lock.
+ * Return: true if locked acquired via RT spinning.
+ *	   false if need to go into MCS wait queue.
  */
-static inline int rt_pending(int val)
-{
-	return val & _Q_PENDING_MASK;
-}
-
-/*
- * Return: true if locked acquired
- *	   false if queuing in the MCS wait queue is needed.
- */
-static bool rt_spin_trylock(struct qspinlock *lock)
+static bool __rt_spin_trylock(struct qspinlock *lock,
+			      struct qspinlock *outerlock, u8 min_prio)
 {
 	struct __qspinlock *l = (void *)lock;
+	struct __qspinlock *ol = (void *)outerlock;
 	struct task_struct *task = in_interrupt() ? NULL : current;
-	u8 prio;
+	u8 prio, mypdprio = 0;
 
 	BUILD_BUG_ON(_Q_PENDING_BITS != 8);
 
 	if (!task)
-		prio = in_nmi() ? MAX_RT_PRIO + 1
-		     : in_irq() ? MAX_RT_PRIO : 1;
-	else if (!(prio = rt_task_priority(task)))
+		min_prio = in_nmi() ? MAX_RT_PRIO + 1
+			 : in_irq() ? MAX_RT_PRIO : 1;
+	if (!(prio = rt_task_priority(task, min_prio)))
 		return false;
 
 	/*
@@ -96,7 +101,7 @@ static bool rt_spin_trylock(struct qspinlock *lock)
 
 		if (!(lockpend & _Q_LOCKED_MASK)) {
 			u16 old = lockpend;
-			u16 new = (pdprio == prio)
+			u16 new = (pdprio == mypdprio)
 				? _Q_LOCKED_VAL : (lockpend | _Q_LOCKED_VAL);
 
 			/*
@@ -112,15 +117,56 @@ static bool rt_spin_trylock(struct qspinlock *lock)
 			pdprio = (u8)(lockpend >> _Q_PENDING_OFFSET);
 		}
 
-		if (pdprio < prio)
-			cmpxchg_relaxed(&l->pending, pdprio, prio);
+		if (pdprio < prio) {
+			/*
+			 * As the RT priority can increase dynamically, we
+			 * need to keep track of what priority value has
+			 * been set in the pending byte of the lock.
+			 */
+			if (cmpxchg_relaxed(&l->pending, pdprio, prio)
+					== pdprio)
+				mypdprio = prio;
+		}
 next:
 		cpu_relax();
+
+		/*
+		 * Recompute pending priority
+		 */
+		prio = MAX(ol ? ol->pending : 0,
+			   rt_task_priority(task, min_prio));
+
 	}
 	return true;
 }
 
 /*
+ * ====================[ Functions Used by qspinlock.c ]====================
+ */
+
+static inline bool rt_enabled(void)
+{
+	return true;
+}
+
+/*
+ * Return the pending byte portion of the integer value of the lock.
+ */
+static inline int rt_pending(int val)
+{
+	return val & _Q_PENDING_MASK;
+}
+
+/*
+ * Return: true if locked acquired
+ *	   false if queuing in the MCS wait queue is needed.
+ */
+static inline bool rt_spin_trylock(struct qspinlock *lock)
+{
+	return __rt_spin_trylock(lock, NULL, 0);
+}
+
+/*
  * We need to make the non-RT tasks wait longer if RT tasks are spinning for
  * the lock. This is done to reduce the chance that a non-RT task may
  * accidently grab the lock away from the RT tasks in the short interval
@@ -154,3 +200,42 @@ static u32 rt_wait_head_or_retry(struct qspinlock *lock)
 	}
 	return atomic_read(&lock->val);
 }
+
+/*
+ * =================[ Exported Nested Spinlock Functions ]=================
+ */
+
+/*
+ * For nested spinlocks, we give it a minimum RT priority of 1. If the
+ * outerlock is specified, it will boost its priority if the priority of
+ * the highest waiting task in the outer lock is larger than itself.
+ */
+void __lockfunc _rt_raw_spin_lock_nested(raw_spinlock_t *lock, int subclass,
+		raw_spinlock_t *outerlock) __acquires(lock)
+{
+	preempt_disable();
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	if (subclass) {
+		spin_acquire(&lock->dep_map, subclass, 0, _RET_IP_);
+	} else {
+		type_check(struct lockdep_map *, &outerlock->dep_map);
+		spin_acquire_nest(&lock->dep_map, 0, &outerlock->dep_map,
+				  _RET_IP_);
+	}
+#endif
+	__acquire(lock);
+	__rt_spin_trylock(&lock->raw_lock,
+			  outerlock ? &outerlock->raw_lock : NULL, 1);
+}
+EXPORT_SYMBOL(_rt_raw_spin_lock_nested);
+
+unsigned long __lockfunc _rt_raw_spin_lock_irqsave_nested(raw_spinlock_t *lock,
+		int subclass, raw_spinlock_t *outerlock) __acquires(lock)
+{
+	unsigned long flags;
+
+	local_irq_save(flags);
+	_rt_raw_spin_lock_nested(lock, subclass, outerlock);
+	return flags;
+}
+EXPORT_SYMBOL(_rt_raw_spin_lock_irqsave_nested);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH 5/7] locking/rtqspinlock: Handle priority boosting
  2017-01-03 18:00 [RFC PATCH 0/7] locking/rtqspinlock: Realtime queued spinlocks Waiman Long
                   ` (3 preceding siblings ...)
  2017-01-03 18:00 ` [RFC PATCH 4/7] locking/rtqspinlock: Override spin_lock_nested with special RT variants Waiman Long
@ 2017-01-03 18:00 ` Waiman Long
  2017-01-03 18:00 ` [RFC PATCH 6/7] locking/rtqspinlock: Voluntarily yield CPU when need_sched() Waiman Long
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 24+ messages in thread
From: Waiman Long @ 2017-01-03 18:00 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, H. Peter Anvin
  Cc: linux-kernel, Waiman Long

As the priority of a task may get boosted due to an acquired rtmutex,
we will need to periodically check the task priority to see if it
gets boosted. For an originally non-RT task, that means unqueuing from
the MCS wait queue before doing an RT spinning. So the unqueuing code
from osq_lock is borrowed to handle the unqueuing aspect of it.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/qspinlock.c    |  25 ++++-
 kernel/locking/qspinlock_rt.h | 235 +++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 253 insertions(+), 7 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index cb5c2e5..b25ad58 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -270,10 +270,18 @@ static __always_inline u32  __pv_wait_head_or_lock(struct qspinlock *lock,
 /*
  * Realtime queued spinlock is mutual exclusive with native and PV spinlocks.
  */
+#define RT_RETRY	((u32)-1)
+
 #ifdef CONFIG_REALTIME_QUEUED_SPINLOCKS
 #include "qspinlock_rt.h"
 #else
-static __always_inline u32 rt_wait_head_or_retry(struct qspinlock *lock)
+static __always_inline void rt_init_node(struct mcs_spinlock *node, u32 tail) {}
+static __always_inline bool rt_wait_node_or_unqueue(struct qspinlock *lock,
+						    struct mcs_spinlock *node,
+						    struct mcs_spinlock *prev)
+						{ return false; }
+static __always_inline u32  rt_spin_lock_or_retry(struct qspinlock *lock,
+						  struct mcs_spinlock *node)
 						{ return 0; }
 #define rt_pending(v)		0
 #define rt_enabled()		false
@@ -514,6 +522,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	node->locked = 0;
 	node->next = NULL;
 	pv_init_node(node);
+	rt_init_node(node, tail);
 
 	/*
 	 * We touched a (possibly) cold cacheline in the per-cpu queue node;
@@ -552,6 +561,10 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 		WRITE_ONCE(prev->next, node);
 
 		pv_wait_node(node, prev);
+
+		if (rt_wait_node_or_unqueue(lock, node, prev))
+			goto queue;	/* Retry RT locking */
+
 		arch_mcs_spin_lock_contended(&node->locked);
 
 		/*
@@ -591,10 +604,14 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 
 	/*
 	 * The RT rt_wait_head_or_lock function, if active, will acquire the
-	 * lock and return a non-zero value.
+	 * lock, release the MCS lock and return a non-zero value. We need to
+	 * retry with RT spinning when RT_RETRY is returned.
 	 */
-	if ((val = rt_wait_head_or_retry(lock)))
-		goto locked;
+	val = rt_spin_lock_or_retry(lock, node);
+	if (val == RT_RETRY)
+		goto queue;
+	else if (val)
+		return;
 
 	val = smp_cond_load_acquire(&lock->val.counter, !(VAL & _Q_LOCKED_PENDING_MASK));
 
diff --git a/kernel/locking/qspinlock_rt.h b/kernel/locking/qspinlock_rt.h
index c2f2722..0c4d051 100644
--- a/kernel/locking/qspinlock_rt.h
+++ b/kernel/locking/qspinlock_rt.h
@@ -37,6 +37,12 @@
  * doing so, we can query the highest priority task that is waiting on the
  * outer lock and adjust our waiting priority accordingly. To speed up nested
  * spinlock calls, they will have a minimum RT priority of 1 to begin with.
+ *
+ * To handle priority boosting due to an acquired rt-mutex, The task->prio
+ * field is queried in each iteration of the loop. For originally non-RT tasks,
+ * it will have to break out of the MCS wait queue just like what is done
+ * in the OSQ lock. Then it has to retry RT spinning if it has been boosted
+ * to RT priority.
  */
 #include <linux/sched.h>
 
@@ -45,9 +51,56 @@
 #endif
 
 /*
+ * For proper unqueuing from the MCS wait queue, we need to store the encoded
+ * tail code as well the previous node pointer into the extra MCS node. Since
+ * CPUs in interrupt context won't use the per-CPU MCS nodes anymore. So only
+ * one is needed for process context CPUs. As a result, we can use the
+ * additional nodes for data storage. Here, we allow 2 nodes per cpu in case
+ * we want to put softIRQ CPUs into the queue as well.
+ */
+struct rt_node {
+	struct mcs_spinlock	mcs;
+	struct mcs_spinlock	__reserved;
+	struct mcs_spinlock	*prev;
+	u32			tail;
+};
+
+/*
  * =========================[ Helper Functions ]=========================
  */
 
+static u32 cmpxchg_tail_acquire(struct qspinlock *lock, u32 old, u32 new)
+{
+	struct __qspinlock *l = (void *)lock;
+
+	return cmpxchg_acquire(&l->tail, old >> _Q_TAIL_OFFSET,
+			       new >> _Q_TAIL_OFFSET) << _Q_TAIL_OFFSET;
+}
+
+static u32 cmpxchg_tail_release(struct qspinlock *lock, u32 old, u32 new)
+{
+	struct __qspinlock *l = (void *)lock;
+
+	return cmpxchg_release(&l->tail, old >> _Q_TAIL_OFFSET,
+			       new >> _Q_TAIL_OFFSET) << _Q_TAIL_OFFSET;
+}
+
+static inline void rt_write_prev(struct mcs_spinlock *node,
+				 struct mcs_spinlock *prev)
+{
+	WRITE_ONCE(((struct rt_node *)node)->prev, prev);
+}
+
+static inline u32 rt_read_tail(struct mcs_spinlock *node)
+{
+	return READ_ONCE(((struct rt_node *)node)->tail);
+}
+
+static inline struct mcs_spinlock *rt_read_prev(struct mcs_spinlock *node)
+{
+	return READ_ONCE(((struct rt_node *)node)->prev);
+}
+
 /*
  * Translate the priority of a task to an equivalent RT priority
  */
@@ -141,6 +194,56 @@ static bool __rt_spin_trylock(struct qspinlock *lock,
 }
 
 /*
+ * MCS wait queue unqueuing code, borrow mostly from osq_lock.c.
+ */
+static struct mcs_spinlock *
+mcsq_wait_next(struct qspinlock *lock, struct mcs_spinlock *node,
+	       struct mcs_spinlock *prev)
+{
+	 struct mcs_spinlock *next = NULL;
+	 u32 tail = rt_read_tail(node);
+	 u32 old;
+
+	/*
+	 * If there is a prev node in queue, the 'old' value will be
+	 * the prev node's tail value. Otherwise, it's set to 0 since if
+	 * we're the only one in queue, the queue will then become empty.
+	 */
+	old = prev ? rt_read_tail(prev) : 0;
+
+	for (;;) {
+		if ((atomic_read(&lock->val) & _Q_TAIL_MASK) == tail &&
+		    cmpxchg_tail_acquire(lock, tail, old) == tail) {
+			/*
+			 * We are at the queue tail, we moved the @lock back.
+			 * @prev will now observe @lock and will complete its
+			 * unlock()/unqueue().
+			 */
+			break;
+		}
+
+		/*
+		 * We must xchg() the @node->next value, because if we were to
+		 * leave it in, a concurrent unlock()/unqueue() from
+		 * @node->next might complete Step-A and think its @prev is
+		 * still valid.
+		 *
+		 * If the concurrent unlock()/unqueue() wins the race, we'll
+		 * wait for either @lock to point to us, through its Step-B, or
+		 * wait for a new @node->next from its Step-C.
+		 */
+		if (node->next) {
+			next = xchg(&node->next, NULL);
+			if (next)
+				break;
+		}
+
+		cpu_relax();
+	}
+	return next;
+}
+
+/*
  * ====================[ Functions Used by qspinlock.c ]====================
  */
 
@@ -158,6 +261,17 @@ static inline int rt_pending(int val)
 }
 
 /*
+ * Initialize the RT fields of a MCS node.
+ */
+static inline void rt_init_node(struct mcs_spinlock *node, u32 tail)
+{
+	struct rt_node *n = (struct rt_node *)node;
+
+	n->prev = NULL;
+	n->tail = tail;
+}
+
+/*
  * Return: true if locked acquired
  *	   false if queuing in the MCS wait queue is needed.
  */
@@ -167,16 +281,98 @@ static inline bool rt_spin_trylock(struct qspinlock *lock)
 }
 
 /*
+ * Return: true if it has been unqueued and need to retry locking.
+ *	   false if it becomes the wait queue head & proceed to next step.
+ */
+static bool rt_wait_node_or_unqueue(struct qspinlock *lock,
+				    struct mcs_spinlock *node,
+				    struct mcs_spinlock *prev)
+{
+	struct mcs_spinlock *next;
+
+	rt_write_prev(node, prev);	/* Save previous node pointer */
+
+	while (!READ_ONCE(node->locked)) {
+		if (rt_task_priority(current, 0))
+			goto unqueue;
+		cpu_relax();
+	}
+	return false;
+
+unqueue:
+	/*
+	 * Step - A  -- stabilize @prev
+	 *
+	 * Undo our @prev->next assignment; this will make @prev's
+	 * unlock()/unqueue() wait for a next pointer since @lock points
+	 * to us (or later).
+	 */
+	for (;;) {
+		if (prev->next == node &&
+		    cmpxchg(&prev->next, node, NULL) == node)
+			break;
+
+		/*
+		 * We can only fail the cmpxchg() racing against an unlock(),
+		 * in which case we should observe @node->locked becoming
+		 * true.
+		 */
+		if (smp_load_acquire(&node->locked))
+			return false;
+
+		cpu_relax();
+
+		/*
+		 * Or we race against a concurrent unqueue()'s step-B, in which
+		 * case its step-C will write us a new @node->prev pointer.
+		 */
+		prev = rt_read_prev(node);
+	}
+
+	/*
+	 * Step - B -- stabilize @next
+	 *
+	 * Similar to unlock(), wait for @node->next or move @lock from @node
+	 * back to @prev.
+	 */
+	next = mcsq_wait_next(lock, node, prev);
+
+	/*
+	 * Step - C -- unlink
+	 *
+	 * @prev is stable because its still waiting for a new @prev->next
+	 * pointer, @next is stable because our @node->next pointer is NULL and
+	 * it will wait in Step-A.
+	 */
+	if (next) {
+		rt_write_prev(next, prev);
+		WRITE_ONCE(prev->next, next);
+	}
+
+	/*
+	 * Release the node.
+	 */
+	__this_cpu_dec(mcs_nodes[0].count);
+
+	return true;	/* Need to retry RT spinning */
+}
+
+/*
  * We need to make the non-RT tasks wait longer if RT tasks are spinning for
  * the lock. This is done to reduce the chance that a non-RT task may
  * accidently grab the lock away from the RT tasks in the short interval
  * where the pending priority may be reset after an RT task acquires the lock.
  *
- * Return: Current value of the lock.
+ * Return: RT_RETRY if it needs to retry locking.
+ *	   1 if lock acquired.
  */
-static u32 rt_wait_head_or_retry(struct qspinlock *lock)
+static u32 rt_spin_lock_or_retry(struct qspinlock *lock,
+				 struct mcs_spinlock *node)
 {
 	struct __qspinlock *l = (void *)lock;
+	struct mcs_spinlock *next;
+	bool retry = false;
+	u32 tail;
 
 	for (;;) {
 		u16 lockpend = READ_ONCE(l->locked_pending);
@@ -187,6 +383,14 @@ static u32 rt_wait_head_or_retry(struct qspinlock *lock)
 			if (!lockpend)
 				break;
 		}
+		/*
+		 * We need to break out of the non-RT wait queue and do
+		 * RT spinnning if we become an RT task.
+		 */
+		if (rt_task_priority(current, 0)) {
+			retry = true;
+			goto unlock;
+		}
 
 		/*
 		 * 4 cpu_relax's if RT tasks present.
@@ -198,7 +402,32 @@ static u32 rt_wait_head_or_retry(struct qspinlock *lock)
 		}
 		cpu_relax();
 	}
-	return atomic_read(&lock->val);
+
+unlock:
+	/*
+	 * Remove itself from the MCS wait queue (unlock).
+	 */
+	tail = rt_read_tail(node);
+	if (cmpxchg_tail_release(lock, tail, 0) == tail)
+		goto release;
+
+	/*
+	 * Second case.
+	 */
+	next = xchg(&node->next, NULL);
+	if (!next)
+		next = mcsq_wait_next(lock, node, NULL);
+
+	if (next)
+		WRITE_ONCE(next->locked, 1);
+
+release:
+	/*
+	 * Release the node.
+	 */
+	__this_cpu_dec(mcs_nodes[0].count);
+
+	return retry ? RT_RETRY : 1;
 }
 
 /*
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH 6/7] locking/rtqspinlock: Voluntarily yield CPU when need_sched()
  2017-01-03 18:00 [RFC PATCH 0/7] locking/rtqspinlock: Realtime queued spinlocks Waiman Long
                   ` (4 preceding siblings ...)
  2017-01-03 18:00 ` [RFC PATCH 5/7] locking/rtqspinlock: Handle priority boosting Waiman Long
@ 2017-01-03 18:00 ` Waiman Long
  2017-01-04 10:07   ` Boqun Feng
  2017-01-05 10:16   ` Daniel Bristot de Oliveira
  2017-01-03 18:00 ` [RFC PATCH 7/7] locking/rtqspinlock: Enable collection of event counts Waiman Long
  2017-01-04 12:49 ` [RFC PATCH 0/7] locking/rtqspinlock: Realtime queued spinlocks Peter Zijlstra
  7 siblings, 2 replies; 24+ messages in thread
From: Waiman Long @ 2017-01-03 18:00 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, H. Peter Anvin
  Cc: linux-kernel, Waiman Long

Ideally we want the CPU to be preemptible even when inside or waiting
for a lock. We cannot make it preemptible when inside a lock critical
section, but we can try to make the task voluntarily yield the CPU
when waiting for a lock.

This patch checks the need_sched() flag and yields the CPU when the
preemption count is 1. IOW, the spin_lock() call isn't done in a
region that doesn't allow preemption. Otherwise, it will just perform
RT spinning with a minimum priority of 1.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/qspinlock_rt.h | 68 +++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 65 insertions(+), 3 deletions(-)

diff --git a/kernel/locking/qspinlock_rt.h b/kernel/locking/qspinlock_rt.h
index 0c4d051..18ec1f8 100644
--- a/kernel/locking/qspinlock_rt.h
+++ b/kernel/locking/qspinlock_rt.h
@@ -43,6 +43,16 @@
  * it will have to break out of the MCS wait queue just like what is done
  * in the OSQ lock. Then it has to retry RT spinning if it has been boosted
  * to RT priority.
+ *
+ * Another RT requirement is that the CPU need to be preemptible even when
+ * waiting for a spinlock. If the task has already acquired the lock, we
+ * will let it run to completion to release the lock and reenable preemption.
+ * For non-nested spinlock, a spinlock waiter will periodically check
+ * need_resched flag to see if it should break out of the waiting loop and
+ * yield the CPU as long as the preemption count indicates just one
+ * preempt_disabled(). For nested spinlock with outer lock acquired, it will
+ * boost its priority to the highest RT priority level to try to acquire the
+ * inner lock, finish up its work, release the locks and reenable preemption.
  */
 #include <linux/sched.h>
 
@@ -51,6 +61,15 @@
 #endif
 
 /*
+ * Rescheduling is only needed when it is in the task context, the
+ * PREEMPT_NEED_RESCHED flag is set and the preemption count is one.
+ * If only the TIF_NEED_RESCHED flag is set, it will be moved to RT
+ * spinning with a minimum priority of 1.
+ */
+#define rt_should_resched()	(preempt_count() == \
+				(PREEMPT_OFFSET | PREEMPT_NEED_RESCHED))
+
+/*
  * For proper unqueuing from the MCS wait queue, we need to store the encoded
  * tail code as well the previous node pointer into the extra MCS node. Since
  * CPUs in interrupt context won't use the per-CPU MCS nodes anymore. So only
@@ -133,9 +152,12 @@ static bool __rt_spin_trylock(struct qspinlock *lock,
 	if (!task)
 		min_prio = in_nmi() ? MAX_RT_PRIO + 1
 			 : in_irq() ? MAX_RT_PRIO : 1;
+	else if (need_resched() && !min_prio)
+		min_prio = 1;
 	if (!(prio = rt_task_priority(task, min_prio)))
 		return false;
 
+
 	/*
 	 * Spin on the lock and try to set its priority into the pending byte.
 	 */
@@ -189,6 +211,33 @@ static bool __rt_spin_trylock(struct qspinlock *lock,
 		prio = MAX(ol ? ol->pending : 0,
 			   rt_task_priority(task, min_prio));
 
+		/*
+		 * If another task needs this CPU, we will yield it if in
+		 * the process context and it is not a nested spinlock call.
+		 * Otherwise, we will raise our RT priority to try to get
+		 * the lock ASAP.
+		 */
+		if (!task || !rt_should_resched())
+			continue;
+
+		if (outerlock) {
+			if (min_prio < MAX_RT_PRIO)
+				min_prio = MAX_RT_PRIO;
+			continue;
+		}
+
+		/*
+		 * In the unlikely event that we need to relinquish the CPU,
+		 * we need to make sure that we are not the highest priority
+		 * task waiting for the lock.
+		 */
+		if (mypdprio) {
+			lockpend = READ_ONCE(l->locked_pending);
+			pdprio = (u8)(lockpend >> _Q_PENDING_OFFSET);
+			if (pdprio == mypdprio)
+				cmpxchg_relaxed(&l->pending, pdprio, 0);
+		}
+		schedule_preempt_disabled();
 	}
 	return true;
 }
@@ -293,7 +342,7 @@ static bool rt_wait_node_or_unqueue(struct qspinlock *lock,
 	rt_write_prev(node, prev);	/* Save previous node pointer */
 
 	while (!READ_ONCE(node->locked)) {
-		if (rt_task_priority(current, 0))
+		if (rt_task_priority(current, 0) || need_resched())
 			goto unqueue;
 		cpu_relax();
 	}
@@ -354,6 +403,12 @@ static bool rt_wait_node_or_unqueue(struct qspinlock *lock,
 	 */
 	__this_cpu_dec(mcs_nodes[0].count);
 
+	/*
+	 * Yield the CPU if needed by another task with the right condition.
+	 */
+	if (rt_should_resched())
+		schedule_preempt_disabled();
+
 	return true;	/* Need to retry RT spinning */
 }
 
@@ -385,9 +440,10 @@ static u32 rt_spin_lock_or_retry(struct qspinlock *lock,
 		}
 		/*
 		 * We need to break out of the non-RT wait queue and do
-		 * RT spinnning if we become an RT task.
+		 * RT spinnning if we become an RT task or another task needs
+		 * the CPU.
 		 */
-		if (rt_task_priority(current, 0)) {
+		if (rt_task_priority(current, 0) || need_resched()) {
 			retry = true;
 			goto unlock;
 		}
@@ -427,6 +483,12 @@ static u32 rt_spin_lock_or_retry(struct qspinlock *lock,
 	 */
 	__this_cpu_dec(mcs_nodes[0].count);
 
+	/*
+	 * Yield the CPU if needed by another task with the right condition.
+	 */
+	if (retry && rt_should_resched())
+		schedule_preempt_disabled();
+
 	return retry ? RT_RETRY : 1;
 }
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH 7/7] locking/rtqspinlock: Enable collection of event counts
  2017-01-03 18:00 [RFC PATCH 0/7] locking/rtqspinlock: Realtime queued spinlocks Waiman Long
                   ` (5 preceding siblings ...)
  2017-01-03 18:00 ` [RFC PATCH 6/7] locking/rtqspinlock: Voluntarily yield CPU when need_sched() Waiman Long
@ 2017-01-03 18:00 ` Waiman Long
  2017-01-04 12:49 ` [RFC PATCH 0/7] locking/rtqspinlock: Realtime queued spinlocks Peter Zijlstra
  7 siblings, 0 replies; 24+ messages in thread
From: Waiman Long @ 2017-01-03 18:00 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, H. Peter Anvin
  Cc: linux-kernel, Waiman Long

This patch enables the collection of event counts in the slowpath of the
realtime queued spinlocks. The following events are being tracked if
CONFIG_QUEUED_LOCK_STAT is defined:

 - # of interrupt context RT spinnings
 - # of task context RT spinnings
 - # of nested spinlock RT spinnings
 - # of unqueue operations due to RT priority
 - # of unqueue operations due to need_resched()
 - # of CPU yieldings

Signed-off-by: Waiman Long <longman@redhat.com>
---
 arch/x86/Kconfig                | 16 ++++----
 kernel/locking/qspinlock_rt.h   | 15 +++++++-
 kernel/locking/qspinlock_stat.h | 81 ++++++++++++++++++++++++++++++++++++-----
 3 files changed, 92 insertions(+), 20 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 7a97b31..e0dc3c8 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -694,6 +694,14 @@ config SCHED_OMIT_FRAME_POINTER
 
 	  If in doubt, say "Y".
 
+config QUEUED_LOCK_STAT
+	bool "Queued spinlock statistics"
+	depends on (PARAVIRT_SPINLOCKS || REALTIME_QUEUED_SPINLOCKS) && DEBUG_FS
+	---help---
+	  Enable the collection of statistical data on the slowpath
+	  behavior of paravirtualized or realtime queued spinlocks
+	  and report them on debugfs.
+
 menuconfig HYPERVISOR_GUEST
 	bool "Linux guest support"
 	---help---
@@ -734,14 +742,6 @@ config PARAVIRT_SPINLOCKS
 
 	  If you are unsure how to answer this question, answer Y.
 
-config QUEUED_LOCK_STAT
-	bool "Paravirt queued spinlock statistics"
-	depends on PARAVIRT_SPINLOCKS && DEBUG_FS
-	---help---
-	  Enable the collection of statistical data on the slowpath
-	  behavior of paravirtualized queued spinlocks and report
-	  them on debugfs.
-
 source "arch/x86/xen/Kconfig"
 
 config KVM_GUEST
diff --git a/kernel/locking/qspinlock_rt.h b/kernel/locking/qspinlock_rt.h
index 18ec1f8..f6f8498 100644
--- a/kernel/locking/qspinlock_rt.h
+++ b/kernel/locking/qspinlock_rt.h
@@ -55,6 +55,7 @@
  * inner lock, finish up its work, release the locks and reenable preemption.
  */
 #include <linux/sched.h>
+#include "qspinlock_stat.h"
 
 #ifndef MAX
 #define MAX(a, b)	(((a) >= (b)) ? (a) : (b))
@@ -157,6 +158,7 @@ static bool __rt_spin_trylock(struct qspinlock *lock,
 	if (!(prio = rt_task_priority(task, min_prio)))
 		return false;
 
+	qstat_inc_either(qstat_rt_spin_task, qstat_rt_spin_irq, task);
 
 	/*
 	 * Spin on the lock and try to set its priority into the pending byte.
@@ -237,6 +239,7 @@ static bool __rt_spin_trylock(struct qspinlock *lock,
 			if (pdprio == mypdprio)
 				cmpxchg_relaxed(&l->pending, pdprio, 0);
 		}
+		qstat_inc(qstat_rt_resched, true);
 		schedule_preempt_disabled();
 	}
 	return true;
@@ -349,6 +352,9 @@ static bool rt_wait_node_or_unqueue(struct qspinlock *lock,
 	return false;
 
 unqueue:
+	qstat_inc_either(qstat_rt_unqueue_sched, qstat_rt_unqueue_prio,
+			 need_resched());
+
 	/*
 	 * Step - A  -- stabilize @prev
 	 *
@@ -406,8 +412,10 @@ static bool rt_wait_node_or_unqueue(struct qspinlock *lock,
 	/*
 	 * Yield the CPU if needed by another task with the right condition.
 	 */
-	if (rt_should_resched())
+	if (rt_should_resched()) {
+		qstat_inc(qstat_rt_resched, true);
 		schedule_preempt_disabled();
+	}
 
 	return true;	/* Need to retry RT spinning */
 }
@@ -486,8 +494,10 @@ static u32 rt_spin_lock_or_retry(struct qspinlock *lock,
 	/*
 	 * Yield the CPU if needed by another task with the right condition.
 	 */
-	if (retry && rt_should_resched())
+	if (retry && rt_should_resched()) {
+		qstat_inc(qstat_rt_resched, true);
 		schedule_preempt_disabled();
+	}
 
 	return retry ? RT_RETRY : 1;
 }
@@ -514,6 +524,7 @@ void __lockfunc _rt_raw_spin_lock_nested(raw_spinlock_t *lock, int subclass,
 				  _RET_IP_);
 	}
 #endif
+	qstat_inc(qstat_rt_spin_nest, true);
 	__acquire(lock);
 	__rt_spin_trylock(&lock->raw_lock,
 			  outerlock ? &outerlock->raw_lock : NULL, 1);
diff --git a/kernel/locking/qspinlock_stat.h b/kernel/locking/qspinlock_stat.h
index e852be4..d212c71 100644
--- a/kernel/locking/qspinlock_stat.h
+++ b/kernel/locking/qspinlock_stat.h
@@ -14,9 +14,10 @@
 
 /*
  * When queued spinlock statistical counters are enabled, the following
- * debugfs files will be created for reporting the counter values:
+ * debugfs files will be created under the <debugfs>/qlockstat directory
+ * for reporting the counter values:
  *
- * <debugfs>/qlockstat/
+ * PV qspinlock specific files:
  *   pv_hash_hops	- average # of hops per hashing operation
  *   pv_kick_unlock	- # of vCPU kicks issued at unlock time
  *   pv_kick_wake	- # of vCPU kicks used for computing pv_latency_wake
@@ -30,6 +31,14 @@
  *   pv_wait_head	- # of vCPU wait's at the queue head
  *   pv_wait_node	- # of vCPU wait's at a non-head queue node
  *
+ * RT qspinlock specific files:
+ *   rt_resched		- # of voluntary CPU yieldings
+ *   rt_spin_irq	- # of interrupt context RT spinnings
+ *   rt_spin_nest	- # of nested spinlock RT spinnings
+ *   rt_spin_task	- # of task context RT spinnings
+ *   rt_unqueue_prio	- # of unqueue operations due to RT priority
+ *   rt_unqueue_sched	- # of unqueue operations due to need_resched()
+ *
  * Writing to the "reset_counters" file will reset all the above counter
  * values.
  *
@@ -40,12 +49,11 @@
  *
  * There may be slight difference between pv_kick_wake and pv_kick_unlock.
  */
+#ifndef __KERNEL_LOCKING_QSPINLOCK_STAT_H
+#define __KERNEL_LOCKING_QSPINLOCK_STAT_H
+
 enum qlock_stats {
-	qstat_pv_hash_hops,
-	qstat_pv_kick_unlock,
-	qstat_pv_kick_wake,
-	qstat_pv_latency_kick,
-	qstat_pv_latency_wake,
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
 	qstat_pv_lock_slowpath,
 	qstat_pv_lock_stealing,
 	qstat_pv_spurious_wakeup,
@@ -53,6 +61,26 @@ enum qlock_stats {
 	qstat_pv_wait_early,
 	qstat_pv_wait_head,
 	qstat_pv_wait_node,
+#endif
+	/*
+	 * These enums are needed to avoid compilation error even though
+	 * they are not used when CONFIG_PARAVIRT_SPINLOCKS isn't defined.
+	 */
+	qstat_pv_hash_hops,
+	qstat_pv_latency_kick,
+	qstat_pv_latency_wake,
+	qstat_pv_kick_unlock,
+	qstat_pv_kick_wake,
+
+#ifdef CONFIG_REALTIME_QUEUED_SPINLOCKS
+	qstat_rt_resched,
+	qstat_rt_spin_irq,
+	qstat_rt_spin_nest,
+	qstat_rt_spin_task,
+	qstat_rt_unqueue_prio,
+	qstat_rt_unqueue_sched,
+#endif
+
 	qstat_num,	/* Total number of statistical counters */
 	qstat_reset_cnts = qstat_num,
 };
@@ -66,6 +94,7 @@ enum qlock_stats {
 #include <linux/fs.h>
 
 static const char * const qstat_names[qstat_num + 1] = {
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
 	[qstat_pv_hash_hops]	   = "pv_hash_hops",
 	[qstat_pv_kick_unlock]     = "pv_kick_unlock",
 	[qstat_pv_kick_wake]       = "pv_kick_wake",
@@ -78,6 +107,17 @@ enum qlock_stats {
 	[qstat_pv_wait_early]      = "pv_wait_early",
 	[qstat_pv_wait_head]       = "pv_wait_head",
 	[qstat_pv_wait_node]       = "pv_wait_node",
+#endif
+
+#ifdef CONFIG_REALTIME_QUEUED_SPINLOCKS
+	[qstat_rt_resched]         = "rt_resched",
+	[qstat_rt_spin_irq]        = "rt_spin_irq",
+	[qstat_rt_spin_nest]       = "rt_spin_nest",
+	[qstat_rt_spin_task]       = "rt_spin_task",
+	[qstat_rt_unqueue_prio]    = "rt_unqueue_prio",
+	[qstat_rt_unqueue_sched]   = "rt_unqueue_sched",
+#endif
+
 	[qstat_reset_cnts]         = "reset_counters",
 };
 
@@ -85,7 +125,9 @@ enum qlock_stats {
  * Per-cpu counters
  */
 static DEFINE_PER_CPU(unsigned long, qstats[qstat_num]);
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
 static DEFINE_PER_CPU(u64, pv_kick_time);
+#endif
 
 /*
  * Function to read and return the qlock statistical counter values
@@ -214,8 +256,8 @@ static int __init init_qspinlock_stat(void)
 	 * performance.
 	 */
 	for (i = 0; i < qstat_num; i++)
-		if (!debugfs_create_file(qstat_names[i], 0400, d_qstat,
-					 (void *)(long)i, &fops_qstat))
+		if (qstat_names[i] && !debugfs_create_file(qstat_names[i],
+				0400, d_qstat, (void *)(long)i, &fops_qstat))
 			goto fail_undo;
 
 	if (!debugfs_create_file(qstat_names[qstat_reset_cnts], 0200, d_qstat,
@@ -232,7 +274,7 @@ static int __init init_qspinlock_stat(void)
 fs_initcall(init_qspinlock_stat);
 
 /*
- * Increment the PV qspinlock statistical counters
+ * Increment the qspinlock statistical counter.
  */
 static inline void qstat_inc(enum qlock_stats stat, bool cond)
 {
@@ -241,6 +283,20 @@ static inline void qstat_inc(enum qlock_stats stat, bool cond)
 }
 
 /*
+ * Increment either one of the qspinlock statistical counters depending
+ * on the given condition.
+ */
+static inline void qstat_inc_either(enum qlock_stats true_stat,
+				    enum qlock_stats false_stat, bool cond)
+{
+	if (cond)
+		this_cpu_inc(qstats[true_stat]);
+	else
+		this_cpu_inc(qstats[false_stat]);
+}
+
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+/*
  * PV hash hop count
  */
 static inline void qstat_hop(int hopcnt)
@@ -279,9 +335,14 @@ static inline void __pv_wait(u8 *ptr, u8 val)
 #define pv_kick(c)	__pv_kick(c)
 #define pv_wait(p, v)	__pv_wait(p, v)
 
+#endif /* CONFIG_PARAVIRT_SPINLOCKS */
+
 #else /* CONFIG_QUEUED_LOCK_STAT */
 
 static inline void qstat_inc(enum qlock_stats stat, bool cond)	{ }
+static inline void qstat_inc_either(enum qlock_stats true_stat,
+		   enum qlock_stats false_stat, bool cond)	{ }
 static inline void qstat_hop(int hopcnt)			{ }
 
 #endif /* CONFIG_QUEUED_LOCK_STAT */
+#endif /* __KERNEL_LOCKING_QSPINLOCK_STAT_H */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 6/7] locking/rtqspinlock: Voluntarily yield CPU when need_sched()
  2017-01-03 18:00 ` [RFC PATCH 6/7] locking/rtqspinlock: Voluntarily yield CPU when need_sched() Waiman Long
@ 2017-01-04 10:07   ` Boqun Feng
  2017-01-04 21:57     ` Waiman Long
  2017-01-05 10:16   ` Daniel Bristot de Oliveira
  1 sibling, 1 reply; 24+ messages in thread
From: Boqun Feng @ 2017-01-04 10:07 UTC (permalink / raw)
  To: Waiman Long
  Cc: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, H. Peter Anvin,
	linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2599 bytes --]

On Tue, Jan 03, 2017 at 01:00:29PM -0500, Waiman Long wrote:
> Ideally we want the CPU to be preemptible even when inside or waiting
> for a lock. We cannot make it preemptible when inside a lock critical
> section, but we can try to make the task voluntarily yield the CPU
> when waiting for a lock.
> 
> This patch checks the need_sched() flag and yields the CPU when the
> preemption count is 1. IOW, the spin_lock() call isn't done in a
> region that doesn't allow preemption. Otherwise, it will just perform
> RT spinning with a minimum priority of 1.
> 
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>  kernel/locking/qspinlock_rt.h | 68 +++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 65 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/locking/qspinlock_rt.h b/kernel/locking/qspinlock_rt.h
> index 0c4d051..18ec1f8 100644
> --- a/kernel/locking/qspinlock_rt.h
> +++ b/kernel/locking/qspinlock_rt.h
> @@ -43,6 +43,16 @@
>   * it will have to break out of the MCS wait queue just like what is done
>   * in the OSQ lock. Then it has to retry RT spinning if it has been boosted
>   * to RT priority.
> + *
> + * Another RT requirement is that the CPU need to be preemptible even when
> + * waiting for a spinlock. If the task has already acquired the lock, we
> + * will let it run to completion to release the lock and reenable preemption.
> + * For non-nested spinlock, a spinlock waiter will periodically check
> + * need_resched flag to see if it should break out of the waiting loop and
> + * yield the CPU as long as the preemption count indicates just one
> + * preempt_disabled(). For nested spinlock with outer lock acquired, it will
> + * boost its priority to the highest RT priority level to try to acquire the
> + * inner lock, finish up its work, release the locks and reenable preemption.
>   */
>  #include <linux/sched.h>
>  
> @@ -51,6 +61,15 @@
>  #endif
>  
>  /*
> + * Rescheduling is only needed when it is in the task context, the
> + * PREEMPT_NEED_RESCHED flag is set and the preemption count is one.
> + * If only the TIF_NEED_RESCHED flag is set, it will be moved to RT
> + * spinning with a minimum priority of 1.
> + */
> +#define rt_should_resched()	(preempt_count() == \
> +				(PREEMPT_OFFSET | PREEMPT_NEED_RESCHED))
> +

Maybe I am missing something... but

On x86, PREEMPT_NEED_RESCHED is used in an inverting style, i.e. 0
indicates "need to reschedule" and preempt_count() masks away this very
bit, which makes rt_should_resched() always false. So...

Regards,
Boqun

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 0/7] locking/rtqspinlock: Realtime queued spinlocks
  2017-01-03 18:00 [RFC PATCH 0/7] locking/rtqspinlock: Realtime queued spinlocks Waiman Long
                   ` (6 preceding siblings ...)
  2017-01-03 18:00 ` [RFC PATCH 7/7] locking/rtqspinlock: Enable collection of event counts Waiman Long
@ 2017-01-04 12:49 ` Peter Zijlstra
  2017-01-04 15:25   ` Waiman Long
  7 siblings, 1 reply; 24+ messages in thread
From: Peter Zijlstra @ 2017-01-04 12:49 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Thomas Gleixner, H. Peter Anvin, linux-kernel,
	Steven Rostedt

On Tue, Jan 03, 2017 at 01:00:23PM -0500, Waiman Long wrote:
> This patchset introduces a new variant of queued spinlocks - the
> realtime queued spinlocks. The purpose of this new variant is to
> support real spinlock in a realtime environment where high priority
> RT tasks should be allowed to complete its work ASAP. This means as
> little waiting time for spinlocks as possible.
> 
> Non-RT tasks will wait for spinlocks in the MCS waiting queue as
> usual. RT tasks and interrupts will spin directly on the spinlocks
> and use the priority value in the pending byte to arbitrate who get
> the lock first.
> 
> Patch 1 removes the unused spin_lock_bh_nested() API.
> 
> Patch 2 introduces the basic realtime queued spinlocks where the
> pending byte is used for storing the priority of the highest priority
> RT task that is waiting on the spinlock. All the RT tasks will spin
> directly on the spinlock instead of waiting in the queue.
> 


OK, so a single numerical field isn't sufficient to describe priority
anymore, since we added DEADLINE support things have gotten a lot more
complex.

Also, the whole approach worries me, it has the very real possibility of
re-introducing a bunch of starvation cases avoided by the fair lock.


Is there a real problem with -RT that inspired these patches?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 0/7] locking/rtqspinlock: Realtime queued spinlocks
  2017-01-04 12:49 ` [RFC PATCH 0/7] locking/rtqspinlock: Realtime queued spinlocks Peter Zijlstra
@ 2017-01-04 15:25   ` Waiman Long
  2017-01-04 15:55     ` Steven Rostedt
                       ` (2 more replies)
  0 siblings, 3 replies; 24+ messages in thread
From: Waiman Long @ 2017-01-04 15:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, H. Peter Anvin, linux-kernel,
	Steven Rostedt

On 01/04/2017 07:49 AM, Peter Zijlstra wrote:
> On Tue, Jan 03, 2017 at 01:00:23PM -0500, Waiman Long wrote:
>> This patchset introduces a new variant of queued spinlocks - the
>> realtime queued spinlocks. The purpose of this new variant is to
>> support real spinlock in a realtime environment where high priority
>> RT tasks should be allowed to complete its work ASAP. This means as
>> little waiting time for spinlocks as possible.
>>
>> Non-RT tasks will wait for spinlocks in the MCS waiting queue as
>> usual. RT tasks and interrupts will spin directly on the spinlocks
>> and use the priority value in the pending byte to arbitrate who get
>> the lock first.
>>
>> Patch 1 removes the unused spin_lock_bh_nested() API.
>>
>> Patch 2 introduces the basic realtime queued spinlocks where the
>> pending byte is used for storing the priority of the highest priority
>> RT task that is waiting on the spinlock. All the RT tasks will spin
>> directly on the spinlock instead of waiting in the queue.
>>
>
> OK, so a single numerical field isn't sufficient to describe priority
> anymore, since we added DEADLINE support things have gotten a lot more
> complex.

>From what I read from the code, DL tasks all have the same priority that
is higher than any of the RT tasks. So you mean DL tasks have other
property that kind of categorizing them into different sub-priorities
that is not being reflected in their priority level. Is that right?

> Also, the whole approach worries me, it has the very real possibility of
> re-introducing a bunch of starvation cases avoided by the fair lock.

Starvation can happen when there is a constant stream of RT or DL tasks
grabbing the lock, or when there is an interrupt storm. However I am
making the assumption that RT systems should have sufficient resource
available that the RT tasks won't saturate the hardware or we can't have
RT guarantee in this case.

> Is there a real problem with -RT that inspired these patches?

I know that in -RT kernel, all the non-raw spinlocks are replaced by
rtmutex which is a sleeping lock. This can have a real performance
impact on systems with more than a few cores. The rtmutex isn't fair either.

Do you think it is better to keep the raw spinlocks fair and only have
the non-raw spinlocks use the RT version?

Cheers,
Longman

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 0/7] locking/rtqspinlock: Realtime queued spinlocks
  2017-01-04 15:25   ` Waiman Long
@ 2017-01-04 15:55     ` Steven Rostedt
  2017-01-04 20:02       ` Waiman Long
  2017-01-05  9:26     ` Daniel Bristot de Oliveira
  2017-01-05  9:44     ` Peter Zijlstra
  2 siblings, 1 reply; 24+ messages in thread
From: Steven Rostedt @ 2017-01-04 15:55 UTC (permalink / raw)
  To: Waiman Long
  Cc: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, H. Peter Anvin,
	linux-kernel

On Wed, 4 Jan 2017 10:25:14 -0500
Waiman Long <longman@redhat.com> wrote:

 
> I know that in -RT kernel, all the non-raw spinlocks are replaced by
> rtmutex which is a sleeping lock. This can have a real performance
> impact on systems with more than a few cores. The rtmutex isn't fair either.

We do fine on 80+ CPUs. Is that enough cores for you ;-)

Note, it's not a true sleeping lock, because of the adaptive nature.
That is, it spins unless the owner of the lock is sleeping, in which
case, it too will sleep (why spin waiting for a task that isn't
running). But if the owner is running, it will spin too.

We also have tricks to keep normal preemption (like SCHED_OTHER tasks
running out of their time slot) when they have a lock. This keeps
contention down on tasks owning locks while sleeping.

> 
> Do you think it is better to keep the raw spinlocks fair and only have
> the non-raw spinlocks use the RT version?

Yes.

Note, I also want to get rt_mutex into the kernel first for all sleeping
locks. That is, get the logic in before we convert spin_locks to
sleeping locks.

-- Steve

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 0/7] locking/rtqspinlock: Realtime queued spinlocks
  2017-01-04 15:55     ` Steven Rostedt
@ 2017-01-04 20:02       ` Waiman Long
  2017-01-05 18:43         ` Steven Rostedt
  0 siblings, 1 reply; 24+ messages in thread
From: Waiman Long @ 2017-01-04 20:02 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, H. Peter Anvin,
	linux-kernel

On 01/04/2017 10:55 AM, Steven Rostedt wrote:
> On Wed, 4 Jan 2017 10:25:14 -0500
> Waiman Long <longman@redhat.com> wrote:
>
>  
>> I know that in -RT kernel, all the non-raw spinlocks are replaced by
>> rtmutex which is a sleeping lock. This can have a real performance
>> impact on systems with more than a few cores. The rtmutex isn't fair either.
> We do fine on 80+ CPUs. Is that enough cores for you ;-)

It depends on what you mean fine. I haven't done the actual benchmark
yet, but I believe that a -RT system with 80+ CPUs will feel noticeably
slower for non-RT tasks than a system with non-RT kernel. I am not
saying that the system will slow down significantly, but the slow down
will certainly be noticeable.

> Note, it's not a true sleeping lock, because of the adaptive nature.
> That is, it spins unless the owner of the lock is sleeping, in which
> case, it too will sleep (why spin waiting for a task that isn't
> running). But if the owner is running, it will spin too.

I think this is a recent change by Davidlohr. However, the spinning is
limited to the top waiter. The rests will still go to sleep.

> We also have tricks to keep normal preemption (like SCHED_OTHER tasks
> running out of their time slot) when they have a lock. This keeps
> contention down on tasks owning locks while sleeping.

Yes, that certainly help.

>> Do you think it is better to keep the raw spinlocks fair and only have
>> the non-raw spinlocks use the RT version?
> Yes.

OK, I can certainly do that.

>
> Note, I also want to get rt_mutex into the kernel first for all sleeping
> locks. That is, get the logic in before we convert spin_locks to
> sleeping locks.
>
> -- Steve

The rtmutex code is already in the upstream kernel. It is just that
there isn't any code yet that can let you flip a switch (a config
option) and change all sleeping locks to use rtmutex.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 6/7] locking/rtqspinlock: Voluntarily yield CPU when need_sched()
  2017-01-04 10:07   ` Boqun Feng
@ 2017-01-04 21:57     ` Waiman Long
  0 siblings, 0 replies; 24+ messages in thread
From: Waiman Long @ 2017-01-04 21:57 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, H. Peter Anvin,
	linux-kernel

On 01/04/2017 05:07 AM, Boqun Feng wrote:
> On Tue, Jan 03, 2017 at 01:00:29PM -0500, Waiman Long wrote:
>> Ideally we want the CPU to be preemptible even when inside or waiting
>> for a lock. We cannot make it preemptible when inside a lock critical
>> section, but we can try to make the task voluntarily yield the CPU
>> when waiting for a lock.
>>
>> This patch checks the need_sched() flag and yields the CPU when the
>> preemption count is 1. IOW, the spin_lock() call isn't done in a
>> region that doesn't allow preemption. Otherwise, it will just perform
>> RT spinning with a minimum priority of 1.
>>
>> Signed-off-by: Waiman Long <longman@redhat.com>
>> ---
>>  kernel/locking/qspinlock_rt.h | 68 +++++++++++++++++++++++++++++++++++++++++--
>>  1 file changed, 65 insertions(+), 3 deletions(-)
>>
>> diff --git a/kernel/locking/qspinlock_rt.h b/kernel/locking/qspinlock_rt.h
>> index 0c4d051..18ec1f8 100644
>> --- a/kernel/locking/qspinlock_rt.h
>> +++ b/kernel/locking/qspinlock_rt.h
>> @@ -43,6 +43,16 @@
>>   * it will have to break out of the MCS wait queue just like what is done
>>   * in the OSQ lock. Then it has to retry RT spinning if it has been boosted
>>   * to RT priority.
>> + *
>> + * Another RT requirement is that the CPU need to be preemptible even when
>> + * waiting for a spinlock. If the task has already acquired the lock, we
>> + * will let it run to completion to release the lock and reenable preemption.
>> + * For non-nested spinlock, a spinlock waiter will periodically check
>> + * need_resched flag to see if it should break out of the waiting loop and
>> + * yield the CPU as long as the preemption count indicates just one
>> + * preempt_disabled(). For nested spinlock with outer lock acquired, it will
>> + * boost its priority to the highest RT priority level to try to acquire the
>> + * inner lock, finish up its work, release the locks and reenable preemption.
>>   */
>>  #include <linux/sched.h>
>>  
>> @@ -51,6 +61,15 @@
>>  #endif
>>  
>>  /*
>> + * Rescheduling is only needed when it is in the task context, the
>> + * PREEMPT_NEED_RESCHED flag is set and the preemption count is one.
>> + * If only the TIF_NEED_RESCHED flag is set, it will be moved to RT
>> + * spinning with a minimum priority of 1.
>> + */
>> +#define rt_should_resched()	(preempt_count() == \
>> +				(PREEMPT_OFFSET | PREEMPT_NEED_RESCHED))
>> +
> Maybe I am missing something... but
>
> On x86, PREEMPT_NEED_RESCHED is used in an inverting style, i.e. 0
> indicates "need to reschedule" and preempt_count() masks away this very
> bit, which makes rt_should_resched() always false. So...
>
> Regards,
> Boqun

You are right. I think I misunderstood what the preemption code is
doing. I think I need to revise the code to fix that. Thanks for
spotting this.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 0/7] locking/rtqspinlock: Realtime queued spinlocks
  2017-01-04 15:25   ` Waiman Long
  2017-01-04 15:55     ` Steven Rostedt
@ 2017-01-05  9:26     ` Daniel Bristot de Oliveira
  2017-01-05  9:44     ` Peter Zijlstra
  2 siblings, 0 replies; 24+ messages in thread
From: Daniel Bristot de Oliveira @ 2017-01-05  9:26 UTC (permalink / raw)
  To: Waiman Long, Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, H. Peter Anvin, linux-kernel,
	Steven Rostedt

On 01/04/2017 04:25 PM, Waiman Long wrote:
>> OK, so a single numerical field isn't sufficient to describe priority
>> anymore, since we added DEADLINE support things have gotten a lot more
>> complex.
> From what I read from the code, DL tasks all have the same priority that
> is higher than any of the RT tasks. So you mean DL tasks have other
> property that kind of categorizing them into different sub-priorities
> that is not being reflected in their priority level. Is that right?

DL tasks are scheduled according to their absolute deadline, which
stored in the u64 curr->dl.deadline.

It is more complex because the priority of a deadline task is always
changing. The priority of a DL task changes on every new
periodic/sporadic replenishment/activation.


-- Daniel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 0/7] locking/rtqspinlock: Realtime queued spinlocks
  2017-01-04 15:25   ` Waiman Long
  2017-01-04 15:55     ` Steven Rostedt
  2017-01-05  9:26     ` Daniel Bristot de Oliveira
@ 2017-01-05  9:44     ` Peter Zijlstra
  2017-01-05 15:55       ` Waiman Long
  2 siblings, 1 reply; 24+ messages in thread
From: Peter Zijlstra @ 2017-01-05  9:44 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Thomas Gleixner, H. Peter Anvin, linux-kernel,
	Steven Rostedt

On Wed, Jan 04, 2017 at 10:25:14AM -0500, Waiman Long wrote:
> On 01/04/2017 07:49 AM, Peter Zijlstra wrote:
> > On Tue, Jan 03, 2017 at 01:00:23PM -0500, Waiman Long wrote:
> >> This patchset introduces a new variant of queued spinlocks - the
> >> realtime queued spinlocks. The purpose of this new variant is to
> >> support real spinlock in a realtime environment where high priority
> >> RT tasks should be allowed to complete its work ASAP. This means as
> >> little waiting time for spinlocks as possible.
> >>
> >> Non-RT tasks will wait for spinlocks in the MCS waiting queue as
> >> usual. RT tasks and interrupts will spin directly on the spinlocks
> >> and use the priority value in the pending byte to arbitrate who get
> >> the lock first.
> >>
> >> Patch 1 removes the unused spin_lock_bh_nested() API.
> >>
> >> Patch 2 introduces the basic realtime queued spinlocks where the
> >> pending byte is used for storing the priority of the highest priority
> >> RT task that is waiting on the spinlock. All the RT tasks will spin
> >> directly on the spinlock instead of waiting in the queue.
> >>
> >
> > OK, so a single numerical field isn't sufficient to describe priority
> > anymore, since we added DEADLINE support things have gotten a lot more
> > complex.
> 
> From what I read from the code, DL tasks all have the same priority that
> is higher than any of the RT tasks. So you mean DL tasks have other
> property that kind of categorizing them into different sub-priorities
> that is not being reflected in their priority level. Is that right?

Correct, primarily their deadline. That is, the scheduling function for
the class picks the task with the earliest deadline.

> > Also, the whole approach worries me, it has the very real possibility of
> > re-introducing a bunch of starvation cases avoided by the fair lock.
> 
> Starvation can happen when there is a constant stream of RT or DL tasks
> grabbing the lock, or when there is an interrupt storm. However I am
> making the assumption that RT systems should have sufficient resource
> available that the RT tasks won't saturate the hardware or we can't have
> RT guarantee in this case.

That only works on UP, on SMP you only need a combined utilization of 1
to completely saturate a lock.
> 
> > Is there a real problem with -RT that inspired these patches?
> 
> I know that in -RT kernel, all the non-raw spinlocks are replaced by
> rtmutex which is a sleeping lock. This can have a real performance
> impact on systems with more than a few cores. The rtmutex isn't fair either.
> 
> Do you think it is better to keep the raw spinlocks fair and only have
> the non-raw spinlocks use the RT version?

I don't get what you're saying here. Are you proposing to replace the
rtmutex with this rtspinlock? That will very fundamentally not work. The
important part of the conversion of spinlock -> rtmutex is acquiring the
preemptability. Using this rtspinlock looses that and breaks the
entirety of what -rt is about.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 6/7] locking/rtqspinlock: Voluntarily yield CPU when need_sched()
  2017-01-03 18:00 ` [RFC PATCH 6/7] locking/rtqspinlock: Voluntarily yield CPU when need_sched() Waiman Long
  2017-01-04 10:07   ` Boqun Feng
@ 2017-01-05 10:16   ` Daniel Bristot de Oliveira
  1 sibling, 0 replies; 24+ messages in thread
From: Daniel Bristot de Oliveira @ 2017-01-05 10:16 UTC (permalink / raw)
  To: Waiman Long, Peter Zijlstra, Ingo Molnar, Thomas Gleixner,
	H. Peter Anvin
  Cc: linux-kernel

On 01/03/2017 07:00 PM, Waiman Long wrote:
>  For nested spinlock with outer lock acquired, it will
> + * boost its priority to the highest RT priority level to try to acquire the
> + * inner lock, finish up its work, release the locks and reenable preemption.

DL scheduler turns the definition of "highest RT priority" more
complicated. One may try to move a task to the DL class with a very
closer deadline, but a DL task can always be rejected by the admission
test....

-- Daniel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 0/7] locking/rtqspinlock: Realtime queued spinlocks
  2017-01-05  9:44     ` Peter Zijlstra
@ 2017-01-05 15:55       ` Waiman Long
  2017-01-05 16:08         ` Peter Zijlstra
  0 siblings, 1 reply; 24+ messages in thread
From: Waiman Long @ 2017-01-05 15:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, H. Peter Anvin, linux-kernel,
	Steven Rostedt, Daniel Bristot de Oliveira

On 01/05/2017 04:44 AM, Peter Zijlstra wrote:
> On Wed, Jan 04, 2017 at 10:25:14AM -0500, Waiman Long wrote:
>> On 01/04/2017 07:49 AM, Peter Zijlstra wrote:
>>> On Tue, Jan 03, 2017 at 01:00:23PM -0500, Waiman Long wrote:
>>>> This patchset introduces a new variant of queued spinlocks - the
>>>> realtime queued spinlocks. The purpose of this new variant is to
>>>> support real spinlock in a realtime environment where high priority
>>>> RT tasks should be allowed to complete its work ASAP. This means as
>>>> little waiting time for spinlocks as possible.
>>>>
>>>> Non-RT tasks will wait for spinlocks in the MCS waiting queue as
>>>> usual. RT tasks and interrupts will spin directly on the spinlocks
>>>> and use the priority value in the pending byte to arbitrate who get
>>>> the lock first.
>>>>
>>>> Patch 1 removes the unused spin_lock_bh_nested() API.
>>>>
>>>> Patch 2 introduces the basic realtime queued spinlocks where the
>>>> pending byte is used for storing the priority of the highest priority
>>>> RT task that is waiting on the spinlock. All the RT tasks will spin
>>>> directly on the spinlock instead of waiting in the queue.
>>>>
>>> OK, so a single numerical field isn't sufficient to describe priority
>>> anymore, since we added DEADLINE support things have gotten a lot more
>>> complex.
>> From what I read from the code, DL tasks all have the same priority that
>> is higher than any of the RT tasks. So you mean DL tasks have other
>> property that kind of categorizing them into different sub-priorities
>> that is not being reflected in their priority level. Is that right?
> Correct, primarily their deadline. That is, the scheduling function for
> the class picks the task with the earliest deadline.

OK, I need to rethink how to deal with those DL tasks.

>>> Also, the whole approach worries me, it has the very real possibility of
>>> re-introducing a bunch of starvation cases avoided by the fair lock.
>> Starvation can happen when there is a constant stream of RT or DL tasks
>> grabbing the lock, or when there is an interrupt storm. However I am
>> making the assumption that RT systems should have sufficient resource
>> available that the RT tasks won't saturate the hardware or we can't have
>> RT guarantee in this case.
> That only works on UP, on SMP you only need a combined utilization of 1
> to completely saturate a lock.

An RT task in a spinlock loop won't be able to completely monopolize the
lock because of the small window between unlock and lock that others can
come in and get the lock. You will need at least 2 RT tasks in lockstep
to completely own the lock and starve the others.

We could implement some kind of policy to increase the dynamic priority
of a task the longer it waits for the lock to make sure that there will
be no lock starvation.

>>> Is there a real problem with -RT that inspired these patches?
>> I know that in -RT kernel, all the non-raw spinlocks are replaced by
>> rtmutex which is a sleeping lock. This can have a real performance
>> impact on systems with more than a few cores. The rtmutex isn't fair either.
>>
>> Do you think it is better to keep the raw spinlocks fair and only have
>> the non-raw spinlocks use the RT version?
> I don't get what you're saying here. Are you proposing to replace the
> rtmutex with this rtspinlock? That will very fundamentally not work. The
> important part of the conversion of spinlock -> rtmutex is acquiring the
> preemptability. Using this rtspinlock looses that and breaks the
> entirety of what -rt is about.

What I am saying that we don't need to change spinlock to rtmutex in a
-RT kernel. Instead, we can use rtqspinlock for this purpose. All the
sleeping locks will still be converted to rtmutex.

Conversion of rtmutex does allow forced CPU preemption when there is a
need for that. What rtqspinlock can provide is voluntary preemption
where the lock waiters explicitly yield the CPU while waiting for the
lock. I use the need_resched() to detect if CPU yielding is necessary.
However, if the CPU was in a preempt disabled region before the
spin_lock() call, we can't yield the CPU. The only way is to raise its
priority and try to get the lock ASAP. I still have some work to do in
this area and I need to figure out how to convey the information about
the priority of the task that is waiting for the CPU.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 0/7] locking/rtqspinlock: Realtime queued spinlocks
  2017-01-05 15:55       ` Waiman Long
@ 2017-01-05 16:08         ` Peter Zijlstra
  2017-01-05 17:07           ` Waiman Long
  2017-01-05 18:05           ` Daniel Bristot de Oliveira
  0 siblings, 2 replies; 24+ messages in thread
From: Peter Zijlstra @ 2017-01-05 16:08 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Thomas Gleixner, H. Peter Anvin, linux-kernel,
	Steven Rostedt, Daniel Bristot de Oliveira

On Thu, Jan 05, 2017 at 10:55:55AM -0500, Waiman Long wrote:
> What I am saying that we don't need to change spinlock to rtmutex in a
> -RT kernel. Instead, we can use rtqspinlock for this purpose. All the
> sleeping locks will still be converted to rtmutex.

No-no-no..

> Conversion of rtmutex does allow forced CPU preemption when there is a
> need for that. What rtqspinlock can provide is voluntary preemption
> where the lock waiters explicitly yield the CPU while waiting for the
> lock. I use the need_resched() to detect if CPU yielding is necessary.
> However, if the CPU was in a preempt disabled region before the
> spin_lock() call, we can't yield the CPU. The only way is to raise its
> priority and try to get the lock ASAP.

And here you've lost your finger because the saw-blade didn't stop in
time.

RT very fundamentally relies on the spinlock->rtmutex conversion to
allow preempting things when a higher priority task comes along. A
spinlock, of any kind, requires having preemption disabled while holding
the lock. If the critical section is of unbounded latency, you have
unbounded preemption latency and RT is no more.

Its not about PI on contention, although that helps inversion scenarios.
Its about allowing preemption, which fundamentally requires a sleeping
lock to be used.

Many of the spinlock sections of mainline are not well behaved in an RT
sense and therefore must not disable preemption. Similar for the IRQ
disable regions and hence we have the whole threaded interrupt stuff.

Please stop writing code and read up on things..

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 0/7] locking/rtqspinlock: Realtime queued spinlocks
  2017-01-05 16:08         ` Peter Zijlstra
@ 2017-01-05 17:07           ` Waiman Long
  2017-01-05 18:50             ` Steven Rostedt
  2017-01-05 18:05           ` Daniel Bristot de Oliveira
  1 sibling, 1 reply; 24+ messages in thread
From: Waiman Long @ 2017-01-05 17:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, H. Peter Anvin, linux-kernel,
	Steven Rostedt, Daniel Bristot de Oliveira

On 01/05/2017 11:08 AM, Peter Zijlstra wrote:
> On Thu, Jan 05, 2017 at 10:55:55AM -0500, Waiman Long wrote:
>> What I am saying that we don't need to change spinlock to rtmutex in a
>> -RT kernel. Instead, we can use rtqspinlock for this purpose. All the
>> sleeping locks will still be converted to rtmutex.
> No-no-no..
>
>> Conversion of rtmutex does allow forced CPU preemption when there is a
>> need for that. What rtqspinlock can provide is voluntary preemption
>> where the lock waiters explicitly yield the CPU while waiting for the
>> lock. I use the need_resched() to detect if CPU yielding is necessary.
>> However, if the CPU was in a preempt disabled region before the
>> spin_lock() call, we can't yield the CPU. The only way is to raise its
>> priority and try to get the lock ASAP.
> And here you've lost your finger because the saw-blade didn't stop in
> time.

Well, I lost my virtual fingers all the time;-)

This is one way that I learn and become stronger.

> RT very fundamentally relies on the spinlock->rtmutex conversion to
> allow preempting things when a higher priority task comes along. A
> spinlock, of any kind, requires having preemption disabled while holding
> the lock. If the critical section is of unbounded latency, you have
> unbounded preemption latency and RT is no more.
>
> Its not about PI on contention, although that helps inversion scenarios.
> Its about allowing preemption, which fundamentally requires a sleeping
> lock to be used.
>
> Many of the spinlock sections of mainline are not well behaved in an RT
> sense and therefore must not disable preemption. Similar for the IRQ
> disable regions and hence we have the whole threaded interrupt stuff.

I do make the assumption that spinlock critical sections are behaving
well enough. Apparently, that is not a valid assumption. I sent these
RFC patches out to see if it was an idea worth pursuing. If not, I can
drop these patches. Anyway, thanks for the feedback.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 0/7] locking/rtqspinlock: Realtime queued spinlocks
  2017-01-05 16:08         ` Peter Zijlstra
  2017-01-05 17:07           ` Waiman Long
@ 2017-01-05 18:05           ` Daniel Bristot de Oliveira
  1 sibling, 0 replies; 24+ messages in thread
From: Daniel Bristot de Oliveira @ 2017-01-05 18:05 UTC (permalink / raw)
  To: Peter Zijlstra, Waiman Long
  Cc: Ingo Molnar, Thomas Gleixner, H. Peter Anvin, linux-kernel,
	Steven Rostedt

On 01/05/2017 05:08 PM, Peter Zijlstra wrote:
> RT very fundamentally relies on the spinlock->rtmutex conversion to
> allow preempting things when a higher priority task comes along. A
> spinlock, of any kind, requires having preemption disabled while holding
> the lock. If the critical section is of unbounded latency, you have
> unbounded preemption latency and RT is no more.
> 
> Its not about PI on contention, although that helps inversion scenarios.
> Its about allowing preemption, which fundamentally requires a sleeping
> lock to be used.

2 cents complementing this:

Spinlocks algorithms which disable the preemption while holding the lock
resembles a theoretical algorithm named Immediate Priority Ceiling
Protocol. The immediate priority ceiling protocol avoids unbounded
priority inversion, like the Priority Inheritance Protocol used on rt
mutex. However, although the Immediate Priority Ceiling Protocol may
help to reduce the response time of tasks, it causes penalty on the
activation/scheduling delay (latency on linux -rt dictionary) of tasks
with the priority higher than the task holding the lock, but lower than
the Ceiling priority. As it is not possible to know on beforehand the
Ceiling priority of a lock on Linux, the implementation needs to use the
highest priority, that is only possible by disabling the preemption on
Linux, causing scheduling latency even for the highest -rt task.

Hence, the penalty of the immediate priority ceiling protocol is right
in the main metric of the PREEMPT_RT: the scheduling latency.

This is an old article:

http://www.jsoftware.us/vol7/jsw0703-03.pdf

for uni processors... so it does not completely fit in this case but it
shows some results of using immediate priority ceiling rather than PI on
rt-mutex. It shows that the immediate priority ceiling causes scheduling
delays/latency

This paper mentions the Multicore Priority Ceiling Protocol causing
scheduling latency as well:

https://people.mpi-sws.org/~bbb/papers/pdf/rtlws12.pdf

That is why RT Mutex with PIP is better for the PREEMPT than any
protocol that disables preemption.

-- Daniel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 0/7] locking/rtqspinlock: Realtime queued spinlocks
  2017-01-04 20:02       ` Waiman Long
@ 2017-01-05 18:43         ` Steven Rostedt
  0 siblings, 0 replies; 24+ messages in thread
From: Steven Rostedt @ 2017-01-05 18:43 UTC (permalink / raw)
  To: Waiman Long
  Cc: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, H. Peter Anvin,
	linux-kernel

On Wed, 4 Jan 2017 15:02:23 -0500
Waiman Long <longman@redhat.com> wrote:

> On 01/04/2017 10:55 AM, Steven Rostedt wrote:
> > On Wed, 4 Jan 2017 10:25:14 -0500
> > Waiman Long <longman@redhat.com> wrote:
> >
> >    
> >> I know that in -RT kernel, all the non-raw spinlocks are replaced by
> >> rtmutex which is a sleeping lock. This can have a real performance
> >> impact on systems with more than a few cores. The rtmutex isn't fair either.  
> > We do fine on 80+ CPUs. Is that enough cores for you ;-)  
> 
> It depends on what you mean fine. I haven't done the actual benchmark
> yet, but I believe that a -RT system with 80+ CPUs will feel noticeably
> slower for non-RT tasks than a system with non-RT kernel. I am not
> saying that the system will slow down significantly, but the slow down
> will certainly be noticeable.

Really matters what the work load is. Yes, java apps will be much
slower, because java runs 1000s of threads, and the real bottle neck is
with the read-writer locks (the mmap rwsem causes issues here).

But, really, I've done kernel builds on these large boxes running -rt,
and those still do well.

> 
> > Note, it's not a true sleeping lock, because of the adaptive nature.
> > That is, it spins unless the owner of the lock is sleeping, in which
> > case, it too will sleep (why spin waiting for a task that isn't
> > running). But if the owner is running, it will spin too.  
> 
> I think this is a recent change by Davidlohr. However, the spinning is
> limited to the top waiter. The rests will still go to sleep.

No, adaptive spinlocks were there for some time. It's copyright in 2008
by Gregory Haskins, Sven Dietrich and Peter Morreale.

As for the top waiter, that could probably be optimized to not do so.
Good project to try out and see how that speeds things up.

> >
> > Note, I also want to get rt_mutex into the kernel first for all sleeping
> > locks. That is, get the logic in before we convert spin_locks to
> > sleeping locks.

> 
> The rtmutex code is already in the upstream kernel. It is just that
> there isn't any code yet that can let you flip a switch (a config
> option) and change all sleeping locks to use rtmutex.

The pi code is upstream, but it's only used by futexes and maybe even
rcu boosting. But normal kernel mutexes don't have pi.

-- Steve

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 0/7] locking/rtqspinlock: Realtime queued spinlocks
  2017-01-05 17:07           ` Waiman Long
@ 2017-01-05 18:50             ` Steven Rostedt
  2017-01-05 19:24               ` Waiman Long
  0 siblings, 1 reply; 24+ messages in thread
From: Steven Rostedt @ 2017-01-05 18:50 UTC (permalink / raw)
  To: Waiman Long
  Cc: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, H. Peter Anvin,
	linux-kernel, Daniel Bristot de Oliveira

On Thu, 5 Jan 2017 12:07:21 -0500
Waiman Long <longman@redhat.com> wrote:


> I do make the assumption that spinlock critical sections are behaving
> well enough. Apparently, that is not a valid assumption. I sent these
> RFC patches out to see if it was an idea worth pursuing. If not, I can
> drop these patches. Anyway, thanks for the feedback.

Yes, the assumption is incorrect. There are places that can hold a spin
lock for several hundreds of microseconds. If you can't preempt them,
you'll never get below several hundreds of microseconds in latency.

And it would be hard to pick and choose (we already do this to decide
what can be a raw_spin_lock), because you need to audit all use cases
of a spin_lock as well as all the locks taken while holding that
spin_lock.

-- Steve

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 0/7] locking/rtqspinlock: Realtime queued spinlocks
  2017-01-05 18:50             ` Steven Rostedt
@ 2017-01-05 19:24               ` Waiman Long
  0 siblings, 0 replies; 24+ messages in thread
From: Waiman Long @ 2017-01-05 19:24 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, H. Peter Anvin,
	linux-kernel, Daniel Bristot de Oliveira

On 01/05/2017 01:50 PM, Steven Rostedt wrote:
> On Thu, 5 Jan 2017 12:07:21 -0500
> Waiman Long <longman@redhat.com> wrote:
>
>
>> I do make the assumption that spinlock critical sections are behaving
>> well enough. Apparently, that is not a valid assumption. I sent these
>> RFC patches out to see if it was an idea worth pursuing. If not, I can
>> drop these patches. Anyway, thanks for the feedback.
> Yes, the assumption is incorrect. There are places that can hold a spin
> lock for several hundreds of microseconds. If you can't preempt them,
> you'll never get below several hundreds of microseconds in latency.
>
> And it would be hard to pick and choose (we already do this to decide
> what can be a raw_spin_lock), because you need to audit all use cases
> of a spin_lock as well as all the locks taken while holding that
> spin_lock.
>
> -- Steve

Thank for the information.

It has come to my attention that scalability problem may be present in
the -RT kernel because of the longer wait time in the raw_spin_lock side
as the number of CPUs increases. I will look into this some more to see
if my patch set can help under those circumstances.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2017-01-05 19:27 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-03 18:00 [RFC PATCH 0/7] locking/rtqspinlock: Realtime queued spinlocks Waiman Long
2017-01-03 18:00 ` [RFC PATCH 1/7] locking/spinlock: Remove the unused spin_lock_bh_nested API Waiman Long
2017-01-03 18:00 ` [RFC PATCH 2/7] locking/rtqspinlock: Introduce realtime queued spinlocks Waiman Long
2017-01-03 18:00 ` [RFC PATCH 3/7] locking/rtqspinlock: Use static RT priority when in interrupt context Waiman Long
2017-01-03 18:00 ` [RFC PATCH 4/7] locking/rtqspinlock: Override spin_lock_nested with special RT variants Waiman Long
2017-01-03 18:00 ` [RFC PATCH 5/7] locking/rtqspinlock: Handle priority boosting Waiman Long
2017-01-03 18:00 ` [RFC PATCH 6/7] locking/rtqspinlock: Voluntarily yield CPU when need_sched() Waiman Long
2017-01-04 10:07   ` Boqun Feng
2017-01-04 21:57     ` Waiman Long
2017-01-05 10:16   ` Daniel Bristot de Oliveira
2017-01-03 18:00 ` [RFC PATCH 7/7] locking/rtqspinlock: Enable collection of event counts Waiman Long
2017-01-04 12:49 ` [RFC PATCH 0/7] locking/rtqspinlock: Realtime queued spinlocks Peter Zijlstra
2017-01-04 15:25   ` Waiman Long
2017-01-04 15:55     ` Steven Rostedt
2017-01-04 20:02       ` Waiman Long
2017-01-05 18:43         ` Steven Rostedt
2017-01-05  9:26     ` Daniel Bristot de Oliveira
2017-01-05  9:44     ` Peter Zijlstra
2017-01-05 15:55       ` Waiman Long
2017-01-05 16:08         ` Peter Zijlstra
2017-01-05 17:07           ` Waiman Long
2017-01-05 18:50             ` Steven Rostedt
2017-01-05 19:24               ` Waiman Long
2017-01-05 18:05           ` Daniel Bristot de Oliveira

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).