Re: sem_lock() vs qspinlocks

From: Peter Zijlstra <peterz@infradead.org>
To: Davidlohr Bueso <dave@stgolabs.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
	Boqun Feng <boqun.feng@gmail.com>,
	Manfred Spraul <manfred@colorfullife.com>,
	Waiman Long <Waiman.Long@hpe.com>, Ingo Molnar <mingo@kernel.org>,
	ggherdovich@suse.com, Mel Gorman <mgorman@techsingularity.net>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Paul McKenney <paulmck@linux.vnet.ibm.com>,
	Will Deacon <will.deacon@arm.com>
Subject: Re: sem_lock() vs qspinlocks
Date: Sat, 21 May 2016 09:37:39 +0200	[thread overview]
Message-ID: <20160521073739.GB15728@worktop.ger.corp.intel.com> (raw)
In-Reply-To: <20160521004839.GA28231@linux-uzut.site>

On Fri, May 20, 2016 at 05:48:39PM -0700, Davidlohr Bueso wrote:
> On Fri, 20 May 2016, Linus Torvalds wrote:
> 
> 
> >Oh, I definitely agree on the stable part, and yes, the "splt things
> >up" model should come later if people agree that it's a good thing.
> 
> The backporting part is quite nice, yes, but ultimately I think I prefer
> Linus' suggestion making things explicit, as opposed to consulting the spinlock
> implying barriers. I also hate to have an smp_mb() (particularly for spin_is_locked)
> given that we are not optimizing for the common case (regular mutual excl).

I'm confused; we _are_ optimizing for the common case. spin_is_locked()
is very unlikely to be used. And arguably should be used less in favour
of lockdep_assert_held().

> As opposed to spin_is_locked(), spin_unlock_wait() is perhaps more tempting
> to use for locking correctness. For example, taking a look at nf_conntrack_all_lock(),
> it too likes to get smart with spin_unlock_wait() -- also for finer graining purposes.
> While not identical to sems, it goes like:
> 
> nf_conntrack_all_lock():	nf_conntrack_lock():
> spin_lock(B);			spin_lock(A);
> 
> 				if (bar) { // false
> bar = 1;			   ...
> 				}
> [loop ctrl-barrier]				
>  spin_unlock_wait(A);
> foo();				foo();
> 
> If the spin_unlock_wait() doesn't yet see the store that makes A visibly locked,
> we could end up with both threads in foo(), no?. (Although I'm unsure about that
> ctrl barrier and archs could fall into it. The point was to see in-tree examples
> of creative thinking with locking).

I'm tempted to put that trailing smp_rmb() in spin_unlock_wait() too;
because I suspect the netfilter code is broken without it.

And it seems intuitive to assume that if we return from unlock_wait() we
can indeed observe the critical section we waited on.

Something a little like so; but then for _all_ implementations.

---
 include/asm-generic/qspinlock.h |  6 ++++++
 include/linux/compiler.h        | 13 ++++++++-----
 2 files changed, 14 insertions(+), 5 deletions(-)

diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h
index 6bd05700d8c9..2f2eddd3e1f9 100644
--- a/include/asm-generic/qspinlock.h
+++ b/include/asm-generic/qspinlock.h
@@ -135,6 +135,12 @@ static inline void queued_spin_unlock_wait(struct qspinlock *lock)
 	smp_mb();
 	while (atomic_read(&lock->val) & _Q_LOCKED_MASK)
 		cpu_relax();
+
+	/*
+	 * Match the RELEASE of the spin_unlock() we just observed. Thereby
+	 * ensuring we observe the whole critical section that ended.
+	 */
+	smp_acquire__after_ctrl_dep();
 }
 
 #ifndef virt_spin_lock
diff --git a/include/linux/compiler.h b/include/linux/compiler.h
index 793c0829e3a3..3c4bc8160947 100644
--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -304,21 +304,24 @@ static __always_inline void __write_once_size(volatile void *p, void *res, int s
 	__u.__val;					\
 })
 
+/*
+ * A control dependency provides a LOAD->STORE order, the additional RMB
+ * provides LOAD->LOAD order, together they provide LOAD->{LOAD,STORE} order,
+ * aka. ACQUIRE.
+ */
+#define smp_acquire__after_ctrl_dep()	smp_rmb()
+
 /**
  * smp_cond_acquire() - Spin wait for cond with ACQUIRE ordering
  * @cond: boolean expression to wait for
  *
  * Equivalent to using smp_load_acquire() on the condition variable but employs
  * the control dependency of the wait to reduce the barrier on many platforms.
- *
- * The control dependency provides a LOAD->STORE order, the additional RMB
- * provides LOAD->LOAD order, together they provide LOAD->{LOAD,STORE} order,
- * aka. ACQUIRE.
  */
 #define smp_cond_acquire(cond)	do {		\
 	while (!(cond))				\
 		cpu_relax();			\
-	smp_rmb(); /* ctrl + rmb := acquire */	\
+	smp_acquire__after_ctrl_dep();		\
 } while (0)
 
 #endif /* __KERNEL__ */