All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH memory-model 0/5] Updates to the formal memory model
@ 2018-09-26 18:28 Paul E. McKenney
  2018-09-26 18:29 ` [PATCH memory-model 1/5] tools/memory-model: Add litmus-test naming scheme Paul E. McKenney
                   ` (5 more replies)
  0 siblings, 6 replies; 41+ messages in thread
From: Paul E. McKenney @ 2018-09-26 18:28 UTC (permalink / raw)
  To: linux-kernel, linux-arch, mingo
  Cc: stern, parri.andrea, will.deacon, peterz, boqun.feng, npiggin,
	dhowells, j.alglave, luc.maranget, akiyks

Hello, Ingo!

This series contains updates to the Linux kernel's formal memory model
in tools/memory-model, along with corresponding changes in documentation
and Linux-kernel code.  These patches are ready for inclusion into -tip.

1.	Document litmus-test naming scheme.

2.	Add extra ordering for locks and remove it for ordinary
	release/acquire, courtesy of Alan Stern.

3.	Fix a README typo, courtesy of SeongJae Park.

4.	Expand the list of LKMM limitations.

5.	Replace smp_cond_acquire() with smp_cond_load_acquire(),
	courtesy of Andrea Parri.

							Thanx, Paul

------------------------------------------------------------------------

 Documentation/memory-barriers.txt                                          |    3 
 tools/memory-model/Documentation/explanation.txt                           |  186 +++++++---
 tools/memory-model/Documentation/recipes.txt                               |    2 
 tools/memory-model/README                                                  |   39 ++
 tools/memory-model/linux-kernel.cat                                        |    8 
 tools/memory-model/litmus-tests/ISA2+pooncelock+pooncelock+pombonce.litmus |    7 
 tools/memory-model/litmus-tests/README                                     |  104 +++++
 7 files changed, 293 insertions(+), 56 deletions(-)


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH memory-model 1/5] tools/memory-model: Add litmus-test naming scheme
  2018-09-26 18:28 [PATCH memory-model 0/5] Updates to the formal memory model Paul E. McKenney
@ 2018-09-26 18:29 ` Paul E. McKenney
  2018-10-02 10:10   ` [tip:locking/core] " tip-bot for Paul E. McKenney
  2018-09-26 18:29 ` [PATCH memory-model 2/5] tools/memory-model: Add extra ordering for locks and remove it for ordinary release/acquire Paul E. McKenney
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 41+ messages in thread
From: Paul E. McKenney @ 2018-09-26 18:29 UTC (permalink / raw)
  To: linux-kernel, linux-arch, mingo
  Cc: stern, parri.andrea, will.deacon, peterz, boqun.feng, npiggin,
	dhowells, j.alglave, luc.maranget, akiyks, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>

This commit documents the scheme used to generate the names for the
litmus tests.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Apply feedback from Andrea Parri and Will Deacon. ]
Acked-by: Will Deacon <will.deacon@arm.com>
---
 tools/memory-model/litmus-tests/README | 104 ++++++++++++++++++++++++-
 1 file changed, 102 insertions(+), 2 deletions(-)

diff --git a/tools/memory-model/litmus-tests/README b/tools/memory-model/litmus-tests/README
index 4581ec2d3c57..5ee08f129094 100644
--- a/tools/memory-model/litmus-tests/README
+++ b/tools/memory-model/litmus-tests/README
@@ -1,4 +1,6 @@
-This directory contains the following litmus tests:
+============
+LITMUS TESTS
+============
 
 CoRR+poonceonce+Once.litmus
 	Test of read-read coherence, that is, whether or not two
@@ -36,7 +38,7 @@ IRIW+poonceonces+OnceOnce.litmus
 ISA2+pooncelock+pooncelock+pombonce.litmus
 	Tests whether the ordering provided by a lock-protected S
 	litmus test is visible to an external process whose accesses are
-	separated by smp_mb().	This addition of an external process to
+	separated by smp_mb().  This addition of an external process to
 	S is otherwise known as ISA2.
 
 ISA2+poonceonces.litmus
@@ -151,3 +153,101 @@ Z6.0+pooncerelease+poacquirerelease+fencembonceonce.litmus
 A great many more litmus tests are available here:
 
 	https://github.com/paulmckrcu/litmus
+
+==================
+LITMUS TEST NAMING
+==================
+
+Litmus tests are usually named based on their contents, which means that
+looking at the name tells you what the litmus test does.  The naming
+scheme covers litmus tests having a single cycle that passes through
+each process exactly once, so litmus tests not fitting this description
+are named on an ad-hoc basis.
+
+The structure of a litmus-test name is the litmus-test class, a plus
+sign ("+"), and one string for each process, separated by plus signs.
+The end of the name is ".litmus".
+
+The litmus-test classes may be found in the infamous test6.pdf:
+https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test6.pdf
+Each class defines the pattern of accesses and of the variables accessed.
+For example, if the one process writes to a pair of variables, and
+the other process reads from these same variables, the corresponding
+litmus-test class is "MP" (message passing), which may be found on the
+left-hand end of the second row of tests on page one of test6.pdf.
+
+The strings used to identify the actions carried out by each process are
+complex due to a desire to have short(er) names.  Thus, there is a tool to
+generate these strings from a given litmus test's actions.  For example,
+consider the processes from SB+rfionceonce-poonceonces.litmus:
+
+	P0(int *x, int *y)
+	{
+		int r1;
+		int r2;
+
+		WRITE_ONCE(*x, 1);
+		r1 = READ_ONCE(*x);
+		r2 = READ_ONCE(*y);
+	}
+
+	P1(int *x, int *y)
+	{
+		int r3;
+		int r4;
+
+		WRITE_ONCE(*y, 1);
+		r3 = READ_ONCE(*y);
+		r4 = READ_ONCE(*x);
+	}
+
+The next step is to construct a space-separated list of descriptors,
+interleaving descriptions of the relation between a pair of consecutive
+accesses with descriptions of the second access in the pair.
+
+P0()'s WRITE_ONCE() is read by its first READ_ONCE(), which is a
+reads-from link (rf) and internal to the P0() process.  This is
+"rfi", which is an abbreviation for "reads-from internal".  Because
+some of the tools string these abbreviations together with space
+characters separating processes, the first character is capitalized,
+resulting in "Rfi".
+
+P0()'s second access is a READ_ONCE(), as opposed to (for example)
+smp_load_acquire(), so next is "Once".  Thus far, we have "Rfi Once".
+
+P0()'s third access is also a READ_ONCE(), but to y rather than x.
+This is related to P0()'s second access by program order ("po"),
+to a different variable ("d"), and both accesses are reads ("RR").
+The resulting descriptor is "PodRR".  Because P0()'s third access is
+READ_ONCE(), we add another "Once" descriptor.
+
+A from-read ("fre") relation links P0()'s third to P1()'s first
+access, and the resulting descriptor is "Fre".  P1()'s first access is
+WRITE_ONCE(), which as before gives the descriptor "Once".  The string
+thus far is thus "Rfi Once PodRR Once Fre Once".
+
+The remainder of P1() is similar to P0(), which means we add
+"Rfi Once PodRR Once".  Another fre links P1()'s last access to
+P0()'s first access, which is WRITE_ONCE(), so we add "Fre Once".
+The full string is thus:
+
+	Rfi Once PodRR Once Fre Once Rfi Once PodRR Once Fre Once
+
+This string can be given to the "norm7" and "classify7" tools to
+produce the name:
+
+	$ norm7 -bell linux-kernel.bell \
+		Rfi Once PodRR Once Fre Once Rfi Once PodRR Once Fre Once | \
+	  sed -e 's/:.*//g'
+	SB+rfionceonce-poonceonces
+
+Adding the ".litmus" suffix: SB+rfionceonce-poonceonces.litmus
+
+The descriptors that describe connections between consecutive accesses
+within the cycle through a given litmus test can be provided by the herd
+tool (Rfi, Po, Fre, and so on) or by the linux-kernel.bell file (Once,
+Release, Acquire, and so on).
+
+To see the full list of descriptors, execute the following command:
+
+	$ diyone7 -bell linux-kernel.bell -show edges
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH memory-model 2/5] tools/memory-model: Add extra ordering for locks and remove it for ordinary release/acquire
  2018-09-26 18:28 [PATCH memory-model 0/5] Updates to the formal memory model Paul E. McKenney
  2018-09-26 18:29 ` [PATCH memory-model 1/5] tools/memory-model: Add litmus-test naming scheme Paul E. McKenney
@ 2018-09-26 18:29 ` Paul E. McKenney
  2018-10-02 10:11   ` [tip:locking/core] " tip-bot for Alan Stern
  2018-09-26 18:29 ` [PATCH memory-model 3/5] tools/memory-model: Fix a README typo Paul E. McKenney
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 41+ messages in thread
From: Paul E. McKenney @ 2018-09-26 18:29 UTC (permalink / raw)
  To: linux-kernel, linux-arch, mingo
  Cc: stern, parri.andrea, will.deacon, peterz, boqun.feng, npiggin,
	dhowells, j.alglave, luc.maranget, akiyks, Paul E . McKenney

From: Alan Stern <stern@rowland.harvard.edu>

More than one kernel developer has expressed the opinion that the LKMM
should enforce ordering of writes by locking.  In other words, given
the following code:

	WRITE_ONCE(x, 1);
	spin_unlock(&s):
	spin_lock(&s);
	WRITE_ONCE(y, 1);

the stores to x and y should be propagated in order to all other CPUs,
even though those other CPUs might not access the lock s.  In terms of
the memory model, this means expanding the cumul-fence relation.

Locks should also provide read-read (and read-write) ordering in a
similar way.  Given:

	READ_ONCE(x);
	spin_unlock(&s);
	spin_lock(&s);
	READ_ONCE(y);		// or WRITE_ONCE(y, 1);

the load of x should be executed before the load of (or store to) y.
The LKMM already provides this ordering, but it provides it even in
the case where the two accesses are separated by a release/acquire
pair of fences rather than unlock/lock.  This would prevent
architectures from using weakly ordered implementations of release and
acquire, which seems like an unnecessary restriction.  The patch
therefore removes the ordering requirement from the LKMM for that
case.

There are several arguments both for and against this change.  Let us
refer to these enhanced ordering properties by saying that the LKMM
would require locks to be RCtso (a bit of a misnomer, but analogous to
RCpc and RCsc) and it would require ordinary acquire/release only to
be RCpc.  (Note: In the following, the phrase "all supported
architectures" is meant not to include RISC-V.  Although RISC-V is
indeed supported by the kernel, the implementation is still somewhat
in a state of flux and therefore statements about it would be
premature.)

Pros:

	The kernel already provides RCtso ordering for locks on all
	supported architectures, even though this is not stated
	explicitly anywhere.  Therefore the LKMM should formalize it.

	In theory, guaranteeing RCtso ordering would reduce the need
	for additional barrier-like constructs meant to increase the
	ordering strength of locks.

	Will Deacon and Peter Zijlstra are strongly in favor of
	formalizing the RCtso requirement.  Linus Torvalds and Will
	would like to go even further, requiring locks to have RCsc
	behavior (ordering preceding writes against later reads), but
	they recognize that this would incur a noticeable performance
	degradation on the POWER architecture.  Linus also points out
	that people have made the mistake, in the past, of assuming
	that locking has stronger ordering properties than is
	currently guaranteed, and this change would reduce the
	likelihood of such mistakes.

	Not requiring ordinary acquire/release to be any stronger than
	RCpc may prove advantageous for future architectures, allowing
	them to implement smp_load_acquire() and smp_store_release()
	with more efficient machine instructions than would be
	possible if the operations had to be RCtso.  Will and Linus
	approve this rationale, hypothetical though it is at the
	moment (it may end up affecting the RISC-V implementation).
	The same argument may or may not apply to RMW-acquire/release;
	see also the second Con entry below.

	Linus feels that locks should be easy for people to use
	without worrying about memory consistency issues, since they
	are so pervasive in the kernel, whereas acquire/release is
	much more of an "experts only" tool.  Requiring locks to be
	RCtso is a step in this direction.

Cons:

	Andrea Parri and Luc Maranget think that locks should have the
	same ordering properties as ordinary acquire/release (indeed,
	Luc points out that the names "acquire" and "release" derive
	from the usage of locks).  Andrea points out that having
	different ordering properties for different forms of acquires
	and releases is not only unnecessary, it would also be
	confusing and unmaintainable.

	Locks are constructed from lower-level primitives, typically
	RMW-acquire (for locking) and ordinary release (for unlock).
	It is illogical to require stronger ordering properties from
	the high-level operations than from the low-level operations
	they comprise.  Thus, this change would make

		while (cmpxchg_acquire(&s, 0, 1) != 0)
			cpu_relax();

	an incorrect implementation of spin_lock(&s) as far as the
	LKMM is concerned.  In theory this weakness can be ameliorated
	by changing the LKMM even further, requiring
	RMW-acquire/release also to be RCtso (which it already is on
	all supported architectures).

	As far as I know, nobody has singled out any examples of code
	in the kernel that actually relies on locks being RCtso.
	(People mumble about RCU and the scheduler, but nobody has
	pointed to any actual code.  If there are any real cases,
	their number is likely quite small.)  If RCtso ordering is not
	needed, why require it?

	A handful of locking constructs (qspinlocks, qrwlocks, and
	mcs_spinlocks) are built on top of smp_cond_load_acquire()
	instead of an RMW-acquire instruction.  It currently provides
	only the ordinary acquire semantics, not the stronger ordering
	this patch would require of locks.  In theory this could be
	ameliorated by requiring smp_cond_load_acquire() in
	combination with ordinary release also to be RCtso (which is
	currently true on all supported architectures).

	On future weakly ordered architectures, people may be able to
	implement locks in a non-RCtso fashion with significant
	performance improvement.  Meeting the RCtso requirement would
	necessarily add run-time overhead.

Overall, the technical aspects of these arguments seem relatively
minor, and it appears mostly to boil down to a matter of opinion.
Since the opinions of senior kernel maintainers such as Linus,
Peter, and Will carry more weight than those of Luc and Andrea, this
patch changes the model in accordance with the maintainers' wishes.

Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Reviewed-by: Will Deacon <will.deacon@arm.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Andrea Parri <andrea.parri@amarulasolutions.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 .../Documentation/explanation.txt             | 186 ++++++++++++++----
 tools/memory-model/linux-kernel.cat           |   8 +-
 ...ISA2+pooncelock+pooncelock+pombonce.litmus |   7 +-
 3 files changed, 150 insertions(+), 51 deletions(-)

diff --git a/tools/memory-model/Documentation/explanation.txt b/tools/memory-model/Documentation/explanation.txt
index 0cbd1ef8f86d..35bff92cc773 100644
--- a/tools/memory-model/Documentation/explanation.txt
+++ b/tools/memory-model/Documentation/explanation.txt
@@ -28,7 +28,8 @@ Explanation of the Linux-Kernel Memory Consistency Model
   20. THE HAPPENS-BEFORE RELATION: hb
   21. THE PROPAGATES-BEFORE RELATION: pb
   22. RCU RELATIONS: rcu-link, gp, rscs, rcu-fence, and rb
-  23. ODDS AND ENDS
+  23. LOCKING
+  24. ODDS AND ENDS
 
 
 
@@ -1067,28 +1068,6 @@ allowing out-of-order writes like this to occur.  The model avoided
 violating the write-write coherence rule by requiring the CPU not to
 send the W write to the memory subsystem at all!)
 
-There is one last example of preserved program order in the LKMM: when
-a load-acquire reads from an earlier store-release.  For example:
-
-	smp_store_release(&x, 123);
-	r1 = smp_load_acquire(&x);
-
-If the smp_load_acquire() ends up obtaining the 123 value that was
-stored by the smp_store_release(), the LKMM says that the load must be
-executed after the store; the store cannot be forwarded to the load.
-This requirement does not arise from the operational model, but it
-yields correct predictions on all architectures supported by the Linux
-kernel, although for differing reasons.
-
-On some architectures, including x86 and ARMv8, it is true that the
-store cannot be forwarded to the load.  On others, including PowerPC
-and ARMv7, smp_store_release() generates object code that starts with
-a fence and smp_load_acquire() generates object code that ends with a
-fence.  The upshot is that even though the store may be forwarded to
-the load, it is still true that any instruction preceding the store
-will be executed before the load or any following instructions, and
-the store will be executed before any instruction following the load.
-
 
 AND THEN THERE WAS ALPHA
 ------------------------
@@ -1766,6 +1745,147 @@ before it does, and the critical section in P2 both starts after P1's
 grace period does and ends after it does.
 
 
+LOCKING
+-------
+
+The LKMM includes locking.  In fact, there is special code for locking
+in the formal model, added in order to make tools run faster.
+However, this special code is intended to be more or less equivalent
+to concepts we have already covered.  A spinlock_t variable is treated
+the same as an int, and spin_lock(&s) is treated almost the same as:
+
+	while (cmpxchg_acquire(&s, 0, 1) != 0)
+		cpu_relax();
+
+This waits until s is equal to 0 and then atomically sets it to 1,
+and the read part of the cmpxchg operation acts as an acquire fence.
+An alternate way to express the same thing would be:
+
+	r = xchg_acquire(&s, 1);
+
+along with a requirement that at the end, r = 0.  Similarly,
+spin_trylock(&s) is treated almost the same as:
+
+	return !cmpxchg_acquire(&s, 0, 1);
+
+which atomically sets s to 1 if it is currently equal to 0 and returns
+true if it succeeds (the read part of the cmpxchg operation acts as an
+acquire fence only if the operation is successful).  spin_unlock(&s)
+is treated almost the same as:
+
+	smp_store_release(&s, 0);
+
+The "almost" qualifiers above need some explanation.  In the LKMM, the
+store-release in a spin_unlock() and the load-acquire which forms the
+first half of the atomic rmw update in a spin_lock() or a successful
+spin_trylock() -- we can call these things lock-releases and
+lock-acquires -- have two properties beyond those of ordinary releases
+and acquires.
+
+First, when a lock-acquire reads from a lock-release, the LKMM
+requires that every instruction po-before the lock-release must
+execute before any instruction po-after the lock-acquire.  This would
+naturally hold if the release and acquire operations were on different
+CPUs, but the LKMM says it holds even when they are on the same CPU.
+For example:
+
+	int x, y;
+	spinlock_t s;
+
+	P0()
+	{
+		int r1, r2;
+
+		spin_lock(&s);
+		r1 = READ_ONCE(x);
+		spin_unlock(&s);
+		spin_lock(&s);
+		r2 = READ_ONCE(y);
+		spin_unlock(&s);
+	}
+
+	P1()
+	{
+		WRITE_ONCE(y, 1);
+		smp_wmb();
+		WRITE_ONCE(x, 1);
+	}
+
+Here the second spin_lock() reads from the first spin_unlock(), and
+therefore the load of x must execute before the load of y.  Thus we
+cannot have r1 = 1 and r2 = 0 at the end (this is an instance of the
+MP pattern).
+
+This requirement does not apply to ordinary release and acquire
+fences, only to lock-related operations.  For instance, suppose P0()
+in the example had been written as:
+
+	P0()
+	{
+		int r1, r2, r3;
+
+		r1 = READ_ONCE(x);
+		smp_store_release(&s, 1);
+		r3 = smp_load_acquire(&s);
+		r2 = READ_ONCE(y);
+	}
+
+Then the CPU would be allowed to forward the s = 1 value from the
+smp_store_release() to the smp_load_acquire(), executing the
+instructions in the following order:
+
+		r3 = smp_load_acquire(&s);	// Obtains r3 = 1
+		r2 = READ_ONCE(y);
+		r1 = READ_ONCE(x);
+		smp_store_release(&s, 1);	// Value is forwarded
+
+and thus it could load y before x, obtaining r2 = 0 and r1 = 1.
+
+Second, when a lock-acquire reads from a lock-release, and some other
+stores W and W' occur po-before the lock-release and po-after the
+lock-acquire respectively, the LKMM requires that W must propagate to
+each CPU before W' does.  For example, consider:
+
+	int x, y;
+	spinlock_t x;
+
+	P0()
+	{
+		spin_lock(&s);
+		WRITE_ONCE(x, 1);
+		spin_unlock(&s);
+	}
+
+	P1()
+	{
+		int r1;
+
+		spin_lock(&s);
+		r1 = READ_ONCE(x);
+		WRITE_ONCE(y, 1);
+		spin_unlock(&s);
+	}
+
+	P2()
+	{
+		int r2, r3;
+
+		r2 = READ_ONCE(y);
+		smp_rmb();
+		r3 = READ_ONCE(x);
+	}
+
+If r1 = 1 at the end then the spin_lock() in P1 must have read from
+the spin_unlock() in P0.  Hence the store to x must propagate to P2
+before the store to y does, so we cannot have r2 = 1 and r3 = 0.
+
+These two special requirements for lock-release and lock-acquire do
+not arise from the operational model.  Nevertheless, kernel developers
+have come to expect and rely on them because they do hold on all
+architectures supported by the Linux kernel, albeit for various
+differing reasons.
+
+
 ODDS AND ENDS
 -------------
 
@@ -1831,26 +1951,6 @@ they behave as follows:
 	events and the events preceding them against all po-later
 	events.
 
-The LKMM includes locking.  In fact, there is special code for locking
-in the formal model, added in order to make tools run faster.
-However, this special code is intended to be exactly equivalent to
-concepts we have already covered.  A spinlock_t variable is treated
-the same as an int, and spin_lock(&s) is treated the same as:
-
-	while (cmpxchg_acquire(&s, 0, 1) != 0)
-		cpu_relax();
-
-which waits until s is equal to 0 and then atomically sets it to 1,
-and where the read part of the atomic update is also an acquire fence.
-An alternate way to express the same thing would be:
-
-	r = xchg_acquire(&s, 1);
-
-along with a requirement that at the end, r = 0.  spin_unlock(&s) is
-treated the same as:
-
-	smp_store_release(&s, 0);
-
 Interestingly, RCU and locking each introduce the possibility of
 deadlock.  When faced with code sequences such as:
 
diff --git a/tools/memory-model/linux-kernel.cat b/tools/memory-model/linux-kernel.cat
index 59b5cbe6b624..882fc33274ac 100644
--- a/tools/memory-model/linux-kernel.cat
+++ b/tools/memory-model/linux-kernel.cat
@@ -38,7 +38,7 @@ let strong-fence = mb | gp
 (* Release Acquire *)
 let acq-po = [Acquire] ; po ; [M]
 let po-rel = [M] ; po ; [Release]
-let rfi-rel-acq = [Release] ; rfi ; [Acquire]
+let po-unlock-rf-lock-po = po ; [UL] ; rf ; [LKR] ; po
 
 (**********************************)
 (* Fundamental coherence ordering *)
@@ -60,13 +60,13 @@ let dep = addr | data
 let rwdep = (dep | ctrl) ; [W]
 let overwrite = co | fr
 let to-w = rwdep | (overwrite & int)
-let to-r = addr | (dep ; rfi) | rfi-rel-acq
+let to-r = addr | (dep ; rfi)
 let fence = strong-fence | wmb | po-rel | rmb | acq-po
-let ppo = to-r | to-w | fence
+let ppo = to-r | to-w | fence | (po-unlock-rf-lock-po & int)
 
 (* Propagation: Ordering from release operations and strong fences. *)
 let A-cumul(r) = rfe? ; r
-let cumul-fence = A-cumul(strong-fence | po-rel) | wmb
+let cumul-fence = A-cumul(strong-fence | po-rel) | wmb | po-unlock-rf-lock-po
 let prop = (overwrite & ext)? ; cumul-fence* ; rfe?
 
 (*
diff --git a/tools/memory-model/litmus-tests/ISA2+pooncelock+pooncelock+pombonce.litmus b/tools/memory-model/litmus-tests/ISA2+pooncelock+pooncelock+pombonce.litmus
index 0f749e419b34..094d58df7789 100644
--- a/tools/memory-model/litmus-tests/ISA2+pooncelock+pooncelock+pombonce.litmus
+++ b/tools/memory-model/litmus-tests/ISA2+pooncelock+pooncelock+pombonce.litmus
@@ -1,11 +1,10 @@
 C ISA2+pooncelock+pooncelock+pombonce
 
 (*
- * Result: Sometimes
+ * Result: Never
  *
- * This test shows that the ordering provided by a lock-protected S
- * litmus test (P0() and P1()) are not visible to external process P2().
- * This is likely to change soon.
+ * This test shows that write-write ordering provided by locks
+ * (in P0() and P1()) is visible to external process P2().
  *)
 
 {}
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH memory-model 3/5] tools/memory-model: Fix a README typo
  2018-09-26 18:28 [PATCH memory-model 0/5] Updates to the formal memory model Paul E. McKenney
  2018-09-26 18:29 ` [PATCH memory-model 1/5] tools/memory-model: Add litmus-test naming scheme Paul E. McKenney
  2018-09-26 18:29 ` [PATCH memory-model 2/5] tools/memory-model: Add extra ordering for locks and remove it for ordinary release/acquire Paul E. McKenney
@ 2018-09-26 18:29 ` Paul E. McKenney
  2018-10-02 10:11   ` [tip:locking/core] " tip-bot for SeongJae Park
  2018-09-26 18:29 ` [PATCH memory-model 4/5] tools/memory-model: Add more LKMM limitations Paul E. McKenney
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 41+ messages in thread
From: Paul E. McKenney @ 2018-09-26 18:29 UTC (permalink / raw)
  To: linux-kernel, linux-arch, mingo
  Cc: stern, parri.andrea, will.deacon, peterz, boqun.feng, npiggin,
	dhowells, j.alglave, luc.maranget, akiyks, SeongJae Park,
	Paul E . McKenney

From: SeongJae Park <sj38.park@gmail.com>

This commit fixes a duplicate-"the" typo in README.

Signed-off-by: SeongJae Park <sj38.park@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Alan Stern <stern@rowland.harvard.edu>
---
 tools/memory-model/Documentation/recipes.txt | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/memory-model/Documentation/recipes.txt b/tools/memory-model/Documentation/recipes.txt
index af72700cc20a..7fe8d7aa3029 100644
--- a/tools/memory-model/Documentation/recipes.txt
+++ b/tools/memory-model/Documentation/recipes.txt
@@ -311,7 +311,7 @@ The smp_wmb() macro orders prior stores against later stores, and the
 smp_rmb() macro orders prior loads against later loads.  Therefore, if
 the final value of r0 is 1, the final value of r1 must also be 1.
 
-The the xlog_state_switch_iclogs() function in fs/xfs/xfs_log.c contains
+The xlog_state_switch_iclogs() function in fs/xfs/xfs_log.c contains
 the following write-side code fragment:
 
 	log->l_curr_block -= log->l_logBBsize;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH memory-model 4/5] tools/memory-model: Add more LKMM limitations
  2018-09-26 18:28 [PATCH memory-model 0/5] Updates to the formal memory model Paul E. McKenney
                   ` (2 preceding siblings ...)
  2018-09-26 18:29 ` [PATCH memory-model 3/5] tools/memory-model: Fix a README typo Paul E. McKenney
@ 2018-09-26 18:29 ` Paul E. McKenney
  2018-10-02 10:12   ` [tip:locking/core] " tip-bot for Paul E. McKenney
  2018-09-26 18:29 ` [PATCH memory-model 5/5] doc: Replace smp_cond_acquire() with smp_cond_load_acquire() Paul E. McKenney
  2018-10-02  8:28 ` [PATCH memory-model 0/5] Updates to the formal memory model Ingo Molnar
  5 siblings, 1 reply; 41+ messages in thread
From: Paul E. McKenney @ 2018-09-26 18:29 UTC (permalink / raw)
  To: linux-kernel, linux-arch, mingo
  Cc: stern, parri.andrea, will.deacon, peterz, boqun.feng, npiggin,
	dhowells, j.alglave, luc.maranget, akiyks, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>

This commit adds more detail about compiler optimizations and
not-yet-modeled Linux-kernel APIs.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Andrea Parri <andrea.parri@amarulasolutions.com>
---
 tools/memory-model/README | 39 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 39 insertions(+)

diff --git a/tools/memory-model/README b/tools/memory-model/README
index ee987ce20aae..acf9077cffaa 100644
--- a/tools/memory-model/README
+++ b/tools/memory-model/README
@@ -171,6 +171,12 @@ The Linux-kernel memory model has the following limitations:
 	particular, the "THE PROGRAM ORDER RELATION: po AND po-loc"
 	and "A WARNING" sections).
 
+	Note that this limitation in turn limits LKMM's ability to
+	accurately model address, control, and data dependencies.
+	For example, if the compiler can deduce the value of some variable
+	carrying a dependency, then the compiler can break that dependency
+	by substituting a constant of that value.
+
 2.	Multiple access sizes for a single variable are not supported,
 	and neither are misaligned or partially overlapping accesses.
 
@@ -190,6 +196,36 @@ The Linux-kernel memory model has the following limitations:
 	However, a substantial amount of support is provided for these
 	operations, as shown in the linux-kernel.def file.
 
+	a.	When rcu_assign_pointer() is passed NULL, the Linux
+		kernel provides no ordering, but LKMM models this
+		case as a store release.
+
+	b.	The "unless" RMW operations are not currently modeled:
+		atomic_long_add_unless(), atomic_add_unless(),
+		atomic_inc_unless_negative(), and
+		atomic_dec_unless_positive().  These can be emulated
+		in litmus tests, for example, by using atomic_cmpxchg().
+
+	c.	The call_rcu() function is not modeled.  It can be
+		emulated in litmus tests by adding another process that
+		invokes synchronize_rcu() and the body of the callback
+		function, with (for example) a release-acquire from
+		the site of the emulated call_rcu() to the beginning
+		of the additional process.
+
+	d.	The rcu_barrier() function is not modeled.  It can be
+		emulated in litmus tests emulating call_rcu() via
+		(for example) a release-acquire from the end of each
+		additional call_rcu() process to the site of the
+		emulated rcu-barrier().
+
+	e.	Sleepable RCU (SRCU) is not modeled.  It can be
+		emulated, but perhaps not simply.
+
+	f.	Reader-writer locking is not modeled.  It can be
+		emulated in litmus tests using atomic read-modify-write
+		operations.
+
 The "herd7" tool has some additional limitations of its own, apart from
 the memory model:
 
@@ -204,3 +240,6 @@ the memory model:
 Some of these limitations may be overcome in the future, but others are
 more likely to be addressed by incorporating the Linux-kernel memory model
 into other tools.
+
+Finally, please note that LKMM is subject to change as hardware, use cases,
+and compilers evolve.
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH memory-model 5/5] doc: Replace smp_cond_acquire() with smp_cond_load_acquire()
  2018-09-26 18:28 [PATCH memory-model 0/5] Updates to the formal memory model Paul E. McKenney
                   ` (3 preceding siblings ...)
  2018-09-26 18:29 ` [PATCH memory-model 4/5] tools/memory-model: Add more LKMM limitations Paul E. McKenney
@ 2018-09-26 18:29 ` Paul E. McKenney
  2018-10-02 10:12   ` [tip:locking/core] locking/memory-barriers: " tip-bot for Andrea Parri
  2018-10-02  8:28 ` [PATCH memory-model 0/5] Updates to the formal memory model Ingo Molnar
  5 siblings, 1 reply; 41+ messages in thread
From: Paul E. McKenney @ 2018-09-26 18:29 UTC (permalink / raw)
  To: linux-kernel, linux-arch, mingo
  Cc: stern, parri.andrea, will.deacon, peterz, boqun.feng, npiggin,
	dhowells, j.alglave, luc.maranget, akiyks, Andrea Parri,
	Paul E. McKenney, Daniel Lustig, Jonathan Corbet

From: Andrea Parri <andrea.parri@amarulasolutions.com>

Amend commit 1f03e8d2919270 ("locking/barriers: Replace smp_cond_acquire()
with smp_cond_load_acquire()") by updating the documentation accordingly.
Also remove some obsolete information related to the implementation.

Signed-off-by: Andrea Parri <andrea.parri@amarulasolutions.com>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Jade Alglave <j.alglave@ucl.ac.uk>
Cc: Luc Maranget <luc.maranget@inria.fr>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Akira Yokosawa <akiyks@gmail.com>
Cc: Daniel Lustig <dlustig@nvidia.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Will Deacon <will.deacon@arm.com>
Acked-by: Alan Stern <stern@rowland.harvard.edu>
---
 Documentation/memory-barriers.txt | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
index 0d8d7ef131e9..c1d913944ad8 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -471,8 +471,7 @@ And a couple of implicit varieties:
      operations after the ACQUIRE operation will appear to happen after the
      ACQUIRE operation with respect to the other components of the system.
      ACQUIRE operations include LOCK operations and both smp_load_acquire()
-     and smp_cond_acquire() operations. The later builds the necessary ACQUIRE
-     semantics from relying on a control dependency and smp_rmb().
+     and smp_cond_load_acquire() operations.
 
      Memory operations that occur before an ACQUIRE operation may appear to
      happen after it completes.
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH memory-model 0/5] Updates to the formal memory model
  2018-09-26 18:28 [PATCH memory-model 0/5] Updates to the formal memory model Paul E. McKenney
                   ` (4 preceding siblings ...)
  2018-09-26 18:29 ` [PATCH memory-model 5/5] doc: Replace smp_cond_acquire() with smp_cond_load_acquire() Paul E. McKenney
@ 2018-10-02  8:28 ` Ingo Molnar
  5 siblings, 0 replies; 41+ messages in thread
From: Ingo Molnar @ 2018-10-02  8:28 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, linux-arch, stern, parri.andrea, will.deacon,
	peterz, boqun.feng, npiggin, dhowells, j.alglave, luc.maranget,
	akiyks


* Paul E. McKenney <paulmck@linux.ibm.com> wrote:

> Hello, Ingo!
> 
> This series contains updates to the Linux kernel's formal memory model
> in tools/memory-model, along with corresponding changes in documentation
> and Linux-kernel code.  These patches are ready for inclusion into -tip.
> 
> 1.	Document litmus-test naming scheme.
> 
> 2.	Add extra ordering for locks and remove it for ordinary
> 	release/acquire, courtesy of Alan Stern.
> 
> 3.	Fix a README typo, courtesy of SeongJae Park.
> 
> 4.	Expand the list of LKMM limitations.
> 
> 5.	Replace smp_cond_acquire() with smp_cond_load_acquire(),
> 	courtesy of Andrea Parri.
> 
> 							Thanx, Paul
> 
> ------------------------------------------------------------------------
> 
>  Documentation/memory-barriers.txt                                          |    3 
>  tools/memory-model/Documentation/explanation.txt                           |  186 +++++++---
>  tools/memory-model/Documentation/recipes.txt                               |    2 
>  tools/memory-model/README                                                  |   39 ++
>  tools/memory-model/linux-kernel.cat                                        |    8 
>  tools/memory-model/litmus-tests/ISA2+pooncelock+pooncelock+pombonce.litmus |    7 
>  tools/memory-model/litmus-tests/README                                     |  104 +++++
>  7 files changed, 293 insertions(+), 56 deletions(-)

Applied to tip:locking/core, thanks Paul!

	Ingo

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [tip:locking/core] tools/memory-model: Add litmus-test naming scheme
  2018-09-26 18:29 ` [PATCH memory-model 1/5] tools/memory-model: Add litmus-test naming scheme Paul E. McKenney
@ 2018-10-02 10:10   ` tip-bot for Paul E. McKenney
  0 siblings, 0 replies; 41+ messages in thread
From: tip-bot for Paul E. McKenney @ 2018-10-02 10:10 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: vincent.weaver, eranian, tglx, paulmck, jolsa,
	alexander.shishkin, acme, linux-kernel, mingo, hpa, torvalds,
	peterz, will.deacon

Commit-ID:  c4f790f244070dbab486805276ba4d1f87a057af
Gitweb:     https://git.kernel.org/tip/c4f790f244070dbab486805276ba4d1f87a057af
Author:     Paul E. McKenney <paulmck@linux.vnet.ibm.com>
AuthorDate: Wed, 26 Sep 2018 11:29:16 -0700
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 2 Oct 2018 10:28:00 +0200

tools/memory-model: Add litmus-test naming scheme

This commit documents the scheme used to generate the names for the
litmus tests.

[ paulmck: Apply feedback from Andrea Parri and Will Deacon. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Will Deacon <will.deacon@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: akiyks@gmail.com
Cc: boqun.feng@gmail.com
Cc: dhowells@redhat.com
Cc: j.alglave@ucl.ac.uk
Cc: linux-arch@vger.kernel.org
Cc: luc.maranget@inria.fr
Cc: npiggin@gmail.com
Cc: parri.andrea@gmail.com
Cc: stern@rowland.harvard.edu
Link: http://lkml.kernel.org/r/20180926182920.27644-1-paulmck@linux.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 tools/memory-model/litmus-tests/README | 104 ++++++++++++++++++++++++++++++++-
 1 file changed, 102 insertions(+), 2 deletions(-)

diff --git a/tools/memory-model/litmus-tests/README b/tools/memory-model/litmus-tests/README
index 4581ec2d3c57..5ee08f129094 100644
--- a/tools/memory-model/litmus-tests/README
+++ b/tools/memory-model/litmus-tests/README
@@ -1,4 +1,6 @@
-This directory contains the following litmus tests:
+============
+LITMUS TESTS
+============
 
 CoRR+poonceonce+Once.litmus
 	Test of read-read coherence, that is, whether or not two
@@ -36,7 +38,7 @@ IRIW+poonceonces+OnceOnce.litmus
 ISA2+pooncelock+pooncelock+pombonce.litmus
 	Tests whether the ordering provided by a lock-protected S
 	litmus test is visible to an external process whose accesses are
-	separated by smp_mb().	This addition of an external process to
+	separated by smp_mb().  This addition of an external process to
 	S is otherwise known as ISA2.
 
 ISA2+poonceonces.litmus
@@ -151,3 +153,101 @@ Z6.0+pooncerelease+poacquirerelease+fencembonceonce.litmus
 A great many more litmus tests are available here:
 
 	https://github.com/paulmckrcu/litmus
+
+==================
+LITMUS TEST NAMING
+==================
+
+Litmus tests are usually named based on their contents, which means that
+looking at the name tells you what the litmus test does.  The naming
+scheme covers litmus tests having a single cycle that passes through
+each process exactly once, so litmus tests not fitting this description
+are named on an ad-hoc basis.
+
+The structure of a litmus-test name is the litmus-test class, a plus
+sign ("+"), and one string for each process, separated by plus signs.
+The end of the name is ".litmus".
+
+The litmus-test classes may be found in the infamous test6.pdf:
+https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test6.pdf
+Each class defines the pattern of accesses and of the variables accessed.
+For example, if the one process writes to a pair of variables, and
+the other process reads from these same variables, the corresponding
+litmus-test class is "MP" (message passing), which may be found on the
+left-hand end of the second row of tests on page one of test6.pdf.
+
+The strings used to identify the actions carried out by each process are
+complex due to a desire to have short(er) names.  Thus, there is a tool to
+generate these strings from a given litmus test's actions.  For example,
+consider the processes from SB+rfionceonce-poonceonces.litmus:
+
+	P0(int *x, int *y)
+	{
+		int r1;
+		int r2;
+
+		WRITE_ONCE(*x, 1);
+		r1 = READ_ONCE(*x);
+		r2 = READ_ONCE(*y);
+	}
+
+	P1(int *x, int *y)
+	{
+		int r3;
+		int r4;
+
+		WRITE_ONCE(*y, 1);
+		r3 = READ_ONCE(*y);
+		r4 = READ_ONCE(*x);
+	}
+
+The next step is to construct a space-separated list of descriptors,
+interleaving descriptions of the relation between a pair of consecutive
+accesses with descriptions of the second access in the pair.
+
+P0()'s WRITE_ONCE() is read by its first READ_ONCE(), which is a
+reads-from link (rf) and internal to the P0() process.  This is
+"rfi", which is an abbreviation for "reads-from internal".  Because
+some of the tools string these abbreviations together with space
+characters separating processes, the first character is capitalized,
+resulting in "Rfi".
+
+P0()'s second access is a READ_ONCE(), as opposed to (for example)
+smp_load_acquire(), so next is "Once".  Thus far, we have "Rfi Once".
+
+P0()'s third access is also a READ_ONCE(), but to y rather than x.
+This is related to P0()'s second access by program order ("po"),
+to a different variable ("d"), and both accesses are reads ("RR").
+The resulting descriptor is "PodRR".  Because P0()'s third access is
+READ_ONCE(), we add another "Once" descriptor.
+
+A from-read ("fre") relation links P0()'s third to P1()'s first
+access, and the resulting descriptor is "Fre".  P1()'s first access is
+WRITE_ONCE(), which as before gives the descriptor "Once".  The string
+thus far is thus "Rfi Once PodRR Once Fre Once".
+
+The remainder of P1() is similar to P0(), which means we add
+"Rfi Once PodRR Once".  Another fre links P1()'s last access to
+P0()'s first access, which is WRITE_ONCE(), so we add "Fre Once".
+The full string is thus:
+
+	Rfi Once PodRR Once Fre Once Rfi Once PodRR Once Fre Once
+
+This string can be given to the "norm7" and "classify7" tools to
+produce the name:
+
+	$ norm7 -bell linux-kernel.bell \
+		Rfi Once PodRR Once Fre Once Rfi Once PodRR Once Fre Once | \
+	  sed -e 's/:.*//g'
+	SB+rfionceonce-poonceonces
+
+Adding the ".litmus" suffix: SB+rfionceonce-poonceonces.litmus
+
+The descriptors that describe connections between consecutive accesses
+within the cycle through a given litmus test can be provided by the herd
+tool (Rfi, Po, Fre, and so on) or by the linux-kernel.bell file (Once,
+Release, Acquire, and so on).
+
+To see the full list of descriptors, execute the following command:
+
+	$ diyone7 -bell linux-kernel.bell -show edges

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [tip:locking/core] tools/memory-model: Add extra ordering for locks and remove it for ordinary release/acquire
  2018-09-26 18:29 ` [PATCH memory-model 2/5] tools/memory-model: Add extra ordering for locks and remove it for ordinary release/acquire Paul E. McKenney
@ 2018-10-02 10:11   ` tip-bot for Alan Stern
  2021-09-08 11:00     ` Peter Zijlstra
  0 siblings, 1 reply; 41+ messages in thread
From: tip-bot for Alan Stern @ 2018-10-02 10:11 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: paulmck, mingo, tglx, vincent.weaver, stern, alexander.shishkin,
	andrea.parri, hpa, eranian, will.deacon, linux-kernel, acme,
	torvalds, peterz, jolsa

Commit-ID:  6e89e831a90172bc3d34ecbba52af5b9c4a447d1
Gitweb:     https://git.kernel.org/tip/6e89e831a90172bc3d34ecbba52af5b9c4a447d1
Author:     Alan Stern <stern@rowland.harvard.edu>
AuthorDate: Wed, 26 Sep 2018 11:29:17 -0700
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 2 Oct 2018 10:28:01 +0200

tools/memory-model: Add extra ordering for locks and remove it for ordinary release/acquire

More than one kernel developer has expressed the opinion that the LKMM
should enforce ordering of writes by locking.  In other words, given
the following code:

	WRITE_ONCE(x, 1);
	spin_unlock(&s):
	spin_lock(&s);
	WRITE_ONCE(y, 1);

the stores to x and y should be propagated in order to all other CPUs,
even though those other CPUs might not access the lock s.  In terms of
the memory model, this means expanding the cumul-fence relation.

Locks should also provide read-read (and read-write) ordering in a
similar way.  Given:

	READ_ONCE(x);
	spin_unlock(&s);
	spin_lock(&s);
	READ_ONCE(y);		// or WRITE_ONCE(y, 1);

the load of x should be executed before the load of (or store to) y.
The LKMM already provides this ordering, but it provides it even in
the case where the two accesses are separated by a release/acquire
pair of fences rather than unlock/lock.  This would prevent
architectures from using weakly ordered implementations of release and
acquire, which seems like an unnecessary restriction.  The patch
therefore removes the ordering requirement from the LKMM for that
case.

There are several arguments both for and against this change.  Let us
refer to these enhanced ordering properties by saying that the LKMM
would require locks to be RCtso (a bit of a misnomer, but analogous to
RCpc and RCsc) and it would require ordinary acquire/release only to
be RCpc.  (Note: In the following, the phrase "all supported
architectures" is meant not to include RISC-V.  Although RISC-V is
indeed supported by the kernel, the implementation is still somewhat
in a state of flux and therefore statements about it would be
premature.)

Pros:

	The kernel already provides RCtso ordering for locks on all
	supported architectures, even though this is not stated
	explicitly anywhere.  Therefore the LKMM should formalize it.

	In theory, guaranteeing RCtso ordering would reduce the need
	for additional barrier-like constructs meant to increase the
	ordering strength of locks.

	Will Deacon and Peter Zijlstra are strongly in favor of
	formalizing the RCtso requirement.  Linus Torvalds and Will
	would like to go even further, requiring locks to have RCsc
	behavior (ordering preceding writes against later reads), but
	they recognize that this would incur a noticeable performance
	degradation on the POWER architecture.  Linus also points out
	that people have made the mistake, in the past, of assuming
	that locking has stronger ordering properties than is
	currently guaranteed, and this change would reduce the
	likelihood of such mistakes.

	Not requiring ordinary acquire/release to be any stronger than
	RCpc may prove advantageous for future architectures, allowing
	them to implement smp_load_acquire() and smp_store_release()
	with more efficient machine instructions than would be
	possible if the operations had to be RCtso.  Will and Linus
	approve this rationale, hypothetical though it is at the
	moment (it may end up affecting the RISC-V implementation).
	The same argument may or may not apply to RMW-acquire/release;
	see also the second Con entry below.

	Linus feels that locks should be easy for people to use
	without worrying about memory consistency issues, since they
	are so pervasive in the kernel, whereas acquire/release is
	much more of an "experts only" tool.  Requiring locks to be
	RCtso is a step in this direction.

Cons:

	Andrea Parri and Luc Maranget think that locks should have the
	same ordering properties as ordinary acquire/release (indeed,
	Luc points out that the names "acquire" and "release" derive
	from the usage of locks).  Andrea points out that having
	different ordering properties for different forms of acquires
	and releases is not only unnecessary, it would also be
	confusing and unmaintainable.

	Locks are constructed from lower-level primitives, typically
	RMW-acquire (for locking) and ordinary release (for unlock).
	It is illogical to require stronger ordering properties from
	the high-level operations than from the low-level operations
	they comprise.  Thus, this change would make

		while (cmpxchg_acquire(&s, 0, 1) != 0)
			cpu_relax();

	an incorrect implementation of spin_lock(&s) as far as the
	LKMM is concerned.  In theory this weakness can be ameliorated
	by changing the LKMM even further, requiring
	RMW-acquire/release also to be RCtso (which it already is on
	all supported architectures).

	As far as I know, nobody has singled out any examples of code
	in the kernel that actually relies on locks being RCtso.
	(People mumble about RCU and the scheduler, but nobody has
	pointed to any actual code.  If there are any real cases,
	their number is likely quite small.)  If RCtso ordering is not
	needed, why require it?

	A handful of locking constructs (qspinlocks, qrwlocks, and
	mcs_spinlocks) are built on top of smp_cond_load_acquire()
	instead of an RMW-acquire instruction.  It currently provides
	only the ordinary acquire semantics, not the stronger ordering
	this patch would require of locks.  In theory this could be
	ameliorated by requiring smp_cond_load_acquire() in
	combination with ordinary release also to be RCtso (which is
	currently true on all supported architectures).

	On future weakly ordered architectures, people may be able to
	implement locks in a non-RCtso fashion with significant
	performance improvement.  Meeting the RCtso requirement would
	necessarily add run-time overhead.

Overall, the technical aspects of these arguments seem relatively
minor, and it appears mostly to boil down to a matter of opinion.
Since the opinions of senior kernel maintainers such as Linus,
Peter, and Will carry more weight than those of Luc and Andrea, this
patch changes the model in accordance with the maintainers' wishes.

Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Will Deacon <will.deacon@arm.com>
Reviewed-by: Andrea Parri <andrea.parri@amarulasolutions.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: akiyks@gmail.com
Cc: boqun.feng@gmail.com
Cc: dhowells@redhat.com
Cc: j.alglave@ucl.ac.uk
Cc: linux-arch@vger.kernel.org
Cc: luc.maranget@inria.fr
Cc: npiggin@gmail.com
Cc: parri.andrea@gmail.com
Link: http://lkml.kernel.org/r/20180926182920.27644-2-paulmck@linux.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 tools/memory-model/Documentation/explanation.txt   | 186 ++++++++++++++++-----
 tools/memory-model/linux-kernel.cat                |   8 +-
 .../ISA2+pooncelock+pooncelock+pombonce.litmus     |   7 +-
 3 files changed, 150 insertions(+), 51 deletions(-)

diff --git a/tools/memory-model/Documentation/explanation.txt b/tools/memory-model/Documentation/explanation.txt
index 0cbd1ef8f86d..35bff92cc773 100644
--- a/tools/memory-model/Documentation/explanation.txt
+++ b/tools/memory-model/Documentation/explanation.txt
@@ -28,7 +28,8 @@ Explanation of the Linux-Kernel Memory Consistency Model
   20. THE HAPPENS-BEFORE RELATION: hb
   21. THE PROPAGATES-BEFORE RELATION: pb
   22. RCU RELATIONS: rcu-link, gp, rscs, rcu-fence, and rb
-  23. ODDS AND ENDS
+  23. LOCKING
+  24. ODDS AND ENDS
 
 
 
@@ -1067,28 +1068,6 @@ allowing out-of-order writes like this to occur.  The model avoided
 violating the write-write coherence rule by requiring the CPU not to
 send the W write to the memory subsystem at all!)
 
-There is one last example of preserved program order in the LKMM: when
-a load-acquire reads from an earlier store-release.  For example:
-
-	smp_store_release(&x, 123);
-	r1 = smp_load_acquire(&x);
-
-If the smp_load_acquire() ends up obtaining the 123 value that was
-stored by the smp_store_release(), the LKMM says that the load must be
-executed after the store; the store cannot be forwarded to the load.
-This requirement does not arise from the operational model, but it
-yields correct predictions on all architectures supported by the Linux
-kernel, although for differing reasons.
-
-On some architectures, including x86 and ARMv8, it is true that the
-store cannot be forwarded to the load.  On others, including PowerPC
-and ARMv7, smp_store_release() generates object code that starts with
-a fence and smp_load_acquire() generates object code that ends with a
-fence.  The upshot is that even though the store may be forwarded to
-the load, it is still true that any instruction preceding the store
-will be executed before the load or any following instructions, and
-the store will be executed before any instruction following the load.
-
 
 AND THEN THERE WAS ALPHA
 ------------------------
@@ -1766,6 +1745,147 @@ before it does, and the critical section in P2 both starts after P1's
 grace period does and ends after it does.
 
 
+LOCKING
+-------
+
+The LKMM includes locking.  In fact, there is special code for locking
+in the formal model, added in order to make tools run faster.
+However, this special code is intended to be more or less equivalent
+to concepts we have already covered.  A spinlock_t variable is treated
+the same as an int, and spin_lock(&s) is treated almost the same as:
+
+	while (cmpxchg_acquire(&s, 0, 1) != 0)
+		cpu_relax();
+
+This waits until s is equal to 0 and then atomically sets it to 1,
+and the read part of the cmpxchg operation acts as an acquire fence.
+An alternate way to express the same thing would be:
+
+	r = xchg_acquire(&s, 1);
+
+along with a requirement that at the end, r = 0.  Similarly,
+spin_trylock(&s) is treated almost the same as:
+
+	return !cmpxchg_acquire(&s, 0, 1);
+
+which atomically sets s to 1 if it is currently equal to 0 and returns
+true if it succeeds (the read part of the cmpxchg operation acts as an
+acquire fence only if the operation is successful).  spin_unlock(&s)
+is treated almost the same as:
+
+	smp_store_release(&s, 0);
+
+The "almost" qualifiers above need some explanation.  In the LKMM, the
+store-release in a spin_unlock() and the load-acquire which forms the
+first half of the atomic rmw update in a spin_lock() or a successful
+spin_trylock() -- we can call these things lock-releases and
+lock-acquires -- have two properties beyond those of ordinary releases
+and acquires.
+
+First, when a lock-acquire reads from a lock-release, the LKMM
+requires that every instruction po-before the lock-release must
+execute before any instruction po-after the lock-acquire.  This would
+naturally hold if the release and acquire operations were on different
+CPUs, but the LKMM says it holds even when they are on the same CPU.
+For example:
+
+	int x, y;
+	spinlock_t s;
+
+	P0()
+	{
+		int r1, r2;
+
+		spin_lock(&s);
+		r1 = READ_ONCE(x);
+		spin_unlock(&s);
+		spin_lock(&s);
+		r2 = READ_ONCE(y);
+		spin_unlock(&s);
+	}
+
+	P1()
+	{
+		WRITE_ONCE(y, 1);
+		smp_wmb();
+		WRITE_ONCE(x, 1);
+	}
+
+Here the second spin_lock() reads from the first spin_unlock(), and
+therefore the load of x must execute before the load of y.  Thus we
+cannot have r1 = 1 and r2 = 0 at the end (this is an instance of the
+MP pattern).
+
+This requirement does not apply to ordinary release and acquire
+fences, only to lock-related operations.  For instance, suppose P0()
+in the example had been written as:
+
+	P0()
+	{
+		int r1, r2, r3;
+
+		r1 = READ_ONCE(x);
+		smp_store_release(&s, 1);
+		r3 = smp_load_acquire(&s);
+		r2 = READ_ONCE(y);
+	}
+
+Then the CPU would be allowed to forward the s = 1 value from the
+smp_store_release() to the smp_load_acquire(), executing the
+instructions in the following order:
+
+		r3 = smp_load_acquire(&s);	// Obtains r3 = 1
+		r2 = READ_ONCE(y);
+		r1 = READ_ONCE(x);
+		smp_store_release(&s, 1);	// Value is forwarded
+
+and thus it could load y before x, obtaining r2 = 0 and r1 = 1.
+
+Second, when a lock-acquire reads from a lock-release, and some other
+stores W and W' occur po-before the lock-release and po-after the
+lock-acquire respectively, the LKMM requires that W must propagate to
+each CPU before W' does.  For example, consider:
+
+	int x, y;
+	spinlock_t x;
+
+	P0()
+	{
+		spin_lock(&s);
+		WRITE_ONCE(x, 1);
+		spin_unlock(&s);
+	}
+
+	P1()
+	{
+		int r1;
+
+		spin_lock(&s);
+		r1 = READ_ONCE(x);
+		WRITE_ONCE(y, 1);
+		spin_unlock(&s);
+	}
+
+	P2()
+	{
+		int r2, r3;
+
+		r2 = READ_ONCE(y);
+		smp_rmb();
+		r3 = READ_ONCE(x);
+	}
+
+If r1 = 1 at the end then the spin_lock() in P1 must have read from
+the spin_unlock() in P0.  Hence the store to x must propagate to P2
+before the store to y does, so we cannot have r2 = 1 and r3 = 0.
+
+These two special requirements for lock-release and lock-acquire do
+not arise from the operational model.  Nevertheless, kernel developers
+have come to expect and rely on them because they do hold on all
+architectures supported by the Linux kernel, albeit for various
+differing reasons.
+
+
 ODDS AND ENDS
 -------------
 
@@ -1831,26 +1951,6 @@ they behave as follows:
 	events and the events preceding them against all po-later
 	events.
 
-The LKMM includes locking.  In fact, there is special code for locking
-in the formal model, added in order to make tools run faster.
-However, this special code is intended to be exactly equivalent to
-concepts we have already covered.  A spinlock_t variable is treated
-the same as an int, and spin_lock(&s) is treated the same as:
-
-	while (cmpxchg_acquire(&s, 0, 1) != 0)
-		cpu_relax();
-
-which waits until s is equal to 0 and then atomically sets it to 1,
-and where the read part of the atomic update is also an acquire fence.
-An alternate way to express the same thing would be:
-
-	r = xchg_acquire(&s, 1);
-
-along with a requirement that at the end, r = 0.  spin_unlock(&s) is
-treated the same as:
-
-	smp_store_release(&s, 0);
-
 Interestingly, RCU and locking each introduce the possibility of
 deadlock.  When faced with code sequences such as:
 
diff --git a/tools/memory-model/linux-kernel.cat b/tools/memory-model/linux-kernel.cat
index 59b5cbe6b624..882fc33274ac 100644
--- a/tools/memory-model/linux-kernel.cat
+++ b/tools/memory-model/linux-kernel.cat
@@ -38,7 +38,7 @@ let strong-fence = mb | gp
 (* Release Acquire *)
 let acq-po = [Acquire] ; po ; [M]
 let po-rel = [M] ; po ; [Release]
-let rfi-rel-acq = [Release] ; rfi ; [Acquire]
+let po-unlock-rf-lock-po = po ; [UL] ; rf ; [LKR] ; po
 
 (**********************************)
 (* Fundamental coherence ordering *)
@@ -60,13 +60,13 @@ let dep = addr | data
 let rwdep = (dep | ctrl) ; [W]
 let overwrite = co | fr
 let to-w = rwdep | (overwrite & int)
-let to-r = addr | (dep ; rfi) | rfi-rel-acq
+let to-r = addr | (dep ; rfi)
 let fence = strong-fence | wmb | po-rel | rmb | acq-po
-let ppo = to-r | to-w | fence
+let ppo = to-r | to-w | fence | (po-unlock-rf-lock-po & int)
 
 (* Propagation: Ordering from release operations and strong fences. *)
 let A-cumul(r) = rfe? ; r
-let cumul-fence = A-cumul(strong-fence | po-rel) | wmb
+let cumul-fence = A-cumul(strong-fence | po-rel) | wmb | po-unlock-rf-lock-po
 let prop = (overwrite & ext)? ; cumul-fence* ; rfe?
 
 (*
diff --git a/tools/memory-model/litmus-tests/ISA2+pooncelock+pooncelock+pombonce.litmus b/tools/memory-model/litmus-tests/ISA2+pooncelock+pooncelock+pombonce.litmus
index 0f749e419b34..094d58df7789 100644
--- a/tools/memory-model/litmus-tests/ISA2+pooncelock+pooncelock+pombonce.litmus
+++ b/tools/memory-model/litmus-tests/ISA2+pooncelock+pooncelock+pombonce.litmus
@@ -1,11 +1,10 @@
 C ISA2+pooncelock+pooncelock+pombonce
 
 (*
- * Result: Sometimes
+ * Result: Never
  *
- * This test shows that the ordering provided by a lock-protected S
- * litmus test (P0() and P1()) are not visible to external process P2().
- * This is likely to change soon.
+ * This test shows that write-write ordering provided by locks
+ * (in P0() and P1()) is visible to external process P2().
  *)
 
 {}

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [tip:locking/core] tools/memory-model: Fix a README typo
  2018-09-26 18:29 ` [PATCH memory-model 3/5] tools/memory-model: Fix a README typo Paul E. McKenney
@ 2018-10-02 10:11   ` tip-bot for SeongJae Park
  0 siblings, 0 replies; 41+ messages in thread
From: tip-bot for SeongJae Park @ 2018-10-02 10:11 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: vincent.weaver, alexander.shishkin, paulmck, acme, peterz, tglx,
	torvalds, linux-kernel, eranian, mingo, hpa, sj38.park, jolsa,
	stern

Commit-ID:  3d2046a6fa2106584cf1080c2c08a6e8e79cbbb4
Gitweb:     https://git.kernel.org/tip/3d2046a6fa2106584cf1080c2c08a6e8e79cbbb4
Author:     SeongJae Park <sj38.park@gmail.com>
AuthorDate: Wed, 26 Sep 2018 11:29:18 -0700
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 2 Oct 2018 10:28:03 +0200

tools/memory-model: Fix a README typo

This commit fixes a duplicate-"the" typo in README.

Signed-off-by: SeongJae Park <sj38.park@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Alan Stern <stern@rowland.harvard.edu>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: akiyks@gmail.com
Cc: boqun.feng@gmail.com
Cc: dhowells@redhat.com
Cc: j.alglave@ucl.ac.uk
Cc: linux-arch@vger.kernel.org
Cc: luc.maranget@inria.fr
Cc: npiggin@gmail.com
Cc: parri.andrea@gmail.com
Cc: will.deacon@arm.com
Link: http://lkml.kernel.org/r/20180926182920.27644-3-paulmck@linux.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 tools/memory-model/Documentation/recipes.txt | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/memory-model/Documentation/recipes.txt b/tools/memory-model/Documentation/recipes.txt
index af72700cc20a..7fe8d7aa3029 100644
--- a/tools/memory-model/Documentation/recipes.txt
+++ b/tools/memory-model/Documentation/recipes.txt
@@ -311,7 +311,7 @@ The smp_wmb() macro orders prior stores against later stores, and the
 smp_rmb() macro orders prior loads against later loads.  Therefore, if
 the final value of r0 is 1, the final value of r1 must also be 1.
 
-The the xlog_state_switch_iclogs() function in fs/xfs/xfs_log.c contains
+The xlog_state_switch_iclogs() function in fs/xfs/xfs_log.c contains
 the following write-side code fragment:
 
 	log->l_curr_block -= log->l_logBBsize;

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [tip:locking/core] tools/memory-model: Add more LKMM limitations
  2018-09-26 18:29 ` [PATCH memory-model 4/5] tools/memory-model: Add more LKMM limitations Paul E. McKenney
@ 2018-10-02 10:12   ` tip-bot for Paul E. McKenney
  0 siblings, 0 replies; 41+ messages in thread
From: tip-bot for Paul E. McKenney @ 2018-10-02 10:12 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, torvalds, eranian, acme, mingo, andrea.parri,
	jolsa, tglx, paulmck, vincent.weaver, alexander.shishkin, hpa,
	peterz

Commit-ID:  d8fa25c4efde0e5f31a427202e583d73d3f021c4
Gitweb:     https://git.kernel.org/tip/d8fa25c4efde0e5f31a427202e583d73d3f021c4
Author:     Paul E. McKenney <paulmck@linux.vnet.ibm.com>
AuthorDate: Wed, 26 Sep 2018 11:29:19 -0700
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 2 Oct 2018 10:28:04 +0200

tools/memory-model: Add more LKMM limitations

This commit adds more detail about compiler optimizations and
not-yet-modeled Linux-kernel APIs.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Andrea Parri <andrea.parri@amarulasolutions.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: akiyks@gmail.com
Cc: boqun.feng@gmail.com
Cc: dhowells@redhat.com
Cc: j.alglave@ucl.ac.uk
Cc: linux-arch@vger.kernel.org
Cc: luc.maranget@inria.fr
Cc: npiggin@gmail.com
Cc: parri.andrea@gmail.com
Cc: stern@rowland.harvard.edu
Cc: will.deacon@arm.com
Link: http://lkml.kernel.org/r/20180926182920.27644-4-paulmck@linux.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 tools/memory-model/README | 39 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 39 insertions(+)

diff --git a/tools/memory-model/README b/tools/memory-model/README
index ee987ce20aae..acf9077cffaa 100644
--- a/tools/memory-model/README
+++ b/tools/memory-model/README
@@ -171,6 +171,12 @@ The Linux-kernel memory model has the following limitations:
 	particular, the "THE PROGRAM ORDER RELATION: po AND po-loc"
 	and "A WARNING" sections).
 
+	Note that this limitation in turn limits LKMM's ability to
+	accurately model address, control, and data dependencies.
+	For example, if the compiler can deduce the value of some variable
+	carrying a dependency, then the compiler can break that dependency
+	by substituting a constant of that value.
+
 2.	Multiple access sizes for a single variable are not supported,
 	and neither are misaligned or partially overlapping accesses.
 
@@ -190,6 +196,36 @@ The Linux-kernel memory model has the following limitations:
 	However, a substantial amount of support is provided for these
 	operations, as shown in the linux-kernel.def file.
 
+	a.	When rcu_assign_pointer() is passed NULL, the Linux
+		kernel provides no ordering, but LKMM models this
+		case as a store release.
+
+	b.	The "unless" RMW operations are not currently modeled:
+		atomic_long_add_unless(), atomic_add_unless(),
+		atomic_inc_unless_negative(), and
+		atomic_dec_unless_positive().  These can be emulated
+		in litmus tests, for example, by using atomic_cmpxchg().
+
+	c.	The call_rcu() function is not modeled.  It can be
+		emulated in litmus tests by adding another process that
+		invokes synchronize_rcu() and the body of the callback
+		function, with (for example) a release-acquire from
+		the site of the emulated call_rcu() to the beginning
+		of the additional process.
+
+	d.	The rcu_barrier() function is not modeled.  It can be
+		emulated in litmus tests emulating call_rcu() via
+		(for example) a release-acquire from the end of each
+		additional call_rcu() process to the site of the
+		emulated rcu-barrier().
+
+	e.	Sleepable RCU (SRCU) is not modeled.  It can be
+		emulated, but perhaps not simply.
+
+	f.	Reader-writer locking is not modeled.  It can be
+		emulated in litmus tests using atomic read-modify-write
+		operations.
+
 The "herd7" tool has some additional limitations of its own, apart from
 the memory model:
 
@@ -204,3 +240,6 @@ the memory model:
 Some of these limitations may be overcome in the future, but others are
 more likely to be addressed by incorporating the Linux-kernel memory model
 into other tools.
+
+Finally, please note that LKMM is subject to change as hardware, use cases,
+and compilers evolve.

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [tip:locking/core] locking/memory-barriers: Replace smp_cond_acquire() with smp_cond_load_acquire()
  2018-09-26 18:29 ` [PATCH memory-model 5/5] doc: Replace smp_cond_acquire() with smp_cond_load_acquire() Paul E. McKenney
@ 2018-10-02 10:12   ` tip-bot for Andrea Parri
  0 siblings, 0 replies; 41+ messages in thread
From: tip-bot for Andrea Parri @ 2018-10-02 10:12 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: eranian, hpa, acme, npiggin, dlustig, linux-kernel, paulmck,
	luc.maranget, boqun.feng, alexander.shishkin, vincent.weaver,
	andrea.parri, akiyks, jolsa, peterz, dhowells, tglx, will.deacon,
	mingo, j.alglave, corbet, stern, torvalds

Commit-ID:  2f359c7ea554ba9cb78a52c82bedff351cdabd2b
Gitweb:     https://git.kernel.org/tip/2f359c7ea554ba9cb78a52c82bedff351cdabd2b
Author:     Andrea Parri <andrea.parri@amarulasolutions.com>
AuthorDate: Wed, 26 Sep 2018 11:29:20 -0700
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 2 Oct 2018 10:28:05 +0200

locking/memory-barriers: Replace smp_cond_acquire() with smp_cond_load_acquire()

Amend the changes in commit:

  1f03e8d2919270 ("locking/barriers: Replace smp_cond_acquire() with smp_cond_load_acquire()")

... by updating the documentation accordingly.

Also remove some obsolete information related to the implementation.

Signed-off-by: Andrea Parri <andrea.parri@amarulasolutions.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Will Deacon <will.deacon@arm.com>
Acked-by: Alan Stern <stern@rowland.harvard.edu>
Cc: Akira Yokosawa <akiyks@gmail.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Daniel Lustig <dlustig@nvidia.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Jade Alglave <j.alglave@ucl.ac.uk>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luc Maranget <luc.maranget@inria.fr>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: linux-arch@vger.kernel.org
Cc: parri.andrea@gmail.com
Link: http://lkml.kernel.org/r/20180926182920.27644-5-paulmck@linux.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 Documentation/memory-barriers.txt | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
index 0d8d7ef131e9..c1d913944ad8 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -471,8 +471,7 @@ And a couple of implicit varieties:
      operations after the ACQUIRE operation will appear to happen after the
      ACQUIRE operation with respect to the other components of the system.
      ACQUIRE operations include LOCK operations and both smp_load_acquire()
-     and smp_cond_acquire() operations. The later builds the necessary ACQUIRE
-     semantics from relying on a control dependency and smp_rmb().
+     and smp_cond_load_acquire() operations.
 
      Memory operations that occur before an ACQUIRE operation may appear to
      happen after it completes.

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [tip:locking/core] tools/memory-model: Add extra ordering for locks and remove it for ordinary release/acquire
  2018-10-02 10:11   ` [tip:locking/core] " tip-bot for Alan Stern
@ 2021-09-08 11:00     ` Peter Zijlstra
  2021-09-08 11:44       ` Peter Zijlstra
  0 siblings, 1 reply; 41+ messages in thread
From: Peter Zijlstra @ 2021-09-08 11:00 UTC (permalink / raw)
  To: stern, alexander.shishkin, hpa, andrea.parri, mingo, paulmck,
	vincent.weaver, tglx, jolsa, acme, torvalds, linux-kernel,
	eranian, will
  Cc: linux-tip-commits

On Tue, Oct 02, 2018 at 03:11:10AM -0700, tip-bot for Alan Stern wrote:
> Commit-ID:  6e89e831a90172bc3d34ecbba52af5b9c4a447d1
> Gitweb:     https://git.kernel.org/tip/6e89e831a90172bc3d34ecbba52af5b9c4a447d1
> Author:     Alan Stern <stern@rowland.harvard.edu>
> AuthorDate: Wed, 26 Sep 2018 11:29:17 -0700
> Committer:  Ingo Molnar <mingo@kernel.org>
> CommitDate: Tue, 2 Oct 2018 10:28:01 +0200
> 
> tools/memory-model: Add extra ordering for locks and remove it for ordinary release/acquire
> 
> More than one kernel developer has expressed the opinion that the LKMM
> should enforce ordering of writes by locking.  In other words, given
> the following code:
> 
> 	WRITE_ONCE(x, 1);
> 	spin_unlock(&s):
> 	spin_lock(&s);
> 	WRITE_ONCE(y, 1);
> 
> the stores to x and y should be propagated in order to all other CPUs,
> even though those other CPUs might not access the lock s.  In terms of
> the memory model, this means expanding the cumul-fence relation.

Let me revive this old thread... recently we ran into the case:

	WRITE_ONCE(x, 1);
	spin_unlock(&s):
	spin_lock(&r);
	WRITE_ONCE(y, 1);

which is distinct from the original in that UNLOCK and LOCK are not on
the same variable.

I'm arguing this should still be RCtso by reason of:

  spin_lock() requires an atomic-acquire which:

    TSO-arch)		implies smp_mb()
    ARM64)		is RCsc for any stlr/ldar
    Power)		LWSYNC
    Risc-V)		fence r , rw
    *)			explicitly has smp_mb()


However, Boqun points out that the memory model disagrees, per:

  https://lkml.kernel.org/r/YTI2UjKy+C7LeIf+@boqun-archlinux

Is this an error/oversight of the memory model, or did I miss a subtlety
somewhere?

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [tip:locking/core] tools/memory-model: Add extra ordering for locks and remove it for ordinary release/acquire
  2021-09-08 11:00     ` Peter Zijlstra
@ 2021-09-08 11:44       ` Peter Zijlstra
  2021-09-08 14:42         ` Alan Stern
  0 siblings, 1 reply; 41+ messages in thread
From: Peter Zijlstra @ 2021-09-08 11:44 UTC (permalink / raw)
  To: stern, alexander.shishkin, hpa, parri.andrea, mingo, paulmck,
	vincent.weaver, tglx, jolsa, acme, torvalds, linux-kernel,
	eranian, will
  Cc: linux-tip-commits

On Wed, Sep 08, 2021 at 01:00:26PM +0200, Peter Zijlstra wrote:
> On Tue, Oct 02, 2018 at 03:11:10AM -0700, tip-bot for Alan Stern wrote:
> > Commit-ID:  6e89e831a90172bc3d34ecbba52af5b9c4a447d1
> > Gitweb:     https://git.kernel.org/tip/6e89e831a90172bc3d34ecbba52af5b9c4a447d1
> > Author:     Alan Stern <stern@rowland.harvard.edu>
> > AuthorDate: Wed, 26 Sep 2018 11:29:17 -0700
> > Committer:  Ingo Molnar <mingo@kernel.org>
> > CommitDate: Tue, 2 Oct 2018 10:28:01 +0200
> > 
> > tools/memory-model: Add extra ordering for locks and remove it for ordinary release/acquire
> > 
> > More than one kernel developer has expressed the opinion that the LKMM
> > should enforce ordering of writes by locking.  In other words, given
> > the following code:
> > 
> > 	WRITE_ONCE(x, 1);
> > 	spin_unlock(&s):
> > 	spin_lock(&s);
> > 	WRITE_ONCE(y, 1);
> > 
> > the stores to x and y should be propagated in order to all other CPUs,
> > even though those other CPUs might not access the lock s.  In terms of
> > the memory model, this means expanding the cumul-fence relation.
> 
> Let me revive this old thread... recently we ran into the case:
> 
> 	WRITE_ONCE(x, 1);
> 	spin_unlock(&s):
> 	spin_lock(&r);
> 	WRITE_ONCE(y, 1);
> 
> which is distinct from the original in that UNLOCK and LOCK are not on
> the same variable.
> 
> I'm arguing this should still be RCtso by reason of:
> 
>   spin_lock() requires an atomic-acquire which:
> 
>     TSO-arch)		implies smp_mb()
>     ARM64)		is RCsc for any stlr/ldar
>     Power)		LWSYNC
>     Risc-V)		fence r , rw
>     *)			explicitly has smp_mb()
> 
> 
> However, Boqun points out that the memory model disagrees, per:
> 
>   https://lkml.kernel.org/r/YTI2UjKy+C7LeIf+@boqun-archlinux
> 
> Is this an error/oversight of the memory model, or did I miss a subtlety
> somewhere?

Hmm.. that argument isn't strong enough for Risc-V if I read that FENCE
thing right. That's just R->RW ordering, which doesn't constrain the
first WRITE_ONCE().

However, that spin_unlock has "fence rw, w" with a subsequent write. So
the whole thing then becomes something like:


	WRITE_ONCE(x, 1);
	FENCE RW, W
	WRITE_ONCE(s.lock, 0);
	AMOSWAP %0, 1, r.lock
	FENCE R, WR
	WRITE_ONCE(y, 1);


Which I think is still sufficient, irrespective of the whole s!=r thing.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [tip:locking/core] tools/memory-model: Add extra ordering for locks and remove it for ordinary release/acquire
  2021-09-08 11:44       ` Peter Zijlstra
@ 2021-09-08 14:42         ` Alan Stern
  2021-09-08 15:12           ` Peter Zijlstra
  2021-09-08 16:08           ` Linus Torvalds
  0 siblings, 2 replies; 41+ messages in thread
From: Alan Stern @ 2021-09-08 14:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: alexander.shishkin, hpa, parri.andrea, mingo, paulmck,
	vincent.weaver, tglx, jolsa, acme, torvalds, linux-kernel,
	eranian, will, linux-tip-commits

On Wed, Sep 08, 2021 at 01:44:11PM +0200, Peter Zijlstra wrote:
> On Wed, Sep 08, 2021 at 01:00:26PM +0200, Peter Zijlstra wrote:
> > On Tue, Oct 02, 2018 at 03:11:10AM -0700, tip-bot for Alan Stern wrote:
> > > Commit-ID:  6e89e831a90172bc3d34ecbba52af5b9c4a447d1
> > > Gitweb:     https://git.kernel.org/tip/6e89e831a90172bc3d34ecbba52af5b9c4a447d1
> > > Author:     Alan Stern <stern@rowland.harvard.edu>
> > > AuthorDate: Wed, 26 Sep 2018 11:29:17 -0700
> > > Committer:  Ingo Molnar <mingo@kernel.org>
> > > CommitDate: Tue, 2 Oct 2018 10:28:01 +0200
> > > 
> > > tools/memory-model: Add extra ordering for locks and remove it for ordinary release/acquire
> > > 
> > > More than one kernel developer has expressed the opinion that the LKMM
> > > should enforce ordering of writes by locking.  In other words, given
> > > the following code:
> > > 
> > > 	WRITE_ONCE(x, 1);
> > > 	spin_unlock(&s):
> > > 	spin_lock(&s);
> > > 	WRITE_ONCE(y, 1);
> > > 
> > > the stores to x and y should be propagated in order to all other CPUs,
> > > even though those other CPUs might not access the lock s.  In terms of
> > > the memory model, this means expanding the cumul-fence relation.
> > 
> > Let me revive this old thread... recently we ran into the case:
> > 
> > 	WRITE_ONCE(x, 1);
> > 	spin_unlock(&s):
> > 	spin_lock(&r);
> > 	WRITE_ONCE(y, 1);
> > 
> > which is distinct from the original in that UNLOCK and LOCK are not on
> > the same variable.
> > 
> > I'm arguing this should still be RCtso by reason of:
> > 
> >   spin_lock() requires an atomic-acquire which:
> > 
> >     TSO-arch)		implies smp_mb()
> >     ARM64)		is RCsc for any stlr/ldar
> >     Power)		LWSYNC
> >     Risc-V)		fence r , rw
> >     *)			explicitly has smp_mb()
> > 
> > 
> > However, Boqun points out that the memory model disagrees, per:
> > 
> >   https://lkml.kernel.org/r/YTI2UjKy+C7LeIf+@boqun-archlinux
> > 
> > Is this an error/oversight of the memory model, or did I miss a subtlety
> > somewhere?

There's the question of what we think the LKMM should do in principle, and 
the question of how far it should go in mirroring the limitations of the 
various kernel hardware implementations.  These are obviously separate 
questions, but they both should influence the design of the memory model.  
But to what extent?

Given:

	spin_lock(&r);
	WRITE_ONCE(x, 1);
	spin_unlock(&r);
	spin_lock(&s);
	WRITE_ONCE(y, 1);
	spin_unlock(&s);

there is no reason _in theory_ why a CPU shouldn't reorder and interleave 
the operations to get:

	spin_lock(&r);
	spin_lock(&s);
	WRITE_ONCE(x, 1);
	WRITE_ONCE(y, 1);
	spin_unlock(&r);
	spin_unlock(&s);

(Of course, this wouldn't happen if some other CPU was holding the s lock 
while waiting for r to be released.  In that case the spin loop for s above 
wouldn't be able to end until after the unlock operation on r was complete, 
so this reordering couldn't occur.  But if there was no such contention then 
the reordering is possible in theory -- ignoring restrictions imposed by the 
actual implementations of the operations.)

Given such a reordering, nothing will prevent other CPUs from observing the 
write to y before the write to x.

> Hmm.. that argument isn't strong enough for Risc-V if I read that FENCE
> thing right. That's just R->RW ordering, which doesn't constrain the
> first WRITE_ONCE().
> 
> However, that spin_unlock has "fence rw, w" with a subsequent write. So
> the whole thing then becomes something like:
> 
> 
> 	WRITE_ONCE(x, 1);
> 	FENCE RW, W
> 	WRITE_ONCE(s.lock, 0);
> 	AMOSWAP %0, 1, r.lock
> 	FENCE R, WR
> 	WRITE_ONCE(y, 1);
> 
> 
> Which I think is still sufficient, irrespective of the whole s!=r thing.

To me, this argument feels like an artificial, unintended consequence of the 
individual implementations, not something that should be considered a 
systematic architectural requirement.  Perhaps one could say the same thing 
about the case where the two spinlock_t variables are the same, but at least 
in that case there is a good argument for inherent ordering of atomic 
accesses to a single variable.

Alan

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [tip:locking/core] tools/memory-model: Add extra ordering for locks and remove it for ordinary release/acquire
  2021-09-08 14:42         ` Alan Stern
@ 2021-09-08 15:12           ` Peter Zijlstra
  2021-09-08 16:08           ` Linus Torvalds
  1 sibling, 0 replies; 41+ messages in thread
From: Peter Zijlstra @ 2021-09-08 15:12 UTC (permalink / raw)
  To: Alan Stern
  Cc: alexander.shishkin, hpa, parri.andrea, mingo, paulmck,
	vincent.weaver, tglx, jolsa, acme, torvalds, linux-kernel,
	eranian, will, linux-tip-commits

On Wed, Sep 08, 2021 at 10:42:17AM -0400, Alan Stern wrote:
> On Wed, Sep 08, 2021 at 01:44:11PM +0200, Peter Zijlstra wrote:

> > > Is this an error/oversight of the memory model, or did I miss a subtlety
> > > somewhere?
> 
> There's the question of what we think the LKMM should do in principle, and 
> the question of how far it should go in mirroring the limitations of the 
> various kernel hardware implementations.  These are obviously separate 
> questions, but they both should influence the design of the memory model.  
> But to what extent?
> 
> Given:
> 
> 	spin_lock(&r);
> 	WRITE_ONCE(x, 1);
> 	spin_unlock(&r);
> 	spin_lock(&s);
> 	WRITE_ONCE(y, 1);
> 	spin_unlock(&s);
> 
> there is no reason _in theory_ why a CPU shouldn't reorder and interleave 
> the operations to get:
> 
> 	spin_lock(&r);
> 	spin_lock(&s);
> 	WRITE_ONCE(x, 1);
> 	WRITE_ONCE(y, 1);
> 	spin_unlock(&r);
> 	spin_unlock(&s);
> 
> (Of course, this wouldn't happen if some other CPU was holding the s lock 
> while waiting for r to be released.  In that case the spin loop for s above 
> wouldn't be able to end until after the unlock operation on r was complete, 
> so this reordering couldn't occur.  But if there was no such contention then 
> the reordering is possible in theory -- ignoring restrictions imposed by the 
> actual implementations of the operations.)
> 
> Given such a reordering, nothing will prevent other CPUs from observing the 
> write to y before the write to x.

To a very small degree the Risc-V implementation actually does some of
that. It allows the stores from unlock and lock to be observed out of
order. But in general we have very weak rules about where the store of
the lock is visible in any case.

(revisit the spin_is_locked() saga for more details there)

> > Hmm.. that argument isn't strong enough for Risc-V if I read that FENCE
> > thing right. That's just R->RW ordering, which doesn't constrain the
> > first WRITE_ONCE().
> > 
> > However, that spin_unlock has "fence rw, w" with a subsequent write. So
> > the whole thing then becomes something like:
> > 
> > 
> > 	WRITE_ONCE(x, 1);
> > 	FENCE RW, W
> > 	WRITE_ONCE(s.lock, 0);
> > 	AMOSWAP %0, 1, r.lock
> > 	FENCE R, WR
> > 	WRITE_ONCE(y, 1);
> > 
> > 
> > Which I think is still sufficient, irrespective of the whole s!=r thing.
> 
> To me, this argument feels like an artificial, unintended consequence of the 
> individual implementations, not something that should be considered a 
> systematic architectural requirement.  Perhaps one could say the same thing 
> about the case where the two spinlock_t variables are the same, but at least 
> in that case there is a good argument for inherent ordering of atomic 
> accesses to a single variable.

Possibly :-) The way I got here is that my brain seems to have produced
the rule that UNLOCK+LOCK -> TSO order (an improvement, because for a
time it said SC), and it completely forgot about this subtlely. And in
general I feel that less subtlety is more better, but I understand your
counter argument :/

In any case, it looks like we had to put an smp_mb() in there anyway due
to other reasons and the whole argument is moot again.

I'll try and remember for next time :-)

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [tip:locking/core] tools/memory-model: Add extra ordering for locks and remove it for ordinary release/acquire
  2021-09-08 14:42         ` Alan Stern
  2021-09-08 15:12           ` Peter Zijlstra
@ 2021-09-08 16:08           ` Linus Torvalds
  2021-09-09  7:25             ` Peter Zijlstra
  1 sibling, 1 reply; 41+ messages in thread
From: Linus Torvalds @ 2021-09-08 16:08 UTC (permalink / raw)
  To: Alan Stern
  Cc: Peter Zijlstra, Alexander Shishkin, Peter Anvin, Andrea Parri,
	Ingo Molnar, Paul E. McKenney, Vince Weaver, Thomas Gleixner,
	Jiri Olsa, Arnaldo Carvalho de Melo, Linux Kernel Mailing List,
	Stephane Eranian, Will Deacon, linux-tip-commits

On Wed, Sep 8, 2021 at 7:42 AM Alan Stern <stern@rowland.harvard.edu> wrote:
>
> there is no reason _in theory_ why a CPU shouldn't reorder and interleave
> the operations to get:

I agree about the theory part.

But I think the LKMM should be the strongest ordering that is reasonable.

And it should take common architecture behavior into account.

IOW, if there is some rare architecture where the above can happen,
but no common sane one allows it in practice, we should strive to make
the LKMM the _stronger_ one.

We sure as hell shouldn't say "RISC-V is potentially very weakly
ordered, so we'll allow that weak ordering".

Because overly weak ordering only causes problems for others. And the
performance arguments for it have historically been garbage anyway.
See the pain powerpc goes through because of bad ordering (and even
more so alpha), and see how arm actually strengthened their ordering
to make everybody happier.

So if this is purely a RISC-V thing, then I think it's entirely reasonable to

        spin_unlock(&r);
        spin_lock(&s);

cannot be reordered.

Strict specifications are not a bad thing, and weak memory ordering is
not inherently good.

             Linus

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [tip:locking/core] tools/memory-model: Add extra ordering for locks and remove it for ordinary release/acquire
  2021-09-08 16:08           ` Linus Torvalds
@ 2021-09-09  7:25             ` Peter Zijlstra
  2021-09-09 13:35               ` Will Deacon
  0 siblings, 1 reply; 41+ messages in thread
From: Peter Zijlstra @ 2021-09-09  7:25 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alan Stern, Alexander Shishkin, Peter Anvin, Andrea Parri,
	Ingo Molnar, Paul E. McKenney, Vince Weaver, Thomas Gleixner,
	Jiri Olsa, Arnaldo Carvalho de Melo, Linux Kernel Mailing List,
	Stephane Eranian, Will Deacon, linux-tip-commits

On Wed, Sep 08, 2021 at 09:08:33AM -0700, Linus Torvalds wrote:

> So if this is purely a RISC-V thing,

Just to clarify, I think the current RISC-V thing is stonger than
PowerPC, but maybe not as strong as say ARM64, but RISC-V memory
ordering is still somewhat hazy to me.

Specifically, the sequence:

	/* critical section s */
	WRITE_ONCE(x, 1);
	FENCE RW, W
	WRITE_ONCE(s.lock, 0);		/* store S */
	AMOSWAP %0, 1, r.lock		/* store R */
	FENCE R, RW
	WRITE_ONCE(y, 1);
	/* critical section r */

fully separates section s from section r, as in RW->RW ordering
(possibly not as strong as smp_mb() though), while on PowerPC it would
only impose TSO ordering between sections.

The AMOSWAP is a RmW and as such matches the W from the RW->W fence,
similarly it marches the R from the R->RW fence, yielding an:

	RW->  W
	    RmW
	    R  ->RW

ordering. It's the stores S and R that can be re-ordered, but not the
sections themselves (same on PowerPC and many others).

Clarification from a RISC-V enabled person would be appreciated.

> then I think it's entirely reasonable to
> 
>         spin_unlock(&r);
>         spin_lock(&s);
> 
> cannot be reordered.

I'm obviously completely in favour of that :-)

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [tip:locking/core] tools/memory-model: Add extra ordering for locks and remove it for ordinary release/acquire
  2021-09-09  7:25             ` Peter Zijlstra
@ 2021-09-09 13:35               ` Will Deacon
  2021-09-09 17:02                 ` Linus Torvalds
                                   ` (2 more replies)
  0 siblings, 3 replies; 41+ messages in thread
From: Will Deacon @ 2021-09-09 13:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Alan Stern, Alexander Shishkin, Peter Anvin,
	Andrea Parri, Ingo Molnar, Paul E. McKenney, Vince Weaver,
	Thomas Gleixner, Jiri Olsa, Arnaldo Carvalho de Melo,
	Linux Kernel Mailing List, Stephane Eranian, linux-tip-commits,
	palmer, paul.walmsley, dlustig, mpe

[+Palmer, PaulW, Daniel and Michael]

On Thu, Sep 09, 2021 at 09:25:30AM +0200, Peter Zijlstra wrote:
> On Wed, Sep 08, 2021 at 09:08:33AM -0700, Linus Torvalds wrote:
> 
> > So if this is purely a RISC-V thing,
> 
> Just to clarify, I think the current RISC-V thing is stonger than
> PowerPC, but maybe not as strong as say ARM64, but RISC-V memory
> ordering is still somewhat hazy to me.
> 
> Specifically, the sequence:
> 
> 	/* critical section s */
> 	WRITE_ONCE(x, 1);
> 	FENCE RW, W
> 	WRITE_ONCE(s.lock, 0);		/* store S */
> 	AMOSWAP %0, 1, r.lock		/* store R */
> 	FENCE R, RW
> 	WRITE_ONCE(y, 1);
> 	/* critical section r */
> 
> fully separates section s from section r, as in RW->RW ordering
> (possibly not as strong as smp_mb() though), while on PowerPC it would
> only impose TSO ordering between sections.
> 
> The AMOSWAP is a RmW and as such matches the W from the RW->W fence,
> similarly it marches the R from the R->RW fence, yielding an:
> 
> 	RW->  W
> 	    RmW
> 	    R  ->RW
> 
> ordering. It's the stores S and R that can be re-ordered, but not the
> sections themselves (same on PowerPC and many others).
> 
> Clarification from a RISC-V enabled person would be appreciated.
> 
> > then I think it's entirely reasonable to
> > 
> >         spin_unlock(&r);
> >         spin_lock(&s);
> > 
> > cannot be reordered.
> 
> I'm obviously completely in favour of that :-)

I don't think we should require the accesses to the actual lockwords to
be ordered here, as it becomes pretty onerous for relaxed LL/SC
architectures where you'd end up with an extra barrier either after the
unlock() or before the lock() operation. However, I remain absolutely in
favour of strengthening the ordering of the _critical sections_ guarded by
the locks to be RCsc.

Last time this came up, I think the RISC-V folks were generally happy to
implement whatever was necessary for Linux [1]. The thing that was stopping
us was Power (see CONFIG_ARCH_WEAK_RELEASE_ACQUIRE), wasn't it? I think
Michael saw quite a bit of variety in the impact on benchmarks [2] across
different machines. So the question is whether newer Power machines are less
affected to the degree that we could consider making this change again.

Will

[1] https://lore.kernel.org/lkml/11b27d32-4a8a-3f84-0f25-723095ef1076@nvidia.com/
[2] https://lore.kernel.org/lkml/87tvp3xonl.fsf@concordia.ellerman.id.au/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [tip:locking/core] tools/memory-model: Add extra ordering for locks and remove it for ordinary release/acquire
  2021-09-09 13:35               ` Will Deacon
@ 2021-09-09 17:02                 ` Linus Torvalds
  2021-09-09 18:59                   ` Alan Stern
  2021-09-09 17:03                 ` Dan Lustig
  2021-09-09 17:46                 ` Paul E. McKenney
  2 siblings, 1 reply; 41+ messages in thread
From: Linus Torvalds @ 2021-09-09 17:02 UTC (permalink / raw)
  To: Will Deacon
  Cc: Peter Zijlstra, Alan Stern, Alexander Shishkin, Peter Anvin,
	Andrea Parri, Ingo Molnar, Paul E. McKenney, Vince Weaver,
	Thomas Gleixner, Jiri Olsa, Arnaldo Carvalho de Melo,
	Linux Kernel Mailing List, Stephane Eranian, linux-tip-commits,
	Palmer Dabbelt, Paul Walmsley, Daniel Lustig, Michael Ellerman

On Thu, Sep 9, 2021 at 6:35 AM Will Deacon <will@kernel.org> wrote:
>
> I don't think we should require the accesses to the actual lockwords to
> be ordered here, as it becomes pretty onerous for relaxed LL/SC
> architectures where you'd end up with an extra barrier either after the
> unlock() or before the lock() operation. However, I remain absolutely in
> favour of strengthening the ordering of the _critical sections_ guarded by
> the locks to be RCsc.

Ack. The actual locking operations themselves can obviously overlap,
it's what they protect that should be ordered if at all possible.

Because anything else will be too confusing for words, and if we have
to add memory barriers *and* locking we're just screwed.

Because I think it is entirely understandable for people to expect
that sequence of two locked regions to be ordered wrt each other.

While memory ordering is subtle and confusing, we should strive to
make our "..but I used locks" to be as straightforward and as
understandable to people who really really don't want to even think
about memory order as at all reasonable.

I think we should have a very strong reason for accepting unordered
locked regions (with "strong reason" being defined as "on this
architecture that is hugely important, anything else would slow down
locks enormously").

It sounds like no such architecture exists, much less is important.

              Linus

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [tip:locking/core] tools/memory-model: Add extra ordering for locks and remove it for ordinary release/acquire
  2021-09-09 13:35               ` Will Deacon
  2021-09-09 17:02                 ` Linus Torvalds
@ 2021-09-09 17:03                 ` Dan Lustig
  2021-09-09 18:00                   ` Paul E. McKenney
  2021-09-10  0:01                   ` Boqun Feng
  2021-09-09 17:46                 ` Paul E. McKenney
  2 siblings, 2 replies; 41+ messages in thread
From: Dan Lustig @ 2021-09-09 17:03 UTC (permalink / raw)
  To: Will Deacon, Peter Zijlstra
  Cc: Linus Torvalds, Alan Stern, Alexander Shishkin, Peter Anvin,
	Andrea Parri, Ingo Molnar, Paul E. McKenney, Vince Weaver,
	Thomas Gleixner, Jiri Olsa, Arnaldo Carvalho de Melo,
	Linux Kernel Mailing List, Stephane Eranian, linux-tip-commits,
	palmer, paul.walmsley, mpe

On 9/9/2021 9:35 AM, Will Deacon wrote:
> [+Palmer, PaulW, Daniel and Michael]
> 
> On Thu, Sep 09, 2021 at 09:25:30AM +0200, Peter Zijlstra wrote:
>> On Wed, Sep 08, 2021 at 09:08:33AM -0700, Linus Torvalds wrote:
>>
>>> So if this is purely a RISC-V thing,
>>
>> Just to clarify, I think the current RISC-V thing is stonger than
>> PowerPC, but maybe not as strong as say ARM64, but RISC-V memory
>> ordering is still somewhat hazy to me.
>>
>> Specifically, the sequence:
>>
>> 	/* critical section s */
>> 	WRITE_ONCE(x, 1);
>> 	FENCE RW, W
>> 	WRITE_ONCE(s.lock, 0);		/* store S */
>> 	AMOSWAP %0, 1, r.lock		/* store R */
>> 	FENCE R, RW
>> 	WRITE_ONCE(y, 1);
>> 	/* critical section r */
>>
>> fully separates section s from section r, as in RW->RW ordering
>> (possibly not as strong as smp_mb() though), while on PowerPC it would
>> only impose TSO ordering between sections.
>>
>> The AMOSWAP is a RmW and as such matches the W from the RW->W fence,
>> similarly it marches the R from the R->RW fence, yielding an:
>>
>> 	RW->  W
>> 	    RmW
>> 	    R  ->RW
>>
>> ordering. It's the stores S and R that can be re-ordered, but not the
>> sections themselves (same on PowerPC and many others).
>>
>> Clarification from a RISC-V enabled person would be appreciated.

To first order, RISC-V's memory model is very similar to ARMv8's.  It
is "other-multi-copy-atomic", unlike Power, and respects dependencies.
It also has AMOs and LR/SC with optional RCsc acquire or release
semantics.  There's no need to worry about RISC-V somehow pushing the
boundaries of weak memory ordering in new ways.

The tricky part is that unlike ARMv8, RISC-V doesn't have load-acquire
or store-release opcodes at all.  Only AMOs and LR/SC have acquire or
release options.  That means that while certain operations like swap
can be implemented with native RCsc semantics, others like store-release
have to fall back on fences and plain writes.

That's where the complexity came up last time this was discussed, at
least as it relates to RISC-V: how to make sure the combination of RCsc
atomics and plain operations+fences gives the semantics everyone is
asking for here.  And to be clear there, I'm not asking for LKMM to
weaken anything about critical section ordering just for RISC-V's sake.
TSO/RCsc ordering between critical sections is a perfectly reasonable
model in my opinion.  I just want to make sure RISC-V gets it right
given whatever the decision is.

>>> then I think it's entirely reasonable to
>>>
>>>         spin_unlock(&r);
>>>         spin_lock(&s);
>>>
>>> cannot be reordered.
>>
>> I'm obviously completely in favour of that :-)
> 
> I don't think we should require the accesses to the actual lockwords to
> be ordered here, as it becomes pretty onerous for relaxed LL/SC
> architectures where you'd end up with an extra barrier either after the
> unlock() or before the lock() operation. However, I remain absolutely in
> favour of strengthening the ordering of the _critical sections_ guarded by
> the locks to be RCsc.

I agree with Will here.  If the AMOSWAP above is actually implemented with
a RISC-V AMO, then the two critical sections will be separated as if RW,RW,
as Peter described.  If instead it's implemented using LR/SC, then RISC-V
gives only TSO (R->R, R->W, W->W), because the two pieces of the AMO are
split, and that breaks the chain.  Getting full RW->RW between the critical
sections would therefore require an extra fence.  Also, the accesses to the
lockwords themselves would not be ordered without an extra fence.

> Last time this came up, I think the RISC-V folks were generally happy to
> implement whatever was necessary for Linux [1]. The thing that was stopping
> us was Power (see CONFIG_ARCH_WEAK_RELEASE_ACQUIRE), wasn't it? I think
> Michael saw quite a bit of variety in the impact on benchmarks [2] across
> different machines. So the question is whether newer Power machines are less
> affected to the degree that we could consider making this change again.

Yes, as I said above, RISC-V will implement what is needed to make this work.

Dan

> Will
> 
> [1] https://lore.kernel.org/lkml/11b27d32-4a8a-3f84-0f25-723095ef1076@nvidia.com/
> [2] https://lore.kernel.org/lkml/87tvp3xonl.fsf@concordia.ellerman.id.au/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [tip:locking/core] tools/memory-model: Add extra ordering for locks and remove it for ordinary release/acquire
  2021-09-09 13:35               ` Will Deacon
  2021-09-09 17:02                 ` Linus Torvalds
  2021-09-09 17:03                 ` Dan Lustig
@ 2021-09-09 17:46                 ` Paul E. McKenney
  2021-09-10 11:08                   ` Will Deacon
  2 siblings, 1 reply; 41+ messages in thread
From: Paul E. McKenney @ 2021-09-09 17:46 UTC (permalink / raw)
  To: Will Deacon
  Cc: Peter Zijlstra, Linus Torvalds, Alan Stern, Alexander Shishkin,
	Peter Anvin, Andrea Parri, Ingo Molnar, Vince Weaver,
	Thomas Gleixner, Jiri Olsa, Arnaldo Carvalho de Melo,
	Linux Kernel Mailing List, Stephane Eranian, linux-tip-commits,
	palmer, paul.walmsley, dlustig, mpe, npiggin

[+ Nick Piggin]

On Thu, Sep 09, 2021 at 02:35:36PM +0100, Will Deacon wrote:
> [+Palmer, PaulW, Daniel and Michael]
> 
> On Thu, Sep 09, 2021 at 09:25:30AM +0200, Peter Zijlstra wrote:
> > On Wed, Sep 08, 2021 at 09:08:33AM -0700, Linus Torvalds wrote:
> > 
> > > So if this is purely a RISC-V thing,
> > 
> > Just to clarify, I think the current RISC-V thing is stonger than
> > PowerPC, but maybe not as strong as say ARM64, but RISC-V memory
> > ordering is still somewhat hazy to me.
> > 
> > Specifically, the sequence:
> > 
> > 	/* critical section s */
> > 	WRITE_ONCE(x, 1);
> > 	FENCE RW, W
> > 	WRITE_ONCE(s.lock, 0);		/* store S */
> > 	AMOSWAP %0, 1, r.lock		/* store R */
> > 	FENCE R, RW
> > 	WRITE_ONCE(y, 1);
> > 	/* critical section r */
> > 
> > fully separates section s from section r, as in RW->RW ordering
> > (possibly not as strong as smp_mb() though), while on PowerPC it would
> > only impose TSO ordering between sections.
> > 
> > The AMOSWAP is a RmW and as such matches the W from the RW->W fence,
> > similarly it marches the R from the R->RW fence, yielding an:
> > 
> > 	RW->  W
> > 	    RmW
> > 	    R  ->RW
> > 
> > ordering. It's the stores S and R that can be re-ordered, but not the
> > sections themselves (same on PowerPC and many others).
> > 
> > Clarification from a RISC-V enabled person would be appreciated.
> > 
> > > then I think it's entirely reasonable to
> > > 
> > >         spin_unlock(&r);
> > >         spin_lock(&s);
> > > 
> > > cannot be reordered.
> > 
> > I'm obviously completely in favour of that :-)
> 
> I don't think we should require the accesses to the actual lockwords to
> be ordered here, as it becomes pretty onerous for relaxed LL/SC
> architectures where you'd end up with an extra barrier either after the
> unlock() or before the lock() operation. However, I remain absolutely in
> favour of strengthening the ordering of the _critical sections_ guarded by
> the locks to be RCsc.

If by this you mean the critical sections when observed only by other
critical sections for a given lock, then everyone is already there.

However...

> Last time this came up, I think the RISC-V folks were generally happy to
> implement whatever was necessary for Linux [1]. The thing that was stopping
> us was Power (see CONFIG_ARCH_WEAK_RELEASE_ACQUIRE), wasn't it? I think
> Michael saw quite a bit of variety in the impact on benchmarks [2] across
> different machines. So the question is whether newer Power machines are less
> affected to the degree that we could consider making this change again.

Last I knew, on Power a pair of critical sections for a given lock could
be observed out of order (writes from the earlier critical section vs.
reads from the later critical section), but only by CPUs not holding
that lock.  Also last I knew, tightening this would require upgrading
some of the locking primitives' lwsync instructions to sync instructions.
But I know very little about Power 10.

Adding Nick on CC for his thoughts.

							Thanx, Paul

> Will
> 
> [1] https://lore.kernel.org/lkml/11b27d32-4a8a-3f84-0f25-723095ef1076@nvidia.com/
> [2] https://lore.kernel.org/lkml/87tvp3xonl.fsf@concordia.ellerman.id.au/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [tip:locking/core] tools/memory-model: Add extra ordering for locks and remove it for ordinary release/acquire
  2021-09-09 17:03                 ` Dan Lustig
@ 2021-09-09 18:00                   ` Paul E. McKenney
  2021-09-10 14:20                     ` Boqun Feng
  2021-09-10  0:01                   ` Boqun Feng
  1 sibling, 1 reply; 41+ messages in thread
From: Paul E. McKenney @ 2021-09-09 18:00 UTC (permalink / raw)
  To: Dan Lustig
  Cc: Will Deacon, Peter Zijlstra, Linus Torvalds, Alan Stern,
	Alexander Shishkin, Peter Anvin, Andrea Parri, Ingo Molnar,
	Vince Weaver, Thomas Gleixner, Jiri Olsa,
	Arnaldo Carvalho de Melo, Linux Kernel Mailing List,
	Stephane Eranian, linux-tip-commits, palmer, paul.walmsley, mpe

On Thu, Sep 09, 2021 at 01:03:18PM -0400, Dan Lustig wrote:
> On 9/9/2021 9:35 AM, Will Deacon wrote:
> > [+Palmer, PaulW, Daniel and Michael]
> > 
> > On Thu, Sep 09, 2021 at 09:25:30AM +0200, Peter Zijlstra wrote:
> >> On Wed, Sep 08, 2021 at 09:08:33AM -0700, Linus Torvalds wrote:
> >>
> >>> So if this is purely a RISC-V thing,
> >>
> >> Just to clarify, I think the current RISC-V thing is stonger than
> >> PowerPC, but maybe not as strong as say ARM64, but RISC-V memory
> >> ordering is still somewhat hazy to me.
> >>
> >> Specifically, the sequence:
> >>
> >> 	/* critical section s */
> >> 	WRITE_ONCE(x, 1);
> >> 	FENCE RW, W
> >> 	WRITE_ONCE(s.lock, 0);		/* store S */
> >> 	AMOSWAP %0, 1, r.lock		/* store R */
> >> 	FENCE R, RW
> >> 	WRITE_ONCE(y, 1);
> >> 	/* critical section r */
> >>
> >> fully separates section s from section r, as in RW->RW ordering
> >> (possibly not as strong as smp_mb() though), while on PowerPC it would
> >> only impose TSO ordering between sections.
> >>
> >> The AMOSWAP is a RmW and as such matches the W from the RW->W fence,
> >> similarly it marches the R from the R->RW fence, yielding an:
> >>
> >> 	RW->  W
> >> 	    RmW
> >> 	    R  ->RW
> >>
> >> ordering. It's the stores S and R that can be re-ordered, but not the
> >> sections themselves (same on PowerPC and many others).
> >>
> >> Clarification from a RISC-V enabled person would be appreciated.
> 
> To first order, RISC-V's memory model is very similar to ARMv8's.  It
> is "other-multi-copy-atomic", unlike Power, and respects dependencies.
> It also has AMOs and LR/SC with optional RCsc acquire or release
> semantics.  There's no need to worry about RISC-V somehow pushing the
> boundaries of weak memory ordering in new ways.
> 
> The tricky part is that unlike ARMv8, RISC-V doesn't have load-acquire
> or store-release opcodes at all.  Only AMOs and LR/SC have acquire or
> release options.  That means that while certain operations like swap
> can be implemented with native RCsc semantics, others like store-release
> have to fall back on fences and plain writes.
> 
> That's where the complexity came up last time this was discussed, at
> least as it relates to RISC-V: how to make sure the combination of RCsc
> atomics and plain operations+fences gives the semantics everyone is
> asking for here.  And to be clear there, I'm not asking for LKMM to
> weaken anything about critical section ordering just for RISC-V's sake.
> TSO/RCsc ordering between critical sections is a perfectly reasonable
> model in my opinion.  I just want to make sure RISC-V gets it right
> given whatever the decision is.
> 
> >>> then I think it's entirely reasonable to
> >>>
> >>>         spin_unlock(&r);
> >>>         spin_lock(&s);
> >>>
> >>> cannot be reordered.
> >>
> >> I'm obviously completely in favour of that :-)
> > 
> > I don't think we should require the accesses to the actual lockwords to
> > be ordered here, as it becomes pretty onerous for relaxed LL/SC
> > architectures where you'd end up with an extra barrier either after the
> > unlock() or before the lock() operation. However, I remain absolutely in
> > favour of strengthening the ordering of the _critical sections_ guarded by
> > the locks to be RCsc.
> 
> I agree with Will here.  If the AMOSWAP above is actually implemented with
> a RISC-V AMO, then the two critical sections will be separated as if RW,RW,
> as Peter described.  If instead it's implemented using LR/SC, then RISC-V
> gives only TSO (R->R, R->W, W->W), because the two pieces of the AMO are
> split, and that breaks the chain.  Getting full RW->RW between the critical
> sections would therefore require an extra fence.  Also, the accesses to the
> lockwords themselves would not be ordered without an extra fence.
> 
> > Last time this came up, I think the RISC-V folks were generally happy to
> > implement whatever was necessary for Linux [1]. The thing that was stopping
> > us was Power (see CONFIG_ARCH_WEAK_RELEASE_ACQUIRE), wasn't it? I think
> > Michael saw quite a bit of variety in the impact on benchmarks [2] across
> > different machines. So the question is whether newer Power machines are less
> > affected to the degree that we could consider making this change again.
> 
> Yes, as I said above, RISC-V will implement what is needed to make this work.

Boqun, I vaguely remember a suggested change from you along these lines,
but now I cannot find it.  Could you please send it as a formal patch
if you have not already done so or point me at it if you have?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [tip:locking/core] tools/memory-model: Add extra ordering for locks and remove it for ordinary release/acquire
  2021-09-09 17:02                 ` Linus Torvalds
@ 2021-09-09 18:59                   ` Alan Stern
  0 siblings, 0 replies; 41+ messages in thread
From: Alan Stern @ 2021-09-09 18:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Will Deacon, Peter Zijlstra, Alexander Shishkin, Peter Anvin,
	Andrea Parri, Ingo Molnar, Paul E. McKenney, Vince Weaver,
	Thomas Gleixner, Jiri Olsa, Arnaldo Carvalho de Melo,
	Linux Kernel Mailing List, Stephane Eranian, linux-tip-commits,
	Palmer Dabbelt, Paul Walmsley, Daniel Lustig, Michael Ellerman

On Thu, Sep 09, 2021 at 10:02:13AM -0700, Linus Torvalds wrote:
> On Thu, Sep 9, 2021 at 6:35 AM Will Deacon <will@kernel.org> wrote:
> >
> > I don't think we should require the accesses to the actual lockwords to
> > be ordered here, as it becomes pretty onerous for relaxed LL/SC
> > architectures where you'd end up with an extra barrier either after the
> > unlock() or before the lock() operation. However, I remain absolutely in
> > favour of strengthening the ordering of the _critical sections_ guarded by
> > the locks to be RCsc.
> 
> Ack. The actual locking operations themselves can obviously overlap,
> it's what they protect that should be ordered if at all possible.
> 
> Because anything else will be too confusing for words, and if we have
> to add memory barriers *and* locking we're just screwed.
> 
> Because I think it is entirely understandable for people to expect
> that sequence of two locked regions to be ordered wrt each other.
> 
> While memory ordering is subtle and confusing, we should strive to
> make our "..but I used locks" to be as straightforward and as
> understandable to people who really really don't want to even think
> about memory order as at all reasonable.
> 
> I think we should have a very strong reason for accepting unordered
> locked regions (with "strong reason" being defined as "on this
> architecture that is hugely important, anything else would slow down
> locks enormously").
> 
> It sounds like no such architecture exists, much less is important.

All right.

Currently the memory model has RCtso ordering of accesses in separate 
critical sections for the same lock, as observed by other CPUs not 
holding the lock.  Peter is proposing (and Linus agrees) to expand this 
to cover critical sections for different locks, so long as both critical 
sections are on the same CPU.  But he didn't propose strengthening the 
ordering to RCsc, and I presume we don't want to do this because of the 
slowdown it would incur on Power.

Does everyone agree that this correctly summarizes the change to be made 
to the memory model?

Alan

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [tip:locking/core] tools/memory-model: Add extra ordering for locks and remove it for ordinary release/acquire
  2021-09-09 17:03                 ` Dan Lustig
  2021-09-09 18:00                   ` Paul E. McKenney
@ 2021-09-10  0:01                   ` Boqun Feng
  2021-09-10  5:37                     ` Boqun Feng
  2021-09-10  9:33                     ` Peter Zijlstra
  1 sibling, 2 replies; 41+ messages in thread
From: Boqun Feng @ 2021-09-10  0:01 UTC (permalink / raw)
  To: Dan Lustig
  Cc: Will Deacon, Peter Zijlstra, Linus Torvalds, Alan Stern,
	Alexander Shishkin, Peter Anvin, Andrea Parri, Ingo Molnar,
	Paul E. McKenney, Vince Weaver, Thomas Gleixner, Jiri Olsa,
	Arnaldo Carvalho de Melo, Linux Kernel Mailing List,
	Stephane Eranian, linux-tip-commits, palmer, paul.walmsley, mpe

On Thu, Sep 09, 2021 at 01:03:18PM -0400, Dan Lustig wrote:
> On 9/9/2021 9:35 AM, Will Deacon wrote:
> > [+Palmer, PaulW, Daniel and Michael]
> > 
> > On Thu, Sep 09, 2021 at 09:25:30AM +0200, Peter Zijlstra wrote:
> >> On Wed, Sep 08, 2021 at 09:08:33AM -0700, Linus Torvalds wrote:
> >>
> >>> So if this is purely a RISC-V thing,
> >>
> >> Just to clarify, I think the current RISC-V thing is stonger than
> >> PowerPC, but maybe not as strong as say ARM64, but RISC-V memory
> >> ordering is still somewhat hazy to me.
> >>
> >> Specifically, the sequence:
> >>
> >> 	/* critical section s */
> >> 	WRITE_ONCE(x, 1);
> >> 	FENCE RW, W
> >> 	WRITE_ONCE(s.lock, 0);		/* store S */
> >> 	AMOSWAP %0, 1, r.lock		/* store R */
> >> 	FENCE R, RW
> >> 	WRITE_ONCE(y, 1);
> >> 	/* critical section r */
> >>
> >> fully separates section s from section r, as in RW->RW ordering
> >> (possibly not as strong as smp_mb() though), while on PowerPC it would
> >> only impose TSO ordering between sections.
> >>
> >> The AMOSWAP is a RmW and as such matches the W from the RW->W fence,
> >> similarly it marches the R from the R->RW fence, yielding an:
> >>
> >> 	RW->  W
> >> 	    RmW
> >> 	    R  ->RW
> >>
> >> ordering. It's the stores S and R that can be re-ordered, but not the
> >> sections themselves (same on PowerPC and many others).
> >>
> >> Clarification from a RISC-V enabled person would be appreciated.
> 
> To first order, RISC-V's memory model is very similar to ARMv8's.  It
> is "other-multi-copy-atomic", unlike Power, and respects dependencies.
> It also has AMOs and LR/SC with optional RCsc acquire or release
> semantics.  There's no need to worry about RISC-V somehow pushing the
> boundaries of weak memory ordering in new ways.
> 
> The tricky part is that unlike ARMv8, RISC-V doesn't have load-acquire
> or store-release opcodes at all.  Only AMOs and LR/SC have acquire or
> release options.  That means that while certain operations like swap
> can be implemented with native RCsc semantics, others like store-release
> have to fall back on fences and plain writes.
> 
> That's where the complexity came up last time this was discussed, at
> least as it relates to RISC-V: how to make sure the combination of RCsc
> atomics and plain operations+fences gives the semantics everyone is
> asking for here.  And to be clear there, I'm not asking for LKMM to
> weaken anything about critical section ordering just for RISC-V's sake.
> TSO/RCsc ordering between critical sections is a perfectly reasonable
> model in my opinion.  I just want to make sure RISC-V gets it right
> given whatever the decision is.
> 
> >>> then I think it's entirely reasonable to
> >>>
> >>>         spin_unlock(&r);
> >>>         spin_lock(&s);
> >>>
> >>> cannot be reordered.
> >>
> >> I'm obviously completely in favour of that :-)
> > 
> > I don't think we should require the accesses to the actual lockwords to
> > be ordered here, as it becomes pretty onerous for relaxed LL/SC
> > architectures where you'd end up with an extra barrier either after the
> > unlock() or before the lock() operation. However, I remain absolutely in
> > favour of strengthening the ordering of the _critical sections_ guarded by
> > the locks to be RCsc.
> 
> I agree with Will here.  If the AMOSWAP above is actually implemented with
> a RISC-V AMO, then the two critical sections will be separated as if RW,RW,
> as Peter described.  If instead it's implemented using LR/SC, then RISC-V

Just out of curiosity, in the following code, can the store S and load L
be reordered?

	WRITE_ONCE(x, 1); // store S
	FENCE RW, W
 	WRITE_ONCE(s.lock, 0); // unlock(s)
 	AMOSWAP %0, 1, s.lock  // lock(s)
	FENCE R, RW
	r1 = READ_ONCE(y); // load L

I think they can, because neither "FENCE RW, W" nor "FENCE R, RW" order
them. Note that the reordering is allowed in LKMM, because unlock-lock
only need to be as strong as RCtso.

Moreover, how about the outcome of the following case:

	{ 
	r1, r2 are registers (variables) on each CPU, X, Y are memory
	locations, and initialized as 0
	}

	CPU 0
	=====
	AMOSWAP r1, 1, X
	FENCE R, RW
	r2 = READ_ONCE(Y);

	CPU 1
	=====
	WRITE_ONCE(Y, 1);
	FENCE RW, RW
	r2 = READ_ONCE(X);

can we observe the result where r2 on CPU0 is 0 while r2 on CPU1 is 1?

Regards,
Boqun

> gives only TSO (R->R, R->W, W->W), because the two pieces of the AMO are
> split, and that breaks the chain.  Getting full RW->RW between the critical
> sections would therefore require an extra fence.  Also, the accesses to the
> lockwords themselves would not be ordered without an extra fence.
> 
> > Last time this came up, I think the RISC-V folks were generally happy to
> > implement whatever was necessary for Linux [1]. The thing that was stopping
> > us was Power (see CONFIG_ARCH_WEAK_RELEASE_ACQUIRE), wasn't it? I think
> > Michael saw quite a bit of variety in the impact on benchmarks [2] across
> > different machines. So the question is whether newer Power machines are less
> > affected to the degree that we could consider making this change again.
> 
> Yes, as I said above, RISC-V will implement what is needed to make this work.
> 
> Dan
> 
> > Will
> > 
> > [1] https://lore.kernel.org/lkml/11b27d32-4a8a-3f84-0f25-723095ef1076@nvidia.com/
> > [2] https://lore.kernel.org/lkml/87tvp3xonl.fsf@concordia.ellerman.id.au/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [tip:locking/core] tools/memory-model: Add extra ordering for locks and remove it for ordinary release/acquire
  2021-09-10  0:01                   ` Boqun Feng
@ 2021-09-10  5:37                     ` Boqun Feng
  2021-09-10  9:33                     ` Peter Zijlstra
  1 sibling, 0 replies; 41+ messages in thread
From: Boqun Feng @ 2021-09-10  5:37 UTC (permalink / raw)
  To: Dan Lustig
  Cc: Will Deacon, Peter Zijlstra, Linus Torvalds, Alan Stern,
	Alexander Shishkin, Peter Anvin, Andrea Parri, Ingo Molnar,
	Paul E. McKenney, Vince Weaver, Thomas Gleixner, Jiri Olsa,
	Arnaldo Carvalho de Melo, Linux Kernel Mailing List,
	Stephane Eranian, linux-tip-commits, palmer, paul.walmsley, mpe

On Fri, Sep 10, 2021 at 08:01:14AM +0800, Boqun Feng wrote:
> On Thu, Sep 09, 2021 at 01:03:18PM -0400, Dan Lustig wrote:
> > On 9/9/2021 9:35 AM, Will Deacon wrote:
> > > [+Palmer, PaulW, Daniel and Michael]
> > > 
> > > On Thu, Sep 09, 2021 at 09:25:30AM +0200, Peter Zijlstra wrote:
> > >> On Wed, Sep 08, 2021 at 09:08:33AM -0700, Linus Torvalds wrote:
> > >>
> > >>> So if this is purely a RISC-V thing,
> > >>
> > >> Just to clarify, I think the current RISC-V thing is stonger than
> > >> PowerPC, but maybe not as strong as say ARM64, but RISC-V memory
> > >> ordering is still somewhat hazy to me.
> > >>
> > >> Specifically, the sequence:
> > >>
> > >> 	/* critical section s */
> > >> 	WRITE_ONCE(x, 1);
> > >> 	FENCE RW, W
> > >> 	WRITE_ONCE(s.lock, 0);		/* store S */
> > >> 	AMOSWAP %0, 1, r.lock		/* store R */
> > >> 	FENCE R, RW
> > >> 	WRITE_ONCE(y, 1);
> > >> 	/* critical section r */
> > >>
> > >> fully separates section s from section r, as in RW->RW ordering
> > >> (possibly not as strong as smp_mb() though), while on PowerPC it would
> > >> only impose TSO ordering between sections.
> > >>
> > >> The AMOSWAP is a RmW and as such matches the W from the RW->W fence,
> > >> similarly it marches the R from the R->RW fence, yielding an:
> > >>
> > >> 	RW->  W
> > >> 	    RmW
> > >> 	    R  ->RW
> > >>
> > >> ordering. It's the stores S and R that can be re-ordered, but not the
> > >> sections themselves (same on PowerPC and many others).
> > >>
> > >> Clarification from a RISC-V enabled person would be appreciated.
> > 
> > To first order, RISC-V's memory model is very similar to ARMv8's.  It
> > is "other-multi-copy-atomic", unlike Power, and respects dependencies.
> > It also has AMOs and LR/SC with optional RCsc acquire or release
> > semantics.  There's no need to worry about RISC-V somehow pushing the
> > boundaries of weak memory ordering in new ways.
> > 
> > The tricky part is that unlike ARMv8, RISC-V doesn't have load-acquire
> > or store-release opcodes at all.  Only AMOs and LR/SC have acquire or
> > release options.  That means that while certain operations like swap
> > can be implemented with native RCsc semantics, others like store-release
> > have to fall back on fences and plain writes.
> > 
> > That's where the complexity came up last time this was discussed, at
> > least as it relates to RISC-V: how to make sure the combination of RCsc
> > atomics and plain operations+fences gives the semantics everyone is
> > asking for here.  And to be clear there, I'm not asking for LKMM to
> > weaken anything about critical section ordering just for RISC-V's sake.
> > TSO/RCsc ordering between critical sections is a perfectly reasonable
> > model in my opinion.  I just want to make sure RISC-V gets it right
> > given whatever the decision is.
> > 
> > >>> then I think it's entirely reasonable to
> > >>>
> > >>>         spin_unlock(&r);
> > >>>         spin_lock(&s);
> > >>>
> > >>> cannot be reordered.
> > >>
> > >> I'm obviously completely in favour of that :-)
> > > 
> > > I don't think we should require the accesses to the actual lockwords to
> > > be ordered here, as it becomes pretty onerous for relaxed LL/SC
> > > architectures where you'd end up with an extra barrier either after the
> > > unlock() or before the lock() operation. However, I remain absolutely in
> > > favour of strengthening the ordering of the _critical sections_ guarded by
> > > the locks to be RCsc.
> > 
> > I agree with Will here.  If the AMOSWAP above is actually implemented with
> > a RISC-V AMO, then the two critical sections will be separated as if RW,RW,
> > as Peter described.  If instead it's implemented using LR/SC, then RISC-V
> 
> Just out of curiosity, in the following code, can the store S and load L
> be reordered?
> 
> 	WRITE_ONCE(x, 1); // store S
> 	FENCE RW, W
>  	WRITE_ONCE(s.lock, 0); // unlock(s)
>  	AMOSWAP %0, 1, s.lock  // lock(s)
> 	FENCE R, RW
> 	r1 = READ_ONCE(y); // load L
> 
> I think they can, because neither "FENCE RW, W" nor "FENCE R, RW" order
> them. Note that the reordering is allowed in LKMM, because unlock-lock
> only need to be as strong as RCtso.
> 
> Moreover, how about the outcome of the following case:
> 
> 	{ 
> 	r1, r2 are registers (variables) on each CPU, X, Y are memory
> 	locations, and initialized as 0
> 	}
> 
> 	CPU 0
> 	=====
> 	AMOSWAP r1, 1, X
> 	FENCE R, RW
> 	r2 = READ_ONCE(Y);
> 
> 	CPU 1
> 	=====
> 	WRITE_ONCE(Y, 1);
> 	FENCE RW, RW
> 	r2 = READ_ONCE(X);
> 
> can we observe the result where r2 on CPU0 is 0 while r2 on CPU1 is 1?
> 

As reminded by Andrea, what I meant to ask here is:

can we observer the result where r2 on CPU0 is 0 while r2 on CPU1 is 0?

Regards,
Boqun

> Regards,
> Boqun
> 
> > gives only TSO (R->R, R->W, W->W), because the two pieces of the AMO are
> > split, and that breaks the chain.  Getting full RW->RW between the critical
> > sections would therefore require an extra fence.  Also, the accesses to the
> > lockwords themselves would not be ordered without an extra fence.
> > 
> > > Last time this came up, I think the RISC-V folks were generally happy to
> > > implement whatever was necessary for Linux [1]. The thing that was stopping
> > > us was Power (see CONFIG_ARCH_WEAK_RELEASE_ACQUIRE), wasn't it? I think
> > > Michael saw quite a bit of variety in the impact on benchmarks [2] across
> > > different machines. So the question is whether newer Power machines are less
> > > affected to the degree that we could consider making this change again.
> > 
> > Yes, as I said above, RISC-V will implement what is needed to make this work.
> > 
> > Dan
> > 
> > > Will
> > > 
> > > [1] https://lore.kernel.org/lkml/11b27d32-4a8a-3f84-0f25-723095ef1076@nvidia.com/
> > > [2] https://lore.kernel.org/lkml/87tvp3xonl.fsf@concordia.ellerman.id.au/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [tip:locking/core] tools/memory-model: Add extra ordering for locks and remove it for ordinary release/acquire
  2021-09-10  0:01                   ` Boqun Feng
  2021-09-10  5:37                     ` Boqun Feng
@ 2021-09-10  9:33                     ` Peter Zijlstra
  2021-09-10 10:04                       ` Boqun Feng
  1 sibling, 1 reply; 41+ messages in thread
From: Peter Zijlstra @ 2021-09-10  9:33 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Dan Lustig, Will Deacon, Linus Torvalds, Alan Stern,
	Alexander Shishkin, Peter Anvin, Andrea Parri, Ingo Molnar,
	Paul E. McKenney, Vince Weaver, Thomas Gleixner, Jiri Olsa,
	Arnaldo Carvalho de Melo, Linux Kernel Mailing List,
	Stephane Eranian, linux-tip-commits, palmer, paul.walmsley, mpe

On Fri, Sep 10, 2021 at 08:01:14AM +0800, Boqun Feng wrote:
> On Thu, Sep 09, 2021 at 01:03:18PM -0400, Dan Lustig wrote:
> > On 9/9/2021 9:35 AM, Will Deacon wrote:
> > > On Thu, Sep 09, 2021 at 09:25:30AM +0200, Peter Zijlstra wrote:

> > >> The AMOSWAP is a RmW and as such matches the W from the RW->W fence,
> > >> similarly it marches the R from the R->RW fence, yielding an:
> > >>
> > >> 	RW->  W
> > >> 	    RmW
> > >> 	    R  ->RW
> > >>
> > >> ordering. It's the stores S and R that can be re-ordered, but not the
> > >> sections themselves (same on PowerPC and many others).

> > I agree with Will here.  If the AMOSWAP above is actually implemented with
> > a RISC-V AMO, then the two critical sections will be separated as if RW,RW,
> > as Peter described.  If instead it's implemented using LR/SC, then RISC-V
> 
> Just out of curiosity, in the following code, can the store S and load L
> be reordered?
> 
> 	WRITE_ONCE(x, 1); // store S
> 	FENCE RW, W
>  	WRITE_ONCE(s.lock, 0); // unlock(s)
>  	AMOSWAP %0, 1, s.lock  // lock(s)
> 	FENCE R, RW
> 	r1 = READ_ONCE(y); // load L
> 
> I think they can, because neither "FENCE RW, W" nor "FENCE R, RW" order
> them.

I'm confused by your argument, per the above quoted section, those
fences and the AMO combine into a RW,RW ordering which is (as per the
later clarification) multi-copy-atomic, aka smp_mb().

As such, S and L are not allowed to be re-ordered in the given scenario.

> Note that the reordering is allowed in LKMM, because unlock-lock
> only need to be as strong as RCtso.

Risc-V is strictly stronger than required in this instance. Given the
current lock implementation. Daniel pointed out that if the atomic op
were LL/SC based instead of AMO it would end up being RCtso.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [tip:locking/core] tools/memory-model: Add extra ordering for locks and remove it for ordinary release/acquire
  2021-09-10  9:33                     ` Peter Zijlstra
@ 2021-09-10 10:04                       ` Boqun Feng
  2021-09-10 13:48                         ` Dan Lustig
  0 siblings, 1 reply; 41+ messages in thread
From: Boqun Feng @ 2021-09-10 10:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dan Lustig, Will Deacon, Linus Torvalds, Alan Stern,
	Alexander Shishkin, Peter Anvin, Andrea Parri, Ingo Molnar,
	Paul E. McKenney, Vince Weaver, Thomas Gleixner, Jiri Olsa,
	Arnaldo Carvalho de Melo, Linux Kernel Mailing List,
	Stephane Eranian, linux-tip-commits, palmer, paul.walmsley, mpe

On Fri, Sep 10, 2021 at 11:33:25AM +0200, Peter Zijlstra wrote:
> On Fri, Sep 10, 2021 at 08:01:14AM +0800, Boqun Feng wrote:
> > On Thu, Sep 09, 2021 at 01:03:18PM -0400, Dan Lustig wrote:
> > > On 9/9/2021 9:35 AM, Will Deacon wrote:
> > > > On Thu, Sep 09, 2021 at 09:25:30AM +0200, Peter Zijlstra wrote:
> 
> > > >> The AMOSWAP is a RmW and as such matches the W from the RW->W fence,
> > > >> similarly it marches the R from the R->RW fence, yielding an:
> > > >>
> > > >> 	RW->  W
> > > >> 	    RmW
> > > >> 	    R  ->RW
> > > >>
> > > >> ordering. It's the stores S and R that can be re-ordered, but not the
> > > >> sections themselves (same on PowerPC and many others).
> 
> > > I agree with Will here.  If the AMOSWAP above is actually implemented with
> > > a RISC-V AMO, then the two critical sections will be separated as if RW,RW,
> > > as Peter described.  If instead it's implemented using LR/SC, then RISC-V
> > 
> > Just out of curiosity, in the following code, can the store S and load L
> > be reordered?
> > 
> > 	WRITE_ONCE(x, 1); // store S
> > 	FENCE RW, W
> >  	WRITE_ONCE(s.lock, 0); // unlock(s)
> >  	AMOSWAP %0, 1, s.lock  // lock(s)
> > 	FENCE R, RW
> > 	r1 = READ_ONCE(y); // load L
> > 
> > I think they can, because neither "FENCE RW, W" nor "FENCE R, RW" order
> > them.
> 
> I'm confused by your argument, per the above quoted section, those
> fences and the AMO combine into a RW,RW ordering which is (as per the
> later clarification) multi-copy-atomic, aka smp_mb().
>

Right, my question is more about the reasoning about why fence rw,w +
AMO + fence r,rw act as a fence rw,rw. Another related question, can
fence rw,w + store + fence w,rw act as a fence rw,rw by the similar
reasoning? IOW, will the two loads in the following be reordered?

	r1 = READ_ONCE(x);
	FENCE RW, W
	WRITE_ONCE(z, 1);
	FENCE W, RW
	r2 = READ_ONCE(y);

again, this is more like a question out of curiosity, not that I find
this pattern is useful.

Regards,
Boqun

> As such, S and L are not allowed to be re-ordered in the given scenario.
> 
> > Note that the reordering is allowed in LKMM, because unlock-lock
> > only need to be as strong as RCtso.
> 
> Risc-V is strictly stronger than required in this instance. Given the
> current lock implementation. Daniel pointed out that if the atomic op
> were LL/SC based instead of AMO it would end up being RCtso.
> 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [tip:locking/core] tools/memory-model: Add extra ordering for locks and remove it for ordinary release/acquire
  2021-09-09 17:46                 ` Paul E. McKenney
@ 2021-09-10 11:08                   ` Will Deacon
  2021-09-17  3:21                     ` Nicholas Piggin
  2021-09-17 14:36                     ` Michael Ellerman
  0 siblings, 2 replies; 41+ messages in thread
From: Will Deacon @ 2021-09-10 11:08 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Zijlstra, Linus Torvalds, Alan Stern, Alexander Shishkin,
	Peter Anvin, Andrea Parri, Ingo Molnar, Vince Weaver,
	Thomas Gleixner, Jiri Olsa, Arnaldo Carvalho de Melo,
	Linux Kernel Mailing List, Stephane Eranian, linux-tip-commits,
	palmer, paul.walmsley, dlustig, mpe, npiggin

Hi Paul,

On Thu, Sep 09, 2021 at 10:46:35AM -0700, Paul E. McKenney wrote:
> On Thu, Sep 09, 2021 at 02:35:36PM +0100, Will Deacon wrote:
> > On Thu, Sep 09, 2021 at 09:25:30AM +0200, Peter Zijlstra wrote:
> > > On Wed, Sep 08, 2021 at 09:08:33AM -0700, Linus Torvalds wrote:
> > > > then I think it's entirely reasonable to
> > > > 
> > > >         spin_unlock(&r);
> > > >         spin_lock(&s);
> > > > 
> > > > cannot be reordered.
> > > 
> > > I'm obviously completely in favour of that :-)
> > 
> > I don't think we should require the accesses to the actual lockwords to
> > be ordered here, as it becomes pretty onerous for relaxed LL/SC
> > architectures where you'd end up with an extra barrier either after the
> > unlock() or before the lock() operation. However, I remain absolutely in
> > favour of strengthening the ordering of the _critical sections_ guarded by
> > the locks to be RCsc.
> 
> If by this you mean the critical sections when observed only by other
> critical sections for a given lock, then everyone is already there.

No, I mean the case where somebody without the lock (but using memory
barriers) can observe the critical sections out of order (i.e. W -> R
order is not maintained).

> However...
> 
> > Last time this came up, I think the RISC-V folks were generally happy to
> > implement whatever was necessary for Linux [1]. The thing that was stopping
> > us was Power (see CONFIG_ARCH_WEAK_RELEASE_ACQUIRE), wasn't it? I think
> > Michael saw quite a bit of variety in the impact on benchmarks [2] across
> > different machines. So the question is whether newer Power machines are less
> > affected to the degree that we could consider making this change again.
> 
> Last I knew, on Power a pair of critical sections for a given lock could
> be observed out of order (writes from the earlier critical section vs.
> reads from the later critical section), but only by CPUs not holding
> that lock.  Also last I knew, tightening this would require upgrading
> some of the locking primitives' lwsync instructions to sync instructions.
> But I know very little about Power 10.

Yup, that's the one. This is the primary reason why we have the confusing
"RCtso" model today so this is my periodic "Do we still need this?" poking
for the Power folks :)

If the SYNC is a disaster for Power, then I'll ask again in another ~3 years
time in the hope that newer micro-architectures can swallow the instruction
more easily, but the results last time weren't hugely compelling and so _if_
there's an opportunity to make locking more "obvious" then I'm all for it.

Will

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [tip:locking/core] tools/memory-model: Add extra ordering for locks and remove it for ordinary release/acquire
  2021-09-10 10:04                       ` Boqun Feng
@ 2021-09-10 13:48                         ` Dan Lustig
  2021-09-10 14:15                           ` Boqun Feng
  0 siblings, 1 reply; 41+ messages in thread
From: Dan Lustig @ 2021-09-10 13:48 UTC (permalink / raw)
  To: Boqun Feng, Peter Zijlstra
  Cc: Will Deacon, Linus Torvalds, Alan Stern, Alexander Shishkin,
	Peter Anvin, Andrea Parri, Ingo Molnar, Paul E. McKenney,
	Vince Weaver, Thomas Gleixner, Jiri Olsa,
	Arnaldo Carvalho de Melo, Linux Kernel Mailing List,
	Stephane Eranian, linux-tip-commits, palmer, paul.walmsley, mpe

On 9/10/2021 6:04 AM, Boqun Feng wrote:
> On Fri, Sep 10, 2021 at 11:33:25AM +0200, Peter Zijlstra wrote:
>> On Fri, Sep 10, 2021 at 08:01:14AM +0800, Boqun Feng wrote:
>>> On Thu, Sep 09, 2021 at 01:03:18PM -0400, Dan Lustig wrote:
>>>> On 9/9/2021 9:35 AM, Will Deacon wrote:
>>>>> On Thu, Sep 09, 2021 at 09:25:30AM +0200, Peter Zijlstra wrote:
>>
>>>>>> The AMOSWAP is a RmW and as such matches the W from the RW->W fence,
>>>>>> similarly it marches the R from the R->RW fence, yielding an:
>>>>>>
>>>>>> 	RW->  W
>>>>>> 	    RmW
>>>>>> 	    R  ->RW
>>>>>>
>>>>>> ordering. It's the stores S and R that can be re-ordered, but not the
>>>>>> sections themselves (same on PowerPC and many others).
>>
>>>> I agree with Will here.  If the AMOSWAP above is actually implemented with
>>>> a RISC-V AMO, then the two critical sections will be separated as if RW,RW,
>>>> as Peter described.  If instead it's implemented using LR/SC, then RISC-V
>>>
>>> Just out of curiosity, in the following code, can the store S and load L
>>> be reordered?
>>>
>>> 	WRITE_ONCE(x, 1); // store S
>>> 	FENCE RW, W
>>>  	WRITE_ONCE(s.lock, 0); // unlock(s)
>>>  	AMOSWAP %0, 1, s.lock  // lock(s)
>>> 	FENCE R, RW
>>> 	r1 = READ_ONCE(y); // load L
>>>
>>> I think they can, because neither "FENCE RW, W" nor "FENCE R, RW" order
>>> them.
>>
>> I'm confused by your argument, per the above quoted section, those
>> fences and the AMO combine into a RW,RW ordering which is (as per the
>> later clarification) multi-copy-atomic, aka smp_mb().
>>
> 
> Right, my question is more about the reasoning about why fence rw,w +
> AMO + fence r,rw act as a fence rw,rw.

Is this a RISC-V question?  If so, it's as simple as:
1) S and anything earlier are ordered before the AMO by the first fence
2) L and anything later are ordered after the AMO by the second fence
3) 1 + 2 = S and anything earlier are ordered before L or anything later

Since RISC-V is multi-copy atomic, so 1+2 just naturally compose
transitively.

> Another related question, can
> fence rw,w + store + fence w,rw act as a fence rw,rw by the similar
> reasoning? IOW, will the two loads in the following be reordered?
> 
> 	r1 = READ_ONCE(x);
> 	FENCE RW, W
> 	WRITE_ONCE(z, 1);
> 	FENCE W, RW
> 	r2 = READ_ONCE(y);
> 
> again, this is more like a question out of curiosity, not that I find
> this pattern is useful.

Does FENCE W,RW appear in some actual use case?  But yes, if it does
appear, this sequence would also act as a FENCE RW,RW on RISC-V.

Dan

> Regards,
> Boqun
> 
>> As such, S and L are not allowed to be re-ordered in the given scenario.
>>
>>> Note that the reordering is allowed in LKMM, because unlock-lock
>>> only need to be as strong as RCtso.
>>
>> Risc-V is strictly stronger than required in this instance. Given the
>> current lock implementation. Daniel pointed out that if the atomic op
>> were LL/SC based instead of AMO it would end up being RCtso.
>>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [tip:locking/core] tools/memory-model: Add extra ordering for locks and remove it for ordinary release/acquire
  2021-09-10 13:48                         ` Dan Lustig
@ 2021-09-10 14:15                           ` Boqun Feng
  0 siblings, 0 replies; 41+ messages in thread
From: Boqun Feng @ 2021-09-10 14:15 UTC (permalink / raw)
  To: Dan Lustig
  Cc: Peter Zijlstra, Will Deacon, Linus Torvalds, Alan Stern,
	Alexander Shishkin, Peter Anvin, Andrea Parri, Ingo Molnar,
	Paul E. McKenney, Vince Weaver, Thomas Gleixner, Jiri Olsa,
	Arnaldo Carvalho de Melo, Linux Kernel Mailing List,
	Stephane Eranian, linux-tip-commits, palmer, paul.walmsley, mpe

On Fri, Sep 10, 2021 at 09:48:55AM -0400, Dan Lustig wrote:
> On 9/10/2021 6:04 AM, Boqun Feng wrote:
> > On Fri, Sep 10, 2021 at 11:33:25AM +0200, Peter Zijlstra wrote:
> >> On Fri, Sep 10, 2021 at 08:01:14AM +0800, Boqun Feng wrote:
> >>> On Thu, Sep 09, 2021 at 01:03:18PM -0400, Dan Lustig wrote:
> >>>> On 9/9/2021 9:35 AM, Will Deacon wrote:
> >>>>> On Thu, Sep 09, 2021 at 09:25:30AM +0200, Peter Zijlstra wrote:
> >>
> >>>>>> The AMOSWAP is a RmW and as such matches the W from the RW->W fence,
> >>>>>> similarly it marches the R from the R->RW fence, yielding an:
> >>>>>>
> >>>>>> 	RW->  W
> >>>>>> 	    RmW
> >>>>>> 	    R  ->RW
> >>>>>>
> >>>>>> ordering. It's the stores S and R that can be re-ordered, but not the
> >>>>>> sections themselves (same on PowerPC and many others).
> >>
> >>>> I agree with Will here.  If the AMOSWAP above is actually implemented with
> >>>> a RISC-V AMO, then the two critical sections will be separated as if RW,RW,
> >>>> as Peter described.  If instead it's implemented using LR/SC, then RISC-V
> >>>
> >>> Just out of curiosity, in the following code, can the store S and load L
> >>> be reordered?
> >>>
> >>> 	WRITE_ONCE(x, 1); // store S
> >>> 	FENCE RW, W
> >>>  	WRITE_ONCE(s.lock, 0); // unlock(s)
> >>>  	AMOSWAP %0, 1, s.lock  // lock(s)
> >>> 	FENCE R, RW
> >>> 	r1 = READ_ONCE(y); // load L
> >>>
> >>> I think they can, because neither "FENCE RW, W" nor "FENCE R, RW" order
> >>> them.
> >>
> >> I'm confused by your argument, per the above quoted section, those
> >> fences and the AMO combine into a RW,RW ordering which is (as per the
> >> later clarification) multi-copy-atomic, aka smp_mb().
> >>
> > 
> > Right, my question is more about the reasoning about why fence rw,w +
> > AMO + fence r,rw act as a fence rw,rw.
> 
> Is this a RISC-V question?  If so, it's as simple as:

Yep, and thanks for the answer.

> 1) S and anything earlier are ordered before the AMO by the first fence
> 2) L and anything later are ordered after the AMO by the second fence
> 3) 1 + 2 = S and anything earlier are ordered before L or anything later
> 
> Since RISC-V is multi-copy atomic, so 1+2 just naturally compose
> transitively.
> 
> > Another related question, can
> > fence rw,w + store + fence w,rw act as a fence rw,rw by the similar
> > reasoning? IOW, will the two loads in the following be reordered?
> > 
> > 	r1 = READ_ONCE(x);
> > 	FENCE RW, W
> > 	WRITE_ONCE(z, 1);
> > 	FENCE W, RW
> > 	r2 = READ_ONCE(y);
> > 
> > again, this is more like a question out of curiosity, not that I find
> > this pattern is useful.
> 
> Does FENCE W,RW appear in some actual use case?  But yes, if it does

I'm not aware of any, but probably because no other arch can order
write->read without a full barrier (or release+acquire if RCsc), we have
a few patterns in kernel where we only want to order write->read, and
smp_mb()s are used, if on RISCV FENCE W,R is cheaper than FENCE RW,RW,
then *in theory* we can have smp_wrmb() implemented as FENCE W,R on
RISCV and smp_mb() on other archs.

/me run

And I'm sure there are cases that we use smp_mb() where only
write->{read,write} is supposed to be ordered, so there may be use case
by the same reason.

I'm not proposing doing anything, just saying we don't use FENCE W,RW
because there is no equilavent concept in other archs, so it's not
modeled by an API. Besides, it may not be cheaper than FENCE RW,RW on
RISCV.

Regards,
Boqun

> appear, this sequence would also act as a FENCE RW,RW on RISC-V.
> 
> Dan
> 
> > Regards,
> > Boqun
> > 
> >> As such, S and L are not allowed to be re-ordered in the given scenario.
> >>
> >>> Note that the reordering is allowed in LKMM, because unlock-lock
> >>> only need to be as strong as RCtso.
> >>
> >> Risc-V is strictly stronger than required in this instance. Given the
> >> current lock implementation. Daniel pointed out that if the atomic op
> >> were LL/SC based instead of AMO it would end up being RCtso.
> >>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [tip:locking/core] tools/memory-model: Add extra ordering for locks and remove it for ordinary release/acquire
  2021-09-09 18:00                   ` Paul E. McKenney
@ 2021-09-10 14:20                     ` Boqun Feng
  2021-09-10 15:33                       ` Palmer Dabbelt
  2021-09-10 16:36                       ` Alan Stern
  0 siblings, 2 replies; 41+ messages in thread
From: Boqun Feng @ 2021-09-10 14:20 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Dan Lustig, Will Deacon, Peter Zijlstra, Linus Torvalds,
	Alan Stern, Alexander Shishkin, Peter Anvin, Andrea Parri,
	Ingo Molnar, Vince Weaver, Thomas Gleixner, Jiri Olsa,
	Arnaldo Carvalho de Melo, Linux Kernel Mailing List,
	Stephane Eranian, linux-tip-commits, palmer, paul.walmsley, mpe

On Thu, Sep 09, 2021 at 11:00:05AM -0700, Paul E. McKenney wrote:
[...]
> 
> Boqun, I vaguely remember a suggested change from you along these lines,
> but now I cannot find it.  Could you please send it as a formal patch
> if you have not already done so or point me at it if you have?
> 

Here is a draft patch based on the change I did when I discussed with
Peter, and I really want to hear Alan's thought first. Ideally, we
should also have related litmus tests and send to linux-arch list so
that we know the ordering is provided by every architecture.

Regards,
Boqun

--------------------------------->8
Subject: [PATCH] tools/memory-model: Provide extra ordering for
 lock-{release,acquire} on the same CPU

A recent discussion[1] shows that we are in favor of strengthening the
ordering of lock-release + lock-acquire on the same CPU: a lock-release
and a po-after lock-acquire should provide the so-called RCtso ordering,
that is a memory access S po-before the lock-release should be ordered
against a memory access R po-after the lock-acquire, unless S is a store
and R is a load.

The strengthening meets programmers' expection that "sequence of two
locked regions to be ordered wrt each other" (from Linus), and can
reduce the mental burden when using locks. Therefore add it in LKMM.

[1]: https://lore.kernel.org/lkml/20210909185937.GA12379@rowland.harvard.edu/

Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
---
 .../Documentation/explanation.txt             | 28 +++++++++++++++++++
 tools/memory-model/linux-kernel.cat           |  6 ++--
 2 files changed, 31 insertions(+), 3 deletions(-)

diff --git a/tools/memory-model/Documentation/explanation.txt b/tools/memory-model/Documentation/explanation.txt
index 5d72f3112e56..d62de21f32c4 100644
--- a/tools/memory-model/Documentation/explanation.txt
+++ b/tools/memory-model/Documentation/explanation.txt
@@ -1847,6 +1847,34 @@ therefore the load of x must execute before the load of y.  Thus we
 cannot have r1 = 1 and r2 = 0 at the end (this is an instance of the
 MP pattern).
 
+This requirement also applies to a lock-release and a lock-acquire
+on different locks, as long as the lock-acquire is po-after the
+lock-release. Note that "po-after" means the lock-acquire and the
+lock-release are on the same cpu. An example simliar to the above:
+
+	int x, y;
+	spinlock_t s;
+	spinlock_t t;
+
+	P0()
+	{
+		int r1, r2;
+
+		spin_lock(&s);
+		r1 = READ_ONCE(x);
+		spin_unlock(&s);
+		spin_lock(&t);
+		r2 = READ_ONCE(y);
+		spin_unlock(&t);
+	}
+
+	P1()
+	{
+		WRITE_ONCE(y, 1);
+		smp_wmb();
+		WRITE_ONCE(x, 1);
+	}
+
 This requirement does not apply to ordinary release and acquire
 fences, only to lock-related operations.  For instance, suppose P0()
 in the example had been written as:
diff --git a/tools/memory-model/linux-kernel.cat b/tools/memory-model/linux-kernel.cat
index 2a9b4fe4a84e..d70315fddef6 100644
--- a/tools/memory-model/linux-kernel.cat
+++ b/tools/memory-model/linux-kernel.cat
@@ -27,7 +27,7 @@ include "lock.cat"
 (* Release Acquire *)
 let acq-po = [Acquire] ; po ; [M]
 let po-rel = [M] ; po ; [Release]
-let po-unlock-rf-lock-po = po ; [UL] ; rf ; [LKR] ; po
+let po-unlock-lock-po = po ; [UL] ; (po|rf) ; [LKR] ; po
 
 (* Fences *)
 let R4rmb = R \ Noreturn	(* Reads for which rmb works *)
@@ -70,12 +70,12 @@ let rwdep = (dep | ctrl) ; [W]
 let overwrite = co | fr
 let to-w = rwdep | (overwrite & int) | (addr ; [Plain] ; wmb)
 let to-r = addr | (dep ; [Marked] ; rfi)
-let ppo = to-r | to-w | fence | (po-unlock-rf-lock-po & int)
+let ppo = to-r | to-w | fence | (po-unlock-lock-po & int)
 
 (* Propagation: Ordering from release operations and strong fences. *)
 let A-cumul(r) = (rfe ; [Marked])? ; r
 let cumul-fence = [Marked] ; (A-cumul(strong-fence | po-rel) | wmb |
-	po-unlock-rf-lock-po) ; [Marked]
+	po-unlock-lock-po) ; [Marked]
 let prop = [Marked] ; (overwrite & ext)? ; cumul-fence* ;
 	[Marked] ; rfe? ; [Marked]
 
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [tip:locking/core] tools/memory-model: Add extra ordering for locks and remove it for ordinary release/acquire
  2021-09-10 14:20                     ` Boqun Feng
@ 2021-09-10 15:33                       ` Palmer Dabbelt
  2021-09-10 16:36                       ` Alan Stern
  1 sibling, 0 replies; 41+ messages in thread
From: Palmer Dabbelt @ 2021-09-10 15:33 UTC (permalink / raw)
  To: boqun.feng
  Cc: paulmck, Daniel Lustig, will, peterz, Linus Torvalds, stern,
	alexander.shishkin, hpa, parri.andrea, mingo, vincent.weaver,
	tglx, jolsa, acme, linux-kernel, eranian, linux-tip-commits,
	Paul Walmsley, mpe

On Fri, 10 Sep 2021 07:20:13 PDT (-0700), boqun.feng@gmail.com wrote:
> On Thu, Sep 09, 2021 at 11:00:05AM -0700, Paul E. McKenney wrote:
> [...]
>>
>> Boqun, I vaguely remember a suggested change from you along these lines,
>> but now I cannot find it.  Could you please send it as a formal patch
>> if you have not already done so or point me at it if you have?
>>
>
> Here is a draft patch based on the change I did when I discussed with
> Peter, and I really want to hear Alan's thought first. Ideally, we
> should also have related litmus tests and send to linux-arch list so
> that we know the ordering is provided by every architecture.
>
> Regards,
> Boqun
>
> --------------------------------->8
> Subject: [PATCH] tools/memory-model: Provide extra ordering for
>  lock-{release,acquire} on the same CPU
>
> A recent discussion[1] shows that we are in favor of strengthening the
> ordering of lock-release + lock-acquire on the same CPU: a lock-release
> and a po-after lock-acquire should provide the so-called RCtso ordering,
> that is a memory access S po-before the lock-release should be ordered
> against a memory access R po-after the lock-acquire, unless S is a store
> and R is a load.
>
> The strengthening meets programmers' expection that "sequence of two
> locked regions to be ordered wrt each other" (from Linus), and can
> reduce the mental burden when using locks. Therefore add it in LKMM.
>
> [1]: https://lore.kernel.org/lkml/20210909185937.GA12379@rowland.harvard.edu/
>
> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
> ---
>  .../Documentation/explanation.txt             | 28 +++++++++++++++++++
>  tools/memory-model/linux-kernel.cat           |  6 ++--
>  2 files changed, 31 insertions(+), 3 deletions(-)
>
> diff --git a/tools/memory-model/Documentation/explanation.txt b/tools/memory-model/Documentation/explanation.txt
> index 5d72f3112e56..d62de21f32c4 100644
> --- a/tools/memory-model/Documentation/explanation.txt
> +++ b/tools/memory-model/Documentation/explanation.txt
> @@ -1847,6 +1847,34 @@ therefore the load of x must execute before the load of y.  Thus we
>  cannot have r1 = 1 and r2 = 0 at the end (this is an instance of the
>  MP pattern).
>
> +This requirement also applies to a lock-release and a lock-acquire
> +on different locks, as long as the lock-acquire is po-after the
> +lock-release. Note that "po-after" means the lock-acquire and the
> +lock-release are on the same cpu. An example simliar to the above:
> +
> +	int x, y;
> +	spinlock_t s;
> +	spinlock_t t;
> +
> +	P0()
> +	{
> +		int r1, r2;
> +
> +		spin_lock(&s);
> +		r1 = READ_ONCE(x);
> +		spin_unlock(&s);
> +		spin_lock(&t);
> +		r2 = READ_ONCE(y);
> +		spin_unlock(&t);
> +	}
> +
> +	P1()
> +	{
> +		WRITE_ONCE(y, 1);
> +		smp_wmb();
> +		WRITE_ONCE(x, 1);
> +	}
> +
>  This requirement does not apply to ordinary release and acquire
>  fences, only to lock-related operations.  For instance, suppose P0()
>  in the example had been written as:
> diff --git a/tools/memory-model/linux-kernel.cat b/tools/memory-model/linux-kernel.cat
> index 2a9b4fe4a84e..d70315fddef6 100644
> --- a/tools/memory-model/linux-kernel.cat
> +++ b/tools/memory-model/linux-kernel.cat
> @@ -27,7 +27,7 @@ include "lock.cat"
>  (* Release Acquire *)
>  let acq-po = [Acquire] ; po ; [M]
>  let po-rel = [M] ; po ; [Release]
> -let po-unlock-rf-lock-po = po ; [UL] ; rf ; [LKR] ; po
> +let po-unlock-lock-po = po ; [UL] ; (po|rf) ; [LKR] ; po
>
>  (* Fences *)
>  let R4rmb = R \ Noreturn	(* Reads for which rmb works *)
> @@ -70,12 +70,12 @@ let rwdep = (dep | ctrl) ; [W]
>  let overwrite = co | fr
>  let to-w = rwdep | (overwrite & int) | (addr ; [Plain] ; wmb)
>  let to-r = addr | (dep ; [Marked] ; rfi)
> -let ppo = to-r | to-w | fence | (po-unlock-rf-lock-po & int)
> +let ppo = to-r | to-w | fence | (po-unlock-lock-po & int)
>
>  (* Propagation: Ordering from release operations and strong fences. *)
>  let A-cumul(r) = (rfe ; [Marked])? ; r
>  let cumul-fence = [Marked] ; (A-cumul(strong-fence | po-rel) | wmb |
> -	po-unlock-rf-lock-po) ; [Marked]
> +	po-unlock-lock-po) ; [Marked]
>  let prop = [Marked] ; (overwrite & ext)? ; cumul-fence* ;
>  	[Marked] ; rfe? ; [Marked]

I'm not a memory model person so I don't really feel comfortable 
reviewing this, but I can follow the non-formal discussion so

Acked-by: Palmer Dabbelt <palmerdabbelt@google.com> # For the RISC-V fallout

So far we've been sticking with the "fastest implementation allowed by 
the spec" mentality, but TBH I think we'd have ended up moving to RCsc 
locks regardless of where LKMM ended up just to be in line with the more 
popular architectures.  With mmiowb gone I think this was the last bit 
of memory model weirdness we'd been carrying around in the RISC-V port, 
so it would have always just been a worry.

We don't really have any hardware to evaluate the performance 
implications of this change, as there are no interestingly aggressive 
implementations of the memory model availiable today.  Like Dan said 
we've got all the ISA mechanisms in place to adequently describe these 
orderings to hardware, so in theory implementations should be able to 
handle this without falling off any performance cliffs.

Happy to take a look and an implementation of this on RISC-V, but if 
nothing arises I'll go sort it out.  It does remind me that we were 
supposed to move over to those generic ticket spinlocks, though...

Thanks!

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [tip:locking/core] tools/memory-model: Add extra ordering for locks and remove it for ordinary release/acquire
  2021-09-10 14:20                     ` Boqun Feng
  2021-09-10 15:33                       ` Palmer Dabbelt
@ 2021-09-10 16:36                       ` Alan Stern
  2021-09-10 17:12                         ` Peter Zijlstra
                                           ` (2 more replies)
  1 sibling, 3 replies; 41+ messages in thread
From: Alan Stern @ 2021-09-10 16:36 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Paul E. McKenney, Dan Lustig, Will Deacon, Peter Zijlstra,
	Linus Torvalds, Alexander Shishkin, Peter Anvin, Andrea Parri,
	Ingo Molnar, Vince Weaver, Thomas Gleixner, Jiri Olsa,
	Arnaldo Carvalho de Melo, Linux Kernel Mailing List,
	Stephane Eranian, linux-tip-commits, palmer, paul.walmsley, mpe

[-- Attachment #1: Type: text/plain, Size: 5554 bytes --]

On Fri, Sep 10, 2021 at 10:20:13PM +0800, Boqun Feng wrote:
> On Thu, Sep 09, 2021 at 11:00:05AM -0700, Paul E. McKenney wrote:
> [...]
> > 
> > Boqun, I vaguely remember a suggested change from you along these lines,
> > but now I cannot find it.  Could you please send it as a formal patch
> > if you have not already done so or point me at it if you have?
> > 
> 
> Here is a draft patch based on the change I did when I discussed with
> Peter, and I really want to hear Alan's thought first. Ideally, we
> should also have related litmus tests and send to linux-arch list so
> that we know the ordering is provided by every architecture.
> 
> Regards,
> Boqun
> 
> --------------------------------->8
> Subject: [PATCH] tools/memory-model: Provide extra ordering for
>  lock-{release,acquire} on the same CPU
> 
> A recent discussion[1] shows that we are in favor of strengthening the
> ordering of lock-release + lock-acquire on the same CPU: a lock-release
> and a po-after lock-acquire should provide the so-called RCtso ordering,
> that is a memory access S po-before the lock-release should be ordered
> against a memory access R po-after the lock-acquire, unless S is a store
> and R is a load.
> 
> The strengthening meets programmers' expection that "sequence of two
> locked regions to be ordered wrt each other" (from Linus), and can
> reduce the mental burden when using locks. Therefore add it in LKMM.
> 
> [1]: https://lore.kernel.org/lkml/20210909185937.GA12379@rowland.harvard.edu/
> 
> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
> ---

The change to linux-kernel.cat looks fine.  However, I didn't like your 
update to explanation.txt.  Instead I wrote my own, given below.

I also wrote a couple of litmus tests which Paul can add to the 
appropriate archive.  They are attached to this email.  As expected, 
they fail (result Sometimes) with the current LKMM and succeed (Never) 
with Boqun's updated model.

Alan


--- usb-devel.orig/tools/memory-model/Documentation/explanation.txt
+++ usb-devel/tools/memory-model/Documentation/explanation.txt
@@ -1813,15 +1813,16 @@ spin_trylock() -- we can call these thin
 lock-acquires -- have two properties beyond those of ordinary releases
 and acquires.
 
-First, when a lock-acquire reads from a lock-release, the LKMM
-requires that every instruction po-before the lock-release must
-execute before any instruction po-after the lock-acquire.  This would
-naturally hold if the release and acquire operations were on different
-CPUs, but the LKMM says it holds even when they are on the same CPU.
-For example:
+First, when a lock-acquire reads from or is po-after a lock-release,
+the LKMM requires that every instruction po-before the lock-release
+must execute before any instruction po-after the lock-acquire.  This
+would naturally hold if the release and acquire operations were on
+different CPUs and accessed the same lock variable, but the LKMM says
+it also holds when they are on the same CPU, even if they access
+different lock variables.  For example:
 
 	int x, y;
-	spinlock_t s;
+	spinlock_t s, t;
 
 	P0()
 	{
@@ -1830,9 +1831,9 @@ For example:
 		spin_lock(&s);
 		r1 = READ_ONCE(x);
 		spin_unlock(&s);
-		spin_lock(&s);
+		spin_lock(&t);
 		r2 = READ_ONCE(y);
-		spin_unlock(&s);
+		spin_unlock(&t);
 	}
 
 	P1()
@@ -1842,10 +1843,10 @@ For example:
 		WRITE_ONCE(x, 1);
 	}
 
-Here the second spin_lock() reads from the first spin_unlock(), and
-therefore the load of x must execute before the load of y.  Thus we
-cannot have r1 = 1 and r2 = 0 at the end (this is an instance of the
-MP pattern).
+Here the second spin_lock() is po-after the first spin_unlock(), and
+therefore the load of x must execute before the load of y, even tbough
+the two locking operations use different locks.  Thus we cannot have
+r1 = 1 and r2 = 0 at the end (this is an instance of the MP pattern).
 
 This requirement does not apply to ordinary release and acquire
 fences, only to lock-related operations.  For instance, suppose P0()
@@ -1872,13 +1873,13 @@ instructions in the following order:
 
 and thus it could load y before x, obtaining r2 = 0 and r1 = 1.
 
-Second, when a lock-acquire reads from a lock-release, and some other
-stores W and W' occur po-before the lock-release and po-after the
-lock-acquire respectively, the LKMM requires that W must propagate to
-each CPU before W' does.  For example, consider:
+Second, when a lock-acquire reads from or is po-after a lock-release,
+and some other stores W and W' occur po-before the lock-release and
+po-after the lock-acquire respectively, the LKMM requires that W must
+propagate to each CPU before W' does.  For example, consider:
 
 	int x, y;
-	spinlock_t x;
+	spinlock_t s;
 
 	P0()
 	{
@@ -1908,7 +1909,12 @@ each CPU before W' does.  For example, c
 
 If r1 = 1 at the end then the spin_lock() in P1 must have read from
 the spin_unlock() in P0.  Hence the store to x must propagate to P2
-before the store to y does, so we cannot have r2 = 1 and r3 = 0.
+before the store to y does, so we cannot have r2 = 1 and r3 = 0.  But
+if P1 had used a lock variable different from s, the writes could have
+propagated in either order.  (On the other hand, if the code in P0 and
+P1 had all executed on a single CPU, as in the example before this
+one, then the writes would have propagated in order even if the two
+critical sections used different lock variables.)
 
 These two special requirements for lock-release and lock-acquire do
 not arise from the operational model.  Nevertheless, kernel developers


[-- Attachment #2: ullk-rw.litmus --]
[-- Type: text/plain, Size: 523 bytes --]

C ullk-rw

(*
 * Result: Never
 *
 * If two locked critical sections execute on the same CPU, all accesses
 * in the first must execute before any accesses in the second, even if
 * the critical sections are protected by different locks.
 *)

{}

P0(spinlock_t *s, spinlock_t *t, int *x, int *y)
{
	int r1;

	spin_lock(s);
	r1 = READ_ONCE(*x);
	spin_unlock(s);
	spin_lock(t);
	WRITE_ONCE(*y, 1);
	spin_unlock(t);
}

P1(int *x, int *y)
{
	int r2;

	r2 = smp_load_acquire(y);
	WRITE_ONCE(*x, 1);
}

exists (0:r1=1 /\ 1:r2=1)

[-- Attachment #3: ullk-ww.litmus --]
[-- Type: text/plain, Size: 533 bytes --]

C ullk-ww

(*
 * Result: Never
 *
 * If two locked critical sections execute on the same CPU, stores in the
 * first must propagate to each CPU before stores in the second do, even if
 * the critical sections are protected by different locks.
 *)

{}

P0(spinlock_t *s, spinlock_t *t, int *x, int *y)
{
	spin_lock(s);
	WRITE_ONCE(*x, 1);
	spin_unlock(s);
	spin_lock(t);
	WRITE_ONCE(*y, 1);
	spin_unlock(t);
}

P1(int *x, int *y)
{
	int r1;
	int r2;

	r1 = READ_ONCE(*y);
	smp_rmb();
	r2 = READ_ONCE(*x);
}

exists (1:r1=1 /\ 1:r2=0)

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [tip:locking/core] tools/memory-model: Add extra ordering for locks and remove it for ordinary release/acquire
  2021-09-10 16:36                       ` Alan Stern
@ 2021-09-10 17:12                         ` Peter Zijlstra
  2021-09-10 17:56                           ` Alan Stern
  2021-09-10 17:17                         ` Peter Zijlstra
  2021-09-12  0:26                         ` Boqun Feng
  2 siblings, 1 reply; 41+ messages in thread
From: Peter Zijlstra @ 2021-09-10 17:12 UTC (permalink / raw)
  To: Alan Stern
  Cc: Boqun Feng, Paul E. McKenney, Dan Lustig, Will Deacon,
	Linus Torvalds, Alexander Shishkin, Peter Anvin, Andrea Parri,
	Ingo Molnar, Vince Weaver, Thomas Gleixner, Jiri Olsa,
	Arnaldo Carvalho de Melo, Linux Kernel Mailing List,
	Stephane Eranian, linux-tip-commits, palmer, paul.walmsley, mpe

On Fri, Sep 10, 2021 at 12:36:32PM -0400, Alan Stern wrote:
> +Here the second spin_lock() is po-after the first spin_unlock(), and
> +therefore the load of x must execute before the load of y, even tbough

I think that's commonly spelled: though, right?                    ^^^^^^


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [tip:locking/core] tools/memory-model: Add extra ordering for locks and remove it for ordinary release/acquire
  2021-09-10 16:36                       ` Alan Stern
  2021-09-10 17:12                         ` Peter Zijlstra
@ 2021-09-10 17:17                         ` Peter Zijlstra
  2021-09-12  0:26                         ` Boqun Feng
  2 siblings, 0 replies; 41+ messages in thread
From: Peter Zijlstra @ 2021-09-10 17:17 UTC (permalink / raw)
  To: Alan Stern
  Cc: Boqun Feng, Paul E. McKenney, Dan Lustig, Will Deacon,
	Linus Torvalds, Alexander Shishkin, Peter Anvin, Andrea Parri,
	Ingo Molnar, Vince Weaver, Thomas Gleixner, Jiri Olsa,
	Arnaldo Carvalho de Melo, Linux Kernel Mailing List,
	Stephane Eranian, linux-tip-commits, palmer, paul.walmsley, mpe

On Fri, Sep 10, 2021 at 12:36:32PM -0400, Alan Stern wrote:
> --- usb-devel.orig/tools/memory-model/Documentation/explanation.txt
> +++ usb-devel/tools/memory-model/Documentation/explanation.txt
> @@ -1813,15 +1813,16 @@ spin_trylock() -- we can call these thin
>  lock-acquires -- have two properties beyond those of ordinary releases
>  and acquires.
>  
> +First, when a lock-acquire reads from or is po-after a lock-release,
> +the LKMM requires that every instruction po-before the lock-release
> +must execute before any instruction po-after the lock-acquire.  This
> +would naturally hold if the release and acquire operations were on
> +different CPUs and accessed the same lock variable, but the LKMM says
> +it also holds when they are on the same CPU, even if they access
> +different lock variables.  For example:

Could be I don't understand this right, but the way I'm reading it, it
seems to imply RCsc. Which I don't think we're actually asking at this
time.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [tip:locking/core] tools/memory-model: Add extra ordering for locks and remove it for ordinary release/acquire
  2021-09-10 17:12                         ` Peter Zijlstra
@ 2021-09-10 17:56                           ` Alan Stern
  0 siblings, 0 replies; 41+ messages in thread
From: Alan Stern @ 2021-09-10 17:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Boqun Feng, Paul E. McKenney, Dan Lustig, Will Deacon,
	Linus Torvalds, Alexander Shishkin, Peter Anvin, Andrea Parri,
	Ingo Molnar, Vince Weaver, Thomas Gleixner, Jiri Olsa,
	Arnaldo Carvalho de Melo, Linux Kernel Mailing List,
	Stephane Eranian, linux-tip-commits, palmer, paul.walmsley, mpe

On Fri, Sep 10, 2021 at 07:12:21PM +0200, Peter Zijlstra wrote:
> On Fri, Sep 10, 2021 at 12:36:32PM -0400, Alan Stern wrote:
> > +Here the second spin_lock() is po-after the first spin_unlock(), and
> > +therefore the load of x must execute before the load of y, even tbough
> 
> I think that's commonly spelled: though, right?                    ^^^^^^

Oops, yes, I missed that.  Good eye!

> > --- usb-devel.orig/tools/memory-model/Documentation/explanation.txt
> > +++ usb-devel/tools/memory-model/Documentation/explanation.txt
> > @@ -1813,15 +1813,16 @@ spin_trylock() -- we can call these thin
> >  lock-acquires -- have two properties beyond those of ordinary releases
> >  and acquires.
> >  
> > +First, when a lock-acquire reads from or is po-after a lock-release,
> > +the LKMM requires that every instruction po-before the lock-release
> > +must execute before any instruction po-after the lock-acquire.  This
> > +would naturally hold if the release and acquire operations were on
> > +different CPUs and accessed the same lock variable, but the LKMM says
> > +it also holds when they are on the same CPU, even if they access
> > +different lock variables.  For example:
> 
> Could be I don't understand this right, but the way I'm reading it, it
> seems to imply RCsc. Which I don't think we're actually asking at this
> time.

No, it doesn't imply RCsc.  This document makes a distinction between 
when a store executes and when it becomes visible to (or propagates to) 
other CPUs.  Thus, even though write 1 executes before write 2, write 2 
might become visible to a different CPU before write 1 does.  In fact, 
on non-other-multicopy-atomic systems, two writes might become visible 
to different CPUs in different orders (think of the IRIW litmus 
pattern.)

Or to consider a more relevant example, a write can execute before a 
read even though the write doesn't become visible to other CPUs until 
after the read is complete.

If you want, you can read this as saying "execute as seen from its own 
CPU" (although even that isn't entirely right, since a write can be 
visible to a po-later read which nevertheless executes before the write 
does).  Or think of a write as executing when its value gets put into 
the local store buffer, rather than when it gets put into the cache 
line.

Alan

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [tip:locking/core] tools/memory-model: Add extra ordering for locks and remove it for ordinary release/acquire
  2021-09-10 16:36                       ` Alan Stern
  2021-09-10 17:12                         ` Peter Zijlstra
  2021-09-10 17:17                         ` Peter Zijlstra
@ 2021-09-12  0:26                         ` Boqun Feng
  2 siblings, 0 replies; 41+ messages in thread
From: Boqun Feng @ 2021-09-12  0:26 UTC (permalink / raw)
  To: Alan Stern
  Cc: Paul E. McKenney, Dan Lustig, Will Deacon, Peter Zijlstra,
	Linus Torvalds, Alexander Shishkin, Peter Anvin, Andrea Parri,
	Ingo Molnar, Vince Weaver, Thomas Gleixner, Jiri Olsa,
	Arnaldo Carvalho de Melo, Linux Kernel Mailing List,
	Stephane Eranian, linux-tip-commits, palmer, paul.walmsley, mpe

On Fri, Sep 10, 2021 at 12:36:32PM -0400, Alan Stern wrote:
> On Fri, Sep 10, 2021 at 10:20:13PM +0800, Boqun Feng wrote:
> > On Thu, Sep 09, 2021 at 11:00:05AM -0700, Paul E. McKenney wrote:
> > [...]
> > > 
> > > Boqun, I vaguely remember a suggested change from you along these lines,
> > > but now I cannot find it.  Could you please send it as a formal patch
> > > if you have not already done so or point me at it if you have?
> > > 
> > 
> > Here is a draft patch based on the change I did when I discussed with
> > Peter, and I really want to hear Alan's thought first. Ideally, we
> > should also have related litmus tests and send to linux-arch list so
> > that we know the ordering is provided by every architecture.
> > 
> > Regards,
> > Boqun
> > 
> > --------------------------------->8
> > Subject: [PATCH] tools/memory-model: Provide extra ordering for
> >  lock-{release,acquire} on the same CPU
> > 
> > A recent discussion[1] shows that we are in favor of strengthening the
> > ordering of lock-release + lock-acquire on the same CPU: a lock-release
> > and a po-after lock-acquire should provide the so-called RCtso ordering,
> > that is a memory access S po-before the lock-release should be ordered
> > against a memory access R po-after the lock-acquire, unless S is a store
> > and R is a load.
> > 
> > The strengthening meets programmers' expection that "sequence of two
> > locked regions to be ordered wrt each other" (from Linus), and can
> > reduce the mental burden when using locks. Therefore add it in LKMM.
> > 
> > [1]: https://lore.kernel.org/lkml/20210909185937.GA12379@rowland.harvard.edu/
> > 
> > Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
> > ---
> 
> The change to linux-kernel.cat looks fine.  However, I didn't like your 
> update to explanation.txt.  Instead I wrote my own, given below.
> 

Thanks. Indeed your changes of explanation is better.

> I also wrote a couple of litmus tests which Paul can add to the 
> appropriate archive.  They are attached to this email.  As expected, 
> they fail (result Sometimes) with the current LKMM and succeed (Never) 
> with Boqun's updated model.
> 

Appreciate it, I will put together your change to explanation.txt (with
the typo fixed), my change to cat file and the litmus tests, and send
a proper patch next Monday.

Regards,
Boqun

> Alan
> 
> 
> --- usb-devel.orig/tools/memory-model/Documentation/explanation.txt
> +++ usb-devel/tools/memory-model/Documentation/explanation.txt
> @@ -1813,15 +1813,16 @@ spin_trylock() -- we can call these thin
>  lock-acquires -- have two properties beyond those of ordinary releases
>  and acquires.
>  
> -First, when a lock-acquire reads from a lock-release, the LKMM
> -requires that every instruction po-before the lock-release must
> -execute before any instruction po-after the lock-acquire.  This would
> -naturally hold if the release and acquire operations were on different
> -CPUs, but the LKMM says it holds even when they are on the same CPU.
> -For example:
> +First, when a lock-acquire reads from or is po-after a lock-release,
> +the LKMM requires that every instruction po-before the lock-release
> +must execute before any instruction po-after the lock-acquire.  This
> +would naturally hold if the release and acquire operations were on
> +different CPUs and accessed the same lock variable, but the LKMM says
> +it also holds when they are on the same CPU, even if they access
> +different lock variables.  For example:
>  
>  	int x, y;
> -	spinlock_t s;
> +	spinlock_t s, t;
>  
>  	P0()
>  	{
> @@ -1830,9 +1831,9 @@ For example:
>  		spin_lock(&s);
>  		r1 = READ_ONCE(x);
>  		spin_unlock(&s);
> -		spin_lock(&s);
> +		spin_lock(&t);
>  		r2 = READ_ONCE(y);
> -		spin_unlock(&s);
> +		spin_unlock(&t);
>  	}
>  
>  	P1()
> @@ -1842,10 +1843,10 @@ For example:
>  		WRITE_ONCE(x, 1);
>  	}
>  
> -Here the second spin_lock() reads from the first spin_unlock(), and
> -therefore the load of x must execute before the load of y.  Thus we
> -cannot have r1 = 1 and r2 = 0 at the end (this is an instance of the
> -MP pattern).
> +Here the second spin_lock() is po-after the first spin_unlock(), and
> +therefore the load of x must execute before the load of y, even tbough
> +the two locking operations use different locks.  Thus we cannot have
> +r1 = 1 and r2 = 0 at the end (this is an instance of the MP pattern).
>  
>  This requirement does not apply to ordinary release and acquire
>  fences, only to lock-related operations.  For instance, suppose P0()
> @@ -1872,13 +1873,13 @@ instructions in the following order:
>  
>  and thus it could load y before x, obtaining r2 = 0 and r1 = 1.
>  
> -Second, when a lock-acquire reads from a lock-release, and some other
> -stores W and W' occur po-before the lock-release and po-after the
> -lock-acquire respectively, the LKMM requires that W must propagate to
> -each CPU before W' does.  For example, consider:
> +Second, when a lock-acquire reads from or is po-after a lock-release,
> +and some other stores W and W' occur po-before the lock-release and
> +po-after the lock-acquire respectively, the LKMM requires that W must
> +propagate to each CPU before W' does.  For example, consider:
>  
>  	int x, y;
> -	spinlock_t x;
> +	spinlock_t s;
>  
>  	P0()
>  	{
> @@ -1908,7 +1909,12 @@ each CPU before W' does.  For example, c
>  
>  If r1 = 1 at the end then the spin_lock() in P1 must have read from
>  the spin_unlock() in P0.  Hence the store to x must propagate to P2
> -before the store to y does, so we cannot have r2 = 1 and r3 = 0.
> +before the store to y does, so we cannot have r2 = 1 and r3 = 0.  But
> +if P1 had used a lock variable different from s, the writes could have
> +propagated in either order.  (On the other hand, if the code in P0 and
> +P1 had all executed on a single CPU, as in the example before this
> +one, then the writes would have propagated in order even if the two
> +critical sections used different lock variables.)
>  
>  These two special requirements for lock-release and lock-acquire do
>  not arise from the operational model.  Nevertheless, kernel developers
> 

> C ullk-rw
> 
> (*
>  * Result: Never
>  *
>  * If two locked critical sections execute on the same CPU, all accesses
>  * in the first must execute before any accesses in the second, even if
>  * the critical sections are protected by different locks.
>  *)
> 
> {}
> 
> P0(spinlock_t *s, spinlock_t *t, int *x, int *y)
> {
> 	int r1;
> 
> 	spin_lock(s);
> 	r1 = READ_ONCE(*x);
> 	spin_unlock(s);
> 	spin_lock(t);
> 	WRITE_ONCE(*y, 1);
> 	spin_unlock(t);
> }
> 
> P1(int *x, int *y)
> {
> 	int r2;
> 
> 	r2 = smp_load_acquire(y);
> 	WRITE_ONCE(*x, 1);
> }
> 
> exists (0:r1=1 /\ 1:r2=1)

> C ullk-ww
> 
> (*
>  * Result: Never
>  *
>  * If two locked critical sections execute on the same CPU, stores in the
>  * first must propagate to each CPU before stores in the second do, even if
>  * the critical sections are protected by different locks.
>  *)
> 
> {}
> 
> P0(spinlock_t *s, spinlock_t *t, int *x, int *y)
> {
> 	spin_lock(s);
> 	WRITE_ONCE(*x, 1);
> 	spin_unlock(s);
> 	spin_lock(t);
> 	WRITE_ONCE(*y, 1);
> 	spin_unlock(t);
> }
> 
> P1(int *x, int *y)
> {
> 	int r1;
> 	int r2;
> 
> 	r1 = READ_ONCE(*y);
> 	smp_rmb();
> 	r2 = READ_ONCE(*x);
> }
> 
> exists (1:r1=1 /\ 1:r2=0)


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [tip:locking/core] tools/memory-model: Add extra ordering for locks and remove it for ordinary release/acquire
  2021-09-10 11:08                   ` Will Deacon
@ 2021-09-17  3:21                     ` Nicholas Piggin
  2021-09-17  5:31                       ` Nicholas Piggin
  2021-09-17 14:36                     ` Michael Ellerman
  1 sibling, 1 reply; 41+ messages in thread
From: Nicholas Piggin @ 2021-09-17  3:21 UTC (permalink / raw)
  To: Paul E. McKenney, Will Deacon
  Cc: Arnaldo Carvalho de Melo, Alexander Shishkin, dlustig,
	Stephane Eranian, Peter Anvin, Jiri Olsa,
	Linux Kernel Mailing List, linux-tip-commits, Ingo Molnar, mpe,
	palmer, Andrea Parri, paul.walmsley, Peter Zijlstra, Alan Stern,
	Thomas Gleixner, Linus Torvalds, Vince Weaver

Excerpts from Will Deacon's message of September 10, 2021 9:08 pm:
> Hi Paul,
> 
> On Thu, Sep 09, 2021 at 10:46:35AM -0700, Paul E. McKenney wrote:
>> On Thu, Sep 09, 2021 at 02:35:36PM +0100, Will Deacon wrote:
>> > On Thu, Sep 09, 2021 at 09:25:30AM +0200, Peter Zijlstra wrote:
>> > > On Wed, Sep 08, 2021 at 09:08:33AM -0700, Linus Torvalds wrote:
>> > > > then I think it's entirely reasonable to
>> > > > 
>> > > >         spin_unlock(&r);
>> > > >         spin_lock(&s);
>> > > > 
>> > > > cannot be reordered.
>> > > 
>> > > I'm obviously completely in favour of that :-)
>> > 
>> > I don't think we should require the accesses to the actual lockwords to
>> > be ordered here, as it becomes pretty onerous for relaxed LL/SC
>> > architectures where you'd end up with an extra barrier either after the
>> > unlock() or before the lock() operation. However, I remain absolutely in
>> > favour of strengthening the ordering of the _critical sections_ guarded by
>> > the locks to be RCsc.
>> 
>> If by this you mean the critical sections when observed only by other
>> critical sections for a given lock, then everyone is already there.
> 
> No, I mean the case where somebody without the lock (but using memory
> barriers) can observe the critical sections out of order (i.e. W -> R
> order is not maintained).

This is a sincere question, why is this important? I mean _any_ 
restriction on reordering makes things easier by definition I can't 
argue with that, but why is this one in particular seen as a problem?
It just seems disproportionate.

We naturally think of accesses within locks as atomic as a whole 
(provided the other parties are doing the proper locking too). So like 
atomic operations, aligned stores, etc can be reordered, I don't see why
these should have any particular ordering either, or why a unlock() 
should pair with a later lock() of an unrelated lock to provide some
ordering.

It gives the idea that individual lock operations in isolation should be 
or do something special, but I think that's the wrong way to think about 
it, the lock and the unlock operate on a specific lock word and protect 
specific data vs other processors that access the same data under the
same locks.

If you don't know what you're doing or don't want to think about 
ordering, perform accesses under locks. If you don't lock, you get to
think about ordering. At which point sure two sets of operations from
different critical sections could go out of order, but so can any two
stores. Or two stores from inside the one critical section if you are
not holding the correct lock.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [tip:locking/core] tools/memory-model: Add extra ordering for locks and remove it for ordinary release/acquire
  2021-09-17  3:21                     ` Nicholas Piggin
@ 2021-09-17  5:31                       ` Nicholas Piggin
  0 siblings, 0 replies; 41+ messages in thread
From: Nicholas Piggin @ 2021-09-17  5:31 UTC (permalink / raw)
  To: Paul E. McKenney, Will Deacon
  Cc: Arnaldo Carvalho de Melo, Alexander Shishkin, dlustig,
	Stephane Eranian, Peter Anvin, Jiri Olsa,
	Linux Kernel Mailing List, linux-tip-commits, Ingo Molnar, mpe,
	palmer, Andrea Parri, paul.walmsley, Peter Zijlstra, Alan Stern,
	Thomas Gleixner, Linus Torvalds, Vince Weaver

Excerpts from Nicholas Piggin's message of September 17, 2021 1:21 pm:
> Excerpts from Will Deacon's message of September 10, 2021 9:08 pm:
>> Hi Paul,
>> 
>> On Thu, Sep 09, 2021 at 10:46:35AM -0700, Paul E. McKenney wrote:
>>> On Thu, Sep 09, 2021 at 02:35:36PM +0100, Will Deacon wrote:
>>> > On Thu, Sep 09, 2021 at 09:25:30AM +0200, Peter Zijlstra wrote:
>>> > > On Wed, Sep 08, 2021 at 09:08:33AM -0700, Linus Torvalds wrote:
>>> > > > then I think it's entirely reasonable to
>>> > > > 
>>> > > >         spin_unlock(&r);
>>> > > >         spin_lock(&s);
>>> > > > 
>>> > > > cannot be reordered.
>>> > > 
>>> > > I'm obviously completely in favour of that :-)
>>> > 
>>> > I don't think we should require the accesses to the actual lockwords to
>>> > be ordered here, as it becomes pretty onerous for relaxed LL/SC
>>> > architectures where you'd end up with an extra barrier either after the
>>> > unlock() or before the lock() operation. However, I remain absolutely in
>>> > favour of strengthening the ordering of the _critical sections_ guarded by
>>> > the locks to be RCsc.
>>> 
>>> If by this you mean the critical sections when observed only by other
>>> critical sections for a given lock, then everyone is already there.
>> 
>> No, I mean the case where somebody without the lock (but using memory
>> barriers) can observe the critical sections out of order (i.e. W -> R
>> order is not maintained).
> 
> This is a sincere question, why is this important? I mean _any_ 
> restriction on reordering makes things easier by definition I can't 
> argue with that, but why is this one in particular seen as a problem?
> It just seems disproportionate.
> 
> We naturally think of accesses within locks as atomic as a whole 
> (provided the other parties are doing the proper locking too). So like 
> atomic operations, aligned stores, etc can be reordered, I don't see why
> these should have any particular ordering either, or why a unlock() 
> should pair with a later lock() of an unrelated lock to provide some
> ordering.
> 
> It gives the idea that individual lock operations in isolation should be 
> or do something special, but I think that's the wrong way to think about 
> it, the lock and the unlock operate on a specific lock word and protect 
> specific data vs other processors that access the same data under the
> same locks.
> 
> If you don't know what you're doing or don't want to think about 
> ordering, perform accesses under locks. If you don't lock, you get to
> think about ordering. At which point sure two sets of operations from
> different critical sections could go out of order, but so can any two
> stores. Or two stores from inside the one critical section if you are
> not holding the correct lock.

It doesn't actually really relieve the burden of thinking about barriers 
mcuh at all, come to think of it.

spin_lock(&foo);
x = 1;
spin_unlock(&foo);
spin_lock(&bar);
y = 1;
spin_unlock(&bar);

vs

if (READ_ONCE(y) == 1)
    // spin_unlock(&foo)+spin_lock(&bar) provides store ordering
    smp_rmb();
    BUG_ON(READ_ONCE(x) == 0);

Then if you didn't comment the store ordering requirement in the first
code, then you patch things to simplify or add functionality:

spin_lock(&foo);
x = 1;
spin_lock(&bar);
y = 1;
spin_unlock(&bar);
z = 1;
spin_unlock(&foo);

or

spin_lock(&baz);
x = 1;
y = 1;
spin_unlock(&baz);

Then you broke it. Because you thought being clever and avoiding
thinking about or talking about ordering in your code which performs 
memory accesses outside of locks was improving the situation.

It's not good practice. If there is *any* unlocked memory access, you 
should always think about and comment *all* memory orderings. Locking
should not ever be relied on to give you some kind of implicit semantics
that you think should be obvious. All the accesses always need thought
and they always need comments.

spin_lock(&foo);
// The spin_unlock+spin_lock orders this store before the store to y, 
// the corresponding smp_rmb is at function().
x = 1;
spin_unlock(&foo);
spin_lock(&bar);
// See x
y = 1;
spin_unlock(&bar);

Once you do that, does it *really* break the bank to add a line of code?

spin_lock(&foo);
// See smp_unlock_lock_mb() below
x = 1;
spin_unlock(&foo);
// This orders this store of x above before the
// store to y below. 
smp_unlock_lock_mb();
spin_lock(&bar);
// See smp_unlock_lock_mb() above
y = 1;
spin_unlock(&bar);

I can't see how that line of code created a fundamentally a bigger 
problem.

If you don't need the performance and don't want to deal with ordering, 
*always use locks*. If you absolutely can't always use locks, *always 
document important memory accesses ordering that is or can affect 
unlocked memory accesses*. We are all agreed on this rule, right? So
what am I missing?

Thanks,
Nick

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [tip:locking/core] tools/memory-model: Add extra ordering for locks and remove it for ordinary release/acquire
  2021-09-10 11:08                   ` Will Deacon
  2021-09-17  3:21                     ` Nicholas Piggin
@ 2021-09-17 14:36                     ` Michael Ellerman
  1 sibling, 0 replies; 41+ messages in thread
From: Michael Ellerman @ 2021-09-17 14:36 UTC (permalink / raw)
  To: Will Deacon, Paul E. McKenney
  Cc: Peter Zijlstra, Linus Torvalds, Alan Stern, Alexander Shishkin,
	Peter Anvin, Andrea Parri, Ingo Molnar, Vince Weaver,
	Thomas Gleixner, Jiri Olsa, Arnaldo Carvalho de Melo,
	Linux Kernel Mailing List, Stephane Eranian, linux-tip-commits,
	palmer, paul.walmsley, dlustig, npiggin

Will Deacon <will@kernel.org> writes:
> On Thu, Sep 09, 2021 at 10:46:35AM -0700, Paul E. McKenney wrote:
>> On Thu, Sep 09, 2021 at 02:35:36PM +0100, Will Deacon wrote:
>> > On Thu, Sep 09, 2021 at 09:25:30AM +0200, Peter Zijlstra wrote:
>> > > On Wed, Sep 08, 2021 at 09:08:33AM -0700, Linus Torvalds wrote:
>> > > > then I think it's entirely reasonable to
>> > > > 
>> > > >         spin_unlock(&r);
>> > > >         spin_lock(&s);
>> > > > 
>> > > > cannot be reordered.
>> > > 
>> > > I'm obviously completely in favour of that :-)
>> > 
>> > I don't think we should require the accesses to the actual lockwords to
>> > be ordered here, as it becomes pretty onerous for relaxed LL/SC
>> > architectures where you'd end up with an extra barrier either after the
>> > unlock() or before the lock() operation. However, I remain absolutely in
>> > favour of strengthening the ordering of the _critical sections_ guarded by
>> > the locks to be RCsc.
>> 
>> If by this you mean the critical sections when observed only by other
>> critical sections for a given lock, then everyone is already there.
>
> No, I mean the case where somebody without the lock (but using memory
> barriers) can observe the critical sections out of order (i.e. W -> R
> order is not maintained).
>
>> However...
>> 
>> > Last time this came up, I think the RISC-V folks were generally happy to
>> > implement whatever was necessary for Linux [1]. The thing that was stopping
>> > us was Power (see CONFIG_ARCH_WEAK_RELEASE_ACQUIRE), wasn't it? I think
>> > Michael saw quite a bit of variety in the impact on benchmarks [2] across
>> > different machines. So the question is whether newer Power machines are less
>> > affected to the degree that we could consider making this change again.
>> 
>> Last I knew, on Power a pair of critical sections for a given lock could
>> be observed out of order (writes from the earlier critical section vs.
>> reads from the later critical section), but only by CPUs not holding
>> that lock.  Also last I knew, tightening this would require upgrading
>> some of the locking primitives' lwsync instructions to sync instructions.
>> But I know very little about Power 10.
>
> Yup, that's the one. This is the primary reason why we have the confusing
> "RCtso" model today so this is my periodic "Do we still need this?" poking
> for the Power folks :)
>
> If the SYNC is a disaster for Power, then I'll ask again in another ~3 years
> time in the hope that newer micro-architectures can swallow the instruction
> more easily, but the results last time weren't hugely compelling and so _if_
> there's an opportunity to make locking more "obvious" then I'm all for it.

I haven't had time to do the full set of numbers like I did last time,
but a quick test shows it's still about a 20-25% drop switching to sync.

So on that basis we'd definitely rather not :)

I'll try and get some more numbers next week.

cheers

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2021-09-17 14:36 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-09-26 18:28 [PATCH memory-model 0/5] Updates to the formal memory model Paul E. McKenney
2018-09-26 18:29 ` [PATCH memory-model 1/5] tools/memory-model: Add litmus-test naming scheme Paul E. McKenney
2018-10-02 10:10   ` [tip:locking/core] " tip-bot for Paul E. McKenney
2018-09-26 18:29 ` [PATCH memory-model 2/5] tools/memory-model: Add extra ordering for locks and remove it for ordinary release/acquire Paul E. McKenney
2018-10-02 10:11   ` [tip:locking/core] " tip-bot for Alan Stern
2021-09-08 11:00     ` Peter Zijlstra
2021-09-08 11:44       ` Peter Zijlstra
2021-09-08 14:42         ` Alan Stern
2021-09-08 15:12           ` Peter Zijlstra
2021-09-08 16:08           ` Linus Torvalds
2021-09-09  7:25             ` Peter Zijlstra
2021-09-09 13:35               ` Will Deacon
2021-09-09 17:02                 ` Linus Torvalds
2021-09-09 18:59                   ` Alan Stern
2021-09-09 17:03                 ` Dan Lustig
2021-09-09 18:00                   ` Paul E. McKenney
2021-09-10 14:20                     ` Boqun Feng
2021-09-10 15:33                       ` Palmer Dabbelt
2021-09-10 16:36                       ` Alan Stern
2021-09-10 17:12                         ` Peter Zijlstra
2021-09-10 17:56                           ` Alan Stern
2021-09-10 17:17                         ` Peter Zijlstra
2021-09-12  0:26                         ` Boqun Feng
2021-09-10  0:01                   ` Boqun Feng
2021-09-10  5:37                     ` Boqun Feng
2021-09-10  9:33                     ` Peter Zijlstra
2021-09-10 10:04                       ` Boqun Feng
2021-09-10 13:48                         ` Dan Lustig
2021-09-10 14:15                           ` Boqun Feng
2021-09-09 17:46                 ` Paul E. McKenney
2021-09-10 11:08                   ` Will Deacon
2021-09-17  3:21                     ` Nicholas Piggin
2021-09-17  5:31                       ` Nicholas Piggin
2021-09-17 14:36                     ` Michael Ellerman
2018-09-26 18:29 ` [PATCH memory-model 3/5] tools/memory-model: Fix a README typo Paul E. McKenney
2018-10-02 10:11   ` [tip:locking/core] " tip-bot for SeongJae Park
2018-09-26 18:29 ` [PATCH memory-model 4/5] tools/memory-model: Add more LKMM limitations Paul E. McKenney
2018-10-02 10:12   ` [tip:locking/core] " tip-bot for Paul E. McKenney
2018-09-26 18:29 ` [PATCH memory-model 5/5] doc: Replace smp_cond_acquire() with smp_cond_load_acquire() Paul E. McKenney
2018-10-02 10:12   ` [tip:locking/core] locking/memory-barriers: " tip-bot for Andrea Parri
2018-10-02  8:28 ` [PATCH memory-model 0/5] Updates to the formal memory model Ingo Molnar

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.