linux-arch.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH] LKMM: Add ctrl_dep() macro for control dependency
@ 2021-09-28 21:15 Mathieu Desnoyers
  2021-09-29 12:06 ` Marco Elver
                   ` (3 more replies)
  0 siblings, 4 replies; 35+ messages in thread
From: Mathieu Desnoyers @ 2021-09-28 21:15 UTC (permalink / raw)
  To: will, paulmck, Peter Zijlstra, Segher Boessenkool
  Cc: linux-kernel, Mathieu Desnoyers, Linus Torvalds, stern,
	parri.andrea, boqun.feng, npiggin, dhowells, j.alglave,
	luc.maranget, akiyks, linux-toolchains, linux-arch

The control dependency ordering currently documented in
Documentation/memory-barriers.txt is fragile and can be broken by
various compiler optimizations.

The goal here is to prevent the compiler from being able to optimize a
conditional branch into something which lacks the control dependency,
while letting the compiler choose the best conditional branch in each
case.

Prevent the compiler from considering the two legs of a conditional
branch as identical by adding a distinct volatile asm in each leg of the
branch. Those asm do not emit any instruction nor data into the
resulting executable, and do not have any clobbers.

GNU describes asm volatile statements as having side-effects. [1]

C99 describes that accessing volatile objects are side-effects, and that
"at certain specified points in the execution sequence called sequence
points, all side effects of previous evaluations shall be complete
and no side effects of subsequent evaluations shall have taken
place". [2]

This ensures that the program order of READ_ONCE(), asm volatile in both
legs of the branch, and following WRITE_ONCE() and after_ctrl_dep()
barriers are preserved.

With this approach, the following code now keeps the control dependency:

        z = READ_ONCE(var1);
        if (ctrl_dep(z))
                WRITE_ONCE(var2, 5);
        else
                WRITE_ONCE(var2, 5);

And the ctrl_dep_eval() checking the constant triggers a build error
for:

        y = READ_ONCE(var1);
        if (ctrl_dep(y % 1))
                WRITE_ONCE(var2, 5);
        else
                WRITE_ONCE(var2, 6);

Which is good to have to ensure the compiler don't end up removing the
conditional branch because the it evaluates a constant.

Introduce the ctrl_dep macro in the generic headers, and use it
everywhere it appears relevant.  The approach taken is simply to
look for smp_acquire__after_ctrl_dep and "control dependency" across the
kernel sources, so a few other uses may have been missed.

The kernel documentation is also updated to reflect the need to use
the ctrl_dep macro and simplify the control dependencies section of
the documentation based on the additional guarantees provided by the
ctrl_dep macro.

In the btrfs code, change a smp_rmb() for a smp_acquire__after_ctrl_dep(),
which should be sufficient to guarantee LOAD-LOAD ordering.

I have validated the following code generation impacts on x86-64:

This patch does not alter the generated binary executables in ipc/sem.c,
ipc/mqueue.c, ipc/shm.c.

The presence of BUILD_BUG_ON() slightly changes the code layout in
kernel/events/ring_buffer.c, kernel/smp.c, and kernel/sched/core.c.

The asm volatile on the else leg of the control dependency further
restricts the compiler's ability to optimize code across the condition
for kernel/events/ring_buffer.c, kernel/smp.c, kernel/locking/rwsem.c.

The asm volatile on the then leg of the control dependency further
restricts the compiler's ability to optimize code across the condition
for kernel/sched/core.c, because its documented control dependency is on
the else leg of the branch.

Link: https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html [1]
Link: http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1256.pdf [2]
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Segher Boessenkool <segher@kernel.crashing.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: will@kernel.org
Cc: paulmck@kernel.org
Cc: stern@rowland.harvard.edu
Cc: parri.andrea@gmail.com
Cc: boqun.feng@gmail.com
Cc: npiggin@gmail.com
Cc: dhowells@redhat.com
Cc: j.alglave@ucl.ac.uk
Cc: luc.maranget@inria.fr
Cc: akiyks@gmail.com
Cc: linux-kernel@vger.kernel.org
Cc: linux-toolchains@vger.kernel.org
Cc: linux-arch@vger.kernel.org
---
 Documentation/memory-barriers.txt | 205 ++++++++----------------------
 fs/btrfs/volumes.c                |   6 +-
 include/asm-generic/barrier.h     |  58 ++++++++-
 include/linux/refcount.h          |   2 +-
 ipc/mqueue.c                      |   2 +-
 ipc/msg.c                         |   2 +-
 ipc/sem.c                         |   2 +-
 kernel/events/ring_buffer.c       |   6 +-
 kernel/locking/rwsem.c            |   4 +-
 kernel/sched/core.c               |   2 +-
 kernel/smp.c                      |   2 +-
 11 files changed, 126 insertions(+), 165 deletions(-)

diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
index 7367ada13208..466f72fc4d3c 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -666,12 +666,18 @@ Control dependencies can be a bit tricky because current compilers do
 not understand them.  The purpose of this section is to help you prevent
 the compiler's ignorance from breaking your code.
 
+Control dependencies should be made explicit by using ctrl_dep() around
+the condition expression of an if () statement, for () loop, or while ()
+loop.  This ensures the compiler is prohibited from optimizing away the
+conditional branch, and from lifting stores and memory barriers outside
+the selection statement.
+
 A load-load control dependency requires a full read memory barrier, not
 simply a data dependency barrier to make it work correctly.  Consider the
 following bit of code:
 
 	q = READ_ONCE(a);
-	if (q) {
+	if (ctrl_dep(q)) {
 		<data dependency barrier>  /* BUG: No data dependency!!! */
 		p = READ_ONCE(b);
 	}
@@ -683,8 +689,8 @@ the load from b as having happened before the load from a.  In such a
 case what's actually required is:
 
 	q = READ_ONCE(a);
-	if (q) {
-		<read barrier>
+	if (ctrl_dep(q)) {
+		<smp_acquire__after_ctrl_dep>
 		p = READ_ONCE(b);
 	}
 
@@ -692,92 +698,45 @@ However, stores are not speculated.  This means that ordering -is- provided
 for load-store control dependencies, as in the following example:
 
 	q = READ_ONCE(a);
-	if (q) {
+	if (ctrl_dep(q)) {
 		WRITE_ONCE(b, 1);
 	}
 
 Control dependencies pair normally with other types of barriers.
 That said, please note that neither READ_ONCE() nor WRITE_ONCE()
-are optional! Without the READ_ONCE(), the compiler might combine the
+are optional!  Without the READ_ONCE(), the compiler might combine the
 load from 'a' with other loads from 'a'.  Without the WRITE_ONCE(),
 the compiler might combine the store to 'b' with other stores to 'b'.
 Either can result in highly counterintuitive effects on ordering.
 
-Worse yet, if the compiler is able to prove (say) that the value of
-variable 'a' is always non-zero, it would be well within its rights
-to optimize the original example by eliminating the "if" statement
-as follows:
-
-	q = a;
-	b = 1;  /* BUG: Compiler and CPU can both reorder!!! */
-
-So don't leave out the READ_ONCE().
+If the compiler is able to prove (say) that the value of variable 'a' is
+always non-zero, it would be well within its rights to optimize the
+original example by eliminating the "if" statement as follows.
+ctrl_dep ensures that a build error is generated if the condition
+evaluates to a true/false constant.
 
-It is tempting to try to enforce ordering on identical stores on both
+Using ctrl_dep allows enforcing ordering on identical stores on both
 branches of the "if" statement as follows:
 
 	q = READ_ONCE(a);
-	if (q) {
-		barrier();
+	if (ctrl_dep(q)) {
 		WRITE_ONCE(b, 1);
 		do_something();
 	} else {
-		barrier();
 		WRITE_ONCE(b, 1);
 		do_something_else();
 	}
 
-Unfortunately, current compilers will transform this as follows at high
-optimization levels:
+ctrl_dep emits distinct asm volatile within each leg of the branch, thus
+preventing the compiler from lifting both WRITE_ONCE outside of the
+selection statement.
 
-	q = READ_ONCE(a);
-	barrier();
-	WRITE_ONCE(b, 1);  /* BUG: No ordering vs. load from a!!! */
-	if (q) {
-		/* WRITE_ONCE(b, 1); -- moved up, BUG!!! */
-		do_something();
-	} else {
-		/* WRITE_ONCE(b, 1); -- moved up, BUG!!! */
-		do_something_else();
-	}
-
-Now there is no conditional between the load from 'a' and the store to
-'b', which means that the CPU is within its rights to reorder them:
-The conditional is absolutely required, and must be present in the
-assembly code even after all compiler optimizations have been applied.
-Therefore, if you need ordering in this example, you need explicit
-memory barriers, for example, smp_store_release():
+In addition, if operations applied to the condition variable result in
+the compiler proving it to be a constant (true/false), ctrl_dep will
+cause a compile-time error.  For example:
 
 	q = READ_ONCE(a);
-	if (q) {
-		smp_store_release(&b, 1);
-		do_something();
-	} else {
-		smp_store_release(&b, 1);
-		do_something_else();
-	}
-
-In contrast, without explicit memory barriers, two-legged-if control
-ordering is guaranteed only when the stores differ, for example:
-
-	q = READ_ONCE(a);
-	if (q) {
-		WRITE_ONCE(b, 1);
-		do_something();
-	} else {
-		WRITE_ONCE(b, 2);
-		do_something_else();
-	}
-
-The initial READ_ONCE() is still required to prevent the compiler from
-proving the value of 'a'.
-
-In addition, you need to be careful what you do with the local variable 'q',
-otherwise the compiler might be able to guess the value and again remove
-the needed conditional.  For example:
-
-	q = READ_ONCE(a);
-	if (q % MAX) {
+	if (ctrl_dep(q % MAX)) {
 		WRITE_ONCE(b, 1);
 		do_something();
 	} else {
@@ -786,86 +745,41 @@ the needed conditional.  For example:
 	}
 
 If MAX is defined to be 1, then the compiler knows that (q % MAX) is
-equal to zero, in which case the compiler is within its rights to
-transform the above code into the following:
-
-	q = READ_ONCE(a);
-	WRITE_ONCE(b, 2);
-	do_something_else();
-
-Given this transformation, the CPU is not required to respect the ordering
-between the load from variable 'a' and the store to variable 'b'.  It is
-tempting to add a barrier(), but this does not help.  The conditional
-is gone, and the barrier won't bring it back.  Therefore, if you are
-relying on this ordering, you should make sure that MAX is greater than
-one, perhaps as follows:
-
-	q = READ_ONCE(a);
-	BUILD_BUG_ON(MAX <= 1); /* Order load from a with store to b. */
-	if (q % MAX) {
-		WRITE_ONCE(b, 1);
-		do_something();
-	} else {
-		WRITE_ONCE(b, 2);
-		do_something_else();
-	}
-
-Please note once again that the stores to 'b' differ.  If they were
-identical, as noted earlier, the compiler could pull this store outside
-of the 'if' statement.
+equal to zero, which would allow it to remove the conditional branch
+altogether. The ctrl_dep macro prevents this by causing a compile-time
+error.
 
-You must also be careful not to rely too much on boolean short-circuit
-evaluation.  Consider this example:
+The same compiler error will be generated by ctrl_dep if boolean
+short-circuit evaluation ends up with a constant condition variable.
+Consider this example:
 
 	q = READ_ONCE(a);
-	if (q || 1 > 0)
+	if (ctrl_dep(q || 1 > 0))
 		WRITE_ONCE(b, 1);
 
 Because the first condition cannot fault and the second condition is
-always true, the compiler can transform this example as following,
-defeating control dependency:
-
-	q = READ_ONCE(a);
-	WRITE_ONCE(b, 1);
+always true, the compiler considers the condition as a constant (true).
+ctrl_dep causes a compiler error in this case.
 
-This example underscores the need to ensure that the compiler cannot
-out-guess your code.  More generally, although READ_ONCE() does force
-the compiler to actually emit code for a given load, it does not force
-the compiler to use the results.
+By using ctrl_dep, control dependencies apply to both the then-clause
+and else-clause of the if-statement in question, as well as to code
+following the if-statement:
 
-In addition, control dependencies apply only to the then-clause and
-else-clause of the if-statement in question.  In particular, it does
-not necessarily apply to code following the if-statement:
 
 	q = READ_ONCE(a);
-	if (q) {
+	if (ctrl_dep(q)) {
 		WRITE_ONCE(b, 1);
 	} else {
 		WRITE_ONCE(b, 2);
 	}
-	WRITE_ONCE(c, 1);  /* BUG: No ordering against the read from 'a'. */
-
-It is tempting to argue that there in fact is ordering because the
-compiler cannot reorder volatile accesses and also cannot reorder
-the writes to 'b' with the condition.  Unfortunately for this line
-of reasoning, the compiler might compile the two writes to 'b' as
-conditional-move instructions, as in this fanciful pseudo-assembly
-language:
-
-	ld r1,a
-	cmp r1,$0
-	cmov,ne r4,$1
-	cmov,eq r4,$2
-	st r4,b
-	st $1,c
-
-A weakly ordered CPU would have no dependency of any sort between the load
-from 'a' and the store to 'c'.  The control dependencies would extend
-only to the pair of cmov instructions and the store depending on them.
-In short, control dependencies apply only to the stores in the then-clause
-and else-clause of the if-statement in question (including functions
-invoked by those two clauses), not to code following that if-statement.
+	WRITE_ONCE(c, 1);
 
+Because ctrl_dep emits distinct asm volatile within each leg of the if
+statement, the compiler cannot transform the two writes to 'b' into a
+conditional-move (cmov) instruction, thus ensuring the presence of a
+conditional branch.  Also because the ctrl_dep emits asm volatile within
+each leg of the if statement, the compiler cannot move the write to 'c'
+before the conditional branch.
 
 Note well that the ordering provided by a control dependency is local
 to the CPU containing it.  See the section on "Multicopy atomicity"
@@ -881,40 +795,33 @@ In summary:
       use smp_rmb(), smp_wmb(), or, in the case of prior stores and
       later loads, smp_mb().
 
-  (*) If both legs of the "if" statement begin with identical stores to
-      the same variable, then those stores must be ordered, either by
-      preceding both of them with smp_mb() or by using smp_store_release()
-      to carry out the stores.  Please note that it is -not- sufficient
-      to use barrier() at beginning of each leg of the "if" statement
-      because, as shown by the example above, optimizing compilers can
-      destroy the control dependency while respecting the letter of the
-      barrier() law.
+  (*) Control dependencies should be made explicit using the ctrl_dep
+      macro.
 
   (*) Control dependencies require at least one run-time conditional
       between the prior load and the subsequent store, and this
-      conditional must involve the prior load.  If the compiler is able
-      to optimize the conditional away, it will have also optimized
-      away the ordering.  Careful use of READ_ONCE() and WRITE_ONCE()
-      can help to preserve the needed conditional.
+      conditional must involve the prior load.  Careful use of
+      READ_ONCE() and WRITE_ONCE() can help to preserve the needed
+      conditional.
 
   (*) Control dependencies require that the compiler avoid reordering the
       dependency into nonexistence.  Careful use of READ_ONCE() or
       atomic{,64}_read() can help to preserve your control dependency.
       Please see the COMPILER BARRIER section for more information.
 
-  (*) Control dependencies apply only to the then-clause and else-clause
+  (*) Control dependencies apply to the then-clause and else-clause
       of the if-statement containing the control dependency, including
-      any functions that these two clauses call.  Control dependencies
-      do -not- apply to code following the if-statement containing the
-      control dependency.
+      any functions that these two clauses call, as well as code
+      following the if-statement containing the control dependency.
 
   (*) Control dependencies pair normally with other types of barriers.
 
   (*) Control dependencies do -not- provide multicopy atomicity.  If you
       need all the CPUs to see a given store at the same time, use smp_mb().
 
-  (*) Compilers do not understand control dependencies.  It is therefore
-      your job to ensure that they do not break your code.
+  (*) Compilers do not understand control dependencies.  Use of
+      ctrl_dep is required to prevent the compiler optimizations
+      from breaking your code.
 
 
 SMP BARRIER PAIRING
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 2ec3b8ac8fa3..182cd06d7d37 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -7772,7 +7772,7 @@ int btrfs_run_dev_stats(struct btrfs_trans_handle *trans)
 	mutex_lock(&fs_devices->device_list_mutex);
 	list_for_each_entry(device, &fs_devices->devices, dev_list) {
 		stats_cnt = atomic_read(&device->dev_stats_ccnt);
-		if (!device->dev_stats_valid || stats_cnt == 0)
+		if (ctrl_dep(!device->dev_stats_valid || stats_cnt == 0))
 			continue;
 
 
@@ -7780,14 +7780,14 @@ int btrfs_run_dev_stats(struct btrfs_trans_handle *trans)
 		 * There is a LOAD-LOAD control dependency between the value of
 		 * dev_stats_ccnt and updating the on-disk values which requires
 		 * reading the in-memory counters. Such control dependencies
-		 * require explicit read memory barriers.
+		 * require explicit acquire barriers.
 		 *
 		 * This memory barriers pairs with smp_mb__before_atomic in
 		 * btrfs_dev_stat_inc/btrfs_dev_stat_set and with the full
 		 * barrier implied by atomic_xchg in
 		 * btrfs_dev_stats_read_and_reset
 		 */
-		smp_rmb();
+		smp_acquire__after_ctrl_dep();
 
 		ret = update_dev_stat_item(trans, device);
 		if (!ret)
diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
index 640f09479bdf..7b32c5f67256 100644
--- a/include/asm-generic/barrier.h
+++ b/include/asm-generic/barrier.h
@@ -187,6 +187,60 @@ do {									\
 #define virt_store_release(p, v) __smp_store_release(p, v)
 #define virt_load_acquire(p) __smp_load_acquire(p)
 
+#ifndef ctrl_dep_eval
+#define ctrl_dep_eval(x)        ({ BUILD_BUG_ON(__builtin_constant_p(!!(x))); (x); })
+#endif
+
+#ifndef ctrl_dep_asm_volatile
+/*
+ * Emit a distinct asm volatile for each leg of the branch. Their
+ * purpose is to prevent the compiler from optimizing away the
+ * conditional branch, and prevent it from lifting following volatile
+ * stores and *_after_ctrl_dep() barriers out of the selection
+ * statement.
+ *
+ * Those asm volatile do not generate any code; they only affect
+ * compiler optimizations.  They respectively emit an unused "0:"
+ * and "1:" label in each leg of the branch.
+ */
+#define ctrl_dep_asm_volatile(x)	({ asm volatile (__stringify(x) ":\n\t"); (x); })
+#endif
+
+
+/**
+ * ctrl_dep() - Provide a control-dependency
+ *
+ * if (ctrl_dep(READ_ONCE(A))
+ *	WRITE_ONCE(B, 1);
+ *
+ * and
+ *
+ * do {
+ *   [...]
+ * } while (ctrl_dep(READ_ONCE(A)));
+ * WRITE_ONCE(B, 1);
+ *
+ * will ensure that the STORE to B happens after the LOAD of A. Normally a
+ * control dependency relies on a conditional branch having a data dependency
+ * on the LOAD and an architecture's inability to speculate STOREs. IOW, this
+ * provides a LOAD->STORE order.
+ *
+ * Due to optimizing compilers, extra care is needed; as per the example above
+ * the LOAD must be 'volatile' qualified in order to ensure the compiler
+ * actually emits the load, such that the data-dependency to the conditional
+ * branch can be formed.
+ *
+ * Secondly, the compiler must be prohibited from lifting anything out of the
+ * selection statement, as this would obviously also break the ordering.
+ *
+ * Thirdly, architectures that allow the LOAD->STORE reorder must ensure
+ * the compiler actually emits the conditional branch instruction.
+ */
+
+#ifndef ctrl_dep
+#define ctrl_dep(x)          ((ctrl_dep_eval(x) && ctrl_dep_asm_volatile(1)) || ctrl_dep_asm_volatile(0))
+#endif
+
 /**
  * smp_acquire__after_ctrl_dep() - Provide ACQUIRE ordering after a control dependency
  *
@@ -201,7 +255,7 @@ do {									\
 #endif
 
 /**
- * smp_cond_load_relaxed() - (Spin) wait for cond with no ordering guarantees
+ * smp_cond_load_relaxed() - (Spin) wait for cond with control dependency
  * @ptr: pointer to the variable to wait on
  * @cond: boolean expression to wait for
  *
@@ -216,7 +270,7 @@ do {									\
 	__unqual_scalar_typeof(*ptr) VAL;			\
 	for (;;) {						\
 		VAL = READ_ONCE(*__PTR);			\
-		if (cond_expr)					\
+		if (ctrl_dep(cond_expr))			\
 			break;					\
 		cpu_relax();					\
 	}							\
diff --git a/include/linux/refcount.h b/include/linux/refcount.h
index b8a6e387f8f9..afc7e08648bd 100644
--- a/include/linux/refcount.h
+++ b/include/linux/refcount.h
@@ -274,7 +274,7 @@ static inline __must_check bool __refcount_sub_and_test(int i, refcount_t *r, in
 	if (oldp)
 		*oldp = old;
 
-	if (old == i) {
+	if (ctrl_dep(old == i)) {
 		smp_acquire__after_ctrl_dep();
 		return true;
 	}
diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index 5becca9be867..5246dec231ac 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -715,7 +715,7 @@ static int wq_sleep(struct mqueue_inode_info *info, int sr,
 		time = schedule_hrtimeout_range_clock(timeout, 0,
 			HRTIMER_MODE_ABS, CLOCK_REALTIME);
 
-		if (READ_ONCE(ewp->state) == STATE_READY) {
+		if (ctrl_dep(READ_ONCE(ewp->state) == STATE_READY)) {
 			/* see MQ_BARRIER for purpose/pairing */
 			smp_acquire__after_ctrl_dep();
 			retval = 0;
diff --git a/ipc/msg.c b/ipc/msg.c
index a0d05775af2c..046c54a0c526 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -1213,7 +1213,7 @@ static long do_msgrcv(int msqid, void __user *buf, size_t bufsz, long msgtyp, in
 		 * signal) it will either see the message and continue ...
 		 */
 		msg = READ_ONCE(msr_d.r_msg);
-		if (msg != ERR_PTR(-EAGAIN)) {
+		if (ctrl_dep(msg != ERR_PTR(-EAGAIN))) {
 			/* see MSG_BARRIER for purpose/pairing */
 			smp_acquire__after_ctrl_dep();
 
diff --git a/ipc/sem.c b/ipc/sem.c
index 6693daf4fe11..6d65948731e6 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -2183,7 +2183,7 @@ long __do_semtimedop(int semid, struct sembuf *sops,
 		 * window between wake_q_add() and wake_up_q().
 		 */
 		error = READ_ONCE(queue.status);
-		if (error != -EINTR) {
+		if (ctrl_dep(error != -EINTR)) {
 			/* see SEM_BARRIER_2 for purpose/pairing */
 			smp_acquire__after_ctrl_dep();
 			goto out;
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index 52868716ec35..243c65a6d008 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -193,9 +193,9 @@ __perf_output_begin(struct perf_output_handle *handle,
 		tail = READ_ONCE(rb->user_page->data_tail);
 		offset = head = local_read(&rb->head);
 		if (!rb->overwrite) {
-			if (unlikely(!ring_buffer_has_space(head, tail,
+			if (ctrl_dep(unlikely(!ring_buffer_has_space(head, tail,
 							    perf_data_size(rb),
-							    size, backward)))
+							    size, backward))))
 				goto fail;
 		}
 
@@ -429,7 +429,7 @@ void *perf_aux_output_begin(struct perf_output_handle *handle,
 		 * control dependency barrier separating aux_tail load from aux data
 		 * store that will be enabled on successful return
 		 */
-		if (!handle->size) { /* A, matches D */
+		if (ctrl_dep(!handle->size)) { /* A, matches D */
 			event->pending_disable = smp_processor_id();
 			perf_output_wakeup(handle);
 			WRITE_ONCE(rb->aux_nest, 0);
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 000e8d5a2884..9ea81d1f4c37 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -942,8 +942,8 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, long count, unsigned int stat
 		 * exit the slowpath and return immediately as its
 		 * RWSEM_READER_BIAS has already been set in the count.
 		 */
-		if (!(atomic_long_read(&sem->count) &
-		     (RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))) {
+		if (ctrl_dep(!(atomic_long_read(&sem->count) &
+		     (RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF)))) {
 			/* Provide lock ACQUIRE */
 			smp_acquire__after_ctrl_dep();
 			raw_spin_unlock_irq(&sem->wait_lock);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1bba4128a3e6..17525063d5c0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4008,7 +4008,7 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 	 * A similar smb_rmb() lives in try_invoke_on_locked_down_task().
 	 */
 	smp_rmb();
-	if (READ_ONCE(p->on_rq) && ttwu_runnable(p, wake_flags))
+	if (ctrl_dep(READ_ONCE(p->on_rq) && ttwu_runnable(p, wake_flags)))
 		goto unlock;
 
 #ifdef CONFIG_SMP
diff --git a/kernel/smp.c b/kernel/smp.c
index f43ede0ab183..a284110ed7f6 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -394,7 +394,7 @@ static void __csd_lock_wait(struct __call_single_data *csd)
 
 	ts1 = ts0 = sched_clock();
 	for (;;) {
-		if (csd_lock_wait_toolong(csd, ts0, &ts1, &bug_id))
+		if (ctrl_dep(csd_lock_wait_toolong(csd, ts0, &ts1, &bug_id)))
 			break;
 		cpu_relax();
 	}
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH] LKMM: Add ctrl_dep() macro for control dependency
  2021-09-28 21:15 [RFC PATCH] LKMM: Add ctrl_dep() macro for control dependency Mathieu Desnoyers
@ 2021-09-29 12:06 ` Marco Elver
  2021-10-01 15:45   ` Mathieu Desnoyers
  2021-09-29 12:28 ` Florian Weimer
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 35+ messages in thread
From: Marco Elver @ 2021-09-29 12:06 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: will, paulmck, Peter Zijlstra, Segher Boessenkool, linux-kernel,
	Linus Torvalds, stern, parri.andrea, boqun.feng, npiggin,
	dhowells, j.alglave, luc.maranget, akiyks, linux-toolchains,
	linux-arch

On Tue, Sep 28, 2021 at 05:15PM -0400, Mathieu Desnoyers wrote:
> The control dependency ordering currently documented in
> Documentation/memory-barriers.txt is fragile and can be broken by
> various compiler optimizations.
> 
> The goal here is to prevent the compiler from being able to optimize a
> conditional branch into something which lacks the control dependency,
> while letting the compiler choose the best conditional branch in each
> case.
> 
> Prevent the compiler from considering the two legs of a conditional
> branch as identical by adding a distinct volatile asm in each leg of the
> branch. Those asm do not emit any instruction nor data into the
> resulting executable, and do not have any clobbers.
> 
> GNU describes asm volatile statements as having side-effects. [1]
> 
> C99 describes that accessing volatile objects are side-effects, and that
> "at certain specified points in the execution sequence called sequence
> points, all side effects of previous evaluations shall be complete
> and no side effects of subsequent evaluations shall have taken
> place". [2]
> 
> This ensures that the program order of READ_ONCE(), asm volatile in both
> legs of the branch, and following WRITE_ONCE() and after_ctrl_dep()
> barriers are preserved.
> 
> With this approach, the following code now keeps the control dependency:
> 
>         z = READ_ONCE(var1);
>         if (ctrl_dep(z))
>                 WRITE_ONCE(var2, 5);
>         else
>                 WRITE_ONCE(var2, 5);
> 
> And the ctrl_dep_eval() checking the constant triggers a build error
> for:
> 
>         y = READ_ONCE(var1);
>         if (ctrl_dep(y % 1))
>                 WRITE_ONCE(var2, 5);
>         else
>                 WRITE_ONCE(var2, 6);
> 
> Which is good to have to ensure the compiler don't end up removing the
> conditional branch because the it evaluates a constant.
> 
> Introduce the ctrl_dep macro in the generic headers, and use it
> everywhere it appears relevant.  The approach taken is simply to
> look for smp_acquire__after_ctrl_dep and "control dependency" across the
> kernel sources, so a few other uses may have been missed.

It would be nice to know where and on which arch things are currently
broken of course, which might then also help raise confidence that this
implementation of ctrl_dep() works.

Because it's still hard to prove that the compiler will always do the
right thing with that implementation. The only concrete option I see
here is creating tests with known or potential breakage.

In an ideal world we could add such tests to the compiler's test-suites
themselves, assuming the behaviour your ctrl_dep() implementation relies
on is supposed to be guaranteed (and the compiler folks agree..).

Beyond the above trivial test case with 2 identical branches, here's
another one that breaks on arm64 with clang 12 (taken from
https://reviews.llvm.org/D103958):

 | int x, y;
 | void noinline test_ctrl_dep_broken1(void)
 | {
 | 	/* ARM: do NOT expect: cinc | expect: cbz */
 | 	if (ctrl_dep(READ_ONCE(x))) {
 | 		y = 1;
 | 	} else {
 | 		y = 2;
 | 	}
 | }

Without ctrl_dep():

 | <test_ctrl_dep_broken1>:
 |        d00042a8        adrp    x8, ffffffc010868000 <initcall_debug>
 |        b9400508        ldr     w8, [x8, #4]
 |        52800029        mov     w9, #0x1                        // #1
 |        7100011f        cmp     w8, #0x0
 |        1a891528        cinc    w8, w9, eq  // eq = none
 |        d00042a9        adrp    x9, ffffffc010868000 <initcall_debug>
 |        b9000928        str     w8, [x9, #8]
 |        d65f03c0        ret

			^^ no branch, compiler replaced branch with cinc!

with ctrl_dep():

 | <test_ctrl_dep_broken1>:
 |        d00042a8        adrp    x8, ffffffc010868000 <initcall_debug>
 |        b9400508        ldr     w8, [x8, #4]
 |        34000068        cbz     w8, ffffffc0100124b4 <test_ctrl_dep_broken1+0x14>
 |        52800028        mov     w8, #0x1                        // #1
 |        14000002        b       ffffffc0100124b8 <test_ctrl_dep_broken1+0x18>
 |        52800048        mov     w8, #0x2                        // #2
 |        d00042a9        adrp    x9, ffffffc010868000 <initcall_debug>
 |        b9000928        str     w8, [x9, #8]
 |        d65f03c0        ret

			^^ has cbz (and no cinc)

Which is good -- empirically, this seems to work for this case at least.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH] LKMM: Add ctrl_dep() macro for control dependency
  2021-09-28 21:15 [RFC PATCH] LKMM: Add ctrl_dep() macro for control dependency Mathieu Desnoyers
  2021-09-29 12:06 ` Marco Elver
@ 2021-09-29 12:28 ` Florian Weimer
  2021-09-29 17:41   ` Segher Boessenkool
  2021-09-30 13:28   ` Mathieu Desnoyers
  2021-09-29 14:47 ` Linus Torvalds
  2021-09-29 21:47 ` Segher Boessenkool
  3 siblings, 2 replies; 35+ messages in thread
From: Florian Weimer @ 2021-09-29 12:28 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: will, paulmck, Peter Zijlstra, Segher Boessenkool, linux-kernel,
	Linus Torvalds, stern, parri.andrea, boqun.feng, npiggin,
	dhowells, j.alglave, luc.maranget, akiyks, linux-toolchains,
	linux-arch

* Mathieu Desnoyers:

> + * will ensure that the STORE to B happens after the LOAD of A. Normally a
> + * control dependency relies on a conditional branch having a data dependency
> + * on the LOAD and an architecture's inability to speculate STOREs. IOW, this
> + * provides a LOAD->STORE order.
> + *
> + * Due to optimizing compilers, extra care is needed; as per the example above
> + * the LOAD must be 'volatile' qualified in order to ensure the compiler
> + * actually emits the load, such that the data-dependency to the conditional
> + * branch can be formed.
> + *
> + * Secondly, the compiler must be prohibited from lifting anything out of the
> + * selection statement, as this would obviously also break the ordering.
> + *
> + * Thirdly, architectures that allow the LOAD->STORE reorder must ensure
> + * the compiler actually emits the conditional branch instruction.

If you need a specific instruction emitted, you need a compiler
intrinsic or inline assembly.

So something like this:

#define control_dep(x)                          \
  ({                                            \
    __typeof(x) x__ = (x);                      \
    __asm__("test $0, %0\n\t"                   \
            "jnz 1f\n\t"                        \
            "1:"                                \
            :: "r"(x__) : "cc");                \
  })

with an appropriate instruction sequence for each architecture.

I don't think it's possible to piggy-back this on something else.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH] LKMM: Add ctrl_dep() macro for control dependency
  2021-09-28 21:15 [RFC PATCH] LKMM: Add ctrl_dep() macro for control dependency Mathieu Desnoyers
  2021-09-29 12:06 ` Marco Elver
  2021-09-29 12:28 ` Florian Weimer
@ 2021-09-29 14:47 ` Linus Torvalds
  2021-09-29 14:54   ` Linus Torvalds
  2021-09-29 19:27   ` Mathieu Desnoyers
  2021-09-29 21:47 ` Segher Boessenkool
  3 siblings, 2 replies; 35+ messages in thread
From: Linus Torvalds @ 2021-09-29 14:47 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Will Deacon, Paul E. McKenney, Peter Zijlstra,
	Segher Boessenkool, Linux Kernel Mailing List, Alan Stern,
	Andrea Parri, Boqun Feng, Nick Piggin, David Howells,
	Jade Alglave, Luc Maranget, Akira Yokosawa, linux-toolchains,
	linux-arch

On Tue, Sep 28, 2021 at 2:15 PM Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>
> Introduce the ctrl_dep macro in the generic headers, and use it
> everywhere it appears relevant.

The control dependency is so subtle - just see our discussions - that
I really think every single use of it needs to have a comment about
why it's needed.

Right now, that patch seems to just sprinkle them more or less
randomly. That's absolutely not what I want. It will just mean that
other people start sprinkling them randomly even more, and nobody will
dare remove them.

So I'd literally want a comment about "this needs a control
dependency, because otherwise the compiler could merge the two
identical stores X and Y".

When you have a READ_ONCE() in the conditional, and a WRITE_ONCE() in
the statement protected by the conditional, there is *no* need to
randomly sprinkle noise that doesn't matter.

And if there *is* need ("look, we have that same store in both the if-
and the else-statement" or whatever), then say so, and state that
thing.

               Linus

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH] LKMM: Add ctrl_dep() macro for control dependency
  2021-09-29 14:47 ` Linus Torvalds
@ 2021-09-29 14:54   ` Linus Torvalds
  2021-09-29 19:50     ` Mathieu Desnoyers
  2021-09-29 19:27   ` Mathieu Desnoyers
  1 sibling, 1 reply; 35+ messages in thread
From: Linus Torvalds @ 2021-09-29 14:54 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Will Deacon, Paul E. McKenney, Peter Zijlstra,
	Segher Boessenkool, Linux Kernel Mailing List, Alan Stern,
	Andrea Parri, Boqun Feng, Nick Piggin, David Howells,
	Jade Alglave, Luc Maranget, Akira Yokosawa, linux-toolchains,
	linux-arch

On Wed, Sep 29, 2021 at 7:47 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> And if there *is* need ("look, we have that same store in both the if-
> and the else-statement" or whatever), then say so, and state that
> thing.

Side note: I'd also like the commit that introduces this to talk very
explicitly about the particular case that it is used doe and that it
fixes. No "this can happen". A "this happened, here's the _actual_
wrong code generation, and look how this new ctrl_dep() macro fixes
it".

When it's this subtle, I don't want theoretical arguments. I want
actual outstanding and real bugs.

Because I get the feeling that there were very few actual realistic
examples of this, only made-up theoretical cases that wouldn't ever
really be found in real code.

                Linus

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH] LKMM: Add ctrl_dep() macro for control dependency
  2021-09-29 12:28 ` Florian Weimer
@ 2021-09-29 17:41   ` Segher Boessenkool
  2021-09-29 19:46     ` Florian Weimer
  2021-10-01 16:13     ` Mathieu Desnoyers
  2021-09-30 13:28   ` Mathieu Desnoyers
  1 sibling, 2 replies; 35+ messages in thread
From: Segher Boessenkool @ 2021-09-29 17:41 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Mathieu Desnoyers, will, paulmck, Peter Zijlstra, linux-kernel,
	Linus Torvalds, stern, parri.andrea, boqun.feng, npiggin,
	dhowells, j.alglave, luc.maranget, akiyks, linux-toolchains,
	linux-arch

Hi!

On Wed, Sep 29, 2021 at 02:28:37PM +0200, Florian Weimer wrote:
> If you need a specific instruction emitted, you need a compiler
> intrinsic or inline assembly.

Not an intrinsic.  Builtins (like almost all other code) do not say
"generate this particular machine code", they say "generate code that
does <this>".  That is one reason why builtins are more powerful than
inline assembler (another related reason is that they tell the compiler
exactly what behaviour is expected).

> I don't think it's possible to piggy-back this on something else.

Unless we get a description of what this does in term of language
semantics (instead of generated machine code), there is no hope, even.


Segher

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH] LKMM: Add ctrl_dep() macro for control dependency
  2021-09-29 14:47 ` Linus Torvalds
  2021-09-29 14:54   ` Linus Torvalds
@ 2021-09-29 19:27   ` Mathieu Desnoyers
  2021-09-29 22:14     ` Linus Torvalds
  1 sibling, 1 reply; 35+ messages in thread
From: Mathieu Desnoyers @ 2021-09-29 19:27 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Will Deacon, paulmck, Peter Zijlstra, Segher Boessenkool,
	linux-kernel, Alan Stern, Andrea Parri, Boqun Feng,
	Nicholas Piggin, David Howells, j alglave, luc maranget, akiyks,
	linux-toolchains, linux-arch

----- On Sep 29, 2021, at 10:47 AM, Linus Torvalds torvalds@linux-foundation.org wrote:

> On Tue, Sep 28, 2021 at 2:15 PM Mathieu Desnoyers
> <mathieu.desnoyers@efficios.com> wrote:
>>
>> Introduce the ctrl_dep macro in the generic headers, and use it
>> everywhere it appears relevant.
> 
> The control dependency is so subtle - just see our discussions - that
> I really think every single use of it needs to have a comment about
> why it's needed.

I agree with you on thorough documentation of each control dependency,
perhaps just not about documentation of all compiler optimizations
affecting each of them.

> 
> Right now, that patch seems to just sprinkle them more or less
> randomly. That's absolutely not what I want. It will just mean that
> other people start sprinkling them randomly even more, and nobody will
> dare remove them.

Note that I have not found that many uses of control dependencies in the
kernel tree. When they are used, this happens to be code where speed
really matters though.

> So I'd literally want a comment about "this needs a control
> dependency, because otherwise the compiler could merge the two
> identical stores X and Y".

My hope with this ctrl_dep() macro is to remove at least some of
the caveats to keep in mind when using control dependency ordering.
Requiring to keep track of all relevant compiler optimizations on all
architectures while reasoning about memory barriers is error-prone.

> When you have a READ_ONCE() in the conditional, and a WRITE_ONCE() in
> the statement protected by the conditional, there is *no* need to
> randomly sprinkle noise that doesn't matter.

The main advantage in doing so would be documentation, both in terms of
letting the compiler know that this control dependency matters for
ordering, and in terms of simplifying the set of caveats to document
in Documentation/memory-barriers.txt.

> And if there *is* need ("look, we have that same store in both the if-
> and the else-statement" or whatever), then say so, and state that
> thing.

If we go for only using ctrl_dep() for scenarios which require it for
documented reasons, then we would need to leave in place all the
caveats details in Documentation/memory-barriers.txt, and document
that in those scenarios ctrl_dep() should be used. This would be a
starting point I guess.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH] LKMM: Add ctrl_dep() macro for control dependency
  2021-09-29 17:41   ` Segher Boessenkool
@ 2021-09-29 19:46     ` Florian Weimer
  2021-10-01 16:13     ` Mathieu Desnoyers
  1 sibling, 0 replies; 35+ messages in thread
From: Florian Weimer @ 2021-09-29 19:46 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: Mathieu Desnoyers, will, paulmck, Peter Zijlstra, linux-kernel,
	Linus Torvalds, stern, parri.andrea, boqun.feng, npiggin,
	dhowells, j.alglave, luc.maranget, akiyks, linux-toolchains,
	linux-arch

* Segher Boessenkool:

> Hi!
>
> On Wed, Sep 29, 2021 at 02:28:37PM +0200, Florian Weimer wrote:
>> If you need a specific instruction emitted, you need a compiler
>> intrinsic or inline assembly.
>
> Not an intrinsic.  Builtins (like almost all other code) do not say
> "generate this particular machine code", they say "generate code that
> does <this>".  That is one reason why builtins are more powerful than
> inline assembler (another related reason is that they tell the compiler
> exactly what behaviour is expected).

I meant that if the object code has to contain a specific instruction
sequence involving a conditional, it needs some form of compiler
support.  Adding some volatile here and some form of a compiler barrier
there is very brittle.

>> I don't think it's possible to piggy-back this on something else.
>
> Unless we get a description of what this does in term of language
> semantics (instead of generated machine code), there is no hope, even.

True.  For example, if the argument contains a sequence point, what does
that even mean?

Thanks,
Florian


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH] LKMM: Add ctrl_dep() macro for control dependency
  2021-09-29 14:54   ` Linus Torvalds
@ 2021-09-29 19:50     ` Mathieu Desnoyers
  2021-09-29 20:13       ` Mathieu Desnoyers
  0 siblings, 1 reply; 35+ messages in thread
From: Mathieu Desnoyers @ 2021-09-29 19:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Will Deacon, paulmck, Peter Zijlstra, Segher Boessenkool,
	linux-kernel, Alan Stern, Andrea Parri, Boqun Feng,
	Nicholas Piggin, David Howells, j alglave, luc maranget, akiyks,
	linux-toolchains, linux-arch

----- On Sep 29, 2021, at 10:54 AM, Linus Torvalds torvalds@linux-foundation.org wrote:

> On Wed, Sep 29, 2021 at 7:47 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> And if there *is* need ("look, we have that same store in both the if-
>> and the else-statement" or whatever), then say so, and state that
>> thing.
> 
> Side note: I'd also like the commit that introduces this to talk very
> explicitly about the particular case that it is used doe and that it
> fixes. No "this can happen". A "this happened, here's the _actual_
> wrong code generation, and look how this new ctrl_dep() macro fixes
> it".
> 
> When it's this subtle, I don't want theoretical arguments. I want
> actual outstanding and real bugs.
> 
> Because I get the feeling that there were very few actual realistic
> examples of this, only made-up theoretical cases that wouldn't ever
> really be found in real code.

There is one particular scenario which concerns me in refcount_dec_and_test().
It relies on smp_acquire__after_ctrl_dep() to promote the control
dependency to an acquire barrier on success. Because it is exposed
within a static inline function, it hides the fact that control dependencies
are being used under the hood.

I have not identified a specific instance of oddly generated code, but this is
an accident waiting to happen. If we take this simplified refcount code
into godbolt.org and compile it for RISC-V rv64gc clang 12.0.1:

#define RISCV_FENCE(p, s) \
        __asm__ __volatile__ ("fence " #p "," #s : : : "memory")
#define __smp_rmb()     RISCV_FENCE(r,r)

volatile int var1;
volatile int refcount;

static inline bool refcount_dec_and_test(void)
{
    refcount--;
    if (refcount == 0) {
        __smp_rmb();    /* acquire after ctrl_dep */
        return true;
    }
    return false;
}

void fct(void)
{
    int x;

    if (refcount_dec_and_test()) {
        var1 = 0;
        return;
    }
    __smp_rmb();
    var1 = 1;
}

We end up with:

fct():                               # @fct()
        lui     a0, %hi(refcount)
        lw      a1, %lo(refcount)(a0)
        addiw   a1, a1, -1
        sw      a1, %lo(refcount)(a0)
        lw      a0, %lo(refcount)(a0)
        fence   r, r
        snez    a0, a0
        lui     a1, %hi(var1)
        sw      a0, %lo(var1)(a1)
        ret

Which lacks the conditional branch, and where the "fence r,r" instruction
does not properly order following stores after the refcount load.

Adding ctrl_dep() around the refcount == 0 check fixes this:

fct():                                # @fct()
        lui     a0, %hi(refcount)
        lw      a1, %lo(refcount)(a0)
        addiw   a1, a1, -1
        sw      a1, %lo(refcount)(a0)
        lw      a0, %lo(refcount)(a0)
        beqz    a0, .LBB0_2

        fence   r, r
        addi    a0, zero, 1
        j       .LBB0_3
.LBB0_2:

        fence   r, r
        mv      a0, zero
.LBB0_3:
        lui     a1, %hi(var1)
        sw      a0, %lo(var1)(a1)
        ret

I admit that this is still a "made up" example, although it is similar to the actual
implementation of refcount_dec_and_check(). But if we need to audit every user of this
API for wrongly generated code, we may be looking for a needle in a haystack.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH] LKMM: Add ctrl_dep() macro for control dependency
  2021-09-29 19:50     ` Mathieu Desnoyers
@ 2021-09-29 20:13       ` Mathieu Desnoyers
  0 siblings, 0 replies; 35+ messages in thread
From: Mathieu Desnoyers @ 2021-09-29 20:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Will Deacon, paulmck, Peter Zijlstra, Segher Boessenkool,
	linux-kernel, Alan Stern, Andrea Parri, Boqun Feng,
	Nicholas Piggin, David Howells, j alglave, luc maranget, akiyks,
	linux-toolchains, linux-arch

----- On Sep 29, 2021, at 3:50 PM, Mathieu Desnoyers mathieu.desnoyers@efficios.com wrote:

[...]

> void fct(void)
> {
>    int x;
> 
>    if (refcount_dec_and_test()) {
>        var1 = 0;

in this example, this should be "var1 = 1;", so both legs are similar,
otherwise we end up with a dependency on the load.

Thanks,

Mathieu

>        return;
>    }
>    __smp_rmb();
>    var1 = 1;
> }
-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH] LKMM: Add ctrl_dep() macro for control dependency
  2021-09-28 21:15 [RFC PATCH] LKMM: Add ctrl_dep() macro for control dependency Mathieu Desnoyers
                   ` (2 preceding siblings ...)
  2021-09-29 14:47 ` Linus Torvalds
@ 2021-09-29 21:47 ` Segher Boessenkool
  2021-09-29 23:57   ` Paul E. McKenney
  3 siblings, 1 reply; 35+ messages in thread
From: Segher Boessenkool @ 2021-09-29 21:47 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: will, paulmck, Peter Zijlstra, linux-kernel, Linus Torvalds,
	stern, parri.andrea, boqun.feng, npiggin, dhowells, j.alglave,
	luc.maranget, akiyks, linux-toolchains, linux-arch

Hi!

On Tue, Sep 28, 2021 at 05:15:07PM -0400, Mathieu Desnoyers wrote:
> C99 describes that accessing volatile objects are side-effects, and that
> "at certain specified points in the execution sequence called sequence
> points, all side effects of previous evaluations shall be complete
> and no side effects of subsequent evaluations shall have taken
> place". [2]

But note that the kernel explicitly uses C89 (with GNU extensions).
Side effects are largely equal there though.

Also note that there may no place in the generated machine code that
corresponds exactly to some sequence point.  Sequence points are a
concept that applies to the source program and how that executes on the
abstract machine.

> +Because ctrl_dep emits distinct asm volatile within each leg of the if
> +statement, the compiler cannot transform the two writes to 'b' into a
> +conditional-move (cmov) instruction, thus ensuring the presence of a
> +conditional branch.  Also because the ctrl_dep emits asm volatile within
> +each leg of the if statement, the compiler cannot move the write to 'c'
> +before the conditional branch.

I think your reasoning here misses some things.  So many that I don't
know where to start to list them, every "because" and "thus" here does
not follow, and even the statements of fact are not a given.

Why do you want a conditional branch insn at all, anyway?  You really
want something else as far as I can see.

It is essential here that there is a READ_ONCE and the WRITE_ONCE.
Those things might make it work the way you want, but as Linus says this
is all way too subtle.  Can you include the *_ONCE into the primitive
itself somehow?


Segher

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH] LKMM: Add ctrl_dep() macro for control dependency
  2021-09-29 19:27   ` Mathieu Desnoyers
@ 2021-09-29 22:14     ` Linus Torvalds
  0 siblings, 0 replies; 35+ messages in thread
From: Linus Torvalds @ 2021-09-29 22:14 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Will Deacon, paulmck, Peter Zijlstra, Segher Boessenkool,
	linux-kernel, Alan Stern, Andrea Parri, Boqun Feng,
	Nicholas Piggin, David Howells, j alglave, luc maranget, akiyks,
	linux-toolchains, linux-arch

On Wed, Sep 29, 2021 at 12:27 PM Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>
> If we go for only using ctrl_dep() for scenarios which require it for
> documented reasons, then we would need to leave in place all the
> caveats details in Documentation/memory-barriers.txt, and document
> that in those scenarios ctrl_dep() should be used. This would be a
> starting point I guess.

So to me, it's really that starting point  that I feel needs to truly
explain the whole concept.

I'm ok with people adding more cases later (but would still want to
see a comment about exactly why that ctrl_dep() is needed), but the
initial commit is the one that I want to hold up to much higher
standards.

Those higher standards being: "there's an actual bug here" along with
documenting what exactly is going on in that particular case.

Because I do *not* want to introduce this as "ctrl_dep() documents the
control dependency".

If it's _only_ documentation, then a pure comment will do.

So to me, the only reason to actually have a ctrl_dep() macro is that
we have an actual and existing real true bug.

If the only reason for ctrl_dep() is made-up code that doesn't
actually exist, ie

        if (READ_ONCE(x))
                WRITE_ONCE(y,1);
        else
                WRITE_ONCE(y,1);

and the "READ_ONCE()" and "WRITE_ONCE()" being ordered in the face of
made-up examples like this, then ctrl_dep() shouldn't exist.

(The alternative being some "if the compiler can statically know the
direction of the 'if()'" which is I think _equally_ made up, since the
whole point of a control dependency is that it's dynamic, and no
compiler can ever statiaclly determine the direction).

See?

This is why I want to have a real actual live example for that first commit.

If we then in *other* cases add a "ctrl_dep()" for documentation
reasons, and because somebody is unsure about what the "if/else" sides
can contain and wants to make sure they cannot be merged, that's a
separate thing.

But if we can't find a single case where this truly matters and the
particular actual present bug can be shown, then it really makes me go
"is this just all theoretical for purely made up examples that aren't
realistic"?

I mean - just look at the above example of "could be done without the
'if()', and then re-ordered by the hardware". It really isn't very
realistic.

              Linus

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH] LKMM: Add ctrl_dep() macro for control dependency
  2021-09-29 21:47 ` Segher Boessenkool
@ 2021-09-29 23:57   ` Paul E. McKenney
  2021-10-01 15:28     ` Mathieu Desnoyers
  2021-10-01 19:10     ` Segher Boessenkool
  0 siblings, 2 replies; 35+ messages in thread
From: Paul E. McKenney @ 2021-09-29 23:57 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: Mathieu Desnoyers, will, Peter Zijlstra, linux-kernel,
	Linus Torvalds, stern, parri.andrea, boqun.feng, npiggin,
	dhowells, j.alglave, luc.maranget, akiyks, linux-toolchains,
	linux-arch

On Wed, Sep 29, 2021 at 04:47:03PM -0500, Segher Boessenkool wrote:
> Hi!
> 
> On Tue, Sep 28, 2021 at 05:15:07PM -0400, Mathieu Desnoyers wrote:
> > C99 describes that accessing volatile objects are side-effects, and that
> > "at certain specified points in the execution sequence called sequence
> > points, all side effects of previous evaluations shall be complete
> > and no side effects of subsequent evaluations shall have taken
> > place". [2]
> 
> But note that the kernel explicitly uses C89 (with GNU extensions).
> Side effects are largely equal there though.
> 
> Also note that there may no place in the generated machine code that
> corresponds exactly to some sequence point.  Sequence points are a
> concept that applies to the source program and how that executes on the
> abstract machine.

Plus the "as if" rule rears its ugly head in many of these situations.

> > +Because ctrl_dep emits distinct asm volatile within each leg of the if
> > +statement, the compiler cannot transform the two writes to 'b' into a
> > +conditional-move (cmov) instruction, thus ensuring the presence of a
> > +conditional branch.  Also because the ctrl_dep emits asm volatile within
> > +each leg of the if statement, the compiler cannot move the write to 'c'
> > +before the conditional branch.
> 
> I think your reasoning here misses some things.  So many that I don't
> know where to start to list them, every "because" and "thus" here does
> not follow, and even the statements of fact are not a given.
> 
> Why do you want a conditional branch insn at all, anyway?  You really
> want something else as far as I can see.

Because at the assembly language level on some architectures, a
conditional branch instruction provides weak but very real and very
useful memory-ordering properties.  Such a branch orders all loads
whose return values feed into the branch condition before any stores
that execute after the branch does (regardless of whether or not the
branch was taken).  And this is all the ordering that is required for
the use cases that Mathieu is worried about.

Yes, you can use explicit memory-barrier or acquire-load instructions,
but those incur more overhead on some types of hardware.  The code in
question is on a hotpath and is thus performance-critical.

It would be nice to be able to somehow tell the compiler exactly
what the ordering constraints are ("this particular load must be
ordered before these particular stores") and then let it (1) figure
out that a conditional branch will do the trick and (2) generate the
code accordingly.  But last I checked, this was not going to happen any
time soon.  So for the time being, we have to live within the current
capability of the tools that are available to us.

Linus points out that in all the actual control-dependent code in
the Linux kernel, the compiler is going to be hard-pressed to fail
to emit the required branch.  (Or in the case of ARMv8, the required
conditional-move instruction.)

Mathieu, for his part, recently read the relevant portions of
memory-barriers.txt (reproduced below) and would like to simplify these
coding guidlines, which, speaking as the author of those guidelines,
would be an extremely good thing.  His patches are attempting to move
us in that direction.

Alternatives include: (1) Using acquire loads or memory barriers
and accepting the loss in performance, but giving the compiler much
less leeway, (2) Ripping all of the two-legged "if" examples from
memory-barriers.txt and restricting control dependencies to else-less
"if" statements, again giving the compiler less leeway, and (3) Your
ideas here.

Does that help, or am I missing your point?

> It is essential here that there is a READ_ONCE and the WRITE_ONCE.
> Those things might make it work the way you want, but as Linus says this
> is all way too subtle.  Can you include the *_ONCE into the primitive
> itself somehow?

Actually, if the store is not involved in a data race, the WRITE_ONCE()
is not needed.  And in that case, the compiler is much less able to
fail to provide the needed ordering.  (No, the current documentation
does not reflect this.)  But if there is a data race, then your point
is right on the mark -- that WRITE_ONCE() cannot be safely omitted.

But you are absolutely right that the READ_ONCE() or equivalent is not
at all optional.  An example of an acceptable equivalent is an atomic
read-modify-write operation such as atomic_xchg_relaxed().

The question about whether the READ_ONCE() and WRITE_ONCE() can be
incorporated into the macro I leave to Mathieu.  I can certainly see
serious benefits from this approach, at least from a compiler viewpoint.
I must reserve judgment on usability until I see a proposal.

							Thanx, Paul

------------------------------------------------------------------------
Relevant excerpt from memory-barriers.txt
------------------------------------------------------------------------

CONTROL DEPENDENCIES
--------------------

Control dependencies can be a bit tricky because current compilers do
not understand them.  The purpose of this section is to help you prevent
the compiler's ignorance from breaking your code.

A load-load control dependency requires a full read memory barrier, not
simply a data dependency barrier to make it work correctly.  Consider the
following bit of code:

	q = READ_ONCE(a);
	if (q) {
		<data dependency barrier>  /* BUG: No data dependency!!! */
		p = READ_ONCE(b);
	}

This will not have the desired effect because there is no actual data
dependency, but rather a control dependency that the CPU may short-circuit
by attempting to predict the outcome in advance, so that other CPUs see
the load from b as having happened before the load from a.  In such a
case what's actually required is:

	q = READ_ONCE(a);
	if (q) {
		<read barrier>
		p = READ_ONCE(b);
	}

However, stores are not speculated.  This means that ordering -is- provided
for load-store control dependencies, as in the following example:

	q = READ_ONCE(a);
	if (q) {
		WRITE_ONCE(b, 1);
	}

Control dependencies pair normally with other types of barriers.
That said, please note that neither READ_ONCE() nor WRITE_ONCE()
are optional! Without the READ_ONCE(), the compiler might combine the
load from 'a' with other loads from 'a'.  Without the WRITE_ONCE(),
the compiler might combine the store to 'b' with other stores to 'b'.
Either can result in highly counterintuitive effects on ordering.

Worse yet, if the compiler is able to prove (say) that the value of
variable 'a' is always non-zero, it would be well within its rights
to optimize the original example by eliminating the "if" statement
as follows:

	q = a;
	b = 1;  /* BUG: Compiler and CPU can both reorder!!! */

So don't leave out the READ_ONCE().

It is tempting to try to enforce ordering on identical stores on both
branches of the "if" statement as follows:

	q = READ_ONCE(a);
	if (q) {
		barrier();
		WRITE_ONCE(b, 1);
		do_something();
	} else {
		barrier();
		WRITE_ONCE(b, 1);
		do_something_else();
	}

Unfortunately, current compilers will transform this as follows at high
optimization levels:

	q = READ_ONCE(a);
	barrier();
	WRITE_ONCE(b, 1);  /* BUG: No ordering vs. load from a!!! */
	if (q) {
		/* WRITE_ONCE(b, 1); -- moved up, BUG!!! */
		do_something();
	} else {
		/* WRITE_ONCE(b, 1); -- moved up, BUG!!! */
		do_something_else();
	}

Now there is no conditional between the load from 'a' and the store to
'b', which means that the CPU is within its rights to reorder them:
The conditional is absolutely required, and must be present in the
assembly code even after all compiler optimizations have been applied.
Therefore, if you need ordering in this example, you need explicit
memory barriers, for example, smp_store_release():

	q = READ_ONCE(a);
	if (q) {
		smp_store_release(&b, 1);
		do_something();
	} else {
		smp_store_release(&b, 1);
		do_something_else();
	}

In contrast, without explicit memory barriers, two-legged-if control
ordering is guaranteed only when the stores differ, for example:

	q = READ_ONCE(a);
	if (q) {
		WRITE_ONCE(b, 1);
		do_something();
	} else {
		WRITE_ONCE(b, 2);
		do_something_else();
	}

The initial READ_ONCE() is still required to prevent the compiler from
proving the value of 'a'.

In addition, you need to be careful what you do with the local variable 'q',
otherwise the compiler might be able to guess the value and again remove
the needed conditional.  For example:

	q = READ_ONCE(a);
	if (q % MAX) {
		WRITE_ONCE(b, 1);
		do_something();
	} else {
		WRITE_ONCE(b, 2);
		do_something_else();
	}

If MAX is defined to be 1, then the compiler knows that (q % MAX) is
equal to zero, in which case the compiler is within its rights to
transform the above code into the following:

	q = READ_ONCE(a);
	WRITE_ONCE(b, 2);
	do_something_else();

Given this transformation, the CPU is not required to respect the ordering
between the load from variable 'a' and the store to variable 'b'.  It is
tempting to add a barrier(), but this does not help.  The conditional
is gone, and the barrier won't bring it back.  Therefore, if you are
relying on this ordering, you should make sure that MAX is greater than
one, perhaps as follows:

	q = READ_ONCE(a);
	BUILD_BUG_ON(MAX <= 1); /* Order load from a with store to b. */
	if (q % MAX) {
		WRITE_ONCE(b, 1);
		do_something();
	} else {
		WRITE_ONCE(b, 2);
		do_something_else();
	}

Please note once again that the stores to 'b' differ.  If they were
identical, as noted earlier, the compiler could pull this store outside
of the 'if' statement.

You must also be careful not to rely too much on boolean short-circuit
evaluation.  Consider this example:

	q = READ_ONCE(a);
	if (q || 1 > 0)
		WRITE_ONCE(b, 1);

Because the first condition cannot fault and the second condition is
always true, the compiler can transform this example as following,
defeating control dependency:

	q = READ_ONCE(a);
	WRITE_ONCE(b, 1);

This example underscores the need to ensure that the compiler cannot
out-guess your code.  More generally, although READ_ONCE() does force
the compiler to actually emit code for a given load, it does not force
the compiler to use the results.

In addition, control dependencies apply only to the then-clause and
else-clause of the if-statement in question.  In particular, it does
not necessarily apply to code following the if-statement:

	q = READ_ONCE(a);
	if (q) {
		WRITE_ONCE(b, 1);
	} else {
		WRITE_ONCE(b, 2);
	}
	WRITE_ONCE(c, 1);  /* BUG: No ordering against the read from 'a'. */

It is tempting to argue that there in fact is ordering because the
compiler cannot reorder volatile accesses and also cannot reorder
the writes to 'b' with the condition.  Unfortunately for this line
of reasoning, the compiler might compile the two writes to 'b' as
conditional-move instructions, as in this fanciful pseudo-assembly
language:

	ld r1,a
	cmp r1,$0
	cmov,ne r4,$1
	cmov,eq r4,$2
	st r4,b
	st $1,c

A weakly ordered CPU would have no dependency of any sort between the load
from 'a' and the store to 'c'.  The control dependencies would extend
only to the pair of cmov instructions and the store depending on them.
In short, control dependencies apply only to the stores in the then-clause
and else-clause of the if-statement in question (including functions
invoked by those two clauses), not to code following that if-statement.


Note well that the ordering provided by a control dependency is local
to the CPU containing it.  See the section on "Multicopy atomicity"
for more information.


In summary:

  (*) Control dependencies can order prior loads against later stores.
      However, they do -not- guarantee any other sort of ordering:
      Not prior loads against later loads, nor prior stores against
      later anything.  If you need these other forms of ordering,
      use smp_rmb(), smp_wmb(), or, in the case of prior stores and
      later loads, smp_mb().

  (*) If both legs of the "if" statement begin with identical stores to
      the same variable, then those stores must be ordered, either by
      preceding both of them with smp_mb() or by using smp_store_release()
      to carry out the stores.  Please note that it is -not- sufficient
      to use barrier() at beginning of each leg of the "if" statement
      because, as shown by the example above, optimizing compilers can
      destroy the control dependency while respecting the letter of the
      barrier() law.

  (*) Control dependencies require at least one run-time conditional
      between the prior load and the subsequent store, and this
      conditional must involve the prior load.  If the compiler is able
      to optimize the conditional away, it will have also optimized
      away the ordering.  Careful use of READ_ONCE() and WRITE_ONCE()
      can help to preserve the needed conditional.

  (*) Control dependencies require that the compiler avoid reordering the
      dependency into nonexistence.  Careful use of READ_ONCE() or
      atomic{,64}_read() can help to preserve your control dependency.
      Please see the COMPILER BARRIER section for more information.

  (*) Control dependencies apply only to the then-clause and else-clause
      of the if-statement containing the control dependency, including
      any functions that these two clauses call.  Control dependencies
      do -not- apply to code following the if-statement containing the
      control dependency.

  (*) Control dependencies pair normally with other types of barriers.

  (*) Control dependencies do -not- provide multicopy atomicity.  If you
      need all the CPUs to see a given store at the same time, use smp_mb().

  (*) Compilers do not understand control dependencies.  It is therefore
      your job to ensure that they do not break your code.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH] LKMM: Add ctrl_dep() macro for control dependency
  2021-09-29 12:28 ` Florian Weimer
  2021-09-29 17:41   ` Segher Boessenkool
@ 2021-09-30 13:28   ` Mathieu Desnoyers
  1 sibling, 0 replies; 35+ messages in thread
From: Mathieu Desnoyers @ 2021-09-30 13:28 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Will Deacon, paulmck, Peter Zijlstra, Segher Boessenkool,
	linux-kernel, Linus Torvalds, Alan Stern, Andrea Parri,
	Boqun Feng, Nicholas Piggin, David Howells, j alglave,
	luc maranget, akiyks, linux-toolchains, linux-arch

----- On Sep 29, 2021, at 8:28 AM, Florian Weimer fweimer@redhat.com wrote:

> * Mathieu Desnoyers:
> 
>> + * will ensure that the STORE to B happens after the LOAD of A. Normally a
>> + * control dependency relies on a conditional branch having a data dependency
>> + * on the LOAD and an architecture's inability to speculate STOREs. IOW, this
>> + * provides a LOAD->STORE order.
>> + *
>> + * Due to optimizing compilers, extra care is needed; as per the example above
>> + * the LOAD must be 'volatile' qualified in order to ensure the compiler
>> + * actually emits the load, such that the data-dependency to the conditional
>> + * branch can be formed.
>> + *
>> + * Secondly, the compiler must be prohibited from lifting anything out of the
>> + * selection statement, as this would obviously also break the ordering.
>> + *
>> + * Thirdly, architectures that allow the LOAD->STORE reorder must ensure
>> + * the compiler actually emits the conditional branch instruction.
> 
> If you need a specific instruction emitted, you need a compiler
> intrinsic or inline assembly.
> 
> So something like this:
> 
> #define control_dep(x)                          \
>  ({                                            \
>    __typeof(x) x__ = (x);                      \
>    __asm__("test $0, %0\n\t"                   \
>            "jnz 1f\n\t"                        \
>            "1:"                                \
>            :: "r"(x__) : "cc");                \
>  })
> 
> with an appropriate instruction sequence for each architecture.
> 
> I don't think it's possible to piggy-back this on something else.

The previous patch set from Peter Zijlstra proposed using asm goto to achieve this,
but it was turned down in part because it prevented the compiler from choosing the
most appropriate instruction for the conditional branch:

https://lore.kernel.org/lkml/YLn8dzbNwvqrqqp5@hirez.programming.kicks-ass.net/

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH] LKMM: Add ctrl_dep() macro for control dependency
  2021-09-29 23:57   ` Paul E. McKenney
@ 2021-10-01 15:28     ` Mathieu Desnoyers
  2021-10-01 22:53       ` Paul E. McKenney
  2021-10-01 19:10     ` Segher Boessenkool
  1 sibling, 1 reply; 35+ messages in thread
From: Mathieu Desnoyers @ 2021-10-01 15:28 UTC (permalink / raw)
  To: paulmck
  Cc: Segher Boessenkool, Will Deacon, Peter Zijlstra, linux-kernel,
	Linus Torvalds, Alan Stern, Andrea Parri, Boqun Feng,
	Nicholas Piggin, David Howells, j alglave, luc maranget, akiyks,
	linux-toolchains, linux-arch

----- On Sep 29, 2021, at 7:57 PM, paulmck paulmck@kernel.org wrote:

> On Wed, Sep 29, 2021 at 04:47:03PM -0500, Segher Boessenkool wrote:
>> Hi!
>> 
>> On Tue, Sep 28, 2021 at 05:15:07PM -0400, Mathieu Desnoyers wrote:
>> > C99 describes that accessing volatile objects are side-effects, and that
>> > "at certain specified points in the execution sequence called sequence
>> > points, all side effects of previous evaluations shall be complete
>> > and no side effects of subsequent evaluations shall have taken
>> > place". [2]
>> 
>> But note that the kernel explicitly uses C89 (with GNU extensions).
>> Side effects are largely equal there though.
>> 
>> Also note that there may no place in the generated machine code that
>> corresponds exactly to some sequence point.  Sequence points are a
>> concept that applies to the source program and how that executes on the
>> abstract machine.
> 
> Plus the "as if" rule rears its ugly head in many of these situations.
> 
>> > +Because ctrl_dep emits distinct asm volatile within each leg of the if
>> > +statement, the compiler cannot transform the two writes to 'b' into a
>> > +conditional-move (cmov) instruction, thus ensuring the presence of a
>> > +conditional branch.  Also because the ctrl_dep emits asm volatile within
>> > +each leg of the if statement, the compiler cannot move the write to 'c'
>> > +before the conditional branch.
>> 
>> I think your reasoning here misses some things.  So many that I don't
>> know where to start to list them, every "because" and "thus" here does
>> not follow, and even the statements of fact are not a given.
>> 
>> Why do you want a conditional branch insn at all, anyway?  You really
>> want something else as far as I can see.
> 
> Because at the assembly language level on some architectures, a
> conditional branch instruction provides weak but very real and very
> useful memory-ordering properties.  Such a branch orders all loads
> whose return values feed into the branch condition before any stores
> that execute after the branch does (regardless of whether or not the
> branch was taken).  And this is all the ordering that is required for
> the use cases that Mathieu is worried about.
> 
> Yes, you can use explicit memory-barrier or acquire-load instructions,
> but those incur more overhead on some types of hardware.  The code in
> question is on a hotpath and is thus performance-critical.
> 
> It would be nice to be able to somehow tell the compiler exactly
> what the ordering constraints are ("this particular load must be
> ordered before these particular stores") and then let it (1) figure
> out that a conditional branch will do the trick and (2) generate the
> code accordingly.  But last I checked, this was not going to happen any
> time soon.  So for the time being, we have to live within the current
> capability of the tools that are available to us.
> 
> Linus points out that in all the actual control-dependent code in
> the Linux kernel, the compiler is going to be hard-pressed to fail
> to emit the required branch.  (Or in the case of ARMv8, the required
> conditional-move instruction.)
> 
> Mathieu, for his part, recently read the relevant portions of
> memory-barriers.txt (reproduced below) and would like to simplify these
> coding guidlines, which, speaking as the author of those guidelines,
> would be an extremely good thing.  His patches are attempting to move
> us in that direction.
> 
> Alternatives include: (1) Using acquire loads or memory barriers
> and accepting the loss in performance, but giving the compiler much
> less leeway, (2) Ripping all of the two-legged "if" examples from
> memory-barriers.txt and restricting control dependencies to else-less
> "if" statements, again giving the compiler less leeway, and (3) Your
> ideas here.
> 
> Does that help, or am I missing your point?

Thanks Paul, it does help explaining the motivation for relying on
control dependencies for some fast-path memory ordering in the kernel.

And yes, my main goal is to simplify the coding guide lines, but I have
not found any example of bad generated code in the tree kernel at this
point. In some cases (e.g. uses of smp_acquire__after_ctrl_dep()) it's
mainly thanks to luck though.

There is another alternative we could list here: implement ctrl_dep_true(),
ctrl_dep_false() and ctrl_dep(), which would respectively ensure a
control dependency on the then leg, on the else leg, or on both legs of
a conditional expression evaluation.

> 
>> It is essential here that there is a READ_ONCE and the WRITE_ONCE.
>> Those things might make it work the way you want, but as Linus says this
>> is all way too subtle.  Can you include the *_ONCE into the primitive
>> itself somehow?
> 
> Actually, if the store is not involved in a data race, the WRITE_ONCE()
> is not needed.  And in that case, the compiler is much less able to
> fail to provide the needed ordering.  (No, the current documentation
> does not reflect this.)  But if there is a data race, then your point
> is right on the mark -- that WRITE_ONCE() cannot be safely omitted.
> 
> But you are absolutely right that the READ_ONCE() or equivalent is not
> at all optional.  An example of an acceptable equivalent is an atomic
> read-modify-write operation such as atomic_xchg_relaxed().
> 
> The question about whether the READ_ONCE() and WRITE_ONCE() can be
> incorporated into the macro I leave to Mathieu.  I can certainly see
> serious benefits from this approach, at least from a compiler viewpoint.
> I must reserve judgment on usability until I see a proposal.

[...]

After having audited thoroughly all obviously documented control dependencies
in the kernel tree, I'm not sure that including the READ_ONCE() and WRITE_ONCE()
with the ctrl_dep() macro is a good idea, because in some cases there is
calculation to be done on the result of the READ_ONCE() (e.g. through
a static inline) before handing it over to the conditional expression.
In other cases many stores are being done after the control dependency, e.g.:

kernel/events/ring_buffer.c:__perf_output_begin()

        do {
                tail = READ_ONCE(rb->user_page->data_tail);
                offset = head = local_read(&rb->head);
                if (!rb->overwrite) {
                        if (unlikely(!ring_buffer_has_space(head, tail,
                                                            perf_data_size(rb),
                                                            size, backward)))
                                goto fail;
                }

                /*
                 * The above forms a control dependency barrier separating the
                 * @tail load above from the data stores below. Since the @tail
                 * load is required to compute the branch to fail below.
                 *
                 * A, matches D; the full memory barrier userspace SHOULD issue
                 * after reading the data and before storing the new tail
                 * position.
                 *
                 * See perf_output_put_handle().
                 */

                if (!backward)
                        head += size;
                else
                        head -= size;
        } while (local_cmpxchg(&rb->head, offset, head) != offset);

        if (backward) {
                offset = head;
                head = (u64)(-head);
        }

        /*
         * We rely on the implied barrier() by local_cmpxchg() to ensure
         * none of the data stores below can be lifted up by the compiler.
         */

[...]

Note that I suspect that this control dependency documentation could be
improved to state that the first control dependency is with the following
local_cmpxchg store, which itself has a control dependency (when it evaluates
to false) with the following stores to the ring buffer. Those are not volatile
stores, but the "memory" clobber with the local_cmpxchg should ensure that
following stores are after the local_cmpxchg in program order.

One other thing we could do to improve things slightly would be to turn
smp_acquire__after_ctrl_dep() into something which really is an acquire
in all cases, which may not currently be true if the compiler finds a
matching barrier()/smp_rmb() in the other leg after the conditional
expression.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH] LKMM: Add ctrl_dep() macro for control dependency
  2021-09-29 12:06 ` Marco Elver
@ 2021-10-01 15:45   ` Mathieu Desnoyers
  2021-10-01 16:20     ` Linus Torvalds
  0 siblings, 1 reply; 35+ messages in thread
From: Mathieu Desnoyers @ 2021-10-01 15:45 UTC (permalink / raw)
  To: Marco Elver
  Cc: Will Deacon, paulmck, Peter Zijlstra, Segher Boessenkool,
	linux-kernel, Linus Torvalds, Alan Stern, Andrea Parri,
	Boqun Feng, Nicholas Piggin, David Howells, j alglave,
	luc maranget, akiyks, linux-toolchains, linux-arch

----- On Sep 29, 2021, at 8:06 AM, Marco Elver elver@google.com wrote:

> On Tue, Sep 28, 2021 at 05:15PM -0400, Mathieu Desnoyers wrote:
>> The control dependency ordering currently documented in
>> Documentation/memory-barriers.txt is fragile and can be broken by
>> various compiler optimizations.
>> 
>> The goal here is to prevent the compiler from being able to optimize a
>> conditional branch into something which lacks the control dependency,
>> while letting the compiler choose the best conditional branch in each
>> case.
>> 
>> Prevent the compiler from considering the two legs of a conditional
>> branch as identical by adding a distinct volatile asm in each leg of the
>> branch. Those asm do not emit any instruction nor data into the
>> resulting executable, and do not have any clobbers.
>> 
>> GNU describes asm volatile statements as having side-effects. [1]
>> 
>> C99 describes that accessing volatile objects are side-effects, and that
>> "at certain specified points in the execution sequence called sequence
>> points, all side effects of previous evaluations shall be complete
>> and no side effects of subsequent evaluations shall have taken
>> place". [2]
>> 
>> This ensures that the program order of READ_ONCE(), asm volatile in both
>> legs of the branch, and following WRITE_ONCE() and after_ctrl_dep()
>> barriers are preserved.
>> 
>> With this approach, the following code now keeps the control dependency:
>> 
>>         z = READ_ONCE(var1);
>>         if (ctrl_dep(z))
>>                 WRITE_ONCE(var2, 5);
>>         else
>>                 WRITE_ONCE(var2, 5);
>> 
>> And the ctrl_dep_eval() checking the constant triggers a build error
>> for:
>> 
>>         y = READ_ONCE(var1);
>>         if (ctrl_dep(y % 1))
>>                 WRITE_ONCE(var2, 5);
>>         else
>>                 WRITE_ONCE(var2, 6);
>> 
>> Which is good to have to ensure the compiler don't end up removing the
>> conditional branch because the it evaluates a constant.
>> 
>> Introduce the ctrl_dep macro in the generic headers, and use it
>> everywhere it appears relevant.  The approach taken is simply to
>> look for smp_acquire__after_ctrl_dep and "control dependency" across the
>> kernel sources, so a few other uses may have been missed.
> 
> It would be nice to know where and on which arch things are currently
> broken of course, which might then also help raise confidence that this
> implementation of ctrl_dep() works.
> 
> Because it's still hard to prove that the compiler will always do the
> right thing with that implementation. The only concrete option I see
> here is creating tests with known or potential breakage.
> 
> In an ideal world we could add such tests to the compiler's test-suites
> themselves, assuming the behaviour your ctrl_dep() implementation relies
> on is supposed to be guaranteed (and the compiler folks agree..).

Indeed, if we end up agreeing on the need for a compiler builtin, it should
be added to the compiler test-suites with the known problematic scenarios
for each architecture.

> 
> Beyond the above trivial test case with 2 identical branches, here's
> another one that breaks on arm64 with clang 12 (taken from
> https://reviews.llvm.org/D103958):
> 
> | int x, y;
> | void noinline test_ctrl_dep_broken1(void)
> | {
> | 	/* ARM: do NOT expect: cinc | expect: cbz */
> | 	if (ctrl_dep(READ_ONCE(x))) {
> | 		y = 1;
> | 	} else {
> | 		y = 2;
> | 	}
> | }
> 
> Without ctrl_dep():
> 
> | <test_ctrl_dep_broken1>:
> |        d00042a8        adrp    x8, ffffffc010868000 <initcall_debug>
> |        b9400508        ldr     w8, [x8, #4]
> |        52800029        mov     w9, #0x1                        // #1
> |        7100011f        cmp     w8, #0x0
> |        1a891528        cinc    w8, w9, eq  // eq = none
> |        d00042a9        adrp    x9, ffffffc010868000 <initcall_debug>
> |        b9000928        str     w8, [x9, #8]
> |        d65f03c0        ret
> 
>			^^ no branch, compiler replaced branch with cinc!
> 
> with ctrl_dep():
> 
> | <test_ctrl_dep_broken1>:
> |        d00042a8        adrp    x8, ffffffc010868000 <initcall_debug>
> |        b9400508        ldr     w8, [x8, #4]
> |        34000068        cbz     w8, ffffffc0100124b4 <test_ctrl_dep_broken1+0x14>
> |        52800028        mov     w8, #0x1                        // #1
> |        14000002        b       ffffffc0100124b8 <test_ctrl_dep_broken1+0x18>
> |        52800048        mov     w8, #0x2                        // #2
> |        d00042a9        adrp    x9, ffffffc010868000 <initcall_debug>
> |        b9000928        str     w8, [x9, #8]
> |        d65f03c0        ret
> 
>			^^ has cbz (and no cinc)
> 
> Which is good -- empirically, this seems to work for this case at least.

Well AFAIU, this example with cinc does guarantee the control dependency for the store
to "y". The issue arises if we have additional stores which are also expected to be
ordered by the control dependency, e.g.:

 	if (READ_ONCE(x)) {
 		WRITE_ONCE(y, 1);
 	} else {
 		WRITE_ONCE(y, 2);
 	}
        WRITE_ONCE(z, 3);

Here the store to "z" would not necessarily be ordered by the control dependency.

Likewise with clang if we store the same value to different memory locations, e.g.:

 	if (READ_ONCE(x)) {
 		WRITE_ONCE(a, 0);
 	} else {
 		WRITE_ONCE(b, 0);
 	}
        WRITE_ONCE(z, 3);

With armv8, the csel instruction is done on the address being written to, which also
removes the conditional branch. I think this last example is missing from the kernel
documentation.

Another case which should perhaps be documented is the ability of the compiler to
match assembler, e.g.:

if (READ_ONCE(x)) {
   smp_acquire__after_ctrl_dep();
   WRITE_ONCE(a, 0);
} else {
   smp_rmb();
   WRITE_ONCE(b, 0);
}
WRITE_ONCE(z, 3);

In the example above, the compiler can match and lift the inline asm, and use
csel to select the address to write to, thus causing the store to z to lack
the control dependency with load x.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH] LKMM: Add ctrl_dep() macro for control dependency
  2021-09-29 17:41   ` Segher Boessenkool
  2021-09-29 19:46     ` Florian Weimer
@ 2021-10-01 16:13     ` Mathieu Desnoyers
  2021-10-01 16:26       ` Florian Weimer
  1 sibling, 1 reply; 35+ messages in thread
From: Mathieu Desnoyers @ 2021-10-01 16:13 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: Florian Weimer, Will Deacon, paulmck, Peter Zijlstra,
	linux-kernel, Linus Torvalds, Alan Stern, Andrea Parri,
	Boqun Feng, Nicholas Piggin, David Howells, j alglave,
	luc maranget, akiyks, linux-toolchains, linux-arch

----- On Sep 29, 2021, at 1:41 PM, Segher Boessenkool segher@kernel.crashing.org wrote:

> Hi!
> 
> On Wed, Sep 29, 2021 at 02:28:37PM +0200, Florian Weimer wrote:
>> If you need a specific instruction emitted, you need a compiler
>> intrinsic or inline assembly.
> 
> Not an intrinsic.  Builtins (like almost all other code) do not say
> "generate this particular machine code", they say "generate code that
> does <this>".  That is one reason why builtins are more powerful than
> inline assembler (another related reason is that they tell the compiler
> exactly what behaviour is expected).
> 
>> I don't think it's possible to piggy-back this on something else.
> 
> Unless we get a description of what this does in term of language
> semantics (instead of generated machine code), there is no hope, even.

Hi Segher,

Let me try a slightly improved attempt at describing what I am looking
for in terms of language semantics.

First, let's suppose we define two new compiler builtins, e.g.
__sync_ctrl_dep_rw() and __sync_ctrl_dep_acquire().

Their task would be to ensure that a R->W or R->RW (acquire) dependency between the
volatile loads used as input of the evaluated expression and following volatile
stores, volatile loads for R->RW, volatile asm, memory clobbers, is present in the
following situations:

When the builtin is used around evaluation of the left operand of the && (logical
AND) and || (logical OR) expression, the R->W or R->RW dependency should be
present before evaluating the right operand.

When the builtin is used around evaluation of the first operand of the ternary
"question-mark" operator, the R->W or R->RW dependency should be present before
evaluating the second or third operands.

When the builtin is used around evaluation of the controlling expressions of
if, switch, while, and do-while statements, as well as of the second operand of
the for statement, the R->W or R->RW dependency should be present before the
next sequence point is evaluated.

One cheap way to achieve said R->W dependency (as well as R->RW on architectures which
to not reorder R->R) is to ensure that the generated assembly contains a conditional
branch. Other ways to ensure this include more heavy-weight approaches such as explicit
barriers.

Hopefully my description above is slightly closer to the expected language
semantics.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH] LKMM: Add ctrl_dep() macro for control dependency
  2021-10-01 15:45   ` Mathieu Desnoyers
@ 2021-10-01 16:20     ` Linus Torvalds
  2021-10-01 17:28       ` Mathieu Desnoyers
  0 siblings, 1 reply; 35+ messages in thread
From: Linus Torvalds @ 2021-10-01 16:20 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Marco Elver, Will Deacon, paulmck, Peter Zijlstra,
	Segher Boessenkool, linux-kernel, Alan Stern, Andrea Parri,
	Boqun Feng, Nicholas Piggin, David Howells, j alglave,
	luc maranget, akiyks, linux-toolchains, linux-arch

On Fri, Oct 1, 2021 at 8:45 AM Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>
> Well AFAIU, this example with cinc does guarantee the control dependency for the store
> to "y". The issue arises if we have additional stores which are also expected to be
> ordered by the control dependency, e.g.:
>
>         if (READ_ONCE(x)) {
>                 WRITE_ONCE(y, 1);
>         } else {
>                 WRITE_ONCE(y, 2);
>         }
>         WRITE_ONCE(z, 3);
>
> Here the store to "z" would not necessarily be ordered by the control dependency.

Actually, it is ordered as far as the *compiler* is concerned.

It's just that the two writes to 'y' might become a data dependency
(ie using a cmov or arithmetic tricks like 'adc'), then the hardware
might end up considering the write to 'z' to not have any dependencies
and be done early.

> Likewise with clang if we store the same value to different memory locations, e.g.:
>
>         if (READ_ONCE(x)) {
>                 WRITE_ONCE(a, 0);
>         } else {
>                 WRITE_ONCE(b, 0);
>         }
>         WRITE_ONCE(z, 3);
>
> With armv8, the csel instruction is done on the address being written to, which also
> removes the conditional branch. I think this last example is missing from the kernel
> documentation.

Note that the important part isn't necessarily the "control" part of
the dependency.

A *data* dependency is equally strong and valid, and orders the write
wrt the read too.

IOW, this chain is ordered:

     WRITE_ONCE(a, READ_ONCE(b));

without any control dependencies at all. The CPU fundamentally cannot
do the write before it has done the read.

So turning a control dependency into a data dependency DOES NOT remove
the ordering. It's fine. And it doesn't matter whether the data
dependency is on the actual stored data, or on the stored address. In
both cases it's a dependency, and the store cannot be done before the
load.

(NOTE! The CPU _internally_ might speculate the store address or store
data, and thus "do the store first". But it cannot become _visible_ to
anybody else before the speculation has been validated, so from a
memory ordering standpoint, the load always happens first - even if
the CPU internally might have done parts of the store before. All that
matters is the _effective_ memory ordering visible externally, not the
order in which the CPU did things).

Of course, the issue with a data dependency is that it's then "local
to that data". The example above with the write to 'z' is probably a
good example. If the "if ()" statement ends up visible to the CPU as
control flow, then the READ_ONCE(x) is ordered wrt the WRITE_ONCE(z).

But if the conditional WRITE_ONCE(a/b) ends up being done as a data
dependency on the address (or the store data), then the WRITE_ONCE(z)
is ordered in the instruction stream (because those are the C volatile
semantics), but could be visible out of order thanks to CPU memory
ordering.

But again - a lot of these made-up examples are exactly that: made up.
For us to have a ctrl_dep() macro, I really want to see an actual
honest-to-goodness case of this that we can point to.

               Linus

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH] LKMM: Add ctrl_dep() macro for control dependency
  2021-10-01 16:13     ` Mathieu Desnoyers
@ 2021-10-01 16:26       ` Florian Weimer
  2021-10-01 16:35         ` Linus Torvalds
  0 siblings, 1 reply; 35+ messages in thread
From: Florian Weimer @ 2021-10-01 16:26 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Segher Boessenkool, Will Deacon, paulmck, Peter Zijlstra,
	linux-kernel, Linus Torvalds, Alan Stern, Andrea Parri,
	Boqun Feng, Nicholas Piggin, David Howells, j alglave,
	luc maranget, akiyks, linux-toolchains, linux-arch

* Mathieu Desnoyers:

> One cheap way to achieve said R->W dependency (as well as R->RW on
> architectures which to not reorder R->R) is to ensure that the
> generated assembly contains a conditional branch.

Will any conditional branch do, or is it necessary that it depends in
some way on the data read?

Thanks,
Florian


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH] LKMM: Add ctrl_dep() macro for control dependency
  2021-10-01 16:26       ` Florian Weimer
@ 2021-10-01 16:35         ` Linus Torvalds
  2021-10-10 14:02           ` Florian Weimer
  0 siblings, 1 reply; 35+ messages in thread
From: Linus Torvalds @ 2021-10-01 16:35 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Mathieu Desnoyers, Segher Boessenkool, Will Deacon, paulmck,
	Peter Zijlstra, linux-kernel, Alan Stern, Andrea Parri,
	Boqun Feng, Nicholas Piggin, David Howells, j alglave,
	luc maranget, akiyks, linux-toolchains, linux-arch

On Fri, Oct 1, 2021 at 9:26 AM Florian Weimer <fweimer@redhat.com> wrote:
>
> Will any conditional branch do, or is it necessary that it depends in
> some way on the data read?

The condition needs to be dependent on the read.

(Easy way to see it: if the read isn't related to the conditional or
write data/address, the read could just be delayed to after the
condition and the store had been done).

               Linus

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH] LKMM: Add ctrl_dep() macro for control dependency
  2021-10-01 16:20     ` Linus Torvalds
@ 2021-10-01 17:28       ` Mathieu Desnoyers
  2021-10-01 18:18         ` Linus Torvalds
  0 siblings, 1 reply; 35+ messages in thread
From: Mathieu Desnoyers @ 2021-10-01 17:28 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Marco Elver, Will Deacon, paulmck, Peter Zijlstra,
	Segher Boessenkool, linux-kernel, Alan Stern, Andrea Parri,
	Boqun Feng, Nicholas Piggin, David Howells, j alglave,
	luc maranget, akiyks, linux-toolchains, linux-arch

----- On Oct 1, 2021, at 12:20 PM, Linus Torvalds torvalds@linux-foundation.org wrote:
[...]
> But again - a lot of these made-up examples are exactly that: made up.
> For us to have a ctrl_dep() macro, I really want to see an actual
> honest-to-goodness case of this that we can point to.

I've spent some quality time staring at generated assembler diff in the past
days, and looking for code patterns of refcount_dec_and_test users, without
much success. There are some cases which end up working by chance, e.g. in
cases where the if leg has a smp_acquire__after_ctrl_dep and the else leg has
code that emits a barrier(), but I did not find any buggy generated
code per se. In order to observe those issues in real life, we would
really need to have identical then/else legs to the branch.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH] LKMM: Add ctrl_dep() macro for control dependency
  2021-10-01 17:28       ` Mathieu Desnoyers
@ 2021-10-01 18:18         ` Linus Torvalds
  0 siblings, 0 replies; 35+ messages in thread
From: Linus Torvalds @ 2021-10-01 18:18 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Marco Elver, Will Deacon, paulmck, Peter Zijlstra,
	Segher Boessenkool, linux-kernel, Alan Stern, Andrea Parri,
	Boqun Feng, Nicholas Piggin, David Howells, j alglave,
	luc maranget, akiyks, linux-toolchains, linux-arch

On Fri, Oct 1, 2021 at 10:28 AM Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>
> I've spent some quality time staring at generated assembler diff in the past
> days, and looking for code patterns of refcount_dec_and_test users, without
> much success. There are some cases which end up working by chance, e.g. in
> cases where the if leg has a smp_acquire__after_ctrl_dep and the else leg has
> code that emits a barrier(), but I did not find any buggy generated
> code per se. In order to observe those issues in real life, we would
> really need to have identical then/else legs to the branch.

Yeah, that's been very much my feeling too during this whole
discussion (including, very much, earlier threads).

All the examples about this being a problem are those kinds of
"identical or near-identical if/else statements" and they just don't
seem to be all that realistic.

Because immediately when the if-statement actually does something
_meaningful_, it just turns into a branch. And when people use atomics
- even the weak READ/WRITE_ONCE() kinds of things, never mind anything
stronger - it really doesn't give the compiler the option to move
things around all that much.

There's a reason the source code uses an if-statement, after all: that
is literally the logical code flow, and people write a very particular
dependency chain that is just very fundamental.

Having essentially the same thing on both sides just isn't a realistic
thing to do, and if it were - and you cared about performance in that
case, which is what this is all about, after all - you'd write it very
differently.

                Linus

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH] LKMM: Add ctrl_dep() macro for control dependency
  2021-09-29 23:57   ` Paul E. McKenney
  2021-10-01 15:28     ` Mathieu Desnoyers
@ 2021-10-01 19:10     ` Segher Boessenkool
  2021-10-01 22:50       ` Paul E. McKenney
  2021-10-02 14:29       ` Alan Stern
  1 sibling, 2 replies; 35+ messages in thread
From: Segher Boessenkool @ 2021-10-01 19:10 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Mathieu Desnoyers, will, Peter Zijlstra, linux-kernel,
	Linus Torvalds, stern, parri.andrea, boqun.feng, npiggin,
	dhowells, j.alglave, luc.maranget, akiyks, linux-toolchains,
	linux-arch

Hi Paul,

On Wed, Sep 29, 2021 at 04:57:00PM -0700, Paul E. McKenney wrote:
> On Wed, Sep 29, 2021 at 04:47:03PM -0500, Segher Boessenkool wrote:
> > On Tue, Sep 28, 2021 at 05:15:07PM -0400, Mathieu Desnoyers wrote:
> > > C99 describes that accessing volatile objects are side-effects, and that
> > > "at certain specified points in the execution sequence called sequence
> > > points, all side effects of previous evaluations shall be complete
> > > and no side effects of subsequent evaluations shall have taken
> > > place". [2]
> > 
> > But note that the kernel explicitly uses C89 (with GNU extensions).
> > Side effects are largely equal there though.
> > 
> > Also note that there may no place in the generated machine code that
> > corresponds exactly to some sequence point.  Sequence points are a
> > concept that applies to the source program and how that executes on the
> > abstract machine.
> 
> Plus the "as if" rule rears its ugly head in many of these situations.

Do you mean 5.1.2.3 (especially /4 and /6) here, or something more?

People are easily fooled into thinking that your C code somehow
corresponds directly to the generated machine code.  Partly that is
because very old compilers did work that way (they did for every
language, not just C, certainly for all imperative languages -- but
people do think C is a one-to-one translation, more than for other
languages).

> > > +Because ctrl_dep emits distinct asm volatile within each leg of the if
> > > +statement, the compiler cannot transform the two writes to 'b' into a
> > > +conditional-move (cmov) instruction, thus ensuring the presence of a
> > > +conditional branch.  Also because the ctrl_dep emits asm volatile within
> > > +each leg of the if statement, the compiler cannot move the write to 'c'
> > > +before the conditional branch.
> > 
> > I think your reasoning here misses some things.  So many that I don't
> > know where to start to list them, every "because" and "thus" here does
> > not follow, and even the statements of fact are not a given.
> > 
> > Why do you want a conditional branch insn at all, anyway?  You really
> > want something else as far as I can see.
> 
> Because at the assembly language level on some architectures, a
> conditional branch instruction provides weak but very real and very
> useful memory-ordering properties.

On some archs, yes.  So what you really want is that effect, not a
conditional branch itself, right?

> Such a branch orders all loads
> whose return values feed into the branch condition before any stores
> that execute after the branch does (regardless of whether or not the
> branch was taken).  And this is all the ordering that is required for
> the use cases that Mathieu is worried about.

Okay.  So you want to order some set of loads A before some stores B,
with a conditional expression C, where A dominates C, and C dominates B.

Is that exactly what is wanted here?  (You can also take B to be all
stores dominated by C, which may be the only case we care about).

> Yes, you can use explicit memory-barrier or acquire-load instructions,
> but those incur more overhead on some types of hardware.  The code in
> question is on a hotpath and is thus performance-critical.

Yes.

> It would be nice to be able to somehow tell the compiler exactly
> what the ordering constraints are ("this particular load must be
> ordered before these particular stores") and then let it (1) figure
> out that a conditional branch will do the trick and (2) generate the
> code accordingly.  But last I checked, this was not going to happen any
> time soon.  So for the time being, we have to live within the current
> capability of the tools that are available to us.

But a conditional branch is not enough for all architectures (most
architectures even!), and for all implementations of the architectures.

> Linus points out that in all the actual control-dependent code in
> the Linux kernel, the compiler is going to be hard-pressed to fail
> to emit the required branch.  (Or in the case of ARMv8, the required
> conditional-move instruction.)

Yes.  But as a compiler person I read all "the compiler is hard-pressed
to do X" as "the compiler will certainly do X in some cases, maybe once
in a million only, but still; and the compiler is perfectly free to do
so, and that may even be a good decision as well".

Wishy-washy code is fine in some places, if you cannot do better, and
you can verify the machine code generated executes as wanted.  This is
not really feasible for a more generic building block that is used in
many disparate places.

> Mathieu, for his part, recently read the relevant portions of
> memory-barriers.txt (reproduced below) and would like to simplify these
> coding guidlines, which, speaking as the author of those guidelines,
> would be an extremely good thing.  His patches are attempting to move
> us in that direction.

Yes.

In general, rules and guidelines should make it easy to use some
feature, and hard to make mistakes in how you use it.  Ideally the
design lf the feature gets you 90% there already.

> Alternatives include: (1) Using acquire loads or memory barriers
> and accepting the loss in performance, but giving the compiler much
> less leeway,

Yes, that is very expensive.  So expensive that it isn't an option for
any case where you would think about using a control dependency, imo.

> (2) Ripping all of the two-legged "if" examples from
> memory-barriers.txt and restricting control dependencies to else-less
> "if" statements, again giving the compiler less leeway, and (3) Your
> ideas here.

In my opinion we should not define this barrier in terms of "if" (and/or
"else") at all.

> Does that help, or am I missing your point?

See just above.

Your comments help reduce impedance mismatch, thanks :-)

> > It is essential here that there is a READ_ONCE and the WRITE_ONCE.
> > Those things might make it work the way you want, but as Linus says this
> > is all way too subtle.  Can you include the *_ONCE into the primitive
> > itself somehow?
> 
> Actually, if the store is not involved in a data race, the WRITE_ONCE()
> is not needed.

True.

> And in that case, the compiler is much less able to
> fail to provide the needed ordering.

GCC will always do stores only where the source code said it should.
This is needed to make anything asynchronous (threads, signals) work.

> (No, the current documentation
> does not reflect this.)  But if there is a data race, then your point
> is right on the mark -- that WRITE_ONCE() cannot be safely omitted.

But the store you care about might not be right there, it may be in some
called function for example, so haing the WRITE_ONCE as part of the
macro is maybe not a good idea.

> But you are absolutely right that the READ_ONCE() or equivalent is not
> at all optional.  An example of an acceptable equivalent is an atomic
> read-modify-write operation such as atomic_xchg_relaxed().
> 
> The question about whether the READ_ONCE() and WRITE_ONCE() can be
> incorporated into the macro I leave to Mathieu.  I can certainly see
> serious benefits from this approach, at least from a compiler viewpoint.
> I must reserve judgment on usability until I see a proposal.

Yeah.  And easy of use (and that means *correct* use) is key.

> However, stores are not speculated.  This means that ordering -is- provided
> for load-store control dependencies, as in the following example:
> 
> 	q = READ_ONCE(a);
> 	if (q) {
> 		WRITE_ONCE(b, 1);
> 	}

Yes, and you do not need a conditional branch (in the generated machine
code) to have such a dependency -- if the READ_ONCE itself is part of
the condition (it is a read of a volatile object after all).

> Control dependencies pair normally with other types of barriers.
> That said, please note that neither READ_ONCE() nor WRITE_ONCE()
> are optional! Without the READ_ONCE(), the compiler might combine the
> load from 'a' with other loads from 'a'.  Without the WRITE_ONCE(),
> the compiler might combine the store to 'b' with other stores to 'b'.

This part only shows some things that could go wrong.  It also assumes
things that aren't true in general, like, the compiler cannot combine
a WRITE_ONCE with another store.

> Either can result in highly counterintuitive effects on ordering.

Yes, and part of that is that the expectations are wrong, are based on
fundamental untruths.

> Worse yet, if the compiler is able to prove (say) that the value of
> variable 'a' is always non-zero, it would be well within its rights
> to optimize the original example by eliminating the "if" statement
> as follows:
> 
> 	q = a;
> 	b = 1;  /* BUG: Compiler and CPU can both reorder!!! */
> 
> So don't leave out the READ_ONCE().

If you do have the READ_ONCE, the compiler has to do the read -- but if
the compiler still knows "a" contains a non-zero value (yes, this can
still happen, if a zero would lead to undefined behaviour for example),
you do not get the dependency between the read and write you may want.

If you *do* want that dependency.  Often that WRITE_ONCE is all you want
in such a case.

[ snip a lot ]

> More generally, although READ_ONCE() does force
> the compiler to actually emit code for a given load, it does not force
> the compiler to use the results.

Exactly.  And it does tell the compiler the result could be anything
(because it is a read of volatile memory), but other things in the
program, or caused by compiler transforms, can change this.

If your program says

	X;

somewhere, the compiler is perfectly within its right to change that to

	if (q)
		X;
	else
		X;

and then optimise the two occurences of X separately.  Now if X contains
some conditional based on "q" itself (which btw makes such a compiler
transform more likely!), things you may not want or expect can happen.

Lastly:

>   (*) Compilers do not understand control dependencies.  It is therefore
>       your job to ensure that they do not break your code.

Compilers understand you want exactly what you wrote.  If you write
something other than what you want, you only will get what you want by
pure luck.


Segher

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH] LKMM: Add ctrl_dep() macro for control dependency
  2021-10-01 19:10     ` Segher Boessenkool
@ 2021-10-01 22:50       ` Paul E. McKenney
  2021-10-02 14:29       ` Alan Stern
  1 sibling, 0 replies; 35+ messages in thread
From: Paul E. McKenney @ 2021-10-01 22:50 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: Mathieu Desnoyers, will, Peter Zijlstra, linux-kernel,
	Linus Torvalds, stern, parri.andrea, boqun.feng, npiggin,
	dhowells, j.alglave, luc.maranget, akiyks, linux-toolchains,
	linux-arch

On Fri, Oct 01, 2021 at 02:10:08PM -0500, Segher Boessenkool wrote:
> Hi Paul,
> 
> On Wed, Sep 29, 2021 at 04:57:00PM -0700, Paul E. McKenney wrote:
> > On Wed, Sep 29, 2021 at 04:47:03PM -0500, Segher Boessenkool wrote:
> > > On Tue, Sep 28, 2021 at 05:15:07PM -0400, Mathieu Desnoyers wrote:
> > > > C99 describes that accessing volatile objects are side-effects, and that
> > > > "at certain specified points in the execution sequence called sequence
> > > > points, all side effects of previous evaluations shall be complete
> > > > and no side effects of subsequent evaluations shall have taken
> > > > place". [2]
> > > 
> > > But note that the kernel explicitly uses C89 (with GNU extensions).
> > > Side effects are largely equal there though.
> > > 
> > > Also note that there may no place in the generated machine code that
> > > corresponds exactly to some sequence point.  Sequence points are a
> > > concept that applies to the source program and how that executes on the
> > > abstract machine.
> > 
> > Plus the "as if" rule rears its ugly head in many of these situations.
> 
> Do you mean 5.1.2.3 (especially /4 and /6) here, or something more?

Oddly enough, my C standard was open to exactly these paragraphs.  ;-)

I won't claim that /4 and /6 cover everything that I am worried about
(I do not have the standard committed to memory), but those are big ones.

> People are easily fooled into thinking that your C code somehow
> corresponds directly to the generated machine code.  Partly that is
> because very old compilers did work that way (they did for every
> language, not just C, certainly for all imperative languages -- but
> people do think C is a one-to-one translation, more than for other
> languages).

I remember those very old days well, and lost my C-assembler innocence
in the early 1990s.  ;-)

> > > > +Because ctrl_dep emits distinct asm volatile within each leg of the if
> > > > +statement, the compiler cannot transform the two writes to 'b' into a
> > > > +conditional-move (cmov) instruction, thus ensuring the presence of a
> > > > +conditional branch.  Also because the ctrl_dep emits asm volatile within
> > > > +each leg of the if statement, the compiler cannot move the write to 'c'
> > > > +before the conditional branch.
> > > 
> > > I think your reasoning here misses some things.  So many that I don't
> > > know where to start to list them, every "because" and "thus" here does
> > > not follow, and even the statements of fact are not a given.
> > > 
> > > Why do you want a conditional branch insn at all, anyway?  You really
> > > want something else as far as I can see.
> > 
> > Because at the assembly language level on some architectures, a
> > conditional branch instruction provides weak but very real and very
> > useful memory-ordering properties.
> 
> On some archs, yes.  So what you really want is that effect, not a
> conditional branch itself, right?

Yes.

> > Such a branch orders all loads
> > whose return values feed into the branch condition before any stores
> > that execute after the branch does (regardless of whether or not the
> > branch was taken).  And this is all the ordering that is required for
> > the use cases that Mathieu is worried about.
> 
> Okay.  So you want to order some set of loads A before some stores B,
> with a conditional expression C, where A dominates C, and C dominates B.
> 
> Is that exactly what is wanted here?  (You can also take B to be all
> stores dominated by C, which may be the only case we care about).

If I understand your nomenclature, pretty much, but I suspect that
"dominates" does not necessarily say "the values from loads in A are
used to compute C".  Yes, the compiler can kick out a memory-barrier
instruction to make things work otherwise, but our performance-sensitive
developer would want to know about that performance bug.

> > Yes, you can use explicit memory-barrier or acquire-load instructions,
> > but those incur more overhead on some types of hardware.  The code in
> > question is on a hotpath and is thus performance-critical.
> 
> Yes.
> 
> > It would be nice to be able to somehow tell the compiler exactly
> > what the ordering constraints are ("this particular load must be
> > ordered before these particular stores") and then let it (1) figure
> > out that a conditional branch will do the trick and (2) generate the
> > code accordingly.  But last I checked, this was not going to happen any
> > time soon.  So for the time being, we have to live within the current
> > capability of the tools that are available to us.
> 
> But a conditional branch is not enough for all architectures (most
> architectures even!), and for all implementations of the architectures.

For the architectures that the Linux kernel supports, it does suffice.
For at-least-TSO architectures, load-store ordering suffices.  ARM,
PowerPC, Alpha, and Itanium respect control dependencies, that is,
conditional branches act as lightweight memory-barrier instructions on
those architectures.

If a weaker-than-TSO architecture appeared and wanted to run Linux, it
would need to provide stronger ordering, either with its own version of
something like the volatile_if() macro or with some compiler feature
knowing that an explicit memory-barrier instruction was required in
that case.

Or am I missing your point?

> > Linus points out that in all the actual control-dependent code in
> > the Linux kernel, the compiler is going to be hard-pressed to fail
> > to emit the required branch.  (Or in the case of ARMv8, the required
> > conditional-move instruction.)
> 
> Yes.  But as a compiler person I read all "the compiler is hard-pressed
> to do X" as "the compiler will certainly do X in some cases, maybe once
> in a million only, but still; and the compiler is perfectly free to do
> so, and that may even be a good decision as well".
> 
> Wishy-washy code is fine in some places, if you cannot do better, and
> you can verify the machine code generated executes as wanted.  This is
> not really feasible for a more generic building block that is used in
> many disparate places.

Agreed, which is why Mathieu would like to come up with something better.

And for my part, I wrote the control-dependencies section of the infamous
memory-barriers.txt document to discourage their use.

> > Mathieu, for his part, recently read the relevant portions of
> > memory-barriers.txt (reproduced below) and would like to simplify these
> > coding guidlines, which, speaking as the author of those guidelines,
> > would be an extremely good thing.  His patches are attempting to move
> > us in that direction.
> 
> Yes.
> 
> In general, rules and guidelines should make it easy to use some
> feature, and hard to make mistakes in how you use it.  Ideally the
> design lf the feature gets you 90% there already.

Agreed.  And the kernel does have to play fast and loose with the
compiler in a number of places, and likely always will.  (Context
switches, anyone?)

> > Alternatives include: (1) Using acquire loads or memory barriers
> > and accepting the loss in performance, but giving the compiler much
> > less leeway,
> 
> Yes, that is very expensive.  So expensive that it isn't an option for
> any case where you would think about using a control dependency, imo.

Agreed, and thus there are a few places in the Linux kernel that do
use control dependencies, despite all the scary words in memory-barriers.txt.

> > (2) Ripping all of the two-legged "if" examples from
> > memory-barriers.txt and restricting control dependencies to else-less
> > "if" statements, again giving the compiler less leeway, and (3) Your
> > ideas here.
> 
> In my opinion we should not define this barrier in terms of "if" (and/or
> "else") at all.

It certainly would be nice to be living in a world where we didn't
have to.  ;-)

> > Does that help, or am I missing your point?
> 
> See just above.
> 
> Your comments help reduce impedance mismatch, thanks :-)

Glad it helped!  ;-)

> > > It is essential here that there is a READ_ONCE and the WRITE_ONCE.
> > > Those things might make it work the way you want, but as Linus says this
> > > is all way too subtle.  Can you include the *_ONCE into the primitive
> > > itself somehow?
> > 
> > Actually, if the store is not involved in a data race, the WRITE_ONCE()
> > is not needed.
> 
> True.
> 
> > And in that case, the compiler is much less able to
> > fail to provide the needed ordering.
> 
> GCC will always do stores only where the source code said it should.
> This is needed to make anything asynchronous (threads, signals) work.

Good to know!  I have been worried about the compiler using a variable
as temporary storage just before that variable is stored to.  :-/

> > (No, the current documentation
> > does not reflect this.)  But if there is a data race, then your point
> > is right on the mark -- that WRITE_ONCE() cannot be safely omitted.
> 
> But the store you care about might not be right there, it may be in some
> called function for example, so haing the WRITE_ONCE as part of the
> macro is maybe not a good idea.

The Linux kernel does have address and data dependencies that span many
functions and even multiple translation units.  But I believe that all
of the control dependencies are local.  That said, I get your point --
it would be annoying to have a compiler memory-ordering feature that
worked only within the confines of a single function.

Which was why I have been pessimistic about being able to specify
all this to the compiler.

> > But you are absolutely right that the READ_ONCE() or equivalent is not
> > at all optional.  An example of an acceptable equivalent is an atomic
> > read-modify-write operation such as atomic_xchg_relaxed().
> > 
> > The question about whether the READ_ONCE() and WRITE_ONCE() can be
> > incorporated into the macro I leave to Mathieu.  I can certainly see
> > serious benefits from this approach, at least from a compiler viewpoint.
> > I must reserve judgment on usability until I see a proposal.
> 
> Yeah.  And easy of use (and that means *correct* use) is key.

Agreed.

> > However, stores are not speculated.  This means that ordering -is- provided
> > for load-store control dependencies, as in the following example:
> > 
> > 	q = READ_ONCE(a);
> > 	if (q) {
> > 		WRITE_ONCE(b, 1);
> > 	}
> 
> Yes, and you do not need a conditional branch (in the generated machine
> code) to have such a dependency -- if the READ_ONCE itself is part of
> the condition (it is a read of a volatile object after all).

Agreed, for example, ARMv8 carries control dependencies through the
CSEL instruction.

> > Control dependencies pair normally with other types of barriers.
> > That said, please note that neither READ_ONCE() nor WRITE_ONCE()
> > are optional! Without the READ_ONCE(), the compiler might combine the
> > load from 'a' with other loads from 'a'.  Without the WRITE_ONCE(),
> > the compiler might combine the store to 'b' with other stores to 'b'.
> 
> This part only shows some things that could go wrong.  It also assumes
> things that aren't true in general, like, the compiler cannot combine
> a WRITE_ONCE with another store.

Please give me an example of a situation where the compiler can combine
a WRITE_ONCE() volatile store with some other store.

> > Either can result in highly counterintuitive effects on ordering.
> 
> Yes, and part of that is that the expectations are wrong, are based on
> fundamental untruths.

Yes, as used in the Linux kernel, control dependencies are tricky
and fragile.  They depend on adhering to strict coding standards and
they also depend on particular compiler implementations.  Not exactly
an optimal situation, but it is sadly the world we currently live in.

> > Worse yet, if the compiler is able to prove (say) that the value of
> > variable 'a' is always non-zero, it would be well within its rights
> > to optimize the original example by eliminating the "if" statement
> > as follows:
> > 
> > 	q = a;
> > 	b = 1;  /* BUG: Compiler and CPU can both reorder!!! */
> > 
> > So don't leave out the READ_ONCE().
> 
> If you do have the READ_ONCE, the compiler has to do the read -- but if
> the compiler still knows "a" contains a non-zero value (yes, this can
> still happen, if a zero would lead to undefined behaviour for example),
> you do not get the dependency between the read and write you may want.

Good point, and I should add that.  Maybe something as silly as the
following?

	q = READ_ONCE(a);
	if (q)
		WRITE_ONCE(b, 1);
	c = 5 / q;

Then the divide tells the compiler that q is non-zero, so it is within
its rights to transform to the following:

	q = READ_ONCE(a);
	WRITE_ONCE(b, 1);  // Which the hardware might reorder.
	c = 5 / q;

Or is there a better example?

> If you *do* want that dependency.  Often that WRITE_ONCE is all you want
> in such a case.

I may have lost you on this one.

> [ snip a lot ]
> 
> > More generally, although READ_ONCE() does force
> > the compiler to actually emit code for a given load, it does not force
> > the compiler to use the results.
> 
> Exactly.  And it does tell the compiler the result could be anything
> (because it is a read of volatile memory), but other things in the
> program, or caused by compiler transforms, can change this.
> 
> If your program says
> 
> 	X;
> 
> somewhere, the compiler is perfectly within its right to change that to
> 
> 	if (q)
> 		X;
> 	else
> 		X;
> 
> and then optimise the two occurences of X separately.  Now if X contains
> some conditional based on "q" itself (which btw makes such a compiler
> transform more likely!), things you may not want or expect can happen.

Agreed, and this sort of thing is one of the reasons why I discourage
carrying address/data dependencies through integers.  (Yes, you
can construct similar things for pointers, but you have to work a
bit harder.  And I am pushing for ways to tell the compiler about the
carried dependency, though this is going quite a bit more slowly than
I would like.)

> Lastly:
> 
> >   (*) Compilers do not understand control dependencies.  It is therefore
> >       your job to ensure that they do not break your code.
> 
> Compilers understand you want exactly what you wrote.  If you write
> something other than what you want, you only will get what you want by
> pure luck.

But we have to make things work with the tools that we have.  It would
be nice to have a compiler that could be told about control dependencies,
but in the meantime, we have what we have.

But with the increasing leveraging of undefined behavior to enable
optimizations in advance of the undefined behavior, compilers are less
and less likely to give you exactly what you wrote.  Which certainly
does not ease the task of debugging!  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH] LKMM: Add ctrl_dep() macro for control dependency
  2021-10-01 15:28     ` Mathieu Desnoyers
@ 2021-10-01 22:53       ` Paul E. McKenney
  0 siblings, 0 replies; 35+ messages in thread
From: Paul E. McKenney @ 2021-10-01 22:53 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Segher Boessenkool, Will Deacon, Peter Zijlstra, linux-kernel,
	Linus Torvalds, Alan Stern, Andrea Parri, Boqun Feng,
	Nicholas Piggin, David Howells, j alglave, luc maranget, akiyks,
	linux-toolchains, linux-arch

On Fri, Oct 01, 2021 at 11:28:58AM -0400, Mathieu Desnoyers wrote:
> ----- On Sep 29, 2021, at 7:57 PM, paulmck paulmck@kernel.org wrote:
> 
> > On Wed, Sep 29, 2021 at 04:47:03PM -0500, Segher Boessenkool wrote:
> >> Hi!
> >> 
> >> On Tue, Sep 28, 2021 at 05:15:07PM -0400, Mathieu Desnoyers wrote:
> >> > C99 describes that accessing volatile objects are side-effects, and that
> >> > "at certain specified points in the execution sequence called sequence
> >> > points, all side effects of previous evaluations shall be complete
> >> > and no side effects of subsequent evaluations shall have taken
> >> > place". [2]
> >> 
> >> But note that the kernel explicitly uses C89 (with GNU extensions).
> >> Side effects are largely equal there though.
> >> 
> >> Also note that there may no place in the generated machine code that
> >> corresponds exactly to some sequence point.  Sequence points are a
> >> concept that applies to the source program and how that executes on the
> >> abstract machine.
> > 
> > Plus the "as if" rule rears its ugly head in many of these situations.
> > 
> >> > +Because ctrl_dep emits distinct asm volatile within each leg of the if
> >> > +statement, the compiler cannot transform the two writes to 'b' into a
> >> > +conditional-move (cmov) instruction, thus ensuring the presence of a
> >> > +conditional branch.  Also because the ctrl_dep emits asm volatile within
> >> > +each leg of the if statement, the compiler cannot move the write to 'c'
> >> > +before the conditional branch.
> >> 
> >> I think your reasoning here misses some things.  So many that I don't
> >> know where to start to list them, every "because" and "thus" here does
> >> not follow, and even the statements of fact are not a given.
> >> 
> >> Why do you want a conditional branch insn at all, anyway?  You really
> >> want something else as far as I can see.
> > 
> > Because at the assembly language level on some architectures, a
> > conditional branch instruction provides weak but very real and very
> > useful memory-ordering properties.  Such a branch orders all loads
> > whose return values feed into the branch condition before any stores
> > that execute after the branch does (regardless of whether or not the
> > branch was taken).  And this is all the ordering that is required for
> > the use cases that Mathieu is worried about.
> > 
> > Yes, you can use explicit memory-barrier or acquire-load instructions,
> > but those incur more overhead on some types of hardware.  The code in
> > question is on a hotpath and is thus performance-critical.
> > 
> > It would be nice to be able to somehow tell the compiler exactly
> > what the ordering constraints are ("this particular load must be
> > ordered before these particular stores") and then let it (1) figure
> > out that a conditional branch will do the trick and (2) generate the
> > code accordingly.  But last I checked, this was not going to happen any
> > time soon.  So for the time being, we have to live within the current
> > capability of the tools that are available to us.
> > 
> > Linus points out that in all the actual control-dependent code in
> > the Linux kernel, the compiler is going to be hard-pressed to fail
> > to emit the required branch.  (Or in the case of ARMv8, the required
> > conditional-move instruction.)
> > 
> > Mathieu, for his part, recently read the relevant portions of
> > memory-barriers.txt (reproduced below) and would like to simplify these
> > coding guidlines, which, speaking as the author of those guidelines,
> > would be an extremely good thing.  His patches are attempting to move
> > us in that direction.
> > 
> > Alternatives include: (1) Using acquire loads or memory barriers
> > and accepting the loss in performance, but giving the compiler much
> > less leeway, (2) Ripping all of the two-legged "if" examples from
> > memory-barriers.txt and restricting control dependencies to else-less
> > "if" statements, again giving the compiler less leeway, and (3) Your
> > ideas here.
> > 
> > Does that help, or am I missing your point?
> 
> Thanks Paul, it does help explaining the motivation for relying on
> control dependencies for some fast-path memory ordering in the kernel.
> 
> And yes, my main goal is to simplify the coding guide lines, but I have
> not found any example of bad generated code in the tree kernel at this
> point. In some cases (e.g. uses of smp_acquire__after_ctrl_dep()) it's
> mainly thanks to luck though.
> 
> There is another alternative we could list here: implement ctrl_dep_true(),
> ctrl_dep_false() and ctrl_dep(), which would respectively ensure a
> control dependency on the then leg, on the else leg, or on both legs of
> a conditional expression evaluation.
> 
> > 
> >> It is essential here that there is a READ_ONCE and the WRITE_ONCE.
> >> Those things might make it work the way you want, but as Linus says this
> >> is all way too subtle.  Can you include the *_ONCE into the primitive
> >> itself somehow?
> > 
> > Actually, if the store is not involved in a data race, the WRITE_ONCE()
> > is not needed.  And in that case, the compiler is much less able to
> > fail to provide the needed ordering.  (No, the current documentation
> > does not reflect this.)  But if there is a data race, then your point
> > is right on the mark -- that WRITE_ONCE() cannot be safely omitted.
> > 
> > But you are absolutely right that the READ_ONCE() or equivalent is not
> > at all optional.  An example of an acceptable equivalent is an atomic
> > read-modify-write operation such as atomic_xchg_relaxed().
> > 
> > The question about whether the READ_ONCE() and WRITE_ONCE() can be
> > incorporated into the macro I leave to Mathieu.  I can certainly see
> > serious benefits from this approach, at least from a compiler viewpoint.
> > I must reserve judgment on usability until I see a proposal.
> 
> [...]
> 
> After having audited thoroughly all obviously documented control dependencies
> in the kernel tree, I'm not sure that including the READ_ONCE() and WRITE_ONCE()
> with the ctrl_dep() macro is a good idea, because in some cases there is
> calculation to be done on the result of the READ_ONCE() (e.g. through
> a static inline) before handing it over to the conditional expression.
> In other cases many stores are being done after the control dependency, e.g.:
> 
> kernel/events/ring_buffer.c:__perf_output_begin()
> 
>         do {
>                 tail = READ_ONCE(rb->user_page->data_tail);
>                 offset = head = local_read(&rb->head);
>                 if (!rb->overwrite) {
>                         if (unlikely(!ring_buffer_has_space(head, tail,
>                                                             perf_data_size(rb),
>                                                             size, backward)))
>                                 goto fail;
>                 }
> 
>                 /*
>                  * The above forms a control dependency barrier separating the
>                  * @tail load above from the data stores below. Since the @tail
>                  * load is required to compute the branch to fail below.
>                  *
>                  * A, matches D; the full memory barrier userspace SHOULD issue
>                  * after reading the data and before storing the new tail
>                  * position.
>                  *
>                  * See perf_output_put_handle().
>                  */
> 
>                 if (!backward)
>                         head += size;
>                 else
>                         head -= size;
>         } while (local_cmpxchg(&rb->head, offset, head) != offset);
> 
>         if (backward) {
>                 offset = head;
>                 head = (u64)(-head);
>         }
> 
>         /*
>          * We rely on the implied barrier() by local_cmpxchg() to ensure
>          * none of the data stores below can be lifted up by the compiler.
>          */
> 
> [...]

Indeed, these things do not necessarily fit a nice simple pattern.

> Note that I suspect that this control dependency documentation could be
> improved to state that the first control dependency is with the following
> local_cmpxchg store, which itself has a control dependency (when it evaluates
> to false) with the following stores to the ring buffer. Those are not volatile
> stores, but the "memory" clobber with the local_cmpxchg should ensure that
> following stores are after the local_cmpxchg in program order.
> 
> One other thing we could do to improve things slightly would be to turn
> smp_acquire__after_ctrl_dep() into something which really is an acquire
> in all cases, which may not currently be true if the compiler finds a
> matching barrier()/smp_rmb() in the other leg after the conditional
> expression.

Tools to post-process binaries have been suggested in other venues,
but there would still need to be a way to tell the tool what the
constraints are.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH] LKMM: Add ctrl_dep() macro for control dependency
  2021-10-01 19:10     ` Segher Boessenkool
  2021-10-01 22:50       ` Paul E. McKenney
@ 2021-10-02 14:29       ` Alan Stern
  1 sibling, 0 replies; 35+ messages in thread
From: Alan Stern @ 2021-10-02 14:29 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: Paul E. McKenney, Mathieu Desnoyers, will, Peter Zijlstra,
	linux-kernel, Linus Torvalds, parri.andrea, boqun.feng, npiggin,
	dhowells, j.alglave, luc.maranget, akiyks, linux-toolchains,
	linux-arch

On Fri, Oct 01, 2021 at 02:10:08PM -0500, Segher Boessenkool wrote:
> Compilers understand you want exactly what you wrote.  If you write
> something other than what you want, you only will get what you want by
> pure luck.

The problem is that at times you _can't_ write what you want because the
language offers no way to express it.

Alan

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH] LKMM: Add ctrl_dep() macro for control dependency
  2021-10-01 16:35         ` Linus Torvalds
@ 2021-10-10 14:02           ` Florian Weimer
  2021-10-14  0:01             ` Paul E. McKenney
  0 siblings, 1 reply; 35+ messages in thread
From: Florian Weimer @ 2021-10-10 14:02 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mathieu Desnoyers, Segher Boessenkool, Will Deacon, paulmck,
	Peter Zijlstra, linux-kernel, Alan Stern, Andrea Parri,
	Boqun Feng, Nicholas Piggin, David Howells, j alglave,
	luc maranget, akiyks, linux-toolchains, linux-arch

* Linus Torvalds:

> On Fri, Oct 1, 2021 at 9:26 AM Florian Weimer <fweimer@redhat.com> wrote:
>>
>> Will any conditional branch do, or is it necessary that it depends in
>> some way on the data read?
>
> The condition needs to be dependent on the read.
>
> (Easy way to see it: if the read isn't related to the conditional or
> write data/address, the read could just be delayed to after the
> condition and the store had been done).

That entirely depends on how the hardware is specified to work.  And
the hardware could recognize certain patterns as always producing the
same condition codes, e.g., AND with zero.  Do such tests still count?
It depends on what the specification says.

What I really dislike about this: Operators like & and < now have side
effects, and is no longer possible to reason about arithmetic
expressions in isolation.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH] LKMM: Add ctrl_dep() macro for control dependency
  2021-10-10 14:02           ` Florian Weimer
@ 2021-10-14  0:01             ` Paul E. McKenney
  2021-10-14  2:14               ` Alan Stern
  2021-10-14 15:58               ` Florian Weimer
  0 siblings, 2 replies; 35+ messages in thread
From: Paul E. McKenney @ 2021-10-14  0:01 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Linus Torvalds, Mathieu Desnoyers, Segher Boessenkool,
	Will Deacon, Peter Zijlstra, linux-kernel, Alan Stern,
	Andrea Parri, Boqun Feng, Nicholas Piggin, David Howells,
	j alglave, luc maranget, akiyks, linux-toolchains, linux-arch

On Sun, Oct 10, 2021 at 04:02:02PM +0200, Florian Weimer wrote:
> * Linus Torvalds:
> 
> > On Fri, Oct 1, 2021 at 9:26 AM Florian Weimer <fweimer@redhat.com> wrote:
> >>
> >> Will any conditional branch do, or is it necessary that it depends in
> >> some way on the data read?
> >
> > The condition needs to be dependent on the read.
> >
> > (Easy way to see it: if the read isn't related to the conditional or
> > write data/address, the read could just be delayed to after the
> > condition and the store had been done).
> 
> That entirely depends on how the hardware is specified to work.  And
> the hardware could recognize certain patterns as always producing the
> same condition codes, e.g., AND with zero.  Do such tests still count?
> It depends on what the specification says.
> 
> What I really dislike about this: Operators like & and < now have side
> effects, and is no longer possible to reason about arithmetic
> expressions in isolation.

Is there a reasonable syntax that might help with these issues?

Yes, I know, we for sure have conflicting constraints on "reasonable"
on copy on this email.  What else is new?  ;-)

I could imagine a tag of some sort on the load and store, linking the
operations that needed to be ordered.  You would also want that same
tag on any conditional operators along the way?  Or would the presence
of the tags on the load and store suffice?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH] LKMM: Add ctrl_dep() macro for control dependency
  2021-10-14  0:01             ` Paul E. McKenney
@ 2021-10-14  2:14               ` Alan Stern
  2021-10-14 16:14                 ` Paul E. McKenney
  2021-10-14 15:58               ` Florian Weimer
  1 sibling, 1 reply; 35+ messages in thread
From: Alan Stern @ 2021-10-14  2:14 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Florian Weimer, Linus Torvalds, Mathieu Desnoyers,
	Segher Boessenkool, Will Deacon, Peter Zijlstra, linux-kernel,
	Andrea Parri, Boqun Feng, Nicholas Piggin, David Howells,
	j alglave, luc maranget, akiyks, linux-toolchains, linux-arch

On Wed, Oct 13, 2021 at 05:01:04PM -0700, Paul E. McKenney wrote:
> On Sun, Oct 10, 2021 at 04:02:02PM +0200, Florian Weimer wrote:
> > * Linus Torvalds:
> > 
> > > On Fri, Oct 1, 2021 at 9:26 AM Florian Weimer <fweimer@redhat.com> wrote:
> > >>
> > >> Will any conditional branch do, or is it necessary that it depends in
> > >> some way on the data read?
> > >
> > > The condition needs to be dependent on the read.
> > >
> > > (Easy way to see it: if the read isn't related to the conditional or
> > > write data/address, the read could just be delayed to after the
> > > condition and the store had been done).
> > 
> > That entirely depends on how the hardware is specified to work.  And
> > the hardware could recognize certain patterns as always producing the
> > same condition codes, e.g., AND with zero.  Do such tests still count?
> > It depends on what the specification says.
> > 
> > What I really dislike about this: Operators like & and < now have side
> > effects, and is no longer possible to reason about arithmetic
> > expressions in isolation.
> 
> Is there a reasonable syntax that might help with these issues?
> 
> Yes, I know, we for sure have conflicting constraints on "reasonable"
> on copy on this email.  What else is new?  ;-)
> 
> I could imagine a tag of some sort on the load and store, linking the
> operations that needed to be ordered.  You would also want that same
> tag on any conditional operators along the way?  Or would the presence
> of the tags on the load and store suffice?

Here's a easy cop-out.  Imagine a version of READ_ONCE that is 
equivalent to:

	a normal READ_ONCE on TSO architectures,

	a load-acquire on more weakly ordered architectures.

Call it READ_ONCE_FOR_COND, for the sake of argument.  Then as long as 
people are careful to use READ_ONCE_FOR_COND when loading the values 
that a conditional expression depends on, and WRITE_ONCE for the 
important stores in the branches of the "if" statement, all 
architectures will have the desired ordering.  (In fact, if there are 
multiple loads involved in the condition then only the last one has to 
be READ_ONCE_FOR_COND; the others can just be READ_ONCE.)

Of course, this is not optimal on non-TSO archictecture.  That's why I 
called it a cop-out.  But at least it is simple and easy.

Alan Stern

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH] LKMM: Add ctrl_dep() macro for control dependency
  2021-10-14  0:01             ` Paul E. McKenney
  2021-10-14  2:14               ` Alan Stern
@ 2021-10-14 15:58               ` Florian Weimer
  2021-10-14 16:23                 ` Paul E. McKenney
  1 sibling, 1 reply; 35+ messages in thread
From: Florian Weimer @ 2021-10-14 15:58 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Linus Torvalds, Mathieu Desnoyers, Segher Boessenkool,
	Will Deacon, Peter Zijlstra, linux-kernel, Alan Stern,
	Andrea Parri, Boqun Feng, Nicholas Piggin, David Howells,
	j alglave, luc maranget, akiyks, linux-toolchains, linux-arch

* Paul E. McKenney:

> On Sun, Oct 10, 2021 at 04:02:02PM +0200, Florian Weimer wrote:
>> * Linus Torvalds:
>> 
>> > On Fri, Oct 1, 2021 at 9:26 AM Florian Weimer <fweimer@redhat.com> wrote:
>> >>
>> >> Will any conditional branch do, or is it necessary that it depends in
>> >> some way on the data read?
>> >
>> > The condition needs to be dependent on the read.
>> >
>> > (Easy way to see it: if the read isn't related to the conditional or
>> > write data/address, the read could just be delayed to after the
>> > condition and the store had been done).
>> 
>> That entirely depends on how the hardware is specified to work.  And
>> the hardware could recognize certain patterns as always producing the
>> same condition codes, e.g., AND with zero.  Do such tests still count?
>> It depends on what the specification says.
>> 
>> What I really dislike about this: Operators like & and < now have side
>> effects, and is no longer possible to reason about arithmetic
>> expressions in isolation.
>
> Is there a reasonable syntax that might help with these issues?

Is this really a problem of syntax?

> Yes, I know, we for sure have conflicting constraints on "reasonable"
> on copy on this email.  What else is new?  ;-)
>
> I could imagine a tag of some sort on the load and store, linking the
> operations that needed to be ordered.  You would also want that same
> tag on any conditional operators along the way?  Or would the presence
> of the tags on the load and store suffice?

If the load is assigned to a local variable whose address is not taken
and which is only assigned this once, it could be used to label the
store.  Then the compiler checks if all paths from the load to the
store feature a condition that depends on the local variable (where
qualifying conditions probably depend on the architecture).  If it
can't prove that is the case, it emits a fake no-op condition that
triggers the hardware barrier.  This formulation has the advantage
that it does not add side effects to operators like <.  It even
generalizes to different barrier-implying instructions besides
conditional branches.

But I'm not sure if all this complexity will be a tangible improvement
over just using that no-op condition all the time (whether implied by
READ_ONCE, or in a separate ctrl_dep macro).

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH] LKMM: Add ctrl_dep() macro for control dependency
  2021-10-14  2:14               ` Alan Stern
@ 2021-10-14 16:14                 ` Paul E. McKenney
  0 siblings, 0 replies; 35+ messages in thread
From: Paul E. McKenney @ 2021-10-14 16:14 UTC (permalink / raw)
  To: Alan Stern
  Cc: Florian Weimer, Linus Torvalds, Mathieu Desnoyers,
	Segher Boessenkool, Will Deacon, Peter Zijlstra, linux-kernel,
	Andrea Parri, Boqun Feng, Nicholas Piggin, David Howells,
	j alglave, luc maranget, akiyks, linux-toolchains, linux-arch

On Wed, Oct 13, 2021 at 10:14:31PM -0400, Alan Stern wrote:
> On Wed, Oct 13, 2021 at 05:01:04PM -0700, Paul E. McKenney wrote:
> > On Sun, Oct 10, 2021 at 04:02:02PM +0200, Florian Weimer wrote:
> > > * Linus Torvalds:
> > > 
> > > > On Fri, Oct 1, 2021 at 9:26 AM Florian Weimer <fweimer@redhat.com> wrote:
> > > >>
> > > >> Will any conditional branch do, or is it necessary that it depends in
> > > >> some way on the data read?
> > > >
> > > > The condition needs to be dependent on the read.
> > > >
> > > > (Easy way to see it: if the read isn't related to the conditional or
> > > > write data/address, the read could just be delayed to after the
> > > > condition and the store had been done).
> > > 
> > > That entirely depends on how the hardware is specified to work.  And
> > > the hardware could recognize certain patterns as always producing the
> > > same condition codes, e.g., AND with zero.  Do such tests still count?
> > > It depends on what the specification says.
> > > 
> > > What I really dislike about this: Operators like & and < now have side
> > > effects, and is no longer possible to reason about arithmetic
> > > expressions in isolation.
> > 
> > Is there a reasonable syntax that might help with these issues?
> > 
> > Yes, I know, we for sure have conflicting constraints on "reasonable"
> > on copy on this email.  What else is new?  ;-)
> > 
> > I could imagine a tag of some sort on the load and store, linking the
> > operations that needed to be ordered.  You would also want that same
> > tag on any conditional operators along the way?  Or would the presence
> > of the tags on the load and store suffice?
> 
> Here's a easy cop-out.  Imagine a version of READ_ONCE that is 
> equivalent to:
> 
> 	a normal READ_ONCE on TSO architectures,
> 
> 	a load-acquire on more weakly ordered architectures.
> 
> Call it READ_ONCE_FOR_COND, for the sake of argument.  Then as long as 
> people are careful to use READ_ONCE_FOR_COND when loading the values 
> that a conditional expression depends on, and WRITE_ONCE for the 
> important stores in the branches of the "if" statement, all 
> architectures will have the desired ordering.  (In fact, if there are 
> multiple loads involved in the condition then only the last one has to 
> be READ_ONCE_FOR_COND; the others can just be READ_ONCE.)
> 
> Of course, this is not optimal on non-TSO archictecture.  That's why I 
> called it a cop-out.  But at least it is simple and easy.

That is the ARMv8 approach in CONFIG_LTO=y kernels.  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH] LKMM: Add ctrl_dep() macro for control dependency
  2021-10-14 15:58               ` Florian Weimer
@ 2021-10-14 16:23                 ` Paul E. McKenney
  2021-10-14 18:19                   ` Florian Weimer
  0 siblings, 1 reply; 35+ messages in thread
From: Paul E. McKenney @ 2021-10-14 16:23 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Linus Torvalds, Mathieu Desnoyers, Segher Boessenkool,
	Will Deacon, Peter Zijlstra, linux-kernel, Alan Stern,
	Andrea Parri, Boqun Feng, Nicholas Piggin, David Howells,
	j alglave, luc maranget, akiyks, linux-toolchains, linux-arch

On Thu, Oct 14, 2021 at 05:58:16PM +0200, Florian Weimer wrote:
> * Paul E. McKenney:
> 
> > On Sun, Oct 10, 2021 at 04:02:02PM +0200, Florian Weimer wrote:
> >> * Linus Torvalds:
> >> 
> >> > On Fri, Oct 1, 2021 at 9:26 AM Florian Weimer <fweimer@redhat.com> wrote:
> >> >>
> >> >> Will any conditional branch do, or is it necessary that it depends in
> >> >> some way on the data read?
> >> >
> >> > The condition needs to be dependent on the read.
> >> >
> >> > (Easy way to see it: if the read isn't related to the conditional or
> >> > write data/address, the read could just be delayed to after the
> >> > condition and the store had been done).
> >> 
> >> That entirely depends on how the hardware is specified to work.  And
> >> the hardware could recognize certain patterns as always producing the
> >> same condition codes, e.g., AND with zero.  Do such tests still count?
> >> It depends on what the specification says.
> >> 
> >> What I really dislike about this: Operators like & and < now have side
> >> effects, and is no longer possible to reason about arithmetic
> >> expressions in isolation.
> >
> > Is there a reasonable syntax that might help with these issues?
> 
> Is this really a problem of syntax?

No, but we seem to need some way to communicate the control-dependency's
ordering intent to the compiler.  ;-)

> > Yes, I know, we for sure have conflicting constraints on "reasonable"
> > on copy on this email.  What else is new?  ;-)
> >
> > I could imagine a tag of some sort on the load and store, linking the
> > operations that needed to be ordered.  You would also want that same
> > tag on any conditional operators along the way?  Or would the presence
> > of the tags on the load and store suffice?
> 
> If the load is assigned to a local variable whose address is not taken
> and which is only assigned this once, it could be used to label the
> store.  Then the compiler checks if all paths from the load to the
> store feature a condition that depends on the local variable (where
> qualifying conditions probably depend on the architecture).  If it
> can't prove that is the case, it emits a fake no-op condition that
> triggers the hardware barrier.  This formulation has the advantage
> that it does not add side effects to operators like <.  It even
> generalizes to different barrier-implying instructions besides
> conditional branches.

So something like this?

	tagvar = READ_ONCE(a);
	if (tagvar)
		WRITE_ONCE_COND(b, 1, tagvar);

(This seems to me to be an eminently reasonable syntax.)

Or did I miss a turn in there somewhere?

> But I'm not sure if all this complexity will be a tangible improvement
> over just using that no-op condition all the time (whether implied by
> READ_ONCE, or in a separate ctrl_dep macro).

That is an excellent question.  I have no idea what the answer is.  ;-)

						Thanx, Paul

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH] LKMM: Add ctrl_dep() macro for control dependency
  2021-10-14 16:23                 ` Paul E. McKenney
@ 2021-10-14 18:19                   ` Florian Weimer
  2021-10-14 21:09                     ` Paul E. McKenney
  0 siblings, 1 reply; 35+ messages in thread
From: Florian Weimer @ 2021-10-14 18:19 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Linus Torvalds, Mathieu Desnoyers, Segher Boessenkool,
	Will Deacon, Peter Zijlstra, linux-kernel, Alan Stern,
	Andrea Parri, Boqun Feng, Nicholas Piggin, David Howells,
	j alglave, luc maranget, akiyks, linux-toolchains, linux-arch

* Paul E. McKenney:

>> > Yes, I know, we for sure have conflicting constraints on "reasonable"
>> > on copy on this email.  What else is new?  ;-)
>> >
>> > I could imagine a tag of some sort on the load and store, linking the
>> > operations that needed to be ordered.  You would also want that same
>> > tag on any conditional operators along the way?  Or would the presence
>> > of the tags on the load and store suffice?
>> 
>> If the load is assigned to a local variable whose address is not taken
>> and which is only assigned this once, it could be used to label the
>> store.  Then the compiler checks if all paths from the load to the
>> store feature a condition that depends on the local variable (where
>> qualifying conditions probably depend on the architecture).  If it
>> can't prove that is the case, it emits a fake no-op condition that
>> triggers the hardware barrier.  This formulation has the advantage
>> that it does not add side effects to operators like <.  It even
>> generalizes to different barrier-implying instructions besides
>> conditional branches.
>
> So something like this?
>
> 	tagvar = READ_ONCE(a);
> 	if (tagvar)
> 		WRITE_ONCE_COND(b, 1, tagvar);

Yes, something like that.  The syntax only makes sense if tagvar is
assigned only once (statically).

> (This seems to me to be an eminently reasonable syntax.)
>
> Or did I miss a turn in there somewhere?

The important bit is that the compiler emits the extra condition when
necessary, and the information in the snippet above seems to provide
enough information to optimize it away in principle, when it's safe.
This assumes that we can actually come up with a concrete model what
triggers the hardware barrier, of course.  For example, if tagvar is
spilled to the stack, is it still possible to apply an effective
condition to it after it is loaded from the stack?  If not, then the
compiler would have to put in a barrier before spilling tagvar if it
is used in any WRITE_ONCE_COND statement.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH] LKMM: Add ctrl_dep() macro for control dependency
  2021-10-14 18:19                   ` Florian Weimer
@ 2021-10-14 21:09                     ` Paul E. McKenney
  2021-10-14 22:36                       ` Linus Torvalds
  0 siblings, 1 reply; 35+ messages in thread
From: Paul E. McKenney @ 2021-10-14 21:09 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Linus Torvalds, Mathieu Desnoyers, Segher Boessenkool,
	Will Deacon, Peter Zijlstra, linux-kernel, Alan Stern,
	Andrea Parri, Boqun Feng, Nicholas Piggin, David Howells,
	j alglave, luc maranget, akiyks, linux-toolchains, linux-arch

On Thu, Oct 14, 2021 at 08:19:54PM +0200, Florian Weimer wrote:
> * Paul E. McKenney:
> 
> >> > Yes, I know, we for sure have conflicting constraints on "reasonable"
> >> > on copy on this email.  What else is new?  ;-)
> >> >
> >> > I could imagine a tag of some sort on the load and store, linking the
> >> > operations that needed to be ordered.  You would also want that same
> >> > tag on any conditional operators along the way?  Or would the presence
> >> > of the tags on the load and store suffice?
> >> 
> >> If the load is assigned to a local variable whose address is not taken
> >> and which is only assigned this once, it could be used to label the
> >> store.  Then the compiler checks if all paths from the load to the
> >> store feature a condition that depends on the local variable (where
> >> qualifying conditions probably depend on the architecture).  If it
> >> can't prove that is the case, it emits a fake no-op condition that
> >> triggers the hardware barrier.  This formulation has the advantage
> >> that it does not add side effects to operators like <.  It even
> >> generalizes to different barrier-implying instructions besides
> >> conditional branches.
> >
> > So something like this?
> >
> > 	tagvar = READ_ONCE(a);
> > 	if (tagvar)
> > 		WRITE_ONCE_COND(b, 1, tagvar);
> 
> Yes, something like that.  The syntax only makes sense if tagvar is
> assigned only once (statically).
> 
> > (This seems to me to be an eminently reasonable syntax.)
> >
> > Or did I miss a turn in there somewhere?
> 
> The important bit is that the compiler emits the extra condition when
> necessary, and the information in the snippet above seems to provide
> enough information to optimize it away in principle, when it's safe.
> This assumes that we can actually come up with a concrete model what
> triggers the hardware barrier, of course.  For example, if tagvar is
> spilled to the stack, is it still possible to apply an effective
> condition to it after it is loaded from the stack?  If not, then the
> compiler would have to put in a barrier before spilling tagvar if it
> is used in any WRITE_ONCE_COND statement.

In all the weakly ordered architectures I am aware of, spilling to
the stack and reloading preserves the ordering.  The ordering from
the initial load to the spill is an assembly-language data dependency,
the ordering from the spill to the reload is single-variable SC, and
the ordering beyond that is the original control dependency.

						Thanx, Paul

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH] LKMM: Add ctrl_dep() macro for control dependency
  2021-10-14 21:09                     ` Paul E. McKenney
@ 2021-10-14 22:36                       ` Linus Torvalds
  0 siblings, 0 replies; 35+ messages in thread
From: Linus Torvalds @ 2021-10-14 22:36 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Florian Weimer, Mathieu Desnoyers, Segher Boessenkool,
	Will Deacon, Peter Zijlstra, linux-kernel, Alan Stern,
	Andrea Parri, Boqun Feng, Nicholas Piggin, David Howells,
	j alglave, luc maranget, akiyks, linux-toolchains, linux-arch

On Thu, Oct 14, 2021 at 5:10 PM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> In all the weakly ordered architectures I am aware of, spilling to
> the stack and reloading preserves the ordering.  The ordering from
> the initial load to the spill is an assembly-language data dependency,
> the ordering from the spill to the reload is single-variable SC, and
> the ordering beyond that is the original control dependency.

I think the thing about a control dependency is that any way to
optimize it differently only strengthens it.

That was very different from the problems we had with describing the
RCU dependencies - they were data dependencies, and if they could ever
be turned into control dependencies, they would have been weakened.

But the only way to really weaken a control dependency and the write
behind it is to get rid of it entirely.

So turning it into a data dependency (by turning the conditional into
a 'select' instruction, for example) only makes it stronger. And no
amount of register spilling or data movement any other way makes any
difference.

That's why all the examples of what could go wrong were about same
code on both sides of the conditional, which allowed removing the
conditional entirely (or at least moving parts of the "protected" code
to before it.

(The other way to remove the conditional is to just optimize away the
conditional itself, but that's defeated by "READ_ONCE()" being part of
the source of the conditional, and any data or control dependency from
that fundamental "the compiler cannot remove this logic" is always
sufficient).

So I really don't think this is even about "any weakly ordered
architecture". I think this is fundamentally about causality. You
simply cannot make a conditional write visible before the condition
has been resolved, and resolving the condition requires the read to
have happened.

This is not open to "speculation". Not by hardware, not by compilers.

There are only two ways you can break this fundamental construct:

 - outright bugs

 - a perfect oracle

And honestly, if you have a perfect oracle, you're better off making
money playing the lotto than you would ever be doing hardware or
software development, so that second option isn't really even
interesting.

                 Linus

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2021-10-14 22:36 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-28 21:15 [RFC PATCH] LKMM: Add ctrl_dep() macro for control dependency Mathieu Desnoyers
2021-09-29 12:06 ` Marco Elver
2021-10-01 15:45   ` Mathieu Desnoyers
2021-10-01 16:20     ` Linus Torvalds
2021-10-01 17:28       ` Mathieu Desnoyers
2021-10-01 18:18         ` Linus Torvalds
2021-09-29 12:28 ` Florian Weimer
2021-09-29 17:41   ` Segher Boessenkool
2021-09-29 19:46     ` Florian Weimer
2021-10-01 16:13     ` Mathieu Desnoyers
2021-10-01 16:26       ` Florian Weimer
2021-10-01 16:35         ` Linus Torvalds
2021-10-10 14:02           ` Florian Weimer
2021-10-14  0:01             ` Paul E. McKenney
2021-10-14  2:14               ` Alan Stern
2021-10-14 16:14                 ` Paul E. McKenney
2021-10-14 15:58               ` Florian Weimer
2021-10-14 16:23                 ` Paul E. McKenney
2021-10-14 18:19                   ` Florian Weimer
2021-10-14 21:09                     ` Paul E. McKenney
2021-10-14 22:36                       ` Linus Torvalds
2021-09-30 13:28   ` Mathieu Desnoyers
2021-09-29 14:47 ` Linus Torvalds
2021-09-29 14:54   ` Linus Torvalds
2021-09-29 19:50     ` Mathieu Desnoyers
2021-09-29 20:13       ` Mathieu Desnoyers
2021-09-29 19:27   ` Mathieu Desnoyers
2021-09-29 22:14     ` Linus Torvalds
2021-09-29 21:47 ` Segher Boessenkool
2021-09-29 23:57   ` Paul E. McKenney
2021-10-01 15:28     ` Mathieu Desnoyers
2021-10-01 22:53       ` Paul E. McKenney
2021-10-01 19:10     ` Segher Boessenkool
2021-10-01 22:50       ` Paul E. McKenney
2021-10-02 14:29       ` Alan Stern

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).