linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/6] rcu: Remove several redundant memory barriers
@ 2024-05-15 12:53 Frederic Weisbecker
  2024-05-15 12:53 ` [PATCH 1/6] rcu: Remove full ordering on second EQS snapshot Frederic Weisbecker
                   ` (6 more replies)
  0 siblings, 7 replies; 18+ messages in thread
From: Frederic Weisbecker @ 2024-05-15 12:53 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Valentin Schneider, Paul E . McKenney,
	Boqun Feng, Joel Fernandes, Neeraj Upadhyay, Uladzislau Rezki,
	Zqiang, rcu

Reviewing Valentin's patchset made me stare at some memory barriers
on the way. Here is some removal proposal. Some may be beneficial on
runtime (fqs snapshot with potentially as many smp_mb() as the number
of online CPUs for each GP). Some happen on more rare path. In any
case they clarify code reviews so we don't stumble upon mysterious
barriers.

Thanks.

Frederic Weisbecker (6):
  rcu: Remove full ordering on second EQS snapshot
  rcu: Remove superfluous full memory barrier upon first EQS snapshot
  rcu/exp: Remove superfluous full memory barrier upon first EQS
    snapshot
  rcu: Remove full memory barrier on boot time eqs sanity check
  rcu: Remove full memory barrier on RCU stall printout
  rcu/exp: Remove redundant full memory barrier at the end of GP

 .../Tree-RCU-Memory-Ordering.rst              |  6 +++---
 kernel/rcu/tree.c                             | 21 +++++++------------
 kernel/rcu/tree_exp.h                         | 16 +++++++++++---
 kernel/rcu/tree_stall.h                       |  4 ++--
 4 files changed, 26 insertions(+), 21 deletions(-)

-- 
2.44.0


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 1/6] rcu: Remove full ordering on second EQS snapshot
  2024-05-15 12:53 [PATCH 0/6] rcu: Remove several redundant memory barriers Frederic Weisbecker
@ 2024-05-15 12:53 ` Frederic Weisbecker
  2024-05-15 17:32   ` Valentin Schneider
  2024-05-15 12:53 ` [PATCH 2/6] rcu: Remove superfluous full memory barrier upon first " Frederic Weisbecker
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 18+ messages in thread
From: Frederic Weisbecker @ 2024-05-15 12:53 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Valentin Schneider, Paul E . McKenney,
	Boqun Feng, Joel Fernandes, Neeraj Upadhyay, Uladzislau Rezki,
	Zqiang, rcu

When the grace period kthread checks the extended quiescent state
counter of a CPU, full ordering is necessary to ensure that either:

* If the GP kthread observes the remote target in an extended quiescent
  state, then that target must observe all accesses prior to the current
  grace period, including the current grace period sequence number, once
  it exits that extended quiescent state. Also the GP kthread must
  observe all accesses performed by the target prior it entering in
  EQS.

or:

* If the GP kthread observes the remote target NOT in an extended
  quiescent state, then the target further entering in an extended
  quiescent state must observe all accesses prior to the current
  grace period, including the current grace period sequence number, once
  it enters that extended quiescent state. Also the GP kthread later
  observing that EQS must also observe all accesses performed by the
  target prior it entering in EQS.

This ordering is explicitly performed both on the first EQS snapshot
and on the second one as well through the combination of a preceding
full barrier followed by an acquire read. However the second snapshot's
full memory barrier is redundant and not needed to enforce the above
guarantees:

    GP kthread                  Remote target
    ----                        -----
    // Access prior GP
    WRITE_ONCE(A, 1)
    // first snapshot
    smp_mb()
    x = smp_load_acquire(EQS)
                               // Access prior GP
                               WRITE_ONCE(B, 1)
                               // EQS enter
                               // implied full barrier by atomic_add_return()
                               atomic_add_return(RCU_DYNTICKS_IDX, EQS)
                               // implied full barrier by atomic_add_return()
                               READ_ONCE(A)
    // second snapshot
    y = smp_load_acquire(EQS)
    z = READ_ONCE(B)

If the GP kthread above fails to observe the remote target in EQS
(x not in EQS), the remote target will observe A == 1 after further
entering in EQS. Then the second snapshot taken by the GP kthread only
need to be an acquire read in order to observe z == 1.

Therefore remove the needless full memory barrier on second snapshot.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 kernel/rcu/tree.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 5e6828132007..58415cdc54f8 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -325,7 +325,7 @@ static bool rcu_dynticks_in_eqs(int snap)
  */
 static bool rcu_dynticks_in_eqs_since(struct rcu_data *rdp, int snap)
 {
-	return snap != rcu_dynticks_snap(rdp->cpu);
+	return snap != ct_dynticks_cpu_acquire(rdp->cpu);
 }
 
 /*
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 2/6] rcu: Remove superfluous full memory barrier upon first EQS snapshot
  2024-05-15 12:53 [PATCH 0/6] rcu: Remove several redundant memory barriers Frederic Weisbecker
  2024-05-15 12:53 ` [PATCH 1/6] rcu: Remove full ordering on second EQS snapshot Frederic Weisbecker
@ 2024-05-15 12:53 ` Frederic Weisbecker
  2024-05-16 15:31   ` Valentin Schneider
  2024-05-15 12:53 ` [PATCH 3/6] rcu/exp: " Frederic Weisbecker
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 18+ messages in thread
From: Frederic Weisbecker @ 2024-05-15 12:53 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Valentin Schneider, Paul E . McKenney,
	Boqun Feng, Joel Fernandes, Neeraj Upadhyay, Uladzislau Rezki,
	Zqiang, rcu

When the grace period kthread checks the extended quiescent state
counter of a CPU, full ordering is necessary to ensure that either:

* If the GP kthread observes the remote target in an extended quiescent
  state, then that target must observe all accesses prior to the current
  grace period, including the current grace period sequence number, once
  it exits that extended quiescent state.

or:

* If the GP kthread observes the remote target NOT in an extended
  quiescent state, then the target further entering in an extended
  quiescent state must observe all accesses prior to the current
  grace period, including the current grace period sequence number, once
  it enters that extended quiescent state.

This ordering is enforced through a full memory barrier placed right
before taking the first EQS snapshot. However this is superfluous
because the snapshot is taken while holding the target's rnp lock which
provides the necessary ordering through its chain of
smp_mb__after_unlock_lock().

Remove the needless explicit barrier before the snapshot and put a
comment about the implicit barrier newly relied upon here.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 .../Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst    | 6 +++---
 kernel/rcu/tree.c                                          | 7 ++++++-
 2 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst b/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst
index 5750f125361b..728b1e690c64 100644
--- a/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst
+++ b/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst
@@ -149,9 +149,9 @@ This case is handled by calls to the strongly ordered
 ``atomic_add_return()`` read-modify-write atomic operation that
 is invoked within ``rcu_dynticks_eqs_enter()`` at idle-entry
 time and within ``rcu_dynticks_eqs_exit()`` at idle-exit time.
-The grace-period kthread invokes ``rcu_dynticks_snap()`` and
-``rcu_dynticks_in_eqs_since()`` (both of which invoke
-an ``atomic_add_return()`` of zero) to detect idle CPUs.
+The grace-period kthread invokes first ``ct_dynticks_cpu_acquire()``
+(preceded by a full memory barrier) and ``rcu_dynticks_in_eqs_since()``
+(both of which rely on acquire semantics) to detect idle CPUs.
 
 +-----------------------------------------------------------------------+
 | **Quick Quiz**:                                                       |
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 58415cdc54f8..f5354de5644b 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -773,7 +773,12 @@ static void rcu_gpnum_ovf(struct rcu_node *rnp, struct rcu_data *rdp)
  */
 static int dyntick_save_progress_counter(struct rcu_data *rdp)
 {
-	rdp->dynticks_snap = rcu_dynticks_snap(rdp->cpu);
+	/*
+	 * Full ordering against accesses prior current GP and also against
+	 * current GP sequence number is enforced by current rnp locking
+	 * with chained smp_mb__after_unlock_lock().
+	 */
+	rdp->dynticks_snap = ct_dynticks_cpu_acquire(rdp->cpu);
 	if (rcu_dynticks_in_eqs(rdp->dynticks_snap)) {
 		trace_rcu_fqs(rcu_state.name, rdp->gp_seq, rdp->cpu, TPS("dti"));
 		rcu_gpnum_ovf(rdp->mynode, rdp);
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 3/6] rcu/exp: Remove superfluous full memory barrier upon first EQS snapshot
  2024-05-15 12:53 [PATCH 0/6] rcu: Remove several redundant memory barriers Frederic Weisbecker
  2024-05-15 12:53 ` [PATCH 1/6] rcu: Remove full ordering on second EQS snapshot Frederic Weisbecker
  2024-05-15 12:53 ` [PATCH 2/6] rcu: Remove superfluous full memory barrier upon first " Frederic Weisbecker
@ 2024-05-15 12:53 ` Frederic Weisbecker
  2024-05-15 12:53 ` [PATCH 4/6] rcu: Remove full memory barrier on boot time eqs sanity check Frederic Weisbecker
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 18+ messages in thread
From: Frederic Weisbecker @ 2024-05-15 12:53 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Valentin Schneider, Paul E . McKenney,
	Boqun Feng, Joel Fernandes, Neeraj Upadhyay, Uladzislau Rezki,
	Zqiang, rcu

When the grace period kthread checks the extended quiescent state
counter of a CPU, full ordering is necessary to ensure that either:

* If the GP kthread observes the remote target in an extended quiescent
  state, then that target must observe all accesses prior to the current
  grace period, including the current grace period sequence number, once
  it exits that extended quiescent state.

or:

* If the GP kthread observes the remote target NOT in an extended
  quiescent state, then the target further entering in an extended
  quiescent state must observe all accesses prior to the current
  grace period, including the current grace period sequence number, once
  it enters that extended quiescent state.

This ordering is enforced through a full memory barrier placed right
before taking the first EQS snapshot. However this is superfluous
because the snapshot is taken while holding the target's rnp lock which
provides the necessary ordering through its chain of
smp_mb__after_unlock_lock().

Remove the needless explicit barrier before the snapshot and put a
comment about the implicit barrier newly relied upon here.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 kernel/rcu/tree_exp.h | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
index 8a1d9c8bd9f7..bec24ea6777e 100644
--- a/kernel/rcu/tree_exp.h
+++ b/kernel/rcu/tree_exp.h
@@ -357,7 +357,13 @@ static void __sync_rcu_exp_select_node_cpus(struct rcu_exp_work *rewp)
 		    !(rnp->qsmaskinitnext & mask)) {
 			mask_ofl_test |= mask;
 		} else {
-			snap = rcu_dynticks_snap(cpu);
+			/*
+			 * Full ordering against accesses prior current GP and
+			 * also against current GP sequence number is enforced
+			 * by current rnp locking with chained
+			 * smp_mb__after_unlock_lock().
+			 */
+			snap = ct_dynticks_cpu_acquire(cpu);
 			if (rcu_dynticks_in_eqs(snap))
 				mask_ofl_test |= mask;
 			else
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 4/6] rcu: Remove full memory barrier on boot time eqs sanity check
  2024-05-15 12:53 [PATCH 0/6] rcu: Remove several redundant memory barriers Frederic Weisbecker
                   ` (2 preceding siblings ...)
  2024-05-15 12:53 ` [PATCH 3/6] rcu/exp: " Frederic Weisbecker
@ 2024-05-15 12:53 ` Frederic Weisbecker
  2024-05-16 17:09   ` Valentin Schneider
  2024-05-15 12:53 ` [PATCH 5/6] rcu: Remove full memory barrier on RCU stall printout Frederic Weisbecker
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 18+ messages in thread
From: Frederic Weisbecker @ 2024-05-15 12:53 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Valentin Schneider, Paul E . McKenney,
	Boqun Feng, Joel Fernandes, Neeraj Upadhyay, Uladzislau Rezki,
	Zqiang, rcu

When the boot CPU initializes the per-CPU data on behalf of all possible
CPUs, a sanity check is performed on each of them to make sure none is
initialized in an extended quiescent state.

This check involves a full memory barrier which is useless at this early
boot stage.

Do a plain access instead.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 kernel/rcu/tree.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index f5354de5644b..02f6f3483482 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -4812,7 +4812,7 @@ rcu_boot_init_percpu_data(int cpu)
 	rdp->grpmask = leaf_node_cpu_bit(rdp->mynode, cpu);
 	INIT_WORK(&rdp->strict_work, strict_work_handler);
 	WARN_ON_ONCE(ct->dynticks_nesting != 1);
-	WARN_ON_ONCE(rcu_dynticks_in_eqs(rcu_dynticks_snap(cpu)));
+	WARN_ON_ONCE(rcu_dynticks_in_eqs(ct_dynticks_cpu(cpu)));
 	rdp->barrier_seq_snap = rcu_state.barrier_sequence;
 	rdp->rcu_ofl_gp_seq = rcu_state.gp_seq;
 	rdp->rcu_ofl_gp_state = RCU_GP_CLEANED;
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 5/6] rcu: Remove full memory barrier on RCU stall printout
  2024-05-15 12:53 [PATCH 0/6] rcu: Remove several redundant memory barriers Frederic Weisbecker
                   ` (3 preceding siblings ...)
  2024-05-15 12:53 ` [PATCH 4/6] rcu: Remove full memory barrier on boot time eqs sanity check Frederic Weisbecker
@ 2024-05-15 12:53 ` Frederic Weisbecker
  2024-05-16 17:09   ` Valentin Schneider
  2024-05-15 12:53 ` [PATCH 6/6] rcu/exp: Remove redundant full memory barrier at the end of GP Frederic Weisbecker
  2024-05-15 17:32 ` [PATCH 0/6] rcu: Remove several redundant memory barriers Valentin Schneider
  6 siblings, 1 reply; 18+ messages in thread
From: Frederic Weisbecker @ 2024-05-15 12:53 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Valentin Schneider, Paul E . McKenney,
	Boqun Feng, Joel Fernandes, Neeraj Upadhyay, Uladzislau Rezki,
	Zqiang, rcu

RCU stall printout fetches the EQS state of a CPU with a preceding full
memory barrier. However there is nothing to order this read against at
this debugging stage. It is inherently racy when performed remotely.

Do a plain read instead.

This was the last user of rcu_dynticks_snap().

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 kernel/rcu/tree.c       | 10 ----------
 kernel/rcu/tree_stall.h |  4 ++--
 2 files changed, 2 insertions(+), 12 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 02f6f3483482..04dde7473613 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -299,16 +299,6 @@ static void rcu_dynticks_eqs_online(void)
 	ct_state_inc(RCU_DYNTICKS_IDX);
 }
 
-/*
- * Snapshot the ->dynticks counter with full ordering so as to allow
- * stable comparison of this counter with past and future snapshots.
- */
-static int rcu_dynticks_snap(int cpu)
-{
-	smp_mb();  // Fundamental RCU ordering guarantee.
-	return ct_dynticks_cpu_acquire(cpu);
-}
-
 /*
  * Return true if the snapshot returned from rcu_dynticks_snap()
  * indicates that RCU is in an extended quiescent state.
diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h
index 460efecd077b..4b0e9d7c4c68 100644
--- a/kernel/rcu/tree_stall.h
+++ b/kernel/rcu/tree_stall.h
@@ -501,7 +501,7 @@ static void print_cpu_stall_info(int cpu)
 	}
 	delta = rcu_seq_ctr(rdp->mynode->gp_seq - rdp->rcu_iw_gp_seq);
 	falsepositive = rcu_is_gp_kthread_starving(NULL) &&
-			rcu_dynticks_in_eqs(rcu_dynticks_snap(cpu));
+			rcu_dynticks_in_eqs(ct_dynticks_cpu(cpu));
 	rcuc_starved = rcu_is_rcuc_kthread_starving(rdp, &j);
 	if (rcuc_starved)
 		// Print signed value, as negative values indicate a probable bug.
@@ -515,7 +515,7 @@ static void print_cpu_stall_info(int cpu)
 			rdp->rcu_iw_pending ? (int)min(delta, 9UL) + '0' :
 				"!."[!delta],
 	       ticks_value, ticks_title,
-	       rcu_dynticks_snap(cpu) & 0xffff,
+	       ct_dynticks_cpu(cpu) & 0xffff,
 	       ct_dynticks_nesting_cpu(cpu), ct_dynticks_nmi_nesting_cpu(cpu),
 	       rdp->softirq_snap, kstat_softirqs_cpu(RCU_SOFTIRQ, cpu),
 	       data_race(rcu_state.n_force_qs) - rcu_state.n_force_qs_gpstart,
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 6/6] rcu/exp: Remove redundant full memory barrier at the end of GP
  2024-05-15 12:53 [PATCH 0/6] rcu: Remove several redundant memory barriers Frederic Weisbecker
                   ` (4 preceding siblings ...)
  2024-05-15 12:53 ` [PATCH 5/6] rcu: Remove full memory barrier on RCU stall printout Frederic Weisbecker
@ 2024-05-15 12:53 ` Frederic Weisbecker
  2024-05-15 17:32 ` [PATCH 0/6] rcu: Remove several redundant memory barriers Valentin Schneider
  6 siblings, 0 replies; 18+ messages in thread
From: Frederic Weisbecker @ 2024-05-15 12:53 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Valentin Schneider, Paul E . McKenney,
	Boqun Feng, Joel Fernandes, Neeraj Upadhyay, Uladzislau Rezki,
	Zqiang, rcu

A full memory barrier is necessary at the end of the expedited grace
period to order:

1) The grace period completion (pictured by the GP sequence
   number) with all preceding accesses. This pairs with rcu_seq_end()
   performed by the concurrent kworker.

2) The grace period completion and subsequent post-GP update side
   accesses. Pairs again against rcu_seq_end().

This full barrier is already provided by the final sync_exp_work_done()
test, making the subsequent explicit one redundant. Remove it and
improve comments.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 kernel/rcu/tree_exp.h | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
index bec24ea6777e..721cb93b1fec 100644
--- a/kernel/rcu/tree_exp.h
+++ b/kernel/rcu/tree_exp.h
@@ -265,7 +265,12 @@ static bool sync_exp_work_done(unsigned long s)
 {
 	if (rcu_exp_gp_seq_done(s)) {
 		trace_rcu_exp_grace_period(rcu_state.name, s, TPS("done"));
-		smp_mb(); /* Ensure test happens before caller kfree(). */
+		/*
+		 * Order GP completion with preceding accesses. Order also GP
+		 * completion with post GP update side accesses. Pairs with
+		 * rcu_seq_end().
+		 */
+		smp_mb();
 		return true;
 	}
 	return false;
@@ -959,7 +964,6 @@ void synchronize_rcu_expedited(void)
 	rnp = rcu_get_root();
 	wait_event(rnp->exp_wq[rcu_seq_ctr(s) & 0x3],
 		   sync_exp_work_done(s));
-	smp_mb(); /* Work actions happen before return. */
 
 	/* Let the next expedited grace period start. */
 	mutex_unlock(&rcu_state.exp_mutex);
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/6] rcu: Remove several redundant memory barriers
  2024-05-15 12:53 [PATCH 0/6] rcu: Remove several redundant memory barriers Frederic Weisbecker
                   ` (5 preceding siblings ...)
  2024-05-15 12:53 ` [PATCH 6/6] rcu/exp: Remove redundant full memory barrier at the end of GP Frederic Weisbecker
@ 2024-05-15 17:32 ` Valentin Schneider
  2024-05-15 23:13   ` Frederic Weisbecker
  6 siblings, 1 reply; 18+ messages in thread
From: Valentin Schneider @ 2024-05-15 17:32 UTC (permalink / raw)
  To: Frederic Weisbecker, LKML
  Cc: Frederic Weisbecker, Paul E . McKenney, Boqun Feng,
	Joel Fernandes, Neeraj Upadhyay, Uladzislau Rezki, Zqiang, rcu

On 15/05/24 14:53, Frederic Weisbecker wrote:
> Reviewing Valentin's patchset made me stare at some memory barriers
> on the way.

Sorry not sorry? :-)

> Here is some removal proposal. Some may be beneficial on
> runtime (fqs snapshot with potentially as many smp_mb() as the number
> of online CPUs for each GP). Some happen on more rare path. In any
> case they clarify code reviews so we don't stumble upon mysterious
> barriers.
>
> Thanks.
>
> Frederic Weisbecker (6):
>   rcu: Remove full ordering on second EQS snapshot
>   rcu: Remove superfluous full memory barrier upon first EQS snapshot
>   rcu/exp: Remove superfluous full memory barrier upon first EQS
>     snapshot
>   rcu: Remove full memory barrier on boot time eqs sanity check
>   rcu: Remove full memory barrier on RCU stall printout
>   rcu/exp: Remove redundant full memory barrier at the end of GP
>
>  .../Tree-RCU-Memory-Ordering.rst              |  6 +++---
>  kernel/rcu/tree.c                             | 21 +++++++------------
>  kernel/rcu/tree_exp.h                         | 16 +++++++++++---
>  kernel/rcu/tree_stall.h                       |  4 ++--
>  4 files changed, 26 insertions(+), 21 deletions(-)
>
> --
> 2.44.0


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/6] rcu: Remove full ordering on second EQS snapshot
  2024-05-15 12:53 ` [PATCH 1/6] rcu: Remove full ordering on second EQS snapshot Frederic Weisbecker
@ 2024-05-15 17:32   ` Valentin Schneider
  0 siblings, 0 replies; 18+ messages in thread
From: Valentin Schneider @ 2024-05-15 17:32 UTC (permalink / raw)
  To: Frederic Weisbecker, LKML
  Cc: Frederic Weisbecker, Paul E . McKenney, Boqun Feng,
	Joel Fernandes, Neeraj Upadhyay, Uladzislau Rezki, Zqiang, rcu

On 15/05/24 14:53, Frederic Weisbecker wrote:
> When the grace period kthread checks the extended quiescent state
> counter of a CPU, full ordering is necessary to ensure that either:
>
> * If the GP kthread observes the remote target in an extended quiescent
>   state, then that target must observe all accesses prior to the current
>   grace period, including the current grace period sequence number, once
>   it exits that extended quiescent state. Also the GP kthread must
>   observe all accesses performed by the target prior it entering in
>   EQS.
>
> or:
>
> * If the GP kthread observes the remote target NOT in an extended
>   quiescent state, then the target further entering in an extended
>   quiescent state must observe all accesses prior to the current
>   grace period, including the current grace period sequence number, once
>   it enters that extended quiescent state. Also the GP kthread later
>   observing that EQS must also observe all accesses performed by the
>   target prior it entering in EQS.
>
> This ordering is explicitly performed both on the first EQS snapshot
> and on the second one as well through the combination of a preceding
> full barrier followed by an acquire read. However the second snapshot's
> full memory barrier is redundant and not needed to enforce the above
> guarantees:
>
>     GP kthread                  Remote target
>     ----                        -----
>     // Access prior GP
>     WRITE_ONCE(A, 1)
>     // first snapshot
>     smp_mb()
>     x = smp_load_acquire(EQS)
>                                // Access prior GP
>                                WRITE_ONCE(B, 1)
>                                // EQS enter
>                                // implied full barrier by atomic_add_return()
>                                atomic_add_return(RCU_DYNTICKS_IDX, EQS)
>                                // implied full barrier by atomic_add_return()
>                                READ_ONCE(A)
>     // second snapshot
>     y = smp_load_acquire(EQS)
>     z = READ_ONCE(B)
>
> If the GP kthread above fails to observe the remote target in EQS
> (x not in EQS), the remote target will observe A == 1 after further
> entering in EQS. Then the second snapshot taken by the GP kthread only
> need to be an acquire read in order to observe z == 1.
>
> Therefore remove the needless full memory barrier on second snapshot.
>
> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>

Still looking at the rest, but at least so far I'm convinced this one makes
sense.

Reviewed-by: Valentin Schneider <vschneid@redhat.com>

> ---
>  kernel/rcu/tree.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 5e6828132007..58415cdc54f8 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -325,7 +325,7 @@ static bool rcu_dynticks_in_eqs(int snap)
>   */
>  static bool rcu_dynticks_in_eqs_since(struct rcu_data *rdp, int snap)
>  {
> -	return snap != rcu_dynticks_snap(rdp->cpu);
> +	return snap != ct_dynticks_cpu_acquire(rdp->cpu);
>  }
>
>  /*
> --
> 2.44.0


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/6] rcu: Remove several redundant memory barriers
  2024-05-15 17:32 ` [PATCH 0/6] rcu: Remove several redundant memory barriers Valentin Schneider
@ 2024-05-15 23:13   ` Frederic Weisbecker
  0 siblings, 0 replies; 18+ messages in thread
From: Frederic Weisbecker @ 2024-05-15 23:13 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: LKML, Paul E . McKenney, Boqun Feng, Joel Fernandes,
	Neeraj Upadhyay, Uladzislau Rezki, Zqiang, rcu

Le Wed, May 15, 2024 at 07:32:30PM +0200, Valentin Schneider a écrit :
> On 15/05/24 14:53, Frederic Weisbecker wrote:
> > Reviewing Valentin's patchset made me stare at some memory barriers
> > on the way.
> 
> Sorry not sorry? :-)

;-))

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/6] rcu: Remove superfluous full memory barrier upon first EQS snapshot
  2024-05-15 12:53 ` [PATCH 2/6] rcu: Remove superfluous full memory barrier upon first " Frederic Weisbecker
@ 2024-05-16 15:31   ` Valentin Schneider
  2024-05-16 16:08     ` Frederic Weisbecker
  0 siblings, 1 reply; 18+ messages in thread
From: Valentin Schneider @ 2024-05-16 15:31 UTC (permalink / raw)
  To: Frederic Weisbecker, LKML
  Cc: Frederic Weisbecker, Paul E . McKenney, Boqun Feng,
	Joel Fernandes, Neeraj Upadhyay, Uladzislau Rezki, Zqiang, rcu

On 15/05/24 14:53, Frederic Weisbecker wrote:
> When the grace period kthread checks the extended quiescent state
> counter of a CPU, full ordering is necessary to ensure that either:
>
> * If the GP kthread observes the remote target in an extended quiescent
>   state, then that target must observe all accesses prior to the current
>   grace period, including the current grace period sequence number, once
>   it exits that extended quiescent state.
>
> or:
>
> * If the GP kthread observes the remote target NOT in an extended
>   quiescent state, then the target further entering in an extended
>   quiescent state must observe all accesses prior to the current
>   grace period, including the current grace period sequence number, once
>   it enters that extended quiescent state.
>
> This ordering is enforced through a full memory barrier placed right
> before taking the first EQS snapshot. However this is superfluous
> because the snapshot is taken while holding the target's rnp lock which
> provides the necessary ordering through its chain of
> smp_mb__after_unlock_lock().
>
> Remove the needless explicit barrier before the snapshot and put a
> comment about the implicit barrier newly relied upon here.
>
> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
> ---
>  .../Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst    | 6 +++---
>  kernel/rcu/tree.c                                          | 7 ++++++-
>  2 files changed, 9 insertions(+), 4 deletions(-)
>
> diff --git a/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst b/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst
> index 5750f125361b..728b1e690c64 100644
> --- a/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst
> +++ b/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst
> @@ -149,9 +149,9 @@ This case is handled by calls to the strongly ordered
>  ``atomic_add_return()`` read-modify-write atomic operation that
>  is invoked within ``rcu_dynticks_eqs_enter()`` at idle-entry
>  time and within ``rcu_dynticks_eqs_exit()`` at idle-exit time.
> -The grace-period kthread invokes ``rcu_dynticks_snap()`` and
> -``rcu_dynticks_in_eqs_since()`` (both of which invoke
> -an ``atomic_add_return()`` of zero) to detect idle CPUs.
> +The grace-period kthread invokes first ``ct_dynticks_cpu_acquire()``
> +(preceded by a full memory barrier) and ``rcu_dynticks_in_eqs_since()``
> +(both of which rely on acquire semantics) to detect idle CPUs.
>
>  +-----------------------------------------------------------------------+
>  | **Quick Quiz**:                                                       |
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 58415cdc54f8..f5354de5644b 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -773,7 +773,12 @@ static void rcu_gpnum_ovf(struct rcu_node *rnp, struct rcu_data *rdp)
>   */
>  static int dyntick_save_progress_counter(struct rcu_data *rdp)
>  {
> -	rdp->dynticks_snap = rcu_dynticks_snap(rdp->cpu);

So for PPC, which gets the smp_mb() at the lock acquisition, this is an
"obvious" redundant smp_mb().

For the other archs, per the definition of smp_mb__after_unlock_lock() it
seems implied that UNLOCK+LOCK is a full memory barrier, but I wanted to
see it explicitly stated somewhere. From a bit of spelunking below I still
think it's the case, but is there a "better" source of truth?

  01352fb81658 ("locking: Add an smp_mb__after_unlock_lock() for UNLOCK+BLOCK barrier")
  """
  The Linux kernel has traditionally required that an UNLOCK+LOCK pair act as a
  full memory barrier when either (1) that UNLOCK+LOCK pair was executed by the
  same CPU or task, or (2) the same lock variable was used for the UNLOCK and
  LOCK.
  """

and

  https://lore.kernel.org/all/1436789704-10086-1-git-send-email-will.deacon@arm.com/
  """
  This ordering guarantee is already provided without the barrier on
  all architectures apart from PowerPC
  """

> +	/*
> +	 * Full ordering against accesses prior current GP and also against
                                          ^^^^^
                                          prior to

> +	 * current GP sequence number is enforced by current rnp locking
> +	 * with chained smp_mb__after_unlock_lock().
> +	 */
> +	rdp->dynticks_snap = ct_dynticks_cpu_acquire(rdp->cpu);
>       if (rcu_dynticks_in_eqs(rdp->dynticks_snap)) {
>               trace_rcu_fqs(rcu_state.name, rdp->gp_seq, rdp->cpu, TPS("dti"));
>               rcu_gpnum_ovf(rdp->mynode, rdp);
> --
> 2.44.0


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/6] rcu: Remove superfluous full memory barrier upon first EQS snapshot
  2024-05-16 15:31   ` Valentin Schneider
@ 2024-05-16 16:08     ` Frederic Weisbecker
  2024-05-16 17:08       ` Valentin Schneider
  0 siblings, 1 reply; 18+ messages in thread
From: Frederic Weisbecker @ 2024-05-16 16:08 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: LKML, Paul E . McKenney, Boqun Feng, Joel Fernandes,
	Neeraj Upadhyay, Uladzislau Rezki, Zqiang, rcu

On Thu, May 16, 2024 at 05:31:40PM +0200, Valentin Schneider wrote:
> On 15/05/24 14:53, Frederic Weisbecker wrote:
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index 58415cdc54f8..f5354de5644b 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -773,7 +773,12 @@ static void rcu_gpnum_ovf(struct rcu_node *rnp, struct rcu_data *rdp)
> >   */
> >  static int dyntick_save_progress_counter(struct rcu_data *rdp)
> >  {
> > -	rdp->dynticks_snap = rcu_dynticks_snap(rdp->cpu);
> 
> So for PPC, which gets the smp_mb() at the lock acquisition, this is an
> "obvious" redundant smp_mb().
> 
> For the other archs, per the definition of smp_mb__after_unlock_lock() it
> seems implied that UNLOCK+LOCK is a full memory barrier, but I wanted to
> see it explicitly stated somewhere. From a bit of spelunking below I still
> think it's the case, but is there a "better" source of truth?
> 
>   01352fb81658 ("locking: Add an smp_mb__after_unlock_lock() for UNLOCK+BLOCK barrier")
>   """
>   The Linux kernel has traditionally required that an UNLOCK+LOCK pair act as a
>   full memory barrier when either (1) that UNLOCK+LOCK pair was executed by the
>   same CPU or task, or (2) the same lock variable was used for the UNLOCK and
>   LOCK.
>   """
> 
> and
> 
>   https://lore.kernel.org/all/1436789704-10086-1-git-send-email-will.deacon@arm.com/
>   """
>   This ordering guarantee is already provided without the barrier on
>   all architectures apart from PowerPC
>   """

You seem to have found the accurate informations! But I must admit
they are hard to find and it would be welcome to document that properly, for example
in Documentation/memory-barriers.txt

I think the reason is that it's not supposed to be used outside RCU, perhaps
because its semantics are too fragile to use for general purpose? Even that
could be stated along in Documentation/memory-barriers.txt

Another thing is that its semantics are similar to smp_mb__after_spinlock()
(itself badly documented), although slightly different. I'm not even completely
sure how. I assume that smp_mb__after_spinlock() can be just used once to
produce the required ordering and subsequent lock on that spinlock don't need
to repeat the barrier to propagate the ordering against what is before the
smp_mb__after_spinlock. However IUUC smp_mb__after_unlock_lock() has to be
chained/repeated on all subsequent locking of the spinlock...

Thanks.


> 
> > +	/*
> > +	 * Full ordering against accesses prior current GP and also against
>                                           ^^^^^
>                                           prior to
> 
> > +	 * current GP sequence number is enforced by current rnp locking
> > +	 * with chained smp_mb__after_unlock_lock().
> > +	 */
> > +	rdp->dynticks_snap = ct_dynticks_cpu_acquire(rdp->cpu);
> >       if (rcu_dynticks_in_eqs(rdp->dynticks_snap)) {
> >               trace_rcu_fqs(rcu_state.name, rdp->gp_seq, rdp->cpu, TPS("dti"));
> >               rcu_gpnum_ovf(rdp->mynode, rdp);
> > --
> > 2.44.0
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/6] rcu: Remove superfluous full memory barrier upon first EQS snapshot
  2024-05-16 16:08     ` Frederic Weisbecker
@ 2024-05-16 17:08       ` Valentin Schneider
  2024-05-17  7:29         ` Andrea Parri
  0 siblings, 1 reply; 18+ messages in thread
From: Valentin Schneider @ 2024-05-16 17:08 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Paul E . McKenney, Boqun Feng, Joel Fernandes,
	Neeraj Upadhyay, Uladzislau Rezki, Zqiang, rcu

On 16/05/24 18:08, Frederic Weisbecker wrote:
> On Thu, May 16, 2024 at 05:31:40PM +0200, Valentin Schneider wrote:
>> On 15/05/24 14:53, Frederic Weisbecker wrote:
>> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
>> > index 58415cdc54f8..f5354de5644b 100644
>> > --- a/kernel/rcu/tree.c
>> > +++ b/kernel/rcu/tree.c
>> > @@ -773,7 +773,12 @@ static void rcu_gpnum_ovf(struct rcu_node *rnp, struct rcu_data *rdp)
>> >   */
>> >  static int dyntick_save_progress_counter(struct rcu_data *rdp)
>> >  {
>> > -	rdp->dynticks_snap = rcu_dynticks_snap(rdp->cpu);
>>
>> So for PPC, which gets the smp_mb() at the lock acquisition, this is an
>> "obvious" redundant smp_mb().
>>
>> For the other archs, per the definition of smp_mb__after_unlock_lock() it
>> seems implied that UNLOCK+LOCK is a full memory barrier, but I wanted to
>> see it explicitly stated somewhere. From a bit of spelunking below I still
>> think it's the case, but is there a "better" source of truth?
>>
>>   01352fb81658 ("locking: Add an smp_mb__after_unlock_lock() for UNLOCK+BLOCK barrier")
>>   """
>>   The Linux kernel has traditionally required that an UNLOCK+LOCK pair act as a
>>   full memory barrier when either (1) that UNLOCK+LOCK pair was executed by the
>>   same CPU or task, or (2) the same lock variable was used for the UNLOCK and
>>   LOCK.
>>   """
>>
>> and
>>
>>   https://lore.kernel.org/all/1436789704-10086-1-git-send-email-will.deacon@arm.com/
>>   """
>>   This ordering guarantee is already provided without the barrier on
>>   all architectures apart from PowerPC
>>   """
>
> You seem to have found the accurate informations! But I must admit
> they are hard to find and it would be welcome to document that properly, for example
> in Documentation/memory-barriers.txt
>
> I think the reason is that it's not supposed to be used outside RCU, perhaps
> because its semantics are too fragile to use for general purpose? Even that
> could be stated along in Documentation/memory-barriers.txt
>

That's also what I suspected when I stumbled on

  12d560f4ea87 ("rcu,locking: Privatize smp_mb__after_unlock_lock()")

which removed the references to it from Documentation/memory-barriers.txt

> Another thing is that its semantics are similar to smp_mb__after_spinlock()
> (itself badly documented), although slightly different. I'm not even completely
> sure how. I assume that smp_mb__after_spinlock() can be just used once to
> produce the required ordering and subsequent lock on that spinlock don't need
> to repeat the barrier to propagate the ordering against what is before the
> smp_mb__after_spinlock. However IUUC smp_mb__after_unlock_lock() has to be
> chained/repeated on all subsequent locking of the spinlock...

IIUC (big if) the chaining is a requirement of RCU itself, per:

  2a67e741bbbc ("rcu: Create transitive rnp->lock acquisition functions")

   * Because the rcu_nodes form a tree, the tree traversal locking will observe
   * different lock values, this in turn means that an UNLOCK of one level
   * followed by a LOCK of another level does not imply a full memory barrier;
   * and most importantly transitivity is lost.
   *
   * In order to restore full ordering between tree levels, augment the regular
   * lock acquire functions with smp_mb__after_unlock_lock().


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/6] rcu: Remove full memory barrier on boot time eqs sanity check
  2024-05-15 12:53 ` [PATCH 4/6] rcu: Remove full memory barrier on boot time eqs sanity check Frederic Weisbecker
@ 2024-05-16 17:09   ` Valentin Schneider
  0 siblings, 0 replies; 18+ messages in thread
From: Valentin Schneider @ 2024-05-16 17:09 UTC (permalink / raw)
  To: Frederic Weisbecker, LKML
  Cc: Frederic Weisbecker, Paul E . McKenney, Boqun Feng,
	Joel Fernandes, Neeraj Upadhyay, Uladzislau Rezki, Zqiang, rcu

On 15/05/24 14:53, Frederic Weisbecker wrote:
> When the boot CPU initializes the per-CPU data on behalf of all possible
> CPUs, a sanity check is performed on each of them to make sure none is
> initialized in an extended quiescent state.
>
> This check involves a full memory barrier which is useless at this early
> boot stage.
>
> Do a plain access instead.
>
> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>

Reviewed-by: Valentin Schneider <vschneid@redhat.com>


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 5/6] rcu: Remove full memory barrier on RCU stall printout
  2024-05-15 12:53 ` [PATCH 5/6] rcu: Remove full memory barrier on RCU stall printout Frederic Weisbecker
@ 2024-05-16 17:09   ` Valentin Schneider
  0 siblings, 0 replies; 18+ messages in thread
From: Valentin Schneider @ 2024-05-16 17:09 UTC (permalink / raw)
  To: Frederic Weisbecker, LKML
  Cc: Frederic Weisbecker, Paul E . McKenney, Boqun Feng,
	Joel Fernandes, Neeraj Upadhyay, Uladzislau Rezki, Zqiang, rcu

On 15/05/24 14:53, Frederic Weisbecker wrote:
> RCU stall printout fetches the EQS state of a CPU with a preceding full
> memory barrier. However there is nothing to order this read against at
> this debugging stage. It is inherently racy when performed remotely.
>
> Do a plain read instead.
>
> This was the last user of rcu_dynticks_snap().
>
> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>

Reviewed-by: Valentin Schneider <vschneid@redhat.com>


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/6] rcu: Remove superfluous full memory barrier upon first EQS snapshot
  2024-05-16 17:08       ` Valentin Schneider
@ 2024-05-17  7:29         ` Andrea Parri
  2024-05-17 11:40           ` Frederic Weisbecker
  0 siblings, 1 reply; 18+ messages in thread
From: Andrea Parri @ 2024-05-17  7:29 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Frederic Weisbecker, LKML, Paul E . McKenney, Boqun Feng,
	Joel Fernandes, Neeraj Upadhyay, Uladzislau Rezki, Zqiang, rcu

> >> > @@ -773,7 +773,12 @@ static void rcu_gpnum_ovf(struct rcu_node *rnp, struct rcu_data *rdp)
> >> >   */
> >> >  static int dyntick_save_progress_counter(struct rcu_data *rdp)
> >> >  {
> >> > -	rdp->dynticks_snap = rcu_dynticks_snap(rdp->cpu);
> >>
> >> So for PPC, which gets the smp_mb() at the lock acquisition, this is an
> >> "obvious" redundant smp_mb().
> >>
> >> For the other archs, per the definition of smp_mb__after_unlock_lock() it
> >> seems implied that UNLOCK+LOCK is a full memory barrier, but I wanted to
> >> see it explicitly stated somewhere. From a bit of spelunking below I still
> >> think it's the case, but is there a "better" source of truth?
> >>
> >>   01352fb81658 ("locking: Add an smp_mb__after_unlock_lock() for UNLOCK+BLOCK barrier")
> >>   """
> >>   The Linux kernel has traditionally required that an UNLOCK+LOCK pair act as a
> >>   full memory barrier when either (1) that UNLOCK+LOCK pair was executed by the
> >>   same CPU or task, or (2) the same lock variable was used for the UNLOCK and
> >>   LOCK.
> >>   """
> >>
> >> and
> >>
> >>   https://lore.kernel.org/all/1436789704-10086-1-git-send-email-will.deacon@arm.com/
> >>   """
> >>   This ordering guarantee is already provided without the barrier on
> >>   all architectures apart from PowerPC
> >>   """
> >
> > You seem to have found the accurate informations! But I must admit
> > they are hard to find and it would be welcome to document that properly, for example
> > in Documentation/memory-barriers.txt
> >
> > I think the reason is that it's not supposed to be used outside RCU, perhaps
> > because its semantics are too fragile to use for general purpose? Even that
> > could be stated along in Documentation/memory-barriers.txt
> >
> 
> That's also what I suspected when I stumbled on
> 
>   12d560f4ea87 ("rcu,locking: Privatize smp_mb__after_unlock_lock()")
> 
> which removed the references to it from Documentation/memory-barriers.txt
> 
> > Another thing is that its semantics are similar to smp_mb__after_spinlock()
> > (itself badly documented), although slightly different. I'm not even completely
> > sure how. I assume that smp_mb__after_spinlock() can be just used once to
> > produce the required ordering and subsequent lock on that spinlock don't need
> > to repeat the barrier to propagate the ordering against what is before the
> > smp_mb__after_spinlock. However IUUC smp_mb__after_unlock_lock() has to be
> > chained/repeated on all subsequent locking of the spinlock...
> 
> IIUC (big if) the chaining is a requirement of RCU itself, per:
> 
>   2a67e741bbbc ("rcu: Create transitive rnp->lock acquisition functions")
> 
>    * Because the rcu_nodes form a tree, the tree traversal locking will observe
>    * different lock values, this in turn means that an UNLOCK of one level
>    * followed by a LOCK of another level does not imply a full memory barrier;
>    * and most importantly transitivity is lost.
>    *
>    * In order to restore full ordering between tree levels, augment the regular
>    * lock acquire functions with smp_mb__after_unlock_lock().
> 

I know my remark may seem a little biased,  ;-) but the semantics of
smp_mb__after_unlock_lock() and smp_mb__after_spinlock() have been
somehowr/formally documented in the LKMM.  This means, in particular,
that one can write "litmus tests" with the barriers at stake and then
"run"/check such tests against the _current model.

For example,  (based on inline comments in include/linux/spinlock.h)

$ cat after_spinlock.litmus
C after_spinlock

{ }

P0(int *x, spinlock_t *s)
{
	spin_lock(s);
	WRITE_ONCE(*x, 1);
	spin_unlock(s);
}

P1(int *x, int *y, spinlock_t *s)
{
	int r0;

	spin_lock(s);
	smp_mb__after_spinlock();
	r0 = READ_ONCE(*x);
	WRITE_ONCE(*y, 1);
	spin_unlock(s);
}

P2(int *x, int *y)
{
	int r1;
	int r2;

	r1 = READ_ONCE(*y);
	smp_rmb();
	r2 = READ_ONCE(*x);
}

exists (1:r0=1 /\ 2:r1=1 /\ 2:r2=0)

$ herd7 -conf linux-kernel.cfg after_spinlock.litmus
Test after_spinlock Allowed
States 7
1:r0=0; 2:r1=0; 2:r2=0;
1:r0=0; 2:r1=0; 2:r2=1;
1:r0=0; 2:r1=1; 2:r2=0;
1:r0=0; 2:r1=1; 2:r2=1;
1:r0=1; 2:r1=0; 2:r2=0;
1:r0=1; 2:r1=0; 2:r2=1;
1:r0=1; 2:r1=1; 2:r2=1;
No
Witnesses
Positive: 0 Negative: 7
Condition exists (1:r0=1 /\ 2:r1=1 /\ 2:r2=0)
Observation after_spinlock Never 0 7
Time after_spinlock 0.01
Hash=b377bde8fe3565fcdd0eb2bdfaf3351e

Notice that, according to the current model at least, the state in
the above "exists" clause remains forbidden _after removal of the
smp_mb__after_spinlock() barrier.  In this sense, if you want, the
inline comment (I contributed to) is misleading/incomplete.  :-/

  Andrea

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/6] rcu: Remove superfluous full memory barrier upon first EQS snapshot
  2024-05-17  7:29         ` Andrea Parri
@ 2024-05-17 11:40           ` Frederic Weisbecker
  2024-05-17 16:27             ` Andrea Parri
  0 siblings, 1 reply; 18+ messages in thread
From: Frederic Weisbecker @ 2024-05-17 11:40 UTC (permalink / raw)
  To: Andrea Parri
  Cc: Valentin Schneider, LKML, Paul E . McKenney, Boqun Feng,
	Joel Fernandes, Neeraj Upadhyay, Uladzislau Rezki, Zqiang, rcu

Le Fri, May 17, 2024 at 09:29:14AM +0200, Andrea Parri a écrit :
> I know my remark may seem a little biased,  ;-) but the semantics of
> smp_mb__after_unlock_lock() and smp_mb__after_spinlock() have been
> somehowr/formally documented in the LKMM.  This means, in particular,
> that one can write "litmus tests" with the barriers at stake and then
> "run"/check such tests against the _current model.
> 
> For example,  (based on inline comments in include/linux/spinlock.h)
> 
> $ cat after_spinlock.litmus
> C after_spinlock
> 
> { }
> 
> P0(int *x, spinlock_t *s)
> {
> 	spin_lock(s);
> 	WRITE_ONCE(*x, 1);
> 	spin_unlock(s);
> }
> 
> P1(int *x, int *y, spinlock_t *s)
> {
> 	int r0;
> 
> 	spin_lock(s);
> 	smp_mb__after_spinlock();
> 	r0 = READ_ONCE(*x);
> 	WRITE_ONCE(*y, 1);
> 	spin_unlock(s);
> }
> 
> P2(int *x, int *y)
> {
> 	int r1;
> 	int r2;
> 
> 	r1 = READ_ONCE(*y);
> 	smp_rmb();
> 	r2 = READ_ONCE(*x);
> }
> 
> exists (1:r0=1 /\ 2:r1=1 /\ 2:r2=0)
> 
> $ herd7 -conf linux-kernel.cfg after_spinlock.litmus
> Test after_spinlock Allowed
> States 7
> 1:r0=0; 2:r1=0; 2:r2=0;
> 1:r0=0; 2:r1=0; 2:r2=1;
> 1:r0=0; 2:r1=1; 2:r2=0;
> 1:r0=0; 2:r1=1; 2:r2=1;
> 1:r0=1; 2:r1=0; 2:r2=0;
> 1:r0=1; 2:r1=0; 2:r2=1;
> 1:r0=1; 2:r1=1; 2:r2=1;
> No
> Witnesses
> Positive: 0 Negative: 7
> Condition exists (1:r0=1 /\ 2:r1=1 /\ 2:r2=0)
> Observation after_spinlock Never 0 7
> Time after_spinlock 0.01
> Hash=b377bde8fe3565fcdd0eb2bdfaf3351e
> 
> Notice that, according to the current model at least, the state in
> the above "exists" clause remains forbidden _after removal of the
> smp_mb__after_spinlock() barrier.  In this sense, if you want, the
> inline comment (I contributed to) is misleading/incomplete.  :-/

Z6.0+pooncelock+poonceLock+pombonce.litmus shows an example of
how full ordering is subtely incomplete without smp_mb__after_spinlock().

But still, smp_mb__after_unlock_lock() is supposed to be weaker than
smp_mb__after_spinlock() and yet I'm failing to produce a litmus test
that is successfull with the latter and fails with the former.

For example, and assuming smp_mb__after_unlock_lock() is expected to be
chained across locking, here is a litmus test inspired by
Z6.0+pooncelock+poonceLock+pombonce.litmus that never observes the condition
even though I would expect it should, as opposed to using
smp_mb__after_spinlock():

C smp_mb__after_unlock_lock

{}

P0(int *w, int *x, spinlock_t *mylock)
{
	spin_lock(mylock);
	WRITE_ONCE(*w, 1);
	WRITE_ONCE(*x, 1);
	spin_unlock(mylock);
}

P1(int *x, int *y, spinlock_t *mylock)
{
	int r0;

	spin_lock(mylock);
	smp_mb__after_unlock_lock();
	r0 = READ_ONCE(*x);
	WRITE_ONCE(*y, 1);
	spin_unlock(mylock);
}

P2(int *y, int *z, spinlock_t *mylock)
{
	int r0;

	spin_lock(mylock);
	r0 = READ_ONCE(*y);
	WRITE_ONCE(*z, 1);
	spin_unlock(mylock);
}

P3(int *w, int *z)
{
	int r1;

	WRITE_ONCE(*z, 2);
	smp_mb();
	r1 = READ_ONCE(*w);
}

exists (1:r0=1 /\ 2:r0=1 /\ z=2 /\ 3:r1=0)



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/6] rcu: Remove superfluous full memory barrier upon first EQS snapshot
  2024-05-17 11:40           ` Frederic Weisbecker
@ 2024-05-17 16:27             ` Andrea Parri
  0 siblings, 0 replies; 18+ messages in thread
From: Andrea Parri @ 2024-05-17 16:27 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Valentin Schneider, LKML, Paul E . McKenney, Boqun Feng,
	Joel Fernandes, Neeraj Upadhyay, Uladzislau Rezki, Zqiang, rcu

> Z6.0+pooncelock+poonceLock+pombonce.litmus shows an example of
> how full ordering is subtely incomplete without smp_mb__after_spinlock().
> 
> But still, smp_mb__after_unlock_lock() is supposed to be weaker than
> smp_mb__after_spinlock() and yet I'm failing to produce a litmus test
> that is successfull with the latter and fails with the former.

smp_mb__after_unlock_lock() is a nop without a matching unlock-lock;
smp_mb__after_spinlock() not quite...

C after_spinlock__vs__after_unlock_lock

{ }

P0(int *x, int *y, spinlock_t *s)
{
	int r0;

	WRITE_ONCE(*x, 1);
	spin_lock(s);
	smp_mb__after_spinlock();
	r0 = READ_ONCE(*y);
	spin_unlock(s);
}

P1(int *x, int *y)
{
	int r1;

	WRITE_ONCE(*y, 1);
	smp_mb();
	r1 = READ_ONCE(*x);
}

exists (0:r0=0 /\ 1:r1=0)


> For example, and assuming smp_mb__after_unlock_lock() is expected to be
> chained across locking, here is a litmus test inspired by
> Z6.0+pooncelock+poonceLock+pombonce.litmus that never observes the condition
> even though I would expect it should, as opposed to using
> smp_mb__after_spinlock():
> 
> C smp_mb__after_unlock_lock
> 
> {}
> 
> P0(int *w, int *x, spinlock_t *mylock)
> {
> 	spin_lock(mylock);
> 	WRITE_ONCE(*w, 1);
> 	WRITE_ONCE(*x, 1);
> 	spin_unlock(mylock);
> }
> 
> P1(int *x, int *y, spinlock_t *mylock)
> {
> 	int r0;
> 
> 	spin_lock(mylock);
> 	smp_mb__after_unlock_lock();
> 	r0 = READ_ONCE(*x);
> 	WRITE_ONCE(*y, 1);
> 	spin_unlock(mylock);
> }
> 
> P2(int *y, int *z, spinlock_t *mylock)
> {
> 	int r0;
> 
> 	spin_lock(mylock);
> 	r0 = READ_ONCE(*y);
> 	WRITE_ONCE(*z, 1);
> 	spin_unlock(mylock);
> }
> 
> P3(int *w, int *z)
> {
> 	int r1;
> 
> 	WRITE_ONCE(*z, 2);
> 	smp_mb();
> 	r1 = READ_ONCE(*w);
> }
> 
> exists (1:r0=1 /\ 2:r0=1 /\ z=2 /\ 3:r1=0)

Here's an informal argument to explain this outcome.  It is not the only
according to the LKMM, but the first that came to my mind.  And this is
longer than I wished.  TL; DR:  Full barriers are strong, really strong.

Remark full memory barriers share the following "strong-fence property":

  A ->full-barrier B

implies

  (SFP) A propagates (aka, is visible) to _every CPU before B executes

(cf. tools/memory-model/Documentation/explanation.txt for details about
the concepts of "propagation" and "execution").

For example, in the snippet above,

  P0:WRITE_ONCE(*w, 1) ->full-barrier P1:spin_unlock(mylock)

since

  P0:spin_unlock(mylock) ->reads-from P1:spin_lock(mylock) ->program-order P1:smp_mb__after_unlock_lock()

acts as a full memory barrier.   (1:r0=1 and 2:r0=1 together determine
the so called critical-sections' order (CSO).)

By contradiction,

  1) P0:WRITE_ONCE(*w, 1) propagates to P3 before P1:spin_unlock(mylock) executes   (SFP)

  2) P1:spin_unlock(mylock) executes before P2:spin_lock(mylock) executes   (CSO)

  3) P2:spin_lock(mylock) executes before P2:WRITE_ONCE(*z, 1) executes  (P2:spin_lock() is an ACQUIRE op)

  4) P2:WRITE_ONCE(*z, 1) executes before P2:WRITE_ONCE(*z, 1) propagates P3  (intuitively, a store is visible to the local CPU before being visible to a remote CPU)

  5) P2:WRITE_ONCE(*z, 1) propagates to P3 before P3:WRITE_ONCE(*z, 2) executes   (z=2)

  6) P3:WRITE_ONCE(*z, 2) executes before P3:WRITE_ONCE(*z, 2) propagates to P0    (a store is visible to the local CPU before being visible to a remote CPU)

  7) P3:WRITE_ONCE(*z, 2) propagates to P0 before P3:READ_ONCE(*w) executes   (SFP)

  8) P3:READ_ONCE(*w) executes before P0:WRITE_ONCE(*w, 1) propagates to P3   (3:r1=0)

Put together, (1-8) gives:

  P0:WRITE_ONCE(*w, 1) propagates to P3 before P0:WRITE_ONCE(*w, 1) propagates to P3

an absurd.

  Andrea

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2024-05-17 16:27 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-05-15 12:53 [PATCH 0/6] rcu: Remove several redundant memory barriers Frederic Weisbecker
2024-05-15 12:53 ` [PATCH 1/6] rcu: Remove full ordering on second EQS snapshot Frederic Weisbecker
2024-05-15 17:32   ` Valentin Schneider
2024-05-15 12:53 ` [PATCH 2/6] rcu: Remove superfluous full memory barrier upon first " Frederic Weisbecker
2024-05-16 15:31   ` Valentin Schneider
2024-05-16 16:08     ` Frederic Weisbecker
2024-05-16 17:08       ` Valentin Schneider
2024-05-17  7:29         ` Andrea Parri
2024-05-17 11:40           ` Frederic Weisbecker
2024-05-17 16:27             ` Andrea Parri
2024-05-15 12:53 ` [PATCH 3/6] rcu/exp: " Frederic Weisbecker
2024-05-15 12:53 ` [PATCH 4/6] rcu: Remove full memory barrier on boot time eqs sanity check Frederic Weisbecker
2024-05-16 17:09   ` Valentin Schneider
2024-05-15 12:53 ` [PATCH 5/6] rcu: Remove full memory barrier on RCU stall printout Frederic Weisbecker
2024-05-16 17:09   ` Valentin Schneider
2024-05-15 12:53 ` [PATCH 6/6] rcu/exp: Remove redundant full memory barrier at the end of GP Frederic Weisbecker
2024-05-15 17:32 ` [PATCH 0/6] rcu: Remove several redundant memory barriers Valentin Schneider
2024-05-15 23:13   ` Frederic Weisbecker

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).