[PATCH v2 0/2] Streamlining treatment of smp_mb__after_unlock

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 0/2] Streamlining treatment of smp_mb__after_unlock_lock
@ 2023-01-26 13:46 Jonas Oberhauser
  2023-01-26 13:46 ` [PATCH v2 1/2] tools/memory-model: Unify UNLOCK+LOCK pairings to po-unlock-lock-po Jonas Oberhauser
  2023-01-26 13:46 ` [PATCH v2 2/2] tools/memory-model: Make ppo a subrelation of po Jonas Oberhauser
  0 siblings, 2 replies; 39+ messages in thread
From: Jonas Oberhauser @ 2023-01-26 13:46 UTC (permalink / raw)
  To: paulmck
  Cc: stern, parri.andrea, will, peterz, boqun.feng, npiggin, dhowells,
	j.alglave, luc.maranget, akiyks, dlustig, joel, urezki,
	quic_neeraju, frederic, linux-kernel, Jonas Oberhauser

Hi all,

This patchset first makes smp_mb__after_unlock_lock use po-unlock-lock-po
to express its ordering property on UNLOCK+LOCK pairs in a more uniform
way with the rest of the model.  This does not affect the guarantees
provided by it, but if the solution proposed here is unsatisfactory, there
are two alternatives that spring to mind:

1)	one could think about rephrasing the guarantee of this fence to
	more obviously match the code

2)	one could redefine po-unlock-lock-po to more obviously match the
	definition of UNLOCK+LOCK pair in rcupdate.h (note: I haven't
	yet checked every use of po-unlock-lock-po, so that would need to
	be checked)

It then makes ppo a subset of po by moving the extended fence behaviors of
this fence (which covers events of two threads) out of ppo.


I'm also not sure how to correctly credit the helpful comments of the
reviewers on the first iteration of the patch.

Changes since the last patch proposal:

1)	As suggested by Andrea Parri, the patches no longer introduce a new
	relation and instead rely purely on strong-fence.  Unlike his
	suggestion these patches do not redefine strong-fence but instead
	use mb | gp directly in the case where the extended fence is to be
	excluded.

2)	Following another suggestion by Andrea Parri, the patches no longer
	update any documentation since there are no new relations.

3)	We split the patch logically into two steps as mentioned above.

4)	As suggested by Alan Stern, we explain in the model why the
	mismatch between the UNLOCK+LOCK definition in rcupdate.h and the
	definition of the semantics of the fence in the model is not
	causing any difference in behavior.


Jonas Oberhauser (2):
  tools/memory-model: Unify UNLOCK+LOCK pairings to po-unlock-lock-po
  tools/memory-model: Make ppo a subrelation of po

 tools/memory-model/linux-kernel.cat | 22 +++++++++++++++++-----
 1 file changed, 17 insertions(+), 5 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v2 1/2] tools/memory-model: Unify UNLOCK+LOCK pairings to po-unlock-lock-po
  2023-01-26 13:46 [PATCH v2 0/2] Streamlining treatment of smp_mb__after_unlock_lock Jonas Oberhauser
@ 2023-01-26 13:46 ` Jonas Oberhauser
  2023-01-26 16:36   ` Alan Stern
  2023-01-26 13:46 ` [PATCH v2 2/2] tools/memory-model: Make ppo a subrelation of po Jonas Oberhauser
  1 sibling, 1 reply; 39+ messages in thread
From: Jonas Oberhauser @ 2023-01-26 13:46 UTC (permalink / raw)
  To: paulmck
  Cc: stern, parri.andrea, will, peterz, boqun.feng, npiggin, dhowells,
	j.alglave, luc.maranget, akiyks, dlustig, joel, urezki,
	quic_neeraju, frederic, linux-kernel, Jonas Oberhauser

LKMM uses two relations for talking about UNLOCK+LOCK pairings:

	1) po-unlock-lock-po, which handles UNLOCK+LOCK pairings
	   on the same CPU or immediate lock handovers on the same
	   lock variable

	2) po;[UL];(co|po);[LKW];po, which handles UNLOCK+LOCK pairs
	   literally as described in rcupdate.h#L1002, i.e., even
	   after a sequence of handovers on the same lock variable.

The latter relation is used only once, to provide the guarantee
defined in rcupdate.h#L1002 by smp_mb__after_unlock_lock(), which
makes any UNLOCK+LOCK pair followed by the fence behave like a full
barrier.

This patch drops this use in favor of using po-unlock-lock-po
everywhere, which unifies the way the model talks about UNLOCK+LOCK
pairings.  At first glance this seems to weaken the guarantee given
by LKMM: When considering a long sequence of lock handovers
such as below, where P0 hands the lock to P1, which hands it to P2,
which finally executes such an after_unlock_lock fence, the mb
relation currently links any stores in the critical section of P0
to instructions P2 executes after its fence, but not so after the
patch.

P0(int *x, int *y, spinlock_t *mylock)
{
        spin_lock(mylock);
        WRITE_ONCE(*x, 2);
        spin_unlock(mylock);
        WRITE_ONCE(*y, 1);
}

P1(int *y, int *z, spinlock_t *mylock)
{
        int r0 = READ_ONCE(*y); // reads 1
        spin_lock(mylock);
        spin_unlock(mylock);
        WRITE_ONCE(*z,1);
}

P2(int *z, int *d, spinlock_t *mylock)
{
        int r1 = READ_ONCE(*z); // reads 1
        spin_lock(mylock);
        spin_unlock(mylock);
        smp_mb__after_unlock_lock();
        WRITE_ONCE(*d,1);
}

P3(int *x, int *d)
{
        WRITE_ONCE(*d,2);
        smp_mb();
        WRITE_ONCE(*x,1);
}

exists (1:r0=1 /\ 2:r1=1 /\ x=2 /\ d=2)

Nevertheless, the ordering guarantee given in rcupdate.h is actually
not weakened.  This is because the unlock operations along the
sequence of handovers are A-cumulative fences.  They ensure that any
stores that propagate to the CPU performing the first unlock
operation in the sequence must also propagate to every CPU that
performs a subsequent lock operation in the sequence.  Therefore any
such stores will also be ordered correctly by the fence even if only
the final handover is considered a full barrier.

Indeed this patch does not affect the behaviors allowed by LKMM at
all.  The mb relation is used to define ordering through:
1) mb/.../ppo/hb, where the ordering is subsumed by hb+ where the
   lock-release, rfe, and unlock-acquire orderings each provide hb
2) mb/strong-fence/cumul-fence/prop, where the rfe and A-cumulative
   lock-release orderings simply add more fine-grained cumul-fence
   edges to substitute a single strong-fence edge provided by a long
   lock handover sequence
3) mb/strong-fence/pb and various similar uses in the definition of
   data races, where as discussed above any long handover sequence
   can be turned into a sequence of cumul-fence edges that provide
   the same ordering.

Signed-off-by: Jonas Oberhauser <jonas.oberhauser@huaweicloud.com>
---
 tools/memory-model/linux-kernel.cat | 15 +++++++++++++--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/tools/memory-model/linux-kernel.cat b/tools/memory-model/linux-kernel.cat
index 07f884f9b2bf..6e531457bb73 100644
--- a/tools/memory-model/linux-kernel.cat
+++ b/tools/memory-model/linux-kernel.cat
@@ -37,8 +37,19 @@ let mb = ([M] ; fencerel(Mb) ; [M]) |
 	([M] ; fencerel(Before-atomic) ; [RMW] ; po? ; [M]) |
 	([M] ; po? ; [RMW] ; fencerel(After-atomic) ; [M]) |
 	([M] ; po? ; [LKW] ; fencerel(After-spinlock) ; [M]) |
-	([M] ; po ; [UL] ; (co | po) ; [LKW] ;
-		fencerel(After-unlock-lock) ; [M])
+(*
+ * Note: The po-unlock-lock-po relation only passes the lock to the direct
+ * successor, perhaps giving the impression that the ordering of the
+ * smp_mb__after_unlock_lock() fence only affects a single lock handover.
+ * However, in a longer sequence of lock handovers, the implicit
+ * A-cumulative release fences of lock-release ensure that any stores that
+ * propagate to one of the involved CPUs before it hands over the lock to
+ * the next CPU will also propagate to the final CPU handing over the lock
+ * to the CPU that executes the fence.  Therefore, all those stores are
+ * also affected by the fence.
+ *)
+	([M] ; po-unlock-lock-po ;
+		[After-unlock-lock] ; po ; [M])
 let gp = po ; [Sync-rcu | Sync-srcu] ; po?
 let strong-fence = mb | gp

-- 
2.17.1

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v2 2/2] tools/memory-model: Make ppo a subrelation of po
  2023-01-26 13:46 [PATCH v2 0/2] Streamlining treatment of smp_mb__after_unlock_lock Jonas Oberhauser
  2023-01-26 13:46 ` [PATCH v2 1/2] tools/memory-model: Unify UNLOCK+LOCK pairings to po-unlock-lock-po Jonas Oberhauser
@ 2023-01-26 13:46 ` Jonas Oberhauser
  2023-01-26 16:36   ` Alan Stern
  1 sibling, 1 reply; 39+ messages in thread
From: Jonas Oberhauser @ 2023-01-26 13:46 UTC (permalink / raw)
  To: paulmck
  Cc: stern, parri.andrea, will, peterz, boqun.feng, npiggin, dhowells,
	j.alglave, luc.maranget, akiyks, dlustig, joel, urezki,
	quic_neeraju, frederic, linux-kernel, Jonas Oberhauser

As stated in the documentation and implied by its name, the ppo
(preserved program order) relation is intended to link po-earlier
to po-later instructions under certain conditions.  However, a
corner case currently allows instructions to be linked by ppo that
are not executed by the same thread, i.e., instructions are being
linked that have no po relation.

This happens due to the mb/strong-fence relations, which (as one
case) provide order when locks are passed between threads followed
by an smp_mb__after_unlock_lock() fence.  This is illustrated in
the following litmus test (as can be seen when using herd7 with
`doshow ppo`):

P0(int *x, int *y)
{
    spin_lock(x);
    spin_unlock(x);
}

P1(int *x, int *y)
{
    spin_lock(x);
    smp_mb__after_unlock_lock();
    *y = 1;
}

The ppo relation will link P0's spin_lock(x) and P1's *y=1, because
P0 passes a lock to P1 which then uses this fence.

The patch makes ppo a subrelation of po by eliminating this possibility
from mb (but not strong-fence) and relying explicitly on mb|gp instead
of strong-fence when defining ppo.

Signed-off-by: Jonas Oberhauser <jonas.oberhauser@huaweicloud.com>
---
 tools/memory-model/linux-kernel.cat | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/tools/memory-model/linux-kernel.cat b/tools/memory-model/linux-kernel.cat
index 6e531457bb73..815fdafacaef 100644
--- a/tools/memory-model/linux-kernel.cat
+++ b/tools/memory-model/linux-kernel.cat
@@ -36,7 +36,9 @@ let wmb = [W] ; fencerel(Wmb) ; [W]
 let mb = ([M] ; fencerel(Mb) ; [M]) |
 	([M] ; fencerel(Before-atomic) ; [RMW] ; po? ; [M]) |
 	([M] ; po? ; [RMW] ; fencerel(After-atomic) ; [M]) |
-	([M] ; po? ; [LKW] ; fencerel(After-spinlock) ; [M]) |
+	([M] ; po? ; [LKW] ; fencerel(After-spinlock) ; [M])
+let gp = po ; [Sync-rcu | Sync-srcu] ; po?
+let strong-fence = mb | gp |
 (*
  * Note: The po-unlock-lock-po relation only passes the lock to the direct
  * successor, perhaps giving the impression that the ordering of the
@@ -50,10 +52,9 @@ let mb = ([M] ; fencerel(Mb) ; [M]) |
  *)
 	([M] ; po-unlock-lock-po ;
 		[After-unlock-lock] ; po ; [M])
-let gp = po ; [Sync-rcu | Sync-srcu] ; po?
-let strong-fence = mb | gp
 
-let nonrw-fence = strong-fence | po-rel | acq-po
+
+let nonrw-fence = mb | gp | po-rel | acq-po
 let fence = nonrw-fence | wmb | rmb
 let barrier = fencerel(Barrier | Rmb | Wmb | Mb | Sync-rcu | Sync-srcu |
 		Before-atomic | After-atomic | Acquire | Release |
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 2/2] tools/memory-model: Make ppo a subrelation of po
  2023-01-26 13:46 ` [PATCH v2 2/2] tools/memory-model: Make ppo a subrelation of po Jonas Oberhauser
@ 2023-01-26 16:36   ` Alan Stern
  2023-01-27 14:31     ` Jonas Oberhauser
  0 siblings, 1 reply; 39+ messages in thread
From: Alan Stern @ 2023-01-26 16:36 UTC (permalink / raw)
  To: Jonas Oberhauser
  Cc: paulmck, parri.andrea, will, peterz, boqun.feng, npiggin,
	dhowells, j.alglave, luc.maranget, akiyks, dlustig, joel, urezki,
	quic_neeraju, frederic, linux-kernel

On Thu, Jan 26, 2023 at 02:46:04PM +0100, Jonas Oberhauser wrote:
> As stated in the documentation and implied by its name, the ppo
> (preserved program order) relation is intended to link po-earlier
> to po-later instructions under certain conditions.  However, a
> corner case currently allows instructions to be linked by ppo that
> are not executed by the same thread, i.e., instructions are being
> linked that have no po relation.
> 
> This happens due to the mb/strong-fence relations, which (as one
> case) provide order when locks are passed between threads followed
> by an smp_mb__after_unlock_lock() fence.  This is illustrated in
> the following litmus test (as can be seen when using herd7 with
> `doshow ppo`):
> 
> P0(int *x, int *y)
> {
>     spin_lock(x);
>     spin_unlock(x);
> }
> 
> P1(int *x, int *y)
> {
>     spin_lock(x);
>     smp_mb__after_unlock_lock();
>     *y = 1;
> }
> 
> The ppo relation will link P0's spin_lock(x) and P1's *y=1, because
> P0 passes a lock to P1 which then uses this fence.
> 
> The patch makes ppo a subrelation of po by eliminating this possibility
> from mb (but not strong-fence) and relying explicitly on mb|gp instead
> of strong-fence when defining ppo.
> 
> Signed-off-by: Jonas Oberhauser <jonas.oberhauser@huaweicloud.com>
> ---

This changes the meaning of the fence relation, which is used in 
w-pre-bounded, w-post-bounded, ww-vis, wr-vis, and rw-xbstar.  Have you 
checked that they won't be affected by the change?

>  tools/memory-model/linux-kernel.cat | 9 +++++----
>  1 file changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/tools/memory-model/linux-kernel.cat b/tools/memory-model/linux-kernel.cat
> index 6e531457bb73..815fdafacaef 100644
> --- a/tools/memory-model/linux-kernel.cat
> +++ b/tools/memory-model/linux-kernel.cat
> @@ -36,7 +36,9 @@ let wmb = [W] ; fencerel(Wmb) ; [W]
>  let mb = ([M] ; fencerel(Mb) ; [M]) |
>  	([M] ; fencerel(Before-atomic) ; [RMW] ; po? ; [M]) |
>  	([M] ; po? ; [RMW] ; fencerel(After-atomic) ; [M]) |
> -	([M] ; po? ; [LKW] ; fencerel(After-spinlock) ; [M]) |
> +	([M] ; po? ; [LKW] ; fencerel(After-spinlock) ; [M])
> +let gp = po ; [Sync-rcu | Sync-srcu] ; po?
> +let strong-fence = mb | gp |
>  (*
>   * Note: The po-unlock-lock-po relation only passes the lock to the direct
>   * successor, perhaps giving the impression that the ordering of the
> @@ -50,10 +52,9 @@ let mb = ([M] ; fencerel(Mb) ; [M]) |
>   *)
>  	([M] ; po-unlock-lock-po ;
>  		[After-unlock-lock] ; po ; [M])
> -let gp = po ; [Sync-rcu | Sync-srcu] ; po?
> -let strong-fence = mb | gp
>  
> -let nonrw-fence = strong-fence | po-rel | acq-po
> +

Extra blank line.

> +let nonrw-fence = mb | gp | po-rel | acq-po
>  let fence = nonrw-fence | wmb | rmb
>  let barrier = fencerel(Barrier | Rmb | Wmb | Mb | Sync-rcu | Sync-srcu |
>  		Before-atomic | After-atomic | Acquire | Release |
> -- 
> 2.17.1

Alan

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 1/2] tools/memory-model: Unify UNLOCK+LOCK pairings to po-unlock-lock-po
  2023-01-26 13:46 ` [PATCH v2 1/2] tools/memory-model: Unify UNLOCK+LOCK pairings to po-unlock-lock-po Jonas Oberhauser
@ 2023-01-26 16:36   ` Alan Stern
  2023-01-26 20:08     ` Paul E. McKenney
  0 siblings, 1 reply; 39+ messages in thread
From: Alan Stern @ 2023-01-26 16:36 UTC (permalink / raw)
  To: Jonas Oberhauser
  Cc: paulmck, parri.andrea, will, peterz, boqun.feng, npiggin,
	dhowells, j.alglave, luc.maranget, akiyks, dlustig, joel, urezki,
	quic_neeraju, frederic, linux-kernel

On Thu, Jan 26, 2023 at 02:46:03PM +0100, Jonas Oberhauser wrote:
> LKMM uses two relations for talking about UNLOCK+LOCK pairings:
> 
> 	1) po-unlock-lock-po, which handles UNLOCK+LOCK pairings
> 	   on the same CPU or immediate lock handovers on the same
> 	   lock variable
> 
> 	2) po;[UL];(co|po);[LKW];po, which handles UNLOCK+LOCK pairs
> 	   literally as described in rcupdate.h#L1002, i.e., even
> 	   after a sequence of handovers on the same lock variable.
> 
> The latter relation is used only once, to provide the guarantee
> defined in rcupdate.h#L1002 by smp_mb__after_unlock_lock(), which
> makes any UNLOCK+LOCK pair followed by the fence behave like a full
> barrier.
> 
> This patch drops this use in favor of using po-unlock-lock-po
> everywhere, which unifies the way the model talks about UNLOCK+LOCK
> pairings.  At first glance this seems to weaken the guarantee given
> by LKMM: When considering a long sequence of lock handovers
> such as below, where P0 hands the lock to P1, which hands it to P2,
> which finally executes such an after_unlock_lock fence, the mb
> relation currently links any stores in the critical section of P0
> to instructions P2 executes after its fence, but not so after the
> patch.
> 
> P0(int *x, int *y, spinlock_t *mylock)
> {
>         spin_lock(mylock);
>         WRITE_ONCE(*x, 2);
>         spin_unlock(mylock);
>         WRITE_ONCE(*y, 1);
> }
> 
> P1(int *y, int *z, spinlock_t *mylock)
> {
>         int r0 = READ_ONCE(*y); // reads 1
>         spin_lock(mylock);
>         spin_unlock(mylock);
>         WRITE_ONCE(*z,1);
> }
> 
> P2(int *z, int *d, spinlock_t *mylock)
> {
>         int r1 = READ_ONCE(*z); // reads 1
>         spin_lock(mylock);
>         spin_unlock(mylock);
>         smp_mb__after_unlock_lock();
>         WRITE_ONCE(*d,1);
> }
> 
> P3(int *x, int *d)
> {
>         WRITE_ONCE(*d,2);
>         smp_mb();
>         WRITE_ONCE(*x,1);
> }
> 
> exists (1:r0=1 /\ 2:r1=1 /\ x=2 /\ d=2)
> 
> Nevertheless, the ordering guarantee given in rcupdate.h is actually
> not weakened.  This is because the unlock operations along the
> sequence of handovers are A-cumulative fences.  They ensure that any
> stores that propagate to the CPU performing the first unlock
> operation in the sequence must also propagate to every CPU that
> performs a subsequent lock operation in the sequence.  Therefore any
> such stores will also be ordered correctly by the fence even if only
> the final handover is considered a full barrier.
> 
> Indeed this patch does not affect the behaviors allowed by LKMM at
> all.  The mb relation is used to define ordering through:
> 1) mb/.../ppo/hb, where the ordering is subsumed by hb+ where the
>    lock-release, rfe, and unlock-acquire orderings each provide hb
> 2) mb/strong-fence/cumul-fence/prop, where the rfe and A-cumulative
>    lock-release orderings simply add more fine-grained cumul-fence
>    edges to substitute a single strong-fence edge provided by a long
>    lock handover sequence
> 3) mb/strong-fence/pb and various similar uses in the definition of
>    data races, where as discussed above any long handover sequence
>    can be turned into a sequence of cumul-fence edges that provide
>    the same ordering.
> 
> Signed-off-by: Jonas Oberhauser <jonas.oberhauser@huaweicloud.com>
> ---

Reviewed-by: Alan Stern <stern@rowland.harvard.edu>

>  tools/memory-model/linux-kernel.cat | 15 +++++++++++++--
>  1 file changed, 13 insertions(+), 2 deletions(-)
> 
> diff --git a/tools/memory-model/linux-kernel.cat b/tools/memory-model/linux-kernel.cat
> index 07f884f9b2bf..6e531457bb73 100644
> --- a/tools/memory-model/linux-kernel.cat
> +++ b/tools/memory-model/linux-kernel.cat
> @@ -37,8 +37,19 @@ let mb = ([M] ; fencerel(Mb) ; [M]) |
>  	([M] ; fencerel(Before-atomic) ; [RMW] ; po? ; [M]) |
>  	([M] ; po? ; [RMW] ; fencerel(After-atomic) ; [M]) |
>  	([M] ; po? ; [LKW] ; fencerel(After-spinlock) ; [M]) |
> -	([M] ; po ; [UL] ; (co | po) ; [LKW] ;
> -		fencerel(After-unlock-lock) ; [M])
> +(*
> + * Note: The po-unlock-lock-po relation only passes the lock to the direct
> + * successor, perhaps giving the impression that the ordering of the
> + * smp_mb__after_unlock_lock() fence only affects a single lock handover.
> + * However, in a longer sequence of lock handovers, the implicit
> + * A-cumulative release fences of lock-release ensure that any stores that
> + * propagate to one of the involved CPUs before it hands over the lock to
> + * the next CPU will also propagate to the final CPU handing over the lock
> + * to the CPU that executes the fence.  Therefore, all those stores are
> + * also affected by the fence.
> + *)
> +	([M] ; po-unlock-lock-po ;
> +		[After-unlock-lock] ; po ; [M])
>  let gp = po ; [Sync-rcu | Sync-srcu] ; po?
>  let strong-fence = mb | gp
>  
> -- 
> 2.17.1
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 1/2] tools/memory-model: Unify UNLOCK+LOCK pairings to po-unlock-lock-po
  2023-01-26 16:36   ` Alan Stern
@ 2023-01-26 20:08     ` Paul E. McKenney
  2023-01-26 23:21       ` Paul E. McKenney
  0 siblings, 1 reply; 39+ messages in thread
From: Paul E. McKenney @ 2023-01-26 20:08 UTC (permalink / raw)
  To: Alan Stern
  Cc: Jonas Oberhauser, parri.andrea, will, peterz, boqun.feng,
	npiggin, dhowells, j.alglave, luc.maranget, akiyks, dlustig,
	joel, urezki, quic_neeraju, frederic, linux-kernel

On Thu, Jan 26, 2023 at 11:36:51AM -0500, Alan Stern wrote:
> On Thu, Jan 26, 2023 at 02:46:03PM +0100, Jonas Oberhauser wrote:
> > LKMM uses two relations for talking about UNLOCK+LOCK pairings:
> > 
> > 	1) po-unlock-lock-po, which handles UNLOCK+LOCK pairings
> > 	   on the same CPU or immediate lock handovers on the same
> > 	   lock variable
> > 
> > 	2) po;[UL];(co|po);[LKW];po, which handles UNLOCK+LOCK pairs
> > 	   literally as described in rcupdate.h#L1002, i.e., even
> > 	   after a sequence of handovers on the same lock variable.
> > 
> > The latter relation is used only once, to provide the guarantee
> > defined in rcupdate.h#L1002 by smp_mb__after_unlock_lock(), which
> > makes any UNLOCK+LOCK pair followed by the fence behave like a full
> > barrier.
> > 
> > This patch drops this use in favor of using po-unlock-lock-po
> > everywhere, which unifies the way the model talks about UNLOCK+LOCK
> > pairings.  At first glance this seems to weaken the guarantee given
> > by LKMM: When considering a long sequence of lock handovers
> > such as below, where P0 hands the lock to P1, which hands it to P2,
> > which finally executes such an after_unlock_lock fence, the mb
> > relation currently links any stores in the critical section of P0
> > to instructions P2 executes after its fence, but not so after the
> > patch.
> > 
> > P0(int *x, int *y, spinlock_t *mylock)
> > {
> >         spin_lock(mylock);
> >         WRITE_ONCE(*x, 2);
> >         spin_unlock(mylock);
> >         WRITE_ONCE(*y, 1);
> > }
> > 
> > P1(int *y, int *z, spinlock_t *mylock)
> > {
> >         int r0 = READ_ONCE(*y); // reads 1
> >         spin_lock(mylock);
> >         spin_unlock(mylock);
> >         WRITE_ONCE(*z,1);
> > }
> > 
> > P2(int *z, int *d, spinlock_t *mylock)
> > {
> >         int r1 = READ_ONCE(*z); // reads 1
> >         spin_lock(mylock);
> >         spin_unlock(mylock);
> >         smp_mb__after_unlock_lock();
> >         WRITE_ONCE(*d,1);
> > }
> > 
> > P3(int *x, int *d)
> > {
> >         WRITE_ONCE(*d,2);
> >         smp_mb();
> >         WRITE_ONCE(*x,1);
> > }
> > 
> > exists (1:r0=1 /\ 2:r1=1 /\ x=2 /\ d=2)
> > 
> > Nevertheless, the ordering guarantee given in rcupdate.h is actually
> > not weakened.  This is because the unlock operations along the
> > sequence of handovers are A-cumulative fences.  They ensure that any
> > stores that propagate to the CPU performing the first unlock
> > operation in the sequence must also propagate to every CPU that
> > performs a subsequent lock operation in the sequence.  Therefore any
> > such stores will also be ordered correctly by the fence even if only
> > the final handover is considered a full barrier.
> > 
> > Indeed this patch does not affect the behaviors allowed by LKMM at
> > all.  The mb relation is used to define ordering through:
> > 1) mb/.../ppo/hb, where the ordering is subsumed by hb+ where the
> >    lock-release, rfe, and unlock-acquire orderings each provide hb
> > 2) mb/strong-fence/cumul-fence/prop, where the rfe and A-cumulative
> >    lock-release orderings simply add more fine-grained cumul-fence
> >    edges to substitute a single strong-fence edge provided by a long
> >    lock handover sequence
> > 3) mb/strong-fence/pb and various similar uses in the definition of
> >    data races, where as discussed above any long handover sequence
> >    can be turned into a sequence of cumul-fence edges that provide
> >    the same ordering.
> > 
> > Signed-off-by: Jonas Oberhauser <jonas.oberhauser@huaweicloud.com>
> > ---
> 
> Reviewed-by: Alan Stern <stern@rowland.harvard.edu>

A quick spot check showed no change in performance, so thank you both!

Queued for review and further testing.

							Thanx, Paul

> >  tools/memory-model/linux-kernel.cat | 15 +++++++++++++--
> >  1 file changed, 13 insertions(+), 2 deletions(-)
> > 
> > diff --git a/tools/memory-model/linux-kernel.cat b/tools/memory-model/linux-kernel.cat
> > index 07f884f9b2bf..6e531457bb73 100644
> > --- a/tools/memory-model/linux-kernel.cat
> > +++ b/tools/memory-model/linux-kernel.cat
> > @@ -37,8 +37,19 @@ let mb = ([M] ; fencerel(Mb) ; [M]) |
> >  	([M] ; fencerel(Before-atomic) ; [RMW] ; po? ; [M]) |
> >  	([M] ; po? ; [RMW] ; fencerel(After-atomic) ; [M]) |
> >  	([M] ; po? ; [LKW] ; fencerel(After-spinlock) ; [M]) |
> > -	([M] ; po ; [UL] ; (co | po) ; [LKW] ;
> > -		fencerel(After-unlock-lock) ; [M])
> > +(*
> > + * Note: The po-unlock-lock-po relation only passes the lock to the direct
> > + * successor, perhaps giving the impression that the ordering of the
> > + * smp_mb__after_unlock_lock() fence only affects a single lock handover.
> > + * However, in a longer sequence of lock handovers, the implicit
> > + * A-cumulative release fences of lock-release ensure that any stores that
> > + * propagate to one of the involved CPUs before it hands over the lock to
> > + * the next CPU will also propagate to the final CPU handing over the lock
> > + * to the CPU that executes the fence.  Therefore, all those stores are
> > + * also affected by the fence.
> > + *)
> > +	([M] ; po-unlock-lock-po ;
> > +		[After-unlock-lock] ; po ; [M])
> >  let gp = po ; [Sync-rcu | Sync-srcu] ; po?
> >  let strong-fence = mb | gp
> >  
> > -- 
> > 2.17.1
> > 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 1/2] tools/memory-model: Unify UNLOCK+LOCK pairings to po-unlock-lock-po
  2023-01-26 20:08     ` Paul E. McKenney
@ 2023-01-26 23:21       ` Paul E. McKenney
  2023-01-27 13:18         ` Jonas Oberhauser
  0 siblings, 1 reply; 39+ messages in thread
From: Paul E. McKenney @ 2023-01-26 23:21 UTC (permalink / raw)
  To: Alan Stern
  Cc: Jonas Oberhauser, parri.andrea, will, peterz, boqun.feng,
	npiggin, dhowells, j.alglave, luc.maranget, akiyks, dlustig,
	joel, urezki, quic_neeraju, frederic, linux-kernel

On Thu, Jan 26, 2023 at 12:08:28PM -0800, Paul E. McKenney wrote:
> On Thu, Jan 26, 2023 at 11:36:51AM -0500, Alan Stern wrote:
> > On Thu, Jan 26, 2023 at 02:46:03PM +0100, Jonas Oberhauser wrote:
> > > LKMM uses two relations for talking about UNLOCK+LOCK pairings:
> > > 
> > > 	1) po-unlock-lock-po, which handles UNLOCK+LOCK pairings
> > > 	   on the same CPU or immediate lock handovers on the same
> > > 	   lock variable
> > > 
> > > 	2) po;[UL];(co|po);[LKW];po, which handles UNLOCK+LOCK pairs
> > > 	   literally as described in rcupdate.h#L1002, i.e., even
> > > 	   after a sequence of handovers on the same lock variable.
> > > 
> > > The latter relation is used only once, to provide the guarantee
> > > defined in rcupdate.h#L1002 by smp_mb__after_unlock_lock(), which
> > > makes any UNLOCK+LOCK pair followed by the fence behave like a full
> > > barrier.
> > > 
> > > This patch drops this use in favor of using po-unlock-lock-po
> > > everywhere, which unifies the way the model talks about UNLOCK+LOCK
> > > pairings.  At first glance this seems to weaken the guarantee given
> > > by LKMM: When considering a long sequence of lock handovers
> > > such as below, where P0 hands the lock to P1, which hands it to P2,
> > > which finally executes such an after_unlock_lock fence, the mb
> > > relation currently links any stores in the critical section of P0
> > > to instructions P2 executes after its fence, but not so after the
> > > patch.
> > > 
> > > P0(int *x, int *y, spinlock_t *mylock)
> > > {
> > >         spin_lock(mylock);
> > >         WRITE_ONCE(*x, 2);
> > >         spin_unlock(mylock);
> > >         WRITE_ONCE(*y, 1);
> > > }
> > > 
> > > P1(int *y, int *z, spinlock_t *mylock)
> > > {
> > >         int r0 = READ_ONCE(*y); // reads 1
> > >         spin_lock(mylock);
> > >         spin_unlock(mylock);
> > >         WRITE_ONCE(*z,1);
> > > }
> > > 
> > > P2(int *z, int *d, spinlock_t *mylock)
> > > {
> > >         int r1 = READ_ONCE(*z); // reads 1
> > >         spin_lock(mylock);
> > >         spin_unlock(mylock);
> > >         smp_mb__after_unlock_lock();
> > >         WRITE_ONCE(*d,1);
> > > }
> > > 
> > > P3(int *x, int *d)
> > > {
> > >         WRITE_ONCE(*d,2);
> > >         smp_mb();
> > >         WRITE_ONCE(*x,1);
> > > }
> > > 
> > > exists (1:r0=1 /\ 2:r1=1 /\ x=2 /\ d=2)
> > > 
> > > Nevertheless, the ordering guarantee given in rcupdate.h is actually
> > > not weakened.  This is because the unlock operations along the
> > > sequence of handovers are A-cumulative fences.  They ensure that any
> > > stores that propagate to the CPU performing the first unlock
> > > operation in the sequence must also propagate to every CPU that
> > > performs a subsequent lock operation in the sequence.  Therefore any
> > > such stores will also be ordered correctly by the fence even if only
> > > the final handover is considered a full barrier.
> > > 
> > > Indeed this patch does not affect the behaviors allowed by LKMM at
> > > all.  The mb relation is used to define ordering through:
> > > 1) mb/.../ppo/hb, where the ordering is subsumed by hb+ where the
> > >    lock-release, rfe, and unlock-acquire orderings each provide hb
> > > 2) mb/strong-fence/cumul-fence/prop, where the rfe and A-cumulative
> > >    lock-release orderings simply add more fine-grained cumul-fence
> > >    edges to substitute a single strong-fence edge provided by a long
> > >    lock handover sequence
> > > 3) mb/strong-fence/pb and various similar uses in the definition of
> > >    data races, where as discussed above any long handover sequence
> > >    can be turned into a sequence of cumul-fence edges that provide
> > >    the same ordering.
> > > 
> > > Signed-off-by: Jonas Oberhauser <jonas.oberhauser@huaweicloud.com>
> > > ---
> > 
> > Reviewed-by: Alan Stern <stern@rowland.harvard.edu>
> 
> A quick spot check showed no change in performance, so thank you both!
> 
> Queued for review and further testing.

And testing on https://github.com/paulmckrcu/litmus for litmus tests up
to ten processes and allowing 10 minutes per litmus test got this:

Exact output matches: 5208
!!! Timed out: 38
!!! Unknown primitive: 7

This test compared output with and without your patch.

For the tests with a Results clause, these failed:

	manual/kernel/C-srcu-nest-7.litmus
	manual/kernel/C-srcu-nest-5.litmus
	manual/kernel/C-srcu-nest-6.litmus
	manual/kernel/C-srcu-nest-8.litmus

But all of these will continue to fail until we get Alan's new-age SRCU
patch applied.

Therefore, this constitutes success, so good show thus far on testing!  ;-)

Also, I am going to be pushing the scripts I use to mainline.  They might
not be perfect, but they will be quite useful for this sort of change
to the memory model.

						Thanx, Paul

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 1/2] tools/memory-model: Unify UNLOCK+LOCK pairings to po-unlock-lock-po
  2023-01-26 23:21       ` Paul E. McKenney
@ 2023-01-27 13:18         ` Jonas Oberhauser
  2023-01-27 15:13           ` Paul E. McKenney
  0 siblings, 1 reply; 39+ messages in thread
From: Jonas Oberhauser @ 2023-01-27 13:18 UTC (permalink / raw)
  To: paulmck, Alan Stern
  Cc: parri.andrea, will, peterz, boqun.feng, npiggin, dhowells,
	j.alglave, luc.maranget, akiyks, dlustig, joel, urezki,
	quic_neeraju, frederic, linux-kernel



On 1/27/2023 12:21 AM, Paul E. McKenney wrote:
> On Thu, Jan 26, 2023 at 12:08:28PM -0800, Paul E. McKenney wrote:
>> On Thu, Jan 26, 2023 at 11:36:51AM -0500, Alan Stern wrote:
>>> On Thu, Jan 26, 2023 at 02:46:03PM +0100, Jonas Oberhauser wrote:
>>>> LKMM uses two relations for talking about UNLOCK+LOCK pairings:
>>>>
>>>> 	1) po-unlock-lock-po, which handles UNLOCK+LOCK pairings
>>>> 	   on the same CPU or immediate lock handovers on the same
>>>> 	   lock variable
>>>>
>>>> 	2) po;[UL];(co|po);[LKW];po, which handles UNLOCK+LOCK pairs
>>>> 	   literally as described in rcupdate.h#L1002, i.e., even
>>>> 	   after a sequence of handovers on the same lock variable.
>>>>
>>>> The latter relation is used only once, to provide the guarantee
>>>> defined in rcupdate.h#L1002 by smp_mb__after_unlock_lock(), which
>>>> makes any UNLOCK+LOCK pair followed by the fence behave like a full
>>>> barrier.
>>>>
>>>> This patch drops this use in favor of using po-unlock-lock-po
>>>> everywhere, which unifies the way the model talks about UNLOCK+LOCK
>>>> pairings.  At first glance this seems to weaken the guarantee given
>>>> by LKMM: When considering a long sequence of lock handovers
>>>> such as below, where P0 hands the lock to P1, which hands it to P2,
>>>> which finally executes such an after_unlock_lock fence, the mb
>>>> relation currently links any stores in the critical section of P0
>>>> to instructions P2 executes after its fence, but not so after the
>>>> patch.
>>>>
>>>> P0(int *x, int *y, spinlock_t *mylock)
>>>> {
>>>>          spin_lock(mylock);
>>>>          WRITE_ONCE(*x, 2);
>>>>          spin_unlock(mylock);
>>>>          WRITE_ONCE(*y, 1);
>>>> }
>>>>
>>>> P1(int *y, int *z, spinlock_t *mylock)
>>>> {
>>>>          int r0 = READ_ONCE(*y); // reads 1
>>>>          spin_lock(mylock);
>>>>          spin_unlock(mylock);
>>>>          WRITE_ONCE(*z,1);
>>>> }
>>>>
>>>> P2(int *z, int *d, spinlock_t *mylock)
>>>> {
>>>>          int r1 = READ_ONCE(*z); // reads 1
>>>>          spin_lock(mylock);
>>>>          spin_unlock(mylock);
>>>>          smp_mb__after_unlock_lock();
>>>>          WRITE_ONCE(*d,1);
>>>> }
>>>>
>>>> P3(int *x, int *d)
>>>> {
>>>>          WRITE_ONCE(*d,2);
>>>>          smp_mb();
>>>>          WRITE_ONCE(*x,1);
>>>> }
>>>>
>>>> exists (1:r0=1 /\ 2:r1=1 /\ x=2 /\ d=2)
>>>>
>>>> Nevertheless, the ordering guarantee given in rcupdate.h is actually
>>>> not weakened.  This is because the unlock operations along the
>>>> sequence of handovers are A-cumulative fences.  They ensure that any
>>>> stores that propagate to the CPU performing the first unlock
>>>> operation in the sequence must also propagate to every CPU that
>>>> performs a subsequent lock operation in the sequence.  Therefore any
>>>> such stores will also be ordered correctly by the fence even if only
>>>> the final handover is considered a full barrier.
>>>>
>>>> Indeed this patch does not affect the behaviors allowed by LKMM at
>>>> all.  The mb relation is used to define ordering through:
>>>> 1) mb/.../ppo/hb, where the ordering is subsumed by hb+ where the
>>>>     lock-release, rfe, and unlock-acquire orderings each provide hb
>>>> 2) mb/strong-fence/cumul-fence/prop, where the rfe and A-cumulative
>>>>     lock-release orderings simply add more fine-grained cumul-fence
>>>>     edges to substitute a single strong-fence edge provided by a long
>>>>     lock handover sequence
>>>> 3) mb/strong-fence/pb and various similar uses in the definition of
>>>>     data races, where as discussed above any long handover sequence
>>>>     can be turned into a sequence of cumul-fence edges that provide
>>>>     the same ordering.
>>>>
>>>> Signed-off-by: Jonas Oberhauser <jonas.oberhauser@huaweicloud.com>
>>>> ---
>>> Reviewed-by: Alan Stern <stern@rowland.harvard.edu>
>> A quick spot check showed no change in performance, so thank you both!
>>
>> Queued for review and further testing.
> And testing on https://github.com/paulmckrcu/litmus for litmus tests up
> to ten processes and allowing 10 minutes per litmus test got this:
>
> Exact output matches: 5208
> !!! Timed out: 38
> !!! Unknown primitive: 7
>
> This test compared output with and without your patch.
>
> For the tests with a Results clause, these failed:

Gave me a heart attack there for a second!

> Also, I am going to be pushing the scripts I use to mainline.  They might
> not be perfect, but they will be quite useful for this sort of change
> to the memory model.

I could also provide Coq proofs, although those are ignoring the 
srcu/data race parts at the moment.

Have fun, jonas


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 2/2] tools/memory-model: Make ppo a subrelation of po
  2023-01-26 16:36   ` Alan Stern
@ 2023-01-27 14:31     ` Jonas Oberhauser
  2023-01-28 19:56       ` Alan Stern
  0 siblings, 1 reply; 39+ messages in thread
From: Jonas Oberhauser @ 2023-01-27 14:31 UTC (permalink / raw)
  To: Alan Stern
  Cc: paulmck, parri.andrea, will, peterz, boqun.feng, npiggin,
	dhowells, j.alglave, luc.maranget, akiyks, dlustig, joel, urezki,
	quic_neeraju, frederic, linux-kernel

On 1/26/2023 5:36 PM, Alan Stern wrote:
> On Thu, Jan 26, 2023 at 02:46:04PM +0100, Jonas Oberhauser wrote:
>> As stated in the documentation and implied by its name, the ppo
>> (preserved program order) relation is intended to link po-earlier
>> to po-later instructions under certain conditions.  However, a
>> corner case currently allows instructions to be linked by ppo that
>> are not executed by the same thread, i.e., instructions are being
>> linked that have no po relation.
>>
>> This happens due to the mb/strong-fence relations, which (as one
>> case) provide order when locks are passed between threads followed
>> by an smp_mb__after_unlock_lock() fence.  This is illustrated in
>> the following litmus test (as can be seen when using herd7 with
>> `doshow ppo`):
>>
>> P0(int *x, int *y)
>> {
>>      spin_lock(x);
>>      spin_unlock(x);
>> }
>>
>> P1(int *x, int *y)
>> {
>>      spin_lock(x);
>>      smp_mb__after_unlock_lock();
>>      *y = 1;
>> }
>>
>> The ppo relation will link P0's spin_lock(x) and P1's *y=1, because
>> P0 passes a lock to P1 which then uses this fence.
>>
>> The patch makes ppo a subrelation of po by eliminating this possibility
>> from mb (but not strong-fence) and relying explicitly on mb|gp instead
>> of strong-fence when defining ppo.
>>
>> Signed-off-by: Jonas Oberhauser <jonas.oberhauser@huaweicloud.com>
>> ---
> This changes the meaning of the fence relation, which is used in
> w-pre-bounded, w-post-bounded, ww-vis, wr-vis, and rw-xbstar.  Have you
> checked that they won't be affected by the change?

Good point, in fact it changes the nonrw-fence as well which is used in 
the r-* relations.
I had missed this completely. That's what I get for not looking at the 
data race relations!

Let's go through the individual cases.
If we had e.g. *-post-bounded because of the 
po-unlock-lock-po;[After-unlock-lock];po edge, this may be either due to 
the fence rule or due to (...fence ; vis ; *-pre-bounded).
In the first case we have po-rel ; rfe ; acq-po and get 
fence;rfe;(xbstar&int);fence which gives us *-post-bounded.
In the second case we now have strong-fence, with vis <= xbstar we see 
that **-vis is preserved here by switching from the fence rule to the 
strong-fence;xbstar;... case.

For *-pre-bounded, the situtation is tricky because of the xbstar&int 
that can be at the end of vis, when *-pre-bounded is used to define 
w*-vis through (w-post-bounded;vis;*-pre-bounded). In this case, the 
xbstar&int can point in the opposite direction of po, which means that 
the unlock that creates the strong fence might not be po-after the 
instruction that starts the link.

Here's a litmus test illustrating the difference, where P1 has a 
backwards-pointing xbstar&int. Currently there's no data race, but with 
the proposed patch there is.

P0(int *x, int *y)
{
     *x = 1;
     smp_store_release(y, 1);
}

P1(int *x, int *y, int *dx, int *dy, spinlock_t *l)
{
     spin_lock(l);
     int r1 = READ_ONCE(*dy);
     if (r1==1)
         spin_unlock(l);

     int r0 = smp_load_acquire(y);
     if (r0 == 1) {
         WRITE_ONCE(*dx,1);
     }
}

P2(int *dx, int *dy)
{
     WRITE_ONCE(*dy,READ_ONCE(*dx));
}

P3(int *x, spinlock_t *l)
{
     spin_lock(l);
     smp_mb__after_unlock_lock();
     *x = 2;
}

This actually makes me wonder. I thought the reason for the xbstar & int 
is that it ensures that the overall relation, after shuffling around a 
little bit, becomes prop&int ; hb*.

Like in case the *x=2 is co-before the *x=1, we get
   Wx2 ->overwrite Wx1 ->cumul-fence*;rfe  (some event on the same CPU 
as Wx2)  ->fence Wx2
which is
   Wx2 -> prop&int some other event ->hb Wx2
which must be irreflexive.

However, that's not the case at all, because the fence relation 
currently doesn't actually have to relate events of the same CPU.
So we get
   Wx2 ->overwrite Wx1 ->cumul-fence*;rfe  (some event on some other CPU 
than Wx2's) ->(hb*&int);fence Wx2
i.e.,
   Wx2 ->prop&ext;hb*;strong-fence Wx2

which shouldn't provide any ordering in general.

In fact, replacing the *x=1 and *x=2 with WRITE_ONCEs, (pilot errors 
notwithstanding) both Wx1 ->co Wx2 and Wx2 ->co Wx1 become allowed in 
the current LKMM (in graphs where all other edges are equal).

Shouldn't this actually *be* a data race? And potentially the same with 
rcu-fence?

The other cases of *-pre-bounded seem to work out: they all link stuff 
via xbstar to the instruction that is linked via po-unlock-lock-po ; 
[After-unlock-lock] ; po to the potentially racy access, and 
po-unlock-lock-po;po   is xbstar ; acq-po, which allows closing the gap.

Best wishes, jonas

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 1/2] tools/memory-model: Unify UNLOCK+LOCK pairings to po-unlock-lock-po
  2023-01-27 13:18         ` Jonas Oberhauser
@ 2023-01-27 15:13           ` Paul E. McKenney
  2023-01-27 15:57             ` Jonas Oberhauser
  0 siblings, 1 reply; 39+ messages in thread
From: Paul E. McKenney @ 2023-01-27 15:13 UTC (permalink / raw)
  To: Jonas Oberhauser
  Cc: Alan Stern, parri.andrea, will, peterz, boqun.feng, npiggin,
	dhowells, j.alglave, luc.maranget, akiyks, dlustig, joel, urezki,
	quic_neeraju, frederic, linux-kernel

On Fri, Jan 27, 2023 at 02:18:41PM +0100, Jonas Oberhauser wrote:
> On 1/27/2023 12:21 AM, Paul E. McKenney wrote:
> > On Thu, Jan 26, 2023 at 12:08:28PM -0800, Paul E. McKenney wrote:
> > > On Thu, Jan 26, 2023 at 11:36:51AM -0500, Alan Stern wrote:
> > > > On Thu, Jan 26, 2023 at 02:46:03PM +0100, Jonas Oberhauser wrote:
> > > > > LKMM uses two relations for talking about UNLOCK+LOCK pairings:
> > > > > 
> > > > > 	1) po-unlock-lock-po, which handles UNLOCK+LOCK pairings
> > > > > 	   on the same CPU or immediate lock handovers on the same
> > > > > 	   lock variable
> > > > > 
> > > > > 	2) po;[UL];(co|po);[LKW];po, which handles UNLOCK+LOCK pairs
> > > > > 	   literally as described in rcupdate.h#L1002, i.e., even
> > > > > 	   after a sequence of handovers on the same lock variable.
> > > > > 
> > > > > The latter relation is used only once, to provide the guarantee
> > > > > defined in rcupdate.h#L1002 by smp_mb__after_unlock_lock(), which
> > > > > makes any UNLOCK+LOCK pair followed by the fence behave like a full
> > > > > barrier.
> > > > > 
> > > > > This patch drops this use in favor of using po-unlock-lock-po
> > > > > everywhere, which unifies the way the model talks about UNLOCK+LOCK
> > > > > pairings.  At first glance this seems to weaken the guarantee given
> > > > > by LKMM: When considering a long sequence of lock handovers
> > > > > such as below, where P0 hands the lock to P1, which hands it to P2,
> > > > > which finally executes such an after_unlock_lock fence, the mb
> > > > > relation currently links any stores in the critical section of P0
> > > > > to instructions P2 executes after its fence, but not so after the
> > > > > patch.
> > > > > 
> > > > > P0(int *x, int *y, spinlock_t *mylock)
> > > > > {
> > > > >          spin_lock(mylock);
> > > > >          WRITE_ONCE(*x, 2);
> > > > >          spin_unlock(mylock);
> > > > >          WRITE_ONCE(*y, 1);
> > > > > }
> > > > > 
> > > > > P1(int *y, int *z, spinlock_t *mylock)
> > > > > {
> > > > >          int r0 = READ_ONCE(*y); // reads 1
> > > > >          spin_lock(mylock);
> > > > >          spin_unlock(mylock);
> > > > >          WRITE_ONCE(*z,1);
> > > > > }
> > > > > 
> > > > > P2(int *z, int *d, spinlock_t *mylock)
> > > > > {
> > > > >          int r1 = READ_ONCE(*z); // reads 1
> > > > >          spin_lock(mylock);
> > > > >          spin_unlock(mylock);
> > > > >          smp_mb__after_unlock_lock();
> > > > >          WRITE_ONCE(*d,1);
> > > > > }
> > > > > 
> > > > > P3(int *x, int *d)
> > > > > {
> > > > >          WRITE_ONCE(*d,2);
> > > > >          smp_mb();
> > > > >          WRITE_ONCE(*x,1);
> > > > > }
> > > > > 
> > > > > exists (1:r0=1 /\ 2:r1=1 /\ x=2 /\ d=2)
> > > > > 
> > > > > Nevertheless, the ordering guarantee given in rcupdate.h is actually
> > > > > not weakened.  This is because the unlock operations along the
> > > > > sequence of handovers are A-cumulative fences.  They ensure that any
> > > > > stores that propagate to the CPU performing the first unlock
> > > > > operation in the sequence must also propagate to every CPU that
> > > > > performs a subsequent lock operation in the sequence.  Therefore any
> > > > > such stores will also be ordered correctly by the fence even if only
> > > > > the final handover is considered a full barrier.
> > > > > 
> > > > > Indeed this patch does not affect the behaviors allowed by LKMM at
> > > > > all.  The mb relation is used to define ordering through:
> > > > > 1) mb/.../ppo/hb, where the ordering is subsumed by hb+ where the
> > > > >     lock-release, rfe, and unlock-acquire orderings each provide hb
> > > > > 2) mb/strong-fence/cumul-fence/prop, where the rfe and A-cumulative
> > > > >     lock-release orderings simply add more fine-grained cumul-fence
> > > > >     edges to substitute a single strong-fence edge provided by a long
> > > > >     lock handover sequence
> > > > > 3) mb/strong-fence/pb and various similar uses in the definition of
> > > > >     data races, where as discussed above any long handover sequence
> > > > >     can be turned into a sequence of cumul-fence edges that provide
> > > > >     the same ordering.
> > > > > 
> > > > > Signed-off-by: Jonas Oberhauser <jonas.oberhauser@huaweicloud.com>
> > > > > ---
> > > > Reviewed-by: Alan Stern <stern@rowland.harvard.edu>
> > > A quick spot check showed no change in performance, so thank you both!
> > > 
> > > Queued for review and further testing.
> > And testing on https://github.com/paulmckrcu/litmus for litmus tests up
> > to ten processes and allowing 10 minutes per litmus test got this:
> > 
> > Exact output matches: 5208
> > !!! Timed out: 38
> > !!! Unknown primitive: 7
> > 
> > This test compared output with and without your patch.
> > 
> > For the tests with a Results clause, these failed:
> 
> Gave me a heart attack there for a second!

Sorry for the scare!!!

> > Also, I am going to be pushing the scripts I use to mainline.  They might
> > not be perfect, but they will be quite useful for this sort of change
> > to the memory model.
> 
> I could also provide Coq proofs, although those are ignoring the srcu/data
> race parts at the moment.

Can such proofs serve as regression tests for future changes?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 1/2] tools/memory-model: Unify UNLOCK+LOCK pairings to po-unlock-lock-po
  2023-01-27 15:13           ` Paul E. McKenney
@ 2023-01-27 15:57             ` Jonas Oberhauser
  2023-01-27 16:48               ` Paul E. McKenney
  0 siblings, 1 reply; 39+ messages in thread
From: Jonas Oberhauser @ 2023-01-27 15:57 UTC (permalink / raw)
  To: paulmck
  Cc: Alan Stern, parri.andrea, will, peterz, boqun.feng, npiggin,
	dhowells, j.alglave, luc.maranget, akiyks, dlustig, joel, urezki,
	quic_neeraju, frederic, linux-kernel



On 1/27/2023 4:13 PM, Paul E. McKenney wrote:
> On Fri, Jan 27, 2023 at 02:18:41PM +0100, Jonas Oberhauser wrote:
>> On 1/27/2023 12:21 AM, Paul E. McKenney wrote:
>>> On Thu, Jan 26, 2023 at 12:08:28PM -0800, Paul E. McKenney wrote:
>>>> On Thu, Jan 26, 2023 at 11:36:51AM -0500, Alan Stern wrote:
>>>>> On Thu, Jan 26, 2023 at 02:46:03PM +0100, Jonas Oberhauser wrote:
>>>>>> LKMM uses two relations for talking about UNLOCK+LOCK pairings:
>>>>>>
>>>>>> 	1) po-unlock-lock-po, which handles UNLOCK+LOCK pairings
>>>>>> 	   on the same CPU or immediate lock handovers on the same
>>>>>> 	   lock variable
>>>>>>
>>>>>> 	2) po;[UL];(co|po);[LKW];po, which handles UNLOCK+LOCK pairs
>>>>>> 	   literally as described in rcupdate.h#L1002, i.e., even
>>>>>> 	   after a sequence of handovers on the same lock variable.
>>>>>>
>>>>>> The latter relation is used only once, to provide the guarantee
>>>>>> defined in rcupdate.h#L1002 by smp_mb__after_unlock_lock(), which
>>>>>> makes any UNLOCK+LOCK pair followed by the fence behave like a full
>>>>>> barrier.
>>>>>>
>>>>>> This patch drops this use in favor of using po-unlock-lock-po
>>>>>> everywhere, which unifies the way the model talks about UNLOCK+LOCK
>>>>>> pairings.  At first glance this seems to weaken the guarantee given
>>>>>> by LKMM: When considering a long sequence of lock handovers
>>>>>> such as below, where P0 hands the lock to P1, which hands it to P2,
>>>>>> which finally executes such an after_unlock_lock fence, the mb
>>>>>> relation currently links any stores in the critical section of P0
>>>>>> to instructions P2 executes after its fence, but not so after the
>>>>>> patch.
>>>>>>
>>>>>> P0(int *x, int *y, spinlock_t *mylock)
>>>>>> {
>>>>>>           spin_lock(mylock);
>>>>>>           WRITE_ONCE(*x, 2);
>>>>>>           spin_unlock(mylock);
>>>>>>           WRITE_ONCE(*y, 1);
>>>>>> }
>>>>>>
>>>>>> P1(int *y, int *z, spinlock_t *mylock)
>>>>>> {
>>>>>>           int r0 = READ_ONCE(*y); // reads 1
>>>>>>           spin_lock(mylock);
>>>>>>           spin_unlock(mylock);
>>>>>>           WRITE_ONCE(*z,1);
>>>>>> }
>>>>>>
>>>>>> P2(int *z, int *d, spinlock_t *mylock)
>>>>>> {
>>>>>>           int r1 = READ_ONCE(*z); // reads 1
>>>>>>           spin_lock(mylock);
>>>>>>           spin_unlock(mylock);
>>>>>>           smp_mb__after_unlock_lock();
>>>>>>           WRITE_ONCE(*d,1);
>>>>>> }
>>>>>>
>>>>>> P3(int *x, int *d)
>>>>>> {
>>>>>>           WRITE_ONCE(*d,2);
>>>>>>           smp_mb();
>>>>>>           WRITE_ONCE(*x,1);
>>>>>> }
>>>>>>
>>>>>> exists (1:r0=1 /\ 2:r1=1 /\ x=2 /\ d=2)
>>>>>>
>>>>>> Nevertheless, the ordering guarantee given in rcupdate.h is actually
>>>>>> not weakened.  This is because the unlock operations along the
>>>>>> sequence of handovers are A-cumulative fences.  They ensure that any
>>>>>> stores that propagate to the CPU performing the first unlock
>>>>>> operation in the sequence must also propagate to every CPU that
>>>>>> performs a subsequent lock operation in the sequence.  Therefore any
>>>>>> such stores will also be ordered correctly by the fence even if only
>>>>>> the final handover is considered a full barrier.
>>>>>>
>>>>>> Indeed this patch does not affect the behaviors allowed by LKMM at
>>>>>> all.  The mb relation is used to define ordering through:
>>>>>> 1) mb/.../ppo/hb, where the ordering is subsumed by hb+ where the
>>>>>>      lock-release, rfe, and unlock-acquire orderings each provide hb
>>>>>> 2) mb/strong-fence/cumul-fence/prop, where the rfe and A-cumulative
>>>>>>      lock-release orderings simply add more fine-grained cumul-fence
>>>>>>      edges to substitute a single strong-fence edge provided by a long
>>>>>>      lock handover sequence
>>>>>> 3) mb/strong-fence/pb and various similar uses in the definition of
>>>>>>      data races, where as discussed above any long handover sequence
>>>>>>      can be turned into a sequence of cumul-fence edges that provide
>>>>>>      the same ordering.
>>>>>>
>>>>>> Signed-off-by: Jonas Oberhauser <jonas.oberhauser@huaweicloud.com>
>>>>>> ---
>>>>> Reviewed-by: Alan Stern <stern@rowland.harvard.edu>
>>>> A quick spot check showed no change in performance, so thank you both!
>>>>
>>>> Queued for review and further testing.
>>> And testing on https://github.com/paulmckrcu/litmus for litmus tests up
>>> to ten processes and allowing 10 minutes per litmus test got this:
>>>
>>> Exact output matches: 5208
>>> !!! Timed out: 38
>>> !!! Unknown primitive: 7
>>>
>>> This test compared output with and without your patch.
>>>
>>> For the tests with a Results clause, these failed:
>> Gave me a heart attack there for a second!
> Sorry for the scare!!!
>
>>> Also, I am going to be pushing the scripts I use to mainline.  They might
>>> not be perfect, but they will be quite useful for this sort of change
>>> to the memory model.
>> I could also provide Coq proofs, although those are ignoring the srcu/data
>> race parts at the moment.
> Can such proofs serve as regression tests for future changes?
>
> 							Thanx, Paul

So-so. On the upside, it would be easy to make them raise an alarm if 
the future change breaks stuff.
On the downside, they will often need maintenance together with any 
change. Sometimes a lot, sometimes very little.
I think for the proofs that show the equivalence between two models, the 
maintenance is quite a bit higher because every change needs to be 
reflected in both versions. So if you do 10 equivalent transformations 
and want to show that they remain equivalent with any future changes you 
do, you need to keep at least 10 additional models around ("current LKMM 
where ppo isn't in po, current LKMM where the unlock fence still relies 
on co, ...").

Right now, each equivalence proof I did (e.g., for using 
po-unlock-lock-po here, or the ppo<=po patch I originally proposed) is 
at average in the ballpark of 500 lines of proof script. And as 
evidenced by my discussion with Alan, these proofs only cover the "core 
model".

So for this kind of thing, I think it's better to look at them to have 
more confidence in the patch, and do the patch more based on which model 
is more reasonable (as Alan enforces). Then consider the simplified 
version as the more natural one, and not worry about future changes that 
break the equivalence (that would usually indicate a problem with the 
old model, rather than a problem with the patch).

For regressions, I would rather consider some desirable properties of 
LKMM, like "DRF-SC" or "monotonicity of barriers" or "ppo <= po" and try 
to prove those. This has the upside of not requiring to carry additional 
models around, so much less than half the maintenance effort, and if the 
property should be broken this usually would indicate a problem with the 
patch. So I think the bang for the buck is much higher there.

Those are my current thoughts anyways : )

have fun, jonas


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 1/2] tools/memory-model: Unify UNLOCK+LOCK pairings to po-unlock-lock-po
  2023-01-27 15:57             ` Jonas Oberhauser
@ 2023-01-27 16:48               ` Paul E. McKenney
  0 siblings, 0 replies; 39+ messages in thread
From: Paul E. McKenney @ 2023-01-27 16:48 UTC (permalink / raw)
  To: Jonas Oberhauser
  Cc: Alan Stern, parri.andrea, will, peterz, boqun.feng, npiggin,
	dhowells, j.alglave, luc.maranget, akiyks, dlustig, joel, urezki,
	quic_neeraju, frederic, linux-kernel

On Fri, Jan 27, 2023 at 04:57:43PM +0100, Jonas Oberhauser wrote:
> 
> 
> On 1/27/2023 4:13 PM, Paul E. McKenney wrote:
> > On Fri, Jan 27, 2023 at 02:18:41PM +0100, Jonas Oberhauser wrote:
> > > On 1/27/2023 12:21 AM, Paul E. McKenney wrote:
> > > > On Thu, Jan 26, 2023 at 12:08:28PM -0800, Paul E. McKenney wrote:
> > > > > On Thu, Jan 26, 2023 at 11:36:51AM -0500, Alan Stern wrote:
> > > > > > On Thu, Jan 26, 2023 at 02:46:03PM +0100, Jonas Oberhauser wrote:
> > > > > > > LKMM uses two relations for talking about UNLOCK+LOCK pairings:
> > > > > > > 
> > > > > > > 	1) po-unlock-lock-po, which handles UNLOCK+LOCK pairings
> > > > > > > 	   on the same CPU or immediate lock handovers on the same
> > > > > > > 	   lock variable
> > > > > > > 
> > > > > > > 	2) po;[UL];(co|po);[LKW];po, which handles UNLOCK+LOCK pairs
> > > > > > > 	   literally as described in rcupdate.h#L1002, i.e., even
> > > > > > > 	   after a sequence of handovers on the same lock variable.
> > > > > > > 
> > > > > > > The latter relation is used only once, to provide the guarantee
> > > > > > > defined in rcupdate.h#L1002 by smp_mb__after_unlock_lock(), which
> > > > > > > makes any UNLOCK+LOCK pair followed by the fence behave like a full
> > > > > > > barrier.
> > > > > > > 
> > > > > > > This patch drops this use in favor of using po-unlock-lock-po
> > > > > > > everywhere, which unifies the way the model talks about UNLOCK+LOCK
> > > > > > > pairings.  At first glance this seems to weaken the guarantee given
> > > > > > > by LKMM: When considering a long sequence of lock handovers
> > > > > > > such as below, where P0 hands the lock to P1, which hands it to P2,
> > > > > > > which finally executes such an after_unlock_lock fence, the mb
> > > > > > > relation currently links any stores in the critical section of P0
> > > > > > > to instructions P2 executes after its fence, but not so after the
> > > > > > > patch.
> > > > > > > 
> > > > > > > P0(int *x, int *y, spinlock_t *mylock)
> > > > > > > {
> > > > > > >           spin_lock(mylock);
> > > > > > >           WRITE_ONCE(*x, 2);
> > > > > > >           spin_unlock(mylock);
> > > > > > >           WRITE_ONCE(*y, 1);
> > > > > > > }
> > > > > > > 
> > > > > > > P1(int *y, int *z, spinlock_t *mylock)
> > > > > > > {
> > > > > > >           int r0 = READ_ONCE(*y); // reads 1
> > > > > > >           spin_lock(mylock);
> > > > > > >           spin_unlock(mylock);
> > > > > > >           WRITE_ONCE(*z,1);
> > > > > > > }
> > > > > > > 
> > > > > > > P2(int *z, int *d, spinlock_t *mylock)
> > > > > > > {
> > > > > > >           int r1 = READ_ONCE(*z); // reads 1
> > > > > > >           spin_lock(mylock);
> > > > > > >           spin_unlock(mylock);
> > > > > > >           smp_mb__after_unlock_lock();
> > > > > > >           WRITE_ONCE(*d,1);
> > > > > > > }
> > > > > > > 
> > > > > > > P3(int *x, int *d)
> > > > > > > {
> > > > > > >           WRITE_ONCE(*d,2);
> > > > > > >           smp_mb();
> > > > > > >           WRITE_ONCE(*x,1);
> > > > > > > }
> > > > > > > 
> > > > > > > exists (1:r0=1 /\ 2:r1=1 /\ x=2 /\ d=2)
> > > > > > > 
> > > > > > > Nevertheless, the ordering guarantee given in rcupdate.h is actually
> > > > > > > not weakened.  This is because the unlock operations along the
> > > > > > > sequence of handovers are A-cumulative fences.  They ensure that any
> > > > > > > stores that propagate to the CPU performing the first unlock
> > > > > > > operation in the sequence must also propagate to every CPU that
> > > > > > > performs a subsequent lock operation in the sequence.  Therefore any
> > > > > > > such stores will also be ordered correctly by the fence even if only
> > > > > > > the final handover is considered a full barrier.
> > > > > > > 
> > > > > > > Indeed this patch does not affect the behaviors allowed by LKMM at
> > > > > > > all.  The mb relation is used to define ordering through:
> > > > > > > 1) mb/.../ppo/hb, where the ordering is subsumed by hb+ where the
> > > > > > >      lock-release, rfe, and unlock-acquire orderings each provide hb
> > > > > > > 2) mb/strong-fence/cumul-fence/prop, where the rfe and A-cumulative
> > > > > > >      lock-release orderings simply add more fine-grained cumul-fence
> > > > > > >      edges to substitute a single strong-fence edge provided by a long
> > > > > > >      lock handover sequence
> > > > > > > 3) mb/strong-fence/pb and various similar uses in the definition of
> > > > > > >      data races, where as discussed above any long handover sequence
> > > > > > >      can be turned into a sequence of cumul-fence edges that provide
> > > > > > >      the same ordering.
> > > > > > > 
> > > > > > > Signed-off-by: Jonas Oberhauser <jonas.oberhauser@huaweicloud.com>
> > > > > > > ---
> > > > > > Reviewed-by: Alan Stern <stern@rowland.harvard.edu>
> > > > > A quick spot check showed no change in performance, so thank you both!
> > > > > 
> > > > > Queued for review and further testing.
> > > > And testing on https://github.com/paulmckrcu/litmus for litmus tests up
> > > > to ten processes and allowing 10 minutes per litmus test got this:
> > > > 
> > > > Exact output matches: 5208
> > > > !!! Timed out: 38
> > > > !!! Unknown primitive: 7
> > > > 
> > > > This test compared output with and without your patch.
> > > > 
> > > > For the tests with a Results clause, these failed:
> > > Gave me a heart attack there for a second!
> > Sorry for the scare!!!
> > 
> > > > Also, I am going to be pushing the scripts I use to mainline.  They might
> > > > not be perfect, but they will be quite useful for this sort of change
> > > > to the memory model.
> > > I could also provide Coq proofs, although those are ignoring the srcu/data
> > > race parts at the moment.
> > Can such proofs serve as regression tests for future changes?
> > 
> > 							Thanx, Paul
> 
> So-so. On the upside, it would be easy to make them raise an alarm if the
> future change breaks stuff.
> On the downside, they will often need maintenance together with any change.
> Sometimes a lot, sometimes very little.
> I think for the proofs that show the equivalence between two models, the
> maintenance is quite a bit higher because every change needs to be reflected
> in both versions. So if you do 10 equivalent transformations and want to
> show that they remain equivalent with any future changes you do, you need to
> keep at least 10 additional models around ("current LKMM where ppo isn't in
> po, current LKMM where the unlock fence still relies on co, ...").
> 
> Right now, each equivalence proof I did (e.g., for using po-unlock-lock-po
> here, or the ppo<=po patch I originally proposed) is at average in the
> ballpark of 500 lines of proof script. And as evidenced by my discussion
> with Alan, these proofs only cover the "core model".
> 
> So for this kind of thing, I think it's better to look at them to have more
> confidence in the patch, and do the patch more based on which model is more
> reasonable (as Alan enforces). Then consider the simplified version as the
> more natural one, and not worry about future changes that break the
> equivalence (that would usually indicate a problem with the old model,
> rather than a problem with the patch).
> 
> For regressions, I would rather consider some desirable properties of LKMM,
> like "DRF-SC" or "monotonicity of barriers" or "ppo <= po" and try to prove
> those. This has the upside of not requiring to carry additional models
> around, so much less than half the maintenance effort, and if the property
> should be broken this usually would indicate a problem with the patch. So I
> think the bang for the buck is much higher there.
> 
> Those are my current thoughts anyways : )

That matches my experience, for whatever that is worth.  (I first
used Promela/spin in the early 1990s, which proved to be an excellent
cautionary tale.)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 2/2] tools/memory-model: Make ppo a subrelation of po
  2023-01-27 14:31     ` Jonas Oberhauser
@ 2023-01-28 19:56       ` Alan Stern
  2023-01-28 22:14         ` Andrea Parri
  0 siblings, 1 reply; 39+ messages in thread
From: Alan Stern @ 2023-01-28 19:56 UTC (permalink / raw)
  To: Jonas Oberhauser
  Cc: paulmck, parri.andrea, will, peterz, boqun.feng, npiggin,
	dhowells, j.alglave, luc.maranget, akiyks, dlustig, joel, urezki,
	quic_neeraju, frederic, linux-kernel

On Fri, Jan 27, 2023 at 03:31:25PM +0100, Jonas Oberhauser wrote:
> Here's a litmus test illustrating the difference, where P1 has a
> backwards-pointing xbstar&int. Currently there's no data race, but with the
> proposed patch there is.
> 
> P0(int *x, int *y)
> {
>     *x = 1;
>     smp_store_release(y, 1);
> }
> 
> P1(int *x, int *y, int *dx, int *dy, spinlock_t *l)
> {
>     spin_lock(l);
>     int r1 = READ_ONCE(*dy);
>     if (r1==1)
>         spin_unlock(l);
> 
>     int r0 = smp_load_acquire(y);
>     if (r0 == 1) {
>         WRITE_ONCE(*dx,1);
>     }
> }
> 
> P2(int *dx, int *dy)
> {
>     WRITE_ONCE(*dy,READ_ONCE(*dx));
> }
> 
> 
> P3(int *x, spinlock_t *l)
> {
>     spin_lock(l);
>     smp_mb__after_unlock_lock();
>     *x = 2;
> }

I don't understand why the current LKMM doesn't say there is a data 
race.  In fact, I don't understand what herd7 is doing with this litmus 
test at all.  Evidently the plain-coherence check rules out x=1 at the 
end, because when I relax that check, x=1 becomes a possible result.  
Furthermore, the graphical output confirms that this execution has a 
ww-incoh edge from Wx=2 to Wx=1.  But there is no ww-vis edge from Wx=1 
to Wx=2!  How can this be possible?  It seems like a bug in herd7.

Furthermore, the execution with x=2 at the end doesn't have either a 
ww-vis or a ww-nonrace edge betwen Wx=1 and Wx=2.  So why isn't there a 
ww-race edge?

> This actually makes me wonder. I thought the reason for the xbstar & int is
> that it ensures that the overall relation, after shuffling around a little
> bit, becomes prop&int ; hb*.

No, that is not the reason for it.  See below.

> Like in case the *x=2 is co-before the *x=1, we get
>   Wx2 ->overwrite Wx1 ->cumul-fence*;rfe  (some event on the same CPU as
> Wx2)  ->fence Wx2
> which is
>   Wx2 -> prop&int some other event ->hb Wx2
> which must be irreflexive.
> 
> However, that's not the case at all, because the fence relation currently
> doesn't actually have to relate events of the same CPU.
> So we get
>   Wx2 ->overwrite Wx1 ->cumul-fence*;rfe  (some event on some other CPU than
> Wx2's) ->(hb*&int);fence Wx2
> i.e.,
>   Wx2 ->prop&ext;hb*;strong-fence Wx2
> 
> which shouldn't provide any ordering in general.
> 
> In fact, replacing the *x=1 and *x=2 with WRITE_ONCEs, (pilot errors
> notwithstanding) both Wx1 ->co Wx2 and Wx2 ->co Wx1 become allowed in the
> current LKMM (in graphs where all other edges are equal).
> 
> Shouldn't this actually *be* a data race? And potentially the same with
> rcu-fence?

I think that herd7 _should_ say there is a data race.

On the other hand, I also think that the operational model says there 
isn't.  This is a case where the formal model fails to match the 
operational model.

The operational model says that if W is a release-store on CPU C and W' 
is another store which propagates to C before W executes, then W' 
propagates to every CPU before W does.  (In other words, releases are 
A-cumulative).  But the formal model enforces this rule only in cases 
the event reading W' on C is po-before W.

In your litmus test, the event reading y=1 on P1 is po-after the 
spin_unlock (which is a release).  Nevertheless, any feasible execution 
requires that P1 must execute Ry=1 before the unlock.  So the 
operational model says that y=1 must propagate to P3 before the unlock 
does, i.e., before P3 executes the spin_lock().  But the formal model 
doesn't require this.

Ideally we would fix this by changing the definition of po-rel to:

	[M] ; (xbstar & int) ; [Release]

(This is closely related to the use of (xbstar & int) in the definition 
of vis that you asked about.)  Unfortunately we can't do this, because 
po-rel has to be defined long before xbstar.

> The other cases of *-pre-bounded seem to work out: they all link stuff via
> xbstar to the instruction that is linked via po-unlock-lock-po ;
> [After-unlock-lock] ; po to the potentially racy access, and
> po-unlock-lock-po;po   is xbstar ; acq-po, which allows closing the gap.

I could not follow your arguments at all; the writing was too confusing.

Alan

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 2/2] tools/memory-model: Make ppo a subrelation of po
  2023-01-28 19:56       ` Alan Stern
@ 2023-01-28 22:14         ` Andrea Parri
  2023-01-28 22:21           ` Andrea Parri
  2023-01-28 22:59           ` Alan Stern
  0 siblings, 2 replies; 39+ messages in thread
From: Andrea Parri @ 2023-01-28 22:14 UTC (permalink / raw)
  To: Alan Stern
  Cc: Jonas Oberhauser, paulmck, will, peterz, boqun.feng, npiggin,
	dhowells, j.alglave, luc.maranget, akiyks, dlustig, joel, urezki,
	quic_neeraju, frederic, linux-kernel

> Evidently the plain-coherence check rules out x=1 at the 
> end, because when I relax that check, x=1 becomes a possible result.  
> Furthermore, the graphical output confirms that this execution has a 
> ww-incoh edge from Wx=2 to Wx=1.  But there is no ww-vis edge from Wx=1 
> to Wx=2!  How can this be possible?  It seems like a bug in herd7.

By default, herd7 performs some edges removal when generating the
graphical outputs.  The option -showraw can be useful to increase
the "verbosity", for example,

  [with "exists (x=2)", output in /tmp/T.dot]
  $ herd7 -conf linux-kernel.cfg T.litmus -show prop -o /tmp -skipchecks plain-coherence -doshow ww-vis -showraw ww-vis


> Furthermore, the execution with x=2 at the end doesn't have either a 
> ww-vis or a ww-nonrace edge betwen Wx=1 and Wx=2.  So why isn't there a 
> ww-race edge?

And similarly

  [with "exists (x=2)"]
  $ herd7 -conf linux-kernel.cfg T.litmus -show prop -o /tmp -doshow ww-vis,ww-nonrace -showraw ww-vis,ww-nonrace

  Andrea

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 2/2] tools/memory-model: Make ppo a subrelation of po
  2023-01-28 22:14         ` Andrea Parri
@ 2023-01-28 22:21           ` Andrea Parri
  2023-01-28 22:59           ` Alan Stern
  1 sibling, 0 replies; 39+ messages in thread
From: Andrea Parri @ 2023-01-28 22:21 UTC (permalink / raw)
  To: Alan Stern
  Cc: Jonas Oberhauser, paulmck, will, peterz, boqun.feng, npiggin,
	dhowells, j.alglave, luc.maranget, akiyks, dlustig, joel, urezki,
	quic_neeraju, frederic, linux-kernel

On Sat, Jan 28, 2023 at 11:14:17PM +0100, Andrea Parri wrote:
> > Evidently the plain-coherence check rules out x=1 at the 
> > end, because when I relax that check, x=1 becomes a possible result.  
> > Furthermore, the graphical output confirms that this execution has a 
> > ww-incoh edge from Wx=2 to Wx=1.  But there is no ww-vis edge from Wx=1 
> > to Wx=2!  How can this be possible?  It seems like a bug in herd7.
> 
> By default, herd7 performs some edges removal when generating the
> graphical outputs.  The option -showraw can be useful to increase
> the "verbosity", for example,
> 
>   [with "exists (x=2)", output in /tmp/T.dot]

This was meant to be "exists (x=1)".

  Andrea


>   $ herd7 -conf linux-kernel.cfg T.litmus -show prop -o /tmp -skipchecks plain-coherence -doshow ww-vis -showraw ww-vis
> 
> 
> > Furthermore, the execution with x=2 at the end doesn't have either a 
> > ww-vis or a ww-nonrace edge betwen Wx=1 and Wx=2.  So why isn't there a 
> > ww-race edge?
> 
> And similarly
> 
>   [with "exists (x=2)"]
>   $ herd7 -conf linux-kernel.cfg T.litmus -show prop -o /tmp -doshow ww-vis,ww-nonrace -showraw ww-vis,ww-nonrace
> 
>   Andrea

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 2/2] tools/memory-model: Make ppo a subrelation of po
  2023-01-28 22:14         ` Andrea Parri
  2023-01-28 22:21           ` Andrea Parri
@ 2023-01-28 22:59           ` Alan Stern
  2023-01-29  5:17             ` Paul E. McKenney
                               ` (2 more replies)
  1 sibling, 3 replies; 39+ messages in thread
From: Alan Stern @ 2023-01-28 22:59 UTC (permalink / raw)
  To: Andrea Parri
  Cc: Jonas Oberhauser, paulmck, will, peterz, boqun.feng, npiggin,
	dhowells, j.alglave, luc.maranget, akiyks, dlustig, joel, urezki,
	quic_neeraju, frederic, linux-kernel

On Sat, Jan 28, 2023 at 11:14:17PM +0100, Andrea Parri wrote:
> > Evidently the plain-coherence check rules out x=1 at the 
> > end, because when I relax that check, x=1 becomes a possible result.  
> > Furthermore, the graphical output confirms that this execution has a 
> > ww-incoh edge from Wx=2 to Wx=1.  But there is no ww-vis edge from Wx=1 
> > to Wx=2!  How can this be possible?  It seems like a bug in herd7.
> 
> By default, herd7 performs some edges removal when generating the
> graphical outputs.  The option -showraw can be useful to increase
> the "verbosity", for example,
> 
>   [with "exists (x=2)", output in /tmp/T.dot]
>   $ herd7 -conf linux-kernel.cfg T.litmus -show prop -o /tmp -skipchecks plain-coherence -doshow ww-vis -showraw ww-vis

Okay, thanks, that helps a lot.

So here's what we've got.  The litmus test:


C hb-and-int
{}

P0(int *x, int *y)
{
    *x = 1;
    smp_store_release(y, 1);
}

P1(int *x, int *y, int *dx, int *dy, spinlock_t *l)
{
    spin_lock(l);
    int r1 = READ_ONCE(*dy);
    if (r1==1)
        spin_unlock(l);

    int r0 = smp_load_acquire(y);
    if (r0 == 1) {
        WRITE_ONCE(*dx,1);
    }
}

P2(int *dx, int *dy)
{
    WRITE_ONCE(*dy,READ_ONCE(*dx));
}


P3(int *x, spinlock_t *l)
{
    spin_lock(l);
    smp_mb__after_unlock_lock();
    *x = 2;
}

exists (x=2)


The reason why Wx=1 ->ww-vis Wx=2:

	0:Wx=1 ->po-rel 0:Wy=1 and po-rel < fence < ww-post-bounded.

	0:Wy=1 ->rfe 1:Ry=1 ->(hb* & int) 1:Rdy=1 and
		(rfe ; hb* & int) <= (rfe ; xbstar & int) <= vis.

	1:Rdy=1 ->po 1:unlock ->rfe 3:lock ->po 3:Wx=2
		so 1:Rdy=1 ->po-unlock-lock-po 3:Wx=2
		and po-unlock-lock-po <= mb <= fence <= w-pre-bounded.

Finally, w-post-bounded ; vis ; w-pre-bounded <= ww-vis.

This explains why the memory model says there isn't a data race.  This 
doesn't use the smp_mb__after_unlock_lock at all.

Alan

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 2/2] tools/memory-model: Make ppo a subrelation of po
  2023-01-28 22:59           ` Alan Stern
@ 2023-01-29  5:17             ` Paul E. McKenney
  2023-01-29 16:03               ` Alan Stern
  2023-01-29 17:11             ` Andrea Parri
  2023-01-29 22:19             ` Jonas Oberhauser
  2 siblings, 1 reply; 39+ messages in thread
From: Paul E. McKenney @ 2023-01-29  5:17 UTC (permalink / raw)
  To: Alan Stern
  Cc: Andrea Parri, Jonas Oberhauser, will, peterz, boqun.feng,
	npiggin, dhowells, j.alglave, luc.maranget, akiyks, dlustig,
	joel, urezki, quic_neeraju, frederic, linux-kernel

On Sat, Jan 28, 2023 at 05:59:52PM -0500, Alan Stern wrote:
> On Sat, Jan 28, 2023 at 11:14:17PM +0100, Andrea Parri wrote:
> > > Evidently the plain-coherence check rules out x=1 at the 
> > > end, because when I relax that check, x=1 becomes a possible result.  
> > > Furthermore, the graphical output confirms that this execution has a 
> > > ww-incoh edge from Wx=2 to Wx=1.  But there is no ww-vis edge from Wx=1 
> > > to Wx=2!  How can this be possible?  It seems like a bug in herd7.
> > 
> > By default, herd7 performs some edges removal when generating the
> > graphical outputs.  The option -showraw can be useful to increase
> > the "verbosity", for example,
> > 
> >   [with "exists (x=2)", output in /tmp/T.dot]
> >   $ herd7 -conf linux-kernel.cfg T.litmus -show prop -o /tmp -skipchecks plain-coherence -doshow ww-vis -showraw ww-vis
> 
> Okay, thanks, that helps a lot.
> 
> So here's what we've got.  The litmus test:
> 
> 
> C hb-and-int
> {}
> 
> P0(int *x, int *y)
> {
>     *x = 1;
>     smp_store_release(y, 1);
> }
> 
> P1(int *x, int *y, int *dx, int *dy, spinlock_t *l)
> {
>     spin_lock(l);
>     int r1 = READ_ONCE(*dy);
>     if (r1==1)
>         spin_unlock(l);
> 
>     int r0 = smp_load_acquire(y);
>     if (r0 == 1) {
>         WRITE_ONCE(*dx,1);
>     }

The lack of a spin_unlock() when r1!=1 is intentional?

It is admittedly a cute way to prevent P3 from doing anything
when r1!=1.  And P1 won't do anything if P3 runs first.

> }
> 
> P2(int *dx, int *dy)
> {
>     WRITE_ONCE(*dy,READ_ONCE(*dx));
> }
> 
> 
> P3(int *x, spinlock_t *l)
> {
>     spin_lock(l);
>     smp_mb__after_unlock_lock();
>     *x = 2;
> }
> 
> exists (x=2)
> 
> 
> The reason why Wx=1 ->ww-vis Wx=2:
> 
> 	0:Wx=1 ->po-rel 0:Wy=1 and po-rel < fence < ww-post-bounded.
> 
> 	0:Wy=1 ->rfe 1:Ry=1 ->(hb* & int) 1:Rdy=1 and
> 		(rfe ; hb* & int) <= (rfe ; xbstar & int) <= vis.
> 
> 	1:Rdy=1 ->po 1:unlock ->rfe 3:lock ->po 3:Wx=2
> 		so 1:Rdy=1 ->po-unlock-lock-po 3:Wx=2
> 		and po-unlock-lock-po <= mb <= fence <= w-pre-bounded.
> 
> Finally, w-post-bounded ; vis ; w-pre-bounded <= ww-vis.
> 
> This explains why the memory model says there isn't a data race.  This 
> doesn't use the smp_mb__after_unlock_lock at all.

You lost me on this one.

Suppose that P3 starts first, then P0.  P1 is then stuck at the
spin_lock() because P3 does not release that lock.  P2 goes out for a
pizza.

Why can't the two stores to x by P0 and P3 conflict, resulting in a
data race?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 2/2] tools/memory-model: Make ppo a subrelation of po
  2023-01-29  5:17             ` Paul E. McKenney
@ 2023-01-29 16:03               ` Alan Stern
  2023-01-29 16:21                 ` Paul E. McKenney
  0 siblings, 1 reply; 39+ messages in thread
From: Alan Stern @ 2023-01-29 16:03 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Andrea Parri, Jonas Oberhauser, will, peterz, boqun.feng,
	npiggin, dhowells, j.alglave, luc.maranget, akiyks, dlustig,
	joel, urezki, quic_neeraju, frederic, linux-kernel

On Sat, Jan 28, 2023 at 09:17:34PM -0800, Paul E. McKenney wrote:
> On Sat, Jan 28, 2023 at 05:59:52PM -0500, Alan Stern wrote:
> > On Sat, Jan 28, 2023 at 11:14:17PM +0100, Andrea Parri wrote:
> > > > Evidently the plain-coherence check rules out x=1 at the 
> > > > end, because when I relax that check, x=1 becomes a possible result.  
> > > > Furthermore, the graphical output confirms that this execution has a 
> > > > ww-incoh edge from Wx=2 to Wx=1.  But there is no ww-vis edge from Wx=1 
> > > > to Wx=2!  How can this be possible?  It seems like a bug in herd7.
> > > 
> > > By default, herd7 performs some edges removal when generating the
> > > graphical outputs.  The option -showraw can be useful to increase
> > > the "verbosity", for example,
> > > 
> > >   [with "exists (x=2)", output in /tmp/T.dot]
> > >   $ herd7 -conf linux-kernel.cfg T.litmus -show prop -o /tmp -skipchecks plain-coherence -doshow ww-vis -showraw ww-vis
> > 
> > Okay, thanks, that helps a lot.
> > 
> > So here's what we've got.  The litmus test:
> > 
> > 
> > C hb-and-int
> > {}
> > 
> > P0(int *x, int *y)
> > {
> >     *x = 1;
> >     smp_store_release(y, 1);
> > }
> > 
> > P1(int *x, int *y, int *dx, int *dy, spinlock_t *l)
> > {
> >     spin_lock(l);
> >     int r1 = READ_ONCE(*dy);
> >     if (r1==1)
> >         spin_unlock(l);
> > 
> >     int r0 = smp_load_acquire(y);
> >     if (r0 == 1) {
> >         WRITE_ONCE(*dx,1);
> >     }
> 
> The lack of a spin_unlock() when r1!=1 is intentional?

I assume so.

> It is admittedly a cute way to prevent P3 from doing anything
> when r1!=1.  And P1 won't do anything if P3 runs first.

Right.

> > }
> > 
> > P2(int *dx, int *dy)
> > {
> >     WRITE_ONCE(*dy,READ_ONCE(*dx));
> > }
> > 
> > 
> > P3(int *x, spinlock_t *l)
> > {
> >     spin_lock(l);
> >     smp_mb__after_unlock_lock();
> >     *x = 2;
> > }
> > 
> > exists (x=2)
> > 
> > 
> > The reason why Wx=1 ->ww-vis Wx=2:
> > 
> > 	0:Wx=1 ->po-rel 0:Wy=1 and po-rel < fence < ww-post-bounded.
> > 
> > 	0:Wy=1 ->rfe 1:Ry=1 ->(hb* & int) 1:Rdy=1 and
> > 		(rfe ; hb* & int) <= (rfe ; xbstar & int) <= vis.
> > 
> > 	1:Rdy=1 ->po 1:unlock ->rfe 3:lock ->po 3:Wx=2
> > 		so 1:Rdy=1 ->po-unlock-lock-po 3:Wx=2
> > 		and po-unlock-lock-po <= mb <= fence <= w-pre-bounded.
> > 
> > Finally, w-post-bounded ; vis ; w-pre-bounded <= ww-vis.
> > 
> > This explains why the memory model says there isn't a data race.  This 
> > doesn't use the smp_mb__after_unlock_lock at all.
> 
> You lost me on this one.
> 
> Suppose that P3 starts first, then P0.  P1 is then stuck at the
> spin_lock() because P3 does not release that lock.  P2 goes out for a
> pizza.

That wouldn't be a valid execution.  One of the rules in lock.cat says 
that a spin_lock() call must read from a spin_unlock() or from an 
initial write, which rules out executions in which P3 acquires the lock 
first.

> Why can't the two stores to x by P0 and P3 conflict, resulting in a
> data race?

That can't happen in executions where P1 acquires the lock first for the 
reason outlined above (P0's store to x propagates to P3 before P3 writes 
to x).  And there are no other executions -- basically, herd7 ignores 
deadlock scenarios.

Alan

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 2/2] tools/memory-model: Make ppo a subrelation of po
  2023-01-29 16:03               ` Alan Stern
@ 2023-01-29 16:21                 ` Paul E. McKenney
  2023-01-29 17:28                   ` Andrea Parri
  2023-01-29 19:17                   ` Paul E. McKenney
  0 siblings, 2 replies; 39+ messages in thread
From: Paul E. McKenney @ 2023-01-29 16:21 UTC (permalink / raw)
  To: Alan Stern
  Cc: Andrea Parri, Jonas Oberhauser, will, peterz, boqun.feng,
	npiggin, dhowells, j.alglave, luc.maranget, akiyks, dlustig,
	joel, urezki, quic_neeraju, frederic, linux-kernel

On Sun, Jan 29, 2023 at 11:03:46AM -0500, Alan Stern wrote:
> On Sat, Jan 28, 2023 at 09:17:34PM -0800, Paul E. McKenney wrote:
> > On Sat, Jan 28, 2023 at 05:59:52PM -0500, Alan Stern wrote:
> > > On Sat, Jan 28, 2023 at 11:14:17PM +0100, Andrea Parri wrote:
> > > > > Evidently the plain-coherence check rules out x=1 at the 
> > > > > end, because when I relax that check, x=1 becomes a possible result.  
> > > > > Furthermore, the graphical output confirms that this execution has a 
> > > > > ww-incoh edge from Wx=2 to Wx=1.  But there is no ww-vis edge from Wx=1 
> > > > > to Wx=2!  How can this be possible?  It seems like a bug in herd7.
> > > > 
> > > > By default, herd7 performs some edges removal when generating the
> > > > graphical outputs.  The option -showraw can be useful to increase
> > > > the "verbosity", for example,
> > > > 
> > > >   [with "exists (x=2)", output in /tmp/T.dot]
> > > >   $ herd7 -conf linux-kernel.cfg T.litmus -show prop -o /tmp -skipchecks plain-coherence -doshow ww-vis -showraw ww-vis
> > > 
> > > Okay, thanks, that helps a lot.
> > > 
> > > So here's what we've got.  The litmus test:
> > > 
> > > 
> > > C hb-and-int
> > > {}
> > > 
> > > P0(int *x, int *y)
> > > {
> > >     *x = 1;
> > >     smp_store_release(y, 1);
> > > }
> > > 
> > > P1(int *x, int *y, int *dx, int *dy, spinlock_t *l)
> > > {
> > >     spin_lock(l);
> > >     int r1 = READ_ONCE(*dy);
> > >     if (r1==1)
> > >         spin_unlock(l);
> > > 
> > >     int r0 = smp_load_acquire(y);
> > >     if (r0 == 1) {
> > >         WRITE_ONCE(*dx,1);
> > >     }
> > 
> > The lack of a spin_unlock() when r1!=1 is intentional?
> 
> I assume so.
> 
> > It is admittedly a cute way to prevent P3 from doing anything
> > when r1!=1.  And P1 won't do anything if P3 runs first.
> 
> Right.
> 
> > > }
> > > 
> > > P2(int *dx, int *dy)
> > > {
> > >     WRITE_ONCE(*dy,READ_ONCE(*dx));
> > > }
> > > 
> > > 
> > > P3(int *x, spinlock_t *l)
> > > {
> > >     spin_lock(l);
> > >     smp_mb__after_unlock_lock();
> > >     *x = 2;
> > > }
> > > 
> > > exists (x=2)
> > > 
> > > 
> > > The reason why Wx=1 ->ww-vis Wx=2:
> > > 
> > > 	0:Wx=1 ->po-rel 0:Wy=1 and po-rel < fence < ww-post-bounded.
> > > 
> > > 	0:Wy=1 ->rfe 1:Ry=1 ->(hb* & int) 1:Rdy=1 and
> > > 		(rfe ; hb* & int) <= (rfe ; xbstar & int) <= vis.
> > > 
> > > 	1:Rdy=1 ->po 1:unlock ->rfe 3:lock ->po 3:Wx=2
> > > 		so 1:Rdy=1 ->po-unlock-lock-po 3:Wx=2
> > > 		and po-unlock-lock-po <= mb <= fence <= w-pre-bounded.
> > > 
> > > Finally, w-post-bounded ; vis ; w-pre-bounded <= ww-vis.
> > > 
> > > This explains why the memory model says there isn't a data race.  This 
> > > doesn't use the smp_mb__after_unlock_lock at all.
> > 
> > You lost me on this one.
> > 
> > Suppose that P3 starts first, then P0.  P1 is then stuck at the
> > spin_lock() because P3 does not release that lock.  P2 goes out for a
> > pizza.
> 
> That wouldn't be a valid execution.  One of the rules in lock.cat says 
> that a spin_lock() call must read from a spin_unlock() or from an 
> initial write, which rules out executions in which P3 acquires the lock 
> first.

OK, I will bite...

Why can't P3's spin_lock() read from that initial write?

> > Why can't the two stores to x by P0 and P3 conflict, resulting in a
> > data race?
> 
> That can't happen in executions where P1 acquires the lock first for the 
> reason outlined above (P0's store to x propagates to P3 before P3 writes 
> to x).  And there are no other executions -- basically, herd7 ignores 
> deadlock scenarios.

True enough, if P1 gets there first, then P3 never stores to x.

What I don't understand is why P1 must always get there first.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 2/2] tools/memory-model: Make ppo a subrelation of po
  2023-01-28 22:59           ` Alan Stern
  2023-01-29  5:17             ` Paul E. McKenney
@ 2023-01-29 17:11             ` Andrea Parri
  2023-01-29 22:10               ` Alan Stern
  2023-01-29 22:19             ` Jonas Oberhauser
  2 siblings, 1 reply; 39+ messages in thread
From: Andrea Parri @ 2023-01-29 17:11 UTC (permalink / raw)
  To: Alan Stern
  Cc: Jonas Oberhauser, paulmck, will, peterz, boqun.feng, npiggin,
	dhowells, j.alglave, luc.maranget, akiyks, dlustig, joel, urezki,
	quic_neeraju, frederic, linux-kernel

> The reason why Wx=1 ->ww-vis Wx=2:
> 
> 	0:Wx=1 ->po-rel 0:Wy=1 and po-rel < fence < ww-post-bounded.
> 
> 	0:Wy=1 ->rfe 1:Ry=1 ->(hb* & int) 1:Rdy=1 and
> 		(rfe ; hb* & int) <= (rfe ; xbstar & int) <= vis.
> 
> 	1:Rdy=1 ->po 1:unlock ->rfe 3:lock ->po 3:Wx=2
> 		so 1:Rdy=1 ->po-unlock-lock-po 3:Wx=2
> 		and po-unlock-lock-po <= mb <= fence <= w-pre-bounded.
> 
> Finally, w-post-bounded ; vis ; w-pre-bounded <= ww-vis.

To clarify, po-unlock-lock-po is not a subrelation of mb; see what
happens without the smp_mb__after_unlock_lock().

  Andrea

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 2/2] tools/memory-model: Make ppo a subrelation of po
  2023-01-29 16:21                 ` Paul E. McKenney
@ 2023-01-29 17:28                   ` Andrea Parri
  2023-01-29 18:44                     ` Paul E. McKenney
  2023-01-29 19:17                   ` Paul E. McKenney
  1 sibling, 1 reply; 39+ messages in thread
From: Andrea Parri @ 2023-01-29 17:28 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Alan Stern, Jonas Oberhauser, will, peterz, boqun.feng, npiggin,
	dhowells, j.alglave, luc.maranget, akiyks, dlustig, joel, urezki,
	quic_neeraju, frederic, linux-kernel

> Why can't P3's spin_lock() read from that initial write?

Mmh, sounds like you want to play with something like below?

  Andrea

diff --git a/tools/memory-model/lock.cat b/tools/memory-model/lock.cat
index 6b52f365d73ac..20c3af4511255 100644
--- a/tools/memory-model/lock.cat
+++ b/tools/memory-model/lock.cat
@@ -74,7 +74,6 @@ flag ~empty UL \ range(critical) as unmatched-unlock
 
 (* Allow up to one unmatched LKW per location; more must deadlock *)
 let UNMATCHED-LKW = LKW \ domain(critical)
-empty ([UNMATCHED-LKW] ; loc ; [UNMATCHED-LKW]) \ id as unmatched-locks
 
 (* rfi for LF events: link each LKW to the LF events in its critical section *)
 let rfi-lf = ([LKW] ; po-loc ; [LF]) \ ([LKW] ; po-loc ; [UL] ; po-loc)
@@ -120,8 +119,7 @@ let rf-ru = rfe-ru | rfi-ru
 let rf = rf | rf-lf | rf-ru
 
 (* Generate all co relations, including LKW events but not UL *)
-let co0 = co0 | ([IW] ; loc ; [LKW]) |
-	(([LKW] ; loc ; [UNMATCHED-LKW]) \ [UNMATCHED-LKW])
+let co0 = co0 | ([IW] ; loc ; [LKW])
 include "cos-opt.cat"
 let W = W | UL
 let M = R | W


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 2/2] tools/memory-model: Make ppo a subrelation of po
  2023-01-29 17:28                   ` Andrea Parri
@ 2023-01-29 18:44                     ` Paul E. McKenney
  2023-01-29 21:43                       ` Boqun Feng
  0 siblings, 1 reply; 39+ messages in thread
From: Paul E. McKenney @ 2023-01-29 18:44 UTC (permalink / raw)
  To: Andrea Parri
  Cc: Alan Stern, Jonas Oberhauser, will, peterz, boqun.feng, npiggin,
	dhowells, j.alglave, luc.maranget, akiyks, dlustig, joel, urezki,
	quic_neeraju, frederic, linux-kernel

On Sun, Jan 29, 2023 at 06:28:27PM +0100, Andrea Parri wrote:
> > Why can't P3's spin_lock() read from that initial write?
> 
> Mmh, sounds like you want to play with something like below?
> 
>   Andrea
> 
> diff --git a/tools/memory-model/lock.cat b/tools/memory-model/lock.cat
> index 6b52f365d73ac..20c3af4511255 100644
> --- a/tools/memory-model/lock.cat
> +++ b/tools/memory-model/lock.cat
> @@ -74,7 +74,6 @@ flag ~empty UL \ range(critical) as unmatched-unlock
>  
>  (* Allow up to one unmatched LKW per location; more must deadlock *)
>  let UNMATCHED-LKW = LKW \ domain(critical)
> -empty ([UNMATCHED-LKW] ; loc ; [UNMATCHED-LKW]) \ id as unmatched-locks
>  
>  (* rfi for LF events: link each LKW to the LF events in its critical section *)
>  let rfi-lf = ([LKW] ; po-loc ; [LF]) \ ([LKW] ; po-loc ; [UL] ; po-loc)
> @@ -120,8 +119,7 @@ let rf-ru = rfe-ru | rfi-ru
>  let rf = rf | rf-lf | rf-ru
>  
>  (* Generate all co relations, including LKW events but not UL *)
> -let co0 = co0 | ([IW] ; loc ; [LKW]) |
> -	(([LKW] ; loc ; [UNMATCHED-LKW]) \ [UNMATCHED-LKW])
> +let co0 = co0 | ([IW] ; loc ; [LKW])
>  include "cos-opt.cat"
>  let W = W | UL
>  let M = R | W

No idea.  But the following litmus test gets no executions whatsoever,
so point taken about my missing at least one corner case.  ;-)

Adding a spin_unlock() to the end of either process allows both to
run.

One could argue that this is a bug, but one could equally well argue
that if you have a deadlock, you have a deadlock.

Thoughts?

							Thanx, Paul

------------------------------------------------------------------------

C lock

{
}


P0(int *a, int *b, spinlock_t *l)
{
	spin_lock(l);
	WRITE_ONCE(*a, 1);
}

P1(int *a, int *b, spinlock_t *l)
{
	spin_lock(l);
	WRITE_ONCE(*b, 1);
}

exists (a=1 /\ b=1)

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 2/2] tools/memory-model: Make ppo a subrelation of po
  2023-01-29 16:21                 ` Paul E. McKenney
  2023-01-29 17:28                   ` Andrea Parri
@ 2023-01-29 19:17                   ` Paul E. McKenney
  1 sibling, 0 replies; 39+ messages in thread
From: Paul E. McKenney @ 2023-01-29 19:17 UTC (permalink / raw)
  To: Alan Stern
  Cc: Andrea Parri, Jonas Oberhauser, will, peterz, boqun.feng,
	npiggin, dhowells, j.alglave, luc.maranget, akiyks, dlustig,
	joel, urezki, quic_neeraju, frederic, linux-kernel

On Sun, Jan 29, 2023 at 08:21:56AM -0800, Paul E. McKenney wrote:
> On Sun, Jan 29, 2023 at 11:03:46AM -0500, Alan Stern wrote:
> > On Sat, Jan 28, 2023 at 09:17:34PM -0800, Paul E. McKenney wrote:
> > > On Sat, Jan 28, 2023 at 05:59:52PM -0500, Alan Stern wrote:
> > > > On Sat, Jan 28, 2023 at 11:14:17PM +0100, Andrea Parri wrote:
> > > > > > Evidently the plain-coherence check rules out x=1 at the 
> > > > > > end, because when I relax that check, x=1 becomes a possible result.  
> > > > > > Furthermore, the graphical output confirms that this execution has a 
> > > > > > ww-incoh edge from Wx=2 to Wx=1.  But there is no ww-vis edge from Wx=1 
> > > > > > to Wx=2!  How can this be possible?  It seems like a bug in herd7.
> > > > > 
> > > > > By default, herd7 performs some edges removal when generating the
> > > > > graphical outputs.  The option -showraw can be useful to increase
> > > > > the "verbosity", for example,
> > > > > 
> > > > >   [with "exists (x=2)", output in /tmp/T.dot]
> > > > >   $ herd7 -conf linux-kernel.cfg T.litmus -show prop -o /tmp -skipchecks plain-coherence -doshow ww-vis -showraw ww-vis
> > > > 
> > > > Okay, thanks, that helps a lot.
> > > > 
> > > > So here's what we've got.  The litmus test:
> > > > 
> > > > 
> > > > C hb-and-int
> > > > {}
> > > > 
> > > > P0(int *x, int *y)
> > > > {
> > > >     *x = 1;
> > > >     smp_store_release(y, 1);
> > > > }
> > > > 
> > > > P1(int *x, int *y, int *dx, int *dy, spinlock_t *l)
> > > > {
> > > >     spin_lock(l);
> > > >     int r1 = READ_ONCE(*dy);
> > > >     if (r1==1)
> > > >         spin_unlock(l);
> > > > 
> > > >     int r0 = smp_load_acquire(y);
> > > >     if (r0 == 1) {
> > > >         WRITE_ONCE(*dx,1);
> > > >     }
> > > 
> > > The lack of a spin_unlock() when r1!=1 is intentional?
> > 
> > I assume so.
> > 
> > > It is admittedly a cute way to prevent P3 from doing anything
> > > when r1!=1.  And P1 won't do anything if P3 runs first.
> > 
> > Right.
> > 
> > > > }
> > > > 
> > > > P2(int *dx, int *dy)
> > > > {
> > > >     WRITE_ONCE(*dy,READ_ONCE(*dx));
> > > > }
> > > > 
> > > > 
> > > > P3(int *x, spinlock_t *l)
> > > > {
> > > >     spin_lock(l);
> > > >     smp_mb__after_unlock_lock();
> > > >     *x = 2;
> > > > }
> > > > 
> > > > exists (x=2)
> > > > 
> > > > 
> > > > The reason why Wx=1 ->ww-vis Wx=2:
> > > > 
> > > > 	0:Wx=1 ->po-rel 0:Wy=1 and po-rel < fence < ww-post-bounded.
> > > > 
> > > > 	0:Wy=1 ->rfe 1:Ry=1 ->(hb* & int) 1:Rdy=1 and
> > > > 		(rfe ; hb* & int) <= (rfe ; xbstar & int) <= vis.
> > > > 
> > > > 	1:Rdy=1 ->po 1:unlock ->rfe 3:lock ->po 3:Wx=2
> > > > 		so 1:Rdy=1 ->po-unlock-lock-po 3:Wx=2
> > > > 		and po-unlock-lock-po <= mb <= fence <= w-pre-bounded.
> > > > 
> > > > Finally, w-post-bounded ; vis ; w-pre-bounded <= ww-vis.
> > > > 
> > > > This explains why the memory model says there isn't a data race.  This 
> > > > doesn't use the smp_mb__after_unlock_lock at all.
> > > 
> > > You lost me on this one.
> > > 
> > > Suppose that P3 starts first, then P0.  P1 is then stuck at the
> > > spin_lock() because P3 does not release that lock.  P2 goes out for a
> > > pizza.
> > 
> > That wouldn't be a valid execution.  One of the rules in lock.cat says 
> > that a spin_lock() call must read from a spin_unlock() or from an 
> > initial write, which rules out executions in which P3 acquires the lock 
> > first.
> 
> OK, I will bite...
> 
> Why can't P3's spin_lock() read from that initial write?
> 
> > > Why can't the two stores to x by P0 and P3 conflict, resulting in a
> > > data race?
> > 
> > That can't happen in executions where P1 acquires the lock first for the 
> > reason outlined above (P0's store to x propagates to P3 before P3 writes 
> > to x).  And there are no other executions -- basically, herd7 ignores 
> > deadlock scenarios.
> 
> True enough, if P1 gets there first, then P3 never stores to x.
> 
> What I don't understand is why P1 must always get there first.

Ah, is the rule that all processes must complete?

If so, then the only way all processes can complete is if P1 loads 1
from dy, thus releasing the lock.

But that dy=1 load can only happen after P2 has copied dx to dy, and has
stored a 1.  Which in turn only happens if P1's write to dx is ordered
before the lock release.  Which only executes if P1's load from y returned
1, which only happens after P0 stored 1 to y.

Which means that P3 gets the lock only after P0 completes its plain
write to x, so there is no data race.

The reason that P3 cannot go first is that this will prevent P1 from
completing, in turn preventing herd7 from counting that execution.

Or am I still missing a turn in there somewhere?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 2/2] tools/memory-model: Make ppo a subrelation of po
  2023-01-29 18:44                     ` Paul E. McKenney
@ 2023-01-29 21:43                       ` Boqun Feng
  2023-01-29 23:09                         ` Paul E. McKenney
  0 siblings, 1 reply; 39+ messages in thread
From: Boqun Feng @ 2023-01-29 21:43 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Andrea Parri, Alan Stern, Jonas Oberhauser, will, peterz,
	npiggin, dhowells, j.alglave, luc.maranget, akiyks, dlustig,
	joel, urezki, quic_neeraju, frederic, linux-kernel

On Sun, Jan 29, 2023 at 10:44:03AM -0800, Paul E. McKenney wrote:
> On Sun, Jan 29, 2023 at 06:28:27PM +0100, Andrea Parri wrote:
> > > Why can't P3's spin_lock() read from that initial write?
> > 
> > Mmh, sounds like you want to play with something like below?
> > 
> >   Andrea
> > 
> > diff --git a/tools/memory-model/lock.cat b/tools/memory-model/lock.cat
> > index 6b52f365d73ac..20c3af4511255 100644
> > --- a/tools/memory-model/lock.cat
> > +++ b/tools/memory-model/lock.cat
> > @@ -74,7 +74,6 @@ flag ~empty UL \ range(critical) as unmatched-unlock
> >  
> >  (* Allow up to one unmatched LKW per location; more must deadlock *)
> >  let UNMATCHED-LKW = LKW \ domain(critical)
> > -empty ([UNMATCHED-LKW] ; loc ; [UNMATCHED-LKW]) \ id as unmatched-locks
> >  
> >  (* rfi for LF events: link each LKW to the LF events in its critical section *)
> >  let rfi-lf = ([LKW] ; po-loc ; [LF]) \ ([LKW] ; po-loc ; [UL] ; po-loc)
> > @@ -120,8 +119,7 @@ let rf-ru = rfe-ru | rfi-ru
> >  let rf = rf | rf-lf | rf-ru
> >  
> >  (* Generate all co relations, including LKW events but not UL *)
> > -let co0 = co0 | ([IW] ; loc ; [LKW]) |
> > -	(([LKW] ; loc ; [UNMATCHED-LKW]) \ [UNMATCHED-LKW])
> > +let co0 = co0 | ([IW] ; loc ; [LKW])
> >  include "cos-opt.cat"
> >  let W = W | UL
> >  let M = R | W
> 
> No idea.  But the following litmus test gets no executions whatsoever,
> so point taken about my missing at least one corner case.  ;-)
> 
> Adding a spin_unlock() to the end of either process allows both to
> run.
> 
> One could argue that this is a bug, but one could equally well argue
> that if you have a deadlock, you have a deadlock.
> 

in lock.cat: 

	(* Allow up to one unmatched LKW per location; more must deadlock *)
	let UNMATCHED-LKW = LKW \ domain(critical)
	empty ([UNMATCHED-LKW] ; loc ; [UNMATCHED-LKW]) \ id as unmatched-locks

we rule out deadlocks from the execution candidates we care about.

Regards,
Boqun

> Thoughts?
> 
> 							Thanx, Paul
> 
> ------------------------------------------------------------------------
> 
> C lock
> 
> {
> }
> 
> 
> P0(int *a, int *b, spinlock_t *l)
> {
> 	spin_lock(l);
> 	WRITE_ONCE(*a, 1);
> }
> 
> P1(int *a, int *b, spinlock_t *l)
> {
> 	spin_lock(l);
> 	WRITE_ONCE(*b, 1);
> }
> 
> exists (a=1 /\ b=1)

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 2/2] tools/memory-model: Make ppo a subrelation of po
  2023-01-29 17:11             ` Andrea Parri
@ 2023-01-29 22:10               ` Alan Stern
  0 siblings, 0 replies; 39+ messages in thread
From: Alan Stern @ 2023-01-29 22:10 UTC (permalink / raw)
  To: Andrea Parri
  Cc: Jonas Oberhauser, paulmck, will, peterz, boqun.feng, npiggin,
	dhowells, j.alglave, luc.maranget, akiyks, dlustig, joel, urezki,
	quic_neeraju, frederic, linux-kernel

On Sun, Jan 29, 2023 at 06:11:12PM +0100, Andrea Parri wrote:
> > The reason why Wx=1 ->ww-vis Wx=2:
> > 
> > 	0:Wx=1 ->po-rel 0:Wy=1 and po-rel < fence < ww-post-bounded.
> > 
> > 	0:Wy=1 ->rfe 1:Ry=1 ->(hb* & int) 1:Rdy=1 and
> > 		(rfe ; hb* & int) <= (rfe ; xbstar & int) <= vis.
> > 
> > 	1:Rdy=1 ->po 1:unlock ->rfe 3:lock ->po 3:Wx=2
> > 		so 1:Rdy=1 ->po-unlock-lock-po 3:Wx=2
> > 		and po-unlock-lock-po <= mb <= fence <= w-pre-bounded.
> > 
> > Finally, w-post-bounded ; vis ; w-pre-bounded <= ww-vis.
> 
> To clarify, po-unlock-lock-po is not a subrelation of mb; see what
> happens without the smp_mb__after_unlock_lock().

Ah, thank you again.  That was what I got wrong, and it explains why the 
data race appears with Jonas's patch.

This also points out an important unstated fact: All of the inter-CPU 
extended fences in the memory model are A-cumulative.  In this case the 
fence links Rdy=1 on P1 to Wx=3 on P3.  We know that 0:Wx=1 is visible 
to P1 before the Rdy executes, but what we need is that it is visible to 
P3 before Wx=3 executes.  The fact that the extended fence is strong 
(and therefore A-cumulative) provides this extra guarantee.

Alan

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 2/2] tools/memory-model: Make ppo a subrelation of po
  2023-01-28 22:59           ` Alan Stern
  2023-01-29  5:17             ` Paul E. McKenney
  2023-01-29 17:11             ` Andrea Parri
@ 2023-01-29 22:19             ` Jonas Oberhauser
  2023-01-30  2:39               ` Alan Stern
  2023-01-30  4:46               ` Paul E. McKenney
  2 siblings, 2 replies; 39+ messages in thread
From: Jonas Oberhauser @ 2023-01-29 22:19 UTC (permalink / raw)
  To: Alan Stern, Andrea Parri
  Cc: paulmck, will, peterz, boqun.feng, npiggin, dhowells, j.alglave,
	luc.maranget, akiyks, dlustig, joel, urezki, quic_neeraju,
	frederic, linux-kernel

Hi all, apologies on the confusion about the litmus test.
I should have explained it better but it seems you mostly figured it out.
As Alan said I'm tricking a little bit by not unlocking in certain 
places to filter out all executions that aren't what I'm looking for.
I didn't have much time when I sent it (hence also the lack of 
explanation and why I haven't responded earlier), so I didn't have time 
to play around with the filter feature to do this the "proper"/non-cute way.
As such it really isn't about deadlocks.

I think one question is whether the distinction between the models could 
be reproduced without using any kind of filtering at all.
I have a feeling it should be possible but I haven't had time to think 
up a litmus test that does that.

On 1/28/2023 11:59 PM, Alan Stern wrote:
> On Sat, Jan 28, 2023 at 11:14:17PM +0100, Andrea Parri wrote:
>>> Evidently the plain-coherence check rules out x=1 at the
>>> end, because when I relax that check, x=1 becomes a possible result.
>>> Furthermore, the graphical output confirms that this execution has a
>>> ww-incoh edge from Wx=2 to Wx=1.  But there is no ww-vis edge from Wx=1
>>> to Wx=2!  How can this be possible?  It seems like a bug in herd7.
>> By default, herd7 performs some edges removal when generating the
>> graphical outputs.  The option -showraw can be useful to increase
>> the "verbosity", for example,
>>
>>    [with "exists (x=2)", output in /tmp/T.dot]
>>    $ herd7 -conf linux-kernel.cfg T.litmus -show prop -o /tmp -skipchecks plain-coherence -doshow ww-vis -showraw ww-vis
> Okay, thanks, that helps a lot.
>
> So here's what we've got.  The litmus test:
>
>
> C hb-and-int
> {}
>
> P0(int *x, int *y)
> {
>      *x = 1;
>      smp_store_release(y, 1);
> }
>
> P1(int *x, int *y, int *dx, int *dy, spinlock_t *l)
> {
>      spin_lock(l);
>      int r1 = READ_ONCE(*dy);
>      if (r1==1)
>          spin_unlock(l);
>
>      int r0 = smp_load_acquire(y);
>      if (r0 == 1) {
>          WRITE_ONCE(*dx,1);
>      }
> }
>
> P2(int *dx, int *dy)
> {
>      WRITE_ONCE(*dy,READ_ONCE(*dx));
> }
>
>
> P3(int *x, spinlock_t *l)
> {
>      spin_lock(l);
>      smp_mb__after_unlock_lock();
>      *x = 2;
> }
>
> exists (x=2)
>
>
> The reason why Wx=1 ->ww-vis Wx=2:
>
> 	0:Wx=1 ->po-rel 0:Wy=1 and po-rel < fence < ww-post-bounded.
>
> 	0:Wy=1 ->rfe 1:Ry=1 ->(hb* & int) 1:Rdy=1 and
> 		(rfe ; hb* & int) <= (rfe ; xbstar & int) <= vis.
>
> 	1:Rdy=1 ->po 1:unlock ->rfe 3:lock ->po 3:Wx=2
> 		so 1:Rdy=1 ->po-unlock-lock-po 3:Wx=2
> 		and po-unlock-lock-po <= mb <= fence <= w-pre-bounded.
>
> [...]
>
> This explains why the memory model says there isn't a data race.  This
> doesn't use the smp_mb__after_unlock_lock at all.

Note as Andrea said that po-unlock-lock-po is generally not in mb (and 
also not otherwise in fence).
Only po-unlock-lock-po;[After-unlock-lock];po is in mb (resp. ...&int in 
fence).
While the example uses smp_mb__after_unlock_lock, the anomaly should 
generally cover any extended fences.

[Snipping in a part of an earlier e-mail you sent]

> I think that herd7_should_  say there is a data race.
>
> On the other hand, I also think that the operational model says there
> isn't.  This is a case where the formal model fails to match the
> operational model.
>
> The operational model says that if W is a release-store on CPU C and W'
> is another store which propagates to C before W executes, then W'
> propagates to every CPU before W does.  (In other words, releases are
> A-cumulative).  But the formal model enforces this rule only in cases
> the event reading W' on C is po-before W.
>
> In your litmus test, the event reading y=1 on P1 is po-after the
> spin_unlock (which is a release).  Nevertheless, any feasible execution
> requires that P1 must execute Ry=1 before the unlock.  So the
> operational model says that y=1 must propagate to P3 before the unlock
> does, i.e., before P3 executes the spin_lock().  But the formal model
> doesn't require this.

I see now. Somehow I thought stores must execute in program order, but I 
guess it doesn't make sense.
In that sense, W ->xbstar&int X always means W propagates to X's CPU 
before X executes.

> Ideally we would fix this by changing the definition of po-rel to:
>
> 	[M] ; (xbstar & int) ; [Release]
>
> (This is closely related to the use of (xbstar & int) in the definition
> of vis that you asked about.)

This misses the property of release stores that any po-earlier store 
must also execute before the release store.
Perhaps it could be changed to the old  po-rel | [M] ; (xbstar & int) ; 
[Release] but then one could instead move this into the definition of 
cumul-fence.
In fact you'd probably want this for all the propagation fences, so 
cumul-fence and pb should be the right place.

> Unfortunately we can't do this, because
> po-rel has to be defined long before xbstar.

You could do it, by turning the relation into one massive recursive 
definition.

Thinking about what the options are:
1) accept the difference and run with it by making it consistent inside 
the axiomatic model
2) fix it through the recursive definition, which seems to be quite ugly 
but also consistent with the power operational model as far as I can tell
3) weaken the operational model... somehow
4) just ignore the anomaly
5) ???

Currently my least favorite option is 4) since it seems a bit off that 
the reasoning applies in one specific case of LKMM, more specifically 
the data race definition which should be equivalent to "the order of the 
two races isn't fixed", but here the order isn't fixed but it's a data race.
I think the patch happens to almost do 1) because the xbstar&int at the 
end should already imply ordering through the prop&int <= hb rule.
What would remain is to also exclude rcu-fence somehow.

2) seems the most correct choice but also to complicate LKMM a lot.

Seems like being between a rock and hard place.
jonas

PS:
>> The other cases of *-pre-bounded seem to work out: they all link stuff via
>> xbstar to the instruction that is linked via po-unlock-lock-po ;
>> [After-unlock-lock] ; po to the potentially racy access, and
>> po-unlock-lock-po;po   is xbstar ; acq-po, which allows closing the gap.
> I could not follow your arguments at all; the writing was too confusing.

Sorry, I collapsed some cases. I'll show an example. I think all the 
other cases are the same.
Let's pick an edge that links two events with ww-vis through
   w-post-bounded ; vis ; w-pre-bounded
where the vis comes from
   cumul-fence* ; rfe? ; [Marked] ;
     (strong-fence ; [Marked] ; xbstar)  <= vis
and the w-pre-bounded came from po-unlock-lock-po;[After-unlock-lock];po 
but not po-unlock-lock-po & int.

Note that such po-unlock-lock-po;[After-unlock-lock];po must come from
   po-rel ; rfe ; acq-po

So we have
   w-post-bounded ;  cumul-fence* ; rfe? ; [Marked] ;
      strong-fence ; [Marked] ; xbstar ; po-rel ; rfe ; acq-po
<=
   w-post-bounded ;  cumul-fence* ; rfe? ; [Marked] ;
      strong-fence ; [Marked] ; xbstar ; hb ; hb ;  acq-po
<=
   w-post-bounded ;  cumul-fence* ; rfe? ; [Marked] ;
      strong-fence ; [Marked] ; xbstar ;                fence
<= ww-vis

All the other cases where w-pre-bounded are used have the shape
     ... ; xbstar ; w-pre-bounded
which can all be handled in the same manner.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 2/2] tools/memory-model: Make ppo a subrelation of po
  2023-01-29 21:43                       ` Boqun Feng
@ 2023-01-29 23:09                         ` Paul E. McKenney
  2023-01-30  2:18                           ` Alan Stern
  0 siblings, 1 reply; 39+ messages in thread
From: Paul E. McKenney @ 2023-01-29 23:09 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Andrea Parri, Alan Stern, Jonas Oberhauser, will, peterz,
	npiggin, dhowells, j.alglave, luc.maranget, akiyks, dlustig,
	joel, urezki, quic_neeraju, frederic, linux-kernel

On Sun, Jan 29, 2023 at 01:43:53PM -0800, Boqun Feng wrote:
> On Sun, Jan 29, 2023 at 10:44:03AM -0800, Paul E. McKenney wrote:
> > On Sun, Jan 29, 2023 at 06:28:27PM +0100, Andrea Parri wrote:
> > > > Why can't P3's spin_lock() read from that initial write?
> > > 
> > > Mmh, sounds like you want to play with something like below?
> > > 
> > >   Andrea
> > > 
> > > diff --git a/tools/memory-model/lock.cat b/tools/memory-model/lock.cat
> > > index 6b52f365d73ac..20c3af4511255 100644
> > > --- a/tools/memory-model/lock.cat
> > > +++ b/tools/memory-model/lock.cat
> > > @@ -74,7 +74,6 @@ flag ~empty UL \ range(critical) as unmatched-unlock
> > >  
> > >  (* Allow up to one unmatched LKW per location; more must deadlock *)
> > >  let UNMATCHED-LKW = LKW \ domain(critical)
> > > -empty ([UNMATCHED-LKW] ; loc ; [UNMATCHED-LKW]) \ id as unmatched-locks
> > >  
> > >  (* rfi for LF events: link each LKW to the LF events in its critical section *)
> > >  let rfi-lf = ([LKW] ; po-loc ; [LF]) \ ([LKW] ; po-loc ; [UL] ; po-loc)
> > > @@ -120,8 +119,7 @@ let rf-ru = rfe-ru | rfi-ru
> > >  let rf = rf | rf-lf | rf-ru
> > >  
> > >  (* Generate all co relations, including LKW events but not UL *)
> > > -let co0 = co0 | ([IW] ; loc ; [LKW]) |
> > > -	(([LKW] ; loc ; [UNMATCHED-LKW]) \ [UNMATCHED-LKW])
> > > +let co0 = co0 | ([IW] ; loc ; [LKW])
> > >  include "cos-opt.cat"
> > >  let W = W | UL
> > >  let M = R | W
> > 
> > No idea.  But the following litmus test gets no executions whatsoever,
> > so point taken about my missing at least one corner case.  ;-)
> > 
> > Adding a spin_unlock() to the end of either process allows both to
> > run.
> > 
> > One could argue that this is a bug, but one could equally well argue
> > that if you have a deadlock, you have a deadlock.
> > 
> 
> in lock.cat: 
> 
> 	(* Allow up to one unmatched LKW per location; more must deadlock *)
> 	let UNMATCHED-LKW = LKW \ domain(critical)
> 	empty ([UNMATCHED-LKW] ; loc ; [UNMATCHED-LKW]) \ id as unmatched-locks
> 
> we rule out deadlocks from the execution candidates we care about.

Thank you, Boqun!

							Thanx, Paul

> Regards,
> Boqun
> 
> > Thoughts?
> > 
> > 							Thanx, Paul
> > 
> > ------------------------------------------------------------------------
> > 
> > C lock
> > 
> > {
> > }
> > 
> > 
> > P0(int *a, int *b, spinlock_t *l)
> > {
> > 	spin_lock(l);
> > 	WRITE_ONCE(*a, 1);
> > }
> > 
> > P1(int *a, int *b, spinlock_t *l)
> > {
> > 	spin_lock(l);
> > 	WRITE_ONCE(*b, 1);
> > }
> > 
> > exists (a=1 /\ b=1)

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 2/2] tools/memory-model: Make ppo a subrelation of po
  2023-01-29 23:09                         ` Paul E. McKenney
@ 2023-01-30  2:18                           ` Alan Stern
  2023-01-30  4:43                             ` Paul E. McKenney
  0 siblings, 1 reply; 39+ messages in thread
From: Alan Stern @ 2023-01-30  2:18 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Boqun Feng, Andrea Parri, Jonas Oberhauser, will, peterz,
	npiggin, dhowells, j.alglave, luc.maranget, akiyks, dlustig,
	joel, urezki, quic_neeraju, frederic, linux-kernel

On Sun, Jan 29, 2023 at 03:09:00PM -0800, Paul E. McKenney wrote:
> On Sun, Jan 29, 2023 at 01:43:53PM -0800, Boqun Feng wrote:
> > in lock.cat: 
> > 
> > 	(* Allow up to one unmatched LKW per location; more must deadlock *)
> > 	let UNMATCHED-LKW = LKW \ domain(critical)
> > 	empty ([UNMATCHED-LKW] ; loc ; [UNMATCHED-LKW]) \ id as unmatched-locks
> > 
> > we rule out deadlocks from the execution candidates we care about.
> 
> Thank you, Boqun!

Actually that's only part of it.  The other part is rather obscure:

(* Generate all co relations, including LKW events but not UL *)
let co0 = co0 | ([IW] ; loc ; [LKW]) |
	(([LKW] ; loc ; [UNMATCHED-LKW]) \ [UNMATCHED-LKW])

Implicitly this says that any lock with no corresponding unlock must 
come last in the coherence order, which implies the unmatched-locks rule 
(since only one lock event can be last).  By itself, the unmatched-locks 
rule would not prevent P3 from executing before P1, provided P1 executes 
both its lock and unlock.

Alan


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 2/2] tools/memory-model: Make ppo a subrelation of po
  2023-01-29 22:19             ` Jonas Oberhauser
@ 2023-01-30  2:39               ` Alan Stern
  2023-01-30  4:36                 ` Paul E. McKenney
  2023-01-31 13:56                 ` Jonas Oberhauser
  2023-01-30  4:46               ` Paul E. McKenney
  1 sibling, 2 replies; 39+ messages in thread
From: Alan Stern @ 2023-01-30  2:39 UTC (permalink / raw)
  To: Jonas Oberhauser
  Cc: Andrea Parri, paulmck, will, peterz, boqun.feng, npiggin,
	dhowells, j.alglave, luc.maranget, akiyks, dlustig, joel, urezki,
	quic_neeraju, frederic, linux-kernel

On Sun, Jan 29, 2023 at 11:19:32PM +0100, Jonas Oberhauser wrote:
> I see now. Somehow I thought stores must execute in program order, but I
> guess it doesn't make sense.
> In that sense, W ->xbstar&int X always means W propagates to X's CPU before
> X executes.

It also means any write that propagates to W's CPU before W executes 
also propagates to X's CPU before X executes (because it's the same CPU 
and W executes before X).

> > Ideally we would fix this by changing the definition of po-rel to:
> > 
> > 	[M] ; (xbstar & int) ; [Release]
> > 
> > (This is closely related to the use of (xbstar & int) in the definition
> > of vis that you asked about.)
> 
> This misses the property of release stores that any po-earlier store must
> also execute before the release store.

I should have written:

	[M] ; (po | (xbstar & int)) ; [Release]

> Perhaps it could be changed to the old  po-rel | [M] ; (xbstar & int) ;
> [Release] but then one could instead move this into the definition of
> cumul-fence.
> In fact you'd probably want this for all the propagation fences, so
> cumul-fence and pb should be the right place.
> 
> > Unfortunately we can't do this, because
> > po-rel has to be defined long before xbstar.
> 
> You could do it, by turning the relation into one massive recursive
> definition.

Which would make pretty much the entire memory model one big recursion.  
I do not want to do that.

> Thinking about what the options are:
> 1) accept the difference and run with it by making it consistent inside the
> axiomatic model
> 2) fix it through the recursive definition, which seems to be quite ugly but
> also consistent with the power operational model as far as I can tell
> 3) weaken the operational model... somehow
> 4) just ignore the anomaly
> 5) ???
> 
> Currently my least favorite option is 4) since it seems a bit off that the
> reasoning applies in one specific case of LKMM, more specifically the data
> race definition which should be equivalent to "the order of the two races
> isn't fixed", but here the order isn't fixed but it's a data race.
> I think the patch happens to almost do 1) because the xbstar&int at the end
> should already imply ordering through the prop&int <= hb rule.
> What would remain is to also exclude rcu-fence somehow.

IMO 1) is the best choice.

Alan

PS: For the record, here's a simpler litmus test to illustrates the 
failing.  The idea is that Wz=1 is reordered before the store-release, 
so it ought to propagate before Wy=1.  The LKMM does not require this.


C before-release

{}

P0(int *x, int *y, int *z)
{
	int r1;

	r1 = READ_ONCE(*x);
	smp_store_release(y, 1);
	WRITE_ONCE(*z, 1);
}

P1(int *x, int *y, int *z)
{
	int r2;

	r2 = READ_ONCE(*z);
	WRITE_ONCE(*x, r2);
}

P2(int *x, int *y, int *z)
{
	int r3;
	int r4;

	r3 = READ_ONCE(*y);
	smp_rmb();
	r4 = READ_ONCE(*z);
}

exists (0:r1=1 /\ 2:r3=1 /\ 2:r4=0)

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 2/2] tools/memory-model: Make ppo a subrelation of po
  2023-01-30  2:39               ` Alan Stern
@ 2023-01-30  4:36                 ` Paul E. McKenney
  2023-01-30 16:47                   ` Alan Stern
  2023-01-31 13:56                 ` Jonas Oberhauser
  1 sibling, 1 reply; 39+ messages in thread
From: Paul E. McKenney @ 2023-01-30  4:36 UTC (permalink / raw)
  To: Alan Stern
  Cc: Jonas Oberhauser, Andrea Parri, will, peterz, boqun.feng,
	npiggin, dhowells, j.alglave, luc.maranget, akiyks, dlustig,
	joel, urezki, quic_neeraju, frederic, linux-kernel

On Sun, Jan 29, 2023 at 09:39:17PM -0500, Alan Stern wrote:
> On Sun, Jan 29, 2023 at 11:19:32PM +0100, Jonas Oberhauser wrote:
> > I see now. Somehow I thought stores must execute in program order, but I
> > guess it doesn't make sense.
> > In that sense, W ->xbstar&int X always means W propagates to X's CPU before
> > X executes.
> 
> It also means any write that propagates to W's CPU before W executes 
> also propagates to X's CPU before X executes (because it's the same CPU 
> and W executes before X).
> 
> > > Ideally we would fix this by changing the definition of po-rel to:
> > > 
> > > 	[M] ; (xbstar & int) ; [Release]
> > > 
> > > (This is closely related to the use of (xbstar & int) in the definition
> > > of vis that you asked about.)
> > 
> > This misses the property of release stores that any po-earlier store must
> > also execute before the release store.
> 
> I should have written:
> 
> 	[M] ; (po | (xbstar & int)) ; [Release]
> 
> > Perhaps it could be changed to the old  po-rel | [M] ; (xbstar & int) ;
> > [Release] but then one could instead move this into the definition of
> > cumul-fence.
> > In fact you'd probably want this for all the propagation fences, so
> > cumul-fence and pb should be the right place.
> > 
> > > Unfortunately we can't do this, because
> > > po-rel has to be defined long before xbstar.
> > 
> > You could do it, by turning the relation into one massive recursive
> > definition.
> 
> Which would make pretty much the entire memory model one big recursion.  
> I do not want to do that.
> 
> > Thinking about what the options are:
> > 1) accept the difference and run with it by making it consistent inside the
> > axiomatic model
> > 2) fix it through the recursive definition, which seems to be quite ugly but
> > also consistent with the power operational model as far as I can tell
> > 3) weaken the operational model... somehow
> > 4) just ignore the anomaly
> > 5) ???
> > 
> > Currently my least favorite option is 4) since it seems a bit off that the
> > reasoning applies in one specific case of LKMM, more specifically the data
> > race definition which should be equivalent to "the order of the two races
> > isn't fixed", but here the order isn't fixed but it's a data race.
> > I think the patch happens to almost do 1) because the xbstar&int at the end
> > should already imply ordering through the prop&int <= hb rule.
> > What would remain is to also exclude rcu-fence somehow.
> 
> IMO 1) is the best choice.
> 
> Alan
> 
> PS: For the record, here's a simpler litmus test to illustrates the 
> failing.  The idea is that Wz=1 is reordered before the store-release, 
> so it ought to propagate before Wy=1.  The LKMM does not require this.

In PowerPC terms, would this be like having the Wz=1 being reorders
before the Wy=1, but not before the lwsync instruction preceding the
Wy=1 that made it be a release store?

If so, we might have to keep this quirk.

							Thanx, Paul

> C before-release
> 
> {}
> 
> P0(int *x, int *y, int *z)
> {
> 	int r1;
> 
> 	r1 = READ_ONCE(*x);
> 	smp_store_release(y, 1);
> 	WRITE_ONCE(*z, 1);
> }
> 
> P1(int *x, int *y, int *z)
> {
> 	int r2;
> 
> 	r2 = READ_ONCE(*z);
> 	WRITE_ONCE(*x, r2);
> }
> 
> P2(int *x, int *y, int *z)
> {
> 	int r3;
> 	int r4;
> 
> 	r3 = READ_ONCE(*y);
> 	smp_rmb();
> 	r4 = READ_ONCE(*z);
> }
> 
> exists (0:r1=1 /\ 2:r3=1 /\ 2:r4=0)

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 2/2] tools/memory-model: Make ppo a subrelation of po
  2023-01-30  2:18                           ` Alan Stern
@ 2023-01-30  4:43                             ` Paul E. McKenney
  0 siblings, 0 replies; 39+ messages in thread
From: Paul E. McKenney @ 2023-01-30  4:43 UTC (permalink / raw)
  To: Alan Stern
  Cc: Boqun Feng, Andrea Parri, Jonas Oberhauser, will, peterz,
	npiggin, dhowells, j.alglave, luc.maranget, akiyks, dlustig,
	joel, urezki, quic_neeraju, frederic, linux-kernel

On Sun, Jan 29, 2023 at 09:18:24PM -0500, Alan Stern wrote:
> On Sun, Jan 29, 2023 at 03:09:00PM -0800, Paul E. McKenney wrote:
> > On Sun, Jan 29, 2023 at 01:43:53PM -0800, Boqun Feng wrote:
> > > in lock.cat: 
> > > 
> > > 	(* Allow up to one unmatched LKW per location; more must deadlock *)
> > > 	let UNMATCHED-LKW = LKW \ domain(critical)
> > > 	empty ([UNMATCHED-LKW] ; loc ; [UNMATCHED-LKW]) \ id as unmatched-locks
> > > 
> > > we rule out deadlocks from the execution candidates we care about.
> > 
> > Thank you, Boqun!
> 
> Actually that's only part of it.  The other part is rather obscure:
> 
> (* Generate all co relations, including LKW events but not UL *)
> let co0 = co0 | ([IW] ; loc ; [LKW]) |
> 	(([LKW] ; loc ; [UNMATCHED-LKW]) \ [UNMATCHED-LKW])
> 
> Implicitly this says that any lock with no corresponding unlock must 
> come last in the coherence order, which implies the unmatched-locks rule 
> (since only one lock event can be last).  By itself, the unmatched-locks 
> rule would not prevent P3 from executing before P1, provided P1 executes 
> both its lock and unlock.

And thank you, Alan, as well!

And RCU looks to operate in a similar manner:

------------------------------------------------------------------------

C rcudeadlock

{
}


P0(int *a, int *b)
{
	rcu_read_lock();
	WRITE_ONCE(*a, 1);
	synchronize_rcu();
	WRITE_ONCE(*b, 1);
	rcu_read_unlock();
}

P1(int *a, int *b)
{
	int r1;
	int r2;

	r1 = READ_ONCE(*b);
	smp_mb();
	r2 = READ_ONCE(*a);
}

exists (1:r1=1 /\ 1:r2=0)

------------------------------------------------------------------------

$ herd7 -conf linux-kernel.cfg rcudeadlock.litmus 
Test rcudeadlock Allowed
States 0
No
Witnesses
Positive: 0 Negative: 0
Condition exists (1:r1=1 /\ 1:r2=0)
Observation rcudeadlock Never 0 0
Time rcudeadlock 0.00
Hash=4f7f336ad39d724d93b089133b00d1e2

------------------------------------------------------------------------

So good enough!  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 2/2] tools/memory-model: Make ppo a subrelation of po
  2023-01-29 22:19             ` Jonas Oberhauser
  2023-01-30  2:39               ` Alan Stern
@ 2023-01-30  4:46               ` Paul E. McKenney
  1 sibling, 0 replies; 39+ messages in thread
From: Paul E. McKenney @ 2023-01-30  4:46 UTC (permalink / raw)
  To: Jonas Oberhauser
  Cc: Alan Stern, Andrea Parri, will, peterz, boqun.feng, npiggin,
	dhowells, j.alglave, luc.maranget, akiyks, dlustig, joel, urezki,
	quic_neeraju, frederic, linux-kernel

On Sun, Jan 29, 2023 at 11:19:32PM +0100, Jonas Oberhauser wrote:
> 
> Hi all, apologies on the confusion about the litmus test.
> I should have explained it better but it seems you mostly figured it out.
> As Alan said I'm tricking a little bit by not unlocking in certain places to
> filter out all executions that aren't what I'm looking for.
> I didn't have much time when I sent it (hence also the lack of explanation
> and why I haven't responded earlier), so I didn't have time to play around
> with the filter feature to do this the "proper"/non-cute way.
> As such it really isn't about deadlocks.

Not a problem!

> I think one question is whether the distinction between the models could be
> reproduced without using any kind of filtering at all.
> I have a feeling it should be possible but I haven't had time to think up a
> litmus test that does that.

Here is an example litmus test using filter, if this helps.

You put it right before the "exists" clause, and the condition is the
same as in the "exists" clause.  If an execution does not satisfy the
condition in the filter clause, it is tossed.

							Thanx, Paul

------------------------------------------------------------------------

C C-srcu-nest-6

(*
 * Result: Never
 *
 * This would be valid for srcu_down_read() and srcu_up_read().
 *)

{}

P0(int *x, int *y, struct srcu_struct *s1, int *idx, int *f)
{
	int r2;
	int r3;

	r3 = srcu_down_read(s1);
	WRITE_ONCE(*idx, r3);
	r2 = READ_ONCE(*y);
	smp_store_release(f, 1);
}

P1(int *x, int *y, struct srcu_struct *s1, int *idx, int *f)
{
	int r1;
	int r3;
	int r4;

	r4 = smp_load_acquire(f);
	r1 = READ_ONCE(*x);
	r3 = READ_ONCE(*idx);
	srcu_up_read(s1, r3);
}

P2(int *x, int *y, struct srcu_struct *s1)
{
	WRITE_ONCE(*y, 1);
	synchronize_srcu(s1);
	WRITE_ONCE(*x, 1);
}

locations [0:r1]
filter (1:r4=1)
exists (1:r1=1 /\ 0:r2=0)

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 2/2] tools/memory-model: Make ppo a subrelation of po
  2023-01-30  4:36                 ` Paul E. McKenney
@ 2023-01-30 16:47                   ` Alan Stern
  2023-01-30 16:50                     ` Paul E. McKenney
  0 siblings, 1 reply; 39+ messages in thread
From: Alan Stern @ 2023-01-30 16:47 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Jonas Oberhauser, Andrea Parri, will, peterz, boqun.feng,
	npiggin, dhowells, j.alglave, luc.maranget, akiyks, dlustig,
	joel, urezki, quic_neeraju, frederic, linux-kernel

On Sun, Jan 29, 2023 at 08:36:45PM -0800, Paul E. McKenney wrote:
> On Sun, Jan 29, 2023 at 09:39:17PM -0500, Alan Stern wrote:
> > On Sun, Jan 29, 2023 at 11:19:32PM +0100, Jonas Oberhauser wrote:
> > > I see now. Somehow I thought stores must execute in program order, but I
> > > guess it doesn't make sense.
> > > In that sense, W ->xbstar&int X always means W propagates to X's CPU before
> > > X executes.
> > 
> > It also means any write that propagates to W's CPU before W executes 
> > also propagates to X's CPU before X executes (because it's the same CPU 
> > and W executes before X).
> > 
> > > > Ideally we would fix this by changing the definition of po-rel to:
> > > > 
> > > > 	[M] ; (xbstar & int) ; [Release]
> > > > 
> > > > (This is closely related to the use of (xbstar & int) in the definition
> > > > of vis that you asked about.)
> > > 
> > > This misses the property of release stores that any po-earlier store must
> > > also execute before the release store.
> > 
> > I should have written:
> > 
> > 	[M] ; (po | (xbstar & int)) ; [Release]
> > 
> > > Perhaps it could be changed to the old  po-rel | [M] ; (xbstar & int) ;
> > > [Release] but then one could instead move this into the definition of
> > > cumul-fence.
> > > In fact you'd probably want this for all the propagation fences, so
> > > cumul-fence and pb should be the right place.
> > > 
> > > > Unfortunately we can't do this, because
> > > > po-rel has to be defined long before xbstar.
> > > 
> > > You could do it, by turning the relation into one massive recursive
> > > definition.
> > 
> > Which would make pretty much the entire memory model one big recursion.  
> > I do not want to do that.
> > 
> > > Thinking about what the options are:
> > > 1) accept the difference and run with it by making it consistent inside the
> > > axiomatic model
> > > 2) fix it through the recursive definition, which seems to be quite ugly but
> > > also consistent with the power operational model as far as I can tell
> > > 3) weaken the operational model... somehow
> > > 4) just ignore the anomaly
> > > 5) ???
> > > 
> > > Currently my least favorite option is 4) since it seems a bit off that the
> > > reasoning applies in one specific case of LKMM, more specifically the data
> > > race definition which should be equivalent to "the order of the two races
> > > isn't fixed", but here the order isn't fixed but it's a data race.
> > > I think the patch happens to almost do 1) because the xbstar&int at the end
> > > should already imply ordering through the prop&int <= hb rule.
> > > What would remain is to also exclude rcu-fence somehow.
> > 
> > IMO 1) is the best choice.
> > 
> > Alan
> > 
> > PS: For the record, here's a simpler litmus test to illustrates the 
> > failing.  The idea is that Wz=1 is reordered before the store-release, 
> > so it ought to propagate before Wy=1.  The LKMM does not require this.
> 
> In PowerPC terms, would this be like having the Wz=1 being reorders
> before the Wy=1, but not before the lwsync instruction preceding the
> Wy=1 that made it be a release store?

No, it would be like having the Wz=1 reordered before the Rx=1, 
therefore before the lwsync.  Obviously this can't ever happen on 
PowerPC.

Alan

> If so, we might have to keep this quirk.
> 
> 							Thanx, Paul
> 
> > C before-release
> > 
> > {}
> > 
> > P0(int *x, int *y, int *z)
> > {
> > 	int r1;
> > 
> > 	r1 = READ_ONCE(*x);
> > 	smp_store_release(y, 1);
> > 	WRITE_ONCE(*z, 1);
> > }
> > 
> > P1(int *x, int *y, int *z)
> > {
> > 	int r2;
> > 
> > 	r2 = READ_ONCE(*z);
> > 	WRITE_ONCE(*x, r2);
> > }
> > 
> > P2(int *x, int *y, int *z)
> > {
> > 	int r3;
> > 	int r4;
> > 
> > 	r3 = READ_ONCE(*y);
> > 	smp_rmb();
> > 	r4 = READ_ONCE(*z);
> > }
> > 
> > exists (0:r1=1 /\ 2:r3=1 /\ 2:r4=0)

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 2/2] tools/memory-model: Make ppo a subrelation of po
  2023-01-30 16:47                   ` Alan Stern
@ 2023-01-30 16:50                     ` Paul E. McKenney
  0 siblings, 0 replies; 39+ messages in thread
From: Paul E. McKenney @ 2023-01-30 16:50 UTC (permalink / raw)
  To: Alan Stern
  Cc: Jonas Oberhauser, Andrea Parri, will, peterz, boqun.feng,
	npiggin, dhowells, j.alglave, luc.maranget, akiyks, dlustig,
	joel, urezki, quic_neeraju, frederic, linux-kernel

On Mon, Jan 30, 2023 at 11:47:50AM -0500, Alan Stern wrote:
> On Sun, Jan 29, 2023 at 08:36:45PM -0800, Paul E. McKenney wrote:
> > On Sun, Jan 29, 2023 at 09:39:17PM -0500, Alan Stern wrote:
> > > On Sun, Jan 29, 2023 at 11:19:32PM +0100, Jonas Oberhauser wrote:
> > > > I see now. Somehow I thought stores must execute in program order, but I
> > > > guess it doesn't make sense.
> > > > In that sense, W ->xbstar&int X always means W propagates to X's CPU before
> > > > X executes.
> > > 
> > > It also means any write that propagates to W's CPU before W executes 
> > > also propagates to X's CPU before X executes (because it's the same CPU 
> > > and W executes before X).
> > > 
> > > > > Ideally we would fix this by changing the definition of po-rel to:
> > > > > 
> > > > > 	[M] ; (xbstar & int) ; [Release]
> > > > > 
> > > > > (This is closely related to the use of (xbstar & int) in the definition
> > > > > of vis that you asked about.)
> > > > 
> > > > This misses the property of release stores that any po-earlier store must
> > > > also execute before the release store.
> > > 
> > > I should have written:
> > > 
> > > 	[M] ; (po | (xbstar & int)) ; [Release]
> > > 
> > > > Perhaps it could be changed to the old  po-rel | [M] ; (xbstar & int) ;
> > > > [Release] but then one could instead move this into the definition of
> > > > cumul-fence.
> > > > In fact you'd probably want this for all the propagation fences, so
> > > > cumul-fence and pb should be the right place.
> > > > 
> > > > > Unfortunately we can't do this, because
> > > > > po-rel has to be defined long before xbstar.
> > > > 
> > > > You could do it, by turning the relation into one massive recursive
> > > > definition.
> > > 
> > > Which would make pretty much the entire memory model one big recursion.  
> > > I do not want to do that.
> > > 
> > > > Thinking about what the options are:
> > > > 1) accept the difference and run with it by making it consistent inside the
> > > > axiomatic model
> > > > 2) fix it through the recursive definition, which seems to be quite ugly but
> > > > also consistent with the power operational model as far as I can tell
> > > > 3) weaken the operational model... somehow
> > > > 4) just ignore the anomaly
> > > > 5) ???
> > > > 
> > > > Currently my least favorite option is 4) since it seems a bit off that the
> > > > reasoning applies in one specific case of LKMM, more specifically the data
> > > > race definition which should be equivalent to "the order of the two races
> > > > isn't fixed", but here the order isn't fixed but it's a data race.
> > > > I think the patch happens to almost do 1) because the xbstar&int at the end
> > > > should already imply ordering through the prop&int <= hb rule.
> > > > What would remain is to also exclude rcu-fence somehow.
> > > 
> > > IMO 1) is the best choice.
> > > 
> > > Alan
> > > 
> > > PS: For the record, here's a simpler litmus test to illustrates the 
> > > failing.  The idea is that Wz=1 is reordered before the store-release, 
> > > so it ought to propagate before Wy=1.  The LKMM does not require this.
> > 
> > In PowerPC terms, would this be like having the Wz=1 being reorders
> > before the Wy=1, but not before the lwsync instruction preceding the
> > Wy=1 that made it be a release store?
> 
> No, it would be like having the Wz=1 reordered before the Rx=1, 
> therefore before the lwsync.  Obviously this can't ever happen on 
> PowerPC.

Whew!!!  ;-)

							Thanx, Paul

> Alan
> 
> > If so, we might have to keep this quirk.
> > 
> > 							Thanx, Paul
> > 
> > > C before-release
> > > 
> > > {}
> > > 
> > > P0(int *x, int *y, int *z)
> > > {
> > > 	int r1;
> > > 
> > > 	r1 = READ_ONCE(*x);
> > > 	smp_store_release(y, 1);
> > > 	WRITE_ONCE(*z, 1);
> > > }
> > > 
> > > P1(int *x, int *y, int *z)
> > > {
> > > 	int r2;
> > > 
> > > 	r2 = READ_ONCE(*z);
> > > 	WRITE_ONCE(*x, r2);
> > > }
> > > 
> > > P2(int *x, int *y, int *z)
> > > {
> > > 	int r3;
> > > 	int r4;
> > > 
> > > 	r3 = READ_ONCE(*y);
> > > 	smp_rmb();
> > > 	r4 = READ_ONCE(*z);
> > > }
> > > 
> > > exists (0:r1=1 /\ 2:r3=1 /\ 2:r4=0)

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 2/2] tools/memory-model: Make ppo a subrelation of po
  2023-01-30  2:39               ` Alan Stern
  2023-01-30  4:36                 ` Paul E. McKenney
@ 2023-01-31 13:56                 ` Jonas Oberhauser
  2023-01-31 15:06                   ` Alan Stern
  1 sibling, 1 reply; 39+ messages in thread
From: Jonas Oberhauser @ 2023-01-31 13:56 UTC (permalink / raw)
  To: Alan Stern
  Cc: Andrea Parri, paulmck, will, peterz, boqun.feng, npiggin,
	dhowells, j.alglave, luc.maranget, akiyks, dlustig, joel, urezki,
	quic_neeraju, frederic, linux-kernel

On 1/30/2023 3:39 AM, Alan Stern wrote:
> On Sun, Jan 29, 2023 at 11:19:32PM +0100, Jonas Oberhauser wrote:
>> You could do it, by turning the relation into one massive recursive
>> definition.
> Which would make pretty much the entire memory model one big recursion.
> I do not want to do that.

Neither do I :D

>
>> Thinking about what the options are:
>> 1) accept the difference and run with it by making it consistent inside the
>> axiomatic model
>> 2) fix it through the recursive definition, which seems to be quite ugly but
>> also consistent with the power operational model as far as I can tell
>> 3) weaken the operational model... somehow
>> 4) just ignore the anomaly
>> 5) ???
>>
>> Currently my least favorite option is 4) since it seems a bit off that the
>> reasoning applies in one specific case of LKMM, more specifically the data
>> race definition which should be equivalent to "the order of the two races
>> isn't fixed", but here the order isn't fixed but it's a data race.
>> I think the patch happens to almost do 1) because the xbstar&int at the end
>> should already imply ordering through the prop&int <= hb rule.
>> What would remain is to also exclude rcu-fence somehow.
> IMO 1) is the best choice.

I have some additional thoughts now. It seems that you could weaken the 
operational model by stating that an A-cumulative fence orders 
propagation of all *external* stores (in addition to all po-earlier 
stores) that propagated to you before the fence is executed.

It seems that on power, from an operational model perspective, there's 
currently no difference between propagation fences ordering all stores 
vs only external stores that propagated to the CPU before the fence is 
executed, because they only have bidirectional (*->W) fences (sync, 
lwsync) and not uni-directional (acquire, release), and so it is not 
possible for a store that is po-later than the barrier to be executed 
before the barrier; i.e., on power, every internal store that propagates 
to a CPU before the fence executes is also po-earler than the fence.

If power did introduce release stores, I think you could potentially 
create implementations that allow the behavior in the example you have 
given, but I don't think they are the most natural ones:

> {}
>
> P0(int *x, int *y, int *z)
> {
> 	int r1;
>
> 	r1 = READ_ONCE(*x);
> 	smp_store_release(y, 1);
> 	WRITE_ONCE(*z, 1);
> }
>
> P1(int *x, int *y, int *z)
> {
> 	int r2;
>
> 	r2 = READ_ONCE(*z);
> 	WRITE_ONCE(*x, r2);
> }
>
> P2(int *x, int *y, int *z)
> {
> 	int r3;
> 	int r4;
>
> 	r3 = READ_ONCE(*y);
> 	smp_rmb();
> 	r4 = READ_ONCE(*z);
> }
>
> exists (0:r1=1 /\ 2:r3=1 /\ 2:r4=0)

I could imagine that P0 posts both of its stores in a shared store 
buffer before reading *x, but marks the release store as "not ready".
Then P1 forwards *z=1 from the store buffer and posts *x=1, which P0 
reads, and subsequently marks its release store as "ready".
Then the release store is sent to the cache, where P2 reads *y=1 and 
then *z=0.
Finally P0 sends its *z=1 store to the cache.

However, a perhaps more natural implementation would not post the 
release store to the store buffer until it is "ready", in which case the 
order in the store buffer would be *z=1 before *y=1, and in this case 
the release ordering would presumably work like your current operational 
model.

Nevertheless, perhaps this slightly weaker operational model isn't as 
absurd as it sounds. And I think many people wouldn't be shocked if the 
release store didn't provide ordering with *z=1.

Best wishes, jonas

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 2/2] tools/memory-model: Make ppo a subrelation of po
  2023-01-31 13:56                 ` Jonas Oberhauser
@ 2023-01-31 15:06                   ` Alan Stern
  2023-01-31 15:33                     ` Jonas Oberhauser
  0 siblings, 1 reply; 39+ messages in thread
From: Alan Stern @ 2023-01-31 15:06 UTC (permalink / raw)
  To: Jonas Oberhauser
  Cc: Andrea Parri, paulmck, will, peterz, boqun.feng, npiggin,
	dhowells, j.alglave, luc.maranget, akiyks, dlustig, joel, urezki,
	quic_neeraju, frederic, linux-kernel

On Tue, Jan 31, 2023 at 02:56:00PM +0100, Jonas Oberhauser wrote:
> I have some additional thoughts now. It seems that you could weaken the
> operational model by stating that an A-cumulative fence orders propagation
> of all *external* stores (in addition to all po-earlier stores) that
> propagated to you before the fence is executed.

How is that a weakening of the operational model?  It's what the 
operational model says right now.  From explanation.txt:

	Release fences, such as smp_store_release(), force the CPU to
	execute all po-earlier instructions before the store
	associated with the fence (e.g., the store part of an
	smp_store_release()).

...  When a fence instruction is executed on CPU C:

	...

	For each other CPU C', any store which propagates to C before
	a release fence is executed (including all po-earlier
	stores executed on C) is forced to propagate to C' before the
	store associated with the release fence does.

In theory, we could weaken the operational model by saying that pfences 
order propagation of stores from other CPUs only when those stores are 
read-from by instructions po-before the fence.  But I suspect that's not 
such a good idea.

> It seems that on power, from an operational model perspective, there's
> currently no difference between propagation fences ordering all stores vs
> only external stores that propagated to the CPU before the fence is
> executed, because they only have bidirectional (*->W) fences (sync, lwsync)
> and not uni-directional (acquire, release), and so it is not possible for a
> store that is po-later than the barrier to be executed before the barrier;
> i.e., on power, every internal store that propagates to a CPU before the
> fence executes is also po-earler than the fence.
> 
> If power did introduce release stores, I think you could potentially create
> implementations that allow the behavior in the example you have given, but I
> don't think they are the most natural ones:

Maybe so.  In any case, it's a moot point.  In fact, I don't know if any 
architecture supporting Linux allows a write that is po-after a release 
store to be reordered before the release store.

> > P0(int *x, int *y, int *z)
> > {
> > 	int r1;
> > 
> > 	r1 = READ_ONCE(*x);
> > 	smp_store_release(y, 1);
> > 	WRITE_ONCE(*z, 1);
> > }
> > 
> > P1(int *x, int *y, int *z)
> > {
> > 	int r2;
> > 
> > 	r2 = READ_ONCE(*z);
> > 	WRITE_ONCE(*x, r2);
> > }
> > 
> > P2(int *x, int *y, int *z)
> > {
> > 	int r3;
> > 	int r4;
> > 
> > 	r3 = READ_ONCE(*y);
> > 	smp_rmb();
> > 	r4 = READ_ONCE(*z);
> > }
> > 
> > exists (0:r1=1 /\ 2:r3=1 /\ 2:r4=0)
> 
> I could imagine that P0 posts both of its stores in a shared store buffer
> before reading *x, but marks the release store as "not ready".
> Then P1 forwards *z=1 from the store buffer and posts *x=1, which P0 reads,
> and subsequently marks its release store as "ready".

That isn't how release stores are meant to work.  The read of x is 
supposed to be complete before the release store becomes visible to any 
other CPU.  This is true even in C11.

> Then the release store is sent to the cache, where P2 reads *y=1 and then
> *z=0.
> Finally P0 sends its *z=1 store to the cache.
> 
> However, a perhaps more natural implementation would not post the release
> store to the store buffer until it is "ready", in which case the order in
> the store buffer would be *z=1 before *y=1, and in this case the release
> ordering would presumably work like your current operational model.
> 
> Nevertheless, perhaps this slightly weaker operational model isn't as absurd
> as it sounds. And I think many people wouldn't be shocked if the release
> store didn't provide ordering with *z=1.

This issue is one we should discuss with all the other people involved 
in maintaining the LKMM.

Alan

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 2/2] tools/memory-model: Make ppo a subrelation of po
  2023-01-31 15:06                   ` Alan Stern
@ 2023-01-31 15:33                     ` Jonas Oberhauser
  2023-01-31 16:55                       ` Alan Stern
  0 siblings, 1 reply; 39+ messages in thread
From: Jonas Oberhauser @ 2023-01-31 15:33 UTC (permalink / raw)
  To: Alan Stern
  Cc: Andrea Parri, paulmck, will, peterz, boqun.feng, npiggin,
	dhowells, j.alglave, luc.maranget, akiyks, dlustig, joel, urezki,
	quic_neeraju, frederic, linux-kernel

On 1/31/2023 4:06 PM, Alan Stern wrote:
> On Tue, Jan 31, 2023 at 02:56:00PM +0100, Jonas Oberhauser wrote:
>> I have some additional thoughts now. It seems that you could weaken the
>> operational model by stating that an A-cumulative fence orders propagation
>> of all *external* stores (in addition to all po-earlier stores) that
>> propagated to you before the fence is executed.
> How is that a weakening of the operational model?  It's what the
> operational model says right now.

No, as in the part that you have quoted, it is stated that an 
A-cumulative fence orderes propagation of *all* stores that propagated 
to you before the fence is executed.
I'm saying you could weaken this to only cover all *external* stores.

More precisely, I would change

> 	For each other CPU C', any store which propagates to C before
> 	a release fence is executed (including all po-earlier
> 	stores executed on C) is forced to propagate to C' before the
> 	store associated with the release fence does.

Into something like

      For each other CPU C', any *external* store which propagates to C 
before
      a release fence is executed as well as any po-earlier
      store executed on C is forced to propagate to C' before the
      store associated with the release fence does.

The difference is that po-later stores that happen to propagate to C 
before the release fence is executed would no longer be ordered.
That should be consistent with the axiomatic model.

>
> In theory, we could weaken the operational model by saying that pfences
> order propagation of stores from other CPUs only when those stores are
> read-from by instructions po-before the fence.  But I suspect that's not
> such a good idea.

That indeed looks too confusing.

>> It seems that on power, from an operational model perspective, there's
>> currently no difference between propagation fences ordering all stores vs
>> only external stores that propagated to the CPU before the fence is
>> executed, because they only have bidirectional (*->W) fences (sync, lwsync)
>> and not uni-directional (acquire, release), and so it is not possible for a
>> store that is po-later than the barrier to be executed before the barrier;
>> i.e., on power, every internal store that propagates to a CPU before the
>> fence executes is also po-earler than the fence.
>>
>> If power did introduce release stores, I think you could potentially create
>> implementations that allow the behavior in the example you have given, but I
>> don't think they are the most natural ones:
> Maybe so.  In any case, it's a moot point.  In fact, I don't know if any
> architecture supporting Linux allows a write that is po-after a release
> store to be reordered before the release store.

Arm and Risc5 do, but they are multi-copy-atomic anyways.

>
>>> P0(int *x, int *y, int *z)
>>> {
>>> 	int r1;
>>>
>>> 	r1 = READ_ONCE(*x);
>>> 	smp_store_release(y, 1);
>>> 	WRITE_ONCE(*z, 1);
>>> }
>>>
>>> P1(int *x, int *y, int *z)
>>> {
>>> 	int r2;
>>>
>>> 	r2 = READ_ONCE(*z);
>>> 	WRITE_ONCE(*x, r2);
>>> }
>>>
>>> P2(int *x, int *y, int *z)
>>> {
>>> 	int r3;
>>> 	int r4;
>>>
>>> 	r3 = READ_ONCE(*y);
>>> 	smp_rmb();
>>> 	r4 = READ_ONCE(*z);
>>> }
>>>
>>> exists (0:r1=1 /\ 2:r3=1 /\ 2:r4=0)
>> I could imagine that P0 posts both of its stores in a shared store buffer
>> before reading *x, but marks the release store as "not ready".
>> Then P1 forwards *z=1 from the store buffer and posts *x=1, which P0 reads,
>> and subsequently marks its release store as "ready".
> That isn't how release stores are meant to work.  The read of x is
> supposed to be complete before the release store becomes visible to any
> other CPU.

Note that the release store isn't observed until it becomes "ready", so 
it is really indistinguishable of whether it had become visible to any 
other CPU.
Indeed stores that aren't marked "ready" would be ignored during 
forwarding, and not allowed to be pushed to the cache.

The reason this kind of implementation seems less natural to me is that 
such an "not ready" store would need to be pushed back in the buffer (if 
it is the head of the buffer and the cache is ready to take a store), 
stall the later stores, or be aborted until it becomes ready.
That just seems to create a lot of hassle for no discernible benefit.
A "not ready" store probably shouldn't be put into a store queue, even 
if the only reason it is not ready is that there are some otherwise 
unrelated reads that haven't completed yet.

> This is true even in C11.

Arguable... The following pseudo-code litmus test should demonstrate this:

P0 {
    int r = read_relaxed(&x);
    store_release(&y,1);
}

P1 {
    int s = read_relaxed(&y);
    store_release(&x,1);
}

In C11, it should be possible to read r==s==1.

>> Then the release store is sent to the cache, where P2 reads *y=1 and then
>> *z=0.
>> Finally P0 sends its *z=1 store to the cache.
>>
>> However, a perhaps more natural implementation would not post the release
>> store to the store buffer until it is "ready", in which case the order in
>> the store buffer would be *z=1 before *y=1, and in this case the release
>> ordering would presumably work like your current operational model.
>>
>> Nevertheless, perhaps this slightly weaker operational model isn't as absurd
>> as it sounds. And I think many people wouldn't be shocked if the release
>> store didn't provide ordering with *z=1.
> This issue is one we should discuss with all the other people involved
> in maintaining the LKMM.
>
> Alan

Sure.

Btw, how to proceed for your SRCU patch and this one?
Are you planning to make any changes? I think the version you have is ok 
if you don't think the patch is improved by anything I brought up.

Any additional concerns/changes for this patch?

Best wishes, jonas

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 2/2] tools/memory-model: Make ppo a subrelation of po
  2023-01-31 15:33                     ` Jonas Oberhauser
@ 2023-01-31 16:55                       ` Alan Stern
  2023-02-01 10:37                         ` Jonas Oberhauser
  0 siblings, 1 reply; 39+ messages in thread
From: Alan Stern @ 2023-01-31 16:55 UTC (permalink / raw)
  To: Jonas Oberhauser
  Cc: Andrea Parri, paulmck, will, peterz, boqun.feng, npiggin,
	dhowells, j.alglave, luc.maranget, akiyks, dlustig, joel, urezki,
	quic_neeraju, frederic, linux-kernel

On Tue, Jan 31, 2023 at 04:33:25PM +0100, Jonas Oberhauser wrote:
> 
> 
> On 1/31/2023 4:06 PM, Alan Stern wrote:
> > On Tue, Jan 31, 2023 at 02:56:00PM +0100, Jonas Oberhauser wrote:
> > > I have some additional thoughts now. It seems that you could weaken the
> > > operational model by stating that an A-cumulative fence orders propagation
> > > of all *external* stores (in addition to all po-earlier stores) that
> > > propagated to you before the fence is executed.
> > How is that a weakening of the operational model?  It's what the
> > operational model says right now.
> 
> No, as in the part that you have quoted, it is stated that an A-cumulative
> fence orderes propagation of *all* stores that propagated to you before the
> fence is executed.
> I'm saying you could weaken this to only cover all *external* stores.

Okay, now I understand.

> More precisely, I would change
> 
> > 	For each other CPU C', any store which propagates to C before
> > 	a release fence is executed (including all po-earlier
> > 	stores executed on C) is forced to propagate to C' before the
> > 	store associated with the release fence does.
> 
> Into something like
> 
> 
>      For each other CPU C', any *external* store which propagates to C
> before
>      a release fence is executed as well as any po-earlier
>      store executed on C is forced to propagate to C' before the
>      store associated with the release fence does.
> 
> The difference is that po-later stores that happen to propagate to C before
> the release fence is executed would no longer be ordered.
> That should be consistent with the axiomatic model.

I had to check that it wouldn't affect the (xbstar & int) part of vis, 
but it looks all right.  This seems like a reasonable change.

However, it only fixes part of the problem.  Suppose an external write 
is read by an instruction po-after the release-store, but the read 
executes before the release-store.  The operational model would still 
say the external write has to obey the propagation ordering, whereas the 
formal model doesn't require it.

> > > > P0(int *x, int *y, int *z)
> > > > {
> > > > 	int r1;
> > > > 
> > > > 	r1 = READ_ONCE(*x);
> > > > 	smp_store_release(y, 1);
> > > > 	WRITE_ONCE(*z, 1);
> > > > }
> > > > 
> > > > P1(int *x, int *y, int *z)
> > > > {
> > > > 	int r2;
> > > > 
> > > > 	r2 = READ_ONCE(*z);
> > > > 	WRITE_ONCE(*x, r2);
> > > > }
> > > > 
> > > > P2(int *x, int *y, int *z)
> > > > {
> > > > 	int r3;
> > > > 	int r4;
> > > > 
> > > > 	r3 = READ_ONCE(*y);
> > > > 	smp_rmb();
> > > > 	r4 = READ_ONCE(*z);
> > > > }
> > > > 
> > > > exists (0:r1=1 /\ 2:r3=1 /\ 2:r4=0)
> > > I could imagine that P0 posts both of its stores in a shared store buffer
> > > before reading *x, but marks the release store as "not ready".
> > > Then P1 forwards *z=1 from the store buffer and posts *x=1, which P0 reads,
> > > and subsequently marks its release store as "ready".
> > That isn't how release stores are meant to work.  The read of x is
> > supposed to be complete before the release store becomes visible to any
> > other CPU.
> 
> Note that the release store isn't observed until it becomes "ready", so it
> is really indistinguishable of whether it had become visible to any other
> CPU.
> Indeed stores that aren't marked "ready" would be ignored during forwarding,
> and not allowed to be pushed to the cache.

Oops, I mixed up a couple of the accesses.  Okay, yes, this mechanism 
will allow writes that are po-after a release store but execute before 
it to evade the propagation restriction.

> The reason this kind of implementation seems less natural to me is that such
> an "not ready" store would need to be pushed back in the buffer (if it is
> the head of the buffer and the cache is ready to take a store), stall the
> later stores, or be aborted until it becomes ready.
> That just seems to create a lot of hassle for no discernible benefit.
> A "not ready" store probably shouldn't be put into a store queue, even if
> the only reason it is not ready is that there are some otherwise unrelated
> reads that haven't completed yet.
> 
> 
> 
> > This is true even in C11.
> 
> Arguable... The following pseudo-code litmus test should demonstrate this:
> 
> P0 {
>    int r = read_relaxed(&x);
>    store_release(&y,1);
> }
> 
> 
> P1 {
>    int s = read_relaxed(&y);
>    store_release(&x,1);
> }
> 
> In C11, it should be possible to read r==s==1.

True, in C11 releases don't mean anything unless they're paired with 
acquires.  But if your P1 had been

	int s = read_acquire(&y);
	write_relaxed(&x, 1);

then r = s = 1 would not be allowed.  And presumably the same object 
code would be generated for P0 either way, particularly if P1 was in a 
separate compilation unit (link-time optimization notwithstanding).

> Btw, how to proceed for your SRCU patch and this one?
> Are you planning to make any changes? I think the version you have is ok if
> you don't think the patch is improved by anything I brought up.

I don't see any need to change the SRCU patch at this point, other than 
to improve the attributions.

> Any additional concerns/changes for this patch?

It should give the same data-race diagnostics as the current LKMM.  This 
probably means the patch will need to punch up the definitions of 
*-pre-bounded and *-post-bounded, unless you can think of a better 
approach.

Alan

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 2/2] tools/memory-model: Make ppo a subrelation of po
  2023-01-31 16:55                       ` Alan Stern
@ 2023-02-01 10:37                         ` Jonas Oberhauser
  0 siblings, 0 replies; 39+ messages in thread
From: Jonas Oberhauser @ 2023-02-01 10:37 UTC (permalink / raw)
  To: Alan Stern
  Cc: Andrea Parri, paulmck, will, peterz, boqun.feng, npiggin,
	dhowells, j.alglave, luc.maranget, akiyks, dlustig, joel, urezki,
	quic_neeraju, frederic, linux-kernel



On 1/31/2023 5:55 PM, Alan Stern wrote:
> On Tue, Jan 31, 2023 at 04:33:25PM +0100, Jonas Oberhauser wrote:
>>
>> More precisely, I would change
>>
>>> 	For each other CPU C', any store which propagates to C before
>>> 	a release fence is executed (including all po-earlier
>>> 	stores executed on C) is forced to propagate to C' before the
>>> 	store associated with the release fence does.
>> Into something like
>>
>>
>>       For each other CPU C', any *external* store which propagates to C
>> before
>>       a release fence is executed as well as any po-earlier
>>       store executed on C is forced to propagate to C' before the
>>       store associated with the release fence does.
>>
>> The difference is that po-later stores that happen to propagate to C before
>> the release fence is executed would no longer be ordered.
>> That should be consistent with the axiomatic model.
> I had to check that it wouldn't affect the (xbstar & int) part of vis,
> but it looks all right.  This seems like a reasonable change.
>
> However, it only fixes part of the problem.  Suppose an external write
> is read by an instruction po-after the release-store, but the read
> executes before the release-store.  The operational model would still
> say the external write has to obey the propagation ordering, whereas the
> formal model doesn't require it.


Ouch. I had only thought about the [W];(xbstar & int);[Release] case, 
but there's also the rfe;(xbstar & int);[Release] case....


>> Any additional concerns/changes for this patch?
> It should give the same data-race diagnostics as the current LKMM.  This
> probably means the patch will need to punch up the definitions of
> *-pre-bounded and *-post-bounded, unless you can think of a better
> approach.
>
> Alan

Seems the 1 thing per patch mentally hasn't yet become ingrained in me. 
Thanks!

jonas


^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2023-02-01 10:39 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-26 13:46 [PATCH v2 0/2] Streamlining treatment of smp_mb__after_unlock_lock Jonas Oberhauser
2023-01-26 13:46 ` [PATCH v2 1/2] tools/memory-model: Unify UNLOCK+LOCK pairings to po-unlock-lock-po Jonas Oberhauser
2023-01-26 16:36   ` Alan Stern
2023-01-26 20:08     ` Paul E. McKenney
2023-01-26 23:21       ` Paul E. McKenney
2023-01-27 13:18         ` Jonas Oberhauser
2023-01-27 15:13           ` Paul E. McKenney
2023-01-27 15:57             ` Jonas Oberhauser
2023-01-27 16:48               ` Paul E. McKenney
2023-01-26 13:46 ` [PATCH v2 2/2] tools/memory-model: Make ppo a subrelation of po Jonas Oberhauser
2023-01-26 16:36   ` Alan Stern
2023-01-27 14:31     ` Jonas Oberhauser
2023-01-28 19:56       ` Alan Stern
2023-01-28 22:14         ` Andrea Parri
2023-01-28 22:21           ` Andrea Parri
2023-01-28 22:59           ` Alan Stern
2023-01-29  5:17             ` Paul E. McKenney
2023-01-29 16:03               ` Alan Stern
2023-01-29 16:21                 ` Paul E. McKenney
2023-01-29 17:28                   ` Andrea Parri
2023-01-29 18:44                     ` Paul E. McKenney
2023-01-29 21:43                       ` Boqun Feng
2023-01-29 23:09                         ` Paul E. McKenney
2023-01-30  2:18                           ` Alan Stern
2023-01-30  4:43                             ` Paul E. McKenney
2023-01-29 19:17                   ` Paul E. McKenney
2023-01-29 17:11             ` Andrea Parri
2023-01-29 22:10               ` Alan Stern
2023-01-29 22:19             ` Jonas Oberhauser
2023-01-30  2:39               ` Alan Stern
2023-01-30  4:36                 ` Paul E. McKenney
2023-01-30 16:47                   ` Alan Stern
2023-01-30 16:50                     ` Paul E. McKenney
2023-01-31 13:56                 ` Jonas Oberhauser
2023-01-31 15:06                   ` Alan Stern
2023-01-31 15:33                     ` Jonas Oberhauser
2023-01-31 16:55                       ` Alan Stern
2023-02-01 10:37                         ` Jonas Oberhauser
2023-01-30  4:46               ` Paul E. McKenney

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.