All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH] count_lim_sig: Add pair of smp_wmb() and smp_rmb()
@ 2018-10-15 23:04 Akira Yokosawa
  2018-10-17 15:10 ` Paul E. McKenney
  0 siblings, 1 reply; 8+ messages in thread
From: Akira Yokosawa @ 2018-10-15 23:04 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: perfbook, Akira Yokosawa

From 7b01fc0f19cfa010536d7eb53e4d0cda1e6b801f Mon Sep 17 00:00:00 2001
From: Akira Yokosawa <akiyks@gmail.com>
Date: Mon, 15 Oct 2018 23:46:52 +0900
Subject: RFC [PATCH] count_lim_sig: Add pair of smp_wmb() and smp_rmb()

This message-passing pattern requires smp_wmb()--smp_rmb() pairing.

Signed-off-by: Akira Yokosawa <akiyks@gmail.com>
---
Hi Paul,

I'm not sure this addition of memory barriers is actually required,
but it does look like so.

And I'm aware that you have avoided using weaker memory barriers in
CodeSamples.

Thoughts?

        Thanks, Akira
--
 CodeSamples/arch-arm/arch-arm.h     |  2 ++
 CodeSamples/arch-arm64/arch-arm64.h |  2 ++
 CodeSamples/arch-ppc64/arch-ppc64.h |  2 ++
 CodeSamples/arch-x86/arch-x86.h     |  2 ++
 CodeSamples/count/count_lim_sig.c   | 21 +++++++++++++--------
 5 files changed, 21 insertions(+), 8 deletions(-)

diff --git a/CodeSamples/arch-arm/arch-arm.h b/CodeSamples/arch-arm/arch-arm.h
index 065c6f1..6f0707b 100644
--- a/CodeSamples/arch-arm/arch-arm.h
+++ b/CodeSamples/arch-arm/arch-arm.h
@@ -41,6 +41,8 @@
 /* __sync_synchronize() is broken before gcc 4.4.1 on many ARM systems. */
 #define smp_mb()  __asm__ __volatile__("dmb" : : : "memory")
 
+#define smp_rmb() __asm__ __volatile__("dmb ish" : : : "memory")
+#define smp_wmb() __asm__ __volatile__("dmb ishst" : : : "memory")
 
 #include <stdlib.h>
 #include <sys/time.h>
diff --git a/CodeSamples/arch-arm64/arch-arm64.h b/CodeSamples/arch-arm64/arch-arm64.h
index 354f1f2..a6ccf33 100644
--- a/CodeSamples/arch-arm64/arch-arm64.h
+++ b/CodeSamples/arch-arm64/arch-arm64.h
@@ -41,6 +41,8 @@
 /* __sync_synchronize() is broken before gcc 4.4.1 on many ARM systems. */
 #define smp_mb()  __asm__ __volatile__("dmb ish" : : : "memory")
 
+#define smp_rmb() __asm__ __volatile__("dmb ishld" : : : "memory")
+#define smp_wmb() __asm__ __volatile__("dmb ishst" : : : "memory")
 
 #include <stdlib.h>
 #include <time.h>
diff --git a/CodeSamples/arch-ppc64/arch-ppc64.h b/CodeSamples/arch-ppc64/arch-ppc64.h
index 7b0b025..2d6a2b5 100644
--- a/CodeSamples/arch-ppc64/arch-ppc64.h
+++ b/CodeSamples/arch-ppc64/arch-ppc64.h
@@ -42,6 +42,8 @@
 
 #define smp_mb()  __asm__ __volatile__("sync" : : : "memory")
 
+#define smp_rmb() __asm__ __volatile__("lwsync" : : : "memory")
+#define smp_wmb() __asm__ __volatile__("lwsync" : : : "memory")
 
 /*
  * Generate 64-bit timestamp.
diff --git a/CodeSamples/arch-x86/arch-x86.h b/CodeSamples/arch-x86/arch-x86.h
index 9ea97ca..2765bfc 100644
--- a/CodeSamples/arch-x86/arch-x86.h
+++ b/CodeSamples/arch-x86/arch-x86.h
@@ -52,6 +52,8 @@ __asm__ __volatile__(LOCK_PREFIX "orl %0,%1" \
 __asm__ __volatile__("mfence" : : : "memory")
 /* __asm__ __volatile__("lock; addl $0,0(%%esp)" : : : "memory") */
 
+#define smp_rmb() barrier()
+#define smp_wmb() barrier()
 
 /*
  * Generate 64-bit timestamp.
diff --git a/CodeSamples/count/count_lim_sig.c b/CodeSamples/count/count_lim_sig.c
index c316426..26a2a76 100644
--- a/CodeSamples/count/count_lim_sig.c
+++ b/CodeSamples/count/count_lim_sig.c
@@ -89,6 +89,7 @@ static void flush_local_count(void)			//\lnlbl{flush:b}
 		*counterp[t] = 0;
 		globalreserve -= *countermaxp[t];
 		*countermaxp[t] = 0;			//\lnlbl{flush:thiev:e}
+		smp_wmb();				//\lnlbl{flush:wmb}
 		WRITE_ONCE(*theftp[t], THEFT_IDLE);	//\lnlbl{flush:IDLE}
 	}						//\lnlbl{flush:loop2:e}
 }							//\lnlbl{flush:e}
@@ -115,10 +116,12 @@ int add_count(unsigned long delta)			//\lnlbl{b}
 
 	WRITE_ONCE(counting, 1);			//\lnlbl{fast:b}
 	barrier();					//\lnlbl{barrier:1}
-	if (READ_ONCE(theft) <= THEFT_REQ &&		//\lnlbl{check:b}
-	    countermax - counter >= delta) {		//\lnlbl{check:e}
-		WRITE_ONCE(counter, counter + delta);	//\lnlbl{add:f}
-		fastpath = 1;				//\lnlbl{fasttaken}
+	if (READ_ONCE(theft) <= THEFT_REQ) {		//\lnlbl{check:b}
+		smp_rmb();				//\lnlbl{rmb}
+		if (countermax - counter >= delta) {	//\lnlbl{check:e}
+			WRITE_ONCE(counter, counter + delta);//\lnlbl{add:f}
+			fastpath = 1;			//\lnlbl{fasttaken}
+		}
 	}
 	barrier();					//\lnlbl{barrier:2}
 	WRITE_ONCE(counting, 0);			//\lnlbl{clearcnt}
@@ -154,10 +157,12 @@ int sub_count(unsigned long delta)
 
 	WRITE_ONCE(counting, 1);
 	barrier();
-	if (READ_ONCE(theft) <= THEFT_REQ &&
-	    counter >= delta) {
-		WRITE_ONCE(counter, counter - delta);
-		fastpath = 1;
+	if (READ_ONCE(theft) <= THEFT_REQ) {
+		smp_rmb();
+		if (counter >= delta) {
+			WRITE_ONCE(counter, counter - delta);
+			fastpath = 1;
+		}
 	}
 	barrier();
 	WRITE_ONCE(counting, 0);
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH] count_lim_sig: Add pair of smp_wmb() and smp_rmb()
  2018-10-15 23:04 [RFC PATCH] count_lim_sig: Add pair of smp_wmb() and smp_rmb() Akira Yokosawa
@ 2018-10-17 15:10 ` Paul E. McKenney
  2018-10-17 22:21   ` Akira Yokosawa
  0 siblings, 1 reply; 8+ messages in thread
From: Paul E. McKenney @ 2018-10-17 15:10 UTC (permalink / raw)
  To: Akira Yokosawa; +Cc: perfbook

On Tue, Oct 16, 2018 at 08:04:00AM +0900, Akira Yokosawa wrote:
> >From 7b01fc0f19cfa010536d7eb53e4d0cda1e6b801f Mon Sep 17 00:00:00 2001
> From: Akira Yokosawa <akiyks@gmail.com>
> Date: Mon, 15 Oct 2018 23:46:52 +0900
> Subject: RFC [PATCH] count_lim_sig: Add pair of smp_wmb() and smp_rmb()
> 
> This message-passing pattern requires smp_wmb()--smp_rmb() pairing.
> 
> Signed-off-by: Akira Yokosawa <akiyks@gmail.com>
> ---
> Hi Paul,
> 
> I'm not sure this addition of memory barriers is actually required,
> but it does look like so.
> 
> And I'm aware that you have avoided using weaker memory barriers in
> CodeSamples.
> 
> Thoughts?

Hello, Akira,

I might be missing something, but it looks to me like this ordering is
covered by heavyweight ordering in the signal handler entry/exit and
the gblcnt_mutex.  So what sequence of events leads to the failiure
scenario that you are seeing?

							Thanx, Paul

>         Thanks, Akira
> --
>  CodeSamples/arch-arm/arch-arm.h     |  2 ++
>  CodeSamples/arch-arm64/arch-arm64.h |  2 ++
>  CodeSamples/arch-ppc64/arch-ppc64.h |  2 ++
>  CodeSamples/arch-x86/arch-x86.h     |  2 ++
>  CodeSamples/count/count_lim_sig.c   | 21 +++++++++++++--------
>  5 files changed, 21 insertions(+), 8 deletions(-)
> 
> diff --git a/CodeSamples/arch-arm/arch-arm.h b/CodeSamples/arch-arm/arch-arm.h
> index 065c6f1..6f0707b 100644
> --- a/CodeSamples/arch-arm/arch-arm.h
> +++ b/CodeSamples/arch-arm/arch-arm.h
> @@ -41,6 +41,8 @@
>  /* __sync_synchronize() is broken before gcc 4.4.1 on many ARM systems. */
>  #define smp_mb()  __asm__ __volatile__("dmb" : : : "memory")
>  
> +#define smp_rmb() __asm__ __volatile__("dmb ish" : : : "memory")
> +#define smp_wmb() __asm__ __volatile__("dmb ishst" : : : "memory")
>  
>  #include <stdlib.h>
>  #include <sys/time.h>
> diff --git a/CodeSamples/arch-arm64/arch-arm64.h b/CodeSamples/arch-arm64/arch-arm64.h
> index 354f1f2..a6ccf33 100644
> --- a/CodeSamples/arch-arm64/arch-arm64.h
> +++ b/CodeSamples/arch-arm64/arch-arm64.h
> @@ -41,6 +41,8 @@
>  /* __sync_synchronize() is broken before gcc 4.4.1 on many ARM systems. */
>  #define smp_mb()  __asm__ __volatile__("dmb ish" : : : "memory")
>  
> +#define smp_rmb() __asm__ __volatile__("dmb ishld" : : : "memory")
> +#define smp_wmb() __asm__ __volatile__("dmb ishst" : : : "memory")
>  
>  #include <stdlib.h>
>  #include <time.h>
> diff --git a/CodeSamples/arch-ppc64/arch-ppc64.h b/CodeSamples/arch-ppc64/arch-ppc64.h
> index 7b0b025..2d6a2b5 100644
> --- a/CodeSamples/arch-ppc64/arch-ppc64.h
> +++ b/CodeSamples/arch-ppc64/arch-ppc64.h
> @@ -42,6 +42,8 @@
>  
>  #define smp_mb()  __asm__ __volatile__("sync" : : : "memory")
>  
> +#define smp_rmb() __asm__ __volatile__("lwsync" : : : "memory")
> +#define smp_wmb() __asm__ __volatile__("lwsync" : : : "memory")
>  
>  /*
>   * Generate 64-bit timestamp.
> diff --git a/CodeSamples/arch-x86/arch-x86.h b/CodeSamples/arch-x86/arch-x86.h
> index 9ea97ca..2765bfc 100644
> --- a/CodeSamples/arch-x86/arch-x86.h
> +++ b/CodeSamples/arch-x86/arch-x86.h
> @@ -52,6 +52,8 @@ __asm__ __volatile__(LOCK_PREFIX "orl %0,%1" \
>  __asm__ __volatile__("mfence" : : : "memory")
>  /* __asm__ __volatile__("lock; addl $0,0(%%esp)" : : : "memory") */
>  
> +#define smp_rmb() barrier()
> +#define smp_wmb() barrier()
>  
>  /*
>   * Generate 64-bit timestamp.
> diff --git a/CodeSamples/count/count_lim_sig.c b/CodeSamples/count/count_lim_sig.c
> index c316426..26a2a76 100644
> --- a/CodeSamples/count/count_lim_sig.c
> +++ b/CodeSamples/count/count_lim_sig.c
> @@ -89,6 +89,7 @@ static void flush_local_count(void)			//\lnlbl{flush:b}
>  		*counterp[t] = 0;
>  		globalreserve -= *countermaxp[t];
>  		*countermaxp[t] = 0;			//\lnlbl{flush:thiev:e}
> +		smp_wmb();				//\lnlbl{flush:wmb}
>  		WRITE_ONCE(*theftp[t], THEFT_IDLE);	//\lnlbl{flush:IDLE}
>  	}						//\lnlbl{flush:loop2:e}
>  }							//\lnlbl{flush:e}
> @@ -115,10 +116,12 @@ int add_count(unsigned long delta)			//\lnlbl{b}
>  
>  	WRITE_ONCE(counting, 1);			//\lnlbl{fast:b}
>  	barrier();					//\lnlbl{barrier:1}
> -	if (READ_ONCE(theft) <= THEFT_REQ &&		//\lnlbl{check:b}
> -	    countermax - counter >= delta) {		//\lnlbl{check:e}
> -		WRITE_ONCE(counter, counter + delta);	//\lnlbl{add:f}
> -		fastpath = 1;				//\lnlbl{fasttaken}
> +	if (READ_ONCE(theft) <= THEFT_REQ) {		//\lnlbl{check:b}
> +		smp_rmb();				//\lnlbl{rmb}
> +		if (countermax - counter >= delta) {	//\lnlbl{check:e}
> +			WRITE_ONCE(counter, counter + delta);//\lnlbl{add:f}
> +			fastpath = 1;			//\lnlbl{fasttaken}
> +		}
>  	}
>  	barrier();					//\lnlbl{barrier:2}
>  	WRITE_ONCE(counting, 0);			//\lnlbl{clearcnt}
> @@ -154,10 +157,12 @@ int sub_count(unsigned long delta)
>  
>  	WRITE_ONCE(counting, 1);
>  	barrier();
> -	if (READ_ONCE(theft) <= THEFT_REQ &&
> -	    counter >= delta) {
> -		WRITE_ONCE(counter, counter - delta);
> -		fastpath = 1;
> +	if (READ_ONCE(theft) <= THEFT_REQ) {
> +		smp_rmb();
> +		if (counter >= delta) {
> +			WRITE_ONCE(counter, counter - delta);
> +			fastpath = 1;
> +		}
>  	}
>  	barrier();
>  	WRITE_ONCE(counting, 0);
> -- 
> 2.7.4
> 


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH] count_lim_sig: Add pair of smp_wmb() and smp_rmb()
  2018-10-17 15:10 ` Paul E. McKenney
@ 2018-10-17 22:21   ` Akira Yokosawa
  2018-10-18  0:37     ` Paul E. McKenney
  0 siblings, 1 reply; 8+ messages in thread
From: Akira Yokosawa @ 2018-10-17 22:21 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: perfbook, Akira Yokosawa

On 2018/10/17 08:10:52 -0700, Paul E. McKenney wrote:
> On Tue, Oct 16, 2018 at 08:04:00AM +0900, Akira Yokosawa wrote:
>> >From 7b01fc0f19cfa010536d7eb53e4d0cda1e6b801f Mon Sep 17 00:00:00 2001
>> From: Akira Yokosawa <akiyks@gmail.com>
>> Date: Mon, 15 Oct 2018 23:46:52 +0900
>> Subject: RFC [PATCH] count_lim_sig: Add pair of smp_wmb() and smp_rmb()
>>
>> This message-passing pattern requires smp_wmb()--smp_rmb() pairing.
>>
>> Signed-off-by: Akira Yokosawa <akiyks@gmail.com>
>> ---
>> Hi Paul,
>>
>> I'm not sure this addition of memory barriers is actually required,
>> but it does look like so.
>>
>> And I'm aware that you have avoided using weaker memory barriers in
>> CodeSamples.
>>
>> Thoughts?
> 
> Hello, Akira,
> 
> I might be missing something, but it looks to me like this ordering is
> covered by heavyweight ordering in the signal handler entry/exit and
> the gblcnt_mutex.  So what sequence of events leads to the failiure
> scenario that you are seeing?

So the fastpaths in add_count() and sub_count() are not protected by
glbcnt_mutex.  The slowpath in flush_local_count() waits the transition
of theft from REQ to READY, clears counter and countermax, and finally
assign IDLE to theft.

So, the fastpaths can see (theft == IDLE) but see a non-zero value of
counter or countermax, can't they?

One theory to prevent this from happening is because all the per-thread
variables of a thread reside in a single cache line, and if the fastpaths
see the updated value of theft, they are guaranteed to see the latest
values of both counter and countermax.

I might be completely missing something, though.

        Thanks, Akira 

> 
> 							Thanx, Paul
> 
>>         Thanks, Akira
>> --
>>  CodeSamples/arch-arm/arch-arm.h     |  2 ++
>>  CodeSamples/arch-arm64/arch-arm64.h |  2 ++
>>  CodeSamples/arch-ppc64/arch-ppc64.h |  2 ++
>>  CodeSamples/arch-x86/arch-x86.h     |  2 ++
>>  CodeSamples/count/count_lim_sig.c   | 21 +++++++++++++--------
>>  5 files changed, 21 insertions(+), 8 deletions(-)
>>
>> diff --git a/CodeSamples/arch-arm/arch-arm.h b/CodeSamples/arch-arm/arch-arm.h
>> index 065c6f1..6f0707b 100644
>> --- a/CodeSamples/arch-arm/arch-arm.h
>> +++ b/CodeSamples/arch-arm/arch-arm.h
>> @@ -41,6 +41,8 @@
>>  /* __sync_synchronize() is broken before gcc 4.4.1 on many ARM systems. */
>>  #define smp_mb()  __asm__ __volatile__("dmb" : : : "memory")
>>  
>> +#define smp_rmb() __asm__ __volatile__("dmb ish" : : : "memory")
>> +#define smp_wmb() __asm__ __volatile__("dmb ishst" : : : "memory")
>>  
>>  #include <stdlib.h>
>>  #include <sys/time.h>
>> diff --git a/CodeSamples/arch-arm64/arch-arm64.h b/CodeSamples/arch-arm64/arch-arm64.h
>> index 354f1f2..a6ccf33 100644
>> --- a/CodeSamples/arch-arm64/arch-arm64.h
>> +++ b/CodeSamples/arch-arm64/arch-arm64.h
>> @@ -41,6 +41,8 @@
>>  /* __sync_synchronize() is broken before gcc 4.4.1 on many ARM systems. */
>>  #define smp_mb()  __asm__ __volatile__("dmb ish" : : : "memory")
>>  
>> +#define smp_rmb() __asm__ __volatile__("dmb ishld" : : : "memory")
>> +#define smp_wmb() __asm__ __volatile__("dmb ishst" : : : "memory")
>>  
>>  #include <stdlib.h>
>>  #include <time.h>
>> diff --git a/CodeSamples/arch-ppc64/arch-ppc64.h b/CodeSamples/arch-ppc64/arch-ppc64.h
>> index 7b0b025..2d6a2b5 100644
>> --- a/CodeSamples/arch-ppc64/arch-ppc64.h
>> +++ b/CodeSamples/arch-ppc64/arch-ppc64.h
>> @@ -42,6 +42,8 @@
>>  
>>  #define smp_mb()  __asm__ __volatile__("sync" : : : "memory")
>>  
>> +#define smp_rmb() __asm__ __volatile__("lwsync" : : : "memory")
>> +#define smp_wmb() __asm__ __volatile__("lwsync" : : : "memory")
>>  
>>  /*
>>   * Generate 64-bit timestamp.
>> diff --git a/CodeSamples/arch-x86/arch-x86.h b/CodeSamples/arch-x86/arch-x86.h
>> index 9ea97ca..2765bfc 100644
>> --- a/CodeSamples/arch-x86/arch-x86.h
>> +++ b/CodeSamples/arch-x86/arch-x86.h
>> @@ -52,6 +52,8 @@ __asm__ __volatile__(LOCK_PREFIX "orl %0,%1" \
>>  __asm__ __volatile__("mfence" : : : "memory")
>>  /* __asm__ __volatile__("lock; addl $0,0(%%esp)" : : : "memory") */
>>  
>> +#define smp_rmb() barrier()
>> +#define smp_wmb() barrier()
>>  
>>  /*
>>   * Generate 64-bit timestamp.
>> diff --git a/CodeSamples/count/count_lim_sig.c b/CodeSamples/count/count_lim_sig.c
>> index c316426..26a2a76 100644
>> --- a/CodeSamples/count/count_lim_sig.c
>> +++ b/CodeSamples/count/count_lim_sig.c
>> @@ -89,6 +89,7 @@ static void flush_local_count(void)			//\lnlbl{flush:b}
>>  		*counterp[t] = 0;
>>  		globalreserve -= *countermaxp[t];
>>  		*countermaxp[t] = 0;			//\lnlbl{flush:thiev:e}
>> +		smp_wmb();				//\lnlbl{flush:wmb}
>>  		WRITE_ONCE(*theftp[t], THEFT_IDLE);	//\lnlbl{flush:IDLE}
>>  	}						//\lnlbl{flush:loop2:e}
>>  }							//\lnlbl{flush:e}
>> @@ -115,10 +116,12 @@ int add_count(unsigned long delta)			//\lnlbl{b}
>>  
>>  	WRITE_ONCE(counting, 1);			//\lnlbl{fast:b}
>>  	barrier();					//\lnlbl{barrier:1}
>> -	if (READ_ONCE(theft) <= THEFT_REQ &&		//\lnlbl{check:b}
>> -	    countermax - counter >= delta) {		//\lnlbl{check:e}
>> -		WRITE_ONCE(counter, counter + delta);	//\lnlbl{add:f}
>> -		fastpath = 1;				//\lnlbl{fasttaken}
>> +	if (READ_ONCE(theft) <= THEFT_REQ) {		//\lnlbl{check:b}
>> +		smp_rmb();				//\lnlbl{rmb}
>> +		if (countermax - counter >= delta) {	//\lnlbl{check:e}
>> +			WRITE_ONCE(counter, counter + delta);//\lnlbl{add:f}
>> +			fastpath = 1;			//\lnlbl{fasttaken}
>> +		}
>>  	}
>>  	barrier();					//\lnlbl{barrier:2}
>>  	WRITE_ONCE(counting, 0);			//\lnlbl{clearcnt}
>> @@ -154,10 +157,12 @@ int sub_count(unsigned long delta)
>>  
>>  	WRITE_ONCE(counting, 1);
>>  	barrier();
>> -	if (READ_ONCE(theft) <= THEFT_REQ &&
>> -	    counter >= delta) {
>> -		WRITE_ONCE(counter, counter - delta);
>> -		fastpath = 1;
>> +	if (READ_ONCE(theft) <= THEFT_REQ) {
>> +		smp_rmb();
>> +		if (counter >= delta) {
>> +			WRITE_ONCE(counter, counter - delta);
>> +			fastpath = 1;
>> +		}
>>  	}
>>  	barrier();
>>  	WRITE_ONCE(counting, 0);
>> -- 
>> 2.7.4
>>
> 


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH] count_lim_sig: Add pair of smp_wmb() and smp_rmb()
  2018-10-17 22:21   ` Akira Yokosawa
@ 2018-10-18  0:37     ` Paul E. McKenney
  2018-10-18 13:03       ` Akira Yokosawa
  0 siblings, 1 reply; 8+ messages in thread
From: Paul E. McKenney @ 2018-10-18  0:37 UTC (permalink / raw)
  To: Akira Yokosawa; +Cc: perfbook

On Thu, Oct 18, 2018 at 07:21:38AM +0900, Akira Yokosawa wrote:
> On 2018/10/17 08:10:52 -0700, Paul E. McKenney wrote:
> > On Tue, Oct 16, 2018 at 08:04:00AM +0900, Akira Yokosawa wrote:
> >> >From 7b01fc0f19cfa010536d7eb53e4d0cda1e6b801f Mon Sep 17 00:00:00 2001
> >> From: Akira Yokosawa <akiyks@gmail.com>
> >> Date: Mon, 15 Oct 2018 23:46:52 +0900
> >> Subject: RFC [PATCH] count_lim_sig: Add pair of smp_wmb() and smp_rmb()
> >>
> >> This message-passing pattern requires smp_wmb()--smp_rmb() pairing.
> >>
> >> Signed-off-by: Akira Yokosawa <akiyks@gmail.com>
> >> ---
> >> Hi Paul,
> >>
> >> I'm not sure this addition of memory barriers is actually required,
> >> but it does look like so.
> >>
> >> And I'm aware that you have avoided using weaker memory barriers in
> >> CodeSamples.
> >>
> >> Thoughts?
> > 
> > Hello, Akira,
> > 
> > I might be missing something, but it looks to me like this ordering is
> > covered by heavyweight ordering in the signal handler entry/exit and
> > the gblcnt_mutex.  So what sequence of events leads to the failiure
> > scenario that you are seeing?
> 
> So the fastpaths in add_count() and sub_count() are not protected by
> glbcnt_mutex.  The slowpath in flush_local_count() waits the transition
> of theft from REQ to READY, clears counter and countermax, and finally
> assign IDLE to theft.
> 
> So, the fastpaths can see (theft == IDLE) but see a non-zero value of
> counter or countermax, can't they?

Maybe, maybe not.  Please lay out a sequence of events showing a problem,
as in load by load, store by store, line by line.  Intuition isn't as
helpful as one might like for this kind of stuff.  ;-)

> One theory to prevent this from happening is because all the per-thread
> variables of a thread reside in a single cache line, and if the fastpaths
> see the updated value of theft, they are guaranteed to see the latest
> values of both counter and countermax.

Good point, but we need to avoid that sort of assumption unless we
placed the variables into a struct and told the compiler to align it
appropriately.  And even then, hardware architectures normally don't
make this sort of guarantee.  There is too much that can go wrong, from
ECC errors to interrupts at just the wrong time, and much else besides.

							Thanx, Paul

> I might be completely missing something, though.
> 
>         Thanks, Akira 
> 
> > 
> > 							Thanx, Paul
> > 
> >>         Thanks, Akira
> >> --
> >>  CodeSamples/arch-arm/arch-arm.h     |  2 ++
> >>  CodeSamples/arch-arm64/arch-arm64.h |  2 ++
> >>  CodeSamples/arch-ppc64/arch-ppc64.h |  2 ++
> >>  CodeSamples/arch-x86/arch-x86.h     |  2 ++
> >>  CodeSamples/count/count_lim_sig.c   | 21 +++++++++++++--------
> >>  5 files changed, 21 insertions(+), 8 deletions(-)
> >>
> >> diff --git a/CodeSamples/arch-arm/arch-arm.h b/CodeSamples/arch-arm/arch-arm.h
> >> index 065c6f1..6f0707b 100644
> >> --- a/CodeSamples/arch-arm/arch-arm.h
> >> +++ b/CodeSamples/arch-arm/arch-arm.h
> >> @@ -41,6 +41,8 @@
> >>  /* __sync_synchronize() is broken before gcc 4.4.1 on many ARM systems. */
> >>  #define smp_mb()  __asm__ __volatile__("dmb" : : : "memory")
> >>  
> >> +#define smp_rmb() __asm__ __volatile__("dmb ish" : : : "memory")
> >> +#define smp_wmb() __asm__ __volatile__("dmb ishst" : : : "memory")
> >>  
> >>  #include <stdlib.h>
> >>  #include <sys/time.h>
> >> diff --git a/CodeSamples/arch-arm64/arch-arm64.h b/CodeSamples/arch-arm64/arch-arm64.h
> >> index 354f1f2..a6ccf33 100644
> >> --- a/CodeSamples/arch-arm64/arch-arm64.h
> >> +++ b/CodeSamples/arch-arm64/arch-arm64.h
> >> @@ -41,6 +41,8 @@
> >>  /* __sync_synchronize() is broken before gcc 4.4.1 on many ARM systems. */
> >>  #define smp_mb()  __asm__ __volatile__("dmb ish" : : : "memory")
> >>  
> >> +#define smp_rmb() __asm__ __volatile__("dmb ishld" : : : "memory")
> >> +#define smp_wmb() __asm__ __volatile__("dmb ishst" : : : "memory")
> >>  
> >>  #include <stdlib.h>
> >>  #include <time.h>
> >> diff --git a/CodeSamples/arch-ppc64/arch-ppc64.h b/CodeSamples/arch-ppc64/arch-ppc64.h
> >> index 7b0b025..2d6a2b5 100644
> >> --- a/CodeSamples/arch-ppc64/arch-ppc64.h
> >> +++ b/CodeSamples/arch-ppc64/arch-ppc64.h
> >> @@ -42,6 +42,8 @@
> >>  
> >>  #define smp_mb()  __asm__ __volatile__("sync" : : : "memory")
> >>  
> >> +#define smp_rmb() __asm__ __volatile__("lwsync" : : : "memory")
> >> +#define smp_wmb() __asm__ __volatile__("lwsync" : : : "memory")
> >>  
> >>  /*
> >>   * Generate 64-bit timestamp.
> >> diff --git a/CodeSamples/arch-x86/arch-x86.h b/CodeSamples/arch-x86/arch-x86.h
> >> index 9ea97ca..2765bfc 100644
> >> --- a/CodeSamples/arch-x86/arch-x86.h
> >> +++ b/CodeSamples/arch-x86/arch-x86.h
> >> @@ -52,6 +52,8 @@ __asm__ __volatile__(LOCK_PREFIX "orl %0,%1" \
> >>  __asm__ __volatile__("mfence" : : : "memory")
> >>  /* __asm__ __volatile__("lock; addl $0,0(%%esp)" : : : "memory") */
> >>  
> >> +#define smp_rmb() barrier()
> >> +#define smp_wmb() barrier()
> >>  
> >>  /*
> >>   * Generate 64-bit timestamp.
> >> diff --git a/CodeSamples/count/count_lim_sig.c b/CodeSamples/count/count_lim_sig.c
> >> index c316426..26a2a76 100644
> >> --- a/CodeSamples/count/count_lim_sig.c
> >> +++ b/CodeSamples/count/count_lim_sig.c
> >> @@ -89,6 +89,7 @@ static void flush_local_count(void)			//\lnlbl{flush:b}
> >>  		*counterp[t] = 0;
> >>  		globalreserve -= *countermaxp[t];
> >>  		*countermaxp[t] = 0;			//\lnlbl{flush:thiev:e}
> >> +		smp_wmb();				//\lnlbl{flush:wmb}
> >>  		WRITE_ONCE(*theftp[t], THEFT_IDLE);	//\lnlbl{flush:IDLE}
> >>  	}						//\lnlbl{flush:loop2:e}
> >>  }							//\lnlbl{flush:e}
> >> @@ -115,10 +116,12 @@ int add_count(unsigned long delta)			//\lnlbl{b}
> >>  
> >>  	WRITE_ONCE(counting, 1);			//\lnlbl{fast:b}
> >>  	barrier();					//\lnlbl{barrier:1}
> >> -	if (READ_ONCE(theft) <= THEFT_REQ &&		//\lnlbl{check:b}
> >> -	    countermax - counter >= delta) {		//\lnlbl{check:e}
> >> -		WRITE_ONCE(counter, counter + delta);	//\lnlbl{add:f}
> >> -		fastpath = 1;				//\lnlbl{fasttaken}
> >> +	if (READ_ONCE(theft) <= THEFT_REQ) {		//\lnlbl{check:b}
> >> +		smp_rmb();				//\lnlbl{rmb}
> >> +		if (countermax - counter >= delta) {	//\lnlbl{check:e}
> >> +			WRITE_ONCE(counter, counter + delta);//\lnlbl{add:f}
> >> +			fastpath = 1;			//\lnlbl{fasttaken}
> >> +		}
> >>  	}
> >>  	barrier();					//\lnlbl{barrier:2}
> >>  	WRITE_ONCE(counting, 0);			//\lnlbl{clearcnt}
> >> @@ -154,10 +157,12 @@ int sub_count(unsigned long delta)
> >>  
> >>  	WRITE_ONCE(counting, 1);
> >>  	barrier();
> >> -	if (READ_ONCE(theft) <= THEFT_REQ &&
> >> -	    counter >= delta) {
> >> -		WRITE_ONCE(counter, counter - delta);
> >> -		fastpath = 1;
> >> +	if (READ_ONCE(theft) <= THEFT_REQ) {
> >> +		smp_rmb();
> >> +		if (counter >= delta) {
> >> +			WRITE_ONCE(counter, counter - delta);
> >> +			fastpath = 1;
> >> +		}
> >>  	}
> >>  	barrier();
> >>  	WRITE_ONCE(counting, 0);
> >> -- 
> >> 2.7.4
> >>
> > 
> 


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH] count_lim_sig: Add pair of smp_wmb() and smp_rmb()
  2018-10-18  0:37     ` Paul E. McKenney
@ 2018-10-18 13:03       ` Akira Yokosawa
  2018-10-18 15:15         ` Paul E. McKenney
  0 siblings, 1 reply; 8+ messages in thread
From: Akira Yokosawa @ 2018-10-18 13:03 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: perfbook, Akira Yokosawa

On 2018/10/17 17:37:39 -0700, Paul E. McKenney wrote:
> On Thu, Oct 18, 2018 at 07:21:38AM +0900, Akira Yokosawa wrote:
>> On 2018/10/17 08:10:52 -0700, Paul E. McKenney wrote:
>>> On Tue, Oct 16, 2018 at 08:04:00AM +0900, Akira Yokosawa wrote:
>>>> >From 7b01fc0f19cfa010536d7eb53e4d0cda1e6b801f Mon Sep 17 00:00:00 2001
>>>> From: Akira Yokosawa <akiyks@gmail.com>
>>>> Date: Mon, 15 Oct 2018 23:46:52 +0900
>>>> Subject: RFC [PATCH] count_lim_sig: Add pair of smp_wmb() and smp_rmb()
>>>>
>>>> This message-passing pattern requires smp_wmb()--smp_rmb() pairing.
>>>>
>>>> Signed-off-by: Akira Yokosawa <akiyks@gmail.com>
>>>> ---
>>>> Hi Paul,
>>>>
>>>> I'm not sure this addition of memory barriers is actually required,
>>>> but it does look like so.
>>>>
>>>> And I'm aware that you have avoided using weaker memory barriers in
>>>> CodeSamples.
>>>>
>>>> Thoughts?
>>>
>>> Hello, Akira,
>>>
>>> I might be missing something, but it looks to me like this ordering is
>>> covered by heavyweight ordering in the signal handler entry/exit and
>>> the gblcnt_mutex.  So what sequence of events leads to the failiure
>>> scenario that you are seeing?
>>
>> So the fastpaths in add_count() and sub_count() are not protected by
>> glbcnt_mutex.  The slowpath in flush_local_count() waits the transition
>> of theft from REQ to READY, clears counter and countermax, and finally
>> assign IDLE to theft.
>>
>> So, the fastpaths can see (theft == IDLE) but see a non-zero value of
>> counter or countermax, can't they?
> 
> Maybe, maybe not.  Please lay out a sequence of events showing a problem,
> as in load by load, store by store, line by line.  Intuition isn't as
> helpful as one might like for this kind of stuff.  ;-)

Gotcha!

I've not exhausted the timing variations, but now I see when
split_local_count() sees (*theft@[t] == THEFT_READY), counter part of
add_count() or sub_count() has exited the fastpath (marked by
counting == 1).

So the race I imagined has never existed.

Thanks for your nice suggestion!

> 
>> One theory to prevent this from happening is because all the per-thread
>> variables of a thread reside in a single cache line, and if the fastpaths
>> see the updated value of theft, they are guaranteed to see the latest
>> values of both counter and countermax.
> 
> Good point, but we need to avoid that sort of assumption unless we
> placed the variables into a struct and told the compiler to align it
> appropriately.  And even then, hardware architectures normally don't
> make this sort of guarantee.  There is too much that can go wrong, from
> ECC errors to interrupts at just the wrong time, and much else besides.

Absolutely!

        Thanks, Akira

> 
> 							Thanx, Paul
> 
>> I might be completely missing something, though.
>>
>>         Thanks, Akira 
>>
>>>
>>> 							Thanx, Paul
>>>
>>>>         Thanks, Akira
>>>> --
>>>>  CodeSamples/arch-arm/arch-arm.h     |  2 ++
>>>>  CodeSamples/arch-arm64/arch-arm64.h |  2 ++
>>>>  CodeSamples/arch-ppc64/arch-ppc64.h |  2 ++
>>>>  CodeSamples/arch-x86/arch-x86.h     |  2 ++
>>>>  CodeSamples/count/count_lim_sig.c   | 21 +++++++++++++--------
>>>>  5 files changed, 21 insertions(+), 8 deletions(-)
>>>>
>>>> diff --git a/CodeSamples/arch-arm/arch-arm.h b/CodeSamples/arch-arm/arch-arm.h
>>>> index 065c6f1..6f0707b 100644
>>>> --- a/CodeSamples/arch-arm/arch-arm.h
>>>> +++ b/CodeSamples/arch-arm/arch-arm.h
>>>> @@ -41,6 +41,8 @@
>>>>  /* __sync_synchronize() is broken before gcc 4.4.1 on many ARM systems. */
>>>>  #define smp_mb()  __asm__ __volatile__("dmb" : : : "memory")
>>>>  
>>>> +#define smp_rmb() __asm__ __volatile__("dmb ish" : : : "memory")
>>>> +#define smp_wmb() __asm__ __volatile__("dmb ishst" : : : "memory")
>>>>  
>>>>  #include <stdlib.h>
>>>>  #include <sys/time.h>
>>>> diff --git a/CodeSamples/arch-arm64/arch-arm64.h b/CodeSamples/arch-arm64/arch-arm64.h
>>>> index 354f1f2..a6ccf33 100644
>>>> --- a/CodeSamples/arch-arm64/arch-arm64.h
>>>> +++ b/CodeSamples/arch-arm64/arch-arm64.h
>>>> @@ -41,6 +41,8 @@
>>>>  /* __sync_synchronize() is broken before gcc 4.4.1 on many ARM systems. */
>>>>  #define smp_mb()  __asm__ __volatile__("dmb ish" : : : "memory")
>>>>  
>>>> +#define smp_rmb() __asm__ __volatile__("dmb ishld" : : : "memory")
>>>> +#define smp_wmb() __asm__ __volatile__("dmb ishst" : : : "memory")
>>>>  
>>>>  #include <stdlib.h>
>>>>  #include <time.h>
>>>> diff --git a/CodeSamples/arch-ppc64/arch-ppc64.h b/CodeSamples/arch-ppc64/arch-ppc64.h
>>>> index 7b0b025..2d6a2b5 100644
>>>> --- a/CodeSamples/arch-ppc64/arch-ppc64.h
>>>> +++ b/CodeSamples/arch-ppc64/arch-ppc64.h
>>>> @@ -42,6 +42,8 @@
>>>>  
>>>>  #define smp_mb()  __asm__ __volatile__("sync" : : : "memory")
>>>>  
>>>> +#define smp_rmb() __asm__ __volatile__("lwsync" : : : "memory")
>>>> +#define smp_wmb() __asm__ __volatile__("lwsync" : : : "memory")
>>>>  
>>>>  /*
>>>>   * Generate 64-bit timestamp.
>>>> diff --git a/CodeSamples/arch-x86/arch-x86.h b/CodeSamples/arch-x86/arch-x86.h
>>>> index 9ea97ca..2765bfc 100644
>>>> --- a/CodeSamples/arch-x86/arch-x86.h
>>>> +++ b/CodeSamples/arch-x86/arch-x86.h
>>>> @@ -52,6 +52,8 @@ __asm__ __volatile__(LOCK_PREFIX "orl %0,%1" \
>>>>  __asm__ __volatile__("mfence" : : : "memory")
>>>>  /* __asm__ __volatile__("lock; addl $0,0(%%esp)" : : : "memory") */
>>>>  
>>>> +#define smp_rmb() barrier()
>>>> +#define smp_wmb() barrier()
>>>>  
>>>>  /*
>>>>   * Generate 64-bit timestamp.
>>>> diff --git a/CodeSamples/count/count_lim_sig.c b/CodeSamples/count/count_lim_sig.c
>>>> index c316426..26a2a76 100644
>>>> --- a/CodeSamples/count/count_lim_sig.c
>>>> +++ b/CodeSamples/count/count_lim_sig.c
>>>> @@ -89,6 +89,7 @@ static void flush_local_count(void)			//\lnlbl{flush:b}
>>>>  		*counterp[t] = 0;
>>>>  		globalreserve -= *countermaxp[t];
>>>>  		*countermaxp[t] = 0;			//\lnlbl{flush:thiev:e}
>>>> +		smp_wmb();				//\lnlbl{flush:wmb}
>>>>  		WRITE_ONCE(*theftp[t], THEFT_IDLE);	//\lnlbl{flush:IDLE}
>>>>  	}						//\lnlbl{flush:loop2:e}
>>>>  }							//\lnlbl{flush:e}
>>>> @@ -115,10 +116,12 @@ int add_count(unsigned long delta)			//\lnlbl{b}
>>>>  
>>>>  	WRITE_ONCE(counting, 1);			//\lnlbl{fast:b}
>>>>  	barrier();					//\lnlbl{barrier:1}
>>>> -	if (READ_ONCE(theft) <= THEFT_REQ &&		//\lnlbl{check:b}
>>>> -	    countermax - counter >= delta) {		//\lnlbl{check:e}
>>>> -		WRITE_ONCE(counter, counter + delta);	//\lnlbl{add:f}
>>>> -		fastpath = 1;				//\lnlbl{fasttaken}
>>>> +	if (READ_ONCE(theft) <= THEFT_REQ) {		//\lnlbl{check:b}
>>>> +		smp_rmb();				//\lnlbl{rmb}
>>>> +		if (countermax - counter >= delta) {	//\lnlbl{check:e}
>>>> +			WRITE_ONCE(counter, counter + delta);//\lnlbl{add:f}
>>>> +			fastpath = 1;			//\lnlbl{fasttaken}
>>>> +		}
>>>>  	}
>>>>  	barrier();					//\lnlbl{barrier:2}
>>>>  	WRITE_ONCE(counting, 0);			//\lnlbl{clearcnt}
>>>> @@ -154,10 +157,12 @@ int sub_count(unsigned long delta)
>>>>  
>>>>  	WRITE_ONCE(counting, 1);
>>>>  	barrier();
>>>> -	if (READ_ONCE(theft) <= THEFT_REQ &&
>>>> -	    counter >= delta) {
>>>> -		WRITE_ONCE(counter, counter - delta);
>>>> -		fastpath = 1;
>>>> +	if (READ_ONCE(theft) <= THEFT_REQ) {
>>>> +		smp_rmb();
>>>> +		if (counter >= delta) {
>>>> +			WRITE_ONCE(counter, counter - delta);
>>>> +			fastpath = 1;
>>>> +		}
>>>>  	}
>>>>  	barrier();
>>>>  	WRITE_ONCE(counting, 0);
>>>> -- 
>>>> 2.7.4
>>>>
>>>
>>
> 


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH] count_lim_sig: Add pair of smp_wmb() and smp_rmb()
  2018-10-18 13:03       ` Akira Yokosawa
@ 2018-10-18 15:15         ` Paul E. McKenney
  2018-10-18 22:43           ` Akira Yokosawa
  0 siblings, 1 reply; 8+ messages in thread
From: Paul E. McKenney @ 2018-10-18 15:15 UTC (permalink / raw)
  To: Akira Yokosawa; +Cc: perfbook

On Thu, Oct 18, 2018 at 10:03:56PM +0900, Akira Yokosawa wrote:
> On 2018/10/17 17:37:39 -0700, Paul E. McKenney wrote:
> > On Thu, Oct 18, 2018 at 07:21:38AM +0900, Akira Yokosawa wrote:
> >> On 2018/10/17 08:10:52 -0700, Paul E. McKenney wrote:
> >>> On Tue, Oct 16, 2018 at 08:04:00AM +0900, Akira Yokosawa wrote:
> >>>> >From 7b01fc0f19cfa010536d7eb53e4d0cda1e6b801f Mon Sep 17 00:00:00 2001
> >>>> From: Akira Yokosawa <akiyks@gmail.com>
> >>>> Date: Mon, 15 Oct 2018 23:46:52 +0900
> >>>> Subject: RFC [PATCH] count_lim_sig: Add pair of smp_wmb() and smp_rmb()
> >>>>
> >>>> This message-passing pattern requires smp_wmb()--smp_rmb() pairing.
> >>>>
> >>>> Signed-off-by: Akira Yokosawa <akiyks@gmail.com>
> >>>> ---
> >>>> Hi Paul,
> >>>>
> >>>> I'm not sure this addition of memory barriers is actually required,
> >>>> but it does look like so.
> >>>>
> >>>> And I'm aware that you have avoided using weaker memory barriers in
> >>>> CodeSamples.
> >>>>
> >>>> Thoughts?
> >>>
> >>> Hello, Akira,
> >>>
> >>> I might be missing something, but it looks to me like this ordering is
> >>> covered by heavyweight ordering in the signal handler entry/exit and
> >>> the gblcnt_mutex.  So what sequence of events leads to the failiure
> >>> scenario that you are seeing?
> >>
> >> So the fastpaths in add_count() and sub_count() are not protected by
> >> glbcnt_mutex.  The slowpath in flush_local_count() waits the transition
> >> of theft from REQ to READY, clears counter and countermax, and finally
> >> assign IDLE to theft.
> >>
> >> So, the fastpaths can see (theft == IDLE) but see a non-zero value of
> >> counter or countermax, can't they?
> > 
> > Maybe, maybe not.  Please lay out a sequence of events showing a problem,
> > as in load by load, store by store, line by line.  Intuition isn't as
> > helpful as one might like for this kind of stuff.  ;-)
> 
> Gotcha!
> 
> I've not exhausted the timing variations, but now I see when
> split_local_count() sees (*theft@[t] == THEFT_READY), counter part of
> add_count() or sub_count() has exited the fastpath (marked by
> counting == 1).
> 
> So the race I imagined has never existed.

I know that feeling!!!

> Thanks for your nice suggestion!

Well, there might well be another race.  My main concern is whether or not
signal-handler entry/exit really provides full ordering on all platforms.

Thoughts?

							Thanx, Paul

> >> One theory to prevent this from happening is because all the per-thread
> >> variables of a thread reside in a single cache line, and if the fastpaths
> >> see the updated value of theft, they are guaranteed to see the latest
> >> values of both counter and countermax.
> > 
> > Good point, but we need to avoid that sort of assumption unless we
> > placed the variables into a struct and told the compiler to align it
> > appropriately.  And even then, hardware architectures normally don't
> > make this sort of guarantee.  There is too much that can go wrong, from
> > ECC errors to interrupts at just the wrong time, and much else besides.
> 
> Absolutely!
> 
>         Thanks, Akira
> 
> > 
> > 							Thanx, Paul
> > 
> >> I might be completely missing something, though.
> >>
> >>         Thanks, Akira 
> >>
> >>>
> >>> 							Thanx, Paul
> >>>
> >>>>         Thanks, Akira
> >>>> --
> >>>>  CodeSamples/arch-arm/arch-arm.h     |  2 ++
> >>>>  CodeSamples/arch-arm64/arch-arm64.h |  2 ++
> >>>>  CodeSamples/arch-ppc64/arch-ppc64.h |  2 ++
> >>>>  CodeSamples/arch-x86/arch-x86.h     |  2 ++
> >>>>  CodeSamples/count/count_lim_sig.c   | 21 +++++++++++++--------
> >>>>  5 files changed, 21 insertions(+), 8 deletions(-)
> >>>>
> >>>> diff --git a/CodeSamples/arch-arm/arch-arm.h b/CodeSamples/arch-arm/arch-arm.h
> >>>> index 065c6f1..6f0707b 100644
> >>>> --- a/CodeSamples/arch-arm/arch-arm.h
> >>>> +++ b/CodeSamples/arch-arm/arch-arm.h
> >>>> @@ -41,6 +41,8 @@
> >>>>  /* __sync_synchronize() is broken before gcc 4.4.1 on many ARM systems. */
> >>>>  #define smp_mb()  __asm__ __volatile__("dmb" : : : "memory")
> >>>>  
> >>>> +#define smp_rmb() __asm__ __volatile__("dmb ish" : : : "memory")
> >>>> +#define smp_wmb() __asm__ __volatile__("dmb ishst" : : : "memory")
> >>>>  
> >>>>  #include <stdlib.h>
> >>>>  #include <sys/time.h>
> >>>> diff --git a/CodeSamples/arch-arm64/arch-arm64.h b/CodeSamples/arch-arm64/arch-arm64.h
> >>>> index 354f1f2..a6ccf33 100644
> >>>> --- a/CodeSamples/arch-arm64/arch-arm64.h
> >>>> +++ b/CodeSamples/arch-arm64/arch-arm64.h
> >>>> @@ -41,6 +41,8 @@
> >>>>  /* __sync_synchronize() is broken before gcc 4.4.1 on many ARM systems. */
> >>>>  #define smp_mb()  __asm__ __volatile__("dmb ish" : : : "memory")
> >>>>  
> >>>> +#define smp_rmb() __asm__ __volatile__("dmb ishld" : : : "memory")
> >>>> +#define smp_wmb() __asm__ __volatile__("dmb ishst" : : : "memory")
> >>>>  
> >>>>  #include <stdlib.h>
> >>>>  #include <time.h>
> >>>> diff --git a/CodeSamples/arch-ppc64/arch-ppc64.h b/CodeSamples/arch-ppc64/arch-ppc64.h
> >>>> index 7b0b025..2d6a2b5 100644
> >>>> --- a/CodeSamples/arch-ppc64/arch-ppc64.h
> >>>> +++ b/CodeSamples/arch-ppc64/arch-ppc64.h
> >>>> @@ -42,6 +42,8 @@
> >>>>  
> >>>>  #define smp_mb()  __asm__ __volatile__("sync" : : : "memory")
> >>>>  
> >>>> +#define smp_rmb() __asm__ __volatile__("lwsync" : : : "memory")
> >>>> +#define smp_wmb() __asm__ __volatile__("lwsync" : : : "memory")
> >>>>  
> >>>>  /*
> >>>>   * Generate 64-bit timestamp.
> >>>> diff --git a/CodeSamples/arch-x86/arch-x86.h b/CodeSamples/arch-x86/arch-x86.h
> >>>> index 9ea97ca..2765bfc 100644
> >>>> --- a/CodeSamples/arch-x86/arch-x86.h
> >>>> +++ b/CodeSamples/arch-x86/arch-x86.h
> >>>> @@ -52,6 +52,8 @@ __asm__ __volatile__(LOCK_PREFIX "orl %0,%1" \
> >>>>  __asm__ __volatile__("mfence" : : : "memory")
> >>>>  /* __asm__ __volatile__("lock; addl $0,0(%%esp)" : : : "memory") */
> >>>>  
> >>>> +#define smp_rmb() barrier()
> >>>> +#define smp_wmb() barrier()
> >>>>  
> >>>>  /*
> >>>>   * Generate 64-bit timestamp.
> >>>> diff --git a/CodeSamples/count/count_lim_sig.c b/CodeSamples/count/count_lim_sig.c
> >>>> index c316426..26a2a76 100644
> >>>> --- a/CodeSamples/count/count_lim_sig.c
> >>>> +++ b/CodeSamples/count/count_lim_sig.c
> >>>> @@ -89,6 +89,7 @@ static void flush_local_count(void)			//\lnlbl{flush:b}
> >>>>  		*counterp[t] = 0;
> >>>>  		globalreserve -= *countermaxp[t];
> >>>>  		*countermaxp[t] = 0;			//\lnlbl{flush:thiev:e}
> >>>> +		smp_wmb();				//\lnlbl{flush:wmb}
> >>>>  		WRITE_ONCE(*theftp[t], THEFT_IDLE);	//\lnlbl{flush:IDLE}
> >>>>  	}						//\lnlbl{flush:loop2:e}
> >>>>  }							//\lnlbl{flush:e}
> >>>> @@ -115,10 +116,12 @@ int add_count(unsigned long delta)			//\lnlbl{b}
> >>>>  
> >>>>  	WRITE_ONCE(counting, 1);			//\lnlbl{fast:b}
> >>>>  	barrier();					//\lnlbl{barrier:1}
> >>>> -	if (READ_ONCE(theft) <= THEFT_REQ &&		//\lnlbl{check:b}
> >>>> -	    countermax - counter >= delta) {		//\lnlbl{check:e}
> >>>> -		WRITE_ONCE(counter, counter + delta);	//\lnlbl{add:f}
> >>>> -		fastpath = 1;				//\lnlbl{fasttaken}
> >>>> +	if (READ_ONCE(theft) <= THEFT_REQ) {		//\lnlbl{check:b}
> >>>> +		smp_rmb();				//\lnlbl{rmb}
> >>>> +		if (countermax - counter >= delta) {	//\lnlbl{check:e}
> >>>> +			WRITE_ONCE(counter, counter + delta);//\lnlbl{add:f}
> >>>> +			fastpath = 1;			//\lnlbl{fasttaken}
> >>>> +		}
> >>>>  	}
> >>>>  	barrier();					//\lnlbl{barrier:2}
> >>>>  	WRITE_ONCE(counting, 0);			//\lnlbl{clearcnt}
> >>>> @@ -154,10 +157,12 @@ int sub_count(unsigned long delta)
> >>>>  
> >>>>  	WRITE_ONCE(counting, 1);
> >>>>  	barrier();
> >>>> -	if (READ_ONCE(theft) <= THEFT_REQ &&
> >>>> -	    counter >= delta) {
> >>>> -		WRITE_ONCE(counter, counter - delta);
> >>>> -		fastpath = 1;
> >>>> +	if (READ_ONCE(theft) <= THEFT_REQ) {
> >>>> +		smp_rmb();
> >>>> +		if (counter >= delta) {
> >>>> +			WRITE_ONCE(counter, counter - delta);
> >>>> +			fastpath = 1;
> >>>> +		}
> >>>>  	}
> >>>>  	barrier();
> >>>>  	WRITE_ONCE(counting, 0);
> >>>> -- 
> >>>> 2.7.4
> >>>>
> >>>
> >>
> > 
> 


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH] count_lim_sig: Add pair of smp_wmb() and smp_rmb()
  2018-10-18 15:15         ` Paul E. McKenney
@ 2018-10-18 22:43           ` Akira Yokosawa
  2018-10-19  0:32             ` Paul E. McKenney
  0 siblings, 1 reply; 8+ messages in thread
From: Akira Yokosawa @ 2018-10-18 22:43 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: perfbook, Akira Yokosawa

On 2018/10/18 08:15:19 -0700, Paul E. McKenney wrote:
> On Thu, Oct 18, 2018 at 10:03:56PM +0900, Akira Yokosawa wrote:
>> On 2018/10/17 17:37:39 -0700, Paul E. McKenney wrote:
>>> On Thu, Oct 18, 2018 at 07:21:38AM +0900, Akira Yokosawa wrote:
>>>> On 2018/10/17 08:10:52 -0700, Paul E. McKenney wrote:
>>>>> On Tue, Oct 16, 2018 at 08:04:00AM +0900, Akira Yokosawa wrote:
>>>>>> >From 7b01fc0f19cfa010536d7eb53e4d0cda1e6b801f Mon Sep 17 00:00:00 2001
>>>>>> From: Akira Yokosawa <akiyks@gmail.com>
>>>>>> Date: Mon, 15 Oct 2018 23:46:52 +0900
>>>>>> Subject: RFC [PATCH] count_lim_sig: Add pair of smp_wmb() and smp_rmb()
>>>>>>
>>>>>> This message-passing pattern requires smp_wmb()--smp_rmb() pairing.
>>>>>>
>>>>>> Signed-off-by: Akira Yokosawa <akiyks@gmail.com>
>>>>>> ---
>>>>>> Hi Paul,
>>>>>>
>>>>>> I'm not sure this addition of memory barriers is actually required,
>>>>>> but it does look like so.
>>>>>>
>>>>>> And I'm aware that you have avoided using weaker memory barriers in
>>>>>> CodeSamples.
>>>>>>
>>>>>> Thoughts?
>>>>>
>>>>> Hello, Akira,
>>>>>
>>>>> I might be missing something, but it looks to me like this ordering is
>>>>> covered by heavyweight ordering in the signal handler entry/exit and
>>>>> the gblcnt_mutex.  So what sequence of events leads to the failiure
>>>>> scenario that you are seeing?
>>>>
>>>> So the fastpaths in add_count() and sub_count() are not protected by
>>>> glbcnt_mutex.  The slowpath in flush_local_count() waits the transition
>>>> of theft from REQ to READY, clears counter and countermax, and finally
>>>> assign IDLE to theft.
>>>>
>>>> So, the fastpaths can see (theft == IDLE) but see a non-zero value of
>>>> counter or countermax, can't they?
>>>
>>> Maybe, maybe not.  Please lay out a sequence of events showing a problem,
>>> as in load by load, store by store, line by line.  Intuition isn't as
>>> helpful as one might like for this kind of stuff.  ;-)
>>
>> Gotcha!
>>
>> I've not exhausted the timing variations, but now I see when
>> split_local_count() sees (*theft@[t] == THEFT_READY), counter part of
>> add_count() or sub_count() has exited the fastpath (marked by
>> counting == 1).
>>
>> So the race I imagined has never existed.
> 
> I know that feeling!!!
> 
>> Thanks for your nice suggestion!
> 
> Well, there might well be another race.  My main concern is whether or not
> signal-handler entry/exit really provides full ordering on all platforms.
> 
> Thoughts?

Does your concern related to the lack of memory barrier at the entry of
flush_local_count_sig() in Listing 5.17?

        Akira  

> 
> 							Thanx, Paul
> 
>>>> One theory to prevent this from happening is because all the per-thread
>>>> variables of a thread reside in a single cache line, and if the fastpaths
>>>> see the updated value of theft, they are guaranteed to see the latest
>>>> values of both counter and countermax.
>>>
>>> Good point, but we need to avoid that sort of assumption unless we
>>> placed the variables into a struct and told the compiler to align it
>>> appropriately.  And even then, hardware architectures normally don't
>>> make this sort of guarantee.  There is too much that can go wrong, from
>>> ECC errors to interrupts at just the wrong time, and much else besides.
>>
>> Absolutely!
>>
>>         Thanks, Akira
>>
>>>
>>> 							Thanx, Paul
>>>
>>>> I might be completely missing something, though.
>>>>
>>>>         Thanks, Akira 
>>>>
>>>>>
>>>>> 							Thanx, Paul
>>>>>
[...]


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH] count_lim_sig: Add pair of smp_wmb() and smp_rmb()
  2018-10-18 22:43           ` Akira Yokosawa
@ 2018-10-19  0:32             ` Paul E. McKenney
  0 siblings, 0 replies; 8+ messages in thread
From: Paul E. McKenney @ 2018-10-19  0:32 UTC (permalink / raw)
  To: Akira Yokosawa; +Cc: perfbook

On Fri, Oct 19, 2018 at 07:43:57AM +0900, Akira Yokosawa wrote:
> On 2018/10/18 08:15:19 -0700, Paul E. McKenney wrote:
> > On Thu, Oct 18, 2018 at 10:03:56PM +0900, Akira Yokosawa wrote:
> >> On 2018/10/17 17:37:39 -0700, Paul E. McKenney wrote:
> >>> On Thu, Oct 18, 2018 at 07:21:38AM +0900, Akira Yokosawa wrote:
> >>>> On 2018/10/17 08:10:52 -0700, Paul E. McKenney wrote:
> >>>>> On Tue, Oct 16, 2018 at 08:04:00AM +0900, Akira Yokosawa wrote:
> >>>>>> >From 7b01fc0f19cfa010536d7eb53e4d0cda1e6b801f Mon Sep 17 00:00:00 2001
> >>>>>> From: Akira Yokosawa <akiyks@gmail.com>
> >>>>>> Date: Mon, 15 Oct 2018 23:46:52 +0900
> >>>>>> Subject: RFC [PATCH] count_lim_sig: Add pair of smp_wmb() and smp_rmb()
> >>>>>>
> >>>>>> This message-passing pattern requires smp_wmb()--smp_rmb() pairing.
> >>>>>>
> >>>>>> Signed-off-by: Akira Yokosawa <akiyks@gmail.com>
> >>>>>> ---
> >>>>>> Hi Paul,
> >>>>>>
> >>>>>> I'm not sure this addition of memory barriers is actually required,
> >>>>>> but it does look like so.
> >>>>>>
> >>>>>> And I'm aware that you have avoided using weaker memory barriers in
> >>>>>> CodeSamples.
> >>>>>>
> >>>>>> Thoughts?
> >>>>>
> >>>>> Hello, Akira,
> >>>>>
> >>>>> I might be missing something, but it looks to me like this ordering is
> >>>>> covered by heavyweight ordering in the signal handler entry/exit and
> >>>>> the gblcnt_mutex.  So what sequence of events leads to the failiure
> >>>>> scenario that you are seeing?
> >>>>
> >>>> So the fastpaths in add_count() and sub_count() are not protected by
> >>>> glbcnt_mutex.  The slowpath in flush_local_count() waits the transition
> >>>> of theft from REQ to READY, clears counter and countermax, and finally
> >>>> assign IDLE to theft.
> >>>>
> >>>> So, the fastpaths can see (theft == IDLE) but see a non-zero value of
> >>>> counter or countermax, can't they?
> >>>
> >>> Maybe, maybe not.  Please lay out a sequence of events showing a problem,
> >>> as in load by load, store by store, line by line.  Intuition isn't as
> >>> helpful as one might like for this kind of stuff.  ;-)
> >>
> >> Gotcha!
> >>
> >> I've not exhausted the timing variations, but now I see when
> >> split_local_count() sees (*theft@[t] == THEFT_READY), counter part of
> >> add_count() or sub_count() has exited the fastpath (marked by
> >> counting == 1).
> >>
> >> So the race I imagined has never existed.
> > 
> > I know that feeling!!!
> > 
> >> Thanks for your nice suggestion!
> > 
> > Well, there might well be another race.  My main concern is whether or not
> > signal-handler entry/exit really provides full ordering on all platforms.
> > 
> > Thoughts?
> 
> Does your concern related to the lack of memory barrier at the entry of
> flush_local_count_sig() in Listing 5.17?

Placing memory barriers at flush_local_count_sig() would certainly make the
code independent of the kernel's ordering, but would those barriers really
be needed?  If they are needed, would lighter-weight synchronization work?

							Thanx, Paul

>         Akira  
> 
> > 
> > 							Thanx, Paul
> > 
> >>>> One theory to prevent this from happening is because all the per-thread
> >>>> variables of a thread reside in a single cache line, and if the fastpaths
> >>>> see the updated value of theft, they are guaranteed to see the latest
> >>>> values of both counter and countermax.
> >>>
> >>> Good point, but we need to avoid that sort of assumption unless we
> >>> placed the variables into a struct and told the compiler to align it
> >>> appropriately.  And even then, hardware architectures normally don't
> >>> make this sort of guarantee.  There is too much that can go wrong, from
> >>> ECC errors to interrupts at just the wrong time, and much else besides.
> >>
> >> Absolutely!
> >>
> >>         Thanks, Akira
> >>
> >>>
> >>> 							Thanx, Paul
> >>>
> >>>> I might be completely missing something, though.
> >>>>
> >>>>         Thanks, Akira 
> >>>>
> >>>>>
> >>>>> 							Thanx, Paul
> >>>>>
> [...]
> 


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2018-10-19  8:36 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-10-15 23:04 [RFC PATCH] count_lim_sig: Add pair of smp_wmb() and smp_rmb() Akira Yokosawa
2018-10-17 15:10 ` Paul E. McKenney
2018-10-17 22:21   ` Akira Yokosawa
2018-10-18  0:37     ` Paul E. McKenney
2018-10-18 13:03       ` Akira Yokosawa
2018-10-18 15:15         ` Paul E. McKenney
2018-10-18 22:43           ` Akira Yokosawa
2018-10-19  0:32             ` Paul E. McKenney

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.