All of lore.kernel.org
 help / color / mirror / Atom feed
* [Possible BUG] count_lim_atomic.c fails on POWER8
@ 2018-10-20 15:53 Akira Yokosawa
  2018-10-20 16:36 ` Paul E. McKenney
  0 siblings, 1 reply; 25+ messages in thread
From: Akira Yokosawa @ 2018-10-20 15:53 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: perfbook, Akira Yokosawa

Hi Paul,

I just noticed occasional error of count_lim_atomic.c on POWER8 at current master.
As I've recently touched the code under Codesamples/count/, I also tested on
the tag "v2017.11.22a", and saw the same behavior.

The POWER8 virtual machine is Ubuntu 16.04.

Example output:

$ ./count_lim_atomic 6 uperf 1
!!! Count mismatch: 0 counted vs. 8 final value
n_reads: 0  n_updates: 26038000  nreaders: 0  nupdaters: 6 duration: 240
ns/read: nan  ns/update: 55.3038

$ ./count_lim_atomic 6 perf 1
!!! Count mismatch: 0 counted vs. 11 final value
n_reads: 287000  n_updates: 1702000  nreaders: 6  nupdaters: 1 duration: 240
ns/read: 5017.42  ns/update: 141.011

As you see, the final count check of zero fails even when nupdaters == 1.

I have no idea what's wrong in count_lim_atomic.c.

Can you look into this? There might be something wrong in the header file
under CodeSamples/arch-ppc64.h.

On x86_64, I've never seen the count mismatch.

         Thanks, Akira
 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Possible BUG] count_lim_atomic.c fails on POWER8
  2018-10-20 15:53 [Possible BUG] count_lim_atomic.c fails on POWER8 Akira Yokosawa
@ 2018-10-20 16:36 ` Paul E. McKenney
  2018-10-24 15:53   ` Junchang Wang
  0 siblings, 1 reply; 25+ messages in thread
From: Paul E. McKenney @ 2018-10-20 16:36 UTC (permalink / raw)
  To: Akira Yokosawa; +Cc: perfbook

On Sun, Oct 21, 2018 at 12:53:17AM +0900, Akira Yokosawa wrote:
> Hi Paul,
> 
> I just noticed occasional error of count_lim_atomic.c on POWER8 at current master.
> As I've recently touched the code under Codesamples/count/, I also tested on
> the tag "v2017.11.22a", and saw the same behavior.
> 
> The POWER8 virtual machine is Ubuntu 16.04.
> 
> Example output:
> 
> $ ./count_lim_atomic 6 uperf 1
> !!! Count mismatch: 0 counted vs. 8 final value
> n_reads: 0  n_updates: 26038000  nreaders: 0  nupdaters: 6 duration: 240
> ns/read: nan  ns/update: 55.3038
> 
> $ ./count_lim_atomic 6 perf 1
> !!! Count mismatch: 0 counted vs. 11 final value
> n_reads: 287000  n_updates: 1702000  nreaders: 6  nupdaters: 1 duration: 240
> ns/read: 5017.42  ns/update: 141.011
> 
> As you see, the final count check of zero fails even when nupdaters == 1.

Yow!!!  Thank you for checking this!

That said, it probably wasn't really single threaded, at least assuming
that you had at least one reader.

> I have no idea what's wrong in count_lim_atomic.c.
> 
> Can you look into this? There might be something wrong in the header file
> under CodeSamples/arch-ppc64.h.

There isn't much in that file anymore because we now rely on the gcc
intrinsics for the most part.  Which might well be the problem, depending
on compiler versions and so on.

Could you please send me the output of "objdump -d" on count_lim_atomic.o?
And on the full count_lim_atomic binary, just in case gcc decides to be
tricky in its code generation?

In the meantime, there might well be a generic bug in count_lim_atomic.c
that just happens not to be exercised on x86, and I will look into that.
I am on travel next week, so will be in odd timezones, but should have
at least a little useful airplane time to look into this.

> On x86_64, I've never seen the count mismatch.

Well, David Goldblatt's first C++11 signal-based litmus test wouldn't fail
on PowerPC but did on x86, so I guess that they are now even.  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Possible BUG] count_lim_atomic.c fails on POWER8
  2018-10-20 16:36 ` Paul E. McKenney
@ 2018-10-24 15:53   ` Junchang Wang
  2018-10-24 22:05     ` Akira Yokosawa
  0 siblings, 1 reply; 25+ messages in thread
From: Junchang Wang @ 2018-10-24 15:53 UTC (permalink / raw)
  To: Paul McKenney; +Cc: Akira Yokosawa, perfbook

Hi Akira and Paul,

I checked the code today and the implementation of cmpxchg() looks
very suspicious to me; Current  cmpxchg() first executes function
__atomic_compare_exchange_n, and then checks whether the value stored
in field __actual (old) has been changed to decide if the CAS
instruction has been successfully performed. However, I saw the *weak*
field is set, which, as far as I know, means
__atomic_compare_exchange_n could fail even if the value of *ptr is
equal to __actual (old). Unfortunately, current cmpxchg will treat
this case as a success because the value of __actual(old) does not
change.

This bug happens in both Power8 and ARMv8. It seems it affects
architectures that use LL/SC to emulate CAS. Following patch helps
solve this issue on my testbeds. Please take a look. Any thoughts?

---
 CodeSamples/api-pthreads/api-gcc.h | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/CodeSamples/api-pthreads/api-gcc.h
b/CodeSamples/api-pthreads/api-gcc.h
index 1dd26ca..38a16c0 100644
--- a/CodeSamples/api-pthreads/api-gcc.h
+++ b/CodeSamples/api-pthreads/api-gcc.h
@@ -166,11 +166,9 @@ struct __xchg_dummy {

 #define cmpxchg(ptr, o, n) \
 ({ \
-       typeof(*ptr) _____actual = (o); \
-       \
-       (void)__atomic_compare_exchange_n(ptr, (void *)&_____actual, (n), 1, \
-                                         __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST); \
-       _____actual; \
+       typeof(*ptr) old = (o); \
+       (__atomic_compare_exchange_n(ptr, (void *)&old, (n), 1,
__ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST))? \
+                               (o) : (n); \
 })

 static __inline__ int atomic_cmpxchg(atomic_t *v, int old, int new)
-- 
2.7.4


Thanks,
--Junchang

On Sun, Oct 21, 2018 at 12:37 AM Paul E. McKenney <paulmck@linux.ibm.com> wrote:
>
> On Sun, Oct 21, 2018 at 12:53:17AM +0900, Akira Yokosawa wrote:
> > Hi Paul,
> >
> > I just noticed occasional error of count_lim_atomic.c on POWER8 at current master.
> > As I've recently touched the code under Codesamples/count/, I also tested on
> > the tag "v2017.11.22a", and saw the same behavior.
> >
> > The POWER8 virtual machine is Ubuntu 16.04.
> >
> > Example output:
> >
> > $ ./count_lim_atomic 6 uperf 1
> > !!! Count mismatch: 0 counted vs. 8 final value
> > n_reads: 0  n_updates: 26038000  nreaders: 0  nupdaters: 6 duration: 240
> > ns/read: nan  ns/update: 55.3038
> >
> > $ ./count_lim_atomic 6 perf 1
> > !!! Count mismatch: 0 counted vs. 11 final value
> > n_reads: 287000  n_updates: 1702000  nreaders: 6  nupdaters: 1 duration: 240
> > ns/read: 5017.42  ns/update: 141.011
> >
> > As you see, the final count check of zero fails even when nupdaters == 1.
>
> Yow!!!  Thank you for checking this!
>
> That said, it probably wasn't really single threaded, at least assuming
> that you had at least one reader.
>
> > I have no idea what's wrong in count_lim_atomic.c.
> >
> > Can you look into this? There might be something wrong in the header file
> > under CodeSamples/arch-ppc64.h.
>
> There isn't much in that file anymore because we now rely on the gcc
> intrinsics for the most part.  Which might well be the problem, depending
> on compiler versions and so on.
>
> Could you please send me the output of "objdump -d" on count_lim_atomic.o?
> And on the full count_lim_atomic binary, just in case gcc decides to be
> tricky in its code generation?
>
> In the meantime, there might well be a generic bug in count_lim_atomic.c
> that just happens not to be exercised on x86, and I will look into that.
> I am on travel next week, so will be in odd timezones, but should have
> at least a little useful airplane time to look into this.
>
> > On x86_64, I've never seen the count mismatch.
>
> Well, David Goldblatt's first C++11 signal-based litmus test wouldn't fail
> on PowerPC but did on x86, so I guess that they are now even.  ;-)
>
>                                                         Thanx, Paul
>


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [Possible BUG] count_lim_atomic.c fails on POWER8
  2018-10-24 15:53   ` Junchang Wang
@ 2018-10-24 22:05     ` Akira Yokosawa
  2018-10-24 22:29       ` Akira Yokosawa
  2018-10-25  2:11       ` Junchang Wang
  0 siblings, 2 replies; 25+ messages in thread
From: Akira Yokosawa @ 2018-10-24 22:05 UTC (permalink / raw)
  To: Junchang Wang, Paul McKenney; +Cc: perfbook, Akira Yokosawa

On 2018/10/24 23:53:29 +0800, Junchang Wang wrote:
> Hi Akira and Paul,
> 
> I checked the code today and the implementation of cmpxchg() looks
> very suspicious to me; Current  cmpxchg() first executes function
> __atomic_compare_exchange_n, and then checks whether the value stored
> in field __actual (old) has been changed to decide if the CAS
> instruction has been successfully performed. However, I saw the *weak*
> field is set, which, as far as I know, means
> __atomic_compare_exchange_n could fail even if the value of *ptr is
> equal to __actual (old). Unfortunately, current cmpxchg will treat
> this case as a success because the value of __actual(old) does not
> change.

Thanks for looking into this!

I also suspected the use of "weak" semantics of
__atomic_compare_exchange_n(), but did not quite understand what
"spurious fail" actually means. Your theory sounds plausible to me.

I've suggested in a private email to Paul to modify the 4th argument
to false(0) as a workaround, which would prevent such "spurious fail".

Both approaches looks good to me. I'd defer to Paul on the choice.

        Thanks, Akira

> 
> This bug happens in both Power8 and ARMv8. It seems it affects
> architectures that use LL/SC to emulate CAS. Following patch helps
> solve this issue on my testbeds. Please take a look. Any thoughts?
> 
> ---
>  CodeSamples/api-pthreads/api-gcc.h | 8 +++-----
>  1 file changed, 3 insertions(+), 5 deletions(-)
> 
> diff --git a/CodeSamples/api-pthreads/api-gcc.h
> b/CodeSamples/api-pthreads/api-gcc.h
> index 1dd26ca..38a16c0 100644
> --- a/CodeSamples/api-pthreads/api-gcc.h
> +++ b/CodeSamples/api-pthreads/api-gcc.h
> @@ -166,11 +166,9 @@ struct __xchg_dummy {
> 
>  #define cmpxchg(ptr, o, n) \
>  ({ \
> -       typeof(*ptr) _____actual = (o); \
> -       \
> -       (void)__atomic_compare_exchange_n(ptr, (void *)&_____actual, (n), 1, \
> -                                         __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST); \
> -       _____actual; \
> +       typeof(*ptr) old = (o); \
> +       (__atomic_compare_exchange_n(ptr, (void *)&old, (n), 1,
> __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST))? \
> +                               (o) : (n); \
>  })
> 
>  static __inline__ int atomic_cmpxchg(atomic_t *v, int old, int new)
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Possible BUG] count_lim_atomic.c fails on POWER8
  2018-10-24 22:05     ` Akira Yokosawa
@ 2018-10-24 22:29       ` Akira Yokosawa
  2018-10-25  2:18         ` Junchang Wang
  2018-10-25  2:11       ` Junchang Wang
  1 sibling, 1 reply; 25+ messages in thread
From: Akira Yokosawa @ 2018-10-24 22:29 UTC (permalink / raw)
  To: Junchang Wang, Paul McKenney; +Cc: perfbook, Akira Yokosawa

On 2018/10/25 07:05:04 +0900, Akira Yokosawa wrote:
> On 2018/10/24 23:53:29 +0800, Junchang Wang wrote:
>> Hi Akira and Paul,
>>
>> I checked the code today and the implementation of cmpxchg() looks
>> very suspicious to me; Current  cmpxchg() first executes function
>> __atomic_compare_exchange_n, and then checks whether the value stored
>> in field __actual (old) has been changed to decide if the CAS
>> instruction has been successfully performed. However, I saw the *weak*
>> field is set, which, as far as I know, means
>> __atomic_compare_exchange_n could fail even if the value of *ptr is
>> equal to __actual (old). Unfortunately, current cmpxchg will treat
>> this case as a success because the value of __actual(old) does not
>> change.
> 
> Thanks for looking into this!
> 
> I also suspected the use of "weak" semantics of
> __atomic_compare_exchange_n(), but did not quite understand what
> "spurious fail" actually means. Your theory sounds plausible to me.
> 
> I've suggested in a private email to Paul to modify the 4th argument
> to false(0) as a workaround, which would prevent such "spurious fail".
> 
> Both approaches looks good to me. I'd defer to Paul on the choice.
> 
>         Thanks, Akira
> 
>>
>> This bug happens in both Power8 and ARMv8. It seems it affects
>> architectures that use LL/SC to emulate CAS. Following patch helps
>> solve this issue on my testbeds. Please take a look. Any thoughts?
>>
>> ---
>>  CodeSamples/api-pthreads/api-gcc.h | 8 +++-----
>>  1 file changed, 3 insertions(+), 5 deletions(-)
>>
>> diff --git a/CodeSamples/api-pthreads/api-gcc.h
>> b/CodeSamples/api-pthreads/api-gcc.h
>> index 1dd26ca..38a16c0 100644
>> --- a/CodeSamples/api-pthreads/api-gcc.h
>> +++ b/CodeSamples/api-pthreads/api-gcc.h
>> @@ -166,11 +166,9 @@ struct __xchg_dummy {
>>
>>  #define cmpxchg(ptr, o, n) \
>>  ({ \
>> -       typeof(*ptr) _____actual = (o); \
>> -       \
>> -       (void)__atomic_compare_exchange_n(ptr, (void *)&_____actual, (n), 1, \
>> -                                         __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST); \
>> -       _____actual; \
>> +       typeof(*ptr) old = (o); \
>> +       (__atomic_compare_exchange_n(ptr, (void *)&old, (n), 1,
>> __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST))? \
>> +                               (o) : (n); \

BTW, returning (n) in the fail case would be problematic when "(o) == (n)"
in the first place, wouldn't it?

        Thanks, Akira

>>  })
>>
>>  static __inline__ int atomic_cmpxchg(atomic_t *v, int old, int new)
>>
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Possible BUG] count_lim_atomic.c fails on POWER8
  2018-10-24 22:05     ` Akira Yokosawa
  2018-10-24 22:29       ` Akira Yokosawa
@ 2018-10-25  2:11       ` Junchang Wang
  2018-10-25  9:45         ` Paul E. McKenney
  1 sibling, 1 reply; 25+ messages in thread
From: Junchang Wang @ 2018-10-25  2:11 UTC (permalink / raw)
  To: Akira Yokosawa; +Cc: Paul McKenney, perfbook

Hi Akira,

Thanks for the mail. My understanding is that PPC uses LL/SC to
emulate CAS by using a tiny loop. Unfortunately, the LL/SC loop itself
could fail (due to, for example, context switches) even if *ptr equals
to old. In such a case, a CAS instruction in actually should return a
success. I think this is what the term "spurious fail" describes. Here
is a reference:
http://liblfds.org/mediawiki/index.php?title=Article:CAS_and_LL/SC_Implementation_Details_by_Processor_family

It seems that __atomic_compare_exchange_n() provides option "weak" for
performance. I tested these two solutions and got the following
results:

                           1      4      8     16     32    64
my patch (ns)     35    34    37    73    142  281
strong (ns)          39    39    41    79    158  313

I tested the performance of count_lim_atomic by varying the number of
updaters (./count_lim_atomic N uperf) on a 8-core PPC server. The
first row in the table is the result when my patch is used, and the
second row is the result when the 4th argument of the function is set
to false(0). It seems performance improves slightly if option "weak"
is used. However, there is no performance boost as we expected. So
your solution sounds good if safety is one major concern because
option "weak" seems risky to me :-)

Another interesting observation is that the performance of LL/SC-based
CAS instruction deteriorates dramatically when the number of working
threads exceeds the number of CPU cores.


Thanks,
--Junchang


On Thu, Oct 25, 2018 at 6:05 AM Akira Yokosawa <akiyks@gmail.com> wrote:
>
> On 2018/10/24 23:53:29 +0800, Junchang Wang wrote:
> > Hi Akira and Paul,
> >
> > I checked the code today and the implementation of cmpxchg() looks
> > very suspicious to me; Current  cmpxchg() first executes function
> > __atomic_compare_exchange_n, and then checks whether the value stored
> > in field __actual (old) has been changed to decide if the CAS
> > instruction has been successfully performed. However, I saw the *weak*
> > field is set, which, as far as I know, means
> > __atomic_compare_exchange_n could fail even if the value of *ptr is
> > equal to __actual (old). Unfortunately, current cmpxchg will treat
> > this case as a success because the value of __actual(old) does not
> > change.
>
> Thanks for looking into this!
>
> I also suspected the use of "weak" semantics of
> __atomic_compare_exchange_n(), but did not quite understand what
> "spurious fail" actually means. Your theory sounds plausible to me.
>
> I've suggested in a private email to Paul to modify the 4th argument
> to false(0) as a workaround, which would prevent such "spurious fail".
>
> Both approaches looks good to me. I'd defer to Paul on the choice.
>
>         Thanks, Akira
>
> >
> > This bug happens in both Power8 and ARMv8. It seems it affects
> > architectures that use LL/SC to emulate CAS. Following patch helps
> > solve this issue on my testbeds. Please take a look. Any thoughts?
> >
> > ---
> >  CodeSamples/api-pthreads/api-gcc.h | 8 +++-----
> >  1 file changed, 3 insertions(+), 5 deletions(-)
> >
> > diff --git a/CodeSamples/api-pthreads/api-gcc.h
> > b/CodeSamples/api-pthreads/api-gcc.h
> > index 1dd26ca..38a16c0 100644
> > --- a/CodeSamples/api-pthreads/api-gcc.h
> > +++ b/CodeSamples/api-pthreads/api-gcc.h
> > @@ -166,11 +166,9 @@ struct __xchg_dummy {
> >
> >  #define cmpxchg(ptr, o, n) \
> >  ({ \
> > -       typeof(*ptr) _____actual = (o); \
> > -       \
> > -       (void)__atomic_compare_exchange_n(ptr, (void *)&_____actual, (n), 1, \
> > -                                         __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST); \
> > -       _____actual; \
> > +       typeof(*ptr) old = (o); \
> > +       (__atomic_compare_exchange_n(ptr, (void *)&old, (n), 1,
> > __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST))? \
> > +                               (o) : (n); \
> >  })
> >
> >  static __inline__ int atomic_cmpxchg(atomic_t *v, int old, int new)
> >
>


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Possible BUG] count_lim_atomic.c fails on POWER8
  2018-10-24 22:29       ` Akira Yokosawa
@ 2018-10-25  2:18         ` Junchang Wang
  0 siblings, 0 replies; 25+ messages in thread
From: Junchang Wang @ 2018-10-25  2:18 UTC (permalink / raw)
  To: Akira Yokosawa; +Cc: Paul McKenney, perfbook

Hi Akira,

On Thu, Oct 25, 2018 at 6:30 AM Akira Yokosawa <akiyks@gmail.com> wrote:
>
> On 2018/10/25 07:05:04 +0900, Akira Yokosawa wrote:
> > On 2018/10/24 23:53:29 +0800, Junchang Wang wrote:
> >> Hi Akira and Paul,
> >>
> >> I checked the code today and the implementation of cmpxchg() looks
> >> very suspicious to me; Current  cmpxchg() first executes function
> >> __atomic_compare_exchange_n, and then checks whether the value stored
> >> in field __actual (old) has been changed to decide if the CAS
> >> instruction has been successfully performed. However, I saw the *weak*
> >> field is set, which, as far as I know, means
> >> __atomic_compare_exchange_n could fail even if the value of *ptr is
> >> equal to __actual (old). Unfortunately, current cmpxchg will treat
> >> this case as a success because the value of __actual(old) does not
> >> change.
> >
> > Thanks for looking into this!
> >
> > I also suspected the use of "weak" semantics of
> > __atomic_compare_exchange_n(), but did not quite understand what
> > "spurious fail" actually means. Your theory sounds plausible to me.
> >
> > I've suggested in a private email to Paul to modify the 4th argument
> > to false(0) as a workaround, which would prevent such "spurious fail".
> >
> > Both approaches looks good to me. I'd defer to Paul on the choice.
> >
> >         Thanks, Akira
> >
> >>
> >> This bug happens in both Power8 and ARMv8. It seems it affects
> >> architectures that use LL/SC to emulate CAS. Following patch helps
> >> solve this issue on my testbeds. Please take a look. Any thoughts?
> >>
> >> ---
> >>  CodeSamples/api-pthreads/api-gcc.h | 8 +++-----
> >>  1 file changed, 3 insertions(+), 5 deletions(-)
> >>
> >> diff --git a/CodeSamples/api-pthreads/api-gcc.h
> >> b/CodeSamples/api-pthreads/api-gcc.h
> >> index 1dd26ca..38a16c0 100644
> >> --- a/CodeSamples/api-pthreads/api-gcc.h
> >> +++ b/CodeSamples/api-pthreads/api-gcc.h
> >> @@ -166,11 +166,9 @@ struct __xchg_dummy {
> >>
> >>  #define cmpxchg(ptr, o, n) \
> >>  ({ \
> >> -       typeof(*ptr) _____actual = (o); \
> >> -       \
> >> -       (void)__atomic_compare_exchange_n(ptr, (void *)&_____actual, (n), 1, \
> >> -                                         __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST); \
> >> -       _____actual; \
> >> +       typeof(*ptr) old = (o); \
> >> +       (__atomic_compare_exchange_n(ptr, (void *)&old, (n), 1,
> >> __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST))? \
> >> +                               (o) : (n); \
>
> BTW, returning (n) in the fail case would be problematic when "(o) == (n)"
> in the first place, wouldn't it?
>
>         Thanks, Akira
>

That's right. Then we should return (*ptr) if CAS fails. I think
whether we return (n) or (*ptr) depends on the definition of xmpxchg
:-)

Thanks,
--Junchang


> >>  })
> >>
> >>  static __inline__ int atomic_cmpxchg(atomic_t *v, int old, int new)
> >>
> >
>


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Possible BUG] count_lim_atomic.c fails on POWER8
  2018-10-25  2:11       ` Junchang Wang
@ 2018-10-25  9:45         ` Paul E. McKenney
  2018-10-25 12:23           ` Akira Yokosawa
  2018-10-25 14:09           ` Junchang Wang
  0 siblings, 2 replies; 25+ messages in thread
From: Paul E. McKenney @ 2018-10-25  9:45 UTC (permalink / raw)
  To: Junchang Wang; +Cc: Akira Yokosawa, perfbook

On Thu, Oct 25, 2018 at 10:11:18AM +0800, Junchang Wang wrote:
> Hi Akira,
> 
> Thanks for the mail. My understanding is that PPC uses LL/SC to
> emulate CAS by using a tiny loop. Unfortunately, the LL/SC loop itself
> could fail (due to, for example, context switches) even if *ptr equals
> to old. In such a case, a CAS instruction in actually should return a
> success. I think this is what the term "spurious fail" describes. Here
> is a reference:
> http://liblfds.org/mediawiki/index.php?title=Article:CAS_and_LL/SC_Implementation_Details_by_Processor_family

First, thank you both for your work on this!  And yes, my cmpxchg() code
is clearly quite broken.

> It seems that __atomic_compare_exchange_n() provides option "weak" for
> performance. I tested these two solutions and got the following
> results:
> 
>                            1      4      8     16     32    64
> my patch (ns)     35    34    37    73    142  281
> strong (ns)          39    39    41    79    158  313

So strong is a bit slower, correct?

> I tested the performance of count_lim_atomic by varying the number of
> updaters (./count_lim_atomic N uperf) on a 8-core PPC server. The
> first row in the table is the result when my patch is used, and the
> second row is the result when the 4th argument of the function is set
> to false(0). It seems performance improves slightly if option "weak"
> is used. However, there is no performance boost as we expected. So
> your solution sounds good if safety is one major concern because
> option "weak" seems risky to me :-)
> 
> Another interesting observation is that the performance of LL/SC-based
> CAS instruction deteriorates dramatically when the number of working
> threads exceeds the number of CPU cores.

If weak is faster, would it make sense to return (~o), that is,
the bitwise complement of the expected arguement, when the weak
__atomic_compare_exchange_n() fails?  This would get the improved
performance (if I understand your results above) while correctly handling
the strange (but possible) case where o==n.

Does that make sense, or am I missing something?

							Thanx, Paul

> Thanks,
> --Junchang
> 
> 
> On Thu, Oct 25, 2018 at 6:05 AM Akira Yokosawa <akiyks@gmail.com> wrote:
> >
> > On 2018/10/24 23:53:29 +0800, Junchang Wang wrote:
> > > Hi Akira and Paul,
> > >
> > > I checked the code today and the implementation of cmpxchg() looks
> > > very suspicious to me; Current  cmpxchg() first executes function
> > > __atomic_compare_exchange_n, and then checks whether the value stored
> > > in field __actual (old) has been changed to decide if the CAS
> > > instruction has been successfully performed. However, I saw the *weak*
> > > field is set, which, as far as I know, means
> > > __atomic_compare_exchange_n could fail even if the value of *ptr is
> > > equal to __actual (old). Unfortunately, current cmpxchg will treat
> > > this case as a success because the value of __actual(old) does not
> > > change.
> >
> > Thanks for looking into this!
> >
> > I also suspected the use of "weak" semantics of
> > __atomic_compare_exchange_n(), but did not quite understand what
> > "spurious fail" actually means. Your theory sounds plausible to me.
> >
> > I've suggested in a private email to Paul to modify the 4th argument
> > to false(0) as a workaround, which would prevent such "spurious fail".
> >
> > Both approaches looks good to me. I'd defer to Paul on the choice.
> >
> >         Thanks, Akira
> >
> > >
> > > This bug happens in both Power8 and ARMv8. It seems it affects
> > > architectures that use LL/SC to emulate CAS. Following patch helps
> > > solve this issue on my testbeds. Please take a look. Any thoughts?
> > >
> > > ---
> > >  CodeSamples/api-pthreads/api-gcc.h | 8 +++-----
> > >  1 file changed, 3 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/CodeSamples/api-pthreads/api-gcc.h
> > > b/CodeSamples/api-pthreads/api-gcc.h
> > > index 1dd26ca..38a16c0 100644
> > > --- a/CodeSamples/api-pthreads/api-gcc.h
> > > +++ b/CodeSamples/api-pthreads/api-gcc.h
> > > @@ -166,11 +166,9 @@ struct __xchg_dummy {
> > >
> > >  #define cmpxchg(ptr, o, n) \
> > >  ({ \
> > > -       typeof(*ptr) _____actual = (o); \
> > > -       \
> > > -       (void)__atomic_compare_exchange_n(ptr, (void *)&_____actual, (n), 1, \
> > > -                                         __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST); \
> > > -       _____actual; \
> > > +       typeof(*ptr) old = (o); \
> > > +       (__atomic_compare_exchange_n(ptr, (void *)&old, (n), 1,
> > > __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST))? \
> > > +                               (o) : (n); \
> > >  })
> > >
> > >  static __inline__ int atomic_cmpxchg(atomic_t *v, int old, int new)
> > >
> >
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Possible BUG] count_lim_atomic.c fails on POWER8
  2018-10-25  9:45         ` Paul E. McKenney
@ 2018-10-25 12:23           ` Akira Yokosawa
  2018-10-25 14:09           ` Junchang Wang
  1 sibling, 0 replies; 25+ messages in thread
From: Akira Yokosawa @ 2018-10-25 12:23 UTC (permalink / raw)
  To: Paul E. McKenney, Junchang Wang; +Cc: perfbook, Akira Yokosawa

On 2018/10/25 02:45:16 -0700, Paul E. McKenney wrote:
> On Thu, Oct 25, 2018 at 10:11:18AM +0800, Junchang Wang wrote:
>> Hi Akira,
>>
>> Thanks for the mail. My understanding is that PPC uses LL/SC to
>> emulate CAS by using a tiny loop. Unfortunately, the LL/SC loop itself
>> could fail (due to, for example, context switches) even if *ptr equals
>> to old. In such a case, a CAS instruction in actually should return a
>> success. I think this is what the term "spurious fail" describes. Here
>> is a reference:
>> http://liblfds.org/mediawiki/index.php?title=Article:CAS_and_LL/SC_Implementation_Details_by_Processor_family
> 
> First, thank you both for your work on this!  And yes, my cmpxchg() code
> is clearly quite broken.
> 
>> It seems that __atomic_compare_exchange_n() provides option "weak" for
>> performance. I tested these two solutions and got the following
>> results:
>>
>>                            1      4      8     16     32    64
>> my patch (ns)     35    34    37    73    142  281
>> strong (ns)          39    39    41    79    158  313
> 
> So strong is a bit slower, correct?

Emitted code of strong has an extra conditional branch instruction for
retry loop.

Diff of disassembled code of atomic_cmpxchg (current master vs. strong):

@@ -23,14 +23,15 @@
 ac 04 00 7c 	hwsync
 28 40 40 7d 	lwarx   r10,0,r8
 00 30 8a 7f 	cmpw    cr7,r10,r6
-0c 00 9e 40 	bne     cr7,bc4 <atomic_cmpxchg+0x6c>
+10 00 9e 40 	bne     cr7,bc8 <atomic_cmpxchg+0x70>
 2d 41 e0 7c 	stwcx.  r7,0,r8
 00 00 80 4f 	mcrf    cr7,cr0
+ec ff 9e 40 	bne     cr7,bb0 <atomic_cmpxchg+0x58>  <--- extra branch
 2c 01 00 4c 	isync
 26 10 10 7d 	mfocrf  r8,1
 fe ff 08 55 	rlwinm  r8,r8,31,31,31
 00 00 88 2f 	cmpwi   cr7,r8,0
-08 00 9e 40 	bne     cr7,bdc <atomic_cmpxchg+0x84>
+08 00 9e 40 	bne     cr7,be0 <atomic_cmpxchg+0x88>
 00 00 49 91 	stw     r10,0(r9)
 34 00 3f 81 	lwz     r9,52(r31)
 b4 07 29 7d 	extsw   r9,r9
[...]

> 
>> I tested the performance of count_lim_atomic by varying the number of
>> updaters (./count_lim_atomic N uperf) on a 8-core PPC server. The
>> first row in the table is the result when my patch is used, and the
>> second row is the result when the 4th argument of the function is set
>> to false(0). It seems performance improves slightly if option "weak"
>> is used. However, there is no performance boost as we expected. So
>> your solution sounds good if safety is one major concern because
>> option "weak" seems risky to me :-)
>>
>> Another interesting observation is that the performance of LL/SC-based
>> CAS instruction deteriorates dramatically when the number of working
>> threads exceeds the number of CPU cores.
> 
> If weak is faster, would it make sense to return (~o), that is,
> the bitwise complement of the expected arguement, when the weak
> __atomic_compare_exchange_n() fails?  This would get the improved
> performance (if I understand your results above) while correctly handling
> the strange (but possible) case where o==n.
> 
> Does that make sense, or am I missing something?

Sounds reasonable to me.

        Thanks, Akira
>
> 							Thanx, Paul
> 
>> Thanks,
>> --Junchang
>>
>>
>> On Thu, Oct 25, 2018 at 6:05 AM Akira Yokosawa <akiyks@gmail.com> wrote:
>>>
>>> On 2018/10/24 23:53:29 +0800, Junchang Wang wrote:
>>>> Hi Akira and Paul,
>>>>
>>>> I checked the code today and the implementation of cmpxchg() looks
>>>> very suspicious to me; Current  cmpxchg() first executes function
>>>> __atomic_compare_exchange_n, and then checks whether the value stored
>>>> in field __actual (old) has been changed to decide if the CAS
>>>> instruction has been successfully performed. However, I saw the *weak*
>>>> field is set, which, as far as I know, means
>>>> __atomic_compare_exchange_n could fail even if the value of *ptr is
>>>> equal to __actual (old). Unfortunately, current cmpxchg will treat
>>>> this case as a success because the value of __actual(old) does not
>>>> change.
>>>
>>> Thanks for looking into this!
>>>
>>> I also suspected the use of "weak" semantics of
>>> __atomic_compare_exchange_n(), but did not quite understand what
>>> "spurious fail" actually means. Your theory sounds plausible to me.
>>>
>>> I've suggested in a private email to Paul to modify the 4th argument
>>> to false(0) as a workaround, which would prevent such "spurious fail".
>>>
>>> Both approaches looks good to me. I'd defer to Paul on the choice.
>>>
>>>         Thanks, Akira
>>>
>>>>
>>>> This bug happens in both Power8 and ARMv8. It seems it affects
>>>> architectures that use LL/SC to emulate CAS. Following patch helps
>>>> solve this issue on my testbeds. Please take a look. Any thoughts?
>>>>
>>>> ---
>>>>  CodeSamples/api-pthreads/api-gcc.h | 8 +++-----
>>>>  1 file changed, 3 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/CodeSamples/api-pthreads/api-gcc.h
>>>> b/CodeSamples/api-pthreads/api-gcc.h
>>>> index 1dd26ca..38a16c0 100644
>>>> --- a/CodeSamples/api-pthreads/api-gcc.h
>>>> +++ b/CodeSamples/api-pthreads/api-gcc.h
>>>> @@ -166,11 +166,9 @@ struct __xchg_dummy {
>>>>
>>>>  #define cmpxchg(ptr, o, n) \
>>>>  ({ \
>>>> -       typeof(*ptr) _____actual = (o); \
>>>> -       \
>>>> -       (void)__atomic_compare_exchange_n(ptr, (void *)&_____actual, (n), 1, \
>>>> -                                         __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST); \
>>>> -       _____actual; \
>>>> +       typeof(*ptr) old = (o); \
>>>> +       (__atomic_compare_exchange_n(ptr, (void *)&old, (n), 1,
>>>> __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST))? \
>>>> +                               (o) : (n); \
>>>>  })
>>>>
>>>>  static __inline__ int atomic_cmpxchg(atomic_t *v, int old, int new)
>>>>
>>>
>>
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Possible BUG] count_lim_atomic.c fails on POWER8
  2018-10-25  9:45         ` Paul E. McKenney
  2018-10-25 12:23           ` Akira Yokosawa
@ 2018-10-25 14:09           ` Junchang Wang
  2018-10-25 15:17             ` Akira Yokosawa
  2018-10-25 15:24             ` Paul E. McKenney
  1 sibling, 2 replies; 25+ messages in thread
From: Junchang Wang @ 2018-10-25 14:09 UTC (permalink / raw)
  To: Paul McKenney; +Cc: Akira Yokosawa, perfbook

On Thu, Oct 25, 2018 at 5:45 PM Paul E. McKenney <paulmck@linux.ibm.com> wrote:
>
> On Thu, Oct 25, 2018 at 10:11:18AM +0800, Junchang Wang wrote:
> > Hi Akira,
> >
> > Thanks for the mail. My understanding is that PPC uses LL/SC to
> > emulate CAS by using a tiny loop. Unfortunately, the LL/SC loop itself
> > could fail (due to, for example, context switches) even if *ptr equals
> > to old. In such a case, a CAS instruction in actually should return a
> > success. I think this is what the term "spurious fail" describes. Here
> > is a reference:
> > http://liblfds.org/mediawiki/index.php?title=Article:CAS_and_LL/SC_Implementation_Details_by_Processor_family
>
> First, thank you both for your work on this!  And yes, my cmpxchg() code
> is clearly quite broken.
>
> > It seems that __atomic_compare_exchange_n() provides option "weak" for
> > performance. I tested these two solutions and got the following
> > results:
> >
> >                            1      4      8     16     32    64
> > my patch (ns)     35    34    37    73    142  281
> > strong (ns)          39    39    41    79    158  313
>
> So strong is a bit slower, correct?
>
> > I tested the performance of count_lim_atomic by varying the number of
> > updaters (./count_lim_atomic N uperf) on a 8-core PPC server. The
> > first row in the table is the result when my patch is used, and the
> > second row is the result when the 4th argument of the function is set
> > to false(0). It seems performance improves slightly if option "weak"
> > is used. However, there is no performance boost as we expected. So
> > your solution sounds good if safety is one major concern because
> > option "weak" seems risky to me :-)
> >
> > Another interesting observation is that the performance of LL/SC-based
> > CAS instruction deteriorates dramatically when the number of working
> > threads exceeds the number of CPU cores.
>
> If weak is faster, would it make sense to return (~o), that is,
> the bitwise complement of the expected arguement, when the weak
> __atomic_compare_exchange_n() fails?  This would get the improved
> performance (if I understand your results above) while correctly handling
> the strange (but possible) case where o==n.
>
> Does that make sense, or am I missing something?

Hi Paul and Akira,

Yes, the weak version is faster. The solution looks good. But when I
tried to use the following patch

#define cmpxchg(ptr, o, n) \
({ \
        typeof(*ptr) old = (o); \
        (__atomic_compare_exchange_n(ptr, (void *)&old, (n), 1,
__ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST))? \
                                (o) : (~o); \
})

gcc complains of my use of complement symbol

../api.h:769:12: error: wrong type argument to bit-complement
     (o) : (~o); \
              ^

Any suggestions?

Thanks,
--Junchang


>
>                                                         Thanx, Paul
>
> > Thanks,
> > --Junchang
> >
> >
> > On Thu, Oct 25, 2018 at 6:05 AM Akira Yokosawa <akiyks@gmail.com> wrote:
> > >
> > > On 2018/10/24 23:53:29 +0800, Junchang Wang wrote:
> > > > Hi Akira and Paul,
> > > >
> > > > I checked the code today and the implementation of cmpxchg() looks
> > > > very suspicious to me; Current  cmpxchg() first executes function
> > > > __atomic_compare_exchange_n, and then checks whether the value stored
> > > > in field __actual (old) has been changed to decide if the CAS
> > > > instruction has been successfully performed. However, I saw the *weak*
> > > > field is set, which, as far as I know, means
> > > > __atomic_compare_exchange_n could fail even if the value of *ptr is
> > > > equal to __actual (old). Unfortunately, current cmpxchg will treat
> > > > this case as a success because the value of __actual(old) does not
> > > > change.
> > >
> > > Thanks for looking into this!
> > >
> > > I also suspected the use of "weak" semantics of
> > > __atomic_compare_exchange_n(), but did not quite understand what
> > > "spurious fail" actually means. Your theory sounds plausible to me.
> > >
> > > I've suggested in a private email to Paul to modify the 4th argument
> > > to false(0) as a workaround, which would prevent such "spurious fail".
> > >
> > > Both approaches looks good to me. I'd defer to Paul on the choice.
> > >
> > >         Thanks, Akira
> > >
> > > >
> > > > This bug happens in both Power8 and ARMv8. It seems it affects
> > > > architectures that use LL/SC to emulate CAS. Following patch helps
> > > > solve this issue on my testbeds. Please take a look. Any thoughts?
> > > >
> > > > ---
> > > >  CodeSamples/api-pthreads/api-gcc.h | 8 +++-----
> > > >  1 file changed, 3 insertions(+), 5 deletions(-)
> > > >
> > > > diff --git a/CodeSamples/api-pthreads/api-gcc.h
> > > > b/CodeSamples/api-pthreads/api-gcc.h
> > > > index 1dd26ca..38a16c0 100644
> > > > --- a/CodeSamples/api-pthreads/api-gcc.h
> > > > +++ b/CodeSamples/api-pthreads/api-gcc.h
> > > > @@ -166,11 +166,9 @@ struct __xchg_dummy {
> > > >
> > > >  #define cmpxchg(ptr, o, n) \
> > > >  ({ \
> > > > -       typeof(*ptr) _____actual = (o); \
> > > > -       \
> > > > -       (void)__atomic_compare_exchange_n(ptr, (void *)&_____actual, (n), 1, \
> > > > -                                         __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST); \
> > > > -       _____actual; \
> > > > +       typeof(*ptr) old = (o); \
> > > > +       (__atomic_compare_exchange_n(ptr, (void *)&old, (n), 1,
> > > > __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST))? \
> > > > +                               (o) : (n); \
> > > >  })
> > > >
> > > >  static __inline__ int atomic_cmpxchg(atomic_t *v, int old, int new)
> > > >
> > >
> >
>


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Possible BUG] count_lim_atomic.c fails on POWER8
  2018-10-25 14:09           ` Junchang Wang
@ 2018-10-25 15:17             ` Akira Yokosawa
  2018-10-25 22:04               ` Akira Yokosawa
  2018-10-25 15:24             ` Paul E. McKenney
  1 sibling, 1 reply; 25+ messages in thread
From: Akira Yokosawa @ 2018-10-25 15:17 UTC (permalink / raw)
  To: Junchang Wang; +Cc: Paul McKenney, perfbook, Akira Yokosawa

On 2018/10/25 23:09, Junchang Wang wrote:
> On Thu, Oct 25, 2018 at 5:45 PM Paul E. McKenney <paulmck@linux.ibm.com> wrote:
>>
>> On Thu, Oct 25, 2018 at 10:11:18AM +0800, Junchang Wang wrote:
>>> Hi Akira,
>>>
>>> Thanks for the mail. My understanding is that PPC uses LL/SC to
>>> emulate CAS by using a tiny loop. Unfortunately, the LL/SC loop itself
>>> could fail (due to, for example, context switches) even if *ptr equals
>>> to old. In such a case, a CAS instruction in actually should return a
>>> success. I think this is what the term "spurious fail" describes. Here
>>> is a reference:
>>> http://liblfds.org/mediawiki/index.php?title=Article:CAS_and_LL/SC_Implementation_Details_by_Processor_family
>>
>> First, thank you both for your work on this!  And yes, my cmpxchg() code
>> is clearly quite broken.
>>
>>> It seems that __atomic_compare_exchange_n() provides option "weak" for
>>> performance. I tested these two solutions and got the following
>>> results:
>>>
>>>                            1      4      8     16     32    64
>>> my patch (ns)     35    34    37    73    142  281
>>> strong (ns)          39    39    41    79    158  313
>>
>> So strong is a bit slower, correct?
>>
>>> I tested the performance of count_lim_atomic by varying the number of
>>> updaters (./count_lim_atomic N uperf) on a 8-core PPC server. The
>>> first row in the table is the result when my patch is used, and the
>>> second row is the result when the 4th argument of the function is set
>>> to false(0). It seems performance improves slightly if option "weak"
>>> is used. However, there is no performance boost as we expected. So
>>> your solution sounds good if safety is one major concern because
>>> option "weak" seems risky to me :-)
>>>
>>> Another interesting observation is that the performance of LL/SC-based
>>> CAS instruction deteriorates dramatically when the number of working
>>> threads exceeds the number of CPU cores.
>>
>> If weak is faster, would it make sense to return (~o), that is,
>> the bitwise complement of the expected arguement, when the weak
>> __atomic_compare_exchange_n() fails?  This would get the improved
>> performance (if I understand your results above) while correctly handling
>> the strange (but possible) case where o==n.
>>
>> Does that make sense, or am I missing something?
> 
> Hi Paul and Akira,
> 
> Yes, the weak version is faster. The solution looks good. But when I
> tried to use the following patch
> 
> #define cmpxchg(ptr, o, n) \
> ({ \
>         typeof(*ptr) old = (o); \
>         (__atomic_compare_exchange_n(ptr, (void *)&old, (n), 1,

You need a "\" at the end of the line above. (If it was not unintentionally
wrapped.)

If it was wrapped by your mailer, which is troublesome in sending patches,
please refer to:

https://www.kernel.org/doc/html/latest/process/email-clients.html.

> __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST))? \
>                                 (o) : (~o); \
> })
> 
> gcc complains of my use of complement symbol
> 
> ../api.h:769:12: error: wrong type argument to bit-complement
>      (o) : (~o); \
>               ^
> 
> Any suggestions?

I don't see such error if I add the "\" mentioned above.
Or do you use some strict error checking option of GCC?

        Thanks, Akira

> 
> Thanks,
> --Junchang
> 
> 
>>
>>                                                         Thanx, Paul
>>
>>> Thanks,
>>> --Junchang
>>>
>>>
>>> On Thu, Oct 25, 2018 at 6:05 AM Akira Yokosawa <akiyks@gmail.com> wrote:
>>>>
>>>> On 2018/10/24 23:53:29 +0800, Junchang Wang wrote:
>>>>> Hi Akira and Paul,
>>>>>
>>>>> I checked the code today and the implementation of cmpxchg() looks
>>>>> very suspicious to me; Current  cmpxchg() first executes function
>>>>> __atomic_compare_exchange_n, and then checks whether the value stored
>>>>> in field __actual (old) has been changed to decide if the CAS
>>>>> instruction has been successfully performed. However, I saw the *weak*
>>>>> field is set, which, as far as I know, means
>>>>> __atomic_compare_exchange_n could fail even if the value of *ptr is
>>>>> equal to __actual (old). Unfortunately, current cmpxchg will treat
>>>>> this case as a success because the value of __actual(old) does not
>>>>> change.
>>>>
>>>> Thanks for looking into this!
>>>>
>>>> I also suspected the use of "weak" semantics of
>>>> __atomic_compare_exchange_n(), but did not quite understand what
>>>> "spurious fail" actually means. Your theory sounds plausible to me.
>>>>
>>>> I've suggested in a private email to Paul to modify the 4th argument
>>>> to false(0) as a workaround, which would prevent such "spurious fail".
>>>>
>>>> Both approaches looks good to me. I'd defer to Paul on the choice.
>>>>
>>>>         Thanks, Akira
>>>>
>>>>>
>>>>> This bug happens in both Power8 and ARMv8. It seems it affects
>>>>> architectures that use LL/SC to emulate CAS. Following patch helps
>>>>> solve this issue on my testbeds. Please take a look. Any thoughts?
>>>>>
>>>>> ---
>>>>>  CodeSamples/api-pthreads/api-gcc.h | 8 +++-----
>>>>>  1 file changed, 3 insertions(+), 5 deletions(-)
>>>>>
>>>>> diff --git a/CodeSamples/api-pthreads/api-gcc.h
>>>>> b/CodeSamples/api-pthreads/api-gcc.h
>>>>> index 1dd26ca..38a16c0 100644
>>>>> --- a/CodeSamples/api-pthreads/api-gcc.h
>>>>> +++ b/CodeSamples/api-pthreads/api-gcc.h
>>>>> @@ -166,11 +166,9 @@ struct __xchg_dummy {
>>>>>
>>>>>  #define cmpxchg(ptr, o, n) \
>>>>>  ({ \
>>>>> -       typeof(*ptr) _____actual = (o); \
>>>>> -       \
>>>>> -       (void)__atomic_compare_exchange_n(ptr, (void *)&_____actual, (n), 1, \
>>>>> -                                         __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST); \
>>>>> -       _____actual; \
>>>>> +       typeof(*ptr) old = (o); \
>>>>> +       (__atomic_compare_exchange_n(ptr, (void *)&old, (n), 1,
>>>>> __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST))? \
>>>>> +                               (o) : (n); \
>>>>>  })
>>>>>
>>>>>  static __inline__ int atomic_cmpxchg(atomic_t *v, int old, int new)
>>>>>
>>>>
>>>
>>


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Possible BUG] count_lim_atomic.c fails on POWER8
  2018-10-25 14:09           ` Junchang Wang
  2018-10-25 15:17             ` Akira Yokosawa
@ 2018-10-25 15:24             ` Paul E. McKenney
  1 sibling, 0 replies; 25+ messages in thread
From: Paul E. McKenney @ 2018-10-25 15:24 UTC (permalink / raw)
  To: Junchang Wang; +Cc: Akira Yokosawa, perfbook

On Thu, Oct 25, 2018 at 10:09:22PM +0800, Junchang Wang wrote:
> On Thu, Oct 25, 2018 at 5:45 PM Paul E. McKenney <paulmck@linux.ibm.com> wrote:
> >
> > On Thu, Oct 25, 2018 at 10:11:18AM +0800, Junchang Wang wrote:
> > > Hi Akira,
> > >
> > > Thanks for the mail. My understanding is that PPC uses LL/SC to
> > > emulate CAS by using a tiny loop. Unfortunately, the LL/SC loop itself
> > > could fail (due to, for example, context switches) even if *ptr equals
> > > to old. In such a case, a CAS instruction in actually should return a
> > > success. I think this is what the term "spurious fail" describes. Here
> > > is a reference:
> > > http://liblfds.org/mediawiki/index.php?title=Article:CAS_and_LL/SC_Implementation_Details_by_Processor_family
> >
> > First, thank you both for your work on this!  And yes, my cmpxchg() code
> > is clearly quite broken.
> >
> > > It seems that __atomic_compare_exchange_n() provides option "weak" for
> > > performance. I tested these two solutions and got the following
> > > results:
> > >
> > >                            1      4      8     16     32    64
> > > my patch (ns)     35    34    37    73    142  281
> > > strong (ns)          39    39    41    79    158  313
> >
> > So strong is a bit slower, correct?
> >
> > > I tested the performance of count_lim_atomic by varying the number of
> > > updaters (./count_lim_atomic N uperf) on a 8-core PPC server. The
> > > first row in the table is the result when my patch is used, and the
> > > second row is the result when the 4th argument of the function is set
> > > to false(0). It seems performance improves slightly if option "weak"
> > > is used. However, there is no performance boost as we expected. So
> > > your solution sounds good if safety is one major concern because
> > > option "weak" seems risky to me :-)
> > >
> > > Another interesting observation is that the performance of LL/SC-based
> > > CAS instruction deteriorates dramatically when the number of working
> > > threads exceeds the number of CPU cores.
> >
> > If weak is faster, would it make sense to return (~o), that is,
> > the bitwise complement of the expected arguement, when the weak
> > __atomic_compare_exchange_n() fails?  This would get the improved
> > performance (if I understand your results above) while correctly handling
> > the strange (but possible) case where o==n.
> >
> > Does that make sense, or am I missing something?
> 
> Hi Paul and Akira,
> 
> Yes, the weak version is faster. The solution looks good. But when I
> tried to use the following patch
> 
> #define cmpxchg(ptr, o, n) \
> ({ \
>         typeof(*ptr) old = (o); \
>         (__atomic_compare_exchange_n(ptr, (void *)&old, (n), 1,
> __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST))? \
>                                 (o) : (~o); \
> })
> 
> gcc complains of my use of complement symbol
> 
> ../api.h:769:12: error: wrong type argument to bit-complement
>      (o) : (~o); \
>               ^
> 
> Any suggestions?

You might need to do this for the macro argument: "(~(o))".

Another possibility is ((o) + 1), which would work for pointers as well
as for integers.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Possible BUG] count_lim_atomic.c fails on POWER8
  2018-10-25 15:17             ` Akira Yokosawa
@ 2018-10-25 22:04               ` Akira Yokosawa
  2018-10-26  0:58                 ` Junchang Wang
  2018-10-26  1:12                 ` Akira Yokosawa
  0 siblings, 2 replies; 25+ messages in thread
From: Akira Yokosawa @ 2018-10-25 22:04 UTC (permalink / raw)
  To: Junchang Wang; +Cc: Paul McKenney, perfbook, Akira Yokosawa

On 2018/10/26 0:17, Akira Yokosawa wrote:
> On 2018/10/25 23:09, Junchang Wang wrote:
>> On Thu, Oct 25, 2018 at 5:45 PM Paul E. McKenney <paulmck@linux.ibm.com> wrote:
>>>
>>> On Thu, Oct 25, 2018 at 10:11:18AM +0800, Junchang Wang wrote:
>>>> Hi Akira,
>>>>
>>>> Thanks for the mail. My understanding is that PPC uses LL/SC to
>>>> emulate CAS by using a tiny loop. Unfortunately, the LL/SC loop itself
>>>> could fail (due to, for example, context switches) even if *ptr equals
>>>> to old. In such a case, a CAS instruction in actually should return a
>>>> success. I think this is what the term "spurious fail" describes. Here
>>>> is a reference:
>>>> http://liblfds.org/mediawiki/index.php?title=Article:CAS_and_LL/SC_Implementation_Details_by_Processor_family
>>>
>>> First, thank you both for your work on this!  And yes, my cmpxchg() code
>>> is clearly quite broken.
>>>
>>>> It seems that __atomic_compare_exchange_n() provides option "weak" for
>>>> performance. I tested these two solutions and got the following
>>>> results:
>>>>
>>>>                            1      4      8     16     32    64
>>>> my patch (ns)     35    34    37    73    142  281
>>>> strong (ns)          39    39    41    79    158  313
>>>
>>> So strong is a bit slower, correct?
>>>
>>>> I tested the performance of count_lim_atomic by varying the number of
>>>> updaters (./count_lim_atomic N uperf) on a 8-core PPC server. The
>>>> first row in the table is the result when my patch is used, and the
>>>> second row is the result when the 4th argument of the function is set
>>>> to false(0). It seems performance improves slightly if option "weak"
>>>> is used. However, there is no performance boost as we expected. So
>>>> your solution sounds good if safety is one major concern because
>>>> option "weak" seems risky to me :-)
>>>>
>>>> Another interesting observation is that the performance of LL/SC-based
>>>> CAS instruction deteriorates dramatically when the number of working
>>>> threads exceeds the number of CPU cores.
>>>
>>> If weak is faster, would it make sense to return (~o), that is,
>>> the bitwise complement of the expected arguement, when the weak
>>> __atomic_compare_exchange_n() fails?  This would get the improved
>>> performance (if I understand your results above) while correctly handling
>>> the strange (but possible) case where o==n.
>>>
>>> Does that make sense, or am I missing something?
>>
>> Hi Paul and Akira,
>>
>> Yes, the weak version is faster. The solution looks good. But when I
>> tried to use the following patch
>>
>> #define cmpxchg(ptr, o, n) \
>> ({ \
>>         typeof(*ptr) old = (o); \
>>         (__atomic_compare_exchange_n(ptr, (void *)&old, (n), 1,
> 
> You need a "\" at the end of the line above. (If it was not unintentionally
> wrapped.)
> 
> If it was wrapped by your mailer, which is troublesome in sending patches,
> please refer to:
> 
> https://www.kernel.org/doc/html/latest/process/email-clients.html.
> 
>> __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST))? \
>>                                 (o) : (~o); \
>> })
>>
>> gcc complains of my use of complement symbol
>>
>> ../api.h:769:12: error: wrong type argument to bit-complement
>>      (o) : (~o); \
>>               ^
>>
>> Any suggestions?
> 
> I don't see such error if I add the "\" mentioned above.
> Or do you use some strict error checking option of GCC?

Ah, I see that the error in compiling CodeSamples/advsync/q.c.

The call site is:

struct el *q_pop(struct queue *q)
{
	struct el *p;
	struct el *pnext;

	for (;;) {
		do {
			p = q->head;
			pnext = p->next;
			if (pnext == NULL)
				return NULL;
		} while (cmpxchg(&q->head, p, pnext) != p);
		if (p != &q->dummy)
			return p;
		q_push(&q->dummy, q);
		if (q->head == &q->dummy)
			return NULL;
	}
}

In this case, p and pnext are pointers, hence the error.
returning (o)+1 instead should be OK in this case.

But now, "count_lim_atomic 3 hog" says:

    FAIL: only reached -1829 rather than 0

on x86_64. Hmm. No such error is observed on POWER8.
Hmm...

The strong version works both on x86_64 and POWER8.

        Thanks, Akira

> 
>         Thanks, Akira
> 
>>
>> Thanks,
>> --Junchang
>>
>>
>>>
>>>                                                         Thanx, Paul
>>>
>>>> Thanks,
>>>> --Junchang
>>>>
>>>>
>>>> On Thu, Oct 25, 2018 at 6:05 AM Akira Yokosawa <akiyks@gmail.com> wrote:
>>>>>
>>>>> On 2018/10/24 23:53:29 +0800, Junchang Wang wrote:
>>>>>> Hi Akira and Paul,
>>>>>>
>>>>>> I checked the code today and the implementation of cmpxchg() looks
>>>>>> very suspicious to me; Current  cmpxchg() first executes function
>>>>>> __atomic_compare_exchange_n, and then checks whether the value stored
>>>>>> in field __actual (old) has been changed to decide if the CAS
>>>>>> instruction has been successfully performed. However, I saw the *weak*
>>>>>> field is set, which, as far as I know, means
>>>>>> __atomic_compare_exchange_n could fail even if the value of *ptr is
>>>>>> equal to __actual (old). Unfortunately, current cmpxchg will treat
>>>>>> this case as a success because the value of __actual(old) does not
>>>>>> change.
>>>>>
>>>>> Thanks for looking into this!
>>>>>
>>>>> I also suspected the use of "weak" semantics of
>>>>> __atomic_compare_exchange_n(), but did not quite understand what
>>>>> "spurious fail" actually means. Your theory sounds plausible to me.
>>>>>
>>>>> I've suggested in a private email to Paul to modify the 4th argument
>>>>> to false(0) as a workaround, which would prevent such "spurious fail".
>>>>>
>>>>> Both approaches looks good to me. I'd defer to Paul on the choice.
>>>>>
>>>>>         Thanks, Akira
>>>>>
>>>>>>
>>>>>> This bug happens in both Power8 and ARMv8. It seems it affects
>>>>>> architectures that use LL/SC to emulate CAS. Following patch helps
>>>>>> solve this issue on my testbeds. Please take a look. Any thoughts?
>>>>>>
>>>>>> ---
>>>>>>  CodeSamples/api-pthreads/api-gcc.h | 8 +++-----
>>>>>>  1 file changed, 3 insertions(+), 5 deletions(-)
>>>>>>
>>>>>> diff --git a/CodeSamples/api-pthreads/api-gcc.h
>>>>>> b/CodeSamples/api-pthreads/api-gcc.h
>>>>>> index 1dd26ca..38a16c0 100644
>>>>>> --- a/CodeSamples/api-pthreads/api-gcc.h
>>>>>> +++ b/CodeSamples/api-pthreads/api-gcc.h
>>>>>> @@ -166,11 +166,9 @@ struct __xchg_dummy {
>>>>>>
>>>>>>  #define cmpxchg(ptr, o, n) \
>>>>>>  ({ \
>>>>>> -       typeof(*ptr) _____actual = (o); \
>>>>>> -       \
>>>>>> -       (void)__atomic_compare_exchange_n(ptr, (void *)&_____actual, (n), 1, \
>>>>>> -                                         __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST); \
>>>>>> -       _____actual; \
>>>>>> +       typeof(*ptr) old = (o); \
>>>>>> +       (__atomic_compare_exchange_n(ptr, (void *)&old, (n), 1,
>>>>>> __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST))? \
>>>>>> +                               (o) : (n); \
>>>>>>  })
>>>>>>
>>>>>>  static __inline__ int atomic_cmpxchg(atomic_t *v, int old, int new)
>>>>>>
>>>>>
>>>>
>>>
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Possible BUG] count_lim_atomic.c fails on POWER8
  2018-10-25 22:04               ` Akira Yokosawa
@ 2018-10-26  0:58                 ` Junchang Wang
  2018-10-27 14:56                   ` Akira Yokosawa
  2018-10-26  1:12                 ` Akira Yokosawa
  1 sibling, 1 reply; 25+ messages in thread
From: Junchang Wang @ 2018-10-26  0:58 UTC (permalink / raw)
  To: Akira Yokosawa, Paul McKenney; +Cc: perfbook

On Fri, Oct 26, 2018 at 6:04 AM Akira Yokosawa <akiyks@gmail.com> wrote:
>
> On 2018/10/26 0:17, Akira Yokosawa wrote:
> > On 2018/10/25 23:09, Junchang Wang wrote:
> >> On Thu, Oct 25, 2018 at 5:45 PM Paul E. McKenney <paulmck@linux.ibm.com> wrote:
> >>>
> >>> On Thu, Oct 25, 2018 at 10:11:18AM +0800, Junchang Wang wrote:
> >>>> Hi Akira,
> >>>>
> >>>> Thanks for the mail. My understanding is that PPC uses LL/SC to
> >>>> emulate CAS by using a tiny loop. Unfortunately, the LL/SC loop itself
> >>>> could fail (due to, for example, context switches) even if *ptr equals
> >>>> to old. In such a case, a CAS instruction in actually should return a
> >>>> success. I think this is what the term "spurious fail" describes. Here
> >>>> is a reference:
> >>>> http://liblfds.org/mediawiki/index.php?title=Article:CAS_and_LL/SC_Implementation_Details_by_Processor_family
> >>>
> >>> First, thank you both for your work on this!  And yes, my cmpxchg() code
> >>> is clearly quite broken.
> >>>
> >>>> It seems that __atomic_compare_exchange_n() provides option "weak" for
> >>>> performance. I tested these two solutions and got the following
> >>>> results:
> >>>>
> >>>>                            1      4      8     16     32    64
> >>>> my patch (ns)     35    34    37    73    142  281
> >>>> strong (ns)          39    39    41    79    158  313
> >>>
> >>> So strong is a bit slower, correct?
> >>>
> >>>> I tested the performance of count_lim_atomic by varying the number of
> >>>> updaters (./count_lim_atomic N uperf) on a 8-core PPC server. The
> >>>> first row in the table is the result when my patch is used, and the
> >>>> second row is the result when the 4th argument of the function is set
> >>>> to false(0). It seems performance improves slightly if option "weak"
> >>>> is used. However, there is no performance boost as we expected. So
> >>>> your solution sounds good if safety is one major concern because
> >>>> option "weak" seems risky to me :-)
> >>>>
> >>>> Another interesting observation is that the performance of LL/SC-based
> >>>> CAS instruction deteriorates dramatically when the number of working
> >>>> threads exceeds the number of CPU cores.
> >>>
> >>> If weak is faster, would it make sense to return (~o), that is,
> >>> the bitwise complement of the expected arguement, when the weak
> >>> __atomic_compare_exchange_n() fails?  This would get the improved
> >>> performance (if I understand your results above) while correctly handling
> >>> the strange (but possible) case where o==n.
> >>>
> >>> Does that make sense, or am I missing something?
> >>
> >> Hi Paul and Akira,
> >>
> >> Yes, the weak version is faster. The solution looks good. But when I
> >> tried to use the following patch
> >>
> >> #define cmpxchg(ptr, o, n) \
> >> ({ \
> >>         typeof(*ptr) old = (o); \
> >>         (__atomic_compare_exchange_n(ptr, (void *)&old, (n), 1,
> >
> > You need a "\" at the end of the line above. (If it was not unintentionally
> > wrapped.)
> >
> > If it was wrapped by your mailer, which is troublesome in sending patches,
> > please refer to:
> >
> > https://www.kernel.org/doc/html/latest/process/email-clients.html.
> >
> >> __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST))? \
> >>                                 (o) : (~o); \
> >> })
> >>
> >> gcc complains of my use of complement symbol
> >>
> >> ../api.h:769:12: error: wrong type argument to bit-complement
> >>      (o) : (~o); \
> >>               ^
> >>
> >> Any suggestions?
> >
> > I don't see such error if I add the "\" mentioned above.
> > Or do you use some strict error checking option of GCC?
>
> Ah, I see that the error in compiling CodeSamples/advsync/q.c.
>
> The call site is:
>
> struct el *q_pop(struct queue *q)
> {
>         struct el *p;
>         struct el *pnext;
>
>         for (;;) {
>                 do {
>                         p = q->head;
>                         pnext = p->next;
>                         if (pnext == NULL)
>                                 return NULL;
>                 } while (cmpxchg(&q->head, p, pnext) != p);
>                 if (p != &q->dummy)
>                         return p;
>                 q_push(&q->dummy, q);
>                 if (q->head == &q->dummy)
>                         return NULL;
>         }
> }
>
> In this case, p and pnext are pointers, hence the error.
> returning (o)+1 instead should be OK in this case.
>

Hi Akira and Paul,

Returning (o)+1 if the CAS primitive fails works fine. Thanks a lot!

> But now, "count_lim_atomic 3 hog" says:
>
>     FAIL: only reached -1829 rather than 0
>
> on x86_64. Hmm. No such error is observed on POWER8.
> Hmm...
>
> The strong version works both on x86_64 and POWER8.

Indeed (cry) ! Thanks for checking this. I will take a look and work on this.

BTW, I found I'm not good in writing C macro (e.g., cmpxchg). Do you
know some specification/document on writing C macro functions in
Linux?

Thanks,
--Junchang




>
>         Thanks, Akira
>
> >
> >         Thanks, Akira
> >
> >>
> >> Thanks,
> >> --Junchang
> >>
> >>
> >>>
> >>>                                                         Thanx, Paul
> >>>
> >>>> Thanks,
> >>>> --Junchang
> >>>>
> >>>>
> >>>> On Thu, Oct 25, 2018 at 6:05 AM Akira Yokosawa <akiyks@gmail.com> wrote:
> >>>>>
> >>>>> On 2018/10/24 23:53:29 +0800, Junchang Wang wrote:
> >>>>>> Hi Akira and Paul,
> >>>>>>
> >>>>>> I checked the code today and the implementation of cmpxchg() looks
> >>>>>> very suspicious to me; Current  cmpxchg() first executes function
> >>>>>> __atomic_compare_exchange_n, and then checks whether the value stored
> >>>>>> in field __actual (old) has been changed to decide if the CAS
> >>>>>> instruction has been successfully performed. However, I saw the *weak*
> >>>>>> field is set, which, as far as I know, means
> >>>>>> __atomic_compare_exchange_n could fail even if the value of *ptr is
> >>>>>> equal to __actual (old). Unfortunately, current cmpxchg will treat
> >>>>>> this case as a success because the value of __actual(old) does not
> >>>>>> change.
> >>>>>
> >>>>> Thanks for looking into this!
> >>>>>
> >>>>> I also suspected the use of "weak" semantics of
> >>>>> __atomic_compare_exchange_n(), but did not quite understand what
> >>>>> "spurious fail" actually means. Your theory sounds plausible to me.
> >>>>>
> >>>>> I've suggested in a private email to Paul to modify the 4th argument
> >>>>> to false(0) as a workaround, which would prevent such "spurious fail".
> >>>>>
> >>>>> Both approaches looks good to me. I'd defer to Paul on the choice.
> >>>>>
> >>>>>         Thanks, Akira
> >>>>>
> >>>>>>
> >>>>>> This bug happens in both Power8 and ARMv8. It seems it affects
> >>>>>> architectures that use LL/SC to emulate CAS. Following patch helps
> >>>>>> solve this issue on my testbeds. Please take a look. Any thoughts?
> >>>>>>
> >>>>>> ---
> >>>>>>  CodeSamples/api-pthreads/api-gcc.h | 8 +++-----
> >>>>>>  1 file changed, 3 insertions(+), 5 deletions(-)
> >>>>>>
> >>>>>> diff --git a/CodeSamples/api-pthreads/api-gcc.h
> >>>>>> b/CodeSamples/api-pthreads/api-gcc.h
> >>>>>> index 1dd26ca..38a16c0 100644
> >>>>>> --- a/CodeSamples/api-pthreads/api-gcc.h
> >>>>>> +++ b/CodeSamples/api-pthreads/api-gcc.h
> >>>>>> @@ -166,11 +166,9 @@ struct __xchg_dummy {
> >>>>>>
> >>>>>>  #define cmpxchg(ptr, o, n) \
> >>>>>>  ({ \
> >>>>>> -       typeof(*ptr) _____actual = (o); \
> >>>>>> -       \
> >>>>>> -       (void)__atomic_compare_exchange_n(ptr, (void *)&_____actual, (n), 1, \
> >>>>>> -                                         __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST); \
> >>>>>> -       _____actual; \
> >>>>>> +       typeof(*ptr) old = (o); \
> >>>>>> +       (__atomic_compare_exchange_n(ptr, (void *)&old, (n), 1,
> >>>>>> __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST))? \
> >>>>>> +                               (o) : (n); \
> >>>>>>  })
> >>>>>>
> >>>>>>  static __inline__ int atomic_cmpxchg(atomic_t *v, int old, int new)
> >>>>>>
> >>>>>
> >>>>
> >>>
> >
>


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Possible BUG] count_lim_atomic.c fails on POWER8
  2018-10-25 22:04               ` Akira Yokosawa
  2018-10-26  0:58                 ` Junchang Wang
@ 2018-10-26  1:12                 ` Akira Yokosawa
  2018-10-26 11:34                   ` Akira Yokosawa
  1 sibling, 1 reply; 25+ messages in thread
From: Akira Yokosawa @ 2018-10-26  1:12 UTC (permalink / raw)
  To: Junchang Wang; +Cc: Paul McKenney, perfbook

(from mobile, might be QP encoded)

2018/10/26 7:04、Akira Yokosawa <akiyks@gmail.com> wrote:

>> On 2018/10/26 0:17, Akira Yokosawa wrote:
>>> On 2018/10/25 23:09, Junchang Wang wrote:
>>>> On Thu, Oct 25, 2018 at 5:45 PM Paul E. McKenney <paulmck@linux.ibm.com> wrote:
>>>> 
>>>>> On Thu, Oct 25, 2018 at 10:11:18AM +0800, Junchang Wang wrote:
>>>>> Hi Akira,
>>>>> 
>>>>> Thanks for the mail. My understanding is that PPC uses LL/SC to
>>>>> emulate CAS by using a tiny loop. Unfortunately, the LL/SC loop itself
>>>>> could fail (due to, for example, context switches) even if *ptr equals
>>>>> to old. In such a case, a CAS instruction in actually should return a
>>>>> success. I think this is what the term "spurious fail" describes. Here
>>>>> is a reference:
>>>>> http://liblfds.org/mediawiki/index.php?title=Article:CAS_and_LL/SC_Implementation_Details_by_Processor_family
>>>> 
>>>> First, thank you both for your work on this!  And yes, my cmpxchg() code
>>>> is clearly quite broken.
>>>> 
>>>>> It seems that __atomic_compare_exchange_n() provides option "weak" for
>>>>> performance. I tested these two solutions and got the following
>>>>> results:
>>>>> 
>>>>>                           1      4      8     16     32    64
>>>>> my patch (ns)     35    34    37    73    142  281
>>>>> strong (ns)          39    39    41    79    158  313
>>>> 
>>>> So strong is a bit slower, correct?
>>>> 
>>>>> I tested the performance of count_lim_atomic by varying the number of
>>>>> updaters (./count_lim_atomic N uperf) on a 8-core PPC server. The
>>>>> first row in the table is the result when my patch is used, and the
>>>>> second row is the result when the 4th argument of the function is set
>>>>> to false(0). It seems performance improves slightly if option "weak"
>>>>> is used. However, there is no performance boost as we expected. So
>>>>> your solution sounds good if safety is one major concern because
>>>>> option "weak" seems risky to me :-)
>>>>> 
>>>>> Another interesting observation is that the performance of LL/SC-based
>>>>> CAS instruction deteriorates dramatically when the number of working
>>>>> threads exceeds the number of CPU cores.
>>>> 
>>>> If weak is faster, would it make sense to return (~o), that is,
>>>> the bitwise complement of the expected arguement, when the weak
>>>> __atomic_compare_exchange_n() fails?  This would get the improved
>>>> performance (if I understand your results above) while correctly handling
>>>> the strange (but possible) case where o==n.
>>>> 
>>>> Does that make sense, or am I missing something?
>>> 
>>> Hi Paul and Akira,
>>> 
>>> Yes, the weak version is faster. The solution looks good. But when I
>>> tried to use the following patch
>>> 
>>> #define cmpxchg(ptr, o, n) \
>>> ({ \
>>>        typeof(*ptr) old = (o); \
>>>        (__atomic_compare_exchange_n(ptr, (void *)&old, (n), 1,
>> 
>> You need a "\" at the end of the line above. (If it was not unintentionally
>> wrapped.)
>> 
>> If it was wrapped by your mailer, which is troublesome in sending patches,
>> please refer to:
>> 
>> https://www.kernel.org/doc/html/latest/process/email-clients.html.
>> 
>>> __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST))? \
>>>                                (o) : (~o); \
>>> })
>>> 
>>> gcc complains of my use of complement symbol
>>> 
>>> ../api.h:769:12: error: wrong type argument to bit-complement
>>>     (o) : (~o); \
>>>              ^
>>> 
>>> Any suggestions?
>> 
>> I don't see such error if I add the "\" mentioned above.
>> Or do you use some strict error checking option of GCC?
> 
> Ah, I see that the error in compiling CodeSamples/advsync/q.c.
> 
> The call site is:
> 
> struct el *q_pop(struct queue *q)
> {
>    struct el *p;
>    struct el *pnext;
> 
>    for (;;) {
>        do {
>            p = q->head;
>            pnext = p->next;
>            if (pnext == NULL)
>                return NULL;
>        } while (cmpxchg(&q->head, p, pnext) != p);
>        if (p != &q->dummy)
>            return p;
>        q_push(&q->dummy, q);
>        if (q->head == &q->dummy)
>            return NULL;
>    }
> }
> 
> In this case, p and pnext are pointers, hence the error.
> returning (o)+1 instead should be OK in this case.
> 
> But now, "count_lim_atomic 3 hog" says:
> 
>    FAIL: only reached -1829 rather than 0
> 
> on x86_64. Hmm. No such error is observed on POWER8.
> Hmm...

I tried the same source with (o)+1 on another x86_64 host (Ubuntu 16.04).

count_lim_atomic hog test runs flawlessly.

The host I tried earlier was a bit old SB laptop. I’ll try there again later.

Thanks, Akira

> 
> The strong version works both on x86_64 and POWER8.
> 
>        Thanks, Akira
> 
>> 
>>        Thanks, Akira
>> 
>>> 
>>> Thanks,
>>> --Junchang
>>> 
>>> 
>>>> 
>>>>                                                        Thanx, Paul
>>>> 
>>>>> Thanks,
>>>>> --Junchang
>>>>> 
>>>>> 
>>>>>> On Thu, Oct 25, 2018 at 6:05 AM Akira Yokosawa <akiyks@gmail.com> wrote:
>>>>>> 
>>>>>>> On 2018/10/24 23:53:29 +0800, Junchang Wang wrote:
>>>>>>> Hi Akira and Paul,
>>>>>>> 
>>>>>>> I checked the code today and the implementation of cmpxchg() looks
>>>>>>> very suspicious to me; Current  cmpxchg() first executes function
>>>>>>> __atomic_compare_exchange_n, and then checks whether the value stored
>>>>>>> in field __actual (old) has been changed to decide if the CAS
>>>>>>> instruction has been successfully performed. However, I saw the *weak*
>>>>>>> field is set, which, as far as I know, means
>>>>>>> __atomic_compare_exchange_n could fail even if the value of *ptr is
>>>>>>> equal to __actual (old). Unfortunately, current cmpxchg will treat
>>>>>>> this case as a success because the value of __actual(old) does not
>>>>>>> change.
>>>>>> 
>>>>>> Thanks for looking into this!
>>>>>> 
>>>>>> I also suspected the use of "weak" semantics of
>>>>>> __atomic_compare_exchange_n(), but did not quite understand what
>>>>>> "spurious fail" actually means. Your theory sounds plausible to me.
>>>>>> 
>>>>>> I've suggested in a private email to Paul to modify the 4th argument
>>>>>> to false(0) as a workaround, which would prevent such "spurious fail".
>>>>>> 
>>>>>> Both approaches looks good to me. I'd defer to Paul on the choice.
>>>>>> 
>>>>>>        Thanks, Akira
>>>>>> 
>>>>>>> 
>>>>>>> This bug happens in both Power8 and ARMv8. It seems it affects
>>>>>>> architectures that use LL/SC to emulate CAS. Following patch helps
>>>>>>> solve this issue on my testbeds. Please take a look. Any thoughts?
>>>>>>> 
>>>>>>> ---
>>>>>>> CodeSamples/api-pthreads/api-gcc.h | 8 +++-----
>>>>>>> 1 file changed, 3 insertions(+), 5 deletions(-)
>>>>>>> 
>>>>>>> diff --git a/CodeSamples/api-pthreads/api-gcc.h
>>>>>>> b/CodeSamples/api-pthreads/api-gcc.h
>>>>>>> index 1dd26ca..38a16c0 100644
>>>>>>> --- a/CodeSamples/api-pthreads/api-gcc.h
>>>>>>> +++ b/CodeSamples/api-pthreads/api-gcc.h
>>>>>>> @@ -166,11 +166,9 @@ struct __xchg_dummy {
>>>>>>> 
>>>>>>> #define cmpxchg(ptr, o, n) \
>>>>>>> ({ \
>>>>>>> -       typeof(*ptr) _____actual = (o); \
>>>>>>> -       \
>>>>>>> -       (void)__atomic_compare_exchange_n(ptr, (void *)&_____actual, (n), 1, \
>>>>>>> -                                         __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST); \
>>>>>>> -       _____actual; \
>>>>>>> +       typeof(*ptr) old = (o); \
>>>>>>> +       (__atomic_compare_exchange_n(ptr, (void *)&old, (n), 1,
>>>>>>> __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST))? \
>>>>>>> +                               (o) : (n); \
>>>>>>> })
>>>>>>> 
>>>>>>> static __inline__ int atomic_cmpxchg(atomic_t *v, int old, int new)
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>> 
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Possible BUG] count_lim_atomic.c fails on POWER8
  2018-10-26  1:12                 ` Akira Yokosawa
@ 2018-10-26 11:34                   ` Akira Yokosawa
  2018-10-26 16:06                     ` Junchang Wang
  0 siblings, 1 reply; 25+ messages in thread
From: Akira Yokosawa @ 2018-10-26 11:34 UTC (permalink / raw)
  To: Junchang Wang, Paul McKenney; +Cc: perfbook, Akira Yokosawa

On 2018/10/26 10:12:19 +0900, Akira Yokosawa wrote:
> (from mobile, might be QP encoded)
> 
> 2018/10/26 7:04、Akira Yokosawa <akiyks@gmail.com> wrote:
> 
>>> On 2018/10/26 0:17, Akira Yokosawa wrote:
>>>> On 2018/10/25 23:09, Junchang Wang wrote:
>>>>> On Thu, Oct 25, 2018 at 5:45 PM Paul E. McKenney <paulmck@linux.ibm.com> wrote:
>>>>>
>>>>>> On Thu, Oct 25, 2018 at 10:11:18AM +0800, Junchang Wang wrote:
>>>>>> Hi Akira,
>>>>>>
>>>>>> Thanks for the mail. My understanding is that PPC uses LL/SC to
>>>>>> emulate CAS by using a tiny loop. Unfortunately, the LL/SC loop itself
>>>>>> could fail (due to, for example, context switches) even if *ptr equals
>>>>>> to old. In such a case, a CAS instruction in actually should return a
>>>>>> success. I think this is what the term "spurious fail" describes. Here
>>>>>> is a reference:
>>>>>> http://liblfds.org/mediawiki/index.php?title=Article:CAS_and_LL/SC_Implementation_Details_by_Processor_family
>>>>>
>>>>> First, thank you both for your work on this!  And yes, my cmpxchg() code
>>>>> is clearly quite broken.
>>>>>
>>>>>> It seems that __atomic_compare_exchange_n() provides option "weak" for
>>>>>> performance. I tested these two solutions and got the following
>>>>>> results:
>>>>>>
>>>>>>                           1      4      8     16     32    64
>>>>>> my patch (ns)     35    34    37    73    142  281
>>>>>> strong (ns)          39    39    41    79    158  313
>>>>>
>>>>> So strong is a bit slower, correct?
>>>>>
>>>>>> I tested the performance of count_lim_atomic by varying the number of
>>>>>> updaters (./count_lim_atomic N uperf) on a 8-core PPC server. The
>>>>>> first row in the table is the result when my patch is used, and the
>>>>>> second row is the result when the 4th argument of the function is set
>>>>>> to false(0). It seems performance improves slightly if option "weak"
>>>>>> is used. However, there is no performance boost as we expected. So
>>>>>> your solution sounds good if safety is one major concern because
>>>>>> option "weak" seems risky to me :-)
>>>>>>
>>>>>> Another interesting observation is that the performance of LL/SC-based
>>>>>> CAS instruction deteriorates dramatically when the number of working
>>>>>> threads exceeds the number of CPU cores.
>>>>>
>>>>> If weak is faster, would it make sense to return (~o), that is,
>>>>> the bitwise complement of the expected arguement, when the weak
>>>>> __atomic_compare_exchange_n() fails?  This would get the improved
>>>>> performance (if I understand your results above) while correctly handling
>>>>> the strange (but possible) case where o==n.
>>>>>
>>>>> Does that make sense, or am I missing something?
>>>>
>>>> Hi Paul and Akira,
>>>>
>>>> Yes, the weak version is faster. The solution looks good. But when I
>>>> tried to use the following patch
>>>>
>>>> #define cmpxchg(ptr, o, n) \
>>>> ({ \
>>>>        typeof(*ptr) old = (o); \
>>>>        (__atomic_compare_exchange_n(ptr, (void *)&old, (n), 1,
>>>
>>> You need a "\" at the end of the line above. (If it was not unintentionally
>>> wrapped.)
>>>
>>> If it was wrapped by your mailer, which is troublesome in sending patches,
>>> please refer to:
>>>
>>> https://www.kernel.org/doc/html/latest/process/email-clients.html.
>>>
>>>> __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST))? \
>>>>                                (o) : (~o); \
>>>> })
>>>>
>>>> gcc complains of my use of complement symbol
>>>>
>>>> ../api.h:769:12: error: wrong type argument to bit-complement
>>>>     (o) : (~o); \
>>>>              ^
>>>>
>>>> Any suggestions?
>>>
>>> I don't see such error if I add the "\" mentioned above.
>>> Or do you use some strict error checking option of GCC?
>>
>> Ah, I see that the error in compiling CodeSamples/advsync/q.c.
>>
>> The call site is:
>>
>> struct el *q_pop(struct queue *q)
>> {
>>    struct el *p;
>>    struct el *pnext;
>>
>>    for (;;) {
>>        do {
>>            p = q->head;
>>            pnext = p->next;
>>            if (pnext == NULL)
>>                return NULL;
>>        } while (cmpxchg(&q->head, p, pnext) != p);
>>        if (p != &q->dummy)
>>            return p;
>>        q_push(&q->dummy, q);
>>        if (q->head == &q->dummy)
>>            return NULL;
>>    }
>> }
>>
>> In this case, p and pnext are pointers, hence the error.
>> returning (o)+1 instead should be OK in this case.
>>
>> But now, "count_lim_atomic 3 hog" says:
>>
>>    FAIL: only reached -1829 rather than 0
>>
>> on x86_64. Hmm. No such error is observed on POWER8.
>> Hmm...
> 
> I tried the same source with (o)+1 on another x86_64 host (Ubuntu 16.04).
> 
> count_lim_atomic hog test runs flawlessly.
> 
> The host I tried earlier was a bit old SB laptop. I’ll try there again later.

And I _did_ make a stupid typo there. I tried with "(o+1)", not "(o)+1".
Apologies...

So all clear for Junchang's patch with Paul's minor tweak.

FWIW, diff of my version looks like the following:

diff --git a/CodeSamples/api-pthreads/api-gcc.h b/CodeSamples/api-pthreads/api-gcc.h
index 1dd26ca..2e65998 100644
--- a/CodeSamples/api-pthreads/api-gcc.h
+++ b/CodeSamples/api-pthreads/api-gcc.h
@@ -168,9 +168,8 @@ struct __xchg_dummy {
 ({ \
 	typeof(*ptr) _____actual = (o); \
 	\
-	(void)__atomic_compare_exchange_n(ptr, (void *)&_____actual, (n), 1, \
-					  __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST); \
-	_____actual; \
+	__atomic_compare_exchange_n(ptr, (void *)&_____actual, (n), 1, \
+				    __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST) ? (o) : (o)+1; \
 })
 
 static __inline__ int atomic_cmpxchg(atomic_t *v, int old, int new)
---

        Thanks, Akira 

> 
> Thanks, Akira
> 
>>
>> The strong version works both on x86_64 and POWER8.
>>
>>        Thanks, Akira
>>
>>>
>>>        Thanks, Akira
>>>
>>>>
>>>> Thanks,
>>>> --Junchang
>>>>
>>>>
>>>>>
>>>>>                                                        Thanx, Paul
>>>>>
>>>>>> Thanks,
>>>>>> --Junchang
>>>>>>
>>>>>>
>>>>>>> On Thu, Oct 25, 2018 at 6:05 AM Akira Yokosawa <akiyks@gmail.com> wrote:
>>>>>>>
>>>>>>>> On 2018/10/24 23:53:29 +0800, Junchang Wang wrote:
>>>>>>>> Hi Akira and Paul,
>>>>>>>>
>>>>>>>> I checked the code today and the implementation of cmpxchg() looks
>>>>>>>> very suspicious to me; Current  cmpxchg() first executes function
>>>>>>>> __atomic_compare_exchange_n, and then checks whether the value stored
>>>>>>>> in field __actual (old) has been changed to decide if the CAS
>>>>>>>> instruction has been successfully performed. However, I saw the *weak*
>>>>>>>> field is set, which, as far as I know, means
>>>>>>>> __atomic_compare_exchange_n could fail even if the value of *ptr is
>>>>>>>> equal to __actual (old). Unfortunately, current cmpxchg will treat
>>>>>>>> this case as a success because the value of __actual(old) does not
>>>>>>>> change.
>>>>>>>
>>>>>>> Thanks for looking into this!
>>>>>>>
>>>>>>> I also suspected the use of "weak" semantics of
>>>>>>> __atomic_compare_exchange_n(), but did not quite understand what
>>>>>>> "spurious fail" actually means. Your theory sounds plausible to me.
>>>>>>>
>>>>>>> I've suggested in a private email to Paul to modify the 4th argument
>>>>>>> to false(0) as a workaround, which would prevent such "spurious fail".
>>>>>>>
>>>>>>> Both approaches looks good to me. I'd defer to Paul on the choice.
>>>>>>>
>>>>>>>        Thanks, Akira
>>>>>>>
>>>>>>>>
>>>>>>>> This bug happens in both Power8 and ARMv8. It seems it affects
>>>>>>>> architectures that use LL/SC to emulate CAS. Following patch helps
>>>>>>>> solve this issue on my testbeds. Please take a look. Any thoughts?
>>>>>>>>
>>>>>>>> ---
>>>>>>>> CodeSamples/api-pthreads/api-gcc.h | 8 +++-----
>>>>>>>> 1 file changed, 3 insertions(+), 5 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/CodeSamples/api-pthreads/api-gcc.h
>>>>>>>> b/CodeSamples/api-pthreads/api-gcc.h
>>>>>>>> index 1dd26ca..38a16c0 100644
>>>>>>>> --- a/CodeSamples/api-pthreads/api-gcc.h
>>>>>>>> +++ b/CodeSamples/api-pthreads/api-gcc.h
>>>>>>>> @@ -166,11 +166,9 @@ struct __xchg_dummy {
>>>>>>>>
>>>>>>>> #define cmpxchg(ptr, o, n) \
>>>>>>>> ({ \
>>>>>>>> -       typeof(*ptr) _____actual = (o); \
>>>>>>>> -       \
>>>>>>>> -       (void)__atomic_compare_exchange_n(ptr, (void *)&_____actual, (n), 1, \
>>>>>>>> -                                         __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST); \
>>>>>>>> -       _____actual; \
>>>>>>>> +       typeof(*ptr) old = (o); \
>>>>>>>> +       (__atomic_compare_exchange_n(ptr, (void *)&old, (n), 1,
>>>>>>>> __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST))? \
>>>>>>>> +                               (o) : (n); \
>>>>>>>> })
>>>>>>>>
>>>>>>>> static __inline__ int atomic_cmpxchg(atomic_t *v, int old, int new)
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>
>>


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [Possible BUG] count_lim_atomic.c fails on POWER8
  2018-10-26 11:34                   ` Akira Yokosawa
@ 2018-10-26 16:06                     ` Junchang Wang
  0 siblings, 0 replies; 25+ messages in thread
From: Junchang Wang @ 2018-10-26 16:06 UTC (permalink / raw)
  To: Akira Yokosawa; +Cc: Paul McKenney, perfbook

On Fri, Oct 26, 2018 at 7:34 PM Akira Yokosawa <akiyks@gmail.com> wrote:
>
> On 2018/10/26 10:12:19 +0900, Akira Yokosawa wrote:
> > (from mobile, might be QP encoded)
> >
> > 2018/10/26 7:04、Akira Yokosawa <akiyks@gmail.com> wrote:
> >
> >>> On 2018/10/26 0:17, Akira Yokosawa wrote:
> >>>> On 2018/10/25 23:09, Junchang Wang wrote:
> >>>>> On Thu, Oct 25, 2018 at 5:45 PM Paul E. McKenney <paulmck@linux.ibm.com> wrote:
> >>>>>
> >>>>>> On Thu, Oct 25, 2018 at 10:11:18AM +0800, Junchang Wang wrote:
> >>>>>> Hi Akira,
> >>>>>>
> >>>>>> Thanks for the mail. My understanding is that PPC uses LL/SC to
> >>>>>> emulate CAS by using a tiny loop. Unfortunately, the LL/SC loop itself
> >>>>>> could fail (due to, for example, context switches) even if *ptr equals
> >>>>>> to old. In such a case, a CAS instruction in actually should return a
> >>>>>> success. I think this is what the term "spurious fail" describes. Here
> >>>>>> is a reference:
> >>>>>> http://liblfds.org/mediawiki/index.php?title=Article:CAS_and_LL/SC_Implementation_Details_by_Processor_family
> >>>>>
> >>>>> First, thank you both for your work on this!  And yes, my cmpxchg() code
> >>>>> is clearly quite broken.
> >>>>>
> >>>>>> It seems that __atomic_compare_exchange_n() provides option "weak" for
> >>>>>> performance. I tested these two solutions and got the following
> >>>>>> results:
> >>>>>>
> >>>>>>                           1      4      8     16     32    64
> >>>>>> my patch (ns)     35    34    37    73    142  281
> >>>>>> strong (ns)          39    39    41    79    158  313
> >>>>>
> >>>>> So strong is a bit slower, correct?
> >>>>>
> >>>>>> I tested the performance of count_lim_atomic by varying the number of
> >>>>>> updaters (./count_lim_atomic N uperf) on a 8-core PPC server. The
> >>>>>> first row in the table is the result when my patch is used, and the
> >>>>>> second row is the result when the 4th argument of the function is set
> >>>>>> to false(0). It seems performance improves slightly if option "weak"
> >>>>>> is used. However, there is no performance boost as we expected. So
> >>>>>> your solution sounds good if safety is one major concern because
> >>>>>> option "weak" seems risky to me :-)
> >>>>>>
> >>>>>> Another interesting observation is that the performance of LL/SC-based
> >>>>>> CAS instruction deteriorates dramatically when the number of working
> >>>>>> threads exceeds the number of CPU cores.
> >>>>>
> >>>>> If weak is faster, would it make sense to return (~o), that is,
> >>>>> the bitwise complement of the expected arguement, when the weak
> >>>>> __atomic_compare_exchange_n() fails?  This would get the improved
> >>>>> performance (if I understand your results above) while correctly handling
> >>>>> the strange (but possible) case where o==n.
> >>>>>
> >>>>> Does that make sense, or am I missing something?
> >>>>
> >>>> Hi Paul and Akira,
> >>>>
> >>>> Yes, the weak version is faster. The solution looks good. But when I
> >>>> tried to use the following patch
> >>>>
> >>>> #define cmpxchg(ptr, o, n) \
> >>>> ({ \
> >>>>        typeof(*ptr) old = (o); \
> >>>>        (__atomic_compare_exchange_n(ptr, (void *)&old, (n), 1,
> >>>
> >>> You need a "\" at the end of the line above. (If it was not unintentionally
> >>> wrapped.)
> >>>
> >>> If it was wrapped by your mailer, which is troublesome in sending patches,
> >>> please refer to:
> >>>
> >>> https://www.kernel.org/doc/html/latest/process/email-clients.html.
> >>>
> >>>> __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST))? \
> >>>>                                (o) : (~o); \
> >>>> })
> >>>>
> >>>> gcc complains of my use of complement symbol
> >>>>
> >>>> ../api.h:769:12: error: wrong type argument to bit-complement
> >>>>     (o) : (~o); \
> >>>>              ^
> >>>>
> >>>> Any suggestions?
> >>>
> >>> I don't see such error if I add the "\" mentioned above.
> >>> Or do you use some strict error checking option of GCC?
> >>
> >> Ah, I see that the error in compiling CodeSamples/advsync/q.c.
> >>
> >> The call site is:
> >>
> >> struct el *q_pop(struct queue *q)
> >> {
> >>    struct el *p;
> >>    struct el *pnext;
> >>
> >>    for (;;) {
> >>        do {
> >>            p = q->head;
> >>            pnext = p->next;
> >>            if (pnext == NULL)
> >>                return NULL;
> >>        } while (cmpxchg(&q->head, p, pnext) != p);
> >>        if (p != &q->dummy)
> >>            return p;
> >>        q_push(&q->dummy, q);
> >>        if (q->head == &q->dummy)
> >>            return NULL;
> >>    }
> >> }
> >>
> >> In this case, p and pnext are pointers, hence the error.
> >> returning (o)+1 instead should be OK in this case.
> >>
> >> But now, "count_lim_atomic 3 hog" says:
> >>
> >>    FAIL: only reached -1829 rather than 0
> >>
> >> on x86_64. Hmm. No such error is observed on POWER8.
> >> Hmm...
> >
> > I tried the same source with (o)+1 on another x86_64 host (Ubuntu 16.04).
> >
> > count_lim_atomic hog test runs flawlessly.
> >
> > The host I tried earlier was a bit old SB laptop. I’ll try there again later.
>
> And I _did_ make a stupid typo there. I tried with "(o+1)", not "(o)+1".
> Apologies...
>
> So all clear for Junchang's patch with Paul's minor tweak.
>
> FWIW, diff of my version looks like the following:
>
> diff --git a/CodeSamples/api-pthreads/api-gcc.h b/CodeSamples/api-pthreads/api-gcc.h
> index 1dd26ca..2e65998 100644
> --- a/CodeSamples/api-pthreads/api-gcc.h
> +++ b/CodeSamples/api-pthreads/api-gcc.h
> @@ -168,9 +168,8 @@ struct __xchg_dummy {
>  ({ \
>         typeof(*ptr) _____actual = (o); \
>         \
> -       (void)__atomic_compare_exchange_n(ptr, (void *)&_____actual, (n), 1, \
> -                                         __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST); \
> -       _____actual; \
> +       __atomic_compare_exchange_n(ptr, (void *)&_____actual, (n), 1, \
> +                                   __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST) ? (o) : (o)+1; \
>  })
>
>  static __inline__ int atomic_cmpxchg(atomic_t *v, int old, int new)
> ---

Hi Akira,

Thanks a lot. Yes, the patch works for my x86 laptop and servers. I
have submitted the path in an independent thread.


Thanks,
--Junchang





>
>         Thanks, Akira
>
> >
> > Thanks, Akira
> >
> >>
> >> The strong version works both on x86_64 and POWER8.
> >>
> >>        Thanks, Akira
> >>
> >>>
> >>>        Thanks, Akira
> >>>
> >>>>
> >>>> Thanks,
> >>>> --Junchang
> >>>>
> >>>>
> >>>>>
> >>>>>                                                        Thanx, Paul
> >>>>>
> >>>>>> Thanks,
> >>>>>> --Junchang
> >>>>>>
> >>>>>>
> >>>>>>> On Thu, Oct 25, 2018 at 6:05 AM Akira Yokosawa <akiyks@gmail.com> wrote:
> >>>>>>>
> >>>>>>>> On 2018/10/24 23:53:29 +0800, Junchang Wang wrote:
> >>>>>>>> Hi Akira and Paul,
> >>>>>>>>
> >>>>>>>> I checked the code today and the implementation of cmpxchg() looks
> >>>>>>>> very suspicious to me; Current  cmpxchg() first executes function
> >>>>>>>> __atomic_compare_exchange_n, and then checks whether the value stored
> >>>>>>>> in field __actual (old) has been changed to decide if the CAS
> >>>>>>>> instruction has been successfully performed. However, I saw the *weak*
> >>>>>>>> field is set, which, as far as I know, means
> >>>>>>>> __atomic_compare_exchange_n could fail even if the value of *ptr is
> >>>>>>>> equal to __actual (old). Unfortunately, current cmpxchg will treat
> >>>>>>>> this case as a success because the value of __actual(old) does not
> >>>>>>>> change.
> >>>>>>>
> >>>>>>> Thanks for looking into this!
> >>>>>>>
> >>>>>>> I also suspected the use of "weak" semantics of
> >>>>>>> __atomic_compare_exchange_n(), but did not quite understand what
> >>>>>>> "spurious fail" actually means. Your theory sounds plausible to me.
> >>>>>>>
> >>>>>>> I've suggested in a private email to Paul to modify the 4th argument
> >>>>>>> to false(0) as a workaround, which would prevent such "spurious fail".
> >>>>>>>
> >>>>>>> Both approaches looks good to me. I'd defer to Paul on the choice.
> >>>>>>>
> >>>>>>>        Thanks, Akira
> >>>>>>>
> >>>>>>>>
> >>>>>>>> This bug happens in both Power8 and ARMv8. It seems it affects
> >>>>>>>> architectures that use LL/SC to emulate CAS. Following patch helps
> >>>>>>>> solve this issue on my testbeds. Please take a look. Any thoughts?
> >>>>>>>>
> >>>>>>>> ---
> >>>>>>>> CodeSamples/api-pthreads/api-gcc.h | 8 +++-----
> >>>>>>>> 1 file changed, 3 insertions(+), 5 deletions(-)
> >>>>>>>>
> >>>>>>>> diff --git a/CodeSamples/api-pthreads/api-gcc.h
> >>>>>>>> b/CodeSamples/api-pthreads/api-gcc.h
> >>>>>>>> index 1dd26ca..38a16c0 100644
> >>>>>>>> --- a/CodeSamples/api-pthreads/api-gcc.h
> >>>>>>>> +++ b/CodeSamples/api-pthreads/api-gcc.h
> >>>>>>>> @@ -166,11 +166,9 @@ struct __xchg_dummy {
> >>>>>>>>
> >>>>>>>> #define cmpxchg(ptr, o, n) \
> >>>>>>>> ({ \
> >>>>>>>> -       typeof(*ptr) _____actual = (o); \
> >>>>>>>> -       \
> >>>>>>>> -       (void)__atomic_compare_exchange_n(ptr, (void *)&_____actual, (n), 1, \
> >>>>>>>> -                                         __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST); \
> >>>>>>>> -       _____actual; \
> >>>>>>>> +       typeof(*ptr) old = (o); \
> >>>>>>>> +       (__atomic_compare_exchange_n(ptr, (void *)&old, (n), 1,
> >>>>>>>> __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST))? \
> >>>>>>>> +                               (o) : (n); \
> >>>>>>>> })
> >>>>>>>>
> >>>>>>>> static __inline__ int atomic_cmpxchg(atomic_t *v, int old, int new)
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>
> >>
>


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Possible BUG] count_lim_atomic.c fails on POWER8
  2018-10-26  0:58                 ` Junchang Wang
@ 2018-10-27 14:56                   ` Akira Yokosawa
       [not found]                     ` <20181028001723.GJ4170@linux.ibm.com>
  0 siblings, 1 reply; 25+ messages in thread
From: Akira Yokosawa @ 2018-10-27 14:56 UTC (permalink / raw)
  To: Junchang Wang; +Cc: Paul McKenney, perfbook, Akira Yokosawa

On 2018/10/26 08:58:30 +0800, Junchang Wang wrote:
[...]
> 
> BTW, I found I'm not good in writing C macro (e.g., cmpxchg). Do you
> know some specification/document on writing C macro functions in
> Linux?

Although I'm not qualified as a kernel developer,
Linux kernel's "coding style" has a section on this. See:

https://www.kernel.org/doc/html/latest/process/coding-style.html#macros-enums-and-rtl

In that regard, macros I added in commit b2acf6239a95
("count: Tweak counttorture.h to avoid segfault") do not meet
the style guide in a couple of ways:

    1) Inline functions are preferable to macros resembling functions
    2) Macros with multiple statements should be enclosed in a do - while block
    3) ...

Any idea for improving them is more than welcome!

        Thanks, Akira 

> 
> Thanks,
> --Junchang
> 
[...]


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Possible BUG] count_lim_atomic.c fails on POWER8
       [not found]                     ` <20181028001723.GJ4170@linux.ibm.com>
@ 2018-10-28 12:08                       ` Junchang Wang
  2018-10-28 13:19                         ` Paul E. McKenney
  2018-10-28 13:22                         ` Akira Yokosawa
  2018-10-28 14:24                       ` Akira Yokosawa
  1 sibling, 2 replies; 25+ messages in thread
From: Junchang Wang @ 2018-10-28 12:08 UTC (permalink / raw)
  To: Paul McKenney; +Cc: Akira Yokosawa, perfbook

On Sun, Oct 28, 2018 at 8:17 AM Paul E. McKenney <paulmck@linux.ibm.com> wrote:
>
> On Sat, Oct 27, 2018 at 11:56:54PM +0900, Akira Yokosawa wrote:
> > On 2018/10/26 08:58:30 +0800, Junchang Wang wrote:
> > [...]
> > >
> > > BTW, I found I'm not good in writing C macro (e.g., cmpxchg). Do you
> > > know some specification/document on writing C macro functions in
> > > Linux?
> >
> > Although I'm not qualified as a kernel developer,
> > Linux kernel's "coding style" has a section on this. See:
> >
> > https://www.kernel.org/doc/html/latest/process/coding-style.html#macros-enums-and-rtl
> >
> > In that regard, macros I added in commit b2acf6239a95
> > ("count: Tweak counttorture.h to avoid segfault") do not meet
> > the style guide in a couple of ways:
> >
> >     1) Inline functions are preferable to macros resembling functions
> >     2) Macros with multiple statements should be enclosed in a do - while block
> >     3) ...
> >
> > Any idea for improving them is more than welcome!
>
> Let's see...
>
> #define cmpxchg(ptr, o, n) \
> ({ \
>         typeof(*ptr) _____actual = (o); \
>         \
>         __atomic_compare_exchange_n(ptr, (void *)&_____actual, (n), 1, \
>                         __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST) ? (o) : (o)+1; \
> })
>
> We cannot do #1 because cmpxchg() is type-generic, and thus cannot be
> implemented as a C function.  (C++ could use templates, but we are not
> writing C++ here.)
>
> We cannot do #2 because cmpxchg() must return a value.
>
> Indentation is not perfect, but given the long names really cannot be
> improved all that much, if at all.
>
> However, we do have a problem, namely the multiple uses of "o", which
> would be very bad if "o" was an expression with side-effects.
>
> How about the following?
>
> #define cmpxchg(ptr, o, n) \
> ({ \
>         typeof(*ptr) _____old = (o); \
>         typeof(*ptr) _____actual = _____old; \
>         \
>         __atomic_compare_exchange_n(ptr, (void *)&_____actual, (n), 1, \
>                         __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST)
>                         ? _____old : _____old + 1; \
> })
>
> This still might have problems with signed integer overflow, but I am
> inclined to ignore that for the moment.  Because paying attention to it
> results in something like this:
>
> #define cmpxchg(ptr, o, n) \
> ({ \
>         typeof(*ptr) _____old = (o); \
>         typeof(*ptr) _____actual = _____old; \
>         \
>         __atomic_compare_exchange_n(ptr, (void *)&_____actual, (n), 1, \
>                         __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST) \
>                         ? _____old \
>                         : _____old > 0 ? _____old - 1; : _____old + 1; \
> })
>
> Thoughts?  Most especially, any better ideas?
>

Hi Paul and Akira,

Thanks a lot for the mail. I have been curious about why cmpxchg()
sticks to returning the old value to notify the caller if the CAS
operation succeeds. Besides the overflow issue mentioned in Paul's
last mail, current cmpxchg() can only be used in the control flow of
"if CAS fails, do something" (Control Flow 1). However, it cannot be
used in the control flow of "if CAS succeeds, do something" (Control
Flow 2).

So another option is that cmpxchg() could just return true or false,
and if the caller needs the current value of the content of the
specified memory address, it could read the value out of field *old*.
Of course, we must adjust the parameters of cmpxchg() slightly by
passing the address of *old* to the function. Here is how cmpxchg()
looks like in my mind:

#define cmpxchg(ptr, o, n) \
({ \
        __atomic_compare_exchange_n(ptr, (void *)(o), (n), 1, \
                        __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST); \
})

static __inline__ int atomic_cmpxchg(atomic_t *v, int *old, int new)
{
        return cmpxchg(&v->counter, old, new);
}

Any thoughts? Or did I miss something here? I will send the full patch
in another thread in case you want to review the code.


Thanks,
--Junchang


>                                                         Thanx, Paul
>


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Possible BUG] count_lim_atomic.c fails on POWER8
  2018-10-28 12:08                       ` Junchang Wang
@ 2018-10-28 13:19                         ` Paul E. McKenney
  2018-10-28 13:22                         ` Akira Yokosawa
  1 sibling, 0 replies; 25+ messages in thread
From: Paul E. McKenney @ 2018-10-28 13:19 UTC (permalink / raw)
  To: Junchang Wang; +Cc: Akira Yokosawa, perfbook

On Sun, Oct 28, 2018 at 08:08:21PM +0800, Junchang Wang wrote:
> On Sun, Oct 28, 2018 at 8:17 AM Paul E. McKenney <paulmck@linux.ibm.com> wrote:
> >
> > On Sat, Oct 27, 2018 at 11:56:54PM +0900, Akira Yokosawa wrote:
> > > On 2018/10/26 08:58:30 +0800, Junchang Wang wrote:
> > > [...]
> > > >
> > > > BTW, I found I'm not good in writing C macro (e.g., cmpxchg). Do you
> > > > know some specification/document on writing C macro functions in
> > > > Linux?
> > >
> > > Although I'm not qualified as a kernel developer,
> > > Linux kernel's "coding style" has a section on this. See:
> > >
> > > https://www.kernel.org/doc/html/latest/process/coding-style.html#macros-enums-and-rtl
> > >
> > > In that regard, macros I added in commit b2acf6239a95
> > > ("count: Tweak counttorture.h to avoid segfault") do not meet
> > > the style guide in a couple of ways:
> > >
> > >     1) Inline functions are preferable to macros resembling functions
> > >     2) Macros with multiple statements should be enclosed in a do - while block
> > >     3) ...
> > >
> > > Any idea for improving them is more than welcome!
> >
> > Let's see...
> >
> > #define cmpxchg(ptr, o, n) \
> > ({ \
> >         typeof(*ptr) _____actual = (o); \
> >         \
> >         __atomic_compare_exchange_n(ptr, (void *)&_____actual, (n), 1, \
> >                         __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST) ? (o) : (o)+1; \
> > })
> >
> > We cannot do #1 because cmpxchg() is type-generic, and thus cannot be
> > implemented as a C function.  (C++ could use templates, but we are not
> > writing C++ here.)
> >
> > We cannot do #2 because cmpxchg() must return a value.
> >
> > Indentation is not perfect, but given the long names really cannot be
> > improved all that much, if at all.
> >
> > However, we do have a problem, namely the multiple uses of "o", which
> > would be very bad if "o" was an expression with side-effects.
> >
> > How about the following?
> >
> > #define cmpxchg(ptr, o, n) \
> > ({ \
> >         typeof(*ptr) _____old = (o); \
> >         typeof(*ptr) _____actual = _____old; \
> >         \
> >         __atomic_compare_exchange_n(ptr, (void *)&_____actual, (n), 1, \
> >                         __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST)
> >                         ? _____old : _____old + 1; \
> > })
> >
> > This still might have problems with signed integer overflow, but I am
> > inclined to ignore that for the moment.  Because paying attention to it
> > results in something like this:
> >
> > #define cmpxchg(ptr, o, n) \
> > ({ \
> >         typeof(*ptr) _____old = (o); \
> >         typeof(*ptr) _____actual = _____old; \
> >         \
> >         __atomic_compare_exchange_n(ptr, (void *)&_____actual, (n), 1, \
> >                         __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST) \
> >                         ? _____old \
> >                         : _____old > 0 ? _____old - 1; : _____old + 1; \
> > })
> >
> > Thoughts?  Most especially, any better ideas?
> >
> 
> Hi Paul and Akira,
> 
> Thanks a lot for the mail. I have been curious about why cmpxchg()
> sticks to returning the old value to notify the caller if the CAS
> operation succeeds. Besides the overflow issue mentioned in Paul's
> last mail, current cmpxchg() can only be used in the control flow of
> "if CAS fails, do something" (Control Flow 1). However, it cannot be
> used in the control flow of "if CAS succeeds, do something" (Control
> Flow 2).
> 
> So another option is that cmpxchg() could just return true or false,
> and if the caller needs the current value of the content of the
> specified memory address, it could read the value out of field *old*.
> Of course, we must adjust the parameters of cmpxchg() slightly by
> passing the address of *old* to the function. Here is how cmpxchg()
> looks like in my mind:
> 
> #define cmpxchg(ptr, o, n) \
> ({ \
>         __atomic_compare_exchange_n(ptr, (void *)(o), (n), 1, \
>                         __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST); \
> })
> 
> static __inline__ int atomic_cmpxchg(atomic_t *v, int *old, int new)
> {
>         return cmpxchg(&v->counter, old, new);
> }
> 
> Any thoughts? Or did I miss something here? I will send the full patch
> in another thread in case you want to review the code.

The reason perfbook's cmpxchg() returns the old value is that the Linux
kernel's cmpxchg() returns the old value.  The reason that the Linux
kernel's cmpxchg() returns the old value is that doing so allows a tighter
retry loop -- it is not necessary to separately reload the current value.
This does mean that one disadvantage of returning a made-up value is
that the next iteration would likely be starting from the wrong value,
but this should not be a real problem given that spurious failure should
be rare.  After all, if spurious failure is not rare, we have way more
serious performance issues.  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Possible BUG] count_lim_atomic.c fails on POWER8
  2018-10-28 12:08                       ` Junchang Wang
  2018-10-28 13:19                         ` Paul E. McKenney
@ 2018-10-28 13:22                         ` Akira Yokosawa
  1 sibling, 0 replies; 25+ messages in thread
From: Akira Yokosawa @ 2018-10-28 13:22 UTC (permalink / raw)
  To: Junchang Wang; +Cc: Paul McKenney, perfbook, Akira Yokosawa

On 2018/10/28 21:08, Junchang Wang wrote:
> On Sun, Oct 28, 2018 at 8:17 AM Paul E. McKenney <paulmck@linux.ibm.com> wrote:
>>
>> On Sat, Oct 27, 2018 at 11:56:54PM +0900, Akira Yokosawa wrote:
>>> On 2018/10/26 08:58:30 +0800, Junchang Wang wrote:
>>> [...]
>>>>
>>>> BTW, I found I'm not good in writing C macro (e.g., cmpxchg). Do you
>>>> know some specification/document on writing C macro functions in
>>>> Linux?
>>>
>>> Although I'm not qualified as a kernel developer,
>>> Linux kernel's "coding style" has a section on this. See:
>>>
>>> https://www.kernel.org/doc/html/latest/process/coding-style.html#macros-enums-and-rtl
>>>
>>> In that regard, macros I added in commit b2acf6239a95
>>> ("count: Tweak counttorture.h to avoid segfault") do not meet
>>> the style guide in a couple of ways:
>>>
>>>     1) Inline functions are preferable to macros resembling functions
>>>     2) Macros with multiple statements should be enclosed in a do - while block
>>>     3) ...
>>>
>>> Any idea for improving them is more than welcome!
>>
>> Let's see...
>>
>> #define cmpxchg(ptr, o, n) \
>> ({ \
>>         typeof(*ptr) _____actual = (o); \
>>         \
>>         __atomic_compare_exchange_n(ptr, (void *)&_____actual, (n), 1, \
>>                         __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST) ? (o) : (o)+1; \
>> })
>>
>> We cannot do #1 because cmpxchg() is type-generic, and thus cannot be
>> implemented as a C function.  (C++ could use templates, but we are not
>> writing C++ here.)
>>
>> We cannot do #2 because cmpxchg() must return a value.
>>
>> Indentation is not perfect, but given the long names really cannot be
>> improved all that much, if at all.
>>
>> However, we do have a problem, namely the multiple uses of "o", which
>> would be very bad if "o" was an expression with side-effects.
>>
>> How about the following?
>>
>> #define cmpxchg(ptr, o, n) \
>> ({ \
>>         typeof(*ptr) _____old = (o); \
>>         typeof(*ptr) _____actual = _____old; \
>>         \
>>         __atomic_compare_exchange_n(ptr, (void *)&_____actual, (n), 1, \
>>                         __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST)
>>                         ? _____old : _____old + 1; \
>> })
>>
>> This still might have problems with signed integer overflow, but I am
>> inclined to ignore that for the moment.  Because paying attention to it
>> results in something like this:
>>
>> #define cmpxchg(ptr, o, n) \
>> ({ \
>>         typeof(*ptr) _____old = (o); \
>>         typeof(*ptr) _____actual = _____old; \
>>         \
>>         __atomic_compare_exchange_n(ptr, (void *)&_____actual, (n), 1, \
>>                         __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST) \
>>                         ? _____old \
>>                         : _____old > 0 ? _____old - 1; : _____old + 1; \
>> })
>>
>> Thoughts?  Most especially, any better ideas?
>>
> 
> Hi Paul and Akira,
> 
> Thanks a lot for the mail. I have been curious about why cmpxchg()
> sticks to returning the old value to notify the caller if the CAS
> operation succeeds. Besides the overflow issue mentioned in Paul's
> last mail, current cmpxchg() can only be used in the control flow of
> "if CAS fails, do something" (Control Flow 1). However, it cannot be
> used in the control flow of "if CAS succeeds, do something" (Control
> Flow 2).
> 
> So another option is that cmpxchg() could just return true or false,
> and if the caller needs the current value of the content of the
> specified memory address, it could read the value out of field *old*.
> Of course, we must adjust the parameters of cmpxchg() slightly by
> passing the address of *old* to the function. Here is how cmpxchg()
> looks like in my mind:
> 
> #define cmpxchg(ptr, o, n) \
> ({ \
>         __atomic_compare_exchange_n(ptr, (void *)(o), (n), 1, \
>                         __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST); \
> })
> 
> static __inline__ int atomic_cmpxchg(atomic_t *v, int *old, int new)
> {
>         return cmpxchg(&v->counter, old, new);
> }
> 
> Any thoughts? Or did I miss something here? I will send the full patch
> in another thread in case you want to review the code.

Well, perfbook CodeSamples' API mimics that of Linux kernel in user land.
So changing behavior of existing API is not acceptable, I suppose.

There might already be such an API in Linux kernel, but I'm not sure.

Ah, looks like Paul has answered that point.

        Thanks, Akira


> 
> 
> Thanks,
> --Junchang
> 
> 
>>                                                         Thanx, Paul
>>


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Possible BUG] count_lim_atomic.c fails on POWER8
       [not found]                     ` <20181028001723.GJ4170@linux.ibm.com>
  2018-10-28 12:08                       ` Junchang Wang
@ 2018-10-28 14:24                       ` Akira Yokosawa
  2018-10-28 16:43                         ` Paul E. McKenney
  1 sibling, 1 reply; 25+ messages in thread
From: Akira Yokosawa @ 2018-10-28 14:24 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Junchang Wang, perfbook, Akira Yokosawa

On 2018/10/27 17:17:23 -0700, Paul E. McKenney wrote:
> On Sat, Oct 27, 2018 at 11:56:54PM +0900, Akira Yokosawa wrote:
>> On 2018/10/26 08:58:30 +0800, Junchang Wang wrote:
>> [...]
>>>
>>> BTW, I found I'm not good in writing C macro (e.g., cmpxchg). Do you
>>> know some specification/document on writing C macro functions in
>>> Linux?
>>
>> Although I'm not qualified as a kernel developer,
>> Linux kernel's "coding style" has a section on this. See:
>>
>> https://www.kernel.org/doc/html/latest/process/coding-style.html#macros-enums-and-rtl
>>
>> In that regard, macros I added in commit b2acf6239a95
>> ("count: Tweak counttorture.h to avoid segfault") do not meet
>> the style guide in a couple of ways:
>>
>>     1) Inline functions are preferable to macros resembling functions
>>     2) Macros with multiple statements should be enclosed in a do - while block
>>     3) ...
>>
>> Any idea for improving them is more than welcome!
> 
> Let's see...
> 
> #define cmpxchg(ptr, o, n) \
> ({ \
> 	typeof(*ptr) _____actual = (o); \
> 	\
> 	__atomic_compare_exchange_n(ptr, (void *)&_____actual, (n), 1, \
> 			__ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST) ? (o) : (o)+1; \
> })

Oh, my concern was macros I added in counttorture.h to support
#ifndef KEEP_GCC_THREAD_LOCAL.

But those macros are used solely in the header file, so the current
definition might be good enough.

OTH, macros defined in api-gcc.h should be made as robust as possible.
Hence your review of cmpxchg() is quite instructive to me.

> 
> We cannot do #1 because cmpxchg() is type-generic, and thus cannot be
> implemented as a C function.  (C++ could use templates, but we are not
> writing C++ here.)
> 
> We cannot do #2 because cmpxchg() must return a value.
> 

These reasoning might not be obvious to those who are new to
C preprocessor programming. Current style guide of kernel doesn't
look good enough, partly because of its intended audiences.

> Indentation is not perfect, but given the long names really cannot be
> improved all that much, if at all.
> 
> However, we do have a problem, namely the multiple uses of "o", which
> would be very bad if "o" was an expression with side-effects.

I didn't notice this point.

> 
> How about the following?
> 
> #define cmpxchg(ptr, o, n) \
> ({ \
> 	typeof(*ptr) _____old = (o); \
> 	typeof(*ptr) _____actual = _____old; \
> 	\
> 	__atomic_compare_exchange_n(ptr, (void *)&_____actual, (n), 1, \
> 			__ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST)
> 			? _____old : _____old + 1; \
> })
> 
> This still might have problems with signed integer overflow, but I am
> inclined to ignore that for the moment.

Behavior of overflow of signed integer is undefined in C standard, right?

>                                         Because paying attention to it
> results in something like this:
> 
> #define cmpxchg(ptr, o, n) \
> ({ \
> 	typeof(*ptr) _____old = (o); \
> 	typeof(*ptr) _____actual = _____old; \
> 	\
> 	__atomic_compare_exchange_n(ptr, (void *)&_____actual, (n), 1, \
> 			__ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST) \
> 			? _____old \
> 			: _____old > 0 ? _____old - 1; : _____old + 1; \
> })
> 
> Thoughts?  Most especially, any better ideas?

Let me think this over.

BTW, the purpose of using the name "_____old" and the like may not
be obvious either.
If we use "old" instead, naming collision can happen if "old" is given
as an argument to this macro in its call sites. Am I guessing right?

        Thanks, Akira

> 
> 							Thanx, Paul
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Possible BUG] count_lim_atomic.c fails on POWER8
  2018-10-28 14:24                       ` Akira Yokosawa
@ 2018-10-28 16:43                         ` Paul E. McKenney
  2018-10-29 14:45                           ` Akira Yokosawa
  0 siblings, 1 reply; 25+ messages in thread
From: Paul E. McKenney @ 2018-10-28 16:43 UTC (permalink / raw)
  To: Akira Yokosawa; +Cc: Junchang Wang, perfbook

On Sun, Oct 28, 2018 at 11:24:41PM +0900, Akira Yokosawa wrote:
> On 2018/10/27 17:17:23 -0700, Paul E. McKenney wrote:
> > On Sat, Oct 27, 2018 at 11:56:54PM +0900, Akira Yokosawa wrote:
> >> On 2018/10/26 08:58:30 +0800, Junchang Wang wrote:
> >> [...]
> >>>
> >>> BTW, I found I'm not good in writing C macro (e.g., cmpxchg). Do you
> >>> know some specification/document on writing C macro functions in
> >>> Linux?
> >>
> >> Although I'm not qualified as a kernel developer,
> >> Linux kernel's "coding style" has a section on this. See:
> >>
> >> https://www.kernel.org/doc/html/latest/process/coding-style.html#macros-enums-and-rtl
> >>
> >> In that regard, macros I added in commit b2acf6239a95
> >> ("count: Tweak counttorture.h to avoid segfault") do not meet
> >> the style guide in a couple of ways:
> >>
> >>     1) Inline functions are preferable to macros resembling functions
> >>     2) Macros with multiple statements should be enclosed in a do - while block
> >>     3) ...
> >>
> >> Any idea for improving them is more than welcome!
> > 
> > Let's see...
> > 
> > #define cmpxchg(ptr, o, n) \
> > ({ \
> > 	typeof(*ptr) _____actual = (o); \
> > 	\
> > 	__atomic_compare_exchange_n(ptr, (void *)&_____actual, (n), 1, \
> > 			__ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST) ? (o) : (o)+1; \
> > })
> 
> Oh, my concern was macros I added in counttorture.h to support
> #ifndef KEEP_GCC_THREAD_LOCAL.
> 
> But those macros are used solely in the header file, so the current
> definition might be good enough.

These ones?

#define _wait_all_threads() { \
	while (READ_ONCE(finalthreadcount) < nthreadsexpected) \
		poll(NULL, 0, 1);}
#define _count_unregister_thread(n) count_unregister_thread(n + 1)
#define final_wait_all_threads() { \
	WRITE_ONCE(finalthreadcount, nthreadsexpected + 1); \
	wait_all_threads();}

The _count_unregister_thread() is fine.

The _wait_all_threads() and final_wait_all_threads() are fine given their
current usage.  One not-yet-needed way to future-proof them would be as
follows:

#define _wait_all_threads() \
do { \
	while (READ_ONCE(finalthreadcount) < nthreadsexpected) \
		poll(NULL, 0, 1); \
} while (0)
#define final_wait_all_threads() \
do { \
	WRITE_ONCE(finalthreadcount, nthreadsexpected + 1); \
	wait_all_threads(); \
} while (0)

Your next question is "why does this matter", to which I would point you
here: https://kernelnewbies.org/FAQ/DoWhile0  Mostly because I never
can remember all of the failure cases that led to the Linux-kernel
coding-style rules.  ;-)

> OTH, macros defined in api-gcc.h should be made as robust as possible.
> Hence your review of cmpxchg() is quite instructive to me.

Glad it helped!

> > We cannot do #1 because cmpxchg() is type-generic, and thus cannot be
> > implemented as a C function.  (C++ could use templates, but we are not
> > writing C++ here.)
> > 
> > We cannot do #2 because cmpxchg() must return a value.
> 
> These reasoning might not be obvious to those who are new to
> C preprocessor programming. Current style guide of kernel doesn't
> look good enough, partly because of its intended audiences.

To be fair, it doesn't get updated as often as it should.

> > Indentation is not perfect, but given the long names really cannot be
> > improved all that much, if at all.
> > 
> > However, we do have a problem, namely the multiple uses of "o", which
> > would be very bad if "o" was an expression with side-effects.
> 
> I didn't notice this point.

Neither did I earlier in this thread.  ;-)

> > How about the following?
> > 
> > #define cmpxchg(ptr, o, n) \
> > ({ \
> > 	typeof(*ptr) _____old = (o); \
> > 	typeof(*ptr) _____actual = _____old; \
> > 	\
> > 	__atomic_compare_exchange_n(ptr, (void *)&_____actual, (n), 1, \
> > 			__ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST)
> > 			? _____old : _____old + 1; \
> > })
> > 
> > This still might have problems with signed integer overflow, but I am
> > inclined to ignore that for the moment.
> 
> Behavior of overflow of signed integer is undefined in C standard, right?

Exactly.  An alternative approach is to do as the Linux kernel does and
tell gcc to wrap signed integers on overflow, as the standard mandates
for unsigned integers.  My preference would be to avoid signed overflow.

> >                                         Because paying attention to it
> > results in something like this:
> > 
> > #define cmpxchg(ptr, o, n) \
> > ({ \
> > 	typeof(*ptr) _____old = (o); \
> > 	typeof(*ptr) _____actual = _____old; \
> > 	\
> > 	__atomic_compare_exchange_n(ptr, (void *)&_____actual, (n), 1, \
> > 			__ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST) \
> > 			? _____old \
> > 			: _____old > 0 ? _____old - 1; : _____old + 1; \
> > })
> > 
> > Thoughts?  Most especially, any better ideas?
> 
> Let me think this over.
> 
> BTW, the purpose of using the name "_____old" and the like may not
> be obvious either.
> If we use "old" instead, naming collision can happen if "old" is given
> as an argument to this macro in its call sites. Am I guessing right?

You got it exactly right!  The "old" in the argument is intended to
refer to something in an outer scope, but it would instead refer to
the macro's local variable.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Possible BUG] count_lim_atomic.c fails on POWER8
  2018-10-28 16:43                         ` Paul E. McKenney
@ 2018-10-29 14:45                           ` Akira Yokosawa
  2018-10-29 15:30                             ` Paul E. McKenney
  0 siblings, 1 reply; 25+ messages in thread
From: Akira Yokosawa @ 2018-10-29 14:45 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Junchang Wang, perfbook, Akira Yokosawa

On 2018/10/28 09:43:55 -0700, Paul E. McKenney wrote:
> On Sun, Oct 28, 2018 at 11:24:41PM +0900, Akira Yokosawa wrote:
>> On 2018/10/27 17:17:23 -0700, Paul E. McKenney wrote:
>>> On Sat, Oct 27, 2018 at 11:56:54PM +0900, Akira Yokosawa wrote:
>>>> On 2018/10/26 08:58:30 +0800, Junchang Wang wrote:
>>>> [...]
>>>>>
>>>>> BTW, I found I'm not good in writing C macro (e.g., cmpxchg). Do you
>>>>> know some specification/document on writing C macro functions in
>>>>> Linux?
>>>>
>>>> Although I'm not qualified as a kernel developer,
>>>> Linux kernel's "coding style" has a section on this. See:
>>>>
>>>> https://www.kernel.org/doc/html/latest/process/coding-style.html#macros-enums-and-rtl
>>>>
>>>> In that regard, macros I added in commit b2acf6239a95
>>>> ("count: Tweak counttorture.h to avoid segfault") do not meet
>>>> the style guide in a couple of ways:
>>>>
>>>>     1) Inline functions are preferable to macros resembling functions
>>>>     2) Macros with multiple statements should be enclosed in a do - while block
>>>>     3) ...
>>>>
>>>> Any idea for improving them is more than welcome!
>>>
>>> Let's see...
>>>
>>> #define cmpxchg(ptr, o, n) \
>>> ({ \
>>> 	typeof(*ptr) _____actual = (o); \
>>> 	\
>>> 	__atomic_compare_exchange_n(ptr, (void *)&_____actual, (n), 1, \
>>> 			__ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST) ? (o) : (o)+1; \
>>> })
>>
>> Oh, my concern was macros I added in counttorture.h to support
>> #ifndef KEEP_GCC_THREAD_LOCAL.
>>
>> But those macros are used solely in the header file, so the current
>> definition might be good enough.
> 
> These ones?
> 
> #define _wait_all_threads() { \
> 	while (READ_ONCE(finalthreadcount) < nthreadsexpected) \
> 		poll(NULL, 0, 1);}
> #define _count_unregister_thread(n) count_unregister_thread(n + 1)
> #define final_wait_all_threads() { \
> 	WRITE_ONCE(finalthreadcount, nthreadsexpected + 1); \
> 	wait_all_threads();}
> 
> The _count_unregister_thread() is fine.

Well, don't we need to protect "n"?  That is,

#define _count_unregister_thread(n) count_unregister_thread((n) + 1)

> 
> The _wait_all_threads() and final_wait_all_threads() are fine given their
> current usage.  One not-yet-needed way to future-proof them would be as
> follows:
> 
> #define _wait_all_threads() \
> do { \
> 	while (READ_ONCE(finalthreadcount) < nthreadsexpected) \
> 		poll(NULL, 0, 1); \
> } while (0)
> #define final_wait_all_threads() \
> do { \
> 	WRITE_ONCE(finalthreadcount, nthreadsexpected + 1); \
> 	wait_all_threads(); \
> } while (0)
> 
> Your next question is "why does this matter", to which I would point you
> here: https://kernelnewbies.org/FAQ/DoWhile0

This is exactly what I've wanted to know! 

>                                              Mostly because I never
> can remember all of the failure cases that led to the Linux-kernel
> coding-style rules.  ;-)
> 
>> OTH, macros defined in api-gcc.h should be made as robust as possible.
>> Hence your review of cmpxchg() is quite instructive to me.
> 
> Glad it helped!
> 
>>> We cannot do #1 because cmpxchg() is type-generic, and thus cannot be
>>> implemented as a C function.  (C++ could use templates, but we are not
>>> writing C++ here.)
>>>
>>> We cannot do #2 because cmpxchg() must return a value.
>>
>> These reasoning might not be obvious to those who are new to
>> C preprocessor programming. Current style guide of kernel doesn't
>> look good enough, partly because of its intended audiences.
> 
> To be fair, it doesn't get updated as often as it should.
> 
>>> Indentation is not perfect, but given the long names really cannot be
>>> improved all that much, if at all.
>>>
>>> However, we do have a problem, namely the multiple uses of "o", which
>>> would be very bad if "o" was an expression with side-effects.
>>
>> I didn't notice this point.
> 
> Neither did I earlier in this thread.  ;-)
> 
>>> How about the following?
>>>
>>> #define cmpxchg(ptr, o, n) \
>>> ({ \
>>> 	typeof(*ptr) _____old = (o); \
>>> 	typeof(*ptr) _____actual = _____old; \
>>> 	\
>>> 	__atomic_compare_exchange_n(ptr, (void *)&_____actual, (n), 1, \
>>> 			__ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST)
>>> 			? _____old : _____old + 1; \
>>> })
>>>
>>> This still might have problems with signed integer overflow, but I am
>>> inclined to ignore that for the moment.
>>
>> Behavior of overflow of signed integer is undefined in C standard, right?
> 
> Exactly.  An alternative approach is to do as the Linux kernel does and
> tell gcc to wrap signed integers on overflow, as the standard mandates
> for unsigned integers.

So, GCC provides such an option. Let's see...

Ah, include/linux/overflow.h has a variety of helper macros.
They are hard to grasp in a short while, though! 

>                         My preference would be to avoid signed overflow.
> 
>>>                                         Because paying attention to it
>>> results in something like this:
>>>
>>> #define cmpxchg(ptr, o, n) \
>>> ({ \
>>> 	typeof(*ptr) _____old = (o); \
>>> 	typeof(*ptr) _____actual = _____old; \
>>> 	\
>>> 	__atomic_compare_exchange_n(ptr, (void *)&_____actual, (n), 1, \
>>> 			__ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST) \
>>> 			? _____old \
>>> 			: _____old > 0 ? _____old - 1; : _____old + 1; \
>>> })
>>>
>>> Thoughts?  Most especially, any better ideas?
>>
>> Let me think this over.
>>
>> BTW, the purpose of using the name "_____old" and the like may not
>> be obvious either.
>> If we use "old" instead, naming collision can happen if "old" is given
>> as an argument to this macro in its call sites. Am I guessing right?
> 
> You got it exactly right!  The "old" in the argument is intended to
> refer to something in an outer scope, but it would instead refer to
> the macro's local variable.

Which might not end up in compile error, and could be hard to track down.
Whew!

        Thanks, Akira

> 
> 							Thanx, Paul
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Possible BUG] count_lim_atomic.c fails on POWER8
  2018-10-29 14:45                           ` Akira Yokosawa
@ 2018-10-29 15:30                             ` Paul E. McKenney
  0 siblings, 0 replies; 25+ messages in thread
From: Paul E. McKenney @ 2018-10-29 15:30 UTC (permalink / raw)
  To: Akira Yokosawa; +Cc: Junchang Wang, perfbook

On Mon, Oct 29, 2018 at 11:45:15PM +0900, Akira Yokosawa wrote:
> On 2018/10/28 09:43:55 -0700, Paul E. McKenney wrote:
> > On Sun, Oct 28, 2018 at 11:24:41PM +0900, Akira Yokosawa wrote:
> >> On 2018/10/27 17:17:23 -0700, Paul E. McKenney wrote:
> >>> On Sat, Oct 27, 2018 at 11:56:54PM +0900, Akira Yokosawa wrote:
> >>>> On 2018/10/26 08:58:30 +0800, Junchang Wang wrote:
> >>>> [...]
> >>>>>
> >>>>> BTW, I found I'm not good in writing C macro (e.g., cmpxchg). Do you
> >>>>> know some specification/document on writing C macro functions in
> >>>>> Linux?
> >>>>
> >>>> Although I'm not qualified as a kernel developer,
> >>>> Linux kernel's "coding style" has a section on this. See:
> >>>>
> >>>> https://www.kernel.org/doc/html/latest/process/coding-style.html#macros-enums-and-rtl
> >>>>
> >>>> In that regard, macros I added in commit b2acf6239a95
> >>>> ("count: Tweak counttorture.h to avoid segfault") do not meet
> >>>> the style guide in a couple of ways:
> >>>>
> >>>>     1) Inline functions are preferable to macros resembling functions
> >>>>     2) Macros with multiple statements should be enclosed in a do - while block
> >>>>     3) ...
> >>>>
> >>>> Any idea for improving them is more than welcome!
> >>>
> >>> Let's see...
> >>>
> >>> #define cmpxchg(ptr, o, n) \
> >>> ({ \
> >>> 	typeof(*ptr) _____actual = (o); \
> >>> 	\
> >>> 	__atomic_compare_exchange_n(ptr, (void *)&_____actual, (n), 1, \
> >>> 			__ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST) ? (o) : (o)+1; \
> >>> })
> >>
> >> Oh, my concern was macros I added in counttorture.h to support
> >> #ifndef KEEP_GCC_THREAD_LOCAL.
> >>
> >> But those macros are used solely in the header file, so the current
> >> definition might be good enough.
> > 
> > These ones?
> > 
> > #define _wait_all_threads() { \
> > 	while (READ_ONCE(finalthreadcount) < nthreadsexpected) \
> > 		poll(NULL, 0, 1);}
> > #define _count_unregister_thread(n) count_unregister_thread(n + 1)
> > #define final_wait_all_threads() { \
> > 	WRITE_ONCE(finalthreadcount, nthreadsexpected + 1); \
> > 	wait_all_threads();}
> > 
> > The _count_unregister_thread() is fine.
> 
> Well, don't we need to protect "n"?  That is,
> 
> #define _count_unregister_thread(n) count_unregister_thread((n) + 1)

Indeed we do!  Color me blind, and good catch!!!

> > The _wait_all_threads() and final_wait_all_threads() are fine given their
> > current usage.  One not-yet-needed way to future-proof them would be as
> > follows:
> > 
> > #define _wait_all_threads() \
> > do { \
> > 	while (READ_ONCE(finalthreadcount) < nthreadsexpected) \
> > 		poll(NULL, 0, 1); \
> > } while (0)
> > #define final_wait_all_threads() \
> > do { \
> > 	WRITE_ONCE(finalthreadcount, nthreadsexpected + 1); \
> > 	wait_all_threads(); \
> > } while (0)
> > 
> > Your next question is "why does this matter", to which I would point you
> > here: https://kernelnewbies.org/FAQ/DoWhile0
> 
> This is exactly what I've wanted to know! 
> 
> >                                              Mostly because I never
> > can remember all of the failure cases that led to the Linux-kernel
> > coding-style rules.  ;-)
> > 
> >> OTH, macros defined in api-gcc.h should be made as robust as possible.
> >> Hence your review of cmpxchg() is quite instructive to me.
> > 
> > Glad it helped!
> > 
> >>> We cannot do #1 because cmpxchg() is type-generic, and thus cannot be
> >>> implemented as a C function.  (C++ could use templates, but we are not
> >>> writing C++ here.)
> >>>
> >>> We cannot do #2 because cmpxchg() must return a value.
> >>
> >> These reasoning might not be obvious to those who are new to
> >> C preprocessor programming. Current style guide of kernel doesn't
> >> look good enough, partly because of its intended audiences.
> > 
> > To be fair, it doesn't get updated as often as it should.
> > 
> >>> Indentation is not perfect, but given the long names really cannot be
> >>> improved all that much, if at all.
> >>>
> >>> However, we do have a problem, namely the multiple uses of "o", which
> >>> would be very bad if "o" was an expression with side-effects.
> >>
> >> I didn't notice this point.
> > 
> > Neither did I earlier in this thread.  ;-)
> > 
> >>> How about the following?
> >>>
> >>> #define cmpxchg(ptr, o, n) \
> >>> ({ \
> >>> 	typeof(*ptr) _____old = (o); \
> >>> 	typeof(*ptr) _____actual = _____old; \
> >>> 	\
> >>> 	__atomic_compare_exchange_n(ptr, (void *)&_____actual, (n), 1, \
> >>> 			__ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST)
> >>> 			? _____old : _____old + 1; \
> >>> })
> >>>
> >>> This still might have problems with signed integer overflow, but I am
> >>> inclined to ignore that for the moment.
> >>
> >> Behavior of overflow of signed integer is undefined in C standard, right?
> > 
> > Exactly.  An alternative approach is to do as the Linux kernel does and
> > tell gcc to wrap signed integers on overflow, as the standard mandates
> > for unsigned integers.
> 
> So, GCC provides such an option. Let's see...
> 
> Ah, include/linux/overflow.h has a variety of helper macros.
> They are hard to grasp in a short while, though! 

Maybe check_add_overflow()?

> >                         My preference would be to avoid signed overflow.
> > 
> >>>                                         Because paying attention to it
> >>> results in something like this:
> >>>
> >>> #define cmpxchg(ptr, o, n) \
> >>> ({ \
> >>> 	typeof(*ptr) _____old = (o); \
> >>> 	typeof(*ptr) _____actual = _____old; \
> >>> 	\
> >>> 	__atomic_compare_exchange_n(ptr, (void *)&_____actual, (n), 1, \
> >>> 			__ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST) \
> >>> 			? _____old \
> >>> 			: _____old > 0 ? _____old - 1; : _____old + 1; \
> >>> })
> >>>
> >>> Thoughts?  Most especially, any better ideas?
> >>
> >> Let me think this over.
> >>
> >> BTW, the purpose of using the name "_____old" and the like may not
> >> be obvious either.
> >> If we use "old" instead, naming collision can happen if "old" is given
> >> as an argument to this macro in its call sites. Am I guessing right?
> > 
> > You got it exactly right!  The "old" in the argument is intended to
> > refer to something in an outer scope, but it would instead refer to
> > the macro's local variable.
> 
> Which might not end up in compile error, and could be hard to track down.
> Whew!

It can be quite nasty!  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2018-10-30  0:19 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-10-20 15:53 [Possible BUG] count_lim_atomic.c fails on POWER8 Akira Yokosawa
2018-10-20 16:36 ` Paul E. McKenney
2018-10-24 15:53   ` Junchang Wang
2018-10-24 22:05     ` Akira Yokosawa
2018-10-24 22:29       ` Akira Yokosawa
2018-10-25  2:18         ` Junchang Wang
2018-10-25  2:11       ` Junchang Wang
2018-10-25  9:45         ` Paul E. McKenney
2018-10-25 12:23           ` Akira Yokosawa
2018-10-25 14:09           ` Junchang Wang
2018-10-25 15:17             ` Akira Yokosawa
2018-10-25 22:04               ` Akira Yokosawa
2018-10-26  0:58                 ` Junchang Wang
2018-10-27 14:56                   ` Akira Yokosawa
     [not found]                     ` <20181028001723.GJ4170@linux.ibm.com>
2018-10-28 12:08                       ` Junchang Wang
2018-10-28 13:19                         ` Paul E. McKenney
2018-10-28 13:22                         ` Akira Yokosawa
2018-10-28 14:24                       ` Akira Yokosawa
2018-10-28 16:43                         ` Paul E. McKenney
2018-10-29 14:45                           ` Akira Yokosawa
2018-10-29 15:30                             ` Paul E. McKenney
2018-10-26  1:12                 ` Akira Yokosawa
2018-10-26 11:34                   ` Akira Yokosawa
2018-10-26 16:06                     ` Junchang Wang
2018-10-25 15:24             ` Paul E. McKenney

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.