* GCC built-in atomic operations and memory barriers
@ 2009-11-04 18:09 Toby Douglass
2009-11-04 19:05 ` Russell King - ARM Linux
0 siblings, 1 reply; 37+ messages in thread
From: Toby Douglass @ 2009-11-04 18:09 UTC (permalink / raw)
To: linux-arm-kernel
Hi -
My first post.
I have written a (currently small) library of lock-free data structures
(www.liblfds.org).
I am porting to ARM.
The port is complete except for CAS.
I initially tried to use the GCC built-in CAS; in my test set, the third
test fails with a double free, because the head element of the freelist
being tested has come to point to itself.
So I figured I'd need to write my own CAS. I learned enough ARM
assembly and did enough Googling to put something together. I've done
the same already for x86 (I needed cmpxchg16b, so I couldn't use the GCC
built-ins) so I have a tiny bit of experience in it; and my attempt
fails in exactly the same way as the GCC built-in.
So I go to bed. Next day I think, ah, memory barriers! you don't need
to specify them on x86 for atomics, but I bet you do on ARM; and indeed,
you do.
I then discover in this mailing list an interesting thread, dated May
2009, which states that the GCC built-ins for ARM do not have memory
barriers. The kernel I'm using is 2.6.28, which was released Christmas
Day 2008.
So I think the lack of memory barriers may well be why the code is failing.
This leads me to want to use smp_mb(). However, from what I can see,
this macro is only available via the linux kernel headers; it's not
available in user-mode. Is this correct?
^ permalink raw reply [flat|nested] 37+ messages in thread
* GCC built-in atomic operations and memory barriers
2009-11-04 18:09 GCC built-in atomic operations and memory barriers Toby Douglass
@ 2009-11-04 19:05 ` Russell King - ARM Linux
2009-11-04 20:12 ` Toby Douglass
2009-11-04 22:09 ` Gilles Chanteperdrix
0 siblings, 2 replies; 37+ messages in thread
From: Russell King - ARM Linux @ 2009-11-04 19:05 UTC (permalink / raw)
To: linux-arm-kernel
On Wed, Nov 04, 2009 at 07:09:37PM +0100, Toby Douglass wrote:
> This leads me to want to use smp_mb(). However, from what I can see,
> this macro is only available via the linux kernel headers; it's not
> available in user-mode. Is this correct?
Correct.
And please understand that just because something exists in kernel land
does not give userland a right to make use of it. (There are people
who believe it does.)
The kernel uses header files as a way to efficiently provide itself with
architecture specific optimized helpers, and these helpers may well only
work in kernel land. Eg, traditionally, atomic operations on ARM
required interrupts to be disabled across them. Userspace can not
disable interrupts, and so use of those functions in userland leads to
them becoming inherently non-atomic.
The kernel API is the syscall interface and associated data structures.
It does not include everything you can find in kernel headers.
^ permalink raw reply [flat|nested] 37+ messages in thread
* GCC built-in atomic operations and memory barriers
2009-11-04 19:05 ` Russell King - ARM Linux
@ 2009-11-04 20:12 ` Toby Douglass
2009-11-04 21:03 ` Russell King - ARM Linux
2009-11-04 22:09 ` Gilles Chanteperdrix
1 sibling, 1 reply; 37+ messages in thread
From: Toby Douglass @ 2009-11-04 20:12 UTC (permalink / raw)
To: linux-arm-kernel
Russell King - ARM Linux wrote:
> On Wed, Nov 04, 2009 at 07:09:37PM +0100, Toby Douglass wrote:
>> This leads me to want to use smp_mb(). However, from what I can see,
>> this macro is only available via the linux kernel headers; it's not
>> available in user-mode. Is this correct?
>
> Correct.
Thanks. It's often hard on the net to track down a negative answer.
[snip]
While we're talking about the GCC atomics...
This appears to be the current code for CAS;
349 static inline unsigned long __cmpxchg(volatile void *ptr,
unsigned long old, unsigned long new, int size)
352 unsigned long oldval, res;
[snip]
382 do {
383 asm volatile("@ __cmpxchg4\n"
384 " ldrex %1, [%2]\n"
385 " mov %0, #0\n"
386 " teq %1, %3\n"
387 " strexeq %0, %4, [%2]\n"
388 : "=&r" (res), "=&r" (oldval)
389 : "r" (ptr), "Ir" (old), "r" (new)
390 : "memory", "cc");
391 } while (res);
The "mov %0, #0" - why is this inbetween the ldrex and strexeq? it
seems to me it could just as well happen before the ldrex, and doing so
would reduce the time between the ldrex and strexeq and so reduce the
chance of someone else modifying our target.
^ permalink raw reply [flat|nested] 37+ messages in thread
* GCC built-in atomic operations and memory barriers
2009-11-04 20:12 ` Toby Douglass
@ 2009-11-04 21:03 ` Russell King - ARM Linux
2009-11-06 19:10 ` Toby Douglass
0 siblings, 1 reply; 37+ messages in thread
From: Russell King - ARM Linux @ 2009-11-04 21:03 UTC (permalink / raw)
To: linux-arm-kernel
On Wed, Nov 04, 2009 at 09:12:10PM +0100, Toby Douglass wrote:
> 382 do {
> 383 asm volatile("@ __cmpxchg4\n"
> 384 " ldrex %1, [%2]\n"
> 385 " mov %0, #0\n"
> 386 " teq %1, %3\n"
> 387 " strexeq %0, %4, [%2]\n"
> 388 : "=&r" (res), "=&r" (oldval)
> 389 : "r" (ptr), "Ir" (old), "r" (new)
> 390 : "memory", "cc");
> 391 } while (res);
>
> The "mov %0, #0" - why is this inbetween the ldrex and strexeq? it
> seems to me it could just as well happen before the ldrex, and doing so
> would reduce the time between the ldrex and strexeq and so reduce the
> chance of someone else modifying our target.
Because it probably doesn't. Loads normally have a 'result delay' which
cause a pipeline stall when the result is used in the next instruction.
It makes sense to fill those stall cycles with useful work, rather than
stalling the execution pipeline.
^ permalink raw reply [flat|nested] 37+ messages in thread
* GCC built-in atomic operations and memory barriers
2009-11-04 19:05 ` Russell King - ARM Linux
2009-11-04 20:12 ` Toby Douglass
@ 2009-11-04 22:09 ` Gilles Chanteperdrix
2009-11-06 19:17 ` Toby Douglass
2009-11-21 15:21 ` CAS implementation may be broken Toby Douglass
1 sibling, 2 replies; 37+ messages in thread
From: Gilles Chanteperdrix @ 2009-11-04 22:09 UTC (permalink / raw)
To: linux-arm-kernel
Russell King - ARM Linux wrote:
> The kernel API is the syscall interface and associated data structures.
> It does not include everything you can find in kernel headers.
Are the functions found in the vectors page part of the kernel API?
Because if that is so, then they provide a way to implement cmpxchg and
barriers in user-space.
--
Gilles.
^ permalink raw reply [flat|nested] 37+ messages in thread
* GCC built-in atomic operations and memory barriers
2009-11-04 21:03 ` Russell King - ARM Linux
@ 2009-11-06 19:10 ` Toby Douglass
0 siblings, 0 replies; 37+ messages in thread
From: Toby Douglass @ 2009-11-06 19:10 UTC (permalink / raw)
To: linux-arm-kernel
Russell King - ARM Linux wrote:
> On Wed, Nov 04, 2009 at 09:12:10PM +0100, Toby Douglass wrote:
[snip]
>> The "mov %0, #0" - why is this inbetween the ldrex and strexeq? it
>> seems to me it could just as well happen before the ldrex, and doing so
>> would reduce the time between the ldrex and strexeq and so reduce the
>> chance of someone else modifying our target.
>
> Because it probably doesn't. Loads normally have a 'result delay' which
> cause a pipeline stall when the result is used in the next instruction.
> It makes sense to fill those stall cycles with useful work, rather than
> stalling the execution pipeline.
Thanks, Russell. I was concerned there was something functionally
important I'd missed with regard to the atomic op.
^ permalink raw reply [flat|nested] 37+ messages in thread
* GCC built-in atomic operations and memory barriers
2009-11-04 22:09 ` Gilles Chanteperdrix
@ 2009-11-06 19:17 ` Toby Douglass
2009-11-21 15:21 ` CAS implementation may be broken Toby Douglass
1 sibling, 0 replies; 37+ messages in thread
From: Toby Douglass @ 2009-11-06 19:17 UTC (permalink / raw)
To: linux-arm-kernel
Gilles Chanteperdrix wrote:
> Russell King - ARM Linux wrote:
>> The kernel API is the syscall interface and associated data structures.
>> It does not include everything you can find in kernel headers.
>
> Are the functions found in the vectors page part of the kernel API?
> Because if that is so, then they provide a way to implement cmpxchg and
> barriers in user-space.
GCC (I had overlooked this) has a function, __sync_synchronize(), which
provides a full memory barrier. I do not know how it is implemented on
ARM, but the CAS I wrote certainly would not work (and did not work)
without a memory barrier and with this function inserted, does work.
The GCC atomic CAS for ARM appears to call the OS's compare-exchange
function, which certainly is broken, by lacking memory barriers, on the
Linux I am developing on.
Either way, I now have my own platform independent ARM __asm__ for LL/CS
CAS, with memory barriers, so I'm happy. Also, I got my OpenGL wrapper
DLL working last night :-)
^ permalink raw reply [flat|nested] 37+ messages in thread
* CAS implementation may be broken
2009-11-04 22:09 ` Gilles Chanteperdrix
2009-11-06 19:17 ` Toby Douglass
@ 2009-11-21 15:21 ` Toby Douglass
2009-11-23 15:08 ` Russell King - ARM Linux
` (3 more replies)
1 sibling, 4 replies; 37+ messages in thread
From: Toby Douglass @ 2009-11-21 15:21 UTC (permalink / raw)
To: linux-arm-kernel
Hi all -
I may be *utterly* wrong, and I'm expecting that someone here is simply
going to look at what I'm about to say and explain to me what I've
misunderstood, but I suspect it may be that atomic compare-and-swap is
incorrectly implemented in the Linux kernel on ARM v6 and above (e.g.
using ldrex and strex).
I'm looking at this code, for 2.6.31, which I believe is the latest
released version;
http://lxr.linux.no/#linux+v2.6.31/arch/arm/include/asm/system.h
382 do {
383 asm volatile("@ __cmpxchg4\n"
384 " ldrex %1, [%2]\n"
385 " mov %0, #0\n"
386 " teq %1, %3\n"
387 " strexeq %0, %4, [%2]\n"
388 : "=&r" (res), "=&r" (oldval)
389 : "r" (ptr), "Ir" (old), "r" (new)
390 : "memory", "cc");
391 } while (res);
This code is for a four byte CAS; the issue I'm about to describe exists
similarly for the two byte and one byte CASs.
The issue is the ABA problem.
Imagine you have a populated stack. Two threads come to pop. Both read
the head pointer and the next pointer of the top element and both
simultaneously come to the point where the are about to CAS, comparing
the head pointer with the top element pointer and if they match,
replacing the head pointer with the next pointer of the top element - so
by this, I mean both threads have just got into the do-while loop above,
but have not yet executed ldrex.
Now, one thread pauses for a long time.
The other thread pops. Then a bunch of the other threads pop and push
and the stack is *totally* different. Then our initial thread which
popped *pushes* that original element back.
Finally, our second thread kicks into life. Remember that his exchange
pointer is now utterly incorrect, because the stack has completely
changed. Now, he's just inside the do-while loop. He checks the top
pointer - and lo and behold, it matches the head element. However,
because someone else has messed with the top pointer in the meantime,
the CAS fails - strex refuses to swap, it can see from the shared TLB
attribute that someone else has touched destination (the top pointer).
So we dodge ABA *there*.
The problem is *we then come round in the do-while loop again*. We have
*not* updated our exchange value. So THIS second time around, we
*repeat* our strex and we DO swap - and we just swapped in completely
the wrong next pointer, from way back before the stack was totally
changed by all the other threads popping and pushing.
So that's the problem.
A common way to solve ABA on contiguous double-word compare-and-swap
architectures (e.g. x86, x64 and Itanium) is to use a pointer-counter
pair. This permits the caller to ensure swaps fail, even if the
pointers all match, because the counters don't.
Load-linked/conditional-store architectures solve ABA by having the
store fail if the destination has been touched since the load was performed.
Currently, the code appears to violate this, by repeating the CAS
*anyway*. In fact, the appropriate behaviour would seem to be *not* to
loop, but rather, to issue the ldrex/strex *once*, and indicate to the
user if the store succeed or failed.
This requires a prototype change, because the return value is the
original value of the destination and so is unable to indicate, when
returning that value, if it is returned from a successful or
unsuccessful swap.
^ permalink raw reply [flat|nested] 37+ messages in thread
* CAS implementation may be broken
2009-11-21 15:21 ` CAS implementation may be broken Toby Douglass
@ 2009-11-23 15:08 ` Russell King - ARM Linux
2009-11-23 19:10 ` Toby Douglass
2009-11-23 15:13 ` Catalin Marinas
` (2 subsequent siblings)
3 siblings, 1 reply; 37+ messages in thread
From: Russell King - ARM Linux @ 2009-11-23 15:08 UTC (permalink / raw)
To: linux-arm-kernel
On Sat, Nov 21, 2009 at 04:21:00PM +0100, Toby Douglass wrote:
> 382 do {
> 383 asm volatile("@ __cmpxchg4\n"
> 384 " ldrex %1, [%2]\n"
> 385 " mov %0, #0\n"
> 386 " teq %1, %3\n"
> 387 " strexeq %0, %4, [%2]\n"
> 388 : "=&r" (res), "=&r" (oldval)
> 389 : "r" (ptr), "Ir" (old), "r" (new)
> 390 : "memory", "cc");
> 391 } while (res);
>
> The problem is *we then come round in the do-while loop again*. We have
> *not* updated our exchange value. So THIS second time around, we
> *repeat* our strex and we DO swap - and we just swapped in completely
> the wrong next pointer, from way back before the stack was totally
> changed by all the other threads popping and pushing.
First time around the loop, lets say %3 = 1 *(u32 *)%2 = 1.
ldrex %1, [%2]
%1 = *(u32 *)%2 (= 1)
mov %0, #0
%0 = 0
teq %1, %3
%3 == %1? (yes)
strexeq %0, %4, [%2]
executed but because of the other access,
exclusivity fails. *(u32 *)%2 not written
and %0 = 1
So, res = 1, and we go around the loop again. Lets say that *(u32 *)%2 = 2
now.
ldrex %1, [%2]
%1 = *(u32 *)%2 (= 2)
mov %0, #0
%0 = 0
teq %1, %3
%3 == %1? (no)
strexeq %0, %4, [%2]
not executed at all, %0 and *(u32 *)%2 untouched
So, res = 0 and we do _not_ repeat the loop and return "cmpxchg" failure.
I haven't had time to read all your email properly (due to the need to
get on a conference call), but please tell me where the problem is above
(using a similar worked example).
Thanks.
^ permalink raw reply [flat|nested] 37+ messages in thread
* CAS implementation may be broken
2009-11-21 15:21 ` CAS implementation may be broken Toby Douglass
2009-11-23 15:08 ` Russell King - ARM Linux
@ 2009-11-23 15:13 ` Catalin Marinas
2009-11-24 15:15 ` Toby Douglass
2009-11-24 15:33 ` Toby Douglass
2009-11-23 15:34 ` Catalin Marinas
2009-11-23 22:28 ` Jamie Lokier
3 siblings, 2 replies; 37+ messages in thread
From: Catalin Marinas @ 2009-11-23 15:13 UTC (permalink / raw)
To: linux-arm-kernel
On Sat, 2009-11-21 at 15:21 +0000, Toby Douglass wrote:
> I may be *utterly* wrong, and I'm expecting that someone here is simply
> going to look at what I'm about to say and explain to me what I've
> misunderstood, but I suspect it may be that atomic compare-and-swap is
> incorrectly implemented in the Linux kernel on ARM v6 and above (e.g.
> using ldrex and strex).
Well, you have to define "correctly" or you may just need a different
function for what you want. In this implementation, CAS is expected to
succeed if the the old value matches the memory one. The loop is present
to always guarantee that it will succeed if the condition matches.
Note that the exclusive monitor state is cleared (and hence STREX fails)
a every context switch or a return from an exception, i.e. interrupt).
>
> The issue is the ABA problem.
[...]
> The problem is *we then come round in the do-while loop again*. We have
> *not* updated our exchange value. So THIS second time around, we
> *repeat* our strex and we DO swap - and we just swapped in completely
> the wrong next pointer, from way back before the stack was totally
> changed by all the other threads popping and pushing.
But the first thread (paused) one may may actually wait before the LDREX
(the old value was loaded with an LDR), so the LDREX/STREX pair would
succeed anyway. That's not really a solution unless you also use LDREX
for loading the old value to be passed to CAS. IOW, you need your own
implementation of what you are trying to achieve (and not modifying
CAS).
--
Catalin
^ permalink raw reply [flat|nested] 37+ messages in thread
* CAS implementation may be broken
2009-11-21 15:21 ` CAS implementation may be broken Toby Douglass
2009-11-23 15:08 ` Russell King - ARM Linux
2009-11-23 15:13 ` Catalin Marinas
@ 2009-11-23 15:34 ` Catalin Marinas
2009-11-23 16:40 ` Toby Douglass
2009-11-23 22:28 ` Jamie Lokier
3 siblings, 1 reply; 37+ messages in thread
From: Catalin Marinas @ 2009-11-23 15:34 UTC (permalink / raw)
To: linux-arm-kernel
On Sat, 2009-11-21 at 15:21 +0000, Toby Douglass wrote:
> Load-linked/conditional-store architectures solve ABA by having the
> store fail if the destination has been touched since the load was performed.
So, for your problem, CAS wouldn't solve it on any architecture, not
just ARM if it is implemented with LL/SC.
Something like below may work (untested and haven't put much thought
into it). Basically you need atomicity between the moment you first get
the head pointer and the moment you update it. CAS doesn't ensure this
since the first load of the [head] pointer would be standard LDR (even
if it is LDREX, you get another LDREX which resets the link with the
STREX):
1: LDREX orig_head, [head]
LDR orig_next, [orig_head, #next]
STREX res, orig_next, [head]
cmp res, #0
bne 1b
--
Catalin
^ permalink raw reply [flat|nested] 37+ messages in thread
* CAS implementation may be broken
2009-11-23 15:34 ` Catalin Marinas
@ 2009-11-23 16:40 ` Toby Douglass
0 siblings, 0 replies; 37+ messages in thread
From: Toby Douglass @ 2009-11-23 16:40 UTC (permalink / raw)
To: linux-arm-kernel
Catalin, Russell - thankyou for your replies. I'm looking at them now;
some thought is required :-)
^ permalink raw reply [flat|nested] 37+ messages in thread
* CAS implementation may be broken
2009-11-23 15:08 ` Russell King - ARM Linux
@ 2009-11-23 19:10 ` Toby Douglass
2009-11-23 20:06 ` Russell King - ARM Linux
0 siblings, 1 reply; 37+ messages in thread
From: Toby Douglass @ 2009-11-23 19:10 UTC (permalink / raw)
To: linux-arm-kernel
Russell King - ARM Linux wrote:
> First time around the loop, lets say %3 = 1 *(u32 *)%2 = 1.
>
> ldrex %1, [%2]
> %1 = *(u32 *)%2 (= 1)
> mov %0, #0
> %0 = 0
> teq %1, %3
> %3 == %1? (yes)
> strexeq %0, %4, [%2]
> executed but because of the other access,
> exclusivity fails. *(u32 *)%2 not written
> and %0 = 1
>
> So, res = 1, and we go around the loop again. Lets say that *(u32 *)%2 = 2
> now.
No - we're dealing with the ABA problem. We're assuming here that this
thread gets to retry with the same values.
> I haven't had time to read all your email properly (due to the need to
> get on a conference call), but please tell me where the problem is above
> (using a similar worked example).
So; we go around again, load %2, do the teq, which succeeds, then the
strexeq, which now succeeds since no-one else has touched %2.
This was the thrust of the original post; however, Catalin has raised
arguments against it which I have not yet digested, so what I'm writing
here, where it is simply an enlargement on the OP, has the same flaws;
it's only in response to your specific point. I'm not trying to assert
this *is* what happens, in spite of what Catalin has written.
^ permalink raw reply [flat|nested] 37+ messages in thread
* CAS implementation may be broken
2009-11-23 19:10 ` Toby Douglass
@ 2009-11-23 20:06 ` Russell King - ARM Linux
2009-11-23 20:34 ` Toby Douglass
0 siblings, 1 reply; 37+ messages in thread
From: Russell King - ARM Linux @ 2009-11-23 20:06 UTC (permalink / raw)
To: linux-arm-kernel
On Mon, Nov 23, 2009 at 08:10:51PM +0100, Toby Douglass wrote:
> Russell King - ARM Linux wrote:
>> First time around the loop, lets say %3 = 1 *(u32 *)%2 = 1.
>>
>> ldrex %1, [%2]
>> %1 = *(u32 *)%2 (= 1)
>> mov %0, #0
>> %0 = 0
>> teq %1, %3
>> %3 == %1? (yes)
>> strexeq %0, %4, [%2]
>> executed but because of the other access,
>> exclusivity fails. *(u32 *)%2 not written
>> and %0 = 1
>>
>> So, res = 1, and we go around the loop again. Lets say that *(u32 *)%2 = 2
>> now.
>
> No - we're dealing with the ABA problem. We're assuming here that this
> thread gets to retry with the same values.
>
>> I haven't had time to read all your email properly (due to the need to
>> get on a conference call), but please tell me where the problem is above
>> (using a similar worked example).
>
> So; we go around again, load %2, do the teq, which succeeds, then the
> strexeq, which now succeeds since no-one else has touched %2.
>
> This was the thrust of the original post; however, Catalin has raised
> arguments against it which I have not yet digested, so what I'm writing
> here, where it is simply an enlargement on the OP, has the same flaws;
> it's only in response to your specific point. I'm not trying to assert
> this *is* what happens, in spite of what Catalin has written.
Well, I've thought it through quite a bit now, and I have an expansive
reply to your email. In summary, there is nothing wrong with the
existing code; your use of it is the problem.
I can post the expansive reply if you need the details.
In short, consider what happens if you consider a slightly different order
of operations, where you have calculated 'ptr', 'old' and 'new' for cmpxchg
but you haven't executed the first ldrex.
^ permalink raw reply [flat|nested] 37+ messages in thread
* CAS implementation may be broken
2009-11-23 20:06 ` Russell King - ARM Linux
@ 2009-11-23 20:34 ` Toby Douglass
0 siblings, 0 replies; 37+ messages in thread
From: Toby Douglass @ 2009-11-23 20:34 UTC (permalink / raw)
To: linux-arm-kernel
Russell King - ARM Linux wrote:
> On Mon, Nov 23, 2009 at 08:10:51PM +0100, Toby Douglass wrote:
>> This was the thrust of the original post; however, Catalin has raised
>> arguments against it which I have not yet digested, so what I'm writing
>> here, where it is simply an enlargement on the OP, has the same flaws;
>> it's only in response to your specific point. I'm not trying to assert
>> this *is* what happens, in spite of what Catalin has written.
>
> Well, I've thought it through quite a bit now, and I have an expansive
> reply to your email. In summary, there is nothing wrong with the
> existing code; your use of it is the problem.
>
> I can post the expansive reply if you need the details.
I would be very grateful if you could - although if it is extensive and
well-understood, you might want to spare the list and email me directly.
I've been working every hour there is for the last three weekends trying
to get CAS working properly on ARM under Linux; any fresh insight or
knowledge from others may give me what I need to work out what I've
mis-understood (before I go insane :-)
> In short, consider what happens if you consider a slightly different order
> of operations, where you have calculated 'ptr', 'old' and 'new' for cmpxchg
> but you haven't executed the first ldrex.
This is, I think, what Catalin said - I'm grappling with this at the
moment; I don't have a reply to it yet, positive or negative.
^ permalink raw reply [flat|nested] 37+ messages in thread
* CAS implementation may be broken
2009-11-21 15:21 ` CAS implementation may be broken Toby Douglass
` (2 preceding siblings ...)
2009-11-23 15:34 ` Catalin Marinas
@ 2009-11-23 22:28 ` Jamie Lokier
2009-11-23 23:13 ` Russell King - ARM Linux
` (2 more replies)
3 siblings, 3 replies; 37+ messages in thread
From: Jamie Lokier @ 2009-11-23 22:28 UTC (permalink / raw)
To: linux-arm-kernel
Toby Douglass wrote:
> Load-linked/conditional-store architectures solve ABA by having the
> store fail if the destination has been touched since the load was performed.
>
> Currently, the code appears to violate this, by repeating the CAS
> *anyway*. In fact, the appropriate behaviour would seem to be *not* to
> loop, but rather, to issue the ldrex/strex *once*, and indicate to the
> user if the store succeed or failed.
I believe Catalin's explained why it does not work even doing
LDREX/STREX once, because the thread can pause before the LDREX. So
you must begin fetching pointers after the LDREX.
(At least I think so. I'm prepared to be shown to be wrong :-)
If you're writing code intended for other LL/SC architectures too, and
following Catalin's suggestion to put LDR between LDREX and STREX,
then you might have to check if the other architectures permit loads
between the LL and SC.
> This requires a prototype change, because the return value is the
> original value of the destination and so is unable to indicate, when
> returning that value, if it is returned from a successful or
> unsuccessful swap.
Nonetheless, such a prototype change might be an improvement anyway.
Some platforms provide compare_and_swap_bool() operations, which do as
you describe: Compare, conditionally store, and return bool to indicate
success. No loop.
That could be an improvement for some algorithms, because often if the
store doesn't happen, the *inputs* to compare_and_swap() need to be
recalculated anyway before another try is likely to succeed. What's
the point in having code which expands to two loops:
do {
old = get_something;
new = calc_something;
/* oldval = compare_and_swap(ptr, old, new); */
do {
__asm__("LL/SC" : (failed), (oldval) : (ptr), (old), (new));
} while (failed && oldval == old);
} while (oldval != old);
When it can often be a smaller loop, which probably executes a little
faster too in various cases:
do {
old = get_something;
new = calc_something;
/* oldval = compare_and_swap_bool(ptr, old, new); */
__asm__("LL/SC" : (failed), (oldval) : (ptr), (old), (new));
} while (failed);
-- Jamie
^ permalink raw reply [flat|nested] 37+ messages in thread
* CAS implementation may be broken
2009-11-23 22:28 ` Jamie Lokier
@ 2009-11-23 23:13 ` Russell King - ARM Linux
2009-11-24 1:32 ` Jamie Lokier
2009-11-24 9:38 ` Toby Douglass
2009-11-24 15:59 ` Catalin Marinas
2009-11-24 16:34 ` Toby Douglass
2 siblings, 2 replies; 37+ messages in thread
From: Russell King - ARM Linux @ 2009-11-23 23:13 UTC (permalink / raw)
To: linux-arm-kernel
On Mon, Nov 23, 2009 at 10:28:30PM +0000, Jamie Lokier wrote:
> That could be an improvement for some algorithms, because often if the
> store doesn't happen, the *inputs* to compare_and_swap() need to be
> recalculated anyway before another try is likely to succeed. What's
> the point in having code which expands to two loops:
The point is that often its cheaper to retry the LL/SC than it is to
do the recalculation.
However, I don't think you've understood the original problem at all.
The issue is that for a particular 32-bit word, the behaviour required
is:
- if no change occurs between the _original_ read and the atomic update,
then go ahead with the update
- if any change has occured, do not update, but go back and recalculate
from the beginning.
This is because, even if the value at the pointer is restored back to
the numerically identical value it had at the start, the requirement is
for the store to fail.
No amount of loops (or lack of loops) solves that, even if you have a
proper single atomic 32-bit CAS instruction.
You need to be far more clever, and there's some hints on wikipedia
about possible solutions. They assume that you can CAS more bits than
your intended value though, which we don't have.
I believe there is a solution to the ABA problem using atomic operations
with one 32-bit word and a separate counter to identify updates, but it
requires a little more time than I have available this evening to fully
put together.
^ permalink raw reply [flat|nested] 37+ messages in thread
* CAS implementation may be broken
2009-11-23 23:13 ` Russell King - ARM Linux
@ 2009-11-24 1:32 ` Jamie Lokier
2009-11-24 11:19 ` Catalin Marinas
2009-11-24 9:38 ` Toby Douglass
1 sibling, 1 reply; 37+ messages in thread
From: Jamie Lokier @ 2009-11-24 1:32 UTC (permalink / raw)
To: linux-arm-kernel
Russell King - ARM Linux wrote:
> On Mon, Nov 23, 2009 at 10:28:30PM +0000, Jamie Lokier wrote:
> > That could be an improvement for some algorithms, because often if the
> > store doesn't happen, the *inputs* to compare_and_swap() need to be
> > recalculated anyway before another try is likely to succeed. What's
> > the point in having code which expands to two loops:
>
> The point is that often its cheaper to retry the LL/SC than it is to
> do the recalculation.
>
> However, I don't think you've understood the original problem at all.
I think I have - I agreed with you and Catalin already that LL/SC does
not suffice. But do you mean that Catalin's suggestion to put the
LDREX before the LDR does not work either? (Maybe it needs a barrier
too).
It probably looks like I didn't understand because I've mixed two
quite different issues in the same mail (same in this one), because I
think the CAS_BOOL has merit. Having implemented both variants in
userspace, it's actually quite annoying to use CAS and have to
remember the input and compare it with the output sometimes.
For some algorithms, you're right that it can be cheaper to retry the
LL/SC. But for some algorithms you know the retry will never succeed
and is a waste of time and code space, e.g. ones which do CAS on a
counter which is always incremented. It makes the caller a bit more
long-winded too.
The final argument is: you can implement CAS in terms of CAS_BOOL, but
you can't do it the other way. Because everyone copied x86's CAS at
the begining, that's the one we got.
-- Jamie
^ permalink raw reply [flat|nested] 37+ messages in thread
* CAS implementation may be broken
2009-11-23 23:13 ` Russell King - ARM Linux
2009-11-24 1:32 ` Jamie Lokier
@ 2009-11-24 9:38 ` Toby Douglass
1 sibling, 0 replies; 37+ messages in thread
From: Toby Douglass @ 2009-11-24 9:38 UTC (permalink / raw)
To: linux-arm-kernel
Russell King - ARM Linux wrote:
[snip]
No comments on everything else and the other posts; still digesting them.
> I believe there is a solution to the ABA problem using atomic operations
> with one 32-bit word and a separate counter to identify updates, but it
> requires a little more time than I have available this evening to fully
> put together.
I've implemented the pointer-counter solution on x86/x64. Indeed, it
turns out you can use them on ARM, since there is a double-word ldrex.
However, and this is part of what I'm currently thinking about, I was
under the impression LL/SC meant *single word CAS* could work without a
counter.
I think you'd never implement pointer-counter in the kernel, as your
basic CAS, because plain CAS is all you need for certain things and the
user can easily implement pointer-counter on top of plain CAS.
OTOH, perhaps it is nice to offer a plain CAS API and an ABA-safe CAS API.
^ permalink raw reply [flat|nested] 37+ messages in thread
* CAS implementation may be broken
2009-11-24 1:32 ` Jamie Lokier
@ 2009-11-24 11:19 ` Catalin Marinas
2009-11-24 22:24 ` Toby Douglass
2009-11-24 22:34 ` Toby Douglass
0 siblings, 2 replies; 37+ messages in thread
From: Catalin Marinas @ 2009-11-24 11:19 UTC (permalink / raw)
To: linux-arm-kernel
On Tue, 2009-11-24 at 01:32 +0000, Jamie Lokier wrote:
> Russell King - ARM Linux wrote:
> > On Mon, Nov 23, 2009 at 10:28:30PM +0000, Jamie Lokier wrote:
> > > That could be an improvement for some algorithms, because often if the
> > > store doesn't happen, the *inputs* to compare_and_swap() need to be
> > > recalculated anyway before another try is likely to succeed. What's
> > > the point in having code which expands to two loops:
> >
> > The point is that often its cheaper to retry the LL/SC than it is to
> > do the recalculation.
> >
> > However, I don't think you've understood the original problem at all.
>
> I think I have - I agreed with you and Catalin already that LL/SC does
> not suffice. But do you mean that Catalin's suggestion to put the
> LDREX before the LDR does not work either? (Maybe it needs a barrier
> too).
It definitely needs a barrier after the LDREX and maybe one after STREX
but that depends on the semantics of such operation.
--
Catalin
^ permalink raw reply [flat|nested] 37+ messages in thread
* CAS implementation may be broken
2009-11-23 15:13 ` Catalin Marinas
@ 2009-11-24 15:15 ` Toby Douglass
2009-11-24 15:36 ` Russell King - ARM Linux
2009-11-25 1:24 ` Jamie Lokier
2009-11-24 15:33 ` Toby Douglass
1 sibling, 2 replies; 37+ messages in thread
From: Toby Douglass @ 2009-11-24 15:15 UTC (permalink / raw)
To: linux-arm-kernel
Catalin Marinas wrote:
> On Sat, 2009-11-21 at 15:21 +0000, Toby Douglass wrote:
>> I may be *utterly* wrong, and I'm expecting that someone here is simply
>> going to look at what I'm about to say and explain to me what I've
>> misunderstood, but I suspect it may be that atomic compare-and-swap is
>> incorrectly implemented in the Linux kernel on ARM v6 and above (e.g.
>> using ldrex and strex).
>
> Well, you have to define "correctly" or you may just need a different
> function for what you want.
Yeeessss-s-s-s-s-s...
But you can have a situation where you can choose between two
implementations and although both could be 'correct', one offers more
functionality than the other - and so is best choice.
I'm coming at this from the POV of implementing lock-free data
structures. Essentially, "plain CAS" (as on SPARC, which has neither
LL/SC nor contiguous double word length CAS) makes implementation
problematic because you have no straightforward solution to the ABA problem.
Contiguous double word length CAS permits a pointer-counter pair, which
solves ABA. LL/SC solves ABA by failing the store if the target has
been modified since load (I didn't properly realise how this could be
used to solve ABA until later in this post; there is, I think, a way - I
describe it later).
Now, it is possible to implement CAS using LL/SC such that it behaves
exactly like plain CAS - a la SPARC. I believe this is what exists now.
However, I think it may also possible to implement using LL/SC such that
ABA is avoided. If a platform implements CAS using LL/SC *only* a plain
CAS, then it makes life exceedingly difficult for lock-free data
structures; and it does so by *not* implementing functionality which
exists in the CPU which solves ABA!
> In this implementation, CAS is expected to
> succeed if the the old value matches the memory one. The loop is present
> to always guarantee that it will succeed if the condition matches.
Yes. This is a "plain CAS" implementation.
Interestingly, I saw a day or two ago that there is a double-word
version of LDREX. The atomic-ptr project, which forms the basis of the
GCC garbage collector, relies on this; it does not in fact use LL/SC on
ARM, but rather uses double-word CAS using a pointer-counter pair.
> Note that the exclusive monitor state is cleared (and hence STREX fails)
> a every context switch or a return from an exception, i.e. interrupt).
Good. I expected this to be so and it is important to me that it is;
for by being so, it permits the use of these lock-free data structures
in interrupt handlers.
>> The issue is the ABA problem.
>> The problem is *we then come round in the do-while loop again*. We have
>> *not* updated our exchange value. So THIS second time around, we
>> *repeat* our strex and we DO swap - and we just swapped in completely
>> the wrong next pointer, from way back before the stack was totally
>> changed by all the other threads popping and pushing.
>
> But the first thread (paused) one may may actually wait before the LDREX
> (the old value was loaded with an LDR), so the LDREX/STREX pair would
> succeed anyway.
I can see my original description of the problem contains an error! I
described it first as "both threads pause before LDREX" and then later I
wrote as if both had in fact executed LDREX.
In fact, what you describe here in the paragraph above is what I
described in the OP as occurring in the second iteration of the while
loop (when in fact it can just as well occur in the first iteration, as
you have here described); namely, you're experiencing ABA - your
exchange pointer is wrong, but you don't realise it, because the
destination matches your compare.
So, now I'm puzzled. I was under the impression it was possible to use
LL/SC to solve ABA.
Ah - wait - perhaps I see how. The problem is that the exchange pointer
is loaded before we become able to see that the destination has changed.
In fact, what's needed here (in this stack example) is that you *load
the exchange pointer inside the ldrex/strex pair*.
> That's not really a solution unless you also use LDREX
> for loading the old value to be passed to CAS.
Man, you're just way ahead of me :-) the coffee at ARM must be *really*
good :-)
But I think you don't need to use LDREX; it's enough just to LDR inside
the LDREX/STREX pair used on destination. (This would mean in the CAS
prototype taking a pointer to the exchange value).
This would mean that if the destination was changed by someone else,
you'd never use the exchange value; but if it was not, you'd be using
the current value.
However, this assumes the caller has a linkage between the exchange
value and the destination; e.g. that if the destination is modified by
someone else, the exchange must now be wrong.
This I would expect is not always the case and such an implementation
would no longer be lowest common denominator.
But I wonder...
Imagine we had a CAS where the compare and exchange are passed in as
pointers and we load what they point to after the LDREX on destination.
What do we have now?
We have a CAS where the exchange and compare are guaranteed to be only
those which exist continuously and without modification between the
loading, compare and potential set of the destination.
It is perhaps offering stronger guarantees than necessary in all cases -
but it actually sounds like a *real* CAS; it says that when we enter the
CAS, we WILL have the values for compare and exchange which are correct
at that moment, rather than using historical values.
Is this wrong? or is it better than x86 style cmpxchg? can we do
everything with this that we could do with a CAS which swaps historical
values?
> IOW, you need your own
> implementation of what you are trying to achieve (and not modifying
> CAS).
On one hand, plain CAS is the basic functionality. It is all you need
for some things and indeed, I can imagine for some things, it is
orthogonal; it is *what you must have* and so it would need to exist.
However, for many things you want an ABA-safe CAS. I would go as so far
as to say is a *second* basic functionality, since there is I think more
use for this than for a non-ABA safe CAS.
Windows offers this simply by offering double-word CAS (and requiring
developers to use it as a pointer-counter pairing, which is reasonable,
since you might want to use double-word really as double-word, so it has
other uses).
ARM offers that as well, but I believe there is no double-word CAS
support in the kernel. That would be one route (and indeed it seems odd
that the current code supports in a switch one, two and four byte CAS,
but not eight - perhaps is it not available on all >= v6 CPUs?).
The other would be to use the LL/SC functionality as described above.
The user of the current CAS cannot implement this behaviour on top of
the CAS that currently exists; he would have to write his own assembly.
My feeling in this, for what it's worth, is that this seems an excessive
requirement for something which is necessary for lock-free data
structures; they are becoming important as the number of cores in SMP
systems continues to double.
^ permalink raw reply [flat|nested] 37+ messages in thread
* CAS implementation may be broken
2009-11-23 15:13 ` Catalin Marinas
2009-11-24 15:15 ` Toby Douglass
@ 2009-11-24 15:33 ` Toby Douglass
1 sibling, 0 replies; 37+ messages in thread
From: Toby Douglass @ 2009-11-24 15:33 UTC (permalink / raw)
To: linux-arm-kernel
I wrote:
> We have a CAS where the exchange and compare are guaranteed to be only those which exist continuously and without modification between the loading, compare and potential set of the destination.
Ah, this isn't true. The compare and exchange values could be modified
after their LDR but before the STREX. Apologies - basic mistake :-/
So in fact what you have is a guarantee you will only use values for
compare and exchange which exist *after* you've loaded your destination.
I don't think this is useful for anything; it's no different to having
them loaded before the LDREX. What's more, I think I can do this now
anyway, in GCC; pass pointers in and in the assembly, use "*exchange"
rather than "exchange".
I understand now what Catalin means about using LDREX to load them as
well (which you can't do, I believe; you can't nest LDREX on one core).
^ permalink raw reply [flat|nested] 37+ messages in thread
* CAS implementation may be broken
2009-11-24 15:15 ` Toby Douglass
@ 2009-11-24 15:36 ` Russell King - ARM Linux
2009-11-24 16:20 ` Toby Douglass
` (2 more replies)
2009-11-25 1:24 ` Jamie Lokier
1 sibling, 3 replies; 37+ messages in thread
From: Russell King - ARM Linux @ 2009-11-24 15:36 UTC (permalink / raw)
To: linux-arm-kernel
On Tue, Nov 24, 2009 at 04:15:37PM +0100, Toby Douglass wrote:
> So, now I'm puzzled. I was under the impression it was possible to use
> LL/SC to solve ABA.
Well, what you can do is:
LL
compute new value
SC
The problem for LL/SC architectures is that this creates a big region
where we hope that no one else performs another 'load locked' operation.
LL/SC based architectures doing this traditionally fall fowl of a problem.
Consider the following:
do {
LL
complex computation to new value
SC
} while (store failed)
and have that running on two CPUs trying to modify the same location. It
can hang both CPUs up in that loop for a _very_ long time. This is
primerily why we've ended up with a CAS implementation, and everything
is implemented as:
do {
load current
compute new value
do {
LL old
old == current?
SC new
} while SC failed
} while old != current
(I argued initially for a LL / SC interface but the arguments I was making
were shot down by Linus et.al. as completely unreasonable and technically
idiotic.)
Actually, I'm rather surprised that our existing LDREX/STREX implementations
don't suffer from this problem. Consider two CPUs operating in lock-step:
CPU0 CPU1
ldrex
add ldrex
strex (fails) add
repeat strex (fails)
ldrex repeat
add ldrex
strex (fails) add
repeat strex (fails)
... repeat
...
In this scenario, the loops _never_ terminate. The usual solution is to
add delays into the loop which depend on the CPU number, thus ensuring
that two CPUs execute a different number of instructions each time around
the loop.
> My feeling in this, for what it's worth, is that this seems an excessive
> requirement for something which is necessary for lock-free data
> structures; they are becoming important as the number of cores in SMP
> systems continues to double.
Hasn't the lock-free issue been solved by things like rcu?
^ permalink raw reply [flat|nested] 37+ messages in thread
* CAS implementation may be broken
2009-11-23 22:28 ` Jamie Lokier
2009-11-23 23:13 ` Russell King - ARM Linux
@ 2009-11-24 15:59 ` Catalin Marinas
2009-11-24 16:34 ` Toby Douglass
2 siblings, 0 replies; 37+ messages in thread
From: Catalin Marinas @ 2009-11-24 15:59 UTC (permalink / raw)
To: linux-arm-kernel
On Mon, 2009-11-23 at 22:28 +0000, Jamie Lokier wrote:
> Toby Douglass wrote:
> > Load-linked/conditional-store architectures solve ABA by having the
> > store fail if the destination has been touched since the load was performed.
> >
> > Currently, the code appears to violate this, by repeating the CAS
> > *anyway*. In fact, the appropriate behaviour would seem to be *not* to
> > loop, but rather, to issue the ldrex/strex *once*, and indicate to the
> > user if the store succeed or failed.
>
> I believe Catalin's explained why it does not work even doing
> LDREX/STREX once, because the thread can pause before the LDREX. So
> you must begin fetching pointers after the LDREX.
>
> (At least I think so. I'm prepared to be shown to be wrong :-)
>
> If you're writing code intended for other LL/SC architectures too, and
> following Catalin's suggestion to put LDR between LDREX and STREX,
> then you might have to check if the other architectures permit loads
> between the LL and SC.
That's a good point. ARM doesn't allow this either, though probably
current implementations don't have any problem with it. From the ARM ARM
A3.4.5:
An implementation might clear an exclusive monitor between the
LDREX and the STREX, without any application-related cause. For
example, this might happen because of cache evictions. Code
written for such an implementation must avoid having any
*explicit memory accesses* or cache maintenance operations
between the LDREX and STREX instructions
So on some future implementations it could live-lock.
--
Catalin
^ permalink raw reply [flat|nested] 37+ messages in thread
* CAS implementation may be broken
2009-11-24 15:36 ` Russell King - ARM Linux
@ 2009-11-24 16:20 ` Toby Douglass
2009-11-24 16:27 ` Catalin Marinas
2009-11-24 17:14 ` Toby Douglass
2 siblings, 0 replies; 37+ messages in thread
From: Toby Douglass @ 2009-11-24 16:20 UTC (permalink / raw)
To: linux-arm-kernel
Russell King - ARM Linux wrote:
> On Tue, Nov 24, 2009 at 04:15:37PM +0100, Toby Douglass wrote:
>> So, now I'm puzzled. I was under the impression it was possible to use
>> LL/SC to solve ABA.
>
> Well, what you can do is:
>
> LL
> compute new value
> SC
>
> The problem for LL/SC architectures is that this creates a big region
> where we hope that no one else performs another 'load locked' operation.
Yes; live-lock.
> LL/SC based architectures doing this traditionally fall fowl of a problem.
> Consider the following:
>
> do {
> LL
> complex computation to new value
> SC
> } while (store failed)
>
> and have that running on two CPUs trying to modify the same location. It
> can hang both CPUs up in that loop for a _very_ long time.
Yes, but this complex computation; in all of the data structures which
can be implemented using a lock-free algorithm, the only work that is
done is loading an exchange value and a compare value.
> (I argued initially for a LL / SC interface but the arguments I was making
> were shot down by Linus et.al. as completely unreasonable and technically
> idiotic.)
Were they?
> Actually, I'm rather surprised that our existing LDREX/STREX implementations
> don't suffer from this problem. Consider two CPUs operating in lock-step:
>
> CPU0 CPU1
> ldrex
> add ldrex
> strex (fails) add
> repeat strex (fails)
> ldrex repeat
> add ldrex
> strex (fails) add
> repeat strex (fails)
> ... repeat
> ...
>
> In this scenario, the loops _never_ terminate. The usual solution is to
> add delays into the loop which depend on the CPU number, thus ensuring
> that two CPUs execute a different number of instructions each time around
> the loop.
Yes. Algorithmic back-off is the standard solution. For light use,
it's excessive. TBH, I doubt many are using this code, since the kernel
version of it didn't have memory barriers until recently and GCC's
__sync_synchronize() does nothing on ARM.
>> My feeling in this, for what it's worth, is that this seems an excessive
>> requirement for something which is necessary for lock-free data
>> structures; they are becoming important as the number of cores in SMP
>> systems continues to double.
>
> Hasn't the lock-free issue been solved by things like rcu?
I would say yes and no. Memory handling adds significant complexity to
a lock-free algorithm. For some algorithms, it's not necessary, for
others you can get by without it at a cost (using a freelist, for
example). It's a trade-off, rather than an outright solution.
^ permalink raw reply [flat|nested] 37+ messages in thread
* CAS implementation may be broken
2009-11-24 15:36 ` Russell King - ARM Linux
2009-11-24 16:20 ` Toby Douglass
@ 2009-11-24 16:27 ` Catalin Marinas
2009-11-24 17:14 ` Toby Douglass
2 siblings, 0 replies; 37+ messages in thread
From: Catalin Marinas @ 2009-11-24 16:27 UTC (permalink / raw)
To: linux-arm-kernel
On Tue, 2009-11-24 at 15:36 +0000, Russell King - ARM Linux wrote:
> On Tue, Nov 24, 2009 at 04:15:37PM +0100, Toby Douglass wrote:
> > So, now I'm puzzled. I was under the impression it was possible to use
> > LL/SC to solve ABA.
>
> Well, what you can do is:
>
> LL
> compute new value
> SC
>
> The problem for LL/SC architectures is that this creates a big region
> where we hope that no one else performs another 'load locked' operation.
As I replied to Jamie, the long computation above may introduce some
live-lock situation on some hardware implementations (I think in general
it needs to avoid cache line evictions).
> (I argued initially for a LL / SC interface but the arguments I was making
> were shot down by Linus et.al. as completely unreasonable and technically
> idiotic.)
Maybe because other architectures don't allow for many instructions
between LL and SC.
> Actually, I'm rather surprised that our existing LDREX/STREX implementations
> don't suffer from this problem. Consider two CPUs operating in lock-step:
>
> CPU0 CPU1
> ldrex
> add ldrex
> strex (fails) add
> repeat strex (fails)
> ldrex repeat
> add ldrex
> strex (fails) add
> repeat strex (fails)
> ... repeat
> ...
>
> In this scenario, the loops _never_ terminate. The usual solution is to
> add delays into the loop which depend on the CPU number, thus ensuring
> that two CPUs execute a different number of instructions each time around
> the loop.
I thought we already have a hardware solution or this. There was an
erratum for ARM11MPCore r0p0 where we had to add the delay in software
but it's no longer needed.
> > My feeling in this, for what it's worth, is that this seems an excessive
> > requirement for something which is necessary for lock-free data
> > structures; they are becoming important as the number of cores in SMP
> > systems continues to double.
>
> Hasn't the lock-free issue been solved by things like rcu?
RCU allows only one writer (and multiple readers). The completely
lock-free algorithm would allow multiple writers (multiple threads
popping from a stack in Toby's original problem).
--
Catalin
^ permalink raw reply [flat|nested] 37+ messages in thread
* CAS implementation may be broken
2009-11-23 22:28 ` Jamie Lokier
2009-11-23 23:13 ` Russell King - ARM Linux
2009-11-24 15:59 ` Catalin Marinas
@ 2009-11-24 16:34 ` Toby Douglass
2 siblings, 0 replies; 37+ messages in thread
From: Toby Douglass @ 2009-11-24 16:34 UTC (permalink / raw)
To: linux-arm-kernel
Jamie Lokier wrote:
> Toby Douglass wrote:
>> Load-linked/conditional-store architectures solve ABA by having the
>> store fail if the destination has been touched since the load was performed.
>>
>> Currently, the code appears to violate this, by repeating the CAS
>> *anyway*. In fact, the appropriate behaviour would seem to be *not* to
>> loop, but rather, to issue the ldrex/strex *once*, and indicate to the
>> user if the store succeed or failed.
>
> I believe Catalin's explained why it does not work even doing
> LDREX/STREX once, because the thread can pause before the LDREX. So
> you must begin fetching pointers after the LDREX.
Yes. I'm now getting around to digesting the posts after Catalin's and
I see here people have already pointed out the things I worked out =-)
> (At least I think so. I'm prepared to be shown to be wrong :-)
Well - it doesn't save you - you could LDREX, then LDR - and *then*
pause. Then the freelist completely changes, but the original element
is moved back to the top. Then you continue. You fail to swap, since
the destination has changed...ah, wait! it *does* save you. Of course,
you're dead if you loop again, so you need to exit at this point.
Now I'm wondering what I was thinking before, in my earlier email to the
list...!
> If you're writing code intended for other LL/SC architectures too, and
> following Catalin's suggestion to put LDR between LDREX and STREX,
> then you might have to check if the other architectures permit loads
> between the LL and SC.
As Catalin has pointed out, this in fact is not allowed on ARM. So it
doesn't save us (well, me :-) after all and in fact I look stuffed, as
far as LL/SC is concerned; I have to use double-word LDREX and a
pointer-counter pair.
So LL/SC on ARM is not an ABA solution. It merely is a RISC-style
implementation of what on CISC is a single instruction.
>> This requires a prototype change, because the return value is the
>> original value of the destination and so is unable to indicate, when
>> returning that value, if it is returned from a successful or
>> unsuccessful swap.
>
> Nonetheless, such a prototype change might be an improvement anyway.
>
> Some platforms provide compare_and_swap_bool() operations, which do as
> you describe: Compare, conditionally store, and return bool to indicate
> success. No loop.
Ah, well, I also like to receive the original value of the destination,
if the CAS had to load it, since I'm going to have to load it again
myself in most cases.
In the case where the CAS fails because the compare fails, the freshly
loaded destination value is useful.
In the case where the CAS fails because the destination is touched by
another thread, the freshly loaded destination is almost never useful
(unless it was modified by the other thread to be the same value).
> That could be an improvement for some algorithms, because often if the
> store doesn't happen, the *inputs* to compare_and_swap() need to be
> recalculated anyway before another try is likely to succeed.
Exactly!
In fact, this is what happens in *every* algorithm I'm implemented (a
grand total of two, mind you, stack and queue, but I know it also
happens for the singly linked list).
^ permalink raw reply [flat|nested] 37+ messages in thread
* CAS implementation may be broken
2009-11-24 15:36 ` Russell King - ARM Linux
2009-11-24 16:20 ` Toby Douglass
2009-11-24 16:27 ` Catalin Marinas
@ 2009-11-24 17:14 ` Toby Douglass
2 siblings, 0 replies; 37+ messages in thread
From: Toby Douglass @ 2009-11-24 17:14 UTC (permalink / raw)
To: linux-arm-kernel
I wrote:
> Russell King - ARM Linux wrote:
>> (I argued initially for a LL / SC interface but the arguments
>> I was making were shot down by Linus et.al. as completely
>> unreasonable and technically idiotic.)
> Were they?
My question is I think in hindsight misinterpretable.
What I mean is; you're a reasonable and intelligent bloke. You have
your views in this matter and Linus et al have theirs; they thought what
you said was crazy. Having heard them out, what did you think of what
they said? were there valid arguments in there?
^ permalink raw reply [flat|nested] 37+ messages in thread
* CAS implementation may be broken
2009-11-24 11:19 ` Catalin Marinas
@ 2009-11-24 22:24 ` Toby Douglass
2009-11-25 11:11 ` Catalin Marinas
2009-11-24 22:34 ` Toby Douglass
1 sibling, 1 reply; 37+ messages in thread
From: Toby Douglass @ 2009-11-24 22:24 UTC (permalink / raw)
To: linux-arm-kernel
Catalin Marinas wrote:
> On Tue, 2009-11-24 at 01:32 +0000, Jamie Lokier wrote:
>> Russell King - ARM Linux wrote:
>>> However, I don't think you've understood the original problem at all.
>> I think I have - I agreed with you and Catalin already that LL/SC does
>> not suffice. But do you mean that Catalin's suggestion to put the
>> LDREX before the LDR does not work either? (Maybe it needs a barrier
>> too).
> It definitely needs a barrier after the LDREX and maybe one after STREX
> but that depends on the semantics of such operation.
In the latest kernel, there is a memory barrier (the DMB instruction on
v7 and above) immediately before and immediately after the call to the
CAS function.
I thought about this a little. If the memory barrier is immediately
before and given the next instruction is the LDREX, *all* threads coming
to the LDREX *must* have preceeding them a DMB and so be up to date on
memory, regardless of pauses in thread execution.
This however is in conflict with Catalin's comments.
^ permalink raw reply [flat|nested] 37+ messages in thread
* CAS implementation may be broken
2009-11-24 11:19 ` Catalin Marinas
2009-11-24 22:24 ` Toby Douglass
@ 2009-11-24 22:34 ` Toby Douglass
2009-11-24 22:56 ` Russell King - ARM Linux
1 sibling, 1 reply; 37+ messages in thread
From: Toby Douglass @ 2009-11-24 22:34 UTC (permalink / raw)
To: linux-arm-kernel
I wrote:
> I thought about this a little. If the memory barrier is immediately
> before and given the next instruction is the LDREX, *all* threads coming
> to the LDREX *must* have preceeding them a DMB and so be up to date on
> memory, regardless of pauses in thread execution.
Additionally; the DMB afterwards seemed to have no value. You could
perform the STREX and then your thread could pause indefinitely and were
you in a situation where you store was not immediately visible or
correctly ordered to another thread, that thread would then read the old
value.
A more common example would be that another thread is reading the
destination value for some reason, and reads after the STREX and before
the trailing DMB.
This implies you need a DMB as the instruction immediately preceding
every read and every write of destination. On x86/x64, all the atomic
ops (e.g. the LOCK prefix) act as full memory barriers.
Note that I'm not yet fully conversant with the issues in multi-core
threading, so I may be writing rubbish :-)
^ permalink raw reply [flat|nested] 37+ messages in thread
* CAS implementation may be broken
2009-11-24 22:34 ` Toby Douglass
@ 2009-11-24 22:56 ` Russell King - ARM Linux
2009-11-25 0:34 ` Toby Douglass
0 siblings, 1 reply; 37+ messages in thread
From: Russell King - ARM Linux @ 2009-11-24 22:56 UTC (permalink / raw)
To: linux-arm-kernel
On Tue, Nov 24, 2009 at 11:34:56PM +0100, Toby Douglass wrote:
> I wrote:
>> I thought about this a little. If the memory barrier is immediately
>> before and given the next instruction is the LDREX, *all* threads
>> coming to the LDREX *must* have preceeding them a DMB and so be up to
>> date on memory, regardless of pauses in thread execution.
>
> Additionally; the DMB afterwards seemed to have no value. You could
> perform the STREX and then your thread could pause indefinitely and were
> you in a situation where you store was not immediately visible or
> correctly ordered to another thread, that thread would then read the old
> value.
The two DMBs are there to prevent other loads and stores on the _local_
CPU being speculated inside the compare-and-swap operation, either by
the compiler or the CPU.
^ permalink raw reply [flat|nested] 37+ messages in thread
* CAS implementation may be broken
2009-11-24 22:56 ` Russell King - ARM Linux
@ 2009-11-25 0:34 ` Toby Douglass
0 siblings, 0 replies; 37+ messages in thread
From: Toby Douglass @ 2009-11-25 0:34 UTC (permalink / raw)
To: linux-arm-kernel
Russell King - ARM Linux wrote:
> On Tue, Nov 24, 2009 at 11:34:56PM +0100, Toby Douglass wrote:
>> Additionally; the DMB afterwards seemed to have no value. You could
>> perform the STREX and then your thread could pause indefinitely and were
>> you in a situation where you store was not immediately visible or
>> correctly ordered to another thread, that thread would then read the old
>> value.
> The two DMBs are there to prevent other loads and stores on the _local_
> CPU being speculated inside the compare-and-swap operation, either by
> the compiler or the CPU.
Ahhh!
Forest, trees, etc...
^ permalink raw reply [flat|nested] 37+ messages in thread
* CAS implementation may be broken
2009-11-24 15:15 ` Toby Douglass
2009-11-24 15:36 ` Russell King - ARM Linux
@ 2009-11-25 1:24 ` Jamie Lokier
2009-11-26 16:14 ` Toby Douglass
1 sibling, 1 reply; 37+ messages in thread
From: Jamie Lokier @ 2009-11-25 1:24 UTC (permalink / raw)
To: linux-arm-kernel
Toby Douglass wrote:
> Catalin Marinas wrote:
> Interestingly, I saw a day or two ago that there is a double-word
> version of LDREX. The atomic-ptr project, which forms the basis of the
> GCC garbage collector, relies on this; it does not in fact use LL/SC on
> ARM, but rather uses double-word CAS using a pointer-counter pair.
Note that pointer-counter is not really reliable.
If a thread is suspended for the time taken for exactly 2^32 operations
by other threads, it fails. (Assuming 32-bit pointer + 32-bit counter).
That's unlikely, but not impossible.
Consider a program with 100 threads and one CPU, each thread in a
tight loop doing atomic_op(&p) and some other things. It's quite
reasonable for 1 thread to be descheduled for the amount of time taken
for 2^32 iterations from all the other threads, if the pointer-counter
CAS operation is fast, which it increasingly is.
Remember, this is one CPU so it's in L1 cache, and modern CPUs execute
2^32 operations in a single second. CAS operations are not
single-cycle _yet_, but they continue to get faster.
With 100 threads running without sleeping, the kernel may well
deschedule one of them for several seconds as it cycles through them
all and gives them "non-interactive" timeslices.
It's also quite likely to wrap if someone used kill -STOP, and later
kill -CONT on a thread, or if they're debugging one of the threads.
Of course even if the counter wraps, you still have to be 1/2^32
unlucky to see the same value. But that's enough to make it unreliable.
-- Jamie
^ permalink raw reply [flat|nested] 37+ messages in thread
* CAS implementation may be broken
2009-11-24 22:24 ` Toby Douglass
@ 2009-11-25 11:11 ` Catalin Marinas
2009-11-25 18:57 ` Toby Douglass
0 siblings, 1 reply; 37+ messages in thread
From: Catalin Marinas @ 2009-11-25 11:11 UTC (permalink / raw)
To: linux-arm-kernel
On Tue, 2009-11-24 at 22:24 +0000, Toby Douglass wrote:
> Catalin Marinas wrote:
> > On Tue, 2009-11-24 at 01:32 +0000, Jamie Lokier wrote:
> >> Russell King - ARM Linux wrote:
>
> >>> However, I don't think you've understood the original problem at all.
>
> >> I think I have - I agreed with you and Catalin already that LL/SC does
> >> not suffice. But do you mean that Catalin's suggestion to put the
> >> LDREX before the LDR does not work either? (Maybe it needs a barrier
> >> too).
>
> > It definitely needs a barrier after the LDREX and maybe one after STREX
> > but that depends on the semantics of such operation.
>
> In the latest kernel, there is a memory barrier (the DMB instruction on
> v7 and above) immediately before and immediately after the call to the
> CAS function.
>
> I thought about this a little. If the memory barrier is immediately
> before and given the next instruction is the LDREX, *all* threads coming
> to the LDREX *must* have preceeding them a DMB and so be up to date on
> memory, regardless of pauses in thread execution.
>
> This however is in conflict with Catalin's comments.
My comment was to a new implementation for the ABA issue using LDR for
the next pointer after LDREX and these would need to be separated by a
barrier. Looking at this a bit more, it isn't actually needed since the
subsequent LDR uses the pointer read by LDREX and thus creating an
address dependency ("the value returned by a read access is used to
compute the virtual address of a subsequent read or write access") which
implies strict ordering between LDREX and LDR (A3.8.2 in the ARM ARM).
But as I said, LDR between LDREX/STREX may no always work.
You may need barriers before and after you routine as we do in the CAS
case but it depends how you define your routine, you may ask the caller
of such routine to add the barriers explicitly.
--
Catalin
^ permalink raw reply [flat|nested] 37+ messages in thread
* CAS implementation may be broken
2009-11-25 11:11 ` Catalin Marinas
@ 2009-11-25 18:57 ` Toby Douglass
0 siblings, 0 replies; 37+ messages in thread
From: Toby Douglass @ 2009-11-25 18:57 UTC (permalink / raw)
To: linux-arm-kernel
So, I'm thinking now; where does this leave us?
We see that LDREX/SDREX cannot be used on their own in a single word CAS
to prevent ABA since we cannot rely on using LDR inside an LDREX/SDREX
pair. (You guys knew this; I'm just stating it here to lead in).
This means you must use a pointer-counter pair (with its potential for
failure) or memory handling; the same situation as we find on x86 and
x64, for example. That's okay - not perfect, but we're not deficient
compared to other common platforms.
I think however it leaves open one question; do we wish for the
LDREX/SDREX CAS to fail if destination has been modified, or to retry,
as it does now.
From a lock-free data structure point of view, failure on modification
is more efficient, since we will always need to recompute our exchange
value; going to the extra trouble of forcing the CAS (which will then
fail anyway, because the counter part of the pointer-counter pair will
not match) is a waste of time.
The behaviour we have now can be easily implemented (a loop) on top of
the non-retry CAS; but the non-retry CAS behaviour cannot be implemented
on top of what we have now.
However, the world and his dog expects x86/x64 style behaviour - we can
do better, but it may not be what users expect.
Ah - there is one other issue - it seems reasonable that the CAS that
exists now should be extended to support 8 byte swaps, since it already
supports 1, 2 and 4 byte swaps.
^ permalink raw reply [flat|nested] 37+ messages in thread
* CAS implementation may be broken
2009-11-25 1:24 ` Jamie Lokier
@ 2009-11-26 16:14 ` Toby Douglass
2009-11-27 1:37 ` Jamie Lokier
0 siblings, 1 reply; 37+ messages in thread
From: Toby Douglass @ 2009-11-26 16:14 UTC (permalink / raw)
To: linux-arm-kernel
Jamie Lokier wrote:
> Toby Douglass wrote:
>> Interestingly, I saw a day or two ago that there is a double-word
>> version of LDREX. The atomic-ptr project, which forms the basis of the
>> GCC garbage collector, relies on this; it does not in fact use LL/SC on
>> ARM, but rather uses double-word CAS using a pointer-counter pair.
> Note that pointer-counter is not really reliable.
[snip case]
> Of course even if the counter wraps, you still have to be 1/2^32
> unlucky to see the same value. But that's enough to make it unreliable.
I concur, but it's not a problem on 64-bit.
^ permalink raw reply [flat|nested] 37+ messages in thread
* CAS implementation may be broken
2009-11-26 16:14 ` Toby Douglass
@ 2009-11-27 1:37 ` Jamie Lokier
0 siblings, 0 replies; 37+ messages in thread
From: Jamie Lokier @ 2009-11-27 1:37 UTC (permalink / raw)
To: linux-arm-kernel
Toby Douglass wrote:
> Jamie Lokier wrote:
> >Note that pointer-counter is not really reliable.
>
> [snip case]
>
> >Of course even if the counter wraps, you still have to be 1/2^32
> >unlucky to see the same value. But that's enough to make it unreliable.
>
> I concur, but it's not a problem on 64-bit.
Agree, about 64-bit. (Until we have much faster CPUs :-)
It's occurred to me that it's possible to make pointer-counter safe.
Simply, when about to wrap, send a signal to every thread. The signal
handler should check if the thread it has interrupted is inside a
pointer-counter critical section, and if yes, force the interrupted
code to synchronise with all the other threads in another way, such as
using a mutex.
-- Jamie
^ permalink raw reply [flat|nested] 37+ messages in thread
end of thread, other threads:[~2009-11-27 1:37 UTC | newest]
Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-11-04 18:09 GCC built-in atomic operations and memory barriers Toby Douglass
2009-11-04 19:05 ` Russell King - ARM Linux
2009-11-04 20:12 ` Toby Douglass
2009-11-04 21:03 ` Russell King - ARM Linux
2009-11-06 19:10 ` Toby Douglass
2009-11-04 22:09 ` Gilles Chanteperdrix
2009-11-06 19:17 ` Toby Douglass
2009-11-21 15:21 ` CAS implementation may be broken Toby Douglass
2009-11-23 15:08 ` Russell King - ARM Linux
2009-11-23 19:10 ` Toby Douglass
2009-11-23 20:06 ` Russell King - ARM Linux
2009-11-23 20:34 ` Toby Douglass
2009-11-23 15:13 ` Catalin Marinas
2009-11-24 15:15 ` Toby Douglass
2009-11-24 15:36 ` Russell King - ARM Linux
2009-11-24 16:20 ` Toby Douglass
2009-11-24 16:27 ` Catalin Marinas
2009-11-24 17:14 ` Toby Douglass
2009-11-25 1:24 ` Jamie Lokier
2009-11-26 16:14 ` Toby Douglass
2009-11-27 1:37 ` Jamie Lokier
2009-11-24 15:33 ` Toby Douglass
2009-11-23 15:34 ` Catalin Marinas
2009-11-23 16:40 ` Toby Douglass
2009-11-23 22:28 ` Jamie Lokier
2009-11-23 23:13 ` Russell King - ARM Linux
2009-11-24 1:32 ` Jamie Lokier
2009-11-24 11:19 ` Catalin Marinas
2009-11-24 22:24 ` Toby Douglass
2009-11-25 11:11 ` Catalin Marinas
2009-11-25 18:57 ` Toby Douglass
2009-11-24 22:34 ` Toby Douglass
2009-11-24 22:56 ` Russell King - ARM Linux
2009-11-25 0:34 ` Toby Douglass
2009-11-24 9:38 ` Toby Douglass
2009-11-24 15:59 ` Catalin Marinas
2009-11-24 16:34 ` Toby Douglass
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.