single copy atomicity for double load/stores on 32-bit systems

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* single copy atomicity for double load/stores on 32-bit systems
@ 2019-05-30 18:22 Vineet Gupta
  2019-05-30 18:53 ` Paul E. McKenney
                   ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Vineet Gupta @ 2019-05-30 18:22 UTC (permalink / raw)
  To: Peter Zijlstra, Will Deacon, Paul E. McKenney; +Cc: arcml, lkml, linux-arch

Hi Peter,

Had an interesting lunch time discussion with our hardware architects pertinent to
"minimal guarantees expected of a CPU" section of memory-barriers.txt

|  (*) These guarantees apply only to properly aligned and sized scalar
|     variables.  "Properly sized" currently means variables that are
|     the same size as "char", "short", "int" and "long".  "Properly
|     aligned" means the natural alignment, thus no constraints for
|     "char", two-byte alignment for "short", four-byte alignment for
|     "int", and either four-byte or eight-byte alignment for "long",
|     on 32-bit and 64-bit systems, respectively.

I'm not sure how to interpret "natural alignment" for the case of double
load/stores on 32-bit systems where the hardware and ABI allow for 4 byte
alignment (ARCv2 LDD/STD, ARM LDRD/STRD ....)

I presume (and the question) that lkmm doesn't expect such 8 byte load/stores to
be atomic unless 8-byte aligned

ARMv7 arch ref manual seems to confirm this. Quoting

| LDM, LDC, LDC2, LDRD, STM, STC, STC2, STRD, PUSH, POP, RFE, SRS, VLDM, VLDR,
| VSTM, and VSTR instructions are executed as a sequence of word-aligned word
| accesses. Each 32-bit word access is guaranteed to be single-copy atomic. A
| subsequence of two or more word accesses from the sequence might not exhibit
| single-copy atomicity

While it seems reasonable form hardware pov to not implement such atomicity by
default it seems there's an additional burden on application writers. They could
be happily using a lockless algorithm with just a shared flag between 2 threads
w/o need for any explicit synchronization. But upgrade to a new compiler which
aggressively "packs" struct rendering long long 32-bit aligned (vs. 64-bit before)
causing the code to suddenly stop working. Is the onus on them to declare such
memory as c11 atomic or some such.

Thx,
-Vineet

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: single copy atomicity for double load/stores on 32-bit systems
  2019-05-30 18:22 single copy atomicity for double load/stores on 32-bit systems Vineet Gupta
@ 2019-05-30 18:53 ` Paul E. McKenney
  2019-05-30 19:16   ` Vineet Gupta
  2019-05-31  8:25   ` Peter Zijlstra
  2019-05-31  8:21 ` Peter Zijlstra
  2019-05-31  9:41 ` David Laight
  2 siblings, 2 replies; 20+ messages in thread
From: Paul E. McKenney @ 2019-05-30 18:53 UTC (permalink / raw)
  To: Vineet Gupta; +Cc: Peter Zijlstra, Will Deacon, arcml, lkml, linux-arch

On Thu, May 30, 2019 at 11:22:42AM -0700, Vineet Gupta wrote:
> Hi Peter,
> 
> Had an interesting lunch time discussion with our hardware architects pertinent to
> "minimal guarantees expected of a CPU" section of memory-barriers.txt
> 
> 
> |  (*) These guarantees apply only to properly aligned and sized scalar
> |     variables.  "Properly sized" currently means variables that are
> |     the same size as "char", "short", "int" and "long".  "Properly
> |     aligned" means the natural alignment, thus no constraints for
> |     "char", two-byte alignment for "short", four-byte alignment for
> |     "int", and either four-byte or eight-byte alignment for "long",
> |     on 32-bit and 64-bit systems, respectively.
> 
> 
> I'm not sure how to interpret "natural alignment" for the case of double
> load/stores on 32-bit systems where the hardware and ABI allow for 4 byte
> alignment (ARCv2 LDD/STD, ARM LDRD/STRD ....)
> 
> I presume (and the question) that lkmm doesn't expect such 8 byte load/stores to
> be atomic unless 8-byte aligned

I would not expect 8-byte accesses to be atomic on 32-bit systems unless
some special instruction was in use.  But that usually means special
intrinsics or assembly code.

> ARMv7 arch ref manual seems to confirm this. Quoting
> 
> | LDM, LDC, LDC2, LDRD, STM, STC, STC2, STRD, PUSH, POP, RFE, SRS, VLDM, VLDR,
> | VSTM, and VSTR instructions are executed as a sequence of word-aligned word
> | accesses. Each 32-bit word access is guaranteed to be single-copy atomic. A
> | subsequence of two or more word accesses from the sequence might not exhibit
> | single-copy atomicity
> 
> While it seems reasonable form hardware pov to not implement such atomicity by
> default it seems there's an additional burden on application writers. They could
> be happily using a lockless algorithm with just a shared flag between 2 threads
> w/o need for any explicit synchronization. But upgrade to a new compiler which
> aggressively "packs" struct rendering long long 32-bit aligned (vs. 64-bit before)
> causing the code to suddenly stop working. Is the onus on them to declare such
> memory as c11 atomic or some such.

There are also GCC extensions that allow specifying the alignment of
structure fields.

								Thanx, Paul


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: single copy atomicity for double load/stores on 32-bit systems
  2019-05-30 18:53 ` Paul E. McKenney
@ 2019-05-30 19:16   ` Vineet Gupta
  2019-05-31  8:23     ` Peter Zijlstra
  2019-05-31  8:25   ` Peter Zijlstra
  1 sibling, 1 reply; 20+ messages in thread
From: Vineet Gupta @ 2019-05-30 19:16 UTC (permalink / raw)
  To: paulmck; +Cc: Peter Zijlstra, Will Deacon, arcml, lkml, linux-arch

On 5/30/19 11:55 AM, Paul E. McKenney wrote:
>
>> I'm not sure how to interpret "natural alignment" for the case of double
>> load/stores on 32-bit systems where the hardware and ABI allow for 4 byte
>> alignment (ARCv2 LDD/STD, ARM LDRD/STRD ....)
>>
>> I presume (and the question) that lkmm doesn't expect such 8 byte load/stores to
>> be atomic unless 8-byte aligned
> I would not expect 8-byte accesses to be atomic on 32-bit systems unless
> some special instruction was in use.  But that usually means special
> intrinsics or assembly code.

Thx for confirming.

In cases where we *do* expect the atomicity, it seems there's some existing type
checking but isn't water tight.
e.g.

#define __smp_load_acquire(p)                        \
({                                    \
    typeof(*p) ___p1 = READ_ONCE(*p);                \
    compiletime_assert_atomic_type(*p);                \
    __smp_mb();                            \
    ___p1;                                \
})

#define compiletime_assert_atomic_type(t)                \
    compiletime_assert(__native_word(t),                \
        "Need native word sized stores/loads for atomicity.")

#define __native_word(t) \
    (sizeof(t) == sizeof(char) || sizeof(t) == sizeof(short) || \
     sizeof(t) == sizeof(int) || sizeof(t) == sizeof(long))


So it won't catch the usage of 4 byte aligned long long which gcc targets to
single double load instruction.

Thx,
-Vineet

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: single copy atomicity for double load/stores on 32-bit systems
  2019-05-30 18:22 single copy atomicity for double load/stores on 32-bit systems Vineet Gupta
  2019-05-30 18:53 ` Paul E. McKenney
@ 2019-05-31  8:21 ` Peter Zijlstra
  2019-06-03 18:08   ` Vineet Gupta
                     ` (2 more replies)
  2019-05-31  9:41 ` David Laight
  2 siblings, 3 replies; 20+ messages in thread
From: Peter Zijlstra @ 2019-05-31  8:21 UTC (permalink / raw)
  To: Vineet Gupta; +Cc: Will Deacon, Paul E. McKenney, arcml, lkml, linux-arch

On Thu, May 30, 2019 at 11:22:42AM -0700, Vineet Gupta wrote:
> Hi Peter,
> 
> Had an interesting lunch time discussion with our hardware architects pertinent to
> "minimal guarantees expected of a CPU" section of memory-barriers.txt
> 
> 
> |  (*) These guarantees apply only to properly aligned and sized scalar
> |     variables.  "Properly sized" currently means variables that are
> |     the same size as "char", "short", "int" and "long".  "Properly
> |     aligned" means the natural alignment, thus no constraints for
> |     "char", two-byte alignment for "short", four-byte alignment for
> |     "int", and either four-byte or eight-byte alignment for "long",
> |     on 32-bit and 64-bit systems, respectively.
> 
> 
> I'm not sure how to interpret "natural alignment" for the case of double
> load/stores on 32-bit systems where the hardware and ABI allow for 4 byte
> alignment (ARCv2 LDD/STD, ARM LDRD/STRD ....)

Natural alignment: !((uintptr_t)ptr % sizeof(*ptr))

For any u64 type, that would give 8 byte alignment. the problem
otherwise being that your data spans two lines/pages etc..

> I presume (and the question) that lkmm doesn't expect such 8 byte load/stores to
> be atomic unless 8-byte aligned
> 
> ARMv7 arch ref manual seems to confirm this. Quoting
> 
> | LDM, LDC, LDC2, LDRD, STM, STC, STC2, STRD, PUSH, POP, RFE, SRS, VLDM, VLDR,
> | VSTM, and VSTR instructions are executed as a sequence of word-aligned word
> | accesses. Each 32-bit word access is guaranteed to be single-copy atomic. A
> | subsequence of two or more word accesses from the sequence might not exhibit
> | single-copy atomicity
> 
> While it seems reasonable form hardware pov to not implement such atomicity by
> default it seems there's an additional burden on application writers. They could
> be happily using a lockless algorithm with just a shared flag between 2 threads
> w/o need for any explicit synchronization.

If you're that careless with lockless code, you deserve all the pain you
get.

> But upgrade to a new compiler which
> aggressively "packs" struct rendering long long 32-bit aligned (vs. 64-bit before)
> causing the code to suddenly stop working. Is the onus on them to declare such
> memory as c11 atomic or some such.

When a programmer wants guarantees they already need to know wth they're
doing.

And I'll stand by my earlier conviction that any architecture that has a
native u64 (be it a 64bit arch or a 32bit with double-width
instructions) but has an ABI that allows u32 alignment on them is daft.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: single copy atomicity for double load/stores on 32-bit systems
  2019-05-30 19:16   ` Vineet Gupta
@ 2019-05-31  8:23     ` Peter Zijlstra
  0 siblings, 0 replies; 20+ messages in thread
From: Peter Zijlstra @ 2019-05-31  8:23 UTC (permalink / raw)
  To: Vineet Gupta; +Cc: paulmck, Will Deacon, arcml, lkml, linux-arch

On Thu, May 30, 2019 at 07:16:36PM +0000, Vineet Gupta wrote:
> On 5/30/19 11:55 AM, Paul E. McKenney wrote:
> >
> >> I'm not sure how to interpret "natural alignment" for the case of double
> >> load/stores on 32-bit systems where the hardware and ABI allow for 4 byte
> >> alignment (ARCv2 LDD/STD, ARM LDRD/STRD ....)
> >>
> >> I presume (and the question) that lkmm doesn't expect such 8 byte load/stores to
> >> be atomic unless 8-byte aligned
> > I would not expect 8-byte accesses to be atomic on 32-bit systems unless
> > some special instruction was in use.  But that usually means special
> > intrinsics or assembly code.
> 
> Thx for confirming.
> 
> In cases where we *do* expect the atomicity, it seems there's some existing type
> checking but isn't water tight.
> e.g.
> 
> #define __smp_load_acquire(p)                        \
> ({                                    \
>     typeof(*p) ___p1 = READ_ONCE(*p);                \
>     compiletime_assert_atomic_type(*p);                \
>     __smp_mb();                            \
>     ___p1;                                \
> })
> 
> #define compiletime_assert_atomic_type(t)                \
>     compiletime_assert(__native_word(t),                \
>         "Need native word sized stores/loads for atomicity.")
> 
> #define __native_word(t) \
>     (sizeof(t) == sizeof(char) || sizeof(t) == sizeof(short) || \
>      sizeof(t) == sizeof(int) || sizeof(t) == sizeof(long))
> 
> 
> So it won't catch the usage of 4 byte aligned long long which gcc targets to
> single double load instruction.

Yes, we didn't do those because that would result in runtime overhead.

We assume natural alignment for any type the hardware can do.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: single copy atomicity for double load/stores on 32-bit systems
  2019-05-30 18:53 ` Paul E. McKenney
  2019-05-30 19:16   ` Vineet Gupta
@ 2019-05-31  8:25   ` Peter Zijlstra
  1 sibling, 0 replies; 20+ messages in thread
From: Peter Zijlstra @ 2019-05-31  8:25 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Vineet Gupta, Will Deacon, arcml, lkml, linux-arch

On Thu, May 30, 2019 at 11:53:58AM -0700, Paul E. McKenney wrote:
> On Thu, May 30, 2019 at 11:22:42AM -0700, Vineet Gupta wrote:
> > Hi Peter,
> > 
> > Had an interesting lunch time discussion with our hardware architects pertinent to
> > "minimal guarantees expected of a CPU" section of memory-barriers.txt
> > 
> > 
> > |  (*) These guarantees apply only to properly aligned and sized scalar
> > |     variables.  "Properly sized" currently means variables that are
> > |     the same size as "char", "short", "int" and "long".  "Properly
> > |     aligned" means the natural alignment, thus no constraints for
> > |     "char", two-byte alignment for "short", four-byte alignment for
> > |     "int", and either four-byte or eight-byte alignment for "long",
> > |     on 32-bit and 64-bit systems, respectively.
> > 
> > 
> > I'm not sure how to interpret "natural alignment" for the case of double
> > load/stores on 32-bit systems where the hardware and ABI allow for 4 byte
> > alignment (ARCv2 LDD/STD, ARM LDRD/STRD ....)
> > 
> > I presume (and the question) that lkmm doesn't expect such 8 byte load/stores to
> > be atomic unless 8-byte aligned
> 
> I would not expect 8-byte accesses to be atomic on 32-bit systems unless
> some special instruction was in use.  But that usually means special
> intrinsics or assembly code.

If the GCC of said platform defaults to the double-word instructions for
long long, then I would very much expect natural alignment on it too.

If the feature is only available through inline asm or intrinsics, then
we can be a little more lenient perhaps.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: single copy atomicity for double load/stores on 32-bit systems
  2019-05-30 18:22 single copy atomicity for double load/stores on 32-bit systems Vineet Gupta
  2019-05-30 18:53 ` Paul E. McKenney
  2019-05-31  8:21 ` Peter Zijlstra
@ 2019-05-31  9:41 ` David Laight
  2019-05-31 11:44   ` Paul E. McKenney
  2019-06-03 18:44   ` Vineet Gupta
  2 siblings, 2 replies; 20+ messages in thread
From: David Laight @ 2019-05-31  9:41 UTC (permalink / raw)
  To: 'Vineet Gupta', Peter Zijlstra, Will Deacon, Paul E. McKenney
  Cc: arcml, lkml, linux-arch

From: Vineet Gupta
> Sent: 30 May 2019 19:23
...
> While it seems reasonable form hardware pov to not implement such atomicity by
> default it seems there's an additional burden on application writers. They could
> be happily using a lockless algorithm with just a shared flag between 2 threads
> w/o need for any explicit synchronization. But upgrade to a new compiler which
> aggressively "packs" struct rendering long long 32-bit aligned (vs. 64-bit before)
> causing the code to suddenly stop working. Is the onus on them to declare such
> memory as c11 atomic or some such.

A 'new' compiler can't suddenly change the alignment rules for structure elements.
The alignment rules will be part of the ABI.

More likely is that the structure itself is unexpectedly allocated on
an 8n+4 boundary due to code changes elsewhere.

It is also worth noting that for complete portability only writes to
'full words' can be assumed atomic.
Some old Alpha's did RMW cycles for byte writes.
(Although I suspect Linux doesn't support those any more.)

Even x86 can catch you out.
The bit operations will do wider RMW cycles than you expect.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: single copy atomicity for double load/stores on 32-bit systems
  2019-05-31  9:41 ` David Laight
@ 2019-05-31 11:44   ` Paul E. McKenney
  2019-06-03 18:44   ` Vineet Gupta
  1 sibling, 0 replies; 20+ messages in thread
From: Paul E. McKenney @ 2019-05-31 11:44 UTC (permalink / raw)
  To: David Laight
  Cc: 'Vineet Gupta',
	Peter Zijlstra, Will Deacon, arcml, lkml, linux-arch

On Fri, May 31, 2019 at 09:41:17AM +0000, David Laight wrote:
> From: Vineet Gupta
> > Sent: 30 May 2019 19:23
> ...
> > While it seems reasonable form hardware pov to not implement such atomicity by
> > default it seems there's an additional burden on application writers. They could
> > be happily using a lockless algorithm with just a shared flag between 2 threads
> > w/o need for any explicit synchronization. But upgrade to a new compiler which
> > aggressively "packs" struct rendering long long 32-bit aligned (vs. 64-bit before)
> > causing the code to suddenly stop working. Is the onus on them to declare such
> > memory as c11 atomic or some such.
> 
> A 'new' compiler can't suddenly change the alignment rules for structure elements.
> The alignment rules will be part of the ABI.
> 
> More likely is that the structure itself is unexpectedly allocated on
> an 8n+4 boundary due to code changes elsewhere.
> 
> It is also worth noting that for complete portability only writes to
> 'full words' can be assumed atomic.
> Some old Alpha's did RMW cycles for byte writes.
> (Although I suspect Linux doesn't support those any more.)

Any C11 or later compiler needs to generate the atomic RMW cycles if
needed in cases like this.  To see this, consider the following code:

	spinlock_t l1;
	spinlock_t l2;
	struct foo {
		char c1; // Protected by l1
		char c2; // Protected by l2
	}

	...

	spin_lock(&l1);
	fp->c1 = 42;
	do_somthing_protected_by_l1();
	spin_unlock(&l1);

	...

	spin_lock(&l2);
	fp->c2 = 206;
	do_somthing_protected_by_l2();
	spin_unlock(&l2);

A compiler that failed to generate atomic RMW code sequences for those
stores to ->c1 and ->c2 would be generating a data race in the object
code when there was no such race in the source code.  Kudos to Hans Boehm
for having browbeat compiler writers into accepting this restriction,
which was not particularly popular -- they wanted to be able to use
vector units and such.  ;-)

> Even x86 can catch you out.
> The bit operations will do wider RMW cycles than you expect.

But does the compiler automatically generate these?

							Thanx, Paul

> 	David
> 
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: single copy atomicity for double load/stores on 32-bit systems
  2019-05-31  8:21 ` Peter Zijlstra
@ 2019-06-03 18:08   ` Vineet Gupta
  2019-06-03 20:13     ` Paul E. McKenney
  2019-06-03 18:43   ` Vineet Gupta
  2019-07-01 20:05   ` Vineet Gupta
  2 siblings, 1 reply; 20+ messages in thread
From: Vineet Gupta @ 2019-06-03 18:08 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Will Deacon, Paul E. McKenney, arcml, lkml, linux-arch

On 5/31/19 1:21 AM, Peter Zijlstra wrote:
>> I'm not sure how to interpret "natural alignment" for the case of double
>> load/stores on 32-bit systems where the hardware and ABI allow for 4 byte
>> alignment (ARCv2 LDD/STD, ARM LDRD/STRD ....)
> Natural alignment: !((uintptr_t)ptr % sizeof(*ptr))
>
> For any u64 type, that would give 8 byte alignment. the problem
> otherwise being that your data spans two lines/pages etc..

Sure, but as Paul said, if the software doesn't expect them to be atomic by
default, they could span 2 hardware lines to keep the implementation simpler/sane.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: single copy atomicity for double load/stores on 32-bit systems
  2019-05-31  8:21 ` Peter Zijlstra
  2019-06-03 18:08   ` Vineet Gupta
@ 2019-06-03 18:43   ` Vineet Gupta
  2019-07-01 20:05   ` Vineet Gupta
  2 siblings, 0 replies; 20+ messages in thread
From: Vineet Gupta @ 2019-06-03 18:43 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Will Deacon, Paul E. McKenney, arcml, lkml, linux-arch

On 5/31/19 1:21 AM, Peter Zijlstra wrote:
> And I'll stand by my earlier conviction that any architecture that has a
> native u64 (be it a 64bit arch or a 32bit with double-width
> instructions) but has an ABI that allows u32 alignment on them is daft.

Why ? For 64-bit data on 32-bit systems, hardware doesn't claim to provide any
single-copy atomicity for such data and software doesn't expect either.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: single copy atomicity for double load/stores on 32-bit systems
  2019-05-31  9:41 ` David Laight
  2019-05-31 11:44   ` Paul E. McKenney
@ 2019-06-03 18:44   ` Vineet Gupta
  1 sibling, 0 replies; 20+ messages in thread
From: Vineet Gupta @ 2019-06-03 18:44 UTC (permalink / raw)
  To: David Laight, Peter Zijlstra, Will Deacon, Paul E. McKenney
  Cc: linux-arch, arcml, lkml

On 5/31/19 2:41 AM, David Laight wrote:
>> While it seems reasonable form hardware pov to not implement such atomicity by
>> default it seems there's an additional burden on application writers. They could
>> be happily using a lockless algorithm with just a shared flag between 2 threads
>> w/o need for any explicit synchronization. But upgrade to a new compiler which
>> aggressively "packs" struct rendering long long 32-bit aligned (vs. 64-bit before)
>> causing the code to suddenly stop working. Is the onus on them to declare such
>> memory as c11 atomic or some such.
> A 'new' compiler can't suddenly change the alignment rules for structure elements.
> The alignment rules will be part of the ABI.
> 
> More likely is that the structure itself is unexpectedly allocated on
> an 8n+4 boundary due to code changes elsewhere.

Indeed thats what I meant that the layout changed as is typical of a new compiler.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: single copy atomicity for double load/stores on 32-bit systems
  2019-06-03 18:08   ` Vineet Gupta
@ 2019-06-03 20:13     ` Paul E. McKenney
  2019-06-03 21:59       ` Vineet Gupta
  2019-06-04  7:41       ` Geert Uytterhoeven
  0 siblings, 2 replies; 20+ messages in thread
From: Paul E. McKenney @ 2019-06-03 20:13 UTC (permalink / raw)
  To: Vineet Gupta; +Cc: Peter Zijlstra, Will Deacon, arcml, lkml, linux-arch

On Mon, Jun 03, 2019 at 06:08:35PM +0000, Vineet Gupta wrote:
> On 5/31/19 1:21 AM, Peter Zijlstra wrote:
> >> I'm not sure how to interpret "natural alignment" for the case of double
> >> load/stores on 32-bit systems where the hardware and ABI allow for 4 byte
> >> alignment (ARCv2 LDD/STD, ARM LDRD/STRD ....)
> > Natural alignment: !((uintptr_t)ptr % sizeof(*ptr))
> >
> > For any u64 type, that would give 8 byte alignment. the problem
> > otherwise being that your data spans two lines/pages etc..
> 
> Sure, but as Paul said, if the software doesn't expect them to be atomic by
> default, they could span 2 hardware lines to keep the implementation simpler/sane.

I could imagine 8-byte types being only four-byte aligned on 32-bit systems,
but it would be quite a surprise on 64-bit systems.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: single copy atomicity for double load/stores on 32-bit systems
  2019-06-03 20:13     ` Paul E. McKenney
@ 2019-06-03 21:59       ` Vineet Gupta
  2019-06-04  7:41       ` Geert Uytterhoeven
  1 sibling, 0 replies; 20+ messages in thread
From: Vineet Gupta @ 2019-06-03 21:59 UTC (permalink / raw)
  To: paulmck; +Cc: Peter Zijlstra, Will Deacon, arcml, lkml, linux-arch

On 6/3/19 1:13 PM, Paul E. McKenney wrote:
> On Mon, Jun 03, 2019 at 06:08:35PM +0000, Vineet Gupta wrote:
>> On 5/31/19 1:21 AM, Peter Zijlstra wrote:
>>>> I'm not sure how to interpret "natural alignment" for the case of double
>>>> load/stores on 32-bit systems where the hardware and ABI allow for 4 byte
>>>> alignment (ARCv2 LDD/STD, ARM LDRD/STRD ....)
>>> Natural alignment: !((uintptr_t)ptr % sizeof(*ptr))
>>>
>>> For any u64 type, that would give 8 byte alignment. the problem
>>> otherwise being that your data spans two lines/pages etc..
>> Sure, but as Paul said, if the software doesn't expect them to be atomic by
>> default, they could span 2 hardware lines to keep the implementation simpler/sane.
> I could imagine 8-byte types being only four-byte aligned on 32-bit systems,
> but it would be quite a surprise on 64-bit systems.

Totally agree !

Thx,
-Vineet

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: single copy atomicity for double load/stores on 32-bit systems
  2019-06-03 20:13     ` Paul E. McKenney
  2019-06-03 21:59       ` Vineet Gupta
@ 2019-06-04  7:41       ` Geert Uytterhoeven
  2019-06-06  9:43         ` Paul E. McKenney
  1 sibling, 1 reply; 20+ messages in thread
From: Geert Uytterhoeven @ 2019-06-04  7:41 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Vineet Gupta, Peter Zijlstra, Will Deacon, arcml, lkml, linux-arch

Hi Paul,

On Mon, Jun 3, 2019 at 10:14 PM Paul E. McKenney <paulmck@linux.ibm.com> wrote:
> On Mon, Jun 03, 2019 at 06:08:35PM +0000, Vineet Gupta wrote:
> > On 5/31/19 1:21 AM, Peter Zijlstra wrote:
> > >> I'm not sure how to interpret "natural alignment" for the case of double
> > >> load/stores on 32-bit systems where the hardware and ABI allow for 4 byte
> > >> alignment (ARCv2 LDD/STD, ARM LDRD/STRD ....)
> > > Natural alignment: !((uintptr_t)ptr % sizeof(*ptr))
> > >
> > > For any u64 type, that would give 8 byte alignment. the problem
> > > otherwise being that your data spans two lines/pages etc..
> >
> > Sure, but as Paul said, if the software doesn't expect them to be atomic by
> > default, they could span 2 hardware lines to keep the implementation simpler/sane.
>
> I could imagine 8-byte types being only four-byte aligned on 32-bit systems,
> but it would be quite a surprise on 64-bit systems.

Or two-byte aligned?

M68k started with a 16-bit data bus, and alignment rules were retained
when gaining a wider data bus.

BTW, do any platforms have issues with atomicity of 4-byte types on
16-bit data buses? I believe some embedded ARM or PowerPC do have
such buses.

Gr{oetje,eeting}s,

                        Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: single copy atomicity for double load/stores on 32-bit systems
  2019-06-04  7:41       ` Geert Uytterhoeven
@ 2019-06-06  9:43         ` Paul E. McKenney
  2019-06-06  9:53           ` Geert Uytterhoeven
  2019-06-06 16:34           ` David Laight
  0 siblings, 2 replies; 20+ messages in thread
From: Paul E. McKenney @ 2019-06-06  9:43 UTC (permalink / raw)
  To: Geert Uytterhoeven
  Cc: Vineet Gupta, Peter Zijlstra, Will Deacon, arcml, lkml, linux-arch

On Tue, Jun 04, 2019 at 09:41:04AM +0200, Geert Uytterhoeven wrote:
> Hi Paul,
> 
> On Mon, Jun 3, 2019 at 10:14 PM Paul E. McKenney <paulmck@linux.ibm.com> wrote:
> > On Mon, Jun 03, 2019 at 06:08:35PM +0000, Vineet Gupta wrote:
> > > On 5/31/19 1:21 AM, Peter Zijlstra wrote:
> > > >> I'm not sure how to interpret "natural alignment" for the case of double
> > > >> load/stores on 32-bit systems where the hardware and ABI allow for 4 byte
> > > >> alignment (ARCv2 LDD/STD, ARM LDRD/STRD ....)
> > > > Natural alignment: !((uintptr_t)ptr % sizeof(*ptr))
> > > >
> > > > For any u64 type, that would give 8 byte alignment. the problem
> > > > otherwise being that your data spans two lines/pages etc..
> > >
> > > Sure, but as Paul said, if the software doesn't expect them to be atomic by
> > > default, they could span 2 hardware lines to keep the implementation simpler/sane.
> >
> > I could imagine 8-byte types being only four-byte aligned on 32-bit systems,
> > but it would be quite a surprise on 64-bit systems.
> 
> Or two-byte aligned?
> 
> M68k started with a 16-bit data bus, and alignment rules were retained
> when gaining a wider data bus.
> 
> BTW, do any platforms have issues with atomicity of 4-byte types on
> 16-bit data buses? I believe some embedded ARM or PowerPC do have
> such buses.

But m68k is !SMP-only, correct?  If so, the only issues would be
interactions with interrupt handlers and the like, and doesn't current
m68k hardware use exact interrupts?  Or is it still possible to interrupt
an m68k in the middle of an instruction like it was in the bad old days?

							Thanx, Paul

> Gr{oetje,eeting}s,
> 
>                         Geert
> 
> -- 
> Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org
> 
> In personal conversations with technical people, I call myself a hacker. But
> when I'm talking to journalists I just say "programmer" or something like that.
>                                 -- Linus Torvalds


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: single copy atomicity for double load/stores on 32-bit systems
  2019-06-06  9:43         ` Paul E. McKenney
@ 2019-06-06  9:53           ` Geert Uytterhoeven
  2019-06-06 16:34           ` David Laight
  1 sibling, 0 replies; 20+ messages in thread
From: Geert Uytterhoeven @ 2019-06-06  9:53 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Vineet Gupta, Peter Zijlstra, Will Deacon, arcml, lkml, linux-arch

Hi Paul,

On Thu, Jun 6, 2019 at 11:43 AM Paul E. McKenney <paulmck@linux.ibm.com> wrote:
> On Tue, Jun 04, 2019 at 09:41:04AM +0200, Geert Uytterhoeven wrote:
> > On Mon, Jun 3, 2019 at 10:14 PM Paul E. McKenney <paulmck@linux.ibm.com> wrote:
> > > On Mon, Jun 03, 2019 at 06:08:35PM +0000, Vineet Gupta wrote:
> > > > On 5/31/19 1:21 AM, Peter Zijlstra wrote:
> > > > >> I'm not sure how to interpret "natural alignment" for the case of double
> > > > >> load/stores on 32-bit systems where the hardware and ABI allow for 4 byte
> > > > >> alignment (ARCv2 LDD/STD, ARM LDRD/STRD ....)
> > > > > Natural alignment: !((uintptr_t)ptr % sizeof(*ptr))
> > > > >
> > > > > For any u64 type, that would give 8 byte alignment. the problem
> > > > > otherwise being that your data spans two lines/pages etc..
> > > >
> > > > Sure, but as Paul said, if the software doesn't expect them to be atomic by
> > > > default, they could span 2 hardware lines to keep the implementation simpler/sane.
> > >
> > > I could imagine 8-byte types being only four-byte aligned on 32-bit systems,
> > > but it would be quite a surprise on 64-bit systems.
> >
> > Or two-byte aligned?
> >
> > M68k started with a 16-bit data bus, and alignment rules were retained
> > when gaining a wider data bus.
> >
> > BTW, do any platforms have issues with atomicity of 4-byte types on
> > 16-bit data buses? I believe some embedded ARM or PowerPC do have
> > such buses.
>
> But m68k is !SMP-only, correct?  If so, the only issues would be

M68k support in Linux is uniprocessor-only.

> interactions with interrupt handlers and the like, and doesn't current
> m68k hardware use exact interrupts?  Or is it still possible to interrupt
> an m68k in the middle of an instruction like it was in the bad old days?

TBH, I don't know.

Gr{oetje,eeting}s,

                        Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: single copy atomicity for double load/stores on 32-bit systems
  2019-06-06  9:43         ` Paul E. McKenney
  2019-06-06  9:53           ` Geert Uytterhoeven
@ 2019-06-06 16:34           ` David Laight
  2019-06-06 21:17             ` Paul E. McKenney
  1 sibling, 1 reply; 20+ messages in thread
From: David Laight @ 2019-06-06 16:34 UTC (permalink / raw)
  To: 'paulmck@linux.ibm.com', Geert Uytterhoeven
  Cc: Vineet Gupta, Peter Zijlstra, Will Deacon, arcml, lkml, linux-arch

From: Paul E. McKenney
> Sent: 06 June 2019 10:44
...
> But m68k is !SMP-only, correct?  If so, the only issues would be
> interactions with interrupt handlers and the like, and doesn't current
> m68k hardware use exact interrupts?  Or is it still possible to interrupt
> an m68k in the middle of an instruction like it was in the bad old days?

Hardware interrupts were always on instruction boundaries, the
mid-instruction interrupts would only happen for page faults (etc).

There were SMP m68k systems (but I can't remember one).
It was important to continue from a mid-instruction trap on the
same cpu - unless you could guarantee that all the cpus had
exactly the same version of the microcode.

In any case you could probably use the 'cmp2' instruction
for an atomic 64bit write.
OTOH setting that up was such a PITA it was always easier
to disable interrupts.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: single copy atomicity for double load/stores on 32-bit systems
  2019-06-06 16:34           ` David Laight
@ 2019-06-06 21:17             ` Paul E. McKenney
  0 siblings, 0 replies; 20+ messages in thread
From: Paul E. McKenney @ 2019-06-06 21:17 UTC (permalink / raw)
  To: David Laight
  Cc: Geert Uytterhoeven, Vineet Gupta, Peter Zijlstra, Will Deacon,
	arcml, lkml, linux-arch

On Thu, Jun 06, 2019 at 04:34:52PM +0000, David Laight wrote:
> From: Paul E. McKenney
> > Sent: 06 June 2019 10:44
> ...
> > But m68k is !SMP-only, correct?  If so, the only issues would be
> > interactions with interrupt handlers and the like, and doesn't current
> > m68k hardware use exact interrupts?  Or is it still possible to interrupt
> > an m68k in the middle of an instruction like it was in the bad old days?
> 
> Hardware interrupts were always on instruction boundaries, the
> mid-instruction interrupts would only happen for page faults (etc).

OK, !SMP should be fine, then.

> There were SMP m68k systems (but I can't remember one).
> It was important to continue from a mid-instruction trap on the
> same cpu - unless you could guarantee that all the cpus had
> exactly the same version of the microcode.

Yuck!  ;-)

> In any case you could probably use the 'cmp2' instruction
> for an atomic 64bit write.
> OTOH setting that up was such a PITA it was always easier
> to disable interrupts.

Unless I am forgetting something, given that m68k is a 32-bit system,
we should be OK without an atomic 64-bit write.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: single copy atomicity for double load/stores on 32-bit systems
  2019-05-31  8:21 ` Peter Zijlstra
  2019-06-03 18:08   ` Vineet Gupta
  2019-06-03 18:43   ` Vineet Gupta
@ 2019-07-01 20:05   ` Vineet Gupta
  2019-07-02 10:46     ` Will Deacon
  2 siblings, 1 reply; 20+ messages in thread
From: Vineet Gupta @ 2019-07-01 20:05 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Will Deacon, Paul E. McKenney, arcml, lkml, linux-arch

On 5/31/19 1:21 AM, Peter Zijlstra wrote:
> On Thu, May 30, 2019 at 11:22:42AM -0700, Vineet Gupta wrote:
>> Hi Peter,
>>
>> Had an interesting lunch time discussion with our hardware architects pertinent to
>> "minimal guarantees expected of a CPU" section of memory-barriers.txt
>>
>>
>> |  (*) These guarantees apply only to properly aligned and sized scalar
>> |     variables.  "Properly sized" currently means variables that are
>> |     the same size as "char", "short", "int" and "long".  "Properly
>> |     aligned" means the natural alignment, thus no constraints for
>> |     "char", two-byte alignment for "short", four-byte alignment for
>> |     "int", and either four-byte or eight-byte alignment for "long",
>> |     on 32-bit and 64-bit systems, respectively.
>>
>>
>> I'm not sure how to interpret "natural alignment" for the case of double
>> load/stores on 32-bit systems where the hardware and ABI allow for 4 byte
>> alignment (ARCv2 LDD/STD, ARM LDRD/STRD ....)
> 
> Natural alignment: !((uintptr_t)ptr % sizeof(*ptr))
> 
> For any u64 type, that would give 8 byte alignment. the problem
> otherwise being that your data spans two lines/pages etc..
> 
>> I presume (and the question) that lkmm doesn't expect such 8 byte load/stores to
>> be atomic unless 8-byte aligned
>>
>> ARMv7 arch ref manual seems to confirm this. Quoting
>>
>> | LDM, LDC, LDC2, LDRD, STM, STC, STC2, STRD, PUSH, POP, RFE, SRS, VLDM, VLDR,
>> | VSTM, and VSTR instructions are executed as a sequence of word-aligned word
>> | accesses. Each 32-bit word access is guaranteed to be single-copy atomic. A
>> | subsequence of two or more word accesses from the sequence might not exhibit
>> | single-copy atomicity
>>
>> While it seems reasonable form hardware pov to not implement such atomicity by
>> default it seems there's an additional burden on application writers. They could
>> be happily using a lockless algorithm with just a shared flag between 2 threads
>> w/o need for any explicit synchronization.
> 
> If you're that careless with lockless code, you deserve all the pain you
> get.
> 
>> But upgrade to a new compiler which
>> aggressively "packs" struct rendering long long 32-bit aligned (vs. 64-bit before)
>> causing the code to suddenly stop working. Is the onus on them to declare such
>> memory as c11 atomic or some such.
> 
> When a programmer wants guarantees they already need to know wth they're
> doing.
> 
> And I'll stand by my earlier conviction that any architecture that has a
> native u64 (be it a 64bit arch or a 32bit with double-width
> instructions) but has an ABI that allows u32 alignment on them is daft.

So I agree with Paul's assertion that it is strange for 8-byte type being 4-byte
aligned on a 64-bit system, but is it totally broken even if the ISA of the said
64-bit arch allows LD/ST to be augmented with acq/rel respectively.

Say the ISA guarantees single-copy atomicity for aligned cases (i.e. for 8-byte
data only if it is naturally aligned) and in lack thereof programmer needs to use
the proper acq/release

In my earlier example on lockless code, we do assume that programmer will use a
release in the update of flag.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: single copy atomicity for double load/stores on 32-bit systems
  2019-07-01 20:05   ` Vineet Gupta
@ 2019-07-02 10:46     ` Will Deacon
  0 siblings, 0 replies; 20+ messages in thread
From: Will Deacon @ 2019-07-02 10:46 UTC (permalink / raw)
  To: Vineet Gupta
  Cc: Peter Zijlstra, Will Deacon, Paul E. McKenney, arcml, lkml, linux-arch

On Mon, Jul 01, 2019 at 08:05:51PM +0000, Vineet Gupta wrote:
> On 5/31/19 1:21 AM, Peter Zijlstra wrote:
> > On Thu, May 30, 2019 at 11:22:42AM -0700, Vineet Gupta wrote:
> >> Had an interesting lunch time discussion with our hardware architects pertinent to
> >> "minimal guarantees expected of a CPU" section of memory-barriers.txt
> >>
> >>
> >> |  (*) These guarantees apply only to properly aligned and sized scalar
> >> |     variables.  "Properly sized" currently means variables that are
> >> |     the same size as "char", "short", "int" and "long".  "Properly
> >> |     aligned" means the natural alignment, thus no constraints for
> >> |     "char", two-byte alignment for "short", four-byte alignment for
> >> |     "int", and either four-byte or eight-byte alignment for "long",
> >> |     on 32-bit and 64-bit systems, respectively.
> >>
> >>
> >> I'm not sure how to interpret "natural alignment" for the case of double
> >> load/stores on 32-bit systems where the hardware and ABI allow for 4 byte
> >> alignment (ARCv2 LDD/STD, ARM LDRD/STRD ....)
> > 
> > Natural alignment: !((uintptr_t)ptr % sizeof(*ptr))
> > 
> > For any u64 type, that would give 8 byte alignment. the problem
> > otherwise being that your data spans two lines/pages etc..
> > 
> >> I presume (and the question) that lkmm doesn't expect such 8 byte load/stores to
> >> be atomic unless 8-byte aligned
> >>
> >> ARMv7 arch ref manual seems to confirm this. Quoting
> >>
> >> | LDM, LDC, LDC2, LDRD, STM, STC, STC2, STRD, PUSH, POP, RFE, SRS, VLDM, VLDR,
> >> | VSTM, and VSTR instructions are executed as a sequence of word-aligned word
> >> | accesses. Each 32-bit word access is guaranteed to be single-copy atomic. A
> >> | subsequence of two or more word accesses from the sequence might not exhibit
> >> | single-copy atomicity
> >>
> >> While it seems reasonable form hardware pov to not implement such atomicity by
> >> default it seems there's an additional burden on application writers. They could
> >> be happily using a lockless algorithm with just a shared flag between 2 threads
> >> w/o need for any explicit synchronization.
> > 
> > If you're that careless with lockless code, you deserve all the pain you
> > get.
> > 
> >> But upgrade to a new compiler which
> >> aggressively "packs" struct rendering long long 32-bit aligned (vs. 64-bit before)
> >> causing the code to suddenly stop working. Is the onus on them to declare such
> >> memory as c11 atomic or some such.
> > 
> > When a programmer wants guarantees they already need to know wth they're
> > doing.
> > 
> > And I'll stand by my earlier conviction that any architecture that has a
> > native u64 (be it a 64bit arch or a 32bit with double-width
> > instructions) but has an ABI that allows u32 alignment on them is daft.
> 
> So I agree with Paul's assertion that it is strange for 8-byte type being 4-byte
> aligned on a 64-bit system, but is it totally broken even if the ISA of the said
> 64-bit arch allows LD/ST to be augmented with acq/rel respectively.
> 
> Say the ISA guarantees single-copy atomicity for aligned cases (i.e. for 8-byte
> data only if it is naturally aligned) and in lack thereof programmer needs to use
> the proper acq/release

Apologies if I'm missing some context here, but it's not clear to me why the
use of acquire/release instructions has anything to do with single-copy
atomicity of unaligned accesses. The ordering they provide doesn't
necessarily prevent tearing, although a CPU architecture could obviously
provide that guarantee if it wanted to. Generally though, I wouldn't expect
the two to go hand-in-hand like you're suggesting.

Will

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2019-07-02 10:46 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-30 18:22 single copy atomicity for double load/stores on 32-bit systems Vineet Gupta
2019-05-30 18:53 ` Paul E. McKenney
2019-05-30 19:16   ` Vineet Gupta
2019-05-31  8:23     ` Peter Zijlstra
2019-05-31  8:25   ` Peter Zijlstra
2019-05-31  8:21 ` Peter Zijlstra
2019-06-03 18:08   ` Vineet Gupta
2019-06-03 20:13     ` Paul E. McKenney
2019-06-03 21:59       ` Vineet Gupta
2019-06-04  7:41       ` Geert Uytterhoeven
2019-06-06  9:43         ` Paul E. McKenney
2019-06-06  9:53           ` Geert Uytterhoeven
2019-06-06 16:34           ` David Laight
2019-06-06 21:17             ` Paul E. McKenney
2019-06-03 18:43   ` Vineet Gupta
2019-07-01 20:05   ` Vineet Gupta
2019-07-02 10:46     ` Will Deacon
2019-05-31  9:41 ` David Laight
2019-05-31 11:44   ` Paul E. McKenney
2019-06-03 18:44   ` Vineet Gupta

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).