All of lore.kernel.org
 help / color / mirror / Atom feed
* bit fields && data tearing
@ 2014-07-12 18:13 ` Oleg Nesterov
  0 siblings, 0 replies; 311+ messages in thread
From: Oleg Nesterov @ 2014-07-12 18:13 UTC (permalink / raw)
  To: Jakub Jelinek, Paul E. McKenney, Richard Henderson
  Cc: Benjamin Herrenschmidt, Miroslav Franc, Paul Mackerras,
	linuxppc-dev, linux-kernel

Hello,

I am not sure I should ask here, but since Documentation/memory-barriers.txt
mentions load/store tearing perhaps my question is not completely off-topic...

I am fighting with mysterious RHEL bug, it can be reproduced on ppc and s390
but not on x86. Finally I seem to understand the problem, and I even wrote the
stupid kernel module to ensure, see it below at the end.

It triggers the problem immediately, kt_2() sees the wrong value in freeze_stop.
(If I turn ->freeze_stop int "long", the problem goes away).

So the question is: is this gcc bug or the code below is buggy?

If it is buggy, then probably memory-barriers.txt could mention that you should
be carefull with bit fields, even ACCESS_ONCE() obviously can't help.

Or this just discloses my ignorance and you need at least aligned(long) after a
bit field to be thread-safe ? I thought that compiler should take care and add
the necessary alignment if (say) CPU can't update a single byte/uint.

gcc version 4.4.7 20120313 (Red Hat 4.4.7-9) (GCC). Asm:

	0000000000000000 <.kt_2>:
	   0:	7c 08 02 a6 	mflr    r0
	   4:	fb 81 ff e0 	std     r28,-32(r1)
	   8:	fb a1 ff e8 	std     r29,-24(r1)
	   c:	fb c1 ff f0 	std     r30,-16(r1)
	  10:	fb e1 ff f8 	std     r31,-8(r1)
	  14:	eb c2 00 00 	ld      r30,0(r2)
	  18:	f8 01 00 10 	std     r0,16(r1)
	  1c:	f8 21 ff 71 	stdu    r1,-144(r1)
	  20:	7c 7d 1b 78 	mr      r29,r3
	  24:	3b e0 00 00 	li      r31,0
	  28:	78 3c 04 64 	rldicr  r28,r1,0,49
	  2c:	3b 9c 00 80 	addi    r28,r28,128
	  30:	48 00 00 2c 	b       5c <.kt_2+0x5c>
	  34:	60 00 00 00 	nop
	  38:	60 00 00 00 	nop
	  3c:	60 00 00 00 	nop
	  40:	93 fd 00 04 	stw     r31,4(r29)
	  44:	e8 9d 00 06 	lwa     r4,4(r29)
	  48:	7f 84 f8 00 	cmpw    cr7,r4,r31
	  4c:	40 de 00 4c 	bne-    cr7,98 <.kt_2+0x98>
	  50:	e8 1c 00 00 	ld      r0,0(r28)
	  54:	78 09 f7 e3 	rldicl. r9,r0,62,63
	  58:	40 c2 00 54 	bne-    ac <.kt_2+0xac>
	  5c:	48 00 00 01 	bl      5c <.kt_2+0x5c>
	  60:	60 00 00 00 	nop
	  64:	3b ff 00 01 	addi    r31,r31,1
	  68:	2f a3 00 00 	cmpdi   cr7,r3,0
	  6c:	7f ff 07 b4 	extsw   r31,r31
	  70:	41 9e ff d0 	beq+    cr7,40 <.kt_2+0x40>
	  74:	38 21 00 90 	addi    r1,r1,144
	  78:	38 60 00 00 	li      r3,0
	  7c:	e8 01 00 10 	ld      r0,16(r1)
	  80:	eb 81 ff e0 	ld      r28,-32(r1)
	  84:	eb a1 ff e8 	ld      r29,-24(r1)
	  88:	eb c1 ff f0 	ld      r30,-16(r1)
	  8c:	eb e1 ff f8 	ld      r31,-8(r1)
	  90:	7c 08 03 a6 	mtlr    r0
	  94:	4e 80 00 20 	blr
	  98:	e8 7e 80 28 	ld      r3,-32728(r30)
	  9c:	7f e5 fb 78 	mr      r5,r31
	  a0:	48 00 00 01 	bl      a0 <.kt_2+0xa0>
	  a4:	60 00 00 00 	nop
	  a8:	4b ff ff a8 	b       50 <.kt_2+0x50>
	  ac:	48 00 00 01 	bl      ac <.kt_2+0xac>
	  b0:	60 00 00 00 	nop
	  b4:	4b ff ff a8 	b       5c <.kt_2+0x5c>
	  b8:	60 00 00 00 	nop
	  bc:	60 00 00 00 	nop

	00000000000000c0 <.kt_1>:
	  c0:	7c 08 02 a6 	mflr    r0
	  c4:	fb 81 ff e0 	std     r28,-32(r1)
	  c8:	fb a1 ff e8 	std     r29,-24(r1)
	  cc:	fb c1 ff f0 	std     r30,-16(r1)
	  d0:	fb e1 ff f8 	std     r31,-8(r1)
	  d4:	eb c2 00 00 	ld      r30,0(r2)
	  d8:	f8 01 00 10 	std     r0,16(r1)
	  dc:	f8 21 ff 71 	stdu    r1,-144(r1)
	  e0:	7c 7d 1b 78 	mr      r29,r3
	  e4:	3b e0 00 00 	li      r31,0
	  e8:	78 3c 04 64 	rldicr  r28,r1,0,49
	  ec:	3b 9c 00 80 	addi    r28,r28,128
	  f0:	48 00 00 38 	b       128 <.kt_1+0x68>
	  f4:	60 00 00 00 	nop
	  f8:	60 00 00 00 	nop
	  fc:	60 00 00 00 	nop
	 100:	e8 1d 00 00 	ld      r0,0(r29)
	 104:	79 20 e8 0e 	rldimi  r0,r9,61,0
	 108:	f8 1d 00 00 	std     r0,0(r29)
	 10c:	80 1d 00 00 	lwz     r0,0(r29)
	 110:	54 00 1f 7e 	rlwinm  r0,r0,3,29,31
	 114:	7f 80 f8 00 	cmpw    cr7,r0,r31
	 118:	40 de 00 6c 	bne-    cr7,184 <.kt_1+0xc4>
	 11c:	e8 1c 00 00 	ld      r0,0(r28)
	 120:	78 09 f7 e3 	rldicl. r9,r0,62,63
	 124:	40 c2 00 70 	bne-    194 <.kt_1+0xd4>
	 128:	48 00 00 01 	bl      128 <.kt_1+0x68>
	 12c:	60 00 00 00 	nop
	 130:	3b ff 00 01 	addi    r31,r31,1
	 134:	2f a3 00 00 	cmpdi   cr7,r3,0
	 138:	7f ff 07 b4 	extsw   r31,r31
	 13c:	2f 1f 00 07 	cmpwi   cr6,r31,7
	 140:	7b e9 07 60 	clrldi  r9,r31,61
	 144:	40 9e 00 1c 	bne-    cr7,160 <.kt_1+0xa0>
	 148:	40 9a ff b8 	bne+    cr6,100 <.kt_1+0x40>
	 14c:	39 20 00 00 	li      r9,0
	 150:	3b e0 00 00 	li      r31,0
	 154:	4b ff ff ac 	b       100 <.kt_1+0x40>
	 158:	60 00 00 00 	nop
	 15c:	60 00 00 00 	nop
	 160:	38 21 00 90 	addi    r1,r1,144
	 164:	38 60 00 00 	li      r3,0
	 168:	e8 01 00 10 	ld      r0,16(r1)
	 16c:	eb 81 ff e0 	ld      r28,-32(r1)
	 170:	eb a1 ff e8 	ld      r29,-24(r1)
	 174:	eb c1 ff f0 	ld      r30,-16(r1)
	 178:	eb e1 ff f8 	ld      r31,-8(r1)
	 17c:	7c 08 03 a6 	mtlr    r0
	 180:	4e 80 00 20 	blr
	 184:	e8 7e 80 30 	ld      r3,-32720(r30)
	 188:	48 00 00 01 	bl      188 <.kt_1+0xc8>
	 18c:	60 00 00 00 	nop
	 190:	4b ff ff 8c 	b       11c <.kt_1+0x5c>
	 194:	48 00 00 01 	bl      194 <.kt_1+0xd4>
	 198:	60 00 00 00 	nop
	 19c:	4b ff ff 8c 	b       128 <.kt_1+0x68>

Unfortunately it tells me nothing, I do not know ppc.

Oleg.

--------------------------------------------------------------------------------
#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/kthread.h>

struct utrace {
	unsigned int resume:3;
	int freeze_stop;
};

static int kt_1(void *arg)
{
	struct utrace *u = arg;
	int r = 0;

	while (!kthread_should_stop()) {
		if (++r == 7)
			r = 0;

		u->resume = r;
		barrier();
		if (u->resume != r)
			printk(KERN_CRIT "BUG! bitfield\n");

		if (need_resched())
			schedule();
	}

	return 0;
}

static int kt_2(void *arg)
{
	struct utrace *u = arg;
	int f = 0;

	while (!kthread_should_stop()) {
		u->freeze_stop = ++f;
		barrier();
		if (u->freeze_stop != f)
			printk(KERN_CRIT "BUG! freeze_stop %d != %d\n", u->freeze_stop, f);
		
		if (need_resched())
			schedule();
	}

	return 0;
}

static struct task_struct *t_1, *t_2;

static struct utrace utrace;

static int __init mod_init(void)
{
	WARN_ON(IS_ERR(t_1 = kthread_run(kt_1, &utrace, "kt_1")));
	WARN_ON(IS_ERR(t_2 = kthread_run(kt_2, &utrace, "kt_2")));

	return 0;
}

static void __exit mod_exit(void)
{
	kthread_stop(t_1);
	kthread_stop(t_2);
}

MODULE_LICENSE("GPL");
module_init(mod_init);
module_exit(mod_exit);


^ permalink raw reply	[flat|nested] 311+ messages in thread

* bit fields && data tearing
@ 2014-07-12 18:13 ` Oleg Nesterov
  0 siblings, 0 replies; 311+ messages in thread
From: Oleg Nesterov @ 2014-07-12 18:13 UTC (permalink / raw)
  To: Jakub Jelinek, Paul E. McKenney, Richard Henderson
  Cc: linuxppc-dev, Paul Mackerras, Miroslav Franc, linux-kernel

Hello,

I am not sure I should ask here, but since Documentation/memory-barriers.txt
mentions load/store tearing perhaps my question is not completely off-topic...

I am fighting with mysterious RHEL bug, it can be reproduced on ppc and s390
but not on x86. Finally I seem to understand the problem, and I even wrote the
stupid kernel module to ensure, see it below at the end.

It triggers the problem immediately, kt_2() sees the wrong value in freeze_stop.
(If I turn ->freeze_stop int "long", the problem goes away).

So the question is: is this gcc bug or the code below is buggy?

If it is buggy, then probably memory-barriers.txt could mention that you should
be carefull with bit fields, even ACCESS_ONCE() obviously can't help.

Or this just discloses my ignorance and you need at least aligned(long) after a
bit field to be thread-safe ? I thought that compiler should take care and add
the necessary alignment if (say) CPU can't update a single byte/uint.

gcc version 4.4.7 20120313 (Red Hat 4.4.7-9) (GCC). Asm:

	0000000000000000 <.kt_2>:
	   0:	7c 08 02 a6 	mflr    r0
	   4:	fb 81 ff e0 	std     r28,-32(r1)
	   8:	fb a1 ff e8 	std     r29,-24(r1)
	   c:	fb c1 ff f0 	std     r30,-16(r1)
	  10:	fb e1 ff f8 	std     r31,-8(r1)
	  14:	eb c2 00 00 	ld      r30,0(r2)
	  18:	f8 01 00 10 	std     r0,16(r1)
	  1c:	f8 21 ff 71 	stdu    r1,-144(r1)
	  20:	7c 7d 1b 78 	mr      r29,r3
	  24:	3b e0 00 00 	li      r31,0
	  28:	78 3c 04 64 	rldicr  r28,r1,0,49
	  2c:	3b 9c 00 80 	addi    r28,r28,128
	  30:	48 00 00 2c 	b       5c <.kt_2+0x5c>
	  34:	60 00 00 00 	nop
	  38:	60 00 00 00 	nop
	  3c:	60 00 00 00 	nop
	  40:	93 fd 00 04 	stw     r31,4(r29)
	  44:	e8 9d 00 06 	lwa     r4,4(r29)
	  48:	7f 84 f8 00 	cmpw    cr7,r4,r31
	  4c:	40 de 00 4c 	bne-    cr7,98 <.kt_2+0x98>
	  50:	e8 1c 00 00 	ld      r0,0(r28)
	  54:	78 09 f7 e3 	rldicl. r9,r0,62,63
	  58:	40 c2 00 54 	bne-    ac <.kt_2+0xac>
	  5c:	48 00 00 01 	bl      5c <.kt_2+0x5c>
	  60:	60 00 00 00 	nop
	  64:	3b ff 00 01 	addi    r31,r31,1
	  68:	2f a3 00 00 	cmpdi   cr7,r3,0
	  6c:	7f ff 07 b4 	extsw   r31,r31
	  70:	41 9e ff d0 	beq+    cr7,40 <.kt_2+0x40>
	  74:	38 21 00 90 	addi    r1,r1,144
	  78:	38 60 00 00 	li      r3,0
	  7c:	e8 01 00 10 	ld      r0,16(r1)
	  80:	eb 81 ff e0 	ld      r28,-32(r1)
	  84:	eb a1 ff e8 	ld      r29,-24(r1)
	  88:	eb c1 ff f0 	ld      r30,-16(r1)
	  8c:	eb e1 ff f8 	ld      r31,-8(r1)
	  90:	7c 08 03 a6 	mtlr    r0
	  94:	4e 80 00 20 	blr
	  98:	e8 7e 80 28 	ld      r3,-32728(r30)
	  9c:	7f e5 fb 78 	mr      r5,r31
	  a0:	48 00 00 01 	bl      a0 <.kt_2+0xa0>
	  a4:	60 00 00 00 	nop
	  a8:	4b ff ff a8 	b       50 <.kt_2+0x50>
	  ac:	48 00 00 01 	bl      ac <.kt_2+0xac>
	  b0:	60 00 00 00 	nop
	  b4:	4b ff ff a8 	b       5c <.kt_2+0x5c>
	  b8:	60 00 00 00 	nop
	  bc:	60 00 00 00 	nop

	00000000000000c0 <.kt_1>:
	  c0:	7c 08 02 a6 	mflr    r0
	  c4:	fb 81 ff e0 	std     r28,-32(r1)
	  c8:	fb a1 ff e8 	std     r29,-24(r1)
	  cc:	fb c1 ff f0 	std     r30,-16(r1)
	  d0:	fb e1 ff f8 	std     r31,-8(r1)
	  d4:	eb c2 00 00 	ld      r30,0(r2)
	  d8:	f8 01 00 10 	std     r0,16(r1)
	  dc:	f8 21 ff 71 	stdu    r1,-144(r1)
	  e0:	7c 7d 1b 78 	mr      r29,r3
	  e4:	3b e0 00 00 	li      r31,0
	  e8:	78 3c 04 64 	rldicr  r28,r1,0,49
	  ec:	3b 9c 00 80 	addi    r28,r28,128
	  f0:	48 00 00 38 	b       128 <.kt_1+0x68>
	  f4:	60 00 00 00 	nop
	  f8:	60 00 00 00 	nop
	  fc:	60 00 00 00 	nop
	 100:	e8 1d 00 00 	ld      r0,0(r29)
	 104:	79 20 e8 0e 	rldimi  r0,r9,61,0
	 108:	f8 1d 00 00 	std     r0,0(r29)
	 10c:	80 1d 00 00 	lwz     r0,0(r29)
	 110:	54 00 1f 7e 	rlwinm  r0,r0,3,29,31
	 114:	7f 80 f8 00 	cmpw    cr7,r0,r31
	 118:	40 de 00 6c 	bne-    cr7,184 <.kt_1+0xc4>
	 11c:	e8 1c 00 00 	ld      r0,0(r28)
	 120:	78 09 f7 e3 	rldicl. r9,r0,62,63
	 124:	40 c2 00 70 	bne-    194 <.kt_1+0xd4>
	 128:	48 00 00 01 	bl      128 <.kt_1+0x68>
	 12c:	60 00 00 00 	nop
	 130:	3b ff 00 01 	addi    r31,r31,1
	 134:	2f a3 00 00 	cmpdi   cr7,r3,0
	 138:	7f ff 07 b4 	extsw   r31,r31
	 13c:	2f 1f 00 07 	cmpwi   cr6,r31,7
	 140:	7b e9 07 60 	clrldi  r9,r31,61
	 144:	40 9e 00 1c 	bne-    cr7,160 <.kt_1+0xa0>
	 148:	40 9a ff b8 	bne+    cr6,100 <.kt_1+0x40>
	 14c:	39 20 00 00 	li      r9,0
	 150:	3b e0 00 00 	li      r31,0
	 154:	4b ff ff ac 	b       100 <.kt_1+0x40>
	 158:	60 00 00 00 	nop
	 15c:	60 00 00 00 	nop
	 160:	38 21 00 90 	addi    r1,r1,144
	 164:	38 60 00 00 	li      r3,0
	 168:	e8 01 00 10 	ld      r0,16(r1)
	 16c:	eb 81 ff e0 	ld      r28,-32(r1)
	 170:	eb a1 ff e8 	ld      r29,-24(r1)
	 174:	eb c1 ff f0 	ld      r30,-16(r1)
	 178:	eb e1 ff f8 	ld      r31,-8(r1)
	 17c:	7c 08 03 a6 	mtlr    r0
	 180:	4e 80 00 20 	blr
	 184:	e8 7e 80 30 	ld      r3,-32720(r30)
	 188:	48 00 00 01 	bl      188 <.kt_1+0xc8>
	 18c:	60 00 00 00 	nop
	 190:	4b ff ff 8c 	b       11c <.kt_1+0x5c>
	 194:	48 00 00 01 	bl      194 <.kt_1+0xd4>
	 198:	60 00 00 00 	nop
	 19c:	4b ff ff 8c 	b       128 <.kt_1+0x68>

Unfortunately it tells me nothing, I do not know ppc.

Oleg.

--------------------------------------------------------------------------------
#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/kthread.h>

struct utrace {
	unsigned int resume:3;
	int freeze_stop;
};

static int kt_1(void *arg)
{
	struct utrace *u = arg;
	int r = 0;

	while (!kthread_should_stop()) {
		if (++r == 7)
			r = 0;

		u->resume = r;
		barrier();
		if (u->resume != r)
			printk(KERN_CRIT "BUG! bitfield\n");

		if (need_resched())
			schedule();
	}

	return 0;
}

static int kt_2(void *arg)
{
	struct utrace *u = arg;
	int f = 0;

	while (!kthread_should_stop()) {
		u->freeze_stop = ++f;
		barrier();
		if (u->freeze_stop != f)
			printk(KERN_CRIT "BUG! freeze_stop %d != %d\n", u->freeze_stop, f);
		
		if (need_resched())
			schedule();
	}

	return 0;
}

static struct task_struct *t_1, *t_2;

static struct utrace utrace;

static int __init mod_init(void)
{
	WARN_ON(IS_ERR(t_1 = kthread_run(kt_1, &utrace, "kt_1")));
	WARN_ON(IS_ERR(t_2 = kthread_run(kt_2, &utrace, "kt_2")));

	return 0;
}

static void __exit mod_exit(void)
{
	kthread_stop(t_1);
	kthread_stop(t_2);
}

MODULE_LICENSE("GPL");
module_init(mod_init);
module_exit(mod_exit);

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
  2014-07-12 18:13 ` Oleg Nesterov
@ 2014-07-12 20:51   ` Oleg Nesterov
  -1 siblings, 0 replies; 311+ messages in thread
From: Oleg Nesterov @ 2014-07-12 20:51 UTC (permalink / raw)
  To: Jakub Jelinek, Paul E. McKenney, Richard Henderson
  Cc: Benjamin Herrenschmidt, Miroslav Franc, Paul Mackerras,
	linuxppc-dev, linux-kernel, Dan

OK, looks like this is compiler bug,

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=52080

Thanks to Dan who informed me privately.

On 07/12, Oleg Nesterov wrote:
>
> Hello,
>
> I am not sure I should ask here, but since Documentation/memory-barriers.txt
> mentions load/store tearing perhaps my question is not completely off-topic...
>
> I am fighting with mysterious RHEL bug, it can be reproduced on ppc and s390
> but not on x86. Finally I seem to understand the problem, and I even wrote the
> stupid kernel module to ensure, see it below at the end.
>
> It triggers the problem immediately, kt_2() sees the wrong value in freeze_stop.
> (If I turn ->freeze_stop int "long", the problem goes away).
>
> So the question is: is this gcc bug or the code below is buggy?
>
> If it is buggy, then probably memory-barriers.txt could mention that you should
> be carefull with bit fields, even ACCESS_ONCE() obviously can't help.
>
> Or this just discloses my ignorance and you need at least aligned(long) after a
> bit field to be thread-safe ? I thought that compiler should take care and add
> the necessary alignment if (say) CPU can't update a single byte/uint.
>
> gcc version 4.4.7 20120313 (Red Hat 4.4.7-9) (GCC). Asm:
>
> 	0000000000000000 <.kt_2>:
> 	   0:	7c 08 02 a6 	mflr    r0
> 	   4:	fb 81 ff e0 	std     r28,-32(r1)
> 	   8:	fb a1 ff e8 	std     r29,-24(r1)
> 	   c:	fb c1 ff f0 	std     r30,-16(r1)
> 	  10:	fb e1 ff f8 	std     r31,-8(r1)
> 	  14:	eb c2 00 00 	ld      r30,0(r2)
> 	  18:	f8 01 00 10 	std     r0,16(r1)
> 	  1c:	f8 21 ff 71 	stdu    r1,-144(r1)
> 	  20:	7c 7d 1b 78 	mr      r29,r3
> 	  24:	3b e0 00 00 	li      r31,0
> 	  28:	78 3c 04 64 	rldicr  r28,r1,0,49
> 	  2c:	3b 9c 00 80 	addi    r28,r28,128
> 	  30:	48 00 00 2c 	b       5c <.kt_2+0x5c>
> 	  34:	60 00 00 00 	nop
> 	  38:	60 00 00 00 	nop
> 	  3c:	60 00 00 00 	nop
> 	  40:	93 fd 00 04 	stw     r31,4(r29)
> 	  44:	e8 9d 00 06 	lwa     r4,4(r29)
> 	  48:	7f 84 f8 00 	cmpw    cr7,r4,r31
> 	  4c:	40 de 00 4c 	bne-    cr7,98 <.kt_2+0x98>
> 	  50:	e8 1c 00 00 	ld      r0,0(r28)
> 	  54:	78 09 f7 e3 	rldicl. r9,r0,62,63
> 	  58:	40 c2 00 54 	bne-    ac <.kt_2+0xac>
> 	  5c:	48 00 00 01 	bl      5c <.kt_2+0x5c>
> 	  60:	60 00 00 00 	nop
> 	  64:	3b ff 00 01 	addi    r31,r31,1
> 	  68:	2f a3 00 00 	cmpdi   cr7,r3,0
> 	  6c:	7f ff 07 b4 	extsw   r31,r31
> 	  70:	41 9e ff d0 	beq+    cr7,40 <.kt_2+0x40>
> 	  74:	38 21 00 90 	addi    r1,r1,144
> 	  78:	38 60 00 00 	li      r3,0
> 	  7c:	e8 01 00 10 	ld      r0,16(r1)
> 	  80:	eb 81 ff e0 	ld      r28,-32(r1)
> 	  84:	eb a1 ff e8 	ld      r29,-24(r1)
> 	  88:	eb c1 ff f0 	ld      r30,-16(r1)
> 	  8c:	eb e1 ff f8 	ld      r31,-8(r1)
> 	  90:	7c 08 03 a6 	mtlr    r0
> 	  94:	4e 80 00 20 	blr
> 	  98:	e8 7e 80 28 	ld      r3,-32728(r30)
> 	  9c:	7f e5 fb 78 	mr      r5,r31
> 	  a0:	48 00 00 01 	bl      a0 <.kt_2+0xa0>
> 	  a4:	60 00 00 00 	nop
> 	  a8:	4b ff ff a8 	b       50 <.kt_2+0x50>
> 	  ac:	48 00 00 01 	bl      ac <.kt_2+0xac>
> 	  b0:	60 00 00 00 	nop
> 	  b4:	4b ff ff a8 	b       5c <.kt_2+0x5c>
> 	  b8:	60 00 00 00 	nop
> 	  bc:	60 00 00 00 	nop
>
> 	00000000000000c0 <.kt_1>:
> 	  c0:	7c 08 02 a6 	mflr    r0
> 	  c4:	fb 81 ff e0 	std     r28,-32(r1)
> 	  c8:	fb a1 ff e8 	std     r29,-24(r1)
> 	  cc:	fb c1 ff f0 	std     r30,-16(r1)
> 	  d0:	fb e1 ff f8 	std     r31,-8(r1)
> 	  d4:	eb c2 00 00 	ld      r30,0(r2)
> 	  d8:	f8 01 00 10 	std     r0,16(r1)
> 	  dc:	f8 21 ff 71 	stdu    r1,-144(r1)
> 	  e0:	7c 7d 1b 78 	mr      r29,r3
> 	  e4:	3b e0 00 00 	li      r31,0
> 	  e8:	78 3c 04 64 	rldicr  r28,r1,0,49
> 	  ec:	3b 9c 00 80 	addi    r28,r28,128
> 	  f0:	48 00 00 38 	b       128 <.kt_1+0x68>
> 	  f4:	60 00 00 00 	nop
> 	  f8:	60 00 00 00 	nop
> 	  fc:	60 00 00 00 	nop
> 	 100:	e8 1d 00 00 	ld      r0,0(r29)
> 	 104:	79 20 e8 0e 	rldimi  r0,r9,61,0
> 	 108:	f8 1d 00 00 	std     r0,0(r29)
> 	 10c:	80 1d 00 00 	lwz     r0,0(r29)
> 	 110:	54 00 1f 7e 	rlwinm  r0,r0,3,29,31
> 	 114:	7f 80 f8 00 	cmpw    cr7,r0,r31
> 	 118:	40 de 00 6c 	bne-    cr7,184 <.kt_1+0xc4>
> 	 11c:	e8 1c 00 00 	ld      r0,0(r28)
> 	 120:	78 09 f7 e3 	rldicl. r9,r0,62,63
> 	 124:	40 c2 00 70 	bne-    194 <.kt_1+0xd4>
> 	 128:	48 00 00 01 	bl      128 <.kt_1+0x68>
> 	 12c:	60 00 00 00 	nop
> 	 130:	3b ff 00 01 	addi    r31,r31,1
> 	 134:	2f a3 00 00 	cmpdi   cr7,r3,0
> 	 138:	7f ff 07 b4 	extsw   r31,r31
> 	 13c:	2f 1f 00 07 	cmpwi   cr6,r31,7
> 	 140:	7b e9 07 60 	clrldi  r9,r31,61
> 	 144:	40 9e 00 1c 	bne-    cr7,160 <.kt_1+0xa0>
> 	 148:	40 9a ff b8 	bne+    cr6,100 <.kt_1+0x40>
> 	 14c:	39 20 00 00 	li      r9,0
> 	 150:	3b e0 00 00 	li      r31,0
> 	 154:	4b ff ff ac 	b       100 <.kt_1+0x40>
> 	 158:	60 00 00 00 	nop
> 	 15c:	60 00 00 00 	nop
> 	 160:	38 21 00 90 	addi    r1,r1,144
> 	 164:	38 60 00 00 	li      r3,0
> 	 168:	e8 01 00 10 	ld      r0,16(r1)
> 	 16c:	eb 81 ff e0 	ld      r28,-32(r1)
> 	 170:	eb a1 ff e8 	ld      r29,-24(r1)
> 	 174:	eb c1 ff f0 	ld      r30,-16(r1)
> 	 178:	eb e1 ff f8 	ld      r31,-8(r1)
> 	 17c:	7c 08 03 a6 	mtlr    r0
> 	 180:	4e 80 00 20 	blr
> 	 184:	e8 7e 80 30 	ld      r3,-32720(r30)
> 	 188:	48 00 00 01 	bl      188 <.kt_1+0xc8>
> 	 18c:	60 00 00 00 	nop
> 	 190:	4b ff ff 8c 	b       11c <.kt_1+0x5c>
> 	 194:	48 00 00 01 	bl      194 <.kt_1+0xd4>
> 	 198:	60 00 00 00 	nop
> 	 19c:	4b ff ff 8c 	b       128 <.kt_1+0x68>
>
> Unfortunately it tells me nothing, I do not know ppc.
>
> Oleg.
>
> --------------------------------------------------------------------------------
> #include <linux/module.h>
> #include <linux/kernel.h>
> #include <linux/kthread.h>
>
> struct utrace {
> 	unsigned int resume:3;
> 	int freeze_stop;
> };
>
> static int kt_1(void *arg)
> {
> 	struct utrace *u = arg;
> 	int r = 0;
>
> 	while (!kthread_should_stop()) {
> 		if (++r == 7)
> 			r = 0;
>
> 		u->resume = r;
> 		barrier();
> 		if (u->resume != r)
> 			printk(KERN_CRIT "BUG! bitfield\n");
>
> 		if (need_resched())
> 			schedule();
> 	}
>
> 	return 0;
> }
>
> static int kt_2(void *arg)
> {
> 	struct utrace *u = arg;
> 	int f = 0;
>
> 	while (!kthread_should_stop()) {
> 		u->freeze_stop = ++f;
> 		barrier();
> 		if (u->freeze_stop != f)
> 			printk(KERN_CRIT "BUG! freeze_stop %d != %d\n", u->freeze_stop, f);
>
> 		if (need_resched())
> 			schedule();
> 	}
>
> 	return 0;
> }
>
> static struct task_struct *t_1, *t_2;
>
> static struct utrace utrace;
>
> static int __init mod_init(void)
> {
> 	WARN_ON(IS_ERR(t_1 = kthread_run(kt_1, &utrace, "kt_1")));
> 	WARN_ON(IS_ERR(t_2 = kthread_run(kt_2, &utrace, "kt_2")));
>
> 	return 0;
> }
>
> static void __exit mod_exit(void)
> {
> 	kthread_stop(t_1);
> 	kthread_stop(t_2);
> }
>
> MODULE_LICENSE("GPL");
> module_init(mod_init);
> module_exit(mod_exit);


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-07-12 20:51   ` Oleg Nesterov
  0 siblings, 0 replies; 311+ messages in thread
From: Oleg Nesterov @ 2014-07-12 20:51 UTC (permalink / raw)
  To: Jakub Jelinek, Paul E. McKenney, Richard Henderson
  Cc: linux-kernel, Paul Mackerras, Dan, linuxppc-dev, Miroslav Franc

OK, looks like this is compiler bug,

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=52080

Thanks to Dan who informed me privately.

On 07/12, Oleg Nesterov wrote:
>
> Hello,
>
> I am not sure I should ask here, but since Documentation/memory-barriers.txt
> mentions load/store tearing perhaps my question is not completely off-topic...
>
> I am fighting with mysterious RHEL bug, it can be reproduced on ppc and s390
> but not on x86. Finally I seem to understand the problem, and I even wrote the
> stupid kernel module to ensure, see it below at the end.
>
> It triggers the problem immediately, kt_2() sees the wrong value in freeze_stop.
> (If I turn ->freeze_stop int "long", the problem goes away).
>
> So the question is: is this gcc bug or the code below is buggy?
>
> If it is buggy, then probably memory-barriers.txt could mention that you should
> be carefull with bit fields, even ACCESS_ONCE() obviously can't help.
>
> Or this just discloses my ignorance and you need at least aligned(long) after a
> bit field to be thread-safe ? I thought that compiler should take care and add
> the necessary alignment if (say) CPU can't update a single byte/uint.
>
> gcc version 4.4.7 20120313 (Red Hat 4.4.7-9) (GCC). Asm:
>
> 	0000000000000000 <.kt_2>:
> 	   0:	7c 08 02 a6 	mflr    r0
> 	   4:	fb 81 ff e0 	std     r28,-32(r1)
> 	   8:	fb a1 ff e8 	std     r29,-24(r1)
> 	   c:	fb c1 ff f0 	std     r30,-16(r1)
> 	  10:	fb e1 ff f8 	std     r31,-8(r1)
> 	  14:	eb c2 00 00 	ld      r30,0(r2)
> 	  18:	f8 01 00 10 	std     r0,16(r1)
> 	  1c:	f8 21 ff 71 	stdu    r1,-144(r1)
> 	  20:	7c 7d 1b 78 	mr      r29,r3
> 	  24:	3b e0 00 00 	li      r31,0
> 	  28:	78 3c 04 64 	rldicr  r28,r1,0,49
> 	  2c:	3b 9c 00 80 	addi    r28,r28,128
> 	  30:	48 00 00 2c 	b       5c <.kt_2+0x5c>
> 	  34:	60 00 00 00 	nop
> 	  38:	60 00 00 00 	nop
> 	  3c:	60 00 00 00 	nop
> 	  40:	93 fd 00 04 	stw     r31,4(r29)
> 	  44:	e8 9d 00 06 	lwa     r4,4(r29)
> 	  48:	7f 84 f8 00 	cmpw    cr7,r4,r31
> 	  4c:	40 de 00 4c 	bne-    cr7,98 <.kt_2+0x98>
> 	  50:	e8 1c 00 00 	ld      r0,0(r28)
> 	  54:	78 09 f7 e3 	rldicl. r9,r0,62,63
> 	  58:	40 c2 00 54 	bne-    ac <.kt_2+0xac>
> 	  5c:	48 00 00 01 	bl      5c <.kt_2+0x5c>
> 	  60:	60 00 00 00 	nop
> 	  64:	3b ff 00 01 	addi    r31,r31,1
> 	  68:	2f a3 00 00 	cmpdi   cr7,r3,0
> 	  6c:	7f ff 07 b4 	extsw   r31,r31
> 	  70:	41 9e ff d0 	beq+    cr7,40 <.kt_2+0x40>
> 	  74:	38 21 00 90 	addi    r1,r1,144
> 	  78:	38 60 00 00 	li      r3,0
> 	  7c:	e8 01 00 10 	ld      r0,16(r1)
> 	  80:	eb 81 ff e0 	ld      r28,-32(r1)
> 	  84:	eb a1 ff e8 	ld      r29,-24(r1)
> 	  88:	eb c1 ff f0 	ld      r30,-16(r1)
> 	  8c:	eb e1 ff f8 	ld      r31,-8(r1)
> 	  90:	7c 08 03 a6 	mtlr    r0
> 	  94:	4e 80 00 20 	blr
> 	  98:	e8 7e 80 28 	ld      r3,-32728(r30)
> 	  9c:	7f e5 fb 78 	mr      r5,r31
> 	  a0:	48 00 00 01 	bl      a0 <.kt_2+0xa0>
> 	  a4:	60 00 00 00 	nop
> 	  a8:	4b ff ff a8 	b       50 <.kt_2+0x50>
> 	  ac:	48 00 00 01 	bl      ac <.kt_2+0xac>
> 	  b0:	60 00 00 00 	nop
> 	  b4:	4b ff ff a8 	b       5c <.kt_2+0x5c>
> 	  b8:	60 00 00 00 	nop
> 	  bc:	60 00 00 00 	nop
>
> 	00000000000000c0 <.kt_1>:
> 	  c0:	7c 08 02 a6 	mflr    r0
> 	  c4:	fb 81 ff e0 	std     r28,-32(r1)
> 	  c8:	fb a1 ff e8 	std     r29,-24(r1)
> 	  cc:	fb c1 ff f0 	std     r30,-16(r1)
> 	  d0:	fb e1 ff f8 	std     r31,-8(r1)
> 	  d4:	eb c2 00 00 	ld      r30,0(r2)
> 	  d8:	f8 01 00 10 	std     r0,16(r1)
> 	  dc:	f8 21 ff 71 	stdu    r1,-144(r1)
> 	  e0:	7c 7d 1b 78 	mr      r29,r3
> 	  e4:	3b e0 00 00 	li      r31,0
> 	  e8:	78 3c 04 64 	rldicr  r28,r1,0,49
> 	  ec:	3b 9c 00 80 	addi    r28,r28,128
> 	  f0:	48 00 00 38 	b       128 <.kt_1+0x68>
> 	  f4:	60 00 00 00 	nop
> 	  f8:	60 00 00 00 	nop
> 	  fc:	60 00 00 00 	nop
> 	 100:	e8 1d 00 00 	ld      r0,0(r29)
> 	 104:	79 20 e8 0e 	rldimi  r0,r9,61,0
> 	 108:	f8 1d 00 00 	std     r0,0(r29)
> 	 10c:	80 1d 00 00 	lwz     r0,0(r29)
> 	 110:	54 00 1f 7e 	rlwinm  r0,r0,3,29,31
> 	 114:	7f 80 f8 00 	cmpw    cr7,r0,r31
> 	 118:	40 de 00 6c 	bne-    cr7,184 <.kt_1+0xc4>
> 	 11c:	e8 1c 00 00 	ld      r0,0(r28)
> 	 120:	78 09 f7 e3 	rldicl. r9,r0,62,63
> 	 124:	40 c2 00 70 	bne-    194 <.kt_1+0xd4>
> 	 128:	48 00 00 01 	bl      128 <.kt_1+0x68>
> 	 12c:	60 00 00 00 	nop
> 	 130:	3b ff 00 01 	addi    r31,r31,1
> 	 134:	2f a3 00 00 	cmpdi   cr7,r3,0
> 	 138:	7f ff 07 b4 	extsw   r31,r31
> 	 13c:	2f 1f 00 07 	cmpwi   cr6,r31,7
> 	 140:	7b e9 07 60 	clrldi  r9,r31,61
> 	 144:	40 9e 00 1c 	bne-    cr7,160 <.kt_1+0xa0>
> 	 148:	40 9a ff b8 	bne+    cr6,100 <.kt_1+0x40>
> 	 14c:	39 20 00 00 	li      r9,0
> 	 150:	3b e0 00 00 	li      r31,0
> 	 154:	4b ff ff ac 	b       100 <.kt_1+0x40>
> 	 158:	60 00 00 00 	nop
> 	 15c:	60 00 00 00 	nop
> 	 160:	38 21 00 90 	addi    r1,r1,144
> 	 164:	38 60 00 00 	li      r3,0
> 	 168:	e8 01 00 10 	ld      r0,16(r1)
> 	 16c:	eb 81 ff e0 	ld      r28,-32(r1)
> 	 170:	eb a1 ff e8 	ld      r29,-24(r1)
> 	 174:	eb c1 ff f0 	ld      r30,-16(r1)
> 	 178:	eb e1 ff f8 	ld      r31,-8(r1)
> 	 17c:	7c 08 03 a6 	mtlr    r0
> 	 180:	4e 80 00 20 	blr
> 	 184:	e8 7e 80 30 	ld      r3,-32720(r30)
> 	 188:	48 00 00 01 	bl      188 <.kt_1+0xc8>
> 	 18c:	60 00 00 00 	nop
> 	 190:	4b ff ff 8c 	b       11c <.kt_1+0x5c>
> 	 194:	48 00 00 01 	bl      194 <.kt_1+0xd4>
> 	 198:	60 00 00 00 	nop
> 	 19c:	4b ff ff 8c 	b       128 <.kt_1+0x68>
>
> Unfortunately it tells me nothing, I do not know ppc.
>
> Oleg.
>
> --------------------------------------------------------------------------------
> #include <linux/module.h>
> #include <linux/kernel.h>
> #include <linux/kthread.h>
>
> struct utrace {
> 	unsigned int resume:3;
> 	int freeze_stop;
> };
>
> static int kt_1(void *arg)
> {
> 	struct utrace *u = arg;
> 	int r = 0;
>
> 	while (!kthread_should_stop()) {
> 		if (++r == 7)
> 			r = 0;
>
> 		u->resume = r;
> 		barrier();
> 		if (u->resume != r)
> 			printk(KERN_CRIT "BUG! bitfield\n");
>
> 		if (need_resched())
> 			schedule();
> 	}
>
> 	return 0;
> }
>
> static int kt_2(void *arg)
> {
> 	struct utrace *u = arg;
> 	int f = 0;
>
> 	while (!kthread_should_stop()) {
> 		u->freeze_stop = ++f;
> 		barrier();
> 		if (u->freeze_stop != f)
> 			printk(KERN_CRIT "BUG! freeze_stop %d != %d\n", u->freeze_stop, f);
>
> 		if (need_resched())
> 			schedule();
> 	}
>
> 	return 0;
> }
>
> static struct task_struct *t_1, *t_2;
>
> static struct utrace utrace;
>
> static int __init mod_init(void)
> {
> 	WARN_ON(IS_ERR(t_1 = kthread_run(kt_1, &utrace, "kt_1")));
> 	WARN_ON(IS_ERR(t_2 = kthread_run(kt_2, &utrace, "kt_2")));
>
> 	return 0;
> }
>
> static void __exit mod_exit(void)
> {
> 	kthread_stop(t_1);
> 	kthread_stop(t_2);
> }
>
> MODULE_LICENSE("GPL");
> module_init(mod_init);
> module_exit(mod_exit);

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
  2014-07-12 20:51   ` Oleg Nesterov
@ 2014-07-12 23:34     ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 311+ messages in thread
From: Benjamin Herrenschmidt @ 2014-07-12 23:34 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Jakub Jelinek, Paul E. McKenney, Richard Henderson,
	Miroslav Franc, Paul Mackerras, linuxppc-dev, linux-kernel, Dan

On Sat, 2014-07-12 at 22:51 +0200, Oleg Nesterov wrote:
> OK, looks like this is compiler bug,
> 
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=52080
> 
> Thanks to Dan who informed me privately.

So yes, there's is this compiler bug which means a bitfield
access can cause a r-m-w access to a neighbouring field but
in general, I would be weary of bitfields anyway since accessing
them isn't going to be atomic anyway... it's too easy to get things
wrong and in most cases the benefit is yet to be demonstrated.

In your example, I don't see the point of the bitfield.

Cheers,
Ben.

> On 07/12, Oleg Nesterov wrote:
> >
> > Hello,
> >
> > I am not sure I should ask here, but since Documentation/memory-barriers.txt
> > mentions load/store tearing perhaps my question is not completely off-topic...
> >
> > I am fighting with mysterious RHEL bug, it can be reproduced on ppc and s390
> > but not on x86. Finally I seem to understand the problem, and I even wrote the
> > stupid kernel module to ensure, see it below at the end.
> >
> > It triggers the problem immediately, kt_2() sees the wrong value in freeze_stop.
> > (If I turn ->freeze_stop int "long", the problem goes away).
> >
> > So the question is: is this gcc bug or the code below is buggy?
> >
> > If it is buggy, then probably memory-barriers.txt could mention that you should
> > be carefull with bit fields, even ACCESS_ONCE() obviously can't help.
> >
> > Or this just discloses my ignorance and you need at least aligned(long) after a
> > bit field to be thread-safe ? I thought that compiler should take care and add
> > the necessary alignment if (say) CPU can't update a single byte/uint.
> >
> > gcc version 4.4.7 20120313 (Red Hat 4.4.7-9) (GCC). Asm:
> >
> > 	0000000000000000 <.kt_2>:
> > 	   0:	7c 08 02 a6 	mflr    r0
> > 	   4:	fb 81 ff e0 	std     r28,-32(r1)
> > 	   8:	fb a1 ff e8 	std     r29,-24(r1)
> > 	   c:	fb c1 ff f0 	std     r30,-16(r1)
> > 	  10:	fb e1 ff f8 	std     r31,-8(r1)
> > 	  14:	eb c2 00 00 	ld      r30,0(r2)
> > 	  18:	f8 01 00 10 	std     r0,16(r1)
> > 	  1c:	f8 21 ff 71 	stdu    r1,-144(r1)
> > 	  20:	7c 7d 1b 78 	mr      r29,r3
> > 	  24:	3b e0 00 00 	li      r31,0
> > 	  28:	78 3c 04 64 	rldicr  r28,r1,0,49
> > 	  2c:	3b 9c 00 80 	addi    r28,r28,128
> > 	  30:	48 00 00 2c 	b       5c <.kt_2+0x5c>
> > 	  34:	60 00 00 00 	nop
> > 	  38:	60 00 00 00 	nop
> > 	  3c:	60 00 00 00 	nop
> > 	  40:	93 fd 00 04 	stw     r31,4(r29)
> > 	  44:	e8 9d 00 06 	lwa     r4,4(r29)
> > 	  48:	7f 84 f8 00 	cmpw    cr7,r4,r31
> > 	  4c:	40 de 00 4c 	bne-    cr7,98 <.kt_2+0x98>
> > 	  50:	e8 1c 00 00 	ld      r0,0(r28)
> > 	  54:	78 09 f7 e3 	rldicl. r9,r0,62,63
> > 	  58:	40 c2 00 54 	bne-    ac <.kt_2+0xac>
> > 	  5c:	48 00 00 01 	bl      5c <.kt_2+0x5c>
> > 	  60:	60 00 00 00 	nop
> > 	  64:	3b ff 00 01 	addi    r31,r31,1
> > 	  68:	2f a3 00 00 	cmpdi   cr7,r3,0
> > 	  6c:	7f ff 07 b4 	extsw   r31,r31
> > 	  70:	41 9e ff d0 	beq+    cr7,40 <.kt_2+0x40>
> > 	  74:	38 21 00 90 	addi    r1,r1,144
> > 	  78:	38 60 00 00 	li      r3,0
> > 	  7c:	e8 01 00 10 	ld      r0,16(r1)
> > 	  80:	eb 81 ff e0 	ld      r28,-32(r1)
> > 	  84:	eb a1 ff e8 	ld      r29,-24(r1)
> > 	  88:	eb c1 ff f0 	ld      r30,-16(r1)
> > 	  8c:	eb e1 ff f8 	ld      r31,-8(r1)
> > 	  90:	7c 08 03 a6 	mtlr    r0
> > 	  94:	4e 80 00 20 	blr
> > 	  98:	e8 7e 80 28 	ld      r3,-32728(r30)
> > 	  9c:	7f e5 fb 78 	mr      r5,r31
> > 	  a0:	48 00 00 01 	bl      a0 <.kt_2+0xa0>
> > 	  a4:	60 00 00 00 	nop
> > 	  a8:	4b ff ff a8 	b       50 <.kt_2+0x50>
> > 	  ac:	48 00 00 01 	bl      ac <.kt_2+0xac>
> > 	  b0:	60 00 00 00 	nop
> > 	  b4:	4b ff ff a8 	b       5c <.kt_2+0x5c>
> > 	  b8:	60 00 00 00 	nop
> > 	  bc:	60 00 00 00 	nop
> >
> > 	00000000000000c0 <.kt_1>:
> > 	  c0:	7c 08 02 a6 	mflr    r0
> > 	  c4:	fb 81 ff e0 	std     r28,-32(r1)
> > 	  c8:	fb a1 ff e8 	std     r29,-24(r1)
> > 	  cc:	fb c1 ff f0 	std     r30,-16(r1)
> > 	  d0:	fb e1 ff f8 	std     r31,-8(r1)
> > 	  d4:	eb c2 00 00 	ld      r30,0(r2)
> > 	  d8:	f8 01 00 10 	std     r0,16(r1)
> > 	  dc:	f8 21 ff 71 	stdu    r1,-144(r1)
> > 	  e0:	7c 7d 1b 78 	mr      r29,r3
> > 	  e4:	3b e0 00 00 	li      r31,0
> > 	  e8:	78 3c 04 64 	rldicr  r28,r1,0,49
> > 	  ec:	3b 9c 00 80 	addi    r28,r28,128
> > 	  f0:	48 00 00 38 	b       128 <.kt_1+0x68>
> > 	  f4:	60 00 00 00 	nop
> > 	  f8:	60 00 00 00 	nop
> > 	  fc:	60 00 00 00 	nop
> > 	 100:	e8 1d 00 00 	ld      r0,0(r29)
> > 	 104:	79 20 e8 0e 	rldimi  r0,r9,61,0
> > 	 108:	f8 1d 00 00 	std     r0,0(r29)
> > 	 10c:	80 1d 00 00 	lwz     r0,0(r29)
> > 	 110:	54 00 1f 7e 	rlwinm  r0,r0,3,29,31
> > 	 114:	7f 80 f8 00 	cmpw    cr7,r0,r31
> > 	 118:	40 de 00 6c 	bne-    cr7,184 <.kt_1+0xc4>
> > 	 11c:	e8 1c 00 00 	ld      r0,0(r28)
> > 	 120:	78 09 f7 e3 	rldicl. r9,r0,62,63
> > 	 124:	40 c2 00 70 	bne-    194 <.kt_1+0xd4>
> > 	 128:	48 00 00 01 	bl      128 <.kt_1+0x68>
> > 	 12c:	60 00 00 00 	nop
> > 	 130:	3b ff 00 01 	addi    r31,r31,1
> > 	 134:	2f a3 00 00 	cmpdi   cr7,r3,0
> > 	 138:	7f ff 07 b4 	extsw   r31,r31
> > 	 13c:	2f 1f 00 07 	cmpwi   cr6,r31,7
> > 	 140:	7b e9 07 60 	clrldi  r9,r31,61
> > 	 144:	40 9e 00 1c 	bne-    cr7,160 <.kt_1+0xa0>
> > 	 148:	40 9a ff b8 	bne+    cr6,100 <.kt_1+0x40>
> > 	 14c:	39 20 00 00 	li      r9,0
> > 	 150:	3b e0 00 00 	li      r31,0
> > 	 154:	4b ff ff ac 	b       100 <.kt_1+0x40>
> > 	 158:	60 00 00 00 	nop
> > 	 15c:	60 00 00 00 	nop
> > 	 160:	38 21 00 90 	addi    r1,r1,144
> > 	 164:	38 60 00 00 	li      r3,0
> > 	 168:	e8 01 00 10 	ld      r0,16(r1)
> > 	 16c:	eb 81 ff e0 	ld      r28,-32(r1)
> > 	 170:	eb a1 ff e8 	ld      r29,-24(r1)
> > 	 174:	eb c1 ff f0 	ld      r30,-16(r1)
> > 	 178:	eb e1 ff f8 	ld      r31,-8(r1)
> > 	 17c:	7c 08 03 a6 	mtlr    r0
> > 	 180:	4e 80 00 20 	blr
> > 	 184:	e8 7e 80 30 	ld      r3,-32720(r30)
> > 	 188:	48 00 00 01 	bl      188 <.kt_1+0xc8>
> > 	 18c:	60 00 00 00 	nop
> > 	 190:	4b ff ff 8c 	b       11c <.kt_1+0x5c>
> > 	 194:	48 00 00 01 	bl      194 <.kt_1+0xd4>
> > 	 198:	60 00 00 00 	nop
> > 	 19c:	4b ff ff 8c 	b       128 <.kt_1+0x68>
> >
> > Unfortunately it tells me nothing, I do not know ppc.
> >
> > Oleg.
> >
> > --------------------------------------------------------------------------------
> > #include <linux/module.h>
> > #include <linux/kernel.h>
> > #include <linux/kthread.h>
> >
> > struct utrace {
> > 	unsigned int resume:3;
> > 	int freeze_stop;
> > };
> >
> > static int kt_1(void *arg)
> > {
> > 	struct utrace *u = arg;
> > 	int r = 0;
> >
> > 	while (!kthread_should_stop()) {
> > 		if (++r == 7)
> > 			r = 0;
> >
> > 		u->resume = r;
> > 		barrier();
> > 		if (u->resume != r)
> > 			printk(KERN_CRIT "BUG! bitfield\n");
> >
> > 		if (need_resched())
> > 			schedule();
> > 	}
> >
> > 	return 0;
> > }
> >
> > static int kt_2(void *arg)
> > {
> > 	struct utrace *u = arg;
> > 	int f = 0;
> >
> > 	while (!kthread_should_stop()) {
> > 		u->freeze_stop = ++f;
> > 		barrier();
> > 		if (u->freeze_stop != f)
> > 			printk(KERN_CRIT "BUG! freeze_stop %d != %d\n", u->freeze_stop, f);
> >
> > 		if (need_resched())
> > 			schedule();
> > 	}
> >
> > 	return 0;
> > }
> >
> > static struct task_struct *t_1, *t_2;
> >
> > static struct utrace utrace;
> >
> > static int __init mod_init(void)
> > {
> > 	WARN_ON(IS_ERR(t_1 = kthread_run(kt_1, &utrace, "kt_1")));
> > 	WARN_ON(IS_ERR(t_2 = kthread_run(kt_2, &utrace, "kt_2")));
> >
> > 	return 0;
> > }
> >
> > static void __exit mod_exit(void)
> > {
> > 	kthread_stop(t_1);
> > 	kthread_stop(t_2);
> > }
> >
> > MODULE_LICENSE("GPL");
> > module_init(mod_init);
> > module_exit(mod_exit);



^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-07-12 23:34     ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 311+ messages in thread
From: Benjamin Herrenschmidt @ 2014-07-12 23:34 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Jakub Jelinek, linux-kernel, Paul Mackerras, Dan,
	Paul E. McKenney, linuxppc-dev, Miroslav Franc,
	Richard Henderson

On Sat, 2014-07-12 at 22:51 +0200, Oleg Nesterov wrote:
> OK, looks like this is compiler bug,
> 
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=52080
> 
> Thanks to Dan who informed me privately.

So yes, there's is this compiler bug which means a bitfield
access can cause a r-m-w access to a neighbouring field but
in general, I would be weary of bitfields anyway since accessing
them isn't going to be atomic anyway... it's too easy to get things
wrong and in most cases the benefit is yet to be demonstrated.

In your example, I don't see the point of the bitfield.

Cheers,
Ben.

> On 07/12, Oleg Nesterov wrote:
> >
> > Hello,
> >
> > I am not sure I should ask here, but since Documentation/memory-barriers.txt
> > mentions load/store tearing perhaps my question is not completely off-topic...
> >
> > I am fighting with mysterious RHEL bug, it can be reproduced on ppc and s390
> > but not on x86. Finally I seem to understand the problem, and I even wrote the
> > stupid kernel module to ensure, see it below at the end.
> >
> > It triggers the problem immediately, kt_2() sees the wrong value in freeze_stop.
> > (If I turn ->freeze_stop int "long", the problem goes away).
> >
> > So the question is: is this gcc bug or the code below is buggy?
> >
> > If it is buggy, then probably memory-barriers.txt could mention that you should
> > be carefull with bit fields, even ACCESS_ONCE() obviously can't help.
> >
> > Or this just discloses my ignorance and you need at least aligned(long) after a
> > bit field to be thread-safe ? I thought that compiler should take care and add
> > the necessary alignment if (say) CPU can't update a single byte/uint.
> >
> > gcc version 4.4.7 20120313 (Red Hat 4.4.7-9) (GCC). Asm:
> >
> > 	0000000000000000 <.kt_2>:
> > 	   0:	7c 08 02 a6 	mflr    r0
> > 	   4:	fb 81 ff e0 	std     r28,-32(r1)
> > 	   8:	fb a1 ff e8 	std     r29,-24(r1)
> > 	   c:	fb c1 ff f0 	std     r30,-16(r1)
> > 	  10:	fb e1 ff f8 	std     r31,-8(r1)
> > 	  14:	eb c2 00 00 	ld      r30,0(r2)
> > 	  18:	f8 01 00 10 	std     r0,16(r1)
> > 	  1c:	f8 21 ff 71 	stdu    r1,-144(r1)
> > 	  20:	7c 7d 1b 78 	mr      r29,r3
> > 	  24:	3b e0 00 00 	li      r31,0
> > 	  28:	78 3c 04 64 	rldicr  r28,r1,0,49
> > 	  2c:	3b 9c 00 80 	addi    r28,r28,128
> > 	  30:	48 00 00 2c 	b       5c <.kt_2+0x5c>
> > 	  34:	60 00 00 00 	nop
> > 	  38:	60 00 00 00 	nop
> > 	  3c:	60 00 00 00 	nop
> > 	  40:	93 fd 00 04 	stw     r31,4(r29)
> > 	  44:	e8 9d 00 06 	lwa     r4,4(r29)
> > 	  48:	7f 84 f8 00 	cmpw    cr7,r4,r31
> > 	  4c:	40 de 00 4c 	bne-    cr7,98 <.kt_2+0x98>
> > 	  50:	e8 1c 00 00 	ld      r0,0(r28)
> > 	  54:	78 09 f7 e3 	rldicl. r9,r0,62,63
> > 	  58:	40 c2 00 54 	bne-    ac <.kt_2+0xac>
> > 	  5c:	48 00 00 01 	bl      5c <.kt_2+0x5c>
> > 	  60:	60 00 00 00 	nop
> > 	  64:	3b ff 00 01 	addi    r31,r31,1
> > 	  68:	2f a3 00 00 	cmpdi   cr7,r3,0
> > 	  6c:	7f ff 07 b4 	extsw   r31,r31
> > 	  70:	41 9e ff d0 	beq+    cr7,40 <.kt_2+0x40>
> > 	  74:	38 21 00 90 	addi    r1,r1,144
> > 	  78:	38 60 00 00 	li      r3,0
> > 	  7c:	e8 01 00 10 	ld      r0,16(r1)
> > 	  80:	eb 81 ff e0 	ld      r28,-32(r1)
> > 	  84:	eb a1 ff e8 	ld      r29,-24(r1)
> > 	  88:	eb c1 ff f0 	ld      r30,-16(r1)
> > 	  8c:	eb e1 ff f8 	ld      r31,-8(r1)
> > 	  90:	7c 08 03 a6 	mtlr    r0
> > 	  94:	4e 80 00 20 	blr
> > 	  98:	e8 7e 80 28 	ld      r3,-32728(r30)
> > 	  9c:	7f e5 fb 78 	mr      r5,r31
> > 	  a0:	48 00 00 01 	bl      a0 <.kt_2+0xa0>
> > 	  a4:	60 00 00 00 	nop
> > 	  a8:	4b ff ff a8 	b       50 <.kt_2+0x50>
> > 	  ac:	48 00 00 01 	bl      ac <.kt_2+0xac>
> > 	  b0:	60 00 00 00 	nop
> > 	  b4:	4b ff ff a8 	b       5c <.kt_2+0x5c>
> > 	  b8:	60 00 00 00 	nop
> > 	  bc:	60 00 00 00 	nop
> >
> > 	00000000000000c0 <.kt_1>:
> > 	  c0:	7c 08 02 a6 	mflr    r0
> > 	  c4:	fb 81 ff e0 	std     r28,-32(r1)
> > 	  c8:	fb a1 ff e8 	std     r29,-24(r1)
> > 	  cc:	fb c1 ff f0 	std     r30,-16(r1)
> > 	  d0:	fb e1 ff f8 	std     r31,-8(r1)
> > 	  d4:	eb c2 00 00 	ld      r30,0(r2)
> > 	  d8:	f8 01 00 10 	std     r0,16(r1)
> > 	  dc:	f8 21 ff 71 	stdu    r1,-144(r1)
> > 	  e0:	7c 7d 1b 78 	mr      r29,r3
> > 	  e4:	3b e0 00 00 	li      r31,0
> > 	  e8:	78 3c 04 64 	rldicr  r28,r1,0,49
> > 	  ec:	3b 9c 00 80 	addi    r28,r28,128
> > 	  f0:	48 00 00 38 	b       128 <.kt_1+0x68>
> > 	  f4:	60 00 00 00 	nop
> > 	  f8:	60 00 00 00 	nop
> > 	  fc:	60 00 00 00 	nop
> > 	 100:	e8 1d 00 00 	ld      r0,0(r29)
> > 	 104:	79 20 e8 0e 	rldimi  r0,r9,61,0
> > 	 108:	f8 1d 00 00 	std     r0,0(r29)
> > 	 10c:	80 1d 00 00 	lwz     r0,0(r29)
> > 	 110:	54 00 1f 7e 	rlwinm  r0,r0,3,29,31
> > 	 114:	7f 80 f8 00 	cmpw    cr7,r0,r31
> > 	 118:	40 de 00 6c 	bne-    cr7,184 <.kt_1+0xc4>
> > 	 11c:	e8 1c 00 00 	ld      r0,0(r28)
> > 	 120:	78 09 f7 e3 	rldicl. r9,r0,62,63
> > 	 124:	40 c2 00 70 	bne-    194 <.kt_1+0xd4>
> > 	 128:	48 00 00 01 	bl      128 <.kt_1+0x68>
> > 	 12c:	60 00 00 00 	nop
> > 	 130:	3b ff 00 01 	addi    r31,r31,1
> > 	 134:	2f a3 00 00 	cmpdi   cr7,r3,0
> > 	 138:	7f ff 07 b4 	extsw   r31,r31
> > 	 13c:	2f 1f 00 07 	cmpwi   cr6,r31,7
> > 	 140:	7b e9 07 60 	clrldi  r9,r31,61
> > 	 144:	40 9e 00 1c 	bne-    cr7,160 <.kt_1+0xa0>
> > 	 148:	40 9a ff b8 	bne+    cr6,100 <.kt_1+0x40>
> > 	 14c:	39 20 00 00 	li      r9,0
> > 	 150:	3b e0 00 00 	li      r31,0
> > 	 154:	4b ff ff ac 	b       100 <.kt_1+0x40>
> > 	 158:	60 00 00 00 	nop
> > 	 15c:	60 00 00 00 	nop
> > 	 160:	38 21 00 90 	addi    r1,r1,144
> > 	 164:	38 60 00 00 	li      r3,0
> > 	 168:	e8 01 00 10 	ld      r0,16(r1)
> > 	 16c:	eb 81 ff e0 	ld      r28,-32(r1)
> > 	 170:	eb a1 ff e8 	ld      r29,-24(r1)
> > 	 174:	eb c1 ff f0 	ld      r30,-16(r1)
> > 	 178:	eb e1 ff f8 	ld      r31,-8(r1)
> > 	 17c:	7c 08 03 a6 	mtlr    r0
> > 	 180:	4e 80 00 20 	blr
> > 	 184:	e8 7e 80 30 	ld      r3,-32720(r30)
> > 	 188:	48 00 00 01 	bl      188 <.kt_1+0xc8>
> > 	 18c:	60 00 00 00 	nop
> > 	 190:	4b ff ff 8c 	b       11c <.kt_1+0x5c>
> > 	 194:	48 00 00 01 	bl      194 <.kt_1+0xd4>
> > 	 198:	60 00 00 00 	nop
> > 	 19c:	4b ff ff 8c 	b       128 <.kt_1+0x68>
> >
> > Unfortunately it tells me nothing, I do not know ppc.
> >
> > Oleg.
> >
> > --------------------------------------------------------------------------------
> > #include <linux/module.h>
> > #include <linux/kernel.h>
> > #include <linux/kthread.h>
> >
> > struct utrace {
> > 	unsigned int resume:3;
> > 	int freeze_stop;
> > };
> >
> > static int kt_1(void *arg)
> > {
> > 	struct utrace *u = arg;
> > 	int r = 0;
> >
> > 	while (!kthread_should_stop()) {
> > 		if (++r == 7)
> > 			r = 0;
> >
> > 		u->resume = r;
> > 		barrier();
> > 		if (u->resume != r)
> > 			printk(KERN_CRIT "BUG! bitfield\n");
> >
> > 		if (need_resched())
> > 			schedule();
> > 	}
> >
> > 	return 0;
> > }
> >
> > static int kt_2(void *arg)
> > {
> > 	struct utrace *u = arg;
> > 	int f = 0;
> >
> > 	while (!kthread_should_stop()) {
> > 		u->freeze_stop = ++f;
> > 		barrier();
> > 		if (u->freeze_stop != f)
> > 			printk(KERN_CRIT "BUG! freeze_stop %d != %d\n", u->freeze_stop, f);
> >
> > 		if (need_resched())
> > 			schedule();
> > 	}
> >
> > 	return 0;
> > }
> >
> > static struct task_struct *t_1, *t_2;
> >
> > static struct utrace utrace;
> >
> > static int __init mod_init(void)
> > {
> > 	WARN_ON(IS_ERR(t_1 = kthread_run(kt_1, &utrace, "kt_1")));
> > 	WARN_ON(IS_ERR(t_2 = kthread_run(kt_2, &utrace, "kt_2")));
> >
> > 	return 0;
> > }
> >
> > static void __exit mod_exit(void)
> > {
> > 	kthread_stop(t_1);
> > 	kthread_stop(t_2);
> > }
> >
> > MODULE_LICENSE("GPL");
> > module_init(mod_init);
> > module_exit(mod_exit);

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
  2014-07-12 23:34     ` Benjamin Herrenschmidt
@ 2014-07-13 12:29       ` Oleg Nesterov
  -1 siblings, 0 replies; 311+ messages in thread
From: Oleg Nesterov @ 2014-07-13 12:29 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Jakub Jelinek, Paul E. McKenney, Richard Henderson,
	Miroslav Franc, Paul Mackerras, linuxppc-dev, linux-kernel, Dan

On 07/13, Benjamin Herrenschmidt wrote:
>
> On Sat, 2014-07-12 at 22:51 +0200, Oleg Nesterov wrote:
> > OK, looks like this is compiler bug,
> >
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=52080
> >
> > Thanks to Dan who informed me privately.
>
> So yes, there's is this compiler bug which means a bitfield
> access can cause a r-m-w access to a neighbouring field

Thanks. So I can forward this all back to bugzilla.

> but
> in general, I would be weary of bitfields anyway since accessing
> them isn't going to be atomic anyway... it's too easy to get things
> wrong and in most cases the benefit is yet to be demonstrated.

Sure, bit fields should be used with care. But the same arguments apply
to bitmasks, they are often used without "atomic" set/clear_bit.

> In your example, I don't see the point of the bitfield.

This is just test-case. The real code has more adjacent bit fields, only
the tracee can modify them, and only debugger can change ->freeze_stop.

Thanks,

Oleg.


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-07-13 12:29       ` Oleg Nesterov
  0 siblings, 0 replies; 311+ messages in thread
From: Oleg Nesterov @ 2014-07-13 12:29 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Jakub Jelinek, linux-kernel, Paul Mackerras, Dan,
	Paul E. McKenney, linuxppc-dev, Miroslav Franc,
	Richard Henderson

On 07/13, Benjamin Herrenschmidt wrote:
>
> On Sat, 2014-07-12 at 22:51 +0200, Oleg Nesterov wrote:
> > OK, looks like this is compiler bug,
> >
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=52080
> >
> > Thanks to Dan who informed me privately.
>
> So yes, there's is this compiler bug which means a bitfield
> access can cause a r-m-w access to a neighbouring field

Thanks. So I can forward this all back to bugzilla.

> but
> in general, I would be weary of bitfields anyway since accessing
> them isn't going to be atomic anyway... it's too easy to get things
> wrong and in most cases the benefit is yet to be demonstrated.

Sure, bit fields should be used with care. But the same arguments apply
to bitmasks, they are often used without "atomic" set/clear_bit.

> In your example, I don't see the point of the bitfield.

This is just test-case. The real code has more adjacent bit fields, only
the tracee can modify them, and only debugger can change ->freeze_stop.

Thanks,

Oleg.

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
  2014-07-12 23:34     ` Benjamin Herrenschmidt
@ 2014-07-13 13:15       ` Peter Hurley
  -1 siblings, 0 replies; 311+ messages in thread
From: Peter Hurley @ 2014-07-13 13:15 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Oleg Nesterov
  Cc: Jakub Jelinek, Paul E. McKenney, Richard Henderson,
	Miroslav Franc, Paul Mackerras, linuxppc-dev, linux-kernel, Dan

On 07/12/2014 07:34 PM, Benjamin Herrenschmidt wrote:
> On Sat, 2014-07-12 at 22:51 +0200, Oleg Nesterov wrote:
>> OK, looks like this is compiler bug,
>>
>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=52080
>>
>> Thanks to Dan who informed me privately.
>
> So yes, there's is this compiler bug which means a bitfield
> access can cause a r-m-w access to a neighbouring field but
> in general, I would be weary of bitfields anyway since accessing
> them isn't going to be atomic anyway... it's too easy to get things
> wrong and in most cases the benefit is yet to be demonstrated.

I'm not sure I understand your point here, Ben.

Suppose that two different spinlocks are used independently to
protect r-m-w access to adjacent data. In Oleg's example,
suppose spinlock 1 is used for access to the bitfield and
spinlock 2 is used for access to freeze_stop.

What would prevent an accidental write to freeze_stop from the
kt_1 thread?

Regards,
Peter Hurley

> In your example, I don't see the point of the bitfield.
>
> Cheers,
> Ben.
>
>> On 07/12, Oleg Nesterov wrote:
>>>
>>> Hello,
>>>
>>> I am not sure I should ask here, but since Documentation/memory-barriers.txt
>>> mentions load/store tearing perhaps my question is not completely off-topic...
>>>
>>> I am fighting with mysterious RHEL bug, it can be reproduced on ppc and s390
>>> but not on x86. Finally I seem to understand the problem, and I even wrote the
>>> stupid kernel module to ensure, see it below at the end.
>>>
>>> It triggers the problem immediately, kt_2() sees the wrong value in freeze_stop.
>>> (If I turn ->freeze_stop int "long", the problem goes away).
>>>
>>> So the question is: is this gcc bug or the code below is buggy?
>>>
>>> If it is buggy, then probably memory-barriers.txt could mention that you should
>>> be carefull with bit fields, even ACCESS_ONCE() obviously can't help.
>>>
>>> Or this just discloses my ignorance and you need at least aligned(long) after a
>>> bit field to be thread-safe ? I thought that compiler should take care and add
>>> the necessary alignment if (say) CPU can't update a single byte/uint.
>>>
>>> gcc version 4.4.7 20120313 (Red Hat 4.4.7-9) (GCC). Asm:
>>>
>>> 	0000000000000000 <.kt_2>:
>>> 	   0:	7c 08 02 a6 	mflr    r0
>>> 	   4:	fb 81 ff e0 	std     r28,-32(r1)
>>> 	   8:	fb a1 ff e8 	std     r29,-24(r1)
>>> 	   c:	fb c1 ff f0 	std     r30,-16(r1)
>>> 	  10:	fb e1 ff f8 	std     r31,-8(r1)
>>> 	  14:	eb c2 00 00 	ld      r30,0(r2)
>>> 	  18:	f8 01 00 10 	std     r0,16(r1)
>>> 	  1c:	f8 21 ff 71 	stdu    r1,-144(r1)
>>> 	  20:	7c 7d 1b 78 	mr      r29,r3
>>> 	  24:	3b e0 00 00 	li      r31,0
>>> 	  28:	78 3c 04 64 	rldicr  r28,r1,0,49
>>> 	  2c:	3b 9c 00 80 	addi    r28,r28,128
>>> 	  30:	48 00 00 2c 	b       5c <.kt_2+0x5c>
>>> 	  34:	60 00 00 00 	nop
>>> 	  38:	60 00 00 00 	nop
>>> 	  3c:	60 00 00 00 	nop
>>> 	  40:	93 fd 00 04 	stw     r31,4(r29)
>>> 	  44:	e8 9d 00 06 	lwa     r4,4(r29)
>>> 	  48:	7f 84 f8 00 	cmpw    cr7,r4,r31
>>> 	  4c:	40 de 00 4c 	bne-    cr7,98 <.kt_2+0x98>
>>> 	  50:	e8 1c 00 00 	ld      r0,0(r28)
>>> 	  54:	78 09 f7 e3 	rldicl. r9,r0,62,63
>>> 	  58:	40 c2 00 54 	bne-    ac <.kt_2+0xac>
>>> 	  5c:	48 00 00 01 	bl      5c <.kt_2+0x5c>
>>> 	  60:	60 00 00 00 	nop
>>> 	  64:	3b ff 00 01 	addi    r31,r31,1
>>> 	  68:	2f a3 00 00 	cmpdi   cr7,r3,0
>>> 	  6c:	7f ff 07 b4 	extsw   r31,r31
>>> 	  70:	41 9e ff d0 	beq+    cr7,40 <.kt_2+0x40>
>>> 	  74:	38 21 00 90 	addi    r1,r1,144
>>> 	  78:	38 60 00 00 	li      r3,0
>>> 	  7c:	e8 01 00 10 	ld      r0,16(r1)
>>> 	  80:	eb 81 ff e0 	ld      r28,-32(r1)
>>> 	  84:	eb a1 ff e8 	ld      r29,-24(r1)
>>> 	  88:	eb c1 ff f0 	ld      r30,-16(r1)
>>> 	  8c:	eb e1 ff f8 	ld      r31,-8(r1)
>>> 	  90:	7c 08 03 a6 	mtlr    r0
>>> 	  94:	4e 80 00 20 	blr
>>> 	  98:	e8 7e 80 28 	ld      r3,-32728(r30)
>>> 	  9c:	7f e5 fb 78 	mr      r5,r31
>>> 	  a0:	48 00 00 01 	bl      a0 <.kt_2+0xa0>
>>> 	  a4:	60 00 00 00 	nop
>>> 	  a8:	4b ff ff a8 	b       50 <.kt_2+0x50>
>>> 	  ac:	48 00 00 01 	bl      ac <.kt_2+0xac>
>>> 	  b0:	60 00 00 00 	nop
>>> 	  b4:	4b ff ff a8 	b       5c <.kt_2+0x5c>
>>> 	  b8:	60 00 00 00 	nop
>>> 	  bc:	60 00 00 00 	nop
>>>
>>> 	00000000000000c0 <.kt_1>:
>>> 	  c0:	7c 08 02 a6 	mflr    r0
>>> 	  c4:	fb 81 ff e0 	std     r28,-32(r1)
>>> 	  c8:	fb a1 ff e8 	std     r29,-24(r1)
>>> 	  cc:	fb c1 ff f0 	std     r30,-16(r1)
>>> 	  d0:	fb e1 ff f8 	std     r31,-8(r1)
>>> 	  d4:	eb c2 00 00 	ld      r30,0(r2)
>>> 	  d8:	f8 01 00 10 	std     r0,16(r1)
>>> 	  dc:	f8 21 ff 71 	stdu    r1,-144(r1)
>>> 	  e0:	7c 7d 1b 78 	mr      r29,r3
>>> 	  e4:	3b e0 00 00 	li      r31,0
>>> 	  e8:	78 3c 04 64 	rldicr  r28,r1,0,49
>>> 	  ec:	3b 9c 00 80 	addi    r28,r28,128
>>> 	  f0:	48 00 00 38 	b       128 <.kt_1+0x68>
>>> 	  f4:	60 00 00 00 	nop
>>> 	  f8:	60 00 00 00 	nop
>>> 	  fc:	60 00 00 00 	nop
>>> 	 100:	e8 1d 00 00 	ld      r0,0(r29)
>>> 	 104:	79 20 e8 0e 	rldimi  r0,r9,61,0
>>> 	 108:	f8 1d 00 00 	std     r0,0(r29)
>>> 	 10c:	80 1d 00 00 	lwz     r0,0(r29)
>>> 	 110:	54 00 1f 7e 	rlwinm  r0,r0,3,29,31
>>> 	 114:	7f 80 f8 00 	cmpw    cr7,r0,r31
>>> 	 118:	40 de 00 6c 	bne-    cr7,184 <.kt_1+0xc4>
>>> 	 11c:	e8 1c 00 00 	ld      r0,0(r28)
>>> 	 120:	78 09 f7 e3 	rldicl. r9,r0,62,63
>>> 	 124:	40 c2 00 70 	bne-    194 <.kt_1+0xd4>
>>> 	 128:	48 00 00 01 	bl      128 <.kt_1+0x68>
>>> 	 12c:	60 00 00 00 	nop
>>> 	 130:	3b ff 00 01 	addi    r31,r31,1
>>> 	 134:	2f a3 00 00 	cmpdi   cr7,r3,0
>>> 	 138:	7f ff 07 b4 	extsw   r31,r31
>>> 	 13c:	2f 1f 00 07 	cmpwi   cr6,r31,7
>>> 	 140:	7b e9 07 60 	clrldi  r9,r31,61
>>> 	 144:	40 9e 00 1c 	bne-    cr7,160 <.kt_1+0xa0>
>>> 	 148:	40 9a ff b8 	bne+    cr6,100 <.kt_1+0x40>
>>> 	 14c:	39 20 00 00 	li      r9,0
>>> 	 150:	3b e0 00 00 	li      r31,0
>>> 	 154:	4b ff ff ac 	b       100 <.kt_1+0x40>
>>> 	 158:	60 00 00 00 	nop
>>> 	 15c:	60 00 00 00 	nop
>>> 	 160:	38 21 00 90 	addi    r1,r1,144
>>> 	 164:	38 60 00 00 	li      r3,0
>>> 	 168:	e8 01 00 10 	ld      r0,16(r1)
>>> 	 16c:	eb 81 ff e0 	ld      r28,-32(r1)
>>> 	 170:	eb a1 ff e8 	ld      r29,-24(r1)
>>> 	 174:	eb c1 ff f0 	ld      r30,-16(r1)
>>> 	 178:	eb e1 ff f8 	ld      r31,-8(r1)
>>> 	 17c:	7c 08 03 a6 	mtlr    r0
>>> 	 180:	4e 80 00 20 	blr
>>> 	 184:	e8 7e 80 30 	ld      r3,-32720(r30)
>>> 	 188:	48 00 00 01 	bl      188 <.kt_1+0xc8>
>>> 	 18c:	60 00 00 00 	nop
>>> 	 190:	4b ff ff 8c 	b       11c <.kt_1+0x5c>
>>> 	 194:	48 00 00 01 	bl      194 <.kt_1+0xd4>
>>> 	 198:	60 00 00 00 	nop
>>> 	 19c:	4b ff ff 8c 	b       128 <.kt_1+0x68>
>>>
>>> Unfortunately it tells me nothing, I do not know ppc.
>>>
>>> Oleg.
>>>
>>> --------------------------------------------------------------------------------
>>> #include <linux/module.h>
>>> #include <linux/kernel.h>
>>> #include <linux/kthread.h>
>>>
>>> struct utrace {
>>> 	unsigned int resume:3;
>>> 	int freeze_stop;
>>> };
>>>
>>> static int kt_1(void *arg)
>>> {
>>> 	struct utrace *u = arg;
>>> 	int r = 0;
>>>
>>> 	while (!kthread_should_stop()) {
>>> 		if (++r == 7)
>>> 			r = 0;
>>>
>>> 		u->resume = r;
>>> 		barrier();
>>> 		if (u->resume != r)
>>> 			printk(KERN_CRIT "BUG! bitfield\n");
>>>
>>> 		if (need_resched())
>>> 			schedule();
>>> 	}
>>>
>>> 	return 0;
>>> }
>>>
>>> static int kt_2(void *arg)
>>> {
>>> 	struct utrace *u = arg;
>>> 	int f = 0;
>>>
>>> 	while (!kthread_should_stop()) {
>>> 		u->freeze_stop = ++f;
>>> 		barrier();
>>> 		if (u->freeze_stop != f)
>>> 			printk(KERN_CRIT "BUG! freeze_stop %d != %d\n", u->freeze_stop, f);
>>>
>>> 		if (need_resched())
>>> 			schedule();
>>> 	}
>>>
>>> 	return 0;
>>> }
>>>
>>> static struct task_struct *t_1, *t_2;
>>>
>>> static struct utrace utrace;
>>>
>>> static int __init mod_init(void)
>>> {
>>> 	WARN_ON(IS_ERR(t_1 = kthread_run(kt_1, &utrace, "kt_1")));
>>> 	WARN_ON(IS_ERR(t_2 = kthread_run(kt_2, &utrace, "kt_2")));
>>>
>>> 	return 0;
>>> }
>>>
>>> static void __exit mod_exit(void)
>>> {
>>> 	kthread_stop(t_1);
>>> 	kthread_stop(t_2);
>>> }
>>>
>>> MODULE_LICENSE("GPL");
>>> module_init(mod_init);
>>> module_exit(mod_exit);
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-07-13 13:15       ` Peter Hurley
  0 siblings, 0 replies; 311+ messages in thread
From: Peter Hurley @ 2014-07-13 13:15 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Oleg Nesterov
  Cc: Jakub Jelinek, linux-kernel, Paul Mackerras, Dan,
	Paul E. McKenney, linuxppc-dev, Miroslav Franc,
	Richard Henderson

On 07/12/2014 07:34 PM, Benjamin Herrenschmidt wrote:
> On Sat, 2014-07-12 at 22:51 +0200, Oleg Nesterov wrote:
>> OK, looks like this is compiler bug,
>>
>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=52080
>>
>> Thanks to Dan who informed me privately.
>
> So yes, there's is this compiler bug which means a bitfield
> access can cause a r-m-w access to a neighbouring field but
> in general, I would be weary of bitfields anyway since accessing
> them isn't going to be atomic anyway... it's too easy to get things
> wrong and in most cases the benefit is yet to be demonstrated.

I'm not sure I understand your point here, Ben.

Suppose that two different spinlocks are used independently to
protect r-m-w access to adjacent data. In Oleg's example,
suppose spinlock 1 is used for access to the bitfield and
spinlock 2 is used for access to freeze_stop.

What would prevent an accidental write to freeze_stop from the
kt_1 thread?

Regards,
Peter Hurley

> In your example, I don't see the point of the bitfield.
>
> Cheers,
> Ben.
>
>> On 07/12, Oleg Nesterov wrote:
>>>
>>> Hello,
>>>
>>> I am not sure I should ask here, but since Documentation/memory-barriers.txt
>>> mentions load/store tearing perhaps my question is not completely off-topic...
>>>
>>> I am fighting with mysterious RHEL bug, it can be reproduced on ppc and s390
>>> but not on x86. Finally I seem to understand the problem, and I even wrote the
>>> stupid kernel module to ensure, see it below at the end.
>>>
>>> It triggers the problem immediately, kt_2() sees the wrong value in freeze_stop.
>>> (If I turn ->freeze_stop int "long", the problem goes away).
>>>
>>> So the question is: is this gcc bug or the code below is buggy?
>>>
>>> If it is buggy, then probably memory-barriers.txt could mention that you should
>>> be carefull with bit fields, even ACCESS_ONCE() obviously can't help.
>>>
>>> Or this just discloses my ignorance and you need at least aligned(long) after a
>>> bit field to be thread-safe ? I thought that compiler should take care and add
>>> the necessary alignment if (say) CPU can't update a single byte/uint.
>>>
>>> gcc version 4.4.7 20120313 (Red Hat 4.4.7-9) (GCC). Asm:
>>>
>>> 	0000000000000000 <.kt_2>:
>>> 	   0:	7c 08 02 a6 	mflr    r0
>>> 	   4:	fb 81 ff e0 	std     r28,-32(r1)
>>> 	   8:	fb a1 ff e8 	std     r29,-24(r1)
>>> 	   c:	fb c1 ff f0 	std     r30,-16(r1)
>>> 	  10:	fb e1 ff f8 	std     r31,-8(r1)
>>> 	  14:	eb c2 00 00 	ld      r30,0(r2)
>>> 	  18:	f8 01 00 10 	std     r0,16(r1)
>>> 	  1c:	f8 21 ff 71 	stdu    r1,-144(r1)
>>> 	  20:	7c 7d 1b 78 	mr      r29,r3
>>> 	  24:	3b e0 00 00 	li      r31,0
>>> 	  28:	78 3c 04 64 	rldicr  r28,r1,0,49
>>> 	  2c:	3b 9c 00 80 	addi    r28,r28,128
>>> 	  30:	48 00 00 2c 	b       5c <.kt_2+0x5c>
>>> 	  34:	60 00 00 00 	nop
>>> 	  38:	60 00 00 00 	nop
>>> 	  3c:	60 00 00 00 	nop
>>> 	  40:	93 fd 00 04 	stw     r31,4(r29)
>>> 	  44:	e8 9d 00 06 	lwa     r4,4(r29)
>>> 	  48:	7f 84 f8 00 	cmpw    cr7,r4,r31
>>> 	  4c:	40 de 00 4c 	bne-    cr7,98 <.kt_2+0x98>
>>> 	  50:	e8 1c 00 00 	ld      r0,0(r28)
>>> 	  54:	78 09 f7 e3 	rldicl. r9,r0,62,63
>>> 	  58:	40 c2 00 54 	bne-    ac <.kt_2+0xac>
>>> 	  5c:	48 00 00 01 	bl      5c <.kt_2+0x5c>
>>> 	  60:	60 00 00 00 	nop
>>> 	  64:	3b ff 00 01 	addi    r31,r31,1
>>> 	  68:	2f a3 00 00 	cmpdi   cr7,r3,0
>>> 	  6c:	7f ff 07 b4 	extsw   r31,r31
>>> 	  70:	41 9e ff d0 	beq+    cr7,40 <.kt_2+0x40>
>>> 	  74:	38 21 00 90 	addi    r1,r1,144
>>> 	  78:	38 60 00 00 	li      r3,0
>>> 	  7c:	e8 01 00 10 	ld      r0,16(r1)
>>> 	  80:	eb 81 ff e0 	ld      r28,-32(r1)
>>> 	  84:	eb a1 ff e8 	ld      r29,-24(r1)
>>> 	  88:	eb c1 ff f0 	ld      r30,-16(r1)
>>> 	  8c:	eb e1 ff f8 	ld      r31,-8(r1)
>>> 	  90:	7c 08 03 a6 	mtlr    r0
>>> 	  94:	4e 80 00 20 	blr
>>> 	  98:	e8 7e 80 28 	ld      r3,-32728(r30)
>>> 	  9c:	7f e5 fb 78 	mr      r5,r31
>>> 	  a0:	48 00 00 01 	bl      a0 <.kt_2+0xa0>
>>> 	  a4:	60 00 00 00 	nop
>>> 	  a8:	4b ff ff a8 	b       50 <.kt_2+0x50>
>>> 	  ac:	48 00 00 01 	bl      ac <.kt_2+0xac>
>>> 	  b0:	60 00 00 00 	nop
>>> 	  b4:	4b ff ff a8 	b       5c <.kt_2+0x5c>
>>> 	  b8:	60 00 00 00 	nop
>>> 	  bc:	60 00 00 00 	nop
>>>
>>> 	00000000000000c0 <.kt_1>:
>>> 	  c0:	7c 08 02 a6 	mflr    r0
>>> 	  c4:	fb 81 ff e0 	std     r28,-32(r1)
>>> 	  c8:	fb a1 ff e8 	std     r29,-24(r1)
>>> 	  cc:	fb c1 ff f0 	std     r30,-16(r1)
>>> 	  d0:	fb e1 ff f8 	std     r31,-8(r1)
>>> 	  d4:	eb c2 00 00 	ld      r30,0(r2)
>>> 	  d8:	f8 01 00 10 	std     r0,16(r1)
>>> 	  dc:	f8 21 ff 71 	stdu    r1,-144(r1)
>>> 	  e0:	7c 7d 1b 78 	mr      r29,r3
>>> 	  e4:	3b e0 00 00 	li      r31,0
>>> 	  e8:	78 3c 04 64 	rldicr  r28,r1,0,49
>>> 	  ec:	3b 9c 00 80 	addi    r28,r28,128
>>> 	  f0:	48 00 00 38 	b       128 <.kt_1+0x68>
>>> 	  f4:	60 00 00 00 	nop
>>> 	  f8:	60 00 00 00 	nop
>>> 	  fc:	60 00 00 00 	nop
>>> 	 100:	e8 1d 00 00 	ld      r0,0(r29)
>>> 	 104:	79 20 e8 0e 	rldimi  r0,r9,61,0
>>> 	 108:	f8 1d 00 00 	std     r0,0(r29)
>>> 	 10c:	80 1d 00 00 	lwz     r0,0(r29)
>>> 	 110:	54 00 1f 7e 	rlwinm  r0,r0,3,29,31
>>> 	 114:	7f 80 f8 00 	cmpw    cr7,r0,r31
>>> 	 118:	40 de 00 6c 	bne-    cr7,184 <.kt_1+0xc4>
>>> 	 11c:	e8 1c 00 00 	ld      r0,0(r28)
>>> 	 120:	78 09 f7 e3 	rldicl. r9,r0,62,63
>>> 	 124:	40 c2 00 70 	bne-    194 <.kt_1+0xd4>
>>> 	 128:	48 00 00 01 	bl      128 <.kt_1+0x68>
>>> 	 12c:	60 00 00 00 	nop
>>> 	 130:	3b ff 00 01 	addi    r31,r31,1
>>> 	 134:	2f a3 00 00 	cmpdi   cr7,r3,0
>>> 	 138:	7f ff 07 b4 	extsw   r31,r31
>>> 	 13c:	2f 1f 00 07 	cmpwi   cr6,r31,7
>>> 	 140:	7b e9 07 60 	clrldi  r9,r31,61
>>> 	 144:	40 9e 00 1c 	bne-    cr7,160 <.kt_1+0xa0>
>>> 	 148:	40 9a ff b8 	bne+    cr6,100 <.kt_1+0x40>
>>> 	 14c:	39 20 00 00 	li      r9,0
>>> 	 150:	3b e0 00 00 	li      r31,0
>>> 	 154:	4b ff ff ac 	b       100 <.kt_1+0x40>
>>> 	 158:	60 00 00 00 	nop
>>> 	 15c:	60 00 00 00 	nop
>>> 	 160:	38 21 00 90 	addi    r1,r1,144
>>> 	 164:	38 60 00 00 	li      r3,0
>>> 	 168:	e8 01 00 10 	ld      r0,16(r1)
>>> 	 16c:	eb 81 ff e0 	ld      r28,-32(r1)
>>> 	 170:	eb a1 ff e8 	ld      r29,-24(r1)
>>> 	 174:	eb c1 ff f0 	ld      r30,-16(r1)
>>> 	 178:	eb e1 ff f8 	ld      r31,-8(r1)
>>> 	 17c:	7c 08 03 a6 	mtlr    r0
>>> 	 180:	4e 80 00 20 	blr
>>> 	 184:	e8 7e 80 30 	ld      r3,-32720(r30)
>>> 	 188:	48 00 00 01 	bl      188 <.kt_1+0xc8>
>>> 	 18c:	60 00 00 00 	nop
>>> 	 190:	4b ff ff 8c 	b       11c <.kt_1+0x5c>
>>> 	 194:	48 00 00 01 	bl      194 <.kt_1+0xd4>
>>> 	 198:	60 00 00 00 	nop
>>> 	 19c:	4b ff ff 8c 	b       128 <.kt_1+0x68>
>>>
>>> Unfortunately it tells me nothing, I do not know ppc.
>>>
>>> Oleg.
>>>
>>> --------------------------------------------------------------------------------
>>> #include <linux/module.h>
>>> #include <linux/kernel.h>
>>> #include <linux/kthread.h>
>>>
>>> struct utrace {
>>> 	unsigned int resume:3;
>>> 	int freeze_stop;
>>> };
>>>
>>> static int kt_1(void *arg)
>>> {
>>> 	struct utrace *u = arg;
>>> 	int r = 0;
>>>
>>> 	while (!kthread_should_stop()) {
>>> 		if (++r == 7)
>>> 			r = 0;
>>>
>>> 		u->resume = r;
>>> 		barrier();
>>> 		if (u->resume != r)
>>> 			printk(KERN_CRIT "BUG! bitfield\n");
>>>
>>> 		if (need_resched())
>>> 			schedule();
>>> 	}
>>>
>>> 	return 0;
>>> }
>>>
>>> static int kt_2(void *arg)
>>> {
>>> 	struct utrace *u = arg;
>>> 	int f = 0;
>>>
>>> 	while (!kthread_should_stop()) {
>>> 		u->freeze_stop = ++f;
>>> 		barrier();
>>> 		if (u->freeze_stop != f)
>>> 			printk(KERN_CRIT "BUG! freeze_stop %d != %d\n", u->freeze_stop, f);
>>>
>>> 		if (need_resched())
>>> 			schedule();
>>> 	}
>>>
>>> 	return 0;
>>> }
>>>
>>> static struct task_struct *t_1, *t_2;
>>>
>>> static struct utrace utrace;
>>>
>>> static int __init mod_init(void)
>>> {
>>> 	WARN_ON(IS_ERR(t_1 = kthread_run(kt_1, &utrace, "kt_1")));
>>> 	WARN_ON(IS_ERR(t_2 = kthread_run(kt_2, &utrace, "kt_2")));
>>>
>>> 	return 0;
>>> }
>>>
>>> static void __exit mod_exit(void)
>>> {
>>> 	kthread_stop(t_1);
>>> 	kthread_stop(t_2);
>>> }
>>>
>>> MODULE_LICENSE("GPL");
>>> module_init(mod_init);
>>> module_exit(mod_exit);
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
  2014-07-13 13:15       ` Peter Hurley
@ 2014-07-13 22:25         ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 311+ messages in thread
From: Benjamin Herrenschmidt @ 2014-07-13 22:25 UTC (permalink / raw)
  To: Peter Hurley
  Cc: Oleg Nesterov, Jakub Jelinek, Paul E. McKenney,
	Richard Henderson, Miroslav Franc, Paul Mackerras, linuxppc-dev,
	linux-kernel, Dan

On Sun, 2014-07-13 at 09:15 -0400, Peter Hurley wrote:
> 
> I'm not sure I understand your point here, Ben.
> 
> Suppose that two different spinlocks are used independently to
> protect r-m-w access to adjacent data. In Oleg's example,
> suppose spinlock 1 is used for access to the bitfield and
> spinlock 2 is used for access to freeze_stop.
> 
> What would prevent an accidental write to freeze_stop from the
> kt_1 thread?

My point was to be weary of bitfields in general because access
to them is always R-M-W, never atomic and that seem to escape
people regularily :-) (Among other problems such as endian etc...)

As for Oleg's example, it *should* have worked because the bitfield and
the adjacent freeze_stop should have been accessed using load/stores
that don't actually overlap, but the compiler bug causes the bitfield
access to not properly use the basic type of the bitfield, but escalate
to a full 64-bit R-M-W instead, thus incorrectly R-M-W'ing the field
next door.

Cheers,
Ben.





^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-07-13 22:25         ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 311+ messages in thread
From: Benjamin Herrenschmidt @ 2014-07-13 22:25 UTC (permalink / raw)
  To: Peter Hurley
  Cc: Jakub Jelinek, Oleg Nesterov, linux-kernel, Paul Mackerras, Dan,
	Paul E. McKenney, linuxppc-dev, Miroslav Franc,
	Richard Henderson

On Sun, 2014-07-13 at 09:15 -0400, Peter Hurley wrote:
> 
> I'm not sure I understand your point here, Ben.
> 
> Suppose that two different spinlocks are used independently to
> protect r-m-w access to adjacent data. In Oleg's example,
> suppose spinlock 1 is used for access to the bitfield and
> spinlock 2 is used for access to freeze_stop.
> 
> What would prevent an accidental write to freeze_stop from the
> kt_1 thread?

My point was to be weary of bitfields in general because access
to them is always R-M-W, never atomic and that seem to escape
people regularily :-) (Among other problems such as endian etc...)

As for Oleg's example, it *should* have worked because the bitfield and
the adjacent freeze_stop should have been accessed using load/stores
that don't actually overlap, but the compiler bug causes the bitfield
access to not properly use the basic type of the bitfield, but escalate
to a full 64-bit R-M-W instead, thus incorrectly R-M-W'ing the field
next door.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
  2014-07-13 22:25         ` Benjamin Herrenschmidt
@ 2014-07-15 13:54           ` Peter Hurley
  -1 siblings, 0 replies; 311+ messages in thread
From: Peter Hurley @ 2014-07-15 13:54 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Oleg Nesterov
  Cc: Jakub Jelinek, Paul E. McKenney, Richard Henderson,
	Miroslav Franc, Paul Mackerras, linuxppc-dev, linux-kernel, Dan

On 07/13/2014 06:25 PM, Benjamin Herrenschmidt wrote:
> On Sun, 2014-07-13 at 09:15 -0400, Peter Hurley wrote:
>>
>> I'm not sure I understand your point here, Ben.
>>
>> Suppose that two different spinlocks are used independently to
>> protect r-m-w access to adjacent data. In Oleg's example,
>> suppose spinlock 1 is used for access to the bitfield and
>> spinlock 2 is used for access to freeze_stop.
>>
>> What would prevent an accidental write to freeze_stop from the
>> kt_1 thread?
>
> My point was to be weary of bitfields in general because access
> to them is always R-M-W, never atomic and that seem to escape
> people regularily :-) (Among other problems such as endian etc...)
>
> As for Oleg's example, it *should* have worked because the bitfield and
> the adjacent freeze_stop should have been accessed using load/stores
> that don't actually overlap, but the compiler bug causes the bitfield
> access to not properly use the basic type of the bitfield, but escalate
> to a full 64-bit R-M-W instead, thus incorrectly R-M-W'ing the field
> next door.

Yeah, ok, so just a generic heads-up about non-atomicity of bitfields,
and not something specific to Oleg's example. Thanks.

Jonathan Corbet wrote a LWN article about this back in 2012:
http://lwn.net/Articles/478657/

I guess it's fixed in gcc 4.8, but too bad there's not a workaround for
earlier compilers (akin to -fstrict_volatile_bitfields without requiring
the volatile keyword).

Regards,
Peter Hurley




^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-07-15 13:54           ` Peter Hurley
  0 siblings, 0 replies; 311+ messages in thread
From: Peter Hurley @ 2014-07-15 13:54 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Oleg Nesterov
  Cc: Jakub Jelinek, linux-kernel, Paul Mackerras, Dan,
	Paul E. McKenney, linuxppc-dev, Miroslav Franc,
	Richard Henderson

On 07/13/2014 06:25 PM, Benjamin Herrenschmidt wrote:
> On Sun, 2014-07-13 at 09:15 -0400, Peter Hurley wrote:
>>
>> I'm not sure I understand your point here, Ben.
>>
>> Suppose that two different spinlocks are used independently to
>> protect r-m-w access to adjacent data. In Oleg's example,
>> suppose spinlock 1 is used for access to the bitfield and
>> spinlock 2 is used for access to freeze_stop.
>>
>> What would prevent an accidental write to freeze_stop from the
>> kt_1 thread?
>
> My point was to be weary of bitfields in general because access
> to them is always R-M-W, never atomic and that seem to escape
> people regularily :-) (Among other problems such as endian etc...)
>
> As for Oleg's example, it *should* have worked because the bitfield and
> the adjacent freeze_stop should have been accessed using load/stores
> that don't actually overlap, but the compiler bug causes the bitfield
> access to not properly use the basic type of the bitfield, but escalate
> to a full 64-bit R-M-W instead, thus incorrectly R-M-W'ing the field
> next door.

Yeah, ok, so just a generic heads-up about non-atomicity of bitfields,
and not something specific to Oleg's example. Thanks.

Jonathan Corbet wrote a LWN article about this back in 2012:
http://lwn.net/Articles/478657/

I guess it's fixed in gcc 4.8, but too bad there's not a workaround for
earlier compilers (akin to -fstrict_volatile_bitfields without requiring
the volatile keyword).

Regards,
Peter Hurley

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
  2014-07-15 13:54           ` Peter Hurley
@ 2014-07-15 15:02             ` Richard Henderson
  -1 siblings, 0 replies; 311+ messages in thread
From: Richard Henderson @ 2014-07-15 15:02 UTC (permalink / raw)
  To: Peter Hurley, Benjamin Herrenschmidt, Oleg Nesterov
  Cc: Jakub Jelinek, Paul E. McKenney, Miroslav Franc, Paul Mackerras,
	linuxppc-dev, linux-kernel, Dan

On 07/15/2014 06:54 AM, Peter Hurley wrote:
> 
> Jonathan Corbet wrote a LWN article about this back in 2012:
> http://lwn.net/Articles/478657/
> 
> I guess it's fixed in gcc 4.8, but too bad there's not a workaround for
> earlier compilers (akin to -fstrict_volatile_bitfields without requiring
> the volatile keyword)

>From the gcc pr, it looks like the patch was backported to 4.7.
But we didn't fix it in versions earlier than that.


r~

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-07-15 15:02             ` Richard Henderson
  0 siblings, 0 replies; 311+ messages in thread
From: Richard Henderson @ 2014-07-15 15:02 UTC (permalink / raw)
  To: Peter Hurley, Benjamin Herrenschmidt, Oleg Nesterov
  Cc: Jakub Jelinek, linux-kernel, Paul Mackerras, Dan,
	Paul E. McKenney, linuxppc-dev, Miroslav Franc

On 07/15/2014 06:54 AM, Peter Hurley wrote:
> 
> Jonathan Corbet wrote a LWN article about this back in 2012:
> http://lwn.net/Articles/478657/
> 
> I guess it's fixed in gcc 4.8, but too bad there's not a workaround for
> earlier compilers (akin to -fstrict_volatile_bitfields without requiring
> the volatile keyword)

>From the gcc pr, it looks like the patch was backported to 4.7.
But we didn't fix it in versions earlier than that.


r~

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
  2014-07-12 18:13 ` Oleg Nesterov
  (?)
@ 2014-09-03 22:51   ` Peter Hurley
  -1 siblings, 0 replies; 311+ messages in thread
From: Peter Hurley @ 2014-09-03 22:51 UTC (permalink / raw)
  To: Jakub Jelinek, Paul E. McKenney, Richard Henderson
  Cc: Oleg Nesterov, Benjamin Herrenschmidt, Miroslav Franc,
	Paul Mackerras, linuxppc-dev, linux-kernel, linux-arch,
	Tony Luck, linux-ia64

[ +cc linux-arch, Tony Luck, 

On 07/12/2014 02:13 PM, Oleg Nesterov wrote:
> Hello,
> 
> I am not sure I should ask here, but since Documentation/memory-barriers.txt
> mentions load/store tearing perhaps my question is not completely off-topic...
> 
> I am fighting with mysterious RHEL bug, it can be reproduced on ppc and s390
> but not on x86. Finally I seem to understand the problem, and I even wrote the
> stupid kernel module to ensure, see it below at the end.
> 
> It triggers the problem immediately, kt_2() sees the wrong value in freeze_stop.
> (If I turn ->freeze_stop int "long", the problem goes away).
> 
> So the question is: is this gcc bug or the code below is buggy?
> 
> If it is buggy, then probably memory-barriers.txt could mention that you should
> be carefull with bit fields, even ACCESS_ONCE() obviously can't help.
> 
> Or this just discloses my ignorance and you need at least aligned(long) after a
> bit field to be thread-safe ? I thought that compiler should take care and add
> the necessary alignment if (say) CPU can't update a single byte/uint.

Apologies for hijacking this thread but I need to extend this discussion
somewhat regarding what a compiler might do with adjacent fields in a structure.

The tty subsystem defines a large aggregate structure, struct tty_struct.
Importantly, several different locks apply to different fields within that
structure; ie., a specific spinlock will be claimed before updating or accessing
certain fields while a different spinlock will be claimed before updating or
accessing certain _adjacent_ fields.

What is necessary and sufficient to prevent accidental false-sharing?
The patch below was flagged as insufficient on ia64, and possibly ARM.

Regards,
Peter Hurley

--- >% ---
Subject: [PATCH 21/26] tty: Convert tty_struct bitfield to bools

The stopped, hw_stopped, flow_stopped and packet bits are smp-unsafe
and interrupt-unsafe. For example,

CPU 0                         | CPU 1
                              |
tty->flow_stopped = 1         | tty->hw_stopped = 0

One of these updates will be corrupted, as the bitwise operation
on the bitfield is non-atomic.

Ensure each flag has a separate memory location, so concurrent
updates do not corrupt orthogonal states.

Signed-off-by: Peter Hurley <peter@hurleysoftware.com>
---
 include/linux/tty.h | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/include/linux/tty.h b/include/linux/tty.h
index 1c3316a..7cf61cb 100644
--- a/include/linux/tty.h
+++ b/include/linux/tty.h
@@ -261,7 +261,10 @@ struct tty_struct {
 	unsigned long flags;
 	int count;
 	struct winsize winsize;		/* winsize_mutex */
-	unsigned char stopped:1, hw_stopped:1, flow_stopped:1, packet:1;
+	bool stopped;
+	bool hw_stopped;
+	bool flow_stopped;
+	bool packet;
 	unsigned char ctrl_status;	/* ctrl_lock */
 	unsigned int receive_room;	/* Bytes free for queue */
 	int flow_change;
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-03 22:51   ` Peter Hurley
  0 siblings, 0 replies; 311+ messages in thread
From: Peter Hurley @ 2014-09-03 22:51 UTC (permalink / raw)
  To: Jakub Jelinek, Paul E. McKenney, Richard Henderson
  Cc: linux-arch, Tony Luck, linux-ia64, Oleg Nesterov, linux-kernel,
	Paul Mackerras, linuxppc-dev, Miroslav Franc

[ +cc linux-arch, Tony Luck, 

On 07/12/2014 02:13 PM, Oleg Nesterov wrote:
> Hello,
> 
> I am not sure I should ask here, but since Documentation/memory-barriers.txt
> mentions load/store tearing perhaps my question is not completely off-topic...
> 
> I am fighting with mysterious RHEL bug, it can be reproduced on ppc and s390
> but not on x86. Finally I seem to understand the problem, and I even wrote the
> stupid kernel module to ensure, see it below at the end.
> 
> It triggers the problem immediately, kt_2() sees the wrong value in freeze_stop.
> (If I turn ->freeze_stop int "long", the problem goes away).
> 
> So the question is: is this gcc bug or the code below is buggy?
> 
> If it is buggy, then probably memory-barriers.txt could mention that you should
> be carefull with bit fields, even ACCESS_ONCE() obviously can't help.
> 
> Or this just discloses my ignorance and you need at least aligned(long) after a
> bit field to be thread-safe ? I thought that compiler should take care and add
> the necessary alignment if (say) CPU can't update a single byte/uint.

Apologies for hijacking this thread but I need to extend this discussion
somewhat regarding what a compiler might do with adjacent fields in a structure.

The tty subsystem defines a large aggregate structure, struct tty_struct.
Importantly, several different locks apply to different fields within that
structure; ie., a specific spinlock will be claimed before updating or accessing
certain fields while a different spinlock will be claimed before updating or
accessing certain _adjacent_ fields.

What is necessary and sufficient to prevent accidental false-sharing?
The patch below was flagged as insufficient on ia64, and possibly ARM.

Regards,
Peter Hurley

--- >% ---
Subject: [PATCH 21/26] tty: Convert tty_struct bitfield to bools

The stopped, hw_stopped, flow_stopped and packet bits are smp-unsafe
and interrupt-unsafe. For example,

CPU 0                         | CPU 1
                              |
tty->flow_stopped = 1         | tty->hw_stopped = 0

One of these updates will be corrupted, as the bitwise operation
on the bitfield is non-atomic.

Ensure each flag has a separate memory location, so concurrent
updates do not corrupt orthogonal states.

Signed-off-by: Peter Hurley <peter@hurleysoftware.com>
---
 include/linux/tty.h | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/include/linux/tty.h b/include/linux/tty.h
index 1c3316a..7cf61cb 100644
--- a/include/linux/tty.h
+++ b/include/linux/tty.h
@@ -261,7 +261,10 @@ struct tty_struct {
 	unsigned long flags;
 	int count;
 	struct winsize winsize;		/* winsize_mutex */
-	unsigned char stopped:1, hw_stopped:1, flow_stopped:1, packet:1;
+	bool stopped;
+	bool hw_stopped;
+	bool flow_stopped;
+	bool packet;
 	unsigned char ctrl_status;	/* ctrl_lock */
 	unsigned int receive_room;	/* Bytes free for queue */
 	int flow_change;
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-03 22:51   ` Peter Hurley
  0 siblings, 0 replies; 311+ messages in thread
From: Peter Hurley @ 2014-09-03 22:51 UTC (permalink / raw)
  To: Jakub Jelinek, Paul E. McKenney, Richard Henderson
  Cc: Oleg Nesterov, Benjamin Herrenschmidt, Miroslav Franc,
	Paul Mackerras, linuxppc-dev, linux-kernel, linux-arch,
	Tony Luck, linux-ia64

[ +cc linux-arch, Tony Luck, 

On 07/12/2014 02:13 PM, Oleg Nesterov wrote:
> Hello,
> 
> I am not sure I should ask here, but since Documentation/memory-barriers.txt
> mentions load/store tearing perhaps my question is not completely off-topic...
> 
> I am fighting with mysterious RHEL bug, it can be reproduced on ppc and s390
> but not on x86. Finally I seem to understand the problem, and I even wrote the
> stupid kernel module to ensure, see it below at the end.
> 
> It triggers the problem immediately, kt_2() sees the wrong value in freeze_stop.
> (If I turn ->freeze_stop int "long", the problem goes away).
> 
> So the question is: is this gcc bug or the code below is buggy?
> 
> If it is buggy, then probably memory-barriers.txt could mention that you should
> be carefull with bit fields, even ACCESS_ONCE() obviously can't help.
> 
> Or this just discloses my ignorance and you need at least aligned(long) after a
> bit field to be thread-safe ? I thought that compiler should take care and add
> the necessary alignment if (say) CPU can't update a single byte/uint.

Apologies for hijacking this thread but I need to extend this discussion
somewhat regarding what a compiler might do with adjacent fields in a structure.

The tty subsystem defines a large aggregate structure, struct tty_struct.
Importantly, several different locks apply to different fields within that
structure; ie., a specific spinlock will be claimed before updating or accessing
certain fields while a different spinlock will be claimed before updating or
accessing certain _adjacent_ fields.

What is necessary and sufficient to prevent accidental false-sharing?
The patch below was flagged as insufficient on ia64, and possibly ARM.

Regards,
Peter Hurley

--- >% ---
Subject: [PATCH 21/26] tty: Convert tty_struct bitfield to bools

The stopped, hw_stopped, flow_stopped and packet bits are smp-unsafe
and interrupt-unsafe. For example,

CPU 0                         | CPU 1
                              |
tty->flow_stopped = 1         | tty->hw_stopped = 0

One of these updates will be corrupted, as the bitwise operation
on the bitfield is non-atomic.

Ensure each flag has a separate memory location, so concurrent
updates do not corrupt orthogonal states.

Signed-off-by: Peter Hurley <peter@hurleysoftware.com>
---
 include/linux/tty.h | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/include/linux/tty.h b/include/linux/tty.h
index 1c3316a..7cf61cb 100644
--- a/include/linux/tty.h
+++ b/include/linux/tty.h
@@ -261,7 +261,10 @@ struct tty_struct {
 	unsigned long flags;
 	int count;
 	struct winsize winsize;		/* winsize_mutex */
-	unsigned char stopped:1, hw_stopped:1, flow_stopped:1, packet:1;
+	bool stopped;
+	bool hw_stopped;
+	bool flow_stopped;
+	bool packet;
 	unsigned char ctrl_status;	/* ctrl_lock */
 	unsigned int receive_room;	/* Bytes free for queue */
 	int flow_change;
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
  2014-09-03 22:51   ` Peter Hurley
  (?)
  (?)
@ 2014-09-03 23:11     ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 311+ messages in thread
From: Benjamin Herrenschmidt @ 2014-09-03 23:11 UTC (permalink / raw)
  To: Peter Hurley
  Cc: Jakub Jelinek, Paul E. McKenney, Richard Henderson,
	Oleg Nesterov, Miroslav Franc, Paul Mackerras, linuxppc-dev,
	linux-kernel, linux-arch, Tony Luck, linux-ia64

On Wed, 2014-09-03 at 18:51 -0400, Peter Hurley wrote:

> Apologies for hijacking this thread but I need to extend this discussion
> somewhat regarding what a compiler might do with adjacent fields in a structure.
> 
> The tty subsystem defines a large aggregate structure, struct tty_struct.
> Importantly, several different locks apply to different fields within that
> structure; ie., a specific spinlock will be claimed before updating or accessing
> certain fields while a different spinlock will be claimed before updating or
> accessing certain _adjacent_ fields.
> 
> What is necessary and sufficient to prevent accidental false-sharing?
> The patch below was flagged as insufficient on ia64, and possibly ARM.

We expect native aligned scalar types to be accessed atomically (the
read/modify/write of a larger quantity that gcc does on some bitfield
cases has been flagged as a gcc bug, but shouldn't happen on normal
scalar types).

I am not 100% certain of "bool" here, I assume it's treated as a normal
scalar and thus atomic but if unsure, you can always use int.

Another option is to use the atomic bitops and make these bits in a
bitmask but that is probably unnecessary if you have locks already.

Cheers,
Ben.


> Regards,
> Peter Hurley
> 
> --- >% ---
> Subject: [PATCH 21/26] tty: Convert tty_struct bitfield to bools
> 
> The stopped, hw_stopped, flow_stopped and packet bits are smp-unsafe
> and interrupt-unsafe. For example,
> 
> CPU 0                         | CPU 1
>                               |
> tty->flow_stopped = 1         | tty->hw_stopped = 0
> 
> One of these updates will be corrupted, as the bitwise operation
> on the bitfield is non-atomic.
> 
> Ensure each flag has a separate memory location, so concurrent
> updates do not corrupt orthogonal states.
> 
> Signed-off-by: Peter Hurley <peter@hurleysoftware.com>
> ---
>  include/linux/tty.h | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/tty.h b/include/linux/tty.h
> index 1c3316a..7cf61cb 100644
> --- a/include/linux/tty.h
> +++ b/include/linux/tty.h
> @@ -261,7 +261,10 @@ struct tty_struct {
>  	unsigned long flags;
>  	int count;
>  	struct winsize winsize;		/* winsize_mutex */
> -	unsigned char stopped:1, hw_stopped:1, flow_stopped:1, packet:1;
> +	bool stopped;
> +	bool hw_stopped;
> +	bool flow_stopped;
> +	bool packet;
>  	unsigned char ctrl_status;	/* ctrl_lock */
>  	unsigned int receive_room;	/* Bytes free for queue */
>  	int flow_change;



^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-03 23:11     ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 311+ messages in thread
From: Benjamin Herrenschmidt @ 2014-09-03 23:11 UTC (permalink / raw)
  To: Peter Hurley
  Cc: Jakub Jelinek, linux-arch, Tony Luck, linux-ia64, Oleg Nesterov,
	linux-kernel, Paul Mackerras, Paul E. McKenney, linuxppc-dev,
	Miroslav Franc, Richard Henderson

On Wed, 2014-09-03 at 18:51 -0400, Peter Hurley wrote:

> Apologies for hijacking this thread but I need to extend this discussion
> somewhat regarding what a compiler might do with adjacent fields in a structure.
> 
> The tty subsystem defines a large aggregate structure, struct tty_struct.
> Importantly, several different locks apply to different fields within that
> structure; ie., a specific spinlock will be claimed before updating or accessing
> certain fields while a different spinlock will be claimed before updating or
> accessing certain _adjacent_ fields.
> 
> What is necessary and sufficient to prevent accidental false-sharing?
> The patch below was flagged as insufficient on ia64, and possibly ARM.

We expect native aligned scalar types to be accessed atomically (the
read/modify/write of a larger quantity that gcc does on some bitfield
cases has been flagged as a gcc bug, but shouldn't happen on normal
scalar types).

I am not 100% certain of "bool" here, I assume it's treated as a normal
scalar and thus atomic but if unsure, you can always use int.

Another option is to use the atomic bitops and make these bits in a
bitmask but that is probably unnecessary if you have locks already.

Cheers,
Ben.


> Regards,
> Peter Hurley
> 
> --- >% ---
> Subject: [PATCH 21/26] tty: Convert tty_struct bitfield to bools
> 
> The stopped, hw_stopped, flow_stopped and packet bits are smp-unsafe
> and interrupt-unsafe. For example,
> 
> CPU 0                         | CPU 1
>                               |
> tty->flow_stopped = 1         | tty->hw_stopped = 0
> 
> One of these updates will be corrupted, as the bitwise operation
> on the bitfield is non-atomic.
> 
> Ensure each flag has a separate memory location, so concurrent
> updates do not corrupt orthogonal states.
> 
> Signed-off-by: Peter Hurley <peter@hurleysoftware.com>
> ---
>  include/linux/tty.h | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/tty.h b/include/linux/tty.h
> index 1c3316a..7cf61cb 100644
> --- a/include/linux/tty.h
> +++ b/include/linux/tty.h
> @@ -261,7 +261,10 @@ struct tty_struct {
>  	unsigned long flags;
>  	int count;
>  	struct winsize winsize;		/* winsize_mutex */
> -	unsigned char stopped:1, hw_stopped:1, flow_stopped:1, packet:1;
> +	bool stopped;
> +	bool hw_stopped;
> +	bool flow_stopped;
> +	bool packet;
>  	unsigned char ctrl_status;	/* ctrl_lock */
>  	unsigned int receive_room;	/* Bytes free for queue */
>  	int flow_change;


_______________________________________________
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-03 23:11     ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 311+ messages in thread
From: Benjamin Herrenschmidt @ 2014-09-03 23:11 UTC (permalink / raw)
  To: Peter Hurley
  Cc: Jakub Jelinek, linux-arch, Tony Luck, linux-ia64, Oleg Nesterov,
	linux-kernel, Paul Mackerras, Paul E. McKenney, linuxppc-dev,
	Miroslav Franc, Richard Henderson

On Wed, 2014-09-03 at 18:51 -0400, Peter Hurley wrote:

> Apologies for hijacking this thread but I need to extend this discussion
> somewhat regarding what a compiler might do with adjacent fields in a structure.
> 
> The tty subsystem defines a large aggregate structure, struct tty_struct.
> Importantly, several different locks apply to different fields within that
> structure; ie., a specific spinlock will be claimed before updating or accessing
> certain fields while a different spinlock will be claimed before updating or
> accessing certain _adjacent_ fields.
> 
> What is necessary and sufficient to prevent accidental false-sharing?
> The patch below was flagged as insufficient on ia64, and possibly ARM.

We expect native aligned scalar types to be accessed atomically (the
read/modify/write of a larger quantity that gcc does on some bitfield
cases has been flagged as a gcc bug, but shouldn't happen on normal
scalar types).

I am not 100% certain of "bool" here, I assume it's treated as a normal
scalar and thus atomic but if unsure, you can always use int.

Another option is to use the atomic bitops and make these bits in a
bitmask but that is probably unnecessary if you have locks already.

Cheers,
Ben.


> Regards,
> Peter Hurley
> 
> --- >% ---
> Subject: [PATCH 21/26] tty: Convert tty_struct bitfield to bools
> 
> The stopped, hw_stopped, flow_stopped and packet bits are smp-unsafe
> and interrupt-unsafe. For example,
> 
> CPU 0                         | CPU 1
>                               |
> tty->flow_stopped = 1         | tty->hw_stopped = 0
> 
> One of these updates will be corrupted, as the bitwise operation
> on the bitfield is non-atomic.
> 
> Ensure each flag has a separate memory location, so concurrent
> updates do not corrupt orthogonal states.
> 
> Signed-off-by: Peter Hurley <peter@hurleysoftware.com>
> ---
>  include/linux/tty.h | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/tty.h b/include/linux/tty.h
> index 1c3316a..7cf61cb 100644
> --- a/include/linux/tty.h
> +++ b/include/linux/tty.h
> @@ -261,7 +261,10 @@ struct tty_struct {
>  	unsigned long flags;
>  	int count;
>  	struct winsize winsize;		/* winsize_mutex */
> -	unsigned char stopped:1, hw_stopped:1, flow_stopped:1, packet:1;
> +	bool stopped;
> +	bool hw_stopped;
> +	bool flow_stopped;
> +	bool packet;
>  	unsigned char ctrl_status;	/* ctrl_lock */
>  	unsigned int receive_room;	/* Bytes free for queue */
>  	int flow_change;

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-03 23:11     ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 311+ messages in thread
From: Benjamin Herrenschmidt @ 2014-09-03 23:11 UTC (permalink / raw)
  To: Peter Hurley
  Cc: Jakub Jelinek, Paul E. McKenney, Richard Henderson,
	Oleg Nesterov, Miroslav Franc, Paul Mackerras, linuxppc-dev,
	linux-kernel, linux-arch, Tony Luck, linux-ia64

On Wed, 2014-09-03 at 18:51 -0400, Peter Hurley wrote:

> Apologies for hijacking this thread but I need to extend this discussion
> somewhat regarding what a compiler might do with adjacent fields in a structure.
> 
> The tty subsystem defines a large aggregate structure, struct tty_struct.
> Importantly, several different locks apply to different fields within that
> structure; ie., a specific spinlock will be claimed before updating or accessing
> certain fields while a different spinlock will be claimed before updating or
> accessing certain _adjacent_ fields.
> 
> What is necessary and sufficient to prevent accidental false-sharing?
> The patch below was flagged as insufficient on ia64, and possibly ARM.

We expect native aligned scalar types to be accessed atomically (the
read/modify/write of a larger quantity that gcc does on some bitfield
cases has been flagged as a gcc bug, but shouldn't happen on normal
scalar types).

I am not 100% certain of "bool" here, I assume it's treated as a normal
scalar and thus atomic but if unsure, you can always use int.

Another option is to use the atomic bitops and make these bits in a
bitmask but that is probably unnecessary if you have locks already.

Cheers,
Ben.


> Regards,
> Peter Hurley
> 
> --- >% ---
> Subject: [PATCH 21/26] tty: Convert tty_struct bitfield to bools
> 
> The stopped, hw_stopped, flow_stopped and packet bits are smp-unsafe
> and interrupt-unsafe. For example,
> 
> CPU 0                         | CPU 1
>                               |
> tty->flow_stopped = 1         | tty->hw_stopped = 0
> 
> One of these updates will be corrupted, as the bitwise operation
> on the bitfield is non-atomic.
> 
> Ensure each flag has a separate memory location, so concurrent
> updates do not corrupt orthogonal states.
> 
> Signed-off-by: Peter Hurley <peter@hurleysoftware.com>
> ---
>  include/linux/tty.h | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/tty.h b/include/linux/tty.h
> index 1c3316a..7cf61cb 100644
> --- a/include/linux/tty.h
> +++ b/include/linux/tty.h
> @@ -261,7 +261,10 @@ struct tty_struct {
>  	unsigned long flags;
>  	int count;
>  	struct winsize winsize;		/* winsize_mutex */
> -	unsigned char stopped:1, hw_stopped:1, flow_stopped:1, packet:1;
> +	bool stopped;
> +	bool hw_stopped;
> +	bool flow_stopped;
> +	bool packet;
>  	unsigned char ctrl_status;	/* ctrl_lock */
>  	unsigned int receive_room;	/* Bytes free for queue */
>  	int flow_change;



^ permalink raw reply	[flat|nested] 311+ messages in thread

* RE: bit fields && data tearing
  2014-09-03 23:11     ` Benjamin Herrenschmidt
  (?)
  (?)
@ 2014-09-04  8:43       ` David Laight
  -1 siblings, 0 replies; 311+ messages in thread
From: David Laight @ 2014-09-04  8:43 UTC (permalink / raw)
  To: 'Benjamin Herrenschmidt', Peter Hurley
  Cc: Jakub Jelinek, linux-arch, Tony Luck, linux-ia64, Oleg Nesterov,
	linux-kernel, Paul Mackerras, Paul E. McKenney, linuxppc-dev,
	Miroslav Franc, Richard Henderson

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 1599 bytes --]

From:  Benjamin Herrenschmidt
> On Wed, 2014-09-03 at 18:51 -0400, Peter Hurley wrote:
> 
> > Apologies for hijacking this thread but I need to extend this discussion
> > somewhat regarding what a compiler might do with adjacent fields in a structure.
> >
> > The tty subsystem defines a large aggregate structure, struct tty_struct.
> > Importantly, several different locks apply to different fields within that
> > structure; ie., a specific spinlock will be claimed before updating or accessing
> > certain fields while a different spinlock will be claimed before updating or
> > accessing certain _adjacent_ fields.
> >
> > What is necessary and sufficient to prevent accidental false-sharing?
> > The patch below was flagged as insufficient on ia64, and possibly ARM.
> 
> We expect native aligned scalar types to be accessed atomically (the
> read/modify/write of a larger quantity that gcc does on some bitfield
> cases has been flagged as a gcc bug, but shouldn't happen on normal
> scalar types).

That isn't true on all architectures for items smaller than a machine word.
At least one has to do rmw for byte accesses.

	David

> I am not 100% certain of "bool" here, I assume it's treated as a normal
> scalar and thus atomic but if unsure, you can always use int.
> 
> Another option is to use the atomic bitops and make these bits in a
> bitmask but that is probably unnecessary if you have locks already.
> 
> Cheers,
> Ben.
ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 311+ messages in thread

* RE: bit fields && data tearing
@ 2014-09-04  8:43       ` David Laight
  0 siblings, 0 replies; 311+ messages in thread
From: David Laight @ 2014-09-04  8:43 UTC (permalink / raw)
  To: 'Benjamin Herrenschmidt', Peter Hurley
  Cc: Jakub Jelinek, linux-arch, Tony Luck, linux-ia64, Oleg Nesterov,
	linux-kernel, Paul Mackerras, Paul E. McKenney, linuxppc-dev,
	Miroslav Franc, Richard Henderson

From:  Benjamin Herrenschmidt
> On Wed, 2014-09-03 at 18:51 -0400, Peter Hurley wrote:
> 
> > Apologies for hijacking this thread but I need to extend this discussion
> > somewhat regarding what a compiler might do with adjacent fields in a structure.
> >
> > The tty subsystem defines a large aggregate structure, struct tty_struct.
> > Importantly, several different locks apply to different fields within that
> > structure; ie., a specific spinlock will be claimed before updating or accessing
> > certain fields while a different spinlock will be claimed before updating or
> > accessing certain _adjacent_ fields.
> >
> > What is necessary and sufficient to prevent accidental false-sharing?
> > The patch below was flagged as insufficient on ia64, and possibly ARM.
> 
> We expect native aligned scalar types to be accessed atomically (the
> read/modify/write of a larger quantity that gcc does on some bitfield
> cases has been flagged as a gcc bug, but shouldn't happen on normal
> scalar types).

That isn't true on all architectures for items smaller than a machine word.
At least one has to do rmw for byte accesses.

	David

> I am not 100% certain of "bool" here, I assume it's treated as a normal
> scalar and thus atomic but if unsure, you can always use int.
> 
> Another option is to use the atomic bitops and make these bits in a
> bitmask but that is probably unnecessary if you have locks already.
> 
> Cheers,
> Ben.

^ permalink raw reply	[flat|nested] 311+ messages in thread

* RE: bit fields && data tearing
@ 2014-09-04  8:43       ` David Laight
  0 siblings, 0 replies; 311+ messages in thread
From: David Laight @ 2014-09-04  8:43 UTC (permalink / raw)
  To: 'Benjamin Herrenschmidt', Peter Hurley
  Cc: Jakub Jelinek, linux-arch, Tony Luck, linux-ia64, Oleg Nesterov,
	linux-kernel, Paul Mackerras, Paul E. McKenney, linuxppc-dev,
	Miroslav Franc, Richard Henderson

RnJvbTogIEJlbmphbWluIEhlcnJlbnNjaG1pZHQNCj4gT24gV2VkLCAyMDE0LTA5LTAzIGF0IDE4
OjUxIC0wNDAwLCBQZXRlciBIdXJsZXkgd3JvdGU6DQo+IA0KPiA+IEFwb2xvZ2llcyBmb3IgaGlq
YWNraW5nIHRoaXMgdGhyZWFkIGJ1dCBJIG5lZWQgdG8gZXh0ZW5kIHRoaXMgZGlzY3Vzc2lvbg0K
PiA+IHNvbWV3aGF0IHJlZ2FyZGluZyB3aGF0IGEgY29tcGlsZXIgbWlnaHQgZG8gd2l0aCBhZGph
Y2VudCBmaWVsZHMgaW4gYSBzdHJ1Y3R1cmUuDQo+ID4NCj4gPiBUaGUgdHR5IHN1YnN5c3RlbSBk
ZWZpbmVzIGEgbGFyZ2UgYWdncmVnYXRlIHN0cnVjdHVyZSwgc3RydWN0IHR0eV9zdHJ1Y3QuDQo+
ID4gSW1wb3J0YW50bHksIHNldmVyYWwgZGlmZmVyZW50IGxvY2tzIGFwcGx5IHRvIGRpZmZlcmVu
dCBmaWVsZHMgd2l0aGluIHRoYXQNCj4gPiBzdHJ1Y3R1cmU7IGllLiwgYSBzcGVjaWZpYyBzcGlu
bG9jayB3aWxsIGJlIGNsYWltZWQgYmVmb3JlIHVwZGF0aW5nIG9yIGFjY2Vzc2luZw0KPiA+IGNl
cnRhaW4gZmllbGRzIHdoaWxlIGEgZGlmZmVyZW50IHNwaW5sb2NrIHdpbGwgYmUgY2xhaW1lZCBi
ZWZvcmUgdXBkYXRpbmcgb3INCj4gPiBhY2Nlc3NpbmcgY2VydGFpbiBfYWRqYWNlbnRfIGZpZWxk
cy4NCj4gPg0KPiA+IFdoYXQgaXMgbmVjZXNzYXJ5IGFuZCBzdWZmaWNpZW50IHRvIHByZXZlbnQg
YWNjaWRlbnRhbCBmYWxzZS1zaGFyaW5nPw0KPiA+IFRoZSBwYXRjaCBiZWxvdyB3YXMgZmxhZ2dl
ZCBhcyBpbnN1ZmZpY2llbnQgb24gaWE2NCwgYW5kIHBvc3NpYmx5IEFSTS4NCj4gDQo+IFdlIGV4
cGVjdCBuYXRpdmUgYWxpZ25lZCBzY2FsYXIgdHlwZXMgdG8gYmUgYWNjZXNzZWQgYXRvbWljYWxs
eSAodGhlDQo+IHJlYWQvbW9kaWZ5L3dyaXRlIG9mIGEgbGFyZ2VyIHF1YW50aXR5IHRoYXQgZ2Nj
IGRvZXMgb24gc29tZSBiaXRmaWVsZA0KPiBjYXNlcyBoYXMgYmVlbiBmbGFnZ2VkIGFzIGEgZ2Nj
IGJ1ZywgYnV0IHNob3VsZG4ndCBoYXBwZW4gb24gbm9ybWFsDQo+IHNjYWxhciB0eXBlcykuDQoN
ClRoYXQgaXNuJ3QgdHJ1ZSBvbiBhbGwgYXJjaGl0ZWN0dXJlcyBmb3IgaXRlbXMgc21hbGxlciB0
aGFuIGEgbWFjaGluZSB3b3JkLg0KQXQgbGVhc3Qgb25lIGhhcyB0byBkbyBybXcgZm9yIGJ5dGUg
YWNjZXNzZXMuDQoNCglEYXZpZA0KDQo+IEkgYW0gbm90IDEwMCUgY2VydGFpbiBvZiAiYm9vbCIg
aGVyZSwgSSBhc3N1bWUgaXQncyB0cmVhdGVkIGFzIGEgbm9ybWFsDQo+IHNjYWxhciBhbmQgdGh1
cyBhdG9taWMgYnV0IGlmIHVuc3VyZSwgeW91IGNhbiBhbHdheXMgdXNlIGludC4NCj4gDQo+IEFu
b3RoZXIgb3B0aW9uIGlzIHRvIHVzZSB0aGUgYXRvbWljIGJpdG9wcyBhbmQgbWFrZSB0aGVzZSBi
aXRzIGluIGENCj4gYml0bWFzayBidXQgdGhhdCBpcyBwcm9iYWJseSB1bm5lY2Vzc2FyeSBpZiB5
b3UgaGF2ZSBsb2NrcyBhbHJlYWR5Lg0KPiANCj4gQ2hlZXJzLA0KPiBCZW4uDQo=

^ permalink raw reply	[flat|nested] 311+ messages in thread

* RE: bit fields && data tearing
@ 2014-09-04  8:43       ` David Laight
  0 siblings, 0 replies; 311+ messages in thread
From: David Laight @ 2014-09-04  8:43 UTC (permalink / raw)
  To: 'Benjamin Herrenschmidt', Peter Hurley
  Cc: Jakub Jelinek, linux-arch, Tony Luck, linux-ia64, Oleg Nesterov,
	linux-kernel, Paul Mackerras, Paul E. McKenney, linuxppc-dev,
	Miroslav Franc, Richard Henderson

RnJvbTogIEJlbmphbWluIEhlcnJlbnNjaG1pZHQNCj4gT24gV2VkLCAyMDE0LTA5LTAzIGF0IDE4
OjUxIC0wNDAwLCBQZXRlciBIdXJsZXkgd3JvdGU6DQo+IA0KPiA+IEFwb2xvZ2llcyBmb3IgaGlq
YWNraW5nIHRoaXMgdGhyZWFkIGJ1dCBJIG5lZWQgdG8gZXh0ZW5kIHRoaXMgZGlzY3Vzc2lvbg0K
PiA+IHNvbWV3aGF0IHJlZ2FyZGluZyB3aGF0IGEgY29tcGlsZXIgbWlnaHQgZG8gd2l0aCBhZGph
Y2VudCBmaWVsZHMgaW4gYSBzdHJ1Y3R1cmUuDQo+ID4NCj4gPiBUaGUgdHR5IHN1YnN5c3RlbSBk
ZWZpbmVzIGEgbGFyZ2UgYWdncmVnYXRlIHN0cnVjdHVyZSwgc3RydWN0IHR0eV9zdHJ1Y3QuDQo+
ID4gSW1wb3J0YW50bHksIHNldmVyYWwgZGlmZmVyZW50IGxvY2tzIGFwcGx5IHRvIGRpZmZlcmVu
dCBmaWVsZHMgd2l0aGluIHRoYXQNCj4gPiBzdHJ1Y3R1cmU7IGllLiwgYSBzcGVjaWZpYyBzcGlu
bG9jayB3aWxsIGJlIGNsYWltZWQgYmVmb3JlIHVwZGF0aW5nIG9yIGFjY2Vzc2luZw0KPiA+IGNl
cnRhaW4gZmllbGRzIHdoaWxlIGEgZGlmZmVyZW50IHNwaW5sb2NrIHdpbGwgYmUgY2xhaW1lZCBi
ZWZvcmUgdXBkYXRpbmcgb3INCj4gPiBhY2Nlc3NpbmcgY2VydGFpbiBfYWRqYWNlbnRfIGZpZWxk
cy4NCj4gPg0KPiA+IFdoYXQgaXMgbmVjZXNzYXJ5IGFuZCBzdWZmaWNpZW50IHRvIHByZXZlbnQg
YWNjaWRlbnRhbCBmYWxzZS1zaGFyaW5nPw0KPiA+IFRoZSBwYXRjaCBiZWxvdyB3YXMgZmxhZ2dl
ZCBhcyBpbnN1ZmZpY2llbnQgb24gaWE2NCwgYW5kIHBvc3NpYmx5IEFSTS4NCj4gDQo+IFdlIGV4
cGVjdCBuYXRpdmUgYWxpZ25lZCBzY2FsYXIgdHlwZXMgdG8gYmUgYWNjZXNzZWQgYXRvbWljYWxs
eSAodGhlDQo+IHJlYWQvbW9kaWZ5L3dyaXRlIG9mIGEgbGFyZ2VyIHF1YW50aXR5IHRoYXQgZ2Nj
IGRvZXMgb24gc29tZSBiaXRmaWVsZA0KPiBjYXNlcyBoYXMgYmVlbiBmbGFnZ2VkIGFzIGEgZ2Nj
IGJ1ZywgYnV0IHNob3VsZG4ndCBoYXBwZW4gb24gbm9ybWFsDQo+IHNjYWxhciB0eXBlcykuDQoN
ClRoYXQgaXNuJ3QgdHJ1ZSBvbiBhbGwgYXJjaGl0ZWN0dXJlcyBmb3IgaXRlbXMgc21hbGxlciB0
aGFuIGEgbWFjaGluZSB3b3JkLg0KQXQgbGVhc3Qgb25lIGhhcyB0byBkbyBybXcgZm9yIGJ5dGUg
YWNjZXNzZXMuDQoNCglEYXZpZA0KDQo+IEkgYW0gbm90IDEwMCUgY2VydGFpbiBvZiAiYm9vbCIg
aGVyZSwgSSBhc3N1bWUgaXQncyB0cmVhdGVkIGFzIGEgbm9ybWFsDQo+IHNjYWxhciBhbmQgdGh1
cyBhdG9taWMgYnV0IGlmIHVuc3VyZSwgeW91IGNhbiBhbHdheXMgdXNlIGludC4NCj4gDQo+IEFu
b3RoZXIgb3B0aW9uIGlzIHRvIHVzZSB0aGUgYXRvbWljIGJpdG9wcyBhbmQgbWFrZSB0aGVzZSBi
aXRzIGluIGENCj4gYml0bWFzayBidXQgdGhhdCBpcyBwcm9iYWJseSB1bm5lY2Vzc2FyeSBpZiB5
b3UgaGF2ZSBsb2NrcyBhbHJlYWR5Lg0KPiANCj4gQ2hlZXJzLA0KPiBCZW4uDQo

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
  2014-09-03 23:11     ` Benjamin Herrenschmidt
  (?)
  (?)
@ 2014-09-04  8:57       ` Mikael Pettersson
  -1 siblings, 0 replies; 311+ messages in thread
From: Mikael Pettersson @ 2014-09-04  8:57 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Peter Hurley, Jakub Jelinek, Paul E. McKenney, Richard Henderson,
	Oleg Nesterov, Miroslav Franc, Paul Mackerras, linuxppc-dev,
	linux-kernel, linux-arch, Tony Luck, linux-ia64

Benjamin Herrenschmidt writes:
 > On Wed, 2014-09-03 at 18:51 -0400, Peter Hurley wrote:
 > 
 > > Apologies for hijacking this thread but I need to extend this discussion
 > > somewhat regarding what a compiler might do with adjacent fields in a structure.
 > > 
 > > The tty subsystem defines a large aggregate structure, struct tty_struct.
 > > Importantly, several different locks apply to different fields within that
 > > structure; ie., a specific spinlock will be claimed before updating or accessing
 > > certain fields while a different spinlock will be claimed before updating or
 > > accessing certain _adjacent_ fields.
 > > 
 > > What is necessary and sufficient to prevent accidental false-sharing?
 > > The patch below was flagged as insufficient on ia64, and possibly ARM.
 > 
 > We expect native aligned scalar types to be accessed atomically (the
 > read/modify/write of a larger quantity that gcc does on some bitfield
 > cases has been flagged as a gcc bug, but shouldn't happen on normal
 > scalar types).
 > 
 > I am not 100% certain of "bool" here, I assume it's treated as a normal
 > scalar and thus atomic but if unsure, you can always use int.

Please use an aligned int or long.  Some machines cannot do atomic
accesses on sub-int/long quantities, so 'bool' may cause unexpected
rmw cycles on adjacent fields.

/Mikael

 > 
 > Another option is to use the atomic bitops and make these bits in a
 > bitmask but that is probably unnecessary if you have locks already.
 > 
 > Cheers,
 > Ben.
 > 
 > 
 > > Regards,
 > > Peter Hurley
 > > 
 > > --- >% ---
 > > Subject: [PATCH 21/26] tty: Convert tty_struct bitfield to bools
 > > 
 > > The stopped, hw_stopped, flow_stopped and packet bits are smp-unsafe
 > > and interrupt-unsafe. For example,
 > > 
 > > CPU 0                         | CPU 1
 > >                               |
 > > tty->flow_stopped = 1         | tty->hw_stopped = 0
 > > 
 > > One of these updates will be corrupted, as the bitwise operation
 > > on the bitfield is non-atomic.
 > > 
 > > Ensure each flag has a separate memory location, so concurrent
 > > updates do not corrupt orthogonal states.
 > > 
 > > Signed-off-by: Peter Hurley <peter@hurleysoftware.com>
 > > ---
 > >  include/linux/tty.h | 5 ++++-
 > >  1 file changed, 4 insertions(+), 1 deletion(-)
 > > 
 > > diff --git a/include/linux/tty.h b/include/linux/tty.h
 > > index 1c3316a..7cf61cb 100644
 > > --- a/include/linux/tty.h
 > > +++ b/include/linux/tty.h
 > > @@ -261,7 +261,10 @@ struct tty_struct {
 > >  	unsigned long flags;
 > >  	int count;
 > >  	struct winsize winsize;		/* winsize_mutex */
 > > -	unsigned char stopped:1, hw_stopped:1, flow_stopped:1, packet:1;
 > > +	bool stopped;
 > > +	bool hw_stopped;
 > > +	bool flow_stopped;
 > > +	bool packet;
 > >  	unsigned char ctrl_status;	/* ctrl_lock */
 > >  	unsigned int receive_room;	/* Bytes free for queue */
 > >  	int flow_change;
 > 
 > 
 > --
 > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
 > the body of a message to majordomo@vger.kernel.org
 > More majordomo info at  http://vger.kernel.org/majordomo-info.html
 > Please read the FAQ at  http://www.tux.org/lkml/

-- 

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-04  8:57       ` Mikael Pettersson
  0 siblings, 0 replies; 311+ messages in thread
From: Mikael Pettersson @ 2014-09-04  8:57 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Jakub Jelinek, linux-arch, Tony Luck, linux-ia64, Peter Hurley,
	Oleg Nesterov, linux-kernel, Paul Mackerras, Paul E. McKenney,
	linuxppc-dev, Miroslav Franc, Richard Henderson

Benjamin Herrenschmidt writes:
 > On Wed, 2014-09-03 at 18:51 -0400, Peter Hurley wrote:
 > 
 > > Apologies for hijacking this thread but I need to extend this discussion
 > > somewhat regarding what a compiler might do with adjacent fields in a structure.
 > > 
 > > The tty subsystem defines a large aggregate structure, struct tty_struct.
 > > Importantly, several different locks apply to different fields within that
 > > structure; ie., a specific spinlock will be claimed before updating or accessing
 > > certain fields while a different spinlock will be claimed before updating or
 > > accessing certain _adjacent_ fields.
 > > 
 > > What is necessary and sufficient to prevent accidental false-sharing?
 > > The patch below was flagged as insufficient on ia64, and possibly ARM.
 > 
 > We expect native aligned scalar types to be accessed atomically (the
 > read/modify/write of a larger quantity that gcc does on some bitfield
 > cases has been flagged as a gcc bug, but shouldn't happen on normal
 > scalar types).
 > 
 > I am not 100% certain of "bool" here, I assume it's treated as a normal
 > scalar and thus atomic but if unsure, you can always use int.

Please use an aligned int or long.  Some machines cannot do atomic
accesses on sub-int/long quantities, so 'bool' may cause unexpected
rmw cycles on adjacent fields.

/Mikael

 > 
 > Another option is to use the atomic bitops and make these bits in a
 > bitmask but that is probably unnecessary if you have locks already.
 > 
 > Cheers,
 > Ben.
 > 
 > 
 > > Regards,
 > > Peter Hurley
 > > 
 > > --- >% ---
 > > Subject: [PATCH 21/26] tty: Convert tty_struct bitfield to bools
 > > 
 > > The stopped, hw_stopped, flow_stopped and packet bits are smp-unsafe
 > > and interrupt-unsafe. For example,
 > > 
 > > CPU 0                         | CPU 1
 > >                               |
 > > tty->flow_stopped = 1         | tty->hw_stopped = 0
 > > 
 > > One of these updates will be corrupted, as the bitwise operation
 > > on the bitfield is non-atomic.
 > > 
 > > Ensure each flag has a separate memory location, so concurrent
 > > updates do not corrupt orthogonal states.
 > > 
 > > Signed-off-by: Peter Hurley <peter@hurleysoftware.com>
 > > ---
 > >  include/linux/tty.h | 5 ++++-
 > >  1 file changed, 4 insertions(+), 1 deletion(-)
 > > 
 > > diff --git a/include/linux/tty.h b/include/linux/tty.h
 > > index 1c3316a..7cf61cb 100644
 > > --- a/include/linux/tty.h
 > > +++ b/include/linux/tty.h
 > > @@ -261,7 +261,10 @@ struct tty_struct {
 > >  	unsigned long flags;
 > >  	int count;
 > >  	struct winsize winsize;		/* winsize_mutex */
 > > -	unsigned char stopped:1, hw_stopped:1, flow_stopped:1, packet:1;
 > > +	bool stopped;
 > > +	bool hw_stopped;
 > > +	bool flow_stopped;
 > > +	bool packet;
 > >  	unsigned char ctrl_status;	/* ctrl_lock */
 > >  	unsigned int receive_room;	/* Bytes free for queue */
 > >  	int flow_change;
 > 
 > 
 > --
 > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
 > the body of a message to majordomo@vger.kernel.org
 > More majordomo info at  http://vger.kernel.org/majordomo-info.html
 > Please read the FAQ at  http://www.tux.org/lkml/

-- 
_______________________________________________
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-04  8:57       ` Mikael Pettersson
  0 siblings, 0 replies; 311+ messages in thread
From: Mikael Pettersson @ 2014-09-04  8:57 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Jakub Jelinek, linux-arch, Tony Luck, linux-ia64, Peter Hurley,
	Oleg Nesterov, linux-kernel, Paul Mackerras, Paul E. McKenney,
	linuxppc-dev, Miroslav Franc, Richard Henderson

Benjamin Herrenschmidt writes:
 > On Wed, 2014-09-03 at 18:51 -0400, Peter Hurley wrote:
 > 
 > > Apologies for hijacking this thread but I need to extend this discussion
 > > somewhat regarding what a compiler might do with adjacent fields in a structure.
 > > 
 > > The tty subsystem defines a large aggregate structure, struct tty_struct.
 > > Importantly, several different locks apply to different fields within that
 > > structure; ie., a specific spinlock will be claimed before updating or accessing
 > > certain fields while a different spinlock will be claimed before updating or
 > > accessing certain _adjacent_ fields.
 > > 
 > > What is necessary and sufficient to prevent accidental false-sharing?
 > > The patch below was flagged as insufficient on ia64, and possibly ARM.
 > 
 > We expect native aligned scalar types to be accessed atomically (the
 > read/modify/write of a larger quantity that gcc does on some bitfield
 > cases has been flagged as a gcc bug, but shouldn't happen on normal
 > scalar types).
 > 
 > I am not 100% certain of "bool" here, I assume it's treated as a normal
 > scalar and thus atomic but if unsure, you can always use int.

Please use an aligned int or long.  Some machines cannot do atomic
accesses on sub-int/long quantities, so 'bool' may cause unexpected
rmw cycles on adjacent fields.

/Mikael

 > 
 > Another option is to use the atomic bitops and make these bits in a
 > bitmask but that is probably unnecessary if you have locks already.
 > 
 > Cheers,
 > Ben.
 > 
 > 
 > > Regards,
 > > Peter Hurley
 > > 
 > > --- >% ---
 > > Subject: [PATCH 21/26] tty: Convert tty_struct bitfield to bools
 > > 
 > > The stopped, hw_stopped, flow_stopped and packet bits are smp-unsafe
 > > and interrupt-unsafe. For example,
 > > 
 > > CPU 0                         | CPU 1
 > >                               |
 > > tty->flow_stopped = 1         | tty->hw_stopped = 0
 > > 
 > > One of these updates will be corrupted, as the bitwise operation
 > > on the bitfield is non-atomic.
 > > 
 > > Ensure each flag has a separate memory location, so concurrent
 > > updates do not corrupt orthogonal states.
 > > 
 > > Signed-off-by: Peter Hurley <peter@hurleysoftware.com>
 > > ---
 > >  include/linux/tty.h | 5 ++++-
 > >  1 file changed, 4 insertions(+), 1 deletion(-)
 > > 
 > > diff --git a/include/linux/tty.h b/include/linux/tty.h
 > > index 1c3316a..7cf61cb 100644
 > > --- a/include/linux/tty.h
 > > +++ b/include/linux/tty.h
 > > @@ -261,7 +261,10 @@ struct tty_struct {
 > >  	unsigned long flags;
 > >  	int count;
 > >  	struct winsize winsize;		/* winsize_mutex */
 > > -	unsigned char stopped:1, hw_stopped:1, flow_stopped:1, packet:1;
 > > +	bool stopped;
 > > +	bool hw_stopped;
 > > +	bool flow_stopped;
 > > +	bool packet;
 > >  	unsigned char ctrl_status;	/* ctrl_lock */
 > >  	unsigned int receive_room;	/* Bytes free for queue */
 > >  	int flow_change;
 > 
 > 
 > --
 > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
 > the body of a message to majordomo@vger.kernel.org
 > More majordomo info at  http://vger.kernel.org/majordomo-info.html
 > Please read the FAQ at  http://www.tux.org/lkml/

-- 

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-04  8:57       ` Mikael Pettersson
  0 siblings, 0 replies; 311+ messages in thread
From: Mikael Pettersson @ 2014-09-04  8:57 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Peter Hurley, Jakub Jelinek, Paul E. McKenney, Richard Henderson,
	Oleg Nesterov, Miroslav Franc, Paul Mackerras, linuxppc-dev,
	linux-kernel, linux-arch, Tony Luck, linux-ia64

Benjamin Herrenschmidt writes:
 > On Wed, 2014-09-03 at 18:51 -0400, Peter Hurley wrote:
 > 
 > > Apologies for hijacking this thread but I need to extend this discussion
 > > somewhat regarding what a compiler might do with adjacent fields in a structure.
 > > 
 > > The tty subsystem defines a large aggregate structure, struct tty_struct.
 > > Importantly, several different locks apply to different fields within that
 > > structure; ie., a specific spinlock will be claimed before updating or accessing
 > > certain fields while a different spinlock will be claimed before updating or
 > > accessing certain _adjacent_ fields.
 > > 
 > > What is necessary and sufficient to prevent accidental false-sharing?
 > > The patch below was flagged as insufficient on ia64, and possibly ARM.
 > 
 > We expect native aligned scalar types to be accessed atomically (the
 > read/modify/write of a larger quantity that gcc does on some bitfield
 > cases has been flagged as a gcc bug, but shouldn't happen on normal
 > scalar types).
 > 
 > I am not 100% certain of "bool" here, I assume it's treated as a normal
 > scalar and thus atomic but if unsure, you can always use int.

Please use an aligned int or long.  Some machines cannot do atomic
accesses on sub-int/long quantities, so 'bool' may cause unexpected
rmw cycles on adjacent fields.

/Mikael

 > 
 > Another option is to use the atomic bitops and make these bits in a
 > bitmask but that is probably unnecessary if you have locks already.
 > 
 > Cheers,
 > Ben.
 > 
 > 
 > > Regards,
 > > Peter Hurley
 > > 
 > > --- >% ---
 > > Subject: [PATCH 21/26] tty: Convert tty_struct bitfield to bools
 > > 
 > > The stopped, hw_stopped, flow_stopped and packet bits are smp-unsafe
 > > and interrupt-unsafe. For example,
 > > 
 > > CPU 0                         | CPU 1
 > >                               |
 > > tty->flow_stopped = 1         | tty->hw_stopped = 0
 > > 
 > > One of these updates will be corrupted, as the bitwise operation
 > > on the bitfield is non-atomic.
 > > 
 > > Ensure each flag has a separate memory location, so concurrent
 > > updates do not corrupt orthogonal states.
 > > 
 > > Signed-off-by: Peter Hurley <peter@hurleysoftware.com>
 > > ---
 > >  include/linux/tty.h | 5 ++++-
 > >  1 file changed, 4 insertions(+), 1 deletion(-)
 > > 
 > > diff --git a/include/linux/tty.h b/include/linux/tty.h
 > > index 1c3316a..7cf61cb 100644
 > > --- a/include/linux/tty.h
 > > +++ b/include/linux/tty.h
 > > @@ -261,7 +261,10 @@ struct tty_struct {
 > >  	unsigned long flags;
 > >  	int count;
 > >  	struct winsize winsize;		/* winsize_mutex */
 > > -	unsigned char stopped:1, hw_stopped:1, flow_stopped:1, packet:1;
 > > +	bool stopped;
 > > +	bool hw_stopped;
 > > +	bool flow_stopped;
 > > +	bool packet;
 > >  	unsigned char ctrl_status;	/* ctrl_lock */
 > >  	unsigned int receive_room;	/* Bytes free for queue */
 > >  	int flow_change;
 > 
 > 
 > --
 > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
 > the body of a message to majordomo@vger.kernel.org
 > More majordomo info at  http://vger.kernel.org/majordomo-info.html
 > Please read the FAQ at  http://www.tux.org/lkml/

-- 

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
  2014-09-04  8:57       ` Mikael Pettersson
  (?)
@ 2014-09-04  9:09         ` Jakub Jelinek
  -1 siblings, 0 replies; 311+ messages in thread
From: Jakub Jelinek @ 2014-09-04  9:09 UTC (permalink / raw)
  To: Mikael Pettersson
  Cc: Benjamin Herrenschmidt, Peter Hurley, Paul E. McKenney,
	Richard Henderson, Oleg Nesterov, Miroslav Franc, Paul Mackerras,
	linuxppc-dev, linux-kernel, linux-arch, Tony Luck, linux-ia64

On Thu, Sep 04, 2014 at 10:57:40AM +0200, Mikael Pettersson wrote:
> Benjamin Herrenschmidt writes:
>  > On Wed, 2014-09-03 at 18:51 -0400, Peter Hurley wrote:
>  > 
>  > > Apologies for hijacking this thread but I need to extend this discussion
>  > > somewhat regarding what a compiler might do with adjacent fields in a structure.
>  > > 
>  > > The tty subsystem defines a large aggregate structure, struct tty_struct.
>  > > Importantly, several different locks apply to different fields within that
>  > > structure; ie., a specific spinlock will be claimed before updating or accessing
>  > > certain fields while a different spinlock will be claimed before updating or
>  > > accessing certain _adjacent_ fields.
>  > > 
>  > > What is necessary and sufficient to prevent accidental false-sharing?
>  > > The patch below was flagged as insufficient on ia64, and possibly ARM.
>  > 
>  > We expect native aligned scalar types to be accessed atomically (the
>  > read/modify/write of a larger quantity that gcc does on some bitfield
>  > cases has been flagged as a gcc bug, but shouldn't happen on normal
>  > scalar types).
>  > 
>  > I am not 100% certain of "bool" here, I assume it's treated as a normal
>  > scalar and thus atomic but if unsure, you can always use int.
> 
> Please use an aligned int or long.  Some machines cannot do atomic
> accesses on sub-int/long quantities, so 'bool' may cause unexpected
> rmw cycles on adjacent fields.

Yeah, at least pre-EV56 Alpha performs rmw cycles on char/short accesses
and thus those are not atomic.

	Jakub

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-04  9:09         ` Jakub Jelinek
  0 siblings, 0 replies; 311+ messages in thread
From: Jakub Jelinek @ 2014-09-04  9:09 UTC (permalink / raw)
  To: Mikael Pettersson
  Cc: linux-arch, Tony Luck, linux-ia64, Peter Hurley, Oleg Nesterov,
	linux-kernel, Paul Mackerras, Paul E. McKenney, linuxppc-dev,
	Miroslav Franc, Richard Henderson

On Thu, Sep 04, 2014 at 10:57:40AM +0200, Mikael Pettersson wrote:
> Benjamin Herrenschmidt writes:
>  > On Wed, 2014-09-03 at 18:51 -0400, Peter Hurley wrote:
>  > 
>  > > Apologies for hijacking this thread but I need to extend this discussion
>  > > somewhat regarding what a compiler might do with adjacent fields in a structure.
>  > > 
>  > > The tty subsystem defines a large aggregate structure, struct tty_struct.
>  > > Importantly, several different locks apply to different fields within that
>  > > structure; ie., a specific spinlock will be claimed before updating or accessing
>  > > certain fields while a different spinlock will be claimed before updating or
>  > > accessing certain _adjacent_ fields.
>  > > 
>  > > What is necessary and sufficient to prevent accidental false-sharing?
>  > > The patch below was flagged as insufficient on ia64, and possibly ARM.
>  > 
>  > We expect native aligned scalar types to be accessed atomically (the
>  > read/modify/write of a larger quantity that gcc does on some bitfield
>  > cases has been flagged as a gcc bug, but shouldn't happen on normal
>  > scalar types).
>  > 
>  > I am not 100% certain of "bool" here, I assume it's treated as a normal
>  > scalar and thus atomic but if unsure, you can always use int.
> 
> Please use an aligned int or long.  Some machines cannot do atomic
> accesses on sub-int/long quantities, so 'bool' may cause unexpected
> rmw cycles on adjacent fields.

Yeah, at least pre-EV56 Alpha performs rmw cycles on char/short accesses
and thus those are not atomic.

	Jakub

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-04  9:09         ` Jakub Jelinek
  0 siblings, 0 replies; 311+ messages in thread
From: Jakub Jelinek @ 2014-09-04  9:09 UTC (permalink / raw)
  To: Mikael Pettersson
  Cc: Benjamin Herrenschmidt, Peter Hurley, Paul E. McKenney,
	Richard Henderson, Oleg Nesterov, Miroslav Franc, Paul Mackerras,
	linuxppc-dev, linux-kernel, linux-arch, Tony Luck, linux-ia64

On Thu, Sep 04, 2014 at 10:57:40AM +0200, Mikael Pettersson wrote:
> Benjamin Herrenschmidt writes:
>  > On Wed, 2014-09-03 at 18:51 -0400, Peter Hurley wrote:
>  > 
>  > > Apologies for hijacking this thread but I need to extend this discussion
>  > > somewhat regarding what a compiler might do with adjacent fields in a structure.
>  > > 
>  > > The tty subsystem defines a large aggregate structure, struct tty_struct.
>  > > Importantly, several different locks apply to different fields within that
>  > > structure; ie., a specific spinlock will be claimed before updating or accessing
>  > > certain fields while a different spinlock will be claimed before updating or
>  > > accessing certain _adjacent_ fields.
>  > > 
>  > > What is necessary and sufficient to prevent accidental false-sharing?
>  > > The patch below was flagged as insufficient on ia64, and possibly ARM.
>  > 
>  > We expect native aligned scalar types to be accessed atomically (the
>  > read/modify/write of a larger quantity that gcc does on some bitfield
>  > cases has been flagged as a gcc bug, but shouldn't happen on normal
>  > scalar types).
>  > 
>  > I am not 100% certain of "bool" here, I assume it's treated as a normal
>  > scalar and thus atomic but if unsure, you can always use int.
> 
> Please use an aligned int or long.  Some machines cannot do atomic
> accesses on sub-int/long quantities, so 'bool' may cause unexpected
> rmw cycles on adjacent fields.

Yeah, at least pre-EV56 Alpha performs rmw cycles on char/short accesses
and thus those are not atomic.

	Jakub

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
  2014-09-04  8:43       ` David Laight
  (?)
@ 2014-09-04  9:52         ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 311+ messages in thread
From: Benjamin Herrenschmidt @ 2014-09-04  9:52 UTC (permalink / raw)
  To: David Laight
  Cc: Peter Hurley, Jakub Jelinek, linux-arch, Tony Luck, linux-ia64,
	Oleg Nesterov, linux-kernel, Paul Mackerras, Paul E. McKenney,
	linuxppc-dev, Miroslav Franc, Richard Henderson

On Thu, 2014-09-04 at 08:43 +0000, David Laight wrote:
> From:  Benjamin Herrenschmidt
> > On Wed, 2014-09-03 at 18:51 -0400, Peter Hurley wrote:
> > 
> > > Apologies for hijacking this thread but I need to extend this discussion
> > > somewhat regarding what a compiler might do with adjacent fields in a structure.
> > >
> > > The tty subsystem defines a large aggregate structure, struct tty_struct.
> > > Importantly, several different locks apply to different fields within that
> > > structure; ie., a specific spinlock will be claimed before updating or accessing
> > > certain fields while a different spinlock will be claimed before updating or
> > > accessing certain _adjacent_ fields.
> > >
> > > What is necessary and sufficient to prevent accidental false-sharing?
> > > The patch below was flagged as insufficient on ia64, and possibly ARM.
> > 
> > We expect native aligned scalar types to be accessed atomically (the
> > read/modify/write of a larger quantity that gcc does on some bitfield
> > cases has been flagged as a gcc bug, but shouldn't happen on normal
> > scalar types).
> 
> That isn't true on all architectures for items smaller than a machine word.
> At least one has to do rmw for byte accesses.

Yeah correct, alpha and bytes right ? Is there any other ? That's why I
suggested int.

> 	David
> 
> > I am not 100% certain of "bool" here, I assume it's treated as a normal
> > scalar and thus atomic but if unsure, you can always use int.
> > 
> > Another option is to use the atomic bitops and make these bits in a
> > bitmask but that is probably unnecessary if you have locks already.
> > 
> > Cheers,
> > Ben.



^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-04  9:52         ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 311+ messages in thread
From: Benjamin Herrenschmidt @ 2014-09-04  9:52 UTC (permalink / raw)
  To: David Laight
  Cc: Jakub Jelinek, linux-arch, Tony Luck, linux-ia64, Peter Hurley,
	Oleg Nesterov, linux-kernel, Paul Mackerras, Paul E. McKenney,
	linuxppc-dev, Miroslav Franc, Richard Henderson

On Thu, 2014-09-04 at 08:43 +0000, David Laight wrote:
> From:  Benjamin Herrenschmidt
> > On Wed, 2014-09-03 at 18:51 -0400, Peter Hurley wrote:
> > 
> > > Apologies for hijacking this thread but I need to extend this discussion
> > > somewhat regarding what a compiler might do with adjacent fields in a structure.
> > >
> > > The tty subsystem defines a large aggregate structure, struct tty_struct.
> > > Importantly, several different locks apply to different fields within that
> > > structure; ie., a specific spinlock will be claimed before updating or accessing
> > > certain fields while a different spinlock will be claimed before updating or
> > > accessing certain _adjacent_ fields.
> > >
> > > What is necessary and sufficient to prevent accidental false-sharing?
> > > The patch below was flagged as insufficient on ia64, and possibly ARM.
> > 
> > We expect native aligned scalar types to be accessed atomically (the
> > read/modify/write of a larger quantity that gcc does on some bitfield
> > cases has been flagged as a gcc bug, but shouldn't happen on normal
> > scalar types).
> 
> That isn't true on all architectures for items smaller than a machine word.
> At least one has to do rmw for byte accesses.

Yeah correct, alpha and bytes right ? Is there any other ? That's why I
suggested int.

> 	David
> 
> > I am not 100% certain of "bool" here, I assume it's treated as a normal
> > scalar and thus atomic but if unsure, you can always use int.
> > 
> > Another option is to use the atomic bitops and make these bits in a
> > bitmask but that is probably unnecessary if you have locks already.
> > 
> > Cheers,
> > Ben.

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-04  9:52         ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 311+ messages in thread
From: Benjamin Herrenschmidt @ 2014-09-04  9:52 UTC (permalink / raw)
  To: David Laight
  Cc: Peter Hurley, Jakub Jelinek, linux-arch, Tony Luck, linux-ia64,
	Oleg Nesterov, linux-kernel, Paul Mackerras, Paul E. McKenney,
	linuxppc-dev, Miroslav Franc, Richard Henderson

On Thu, 2014-09-04 at 08:43 +0000, David Laight wrote:
> From:  Benjamin Herrenschmidt
> > On Wed, 2014-09-03 at 18:51 -0400, Peter Hurley wrote:
> > 
> > > Apologies for hijacking this thread but I need to extend this discussion
> > > somewhat regarding what a compiler might do with adjacent fields in a structure.
> > >
> > > The tty subsystem defines a large aggregate structure, struct tty_struct.
> > > Importantly, several different locks apply to different fields within that
> > > structure; ie., a specific spinlock will be claimed before updating or accessing
> > > certain fields while a different spinlock will be claimed before updating or
> > > accessing certain _adjacent_ fields.
> > >
> > > What is necessary and sufficient to prevent accidental false-sharing?
> > > The patch below was flagged as insufficient on ia64, and possibly ARM.
> > 
> > We expect native aligned scalar types to be accessed atomically (the
> > read/modify/write of a larger quantity that gcc does on some bitfield
> > cases has been flagged as a gcc bug, but shouldn't happen on normal
> > scalar types).
> 
> That isn't true on all architectures for items smaller than a machine word.
> At least one has to do rmw for byte accesses.

Yeah correct, alpha and bytes right ? Is there any other ? That's why I
suggested int.

> 	David
> 
> > I am not 100% certain of "bool" here, I assume it's treated as a normal
> > scalar and thus atomic but if unsure, you can always use int.
> > 
> > Another option is to use the atomic bitops and make these bits in a
> > bitmask but that is probably unnecessary if you have locks already.
> > 
> > Cheers,
> > Ben.



^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
  2014-09-04  9:09         ` Jakub Jelinek
  (?)
@ 2014-09-04 12:24           ` Peter Hurley
  -1 siblings, 0 replies; 311+ messages in thread
From: Peter Hurley @ 2014-09-04 12:24 UTC (permalink / raw)
  To: Jakub Jelinek, Mikael Pettersson
  Cc: Benjamin Herrenschmidt, Paul E. McKenney, Richard Henderson,
	Oleg Nesterov, Miroslav Franc, Paul Mackerras, linuxppc-dev,
	linux-kernel, linux-arch, Tony Luck, linux-ia64

On 09/04/2014 05:09 AM, Jakub Jelinek wrote:
> On Thu, Sep 04, 2014 at 10:57:40AM +0200, Mikael Pettersson wrote:
>> Benjamin Herrenschmidt writes:
>>  > On Wed, 2014-09-03 at 18:51 -0400, Peter Hurley wrote:
>>  > 
>>  > > Apologies for hijacking this thread but I need to extend this discussion
>>  > > somewhat regarding what a compiler might do with adjacent fields in a structure.
>>  > > 
>>  > > The tty subsystem defines a large aggregate structure, struct tty_struct.
>>  > > Importantly, several different locks apply to different fields within that
>>  > > structure; ie., a specific spinlock will be claimed before updating or accessing
>>  > > certain fields while a different spinlock will be claimed before updating or
>>  > > accessing certain _adjacent_ fields.
>>  > > 
>>  > > What is necessary and sufficient to prevent accidental false-sharing?
>>  > > The patch below was flagged as insufficient on ia64, and possibly ARM.
>>  > 
>>  > We expect native aligned scalar types to be accessed atomically (the
>>  > read/modify/write of a larger quantity that gcc does on some bitfield
>>  > cases has been flagged as a gcc bug, but shouldn't happen on normal
>>  > scalar types).
>>  > 
>>  > I am not 100% certain of "bool" here, I assume it's treated as a normal
>>  > scalar and thus atomic but if unsure, you can always use int.
>>
>> Please use an aligned int or long.  Some machines cannot do atomic
>> accesses on sub-int/long quantities, so 'bool' may cause unexpected
>> rmw cycles on adjacent fields.
> 
> Yeah, at least pre-EV56 Alpha performs rmw cycles on char/short accesses
> and thus those are not atomic.

Ok, thanks.

And I just confirmed with the Alpha cross-compiler that the fields are
not 'padded out' if volatile either.

Do any arches consider this an 'optimization'?  I ask because this kind of
accidental adjacency sharing may be common. Even RCU has a char field,
rcu_read_unlock_special, in the middle of the task_struct; luckily the
adjacent field is a list_head.

Besides updating the documentation, it may make sense to do something
arch-specific. Just bumping out storage on arches that don't need it
seems wasteful, as does generating bus locks on arches that don't need it.
Unfortunately, the code churn looks unavoidable.

Regards,
Peter Hurley

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-04 12:24           ` Peter Hurley
  0 siblings, 0 replies; 311+ messages in thread
From: Peter Hurley @ 2014-09-04 12:24 UTC (permalink / raw)
  To: Jakub Jelinek, Mikael Pettersson
  Cc: linux-arch, Tony Luck, linux-ia64, Oleg Nesterov, linux-kernel,
	Paul Mackerras, Paul E. McKenney, linuxppc-dev, Miroslav Franc,
	Richard Henderson

On 09/04/2014 05:09 AM, Jakub Jelinek wrote:
> On Thu, Sep 04, 2014 at 10:57:40AM +0200, Mikael Pettersson wrote:
>> Benjamin Herrenschmidt writes:
>>  > On Wed, 2014-09-03 at 18:51 -0400, Peter Hurley wrote:
>>  > 
>>  > > Apologies for hijacking this thread but I need to extend this discussion
>>  > > somewhat regarding what a compiler might do with adjacent fields in a structure.
>>  > > 
>>  > > The tty subsystem defines a large aggregate structure, struct tty_struct.
>>  > > Importantly, several different locks apply to different fields within that
>>  > > structure; ie., a specific spinlock will be claimed before updating or accessing
>>  > > certain fields while a different spinlock will be claimed before updating or
>>  > > accessing certain _adjacent_ fields.
>>  > > 
>>  > > What is necessary and sufficient to prevent accidental false-sharing?
>>  > > The patch below was flagged as insufficient on ia64, and possibly ARM.
>>  > 
>>  > We expect native aligned scalar types to be accessed atomically (the
>>  > read/modify/write of a larger quantity that gcc does on some bitfield
>>  > cases has been flagged as a gcc bug, but shouldn't happen on normal
>>  > scalar types).
>>  > 
>>  > I am not 100% certain of "bool" here, I assume it's treated as a normal
>>  > scalar and thus atomic but if unsure, you can always use int.
>>
>> Please use an aligned int or long.  Some machines cannot do atomic
>> accesses on sub-int/long quantities, so 'bool' may cause unexpected
>> rmw cycles on adjacent fields.
> 
> Yeah, at least pre-EV56 Alpha performs rmw cycles on char/short accesses
> and thus those are not atomic.

Ok, thanks.

And I just confirmed with the Alpha cross-compiler that the fields are
not 'padded out' if volatile either.

Do any arches consider this an 'optimization'?  I ask because this kind of
accidental adjacency sharing may be common. Even RCU has a char field,
rcu_read_unlock_special, in the middle of the task_struct; luckily the
adjacent field is a list_head.

Besides updating the documentation, it may make sense to do something
arch-specific. Just bumping out storage on arches that don't need it
seems wasteful, as does generating bus locks on arches that don't need it.
Unfortunately, the code churn looks unavoidable.

Regards,
Peter Hurley

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-04 12:24           ` Peter Hurley
  0 siblings, 0 replies; 311+ messages in thread
From: Peter Hurley @ 2014-09-04 12:24 UTC (permalink / raw)
  To: Jakub Jelinek, Mikael Pettersson
  Cc: Benjamin Herrenschmidt, Paul E. McKenney, Richard Henderson,
	Oleg Nesterov, Miroslav Franc, Paul Mackerras, linuxppc-dev,
	linux-kernel, linux-arch, Tony Luck, linux-ia64

On 09/04/2014 05:09 AM, Jakub Jelinek wrote:
> On Thu, Sep 04, 2014 at 10:57:40AM +0200, Mikael Pettersson wrote:
>> Benjamin Herrenschmidt writes:
>>  > On Wed, 2014-09-03 at 18:51 -0400, Peter Hurley wrote:
>>  > 
>>  > > Apologies for hijacking this thread but I need to extend this discussion
>>  > > somewhat regarding what a compiler might do with adjacent fields in a structure.
>>  > > 
>>  > > The tty subsystem defines a large aggregate structure, struct tty_struct.
>>  > > Importantly, several different locks apply to different fields within that
>>  > > structure; ie., a specific spinlock will be claimed before updating or accessing
>>  > > certain fields while a different spinlock will be claimed before updating or
>>  > > accessing certain _adjacent_ fields.
>>  > > 
>>  > > What is necessary and sufficient to prevent accidental false-sharing?
>>  > > The patch below was flagged as insufficient on ia64, and possibly ARM.
>>  > 
>>  > We expect native aligned scalar types to be accessed atomically (the
>>  > read/modify/write of a larger quantity that gcc does on some bitfield
>>  > cases has been flagged as a gcc bug, but shouldn't happen on normal
>>  > scalar types).
>>  > 
>>  > I am not 100% certain of "bool" here, I assume it's treated as a normal
>>  > scalar and thus atomic but if unsure, you can always use int.
>>
>> Please use an aligned int or long.  Some machines cannot do atomic
>> accesses on sub-int/long quantities, so 'bool' may cause unexpected
>> rmw cycles on adjacent fields.
> 
> Yeah, at least pre-EV56 Alpha performs rmw cycles on char/short accesses
> and thus those are not atomic.

Ok, thanks.

And I just confirmed with the Alpha cross-compiler that the fields are
not 'padded out' if volatile either.

Do any arches consider this an 'optimization'?  I ask because this kind of
accidental adjacency sharing may be common. Even RCU has a char field,
rcu_read_unlock_special, in the middle of the task_struct; luckily the
adjacent field is a list_head.

Besides updating the documentation, it may make sense to do something
arch-specific. Just bumping out storage on arches that don't need it
seems wasteful, as does generating bus locks on arches that don't need it.
Unfortunately, the code churn looks unavoidable.

Regards,
Peter Hurley

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
  2014-09-04 12:24           ` Peter Hurley
  (?)
@ 2014-09-04 12:29             ` Jakub Jelinek
  -1 siblings, 0 replies; 311+ messages in thread
From: Jakub Jelinek @ 2014-09-04 12:29 UTC (permalink / raw)
  To: Peter Hurley
  Cc: Mikael Pettersson, Benjamin Herrenschmidt, Paul E. McKenney,
	Richard Henderson, Oleg Nesterov, Miroslav Franc, Paul Mackerras,
	linuxppc-dev, linux-kernel, linux-arch, Tony Luck, linux-ia64

On Thu, Sep 04, 2014 at 08:24:12AM -0400, Peter Hurley wrote:
> And I just confirmed with the Alpha cross-compiler that the fields are
> not 'padded out' if volatile either.

They can't be, struct layout is part of the ABI.
Guess you can introduce say atomic_bool and similar typedefs which would be
bool or char on most arches and on alpha int.

	Jakub

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-04 12:29             ` Jakub Jelinek
  0 siblings, 0 replies; 311+ messages in thread
From: Jakub Jelinek @ 2014-09-04 12:29 UTC (permalink / raw)
  To: Peter Hurley
  Cc: linux-arch, Mikael Pettersson, Tony Luck, linux-ia64,
	Oleg Nesterov, linux-kernel, Paul Mackerras, Paul E. McKenney,
	linuxppc-dev, Miroslav Franc, Richard Henderson

On Thu, Sep 04, 2014 at 08:24:12AM -0400, Peter Hurley wrote:
> And I just confirmed with the Alpha cross-compiler that the fields are
> not 'padded out' if volatile either.

They can't be, struct layout is part of the ABI.
Guess you can introduce say atomic_bool and similar typedefs which would be
bool or char on most arches and on alpha int.

	Jakub

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-04 12:29             ` Jakub Jelinek
  0 siblings, 0 replies; 311+ messages in thread
From: Jakub Jelinek @ 2014-09-04 12:29 UTC (permalink / raw)
  To: Peter Hurley
  Cc: Mikael Pettersson, Benjamin Herrenschmidt, Paul E. McKenney,
	Richard Henderson, Oleg Nesterov, Miroslav Franc, Paul Mackerras,
	linuxppc-dev, linux-kernel, linux-arch, Tony Luck, linux-ia64

On Thu, Sep 04, 2014 at 08:24:12AM -0400, Peter Hurley wrote:
> And I just confirmed with the Alpha cross-compiler that the fields are
> not 'padded out' if volatile either.

They can't be, struct layout is part of the ABI.
Guess you can introduce say atomic_bool and similar typedefs which would be
bool or char on most arches and on alpha int.

	Jakub

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
  2014-09-04 12:24           ` Peter Hurley
  (?)
  (?)
@ 2014-09-04 16:50             ` One Thousand Gnomes
  -1 siblings, 0 replies; 311+ messages in thread
From: One Thousand Gnomes @ 2014-09-04 16:50 UTC (permalink / raw)
  To: Peter Hurley
  Cc: Jakub Jelinek, Mikael Pettersson, Benjamin Herrenschmidt,
	Paul E. McKenney, Richard Henderson, Oleg Nesterov,
	Miroslav Franc, Paul Mackerras, linuxppc-dev, linux-kernel,
	linux-arch, Tony Luck, linux-ia64

> Besides updating the documentation, it may make sense to do something
> arch-specific. Just bumping out storage on arches that don't need it
> seems wasteful, as does generating bus locks on arches that don't need it.
> Unfortunately, the code churn looks unavoidable.

The arch specific is pretty much set_bit and friends. Bus locks on a
locally owned cache line should not be very expensive on anything vaguely
modern, while uniprocessor boxes usually only have to generate set_bit
as a single instruction so it is interrupt safe.

Alan

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-04 16:50             ` One Thousand Gnomes
  0 siblings, 0 replies; 311+ messages in thread
From: One Thousand Gnomes @ 2014-09-04 16:50 UTC (permalink / raw)
  To: Peter Hurley
  Cc: Jakub Jelinek, Mikael Pettersson, Tony Luck, linux-ia64,
	Oleg Nesterov, linux-kernel, Paul Mackerras, linux-arch,
	Paul E. McKenney, linuxppc-dev, Miroslav Franc,
	Richard Henderson

> Besides updating the documentation, it may make sense to do something
> arch-specific. Just bumping out storage on arches that don't need it
> seems wasteful, as does generating bus locks on arches that don't need it.
> Unfortunately, the code churn looks unavoidable.

The arch specific is pretty much set_bit and friends. Bus locks on a
locally owned cache line should not be very expensive on anything vaguely
modern, while uniprocessor boxes usually only have to generate set_bit
as a single instruction so it is interrupt safe.

Alan
_______________________________________________
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-04 16:50             ` One Thousand Gnomes
  0 siblings, 0 replies; 311+ messages in thread
From: One Thousand Gnomes @ 2014-09-04 16:50 UTC (permalink / raw)
  To: Peter Hurley
  Cc: Jakub Jelinek, Mikael Pettersson, Tony Luck, linux-ia64,
	Oleg Nesterov, linux-kernel, Paul Mackerras, linux-arch,
	Paul E. McKenney, linuxppc-dev, Miroslav Franc,
	Richard Henderson

> Besides updating the documentation, it may make sense to do something
> arch-specific. Just bumping out storage on arches that don't need it
> seems wasteful, as does generating bus locks on arches that don't need it.
> Unfortunately, the code churn looks unavoidable.

The arch specific is pretty much set_bit and friends. Bus locks on a
locally owned cache line should not be very expensive on anything vaguely
modern, while uniprocessor boxes usually only have to generate set_bit
as a single instruction so it is interrupt safe.

Alan

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-04 16:50             ` One Thousand Gnomes
  0 siblings, 0 replies; 311+ messages in thread
From: One Thousand Gnomes @ 2014-09-04 16:50 UTC (permalink / raw)
  To: Peter Hurley
  Cc: Jakub Jelinek, Mikael Pettersson, Benjamin Herrenschmidt,
	Paul E. McKenney, Richard Henderson, Oleg Nesterov,
	Miroslav Franc, Paul Mackerras, linuxppc-dev, linux-kernel,
	linux-arch, Tony Luck, linux-ia64

> Besides updating the documentation, it may make sense to do something
> arch-specific. Just bumping out storage on arches that don't need it
> seems wasteful, as does generating bus locks on arches that don't need it.
> Unfortunately, the code churn looks unavoidable.

The arch specific is pretty much set_bit and friends. Bus locks on a
locally owned cache line should not be very expensive on anything vaguely
modern, while uniprocessor boxes usually only have to generate set_bit
as a single instruction so it is interrupt safe.

Alan

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
  2014-09-04 16:50             ` One Thousand Gnomes
  (?)
@ 2014-09-04 19:42               ` Peter Hurley
  -1 siblings, 0 replies; 311+ messages in thread
From: Peter Hurley @ 2014-09-04 19:42 UTC (permalink / raw)
  To: One Thousand Gnomes
  Cc: Jakub Jelinek, Mikael Pettersson, Benjamin Herrenschmidt,
	Paul E. McKenney, Richard Henderson, Oleg Nesterov,
	Miroslav Franc, Paul Mackerras, linuxppc-dev, linux-kernel,
	linux-arch, Tony Luck, linux-ia64

On 09/04/2014 12:50 PM, One Thousand Gnomes wrote:
>> Besides updating the documentation, it may make sense to do something
>> arch-specific. Just bumping out storage on arches that don't need it
>> seems wasteful, as does generating bus locks on arches that don't need it.
>> Unfortunately, the code churn looks unavoidable.
> 
> The arch specific is pretty much set_bit and friends. Bus locks on a
> locally owned cache line should not be very expensive on anything vaguely
> modern, while uniprocessor boxes usually only have to generate set_bit
> as a single instruction so it is interrupt safe.

Or we could give up on the Alpha.

It's not just the non-atomic bytes; we could do away with the
read_barrier_depends() which hardly any code gets correctly anyway.

Regards,
Peter Hurley


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-04 19:42               ` Peter Hurley
  0 siblings, 0 replies; 311+ messages in thread
From: Peter Hurley @ 2014-09-04 19:42 UTC (permalink / raw)
  To: One Thousand Gnomes
  Cc: Jakub Jelinek, Mikael Pettersson, Tony Luck, linux-ia64,
	Oleg Nesterov, linux-kernel, Paul Mackerras, linux-arch,
	Paul E. McKenney, linuxppc-dev, Miroslav Franc,
	Richard Henderson

On 09/04/2014 12:50 PM, One Thousand Gnomes wrote:
>> Besides updating the documentation, it may make sense to do something
>> arch-specific. Just bumping out storage on arches that don't need it
>> seems wasteful, as does generating bus locks on arches that don't need it.
>> Unfortunately, the code churn looks unavoidable.
> 
> The arch specific is pretty much set_bit and friends. Bus locks on a
> locally owned cache line should not be very expensive on anything vaguely
> modern, while uniprocessor boxes usually only have to generate set_bit
> as a single instruction so it is interrupt safe.

Or we could give up on the Alpha.

It's not just the non-atomic bytes; we could do away with the
read_barrier_depends() which hardly any code gets correctly anyway.

Regards,
Peter Hurley

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-04 19:42               ` Peter Hurley
  0 siblings, 0 replies; 311+ messages in thread
From: Peter Hurley @ 2014-09-04 19:42 UTC (permalink / raw)
  To: One Thousand Gnomes
  Cc: Jakub Jelinek, Mikael Pettersson, Benjamin Herrenschmidt,
	Paul E. McKenney, Richard Henderson, Oleg Nesterov,
	Miroslav Franc, Paul Mackerras, linuxppc-dev, linux-kernel,
	linux-arch, Tony Luck, linux-ia64

On 09/04/2014 12:50 PM, One Thousand Gnomes wrote:
>> Besides updating the documentation, it may make sense to do something
>> arch-specific. Just bumping out storage on arches that don't need it
>> seems wasteful, as does generating bus locks on arches that don't need it.
>> Unfortunately, the code churn looks unavoidable.
> 
> The arch specific is pretty much set_bit and friends. Bus locks on a
> locally owned cache line should not be very expensive on anything vaguely
> modern, while uniprocessor boxes usually only have to generate set_bit
> as a single instruction so it is interrupt safe.

Or we could give up on the Alpha.

It's not just the non-atomic bytes; we could do away with the
read_barrier_depends() which hardly any code gets correctly anyway.

Regards,
Peter Hurley


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
  2014-09-04  9:52         ` Benjamin Herrenschmidt
  (?)
@ 2014-09-04 22:14           ` H. Peter Anvin
  -1 siblings, 0 replies; 311+ messages in thread
From: H. Peter Anvin @ 2014-09-04 22:14 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, David Laight
  Cc: Peter Hurley, Jakub Jelinek, linux-arch, Tony Luck, linux-ia64,
	Oleg Nesterov, linux-kernel, Paul Mackerras, Paul E. McKenney,
	linuxppc-dev, Miroslav Franc, Richard Henderson

On 09/04/2014 02:52 AM, Benjamin Herrenschmidt wrote:
> 
> Yeah correct, alpha and bytes right ? Is there any other ? That's why I
> suggested int.
> 

Even for Alpha it is only the 21064 AFAIK.

	-hpa



^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-04 22:14           ` H. Peter Anvin
  0 siblings, 0 replies; 311+ messages in thread
From: H. Peter Anvin @ 2014-09-04 22:14 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, David Laight
  Cc: Jakub Jelinek, linux-arch, Tony Luck, linux-ia64, Peter Hurley,
	Oleg Nesterov, linux-kernel, Paul Mackerras, Paul E. McKenney,
	linuxppc-dev, Miroslav Franc, Richard Henderson

On 09/04/2014 02:52 AM, Benjamin Herrenschmidt wrote:
> 
> Yeah correct, alpha and bytes right ? Is there any other ? That's why I
> suggested int.
> 

Even for Alpha it is only the 21064 AFAIK.

	-hpa

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-04 22:14           ` H. Peter Anvin
  0 siblings, 0 replies; 311+ messages in thread
From: H. Peter Anvin @ 2014-09-04 22:14 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, David Laight
  Cc: Peter Hurley, Jakub Jelinek, linux-arch, Tony Luck, linux-ia64,
	Oleg Nesterov, linux-kernel, Paul Mackerras, Paul E. McKenney,
	linuxppc-dev, Miroslav Franc, Richard Henderson

On 09/04/2014 02:52 AM, Benjamin Herrenschmidt wrote:
> 
> Yeah correct, alpha and bytes right ? Is there any other ? That's why I
> suggested int.
> 

Even for Alpha it is only the 21064 AFAIK.

	-hpa



^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
  2014-09-04 19:42               ` Peter Hurley
  (?)
@ 2014-09-04 22:16                 ` H. Peter Anvin
  -1 siblings, 0 replies; 311+ messages in thread
From: H. Peter Anvin @ 2014-09-04 22:16 UTC (permalink / raw)
  To: Peter Hurley, One Thousand Gnomes
  Cc: Jakub Jelinek, Mikael Pettersson, Benjamin Herrenschmidt,
	Paul E. McKenney, Richard Henderson, Oleg Nesterov,
	Miroslav Franc, Paul Mackerras, linuxppc-dev, linux-kernel,
	linux-arch, Tony Luck, linux-ia64

On 09/04/2014 12:42 PM, Peter Hurley wrote:
> 
> Or we could give up on the Alpha.
> 

If Alpha is turning into Voyager (kept alive only as a museum piece, but
actively causing problems) then please let's kill it.

	-hpa



^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-04 22:16                 ` H. Peter Anvin
  0 siblings, 0 replies; 311+ messages in thread
From: H. Peter Anvin @ 2014-09-04 22:16 UTC (permalink / raw)
  To: Peter Hurley, One Thousand Gnomes
  Cc: Jakub Jelinek, Mikael Pettersson, Tony Luck, linux-ia64,
	Oleg Nesterov, linux-kernel, Paul Mackerras, linux-arch,
	Paul E. McKenney, linuxppc-dev, Miroslav Franc,
	Richard Henderson

On 09/04/2014 12:42 PM, Peter Hurley wrote:
> 
> Or we could give up on the Alpha.
> 

If Alpha is turning into Voyager (kept alive only as a museum piece, but
actively causing problems) then please let's kill it.

	-hpa

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-04 22:16                 ` H. Peter Anvin
  0 siblings, 0 replies; 311+ messages in thread
From: H. Peter Anvin @ 2014-09-04 22:16 UTC (permalink / raw)
  To: Peter Hurley, One Thousand Gnomes
  Cc: Jakub Jelinek, Mikael Pettersson, Benjamin Herrenschmidt,
	Paul E. McKenney, Richard Henderson, Oleg Nesterov,
	Miroslav Franc, Paul Mackerras, linuxppc-dev, linux-kernel,
	linux-arch, Tony Luck, linux-ia64

On 09/04/2014 12:42 PM, Peter Hurley wrote:
> 
> Or we could give up on the Alpha.
> 

If Alpha is turning into Voyager (kept alive only as a museum piece, but
actively causing problems) then please let's kill it.

	-hpa



^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
  2014-09-04 22:16                 ` H. Peter Anvin
  (?)
@ 2014-09-05  0:17                   ` Paul E. McKenney
  -1 siblings, 0 replies; 311+ messages in thread
From: Paul E. McKenney @ 2014-09-05  0:17 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Peter Hurley, One Thousand Gnomes, Jakub Jelinek,
	Mikael Pettersson, Benjamin Herrenschmidt, Richard Henderson,
	Oleg Nesterov, Miroslav Franc, Paul Mackerras, linuxppc-dev,
	linux-kernel, linux-arch, Tony Luck, linux-ia64

On Thu, Sep 04, 2014 at 03:16:03PM -0700, H. Peter Anvin wrote:
> On 09/04/2014 12:42 PM, Peter Hurley wrote:
> > 
> > Or we could give up on the Alpha.
> > 
> 
> If Alpha is turning into Voyager (kept alive only as a museum piece, but
> actively causing problems) then please let's kill it.

Sorry for being slow to join this thread, but I propose the following
patch.  If we can remove support for all CPUs that to not support
direct access to bytes and shorts (which I would very much like to
see happen), I will remove the last non-guarantee.

							Thanx, Paul

------------------------------------------------------------------------

documentation: Record limitations of bitfields and small variables

This commit documents the fact that it is not safe to use bitfields
as shared variables in synchronization algorithms.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
index 87be0a8a78de..a28bfe4fd759 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -269,6 +269,26 @@ And there are a number of things that _must_ or _must_not_ be assumed:
 	STORE *(A + 4) = Y; STORE *A = X;
 	STORE {*A, *(A + 4) } = {X, Y};
 
+And there are anti-guarantees:
+
+ (*) These guarantees do not apply to bitfields, because compilers often
+     generate code to modify these using non-atomic read-modify-write
+     sequences.  Do not attempt to use bitfields to synchronize parallel
+     algorithms.
+
+ (*) Even in cases where bitfields are protected by locks, all fields
+     in a given bitfield must be protected by one lock.  If two fields
+     in a given bitfield are protected by different locks, the compiler's
+     non-atomic read-modify-write sequences can cause an update to one
+     field to corrupt the value of an adjacent field.
+
+ (*) These guarantees apply only to properly aligned and sized scalar
+     variables.  "Properly sized" currently means "int" and "long",
+     because some CPU families do not support loads and stores of
+     other sizes.  ("Some CPU families" is currently believed to
+     be only Alpha 21064.  If this is actually the case, a different
+     non-guarantee is likely to be formulated.)
+
 
 =========================
 WHAT ARE MEMORY BARRIERS?


^ permalink raw reply related	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-05  0:17                   ` Paul E. McKenney
  0 siblings, 0 replies; 311+ messages in thread
From: Paul E. McKenney @ 2014-09-05  0:17 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Jakub Jelinek, One Thousand Gnomes, Tony Luck, linux-ia64,
	Peter Hurley, Mikael Pettersson, Oleg Nesterov, linux-kernel,
	Paul Mackerras, linux-arch, linuxppc-dev, Miroslav Franc,
	Richard Henderson

On Thu, Sep 04, 2014 at 03:16:03PM -0700, H. Peter Anvin wrote:
> On 09/04/2014 12:42 PM, Peter Hurley wrote:
> > 
> > Or we could give up on the Alpha.
> > 
> 
> If Alpha is turning into Voyager (kept alive only as a museum piece, but
> actively causing problems) then please let's kill it.

Sorry for being slow to join this thread, but I propose the following
patch.  If we can remove support for all CPUs that to not support
direct access to bytes and shorts (which I would very much like to
see happen), I will remove the last non-guarantee.

							Thanx, Paul

------------------------------------------------------------------------

documentation: Record limitations of bitfields and small variables

This commit documents the fact that it is not safe to use bitfields
as shared variables in synchronization algorithms.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
index 87be0a8a78de..a28bfe4fd759 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -269,6 +269,26 @@ And there are a number of things that _must_ or _must_not_ be assumed:
 	STORE *(A + 4) = Y; STORE *A = X;
 	STORE {*A, *(A + 4) } = {X, Y};
 
+And there are anti-guarantees:
+
+ (*) These guarantees do not apply to bitfields, because compilers often
+     generate code to modify these using non-atomic read-modify-write
+     sequences.  Do not attempt to use bitfields to synchronize parallel
+     algorithms.
+
+ (*) Even in cases where bitfields are protected by locks, all fields
+     in a given bitfield must be protected by one lock.  If two fields
+     in a given bitfield are protected by different locks, the compiler's
+     non-atomic read-modify-write sequences can cause an update to one
+     field to corrupt the value of an adjacent field.
+
+ (*) These guarantees apply only to properly aligned and sized scalar
+     variables.  "Properly sized" currently means "int" and "long",
+     because some CPU families do not support loads and stores of
+     other sizes.  ("Some CPU families" is currently believed to
+     be only Alpha 21064.  If this is actually the case, a different
+     non-guarantee is likely to be formulated.)
+
 
 =========================
 WHAT ARE MEMORY BARRIERS?

^ permalink raw reply related	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-05  0:17                   ` Paul E. McKenney
  0 siblings, 0 replies; 311+ messages in thread
From: Paul E. McKenney @ 2014-09-05  0:17 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Peter Hurley, One Thousand Gnomes, Jakub Jelinek,
	Mikael Pettersson, Benjamin Herrenschmidt, Richard Henderson,
	Oleg Nesterov, Miroslav Franc, Paul Mackerras, linuxppc-dev,
	linux-kernel, linux-arch, Tony Luck, linux-ia64

On Thu, Sep 04, 2014 at 03:16:03PM -0700, H. Peter Anvin wrote:
> On 09/04/2014 12:42 PM, Peter Hurley wrote:
> > 
> > Or we could give up on the Alpha.
> > 
> 
> If Alpha is turning into Voyager (kept alive only as a museum piece, but
> actively causing problems) then please let's kill it.

Sorry for being slow to join this thread, but I propose the following
patch.  If we can remove support for all CPUs that to not support
direct access to bytes and shorts (which I would very much like to
see happen), I will remove the last non-guarantee.

							Thanx, Paul

------------------------------------------------------------------------

documentation: Record limitations of bitfields and small variables

This commit documents the fact that it is not safe to use bitfields
as shared variables in synchronization algorithms.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
index 87be0a8a78de..a28bfe4fd759 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -269,6 +269,26 @@ And there are a number of things that _must_ or _must_not_ be assumed:
 	STORE *(A + 4) = Y; STORE *A = X;
 	STORE {*A, *(A + 4) } = {X, Y};
 
+And there are anti-guarantees:
+
+ (*) These guarantees do not apply to bitfields, because compilers often
+     generate code to modify these using non-atomic read-modify-write
+     sequences.  Do not attempt to use bitfields to synchronize parallel
+     algorithms.
+
+ (*) Even in cases where bitfields are protected by locks, all fields
+     in a given bitfield must be protected by one lock.  If two fields
+     in a given bitfield are protected by different locks, the compiler's
+     non-atomic read-modify-write sequences can cause an update to one
+     field to corrupt the value of an adjacent field.
+
+ (*) These guarantees apply only to properly aligned and sized scalar
+     variables.  "Properly sized" currently means "int" and "long",
+     because some CPU families do not support loads and stores of
+     other sizes.  ("Some CPU families" is currently believed to
+     be only Alpha 21064.  If this is actually the case, a different
+     non-guarantee is likely to be formulated.)
+
 
 ============ WHAT ARE MEMORY BARRIERS?


^ permalink raw reply related	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
  2014-09-04 22:14           ` H. Peter Anvin
  (?)
@ 2014-09-05  0:59             ` Peter Hurley
  -1 siblings, 0 replies; 311+ messages in thread
From: Peter Hurley @ 2014-09-05  0:59 UTC (permalink / raw)
  To: H. Peter Anvin, Benjamin Herrenschmidt, David Laight
  Cc: Jakub Jelinek, linux-arch, Tony Luck, linux-ia64, Oleg Nesterov,
	linux-kernel, Paul Mackerras, Paul E. McKenney, linuxppc-dev,
	Miroslav Franc, Richard Henderson, linux-alpha

[ +cc linux-alpha ]

On 09/04/2014 06:14 PM, H. Peter Anvin wrote:
> On 09/04/2014 02:52 AM, Benjamin Herrenschmidt wrote:
>>
>> Yeah correct, alpha and bytes right ? Is there any other ? That's why I
>> suggested int.
>>
> 
> Even for Alpha it is only the 21064 AFAIK.

For -mcpu=ev5 (21164) and the following test

struct x {
	long a;
	char b;
	char c;
	char d;
	char e;
};

void store_b(struct x *p) {
	p->b = 1;
}

gcc generates:

void store_b(struct x *p) {
   0:	08 00 30 a0 	ldl	t0,8(a0)
   4:	01 f1 3f 44 	andnot	t0,0xff,t0
   8:	01 34 20 44 	or	t0,0x1,t0
   c:	08 00 30 b0 	stl	t0,8(a0)
  10:	01 80 fa 6b 	ret

IOW, rmw on 3 adjacent bytes, which is bad :)
For -mcpu=ev56 (21164A), the generated code is:

void store_b(struct x *p) {
   0:	01 00 3f 20 	lda	t0,1
   4:	08 00 30 38 	stb	t0,8(a0)
   8:	01 80 fa 6b 	ret

which is ok.
I have no idea how prevalent the ev56 is compared to the ev5.
Still we're talking about a chip that came out in 1996.

I still hate split caches though.

Regards,
Peter Hurley


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-05  0:59             ` Peter Hurley
  0 siblings, 0 replies; 311+ messages in thread
From: Peter Hurley @ 2014-09-05  0:59 UTC (permalink / raw)
  To: H. Peter Anvin, Benjamin Herrenschmidt, David Laight
  Cc: Jakub Jelinek, linux-arch, Tony Luck, linux-ia64, Oleg Nesterov,
	linux-kernel, Paul Mackerras, linux-alpha, Paul E. McKenney,
	linuxppc-dev, Miroslav Franc, Richard Henderson

[ +cc linux-alpha ]

On 09/04/2014 06:14 PM, H. Peter Anvin wrote:
> On 09/04/2014 02:52 AM, Benjamin Herrenschmidt wrote:
>>
>> Yeah correct, alpha and bytes right ? Is there any other ? That's why I
>> suggested int.
>>
> 
> Even for Alpha it is only the 21064 AFAIK.

For -mcpu=ev5 (21164) and the following test

struct x {
	long a;
	char b;
	char c;
	char d;
	char e;
};

void store_b(struct x *p) {
	p->b = 1;
}

gcc generates:

void store_b(struct x *p) {
   0:	08 00 30 a0 	ldl	t0,8(a0)
   4:	01 f1 3f 44 	andnot	t0,0xff,t0
   8:	01 34 20 44 	or	t0,0x1,t0
   c:	08 00 30 b0 	stl	t0,8(a0)
  10:	01 80 fa 6b 	ret

IOW, rmw on 3 adjacent bytes, which is bad :)
For -mcpu=ev56 (21164A), the generated code is:

void store_b(struct x *p) {
   0:	01 00 3f 20 	lda	t0,1
   4:	08 00 30 38 	stb	t0,8(a0)
   8:	01 80 fa 6b 	ret

which is ok.
I have no idea how prevalent the ev56 is compared to the ev5.
Still we're talking about a chip that came out in 1996.

I still hate split caches though.

Regards,
Peter Hurley

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-05  0:59             ` Peter Hurley
  0 siblings, 0 replies; 311+ messages in thread
From: Peter Hurley @ 2014-09-05  0:59 UTC (permalink / raw)
  To: H. Peter Anvin, Benjamin Herrenschmidt, David Laight
  Cc: Jakub Jelinek, linux-arch, Tony Luck, linux-ia64, Oleg Nesterov,
	linux-kernel, Paul Mackerras, Paul E. McKenney, linuxppc-dev,
	Miroslav Franc, Richard Henderson, linux-alpha

[ +cc linux-alpha ]

On 09/04/2014 06:14 PM, H. Peter Anvin wrote:
> On 09/04/2014 02:52 AM, Benjamin Herrenschmidt wrote:
>>
>> Yeah correct, alpha and bytes right ? Is there any other ? That's why I
>> suggested int.
>>
> 
> Even for Alpha it is only the 21064 AFAIK.

For -mcpu=ev5 (21164) and the following test

struct x {
	long a;
	char b;
	char c;
	char d;
	char e;
};

void store_b(struct x *p) {
	p->b = 1;
}

gcc generates:

void store_b(struct x *p) {
   0:	08 00 30 a0 	ldl	t0,8(a0)
   4:	01 f1 3f 44 	andnot	t0,0xff,t0
   8:	01 34 20 44 	or	t0,0x1,t0
   c:	08 00 30 b0 	stl	t0,8(a0)
  10:	01 80 fa 6b 	ret

IOW, rmw on 3 adjacent bytes, which is bad :)
For -mcpu=ev56 (21164A), the generated code is:

void store_b(struct x *p) {
   0:	01 00 3f 20 	lda	t0,1
   4:	08 00 30 38 	stb	t0,8(a0)
   8:	01 80 fa 6b 	ret

which is ok.
I have no idea how prevalent the ev56 is compared to the ev5.
Still we're talking about a chip that came out in 1996.

I still hate split caches though.

Regards,
Peter Hurley


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
  2014-09-05  0:17                   ` Paul E. McKenney
  (?)
@ 2014-09-05  1:57                     ` Peter Hurley
  -1 siblings, 0 replies; 311+ messages in thread
From: Peter Hurley @ 2014-09-05  1:57 UTC (permalink / raw)
  To: paulmck, H. Peter Anvin
  Cc: One Thousand Gnomes, Jakub Jelinek, Mikael Pettersson,
	Benjamin Herrenschmidt, Richard Henderson, Oleg Nesterov,
	Miroslav Franc, Paul Mackerras, linuxppc-dev, linux-kernel,
	linux-arch, Tony Luck, linux-ia64, linux-alpha

[ +cc linux-alpha ]

Hi Paul,

On 09/04/2014 08:17 PM, Paul E. McKenney wrote:
> On Thu, Sep 04, 2014 at 03:16:03PM -0700, H. Peter Anvin wrote:
>> On 09/04/2014 12:42 PM, Peter Hurley wrote:
>>>
>>> Or we could give up on the Alpha.
>>>
>>
>> If Alpha is turning into Voyager (kept alive only as a museum piece, but
>> actively causing problems) then please let's kill it.
> 
> Sorry for being slow to join this thread, but I propose the following
> patch.  If we can remove support for all CPUs that to not support
> direct access to bytes and shorts (which I would very much like to
> see happen), I will remove the last non-guarantee.
> 
> 							Thanx, Paul

Although I don't mind the patch below, I don't think the bitfield thing
happened because anyone was confused about what the compiler would do;
here, it's more a case of legacy code that came out from under the
Big Kernel Lock and the bitfield was an oversight.

However, my patch to fix it by splitting the bitfield into 4 bytes
was rejected as insufficient to prevent accidental sharing. This is
what spun off the Alpha discussion about non-atomic byte updates.

FWIW, there are a bunch of problems with both the documentation and
kernel code if adjacent bytes can be overwritten by a single byte write.

Documentation/atomic-ops.txt claims that properly aligned chars are
atomic in the same sense that ints are, which is not true on the Alpha
(hopefully not a possible optimization on other arches -- I tried
with the ia64 cross compiler but it stuck with byte-sized writes).

Pretty much any large aggregate kernel structure is bound to have some
byte-size fields that are either lockless or serialized by different
locks, which may be corrupted by concurrent updates to adjacent data.
IOW, ACCESS_ONCE(), spinlocks, whatever, doesn't prevent adjacent
byte-sized data from being overwritten. I haven't bothered to count
how many global bools/chars there are and whether they might be
overwritten by adjacent updates.

Documentation/circular-buffers.txt and any lockless implementation based
on or similar to it for bytes or shorts will be corrupted if the head
nears the tail.

I'm sure there's other interesting outcomes that haven't come to light.

I think that 'naturally aligned scalar writes are atomic' should be the
minimum arch guarantee.

Regards,
Peter Hurley


> ------------------------------------------------------------------------
> 
> documentation: Record limitations of bitfields and small variables
> 
> This commit documents the fact that it is not safe to use bitfields
> as shared variables in synchronization algorithms.
> 
> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> 
> diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
> index 87be0a8a78de..a28bfe4fd759 100644
> --- a/Documentation/memory-barriers.txt
> +++ b/Documentation/memory-barriers.txt
> @@ -269,6 +269,26 @@ And there are a number of things that _must_ or _must_not_ be assumed:
>  	STORE *(A + 4) = Y; STORE *A = X;
>  	STORE {*A, *(A + 4) } = {X, Y};
>  
> +And there are anti-guarantees:
> +
> + (*) These guarantees do not apply to bitfields, because compilers often
> +     generate code to modify these using non-atomic read-modify-write
> +     sequences.  Do not attempt to use bitfields to synchronize parallel
> +     algorithms.
> +
> + (*) Even in cases where bitfields are protected by locks, all fields
> +     in a given bitfield must be protected by one lock.  If two fields
> +     in a given bitfield are protected by different locks, the compiler's
> +     non-atomic read-modify-write sequences can cause an update to one
> +     field to corrupt the value of an adjacent field.
> +
> + (*) These guarantees apply only to properly aligned and sized scalar
> +     variables.  "Properly sized" currently means "int" and "long",
> +     because some CPU families do not support loads and stores of
> +     other sizes.  ("Some CPU families" is currently believed to
> +     be only Alpha 21064.  If this is actually the case, a different
> +     non-guarantee is likely to be formulated.)
> +
>  
>  =========================
>  WHAT ARE MEMORY BARRIERS?
 


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-05  1:57                     ` Peter Hurley
  0 siblings, 0 replies; 311+ messages in thread
From: Peter Hurley @ 2014-09-05  1:57 UTC (permalink / raw)
  To: paulmck, H. Peter Anvin
  Cc: Jakub Jelinek, One Thousand Gnomes, Tony Luck, linux-ia64,
	Mikael Pettersson, linux-alpha, Oleg Nesterov, linux-kernel,
	Paul Mackerras, linux-arch, linuxppc-dev, Miroslav Franc,
	Richard Henderson

[ +cc linux-alpha ]

Hi Paul,

On 09/04/2014 08:17 PM, Paul E. McKenney wrote:
> On Thu, Sep 04, 2014 at 03:16:03PM -0700, H. Peter Anvin wrote:
>> On 09/04/2014 12:42 PM, Peter Hurley wrote:
>>>
>>> Or we could give up on the Alpha.
>>>
>>
>> If Alpha is turning into Voyager (kept alive only as a museum piece, but
>> actively causing problems) then please let's kill it.
> 
> Sorry for being slow to join this thread, but I propose the following
> patch.  If we can remove support for all CPUs that to not support
> direct access to bytes and shorts (which I would very much like to
> see happen), I will remove the last non-guarantee.
> 
> 							Thanx, Paul

Although I don't mind the patch below, I don't think the bitfield thing
happened because anyone was confused about what the compiler would do;
here, it's more a case of legacy code that came out from under the
Big Kernel Lock and the bitfield was an oversight.

However, my patch to fix it by splitting the bitfield into 4 bytes
was rejected as insufficient to prevent accidental sharing. This is
what spun off the Alpha discussion about non-atomic byte updates.

FWIW, there are a bunch of problems with both the documentation and
kernel code if adjacent bytes can be overwritten by a single byte write.

Documentation/atomic-ops.txt claims that properly aligned chars are
atomic in the same sense that ints are, which is not true on the Alpha
(hopefully not a possible optimization on other arches -- I tried
with the ia64 cross compiler but it stuck with byte-sized writes).

Pretty much any large aggregate kernel structure is bound to have some
byte-size fields that are either lockless or serialized by different
locks, which may be corrupted by concurrent updates to adjacent data.
IOW, ACCESS_ONCE(), spinlocks, whatever, doesn't prevent adjacent
byte-sized data from being overwritten. I haven't bothered to count
how many global bools/chars there are and whether they might be
overwritten by adjacent updates.

Documentation/circular-buffers.txt and any lockless implementation based
on or similar to it for bytes or shorts will be corrupted if the head
nears the tail.

I'm sure there's other interesting outcomes that haven't come to light.

I think that 'naturally aligned scalar writes are atomic' should be the
minimum arch guarantee.

Regards,
Peter Hurley


> ------------------------------------------------------------------------
> 
> documentation: Record limitations of bitfields and small variables
> 
> This commit documents the fact that it is not safe to use bitfields
> as shared variables in synchronization algorithms.
> 
> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> 
> diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
> index 87be0a8a78de..a28bfe4fd759 100644
> --- a/Documentation/memory-barriers.txt
> +++ b/Documentation/memory-barriers.txt
> @@ -269,6 +269,26 @@ And there are a number of things that _must_ or _must_not_ be assumed:
>  	STORE *(A + 4) = Y; STORE *A = X;
>  	STORE {*A, *(A + 4) } = {X, Y};
>  
> +And there are anti-guarantees:
> +
> + (*) These guarantees do not apply to bitfields, because compilers often
> +     generate code to modify these using non-atomic read-modify-write
> +     sequences.  Do not attempt to use bitfields to synchronize parallel
> +     algorithms.
> +
> + (*) Even in cases where bitfields are protected by locks, all fields
> +     in a given bitfield must be protected by one lock.  If two fields
> +     in a given bitfield are protected by different locks, the compiler's
> +     non-atomic read-modify-write sequences can cause an update to one
> +     field to corrupt the value of an adjacent field.
> +
> + (*) These guarantees apply only to properly aligned and sized scalar
> +     variables.  "Properly sized" currently means "int" and "long",
> +     because some CPU families do not support loads and stores of
> +     other sizes.  ("Some CPU families" is currently believed to
> +     be only Alpha 21064.  If this is actually the case, a different
> +     non-guarantee is likely to be formulated.)
> +
>  
>  =========================
>  WHAT ARE MEMORY BARRIERS?
 

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-05  1:57                     ` Peter Hurley
  0 siblings, 0 replies; 311+ messages in thread
From: Peter Hurley @ 2014-09-05  1:57 UTC (permalink / raw)
  To: paulmck, H. Peter Anvin
  Cc: One Thousand Gnomes, Jakub Jelinek, Mikael Pettersson,
	Benjamin Herrenschmidt, Richard Henderson, Oleg Nesterov,
	Miroslav Franc, Paul Mackerras, linuxppc-dev, linux-kernel,
	linux-arch, Tony Luck, linux-ia64, linux-alpha

[ +cc linux-alpha ]

Hi Paul,

On 09/04/2014 08:17 PM, Paul E. McKenney wrote:
> On Thu, Sep 04, 2014 at 03:16:03PM -0700, H. Peter Anvin wrote:
>> On 09/04/2014 12:42 PM, Peter Hurley wrote:
>>>
>>> Or we could give up on the Alpha.
>>>
>>
>> If Alpha is turning into Voyager (kept alive only as a museum piece, but
>> actively causing problems) then please let's kill it.
> 
> Sorry for being slow to join this thread, but I propose the following
> patch.  If we can remove support for all CPUs that to not support
> direct access to bytes and shorts (which I would very much like to
> see happen), I will remove the last non-guarantee.
> 
> 							Thanx, Paul

Although I don't mind the patch below, I don't think the bitfield thing
happened because anyone was confused about what the compiler would do;
here, it's more a case of legacy code that came out from under the
Big Kernel Lock and the bitfield was an oversight.

However, my patch to fix it by splitting the bitfield into 4 bytes
was rejected as insufficient to prevent accidental sharing. This is
what spun off the Alpha discussion about non-atomic byte updates.

FWIW, there are a bunch of problems with both the documentation and
kernel code if adjacent bytes can be overwritten by a single byte write.

Documentation/atomic-ops.txt claims that properly aligned chars are
atomic in the same sense that ints are, which is not true on the Alpha
(hopefully not a possible optimization on other arches -- I tried
with the ia64 cross compiler but it stuck with byte-sized writes).

Pretty much any large aggregate kernel structure is bound to have some
byte-size fields that are either lockless or serialized by different
locks, which may be corrupted by concurrent updates to adjacent data.
IOW, ACCESS_ONCE(), spinlocks, whatever, doesn't prevent adjacent
byte-sized data from being overwritten. I haven't bothered to count
how many global bools/chars there are and whether they might be
overwritten by adjacent updates.

Documentation/circular-buffers.txt and any lockless implementation based
on or similar to it for bytes or shorts will be corrupted if the head
nears the tail.

I'm sure there's other interesting outcomes that haven't come to light.

I think that 'naturally aligned scalar writes are atomic' should be the
minimum arch guarantee.

Regards,
Peter Hurley


> ------------------------------------------------------------------------
> 
> documentation: Record limitations of bitfields and small variables
> 
> This commit documents the fact that it is not safe to use bitfields
> as shared variables in synchronization algorithms.
> 
> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> 
> diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
> index 87be0a8a78de..a28bfe4fd759 100644
> --- a/Documentation/memory-barriers.txt
> +++ b/Documentation/memory-barriers.txt
> @@ -269,6 +269,26 @@ And there are a number of things that _must_ or _must_not_ be assumed:
>  	STORE *(A + 4) = Y; STORE *A = X;
>  	STORE {*A, *(A + 4) } = {X, Y};
>  
> +And there are anti-guarantees:
> +
> + (*) These guarantees do not apply to bitfields, because compilers often
> +     generate code to modify these using non-atomic read-modify-write
> +     sequences.  Do not attempt to use bitfields to synchronize parallel
> +     algorithms.
> +
> + (*) Even in cases where bitfields are protected by locks, all fields
> +     in a given bitfield must be protected by one lock.  If two fields
> +     in a given bitfield are protected by different locks, the compiler's
> +     non-atomic read-modify-write sequences can cause an update to one
> +     field to corrupt the value of an adjacent field.
> +
> + (*) These guarantees apply only to properly aligned and sized scalar
> +     variables.  "Properly sized" currently means "int" and "long",
> +     because some CPU families do not support loads and stores of
> +     other sizes.  ("Some CPU families" is currently believed to
> +     be only Alpha 21064.  If this is actually the case, a different
> +     non-guarantee is likely to be formulated.)
> +
>  
>  ============>  WHAT ARE MEMORY BARRIERS?
 


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
  2014-09-05  0:59             ` Peter Hurley
  (?)
@ 2014-09-05  2:08               ` H. Peter Anvin
  -1 siblings, 0 replies; 311+ messages in thread
From: H. Peter Anvin @ 2014-09-05  2:08 UTC (permalink / raw)
  To: Peter Hurley, Benjamin Herrenschmidt, David Laight
  Cc: Jakub Jelinek, linux-arch, Tony Luck, linux-ia64, Oleg Nesterov,
	linux-kernel, Paul Mackerras, Paul E. McKenney, linuxppc-dev,
	Miroslav Franc, Richard Henderson, linux-alpha

On 09/04/2014 05:59 PM, Peter Hurley wrote:
> I have no idea how prevalent the ev56 is compared to the ev5.
> Still we're talking about a chip that came out in 1996.

Ah yes, I stand corrected.  According to Wikipedia, the affected CPUs
were all the 2106x CPUs (EV4, EV45, LCA4, LCA45) plus the 21164 with no
suffix (EV5).  However, we're still talking about museum pieces here.

I wonder what the one I have in my garage is... I'm sure I could emulate
it faster, though.

	-hpa



^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
  2014-09-05  0:59             ` Peter Hurley
  (?)
  (?)
@ 2014-09-05  2:08               ` H. Peter Anvin
  -1 siblings, 0 replies; 311+ messages in thread
From: H. Peter Anvin @ 2014-09-05  2:08 UTC (permalink / raw)
  To: Peter Hurley, Benjamin Herrenschmidt, David Laight
  Cc: Jakub Jelinek, linux-arch, Tony Luck, linux-ia64, Oleg Nesterov,
	linux-kernel, Paul Mackerras, Paul E. McKenney, linuxppc-dev,
	Miroslav Franc, Richard Henderson, linux-alpha

On 09/04/2014 05:59 PM, Peter Hurley wrote:
> I have no idea how prevalent the ev56 is compared to the ev5.
> Still we're talking about a chip that came out in 1996.

Ah yes, I stand corrected.  According to Wikipedia, the affected CPUs
were all the 2106x CPUs (EV4, EV45, LCA4, LCA45) plus the 21164 with no
suffix (EV5).  However, we're still talking about museum pieces here.

I wonder what the one I have in my garage is... I'm sure I could emulate
it faster, though.

	-hpa



^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-05  2:08               ` H. Peter Anvin
  0 siblings, 0 replies; 311+ messages in thread
From: H. Peter Anvin @ 2014-09-05  2:08 UTC (permalink / raw)
  To: Peter Hurley, Benjamin Herrenschmidt, David Laight
  Cc: Jakub Jelinek, linux-arch, Tony Luck, linux-ia64, Oleg Nesterov,
	linux-kernel, Paul Mackerras, linux-alpha, Paul E. McKenney,
	linuxppc-dev, Miroslav Franc, Richard Henderson

On 09/04/2014 05:59 PM, Peter Hurley wrote:
> I have no idea how prevalent the ev56 is compared to the ev5.
> Still we're talking about a chip that came out in 1996.

Ah yes, I stand corrected.  According to Wikipedia, the affected CPUs
were all the 2106x CPUs (EV4, EV45, LCA4, LCA45) plus the 21164 with no
suffix (EV5).  However, we're still talking about museum pieces here.

I wonder what the one I have in my garage is... I'm sure I could emulate
it faster, though.

	-hpa


_______________________________________________
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-05  2:08               ` H. Peter Anvin
  0 siblings, 0 replies; 311+ messages in thread
From: H. Peter Anvin @ 2014-09-05  2:08 UTC (permalink / raw)
  To: Peter Hurley, Benjamin Herrenschmidt, David Laight
  Cc: Jakub Jelinek, linux-arch, Tony Luck, linux-ia64, Oleg Nesterov,
	linux-kernel, Paul Mackerras, linux-alpha, Paul E. McKenney,
	linuxppc-dev, Miroslav Franc, Richard Henderson

On 09/04/2014 05:59 PM, Peter Hurley wrote:
> I have no idea how prevalent the ev56 is compared to the ev5.
> Still we're talking about a chip that came out in 1996.

Ah yes, I stand corrected.  According to Wikipedia, the affected CPUs
were all the 2106x CPUs (EV4, EV45, LCA4, LCA45) plus the 21164 with no
suffix (EV5).  However, we're still talking about museum pieces here.

I wonder what the one I have in my garage is... I'm sure I could emulate
it faster, though.

	-hpa

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-05  2:08               ` H. Peter Anvin
  0 siblings, 0 replies; 311+ messages in thread
From: H. Peter Anvin @ 2014-09-05  2:08 UTC (permalink / raw)
  To: Peter Hurley, Benjamin Herrenschmidt, David Laight
  Cc: Jakub Jelinek, linux-arch, Tony Luck, linux-ia64, Oleg Nesterov,
	linux-kernel, Paul Mackerras, linux-alpha, Paul E. McKenney,
	linuxppc-dev, Miroslav Franc, Richard Henderson

On 09/04/2014 05:59 PM, Peter Hurley wrote:
> I have no idea how prevalent the ev56 is compared to the ev5.
> Still we're talking about a chip that came out in 1996.

Ah yes, I stand corrected.  According to Wikipedia, the affected CPUs
were all the 2106x CPUs (EV4, EV45, LCA4, LCA45) plus the 21164 with no
suffix (EV5).  However, we're still talking about museum pieces here.

I wonder what the one I have in my garage is... I'm sure I could emulate
it faster, though.

	-hpa

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-05  2:08               ` H. Peter Anvin
  0 siblings, 0 replies; 311+ messages in thread
From: H. Peter Anvin @ 2014-09-05  2:08 UTC (permalink / raw)
  To: Peter Hurley, Benjamin Herrenschmidt, David Laight
  Cc: Jakub Jelinek, linux-arch, Tony Luck, linux-ia64, Oleg Nesterov,
	linux-kernel, Paul Mackerras, Paul E. McKenney, linuxppc-dev,
	Miroslav Franc, Richard Henderson, linux-alpha

On 09/04/2014 05:59 PM, Peter Hurley wrote:
> I have no idea how prevalent the ev56 is compared to the ev5.
> Still we're talking about a chip that came out in 1996.

Ah yes, I stand corrected.  According to Wikipedia, the affected CPUs
were all the 2106x CPUs (EV4, EV45, LCA4, LCA45) plus the 21164 with no
suffix (EV5).  However, we're still talking about museum pieces here.

I wonder what the one I have in my garage is... I'm sure I could emulate
it faster, though.

	-hpa



^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-05  2:08               ` H. Peter Anvin
  0 siblings, 0 replies; 311+ messages in thread
From: H. Peter Anvin @ 2014-09-05  2:08 UTC (permalink / raw)
  To: Peter Hurley, Benjamin Herrenschmidt, David Laight
  Cc: Jakub Jelinek, linux-arch, Tony Luck, linux-ia64, Oleg Nesterov,
	linux-kernel, Paul Mackerras, Paul E. McKenney, linuxppc-dev,
	Miroslav Franc, Richard Henderson, linux-alpha

On 09/04/2014 05:59 PM, Peter Hurley wrote:
> I have no idea how prevalent the ev56 is compared to the ev5.
> Still we're talking about a chip that came out in 1996.

Ah yes, I stand corrected.  According to Wikipedia, the affected CPUs
were all the 2106x CPUs (EV4, EV45, LCA4, LCA45) plus the 21164 with no
suffix (EV5).  However, we're still talking about museum pieces here.

I wonder what the one I have in my garage is... I'm sure I could emulate
it faster, though.

	-hpa



^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
  2014-09-05  0:17                   ` Paul E. McKenney
  (?)
@ 2014-09-05  2:11                     ` James Bottomley
  -1 siblings, 0 replies; 311+ messages in thread
From: James Bottomley @ 2014-09-05  2:11 UTC (permalink / raw)
  To: paulmck
  Cc: H. Peter Anvin, Peter Hurley, One Thousand Gnomes, Jakub Jelinek,
	Mikael Pettersson, Benjamin Herrenschmidt, Richard Henderson,
	Oleg Nesterov, Miroslav Franc, Paul Mackerras, linuxppc-dev,
	linux-kernel, linux-arch, Tony Luck, linux-ia64

On Thu, 2014-09-04 at 17:17 -0700, Paul E. McKenney wrote:
> +And there are anti-guarantees:
> +
> + (*) These guarantees do not apply to bitfields, because compilers often
> +     generate code to modify these using non-atomic read-modify-write
> +     sequences.  Do not attempt to use bitfields to synchronize parallel
> +     algorithms.
> +
> + (*) Even in cases where bitfields are protected by locks, all fields
> +     in a given bitfield must be protected by one lock.  If two fields
> +     in a given bitfield are protected by different locks, the compiler's
> +     non-atomic read-modify-write sequences can cause an update to one
> +     field to corrupt the value of an adjacent field.
> +
> + (*) These guarantees apply only to properly aligned and sized scalar
> +     variables.  "Properly sized" currently means "int" and "long",
> +     because some CPU families do not support loads and stores of
> +     other sizes.  ("Some CPU families" is currently believed to
> +     be only Alpha 21064.  If this is actually the case, a different
> +     non-guarantee is likely to be formulated.)

This is a bit unclear.  Presumably you're talking about definiteness of
the outcome (as in what's seen after multiple stores to the same
variable).  The guarantees are only for natural width on Parisc as well,
so you would get a mess if you did byte stores to adjacent memory
locations.  But multiple 32 bit stores guarantees to see one of the
stored values as the final outcome.

James



^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-05  2:11                     ` James Bottomley
  0 siblings, 0 replies; 311+ messages in thread
From: James Bottomley @ 2014-09-05  2:11 UTC (permalink / raw)
  To: paulmck
  Cc: Jakub Jelinek, One Thousand Gnomes, linux-arch, linux-ia64,
	Peter Hurley, Mikael Pettersson, Oleg Nesterov, linux-kernel,
	Tony Luck, Paul Mackerras, H. Peter Anvin, linuxppc-dev,
	Miroslav Franc, Richard Henderson

On Thu, 2014-09-04 at 17:17 -0700, Paul E. McKenney wrote:
> +And there are anti-guarantees:
> +
> + (*) These guarantees do not apply to bitfields, because compilers often
> +     generate code to modify these using non-atomic read-modify-write
> +     sequences.  Do not attempt to use bitfields to synchronize parallel
> +     algorithms.
> +
> + (*) Even in cases where bitfields are protected by locks, all fields
> +     in a given bitfield must be protected by one lock.  If two fields
> +     in a given bitfield are protected by different locks, the compiler's
> +     non-atomic read-modify-write sequences can cause an update to one
> +     field to corrupt the value of an adjacent field.
> +
> + (*) These guarantees apply only to properly aligned and sized scalar
> +     variables.  "Properly sized" currently means "int" and "long",
> +     because some CPU families do not support loads and stores of
> +     other sizes.  ("Some CPU families" is currently believed to
> +     be only Alpha 21064.  If this is actually the case, a different
> +     non-guarantee is likely to be formulated.)

This is a bit unclear.  Presumably you're talking about definiteness of
the outcome (as in what's seen after multiple stores to the same
variable).  The guarantees are only for natural width on Parisc as well,
so you would get a mess if you did byte stores to adjacent memory
locations.  But multiple 32 bit stores guarantees to see one of the
stored values as the final outcome.

James

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-05  2:11                     ` James Bottomley
  0 siblings, 0 replies; 311+ messages in thread
From: James Bottomley @ 2014-09-05  2:11 UTC (permalink / raw)
  To: paulmck
  Cc: H. Peter Anvin, Peter Hurley, One Thousand Gnomes, Jakub Jelinek,
	Mikael Pettersson, Benjamin Herrenschmidt, Richard Henderson,
	Oleg Nesterov, Miroslav Franc, Paul Mackerras, linuxppc-dev,
	linux-kernel, linux-arch, Tony Luck, linux-ia64

On Thu, 2014-09-04 at 17:17 -0700, Paul E. McKenney wrote:
> +And there are anti-guarantees:
> +
> + (*) These guarantees do not apply to bitfields, because compilers often
> +     generate code to modify these using non-atomic read-modify-write
> +     sequences.  Do not attempt to use bitfields to synchronize parallel
> +     algorithms.
> +
> + (*) Even in cases where bitfields are protected by locks, all fields
> +     in a given bitfield must be protected by one lock.  If two fields
> +     in a given bitfield are protected by different locks, the compiler's
> +     non-atomic read-modify-write sequences can cause an update to one
> +     field to corrupt the value of an adjacent field.
> +
> + (*) These guarantees apply only to properly aligned and sized scalar
> +     variables.  "Properly sized" currently means "int" and "long",
> +     because some CPU families do not support loads and stores of
> +     other sizes.  ("Some CPU families" is currently believed to
> +     be only Alpha 21064.  If this is actually the case, a different
> +     non-guarantee is likely to be formulated.)

This is a bit unclear.  Presumably you're talking about definiteness of
the outcome (as in what's seen after multiple stores to the same
variable).  The guarantees are only for natural width on Parisc as well,
so you would get a mess if you did byte stores to adjacent memory
locations.  But multiple 32 bit stores guarantees to see one of the
stored values as the final outcome.

James



^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
  2014-09-05  2:11                     ` James Bottomley
  (?)
@ 2014-09-05  2:47                       ` Peter Hurley
  -1 siblings, 0 replies; 311+ messages in thread
From: Peter Hurley @ 2014-09-05  2:47 UTC (permalink / raw)
  To: James Bottomley, paulmck
  Cc: H. Peter Anvin, One Thousand Gnomes, Jakub Jelinek,
	Mikael Pettersson, Benjamin Herrenschmidt, Richard Henderson,
	Oleg Nesterov, Miroslav Franc, Paul Mackerras, linuxppc-dev,
	linux-kernel, linux-arch, Tony Luck, linux-ia64

Hi James,

On 09/04/2014 10:11 PM, James Bottomley wrote:
> On Thu, 2014-09-04 at 17:17 -0700, Paul E. McKenney wrote:
>> +And there are anti-guarantees:
>> +
>> + (*) These guarantees do not apply to bitfields, because compilers often
>> +     generate code to modify these using non-atomic read-modify-write
>> +     sequences.  Do not attempt to use bitfields to synchronize parallel
>> +     algorithms.
>> +
>> + (*) Even in cases where bitfields are protected by locks, all fields
>> +     in a given bitfield must be protected by one lock.  If two fields
>> +     in a given bitfield are protected by different locks, the compiler's
>> +     non-atomic read-modify-write sequences can cause an update to one
>> +     field to corrupt the value of an adjacent field.
>> +
>> + (*) These guarantees apply only to properly aligned and sized scalar
>> +     variables.  "Properly sized" currently means "int" and "long",
>> +     because some CPU families do not support loads and stores of
>> +     other sizes.  ("Some CPU families" is currently believed to
>> +     be only Alpha 21064.  If this is actually the case, a different
>> +     non-guarantee is likely to be formulated.)
> 
> This is a bit unclear.  Presumably you're talking about definiteness of
> the outcome (as in what's seen after multiple stores to the same
> variable).

No, the last conditions refers to adjacent byte stores from different
cpu contexts (either interrupt or SMP).

> The guarantees are only for natural width on Parisc as well,
> so you would get a mess if you did byte stores to adjacent memory
> locations.

For a simple test like:

struct x {
	long a;
	char b;
	char c;
	char d;
	char e;
};

void store_bc(struct x *p) {
	p->b = 1;
	p->c = 2;
}

on parisc, gcc generates separate byte stores

void store_bc(struct x *p) {
   0:	34 1c 00 02 	ldi 1,ret0
   4:	0f 5c 12 08 	stb ret0,4(r26)
   8:	34 1c 00 04 	ldi 2,ret0
   c:	e8 40 c0 00 	bv r0(rp)
  10:	0f 5c 12 0a 	stb ret0,5(r26)

which appears to confirm that on parisc adjacent byte data
is safe from corruption by concurrent cpu updates; that is,

CPU 0                | CPU 1
                     |
p->b = 1             | p->c = 2
                     |

will result in p->b == 1 && p->c == 2 (assume both values
were 0 before the call to store_bc()).

Regards,
Peter Hurley



^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-05  2:47                       ` Peter Hurley
  0 siblings, 0 replies; 311+ messages in thread
From: Peter Hurley @ 2014-09-05  2:47 UTC (permalink / raw)
  To: James Bottomley, paulmck
  Cc: Jakub Jelinek, One Thousand Gnomes, linux-arch, linux-ia64,
	Mikael Pettersson, Oleg Nesterov, linux-kernel, Tony Luck,
	Paul Mackerras, H. Peter Anvin, linuxppc-dev, Miroslav Franc,
	Richard Henderson

Hi James,

On 09/04/2014 10:11 PM, James Bottomley wrote:
> On Thu, 2014-09-04 at 17:17 -0700, Paul E. McKenney wrote:
>> +And there are anti-guarantees:
>> +
>> + (*) These guarantees do not apply to bitfields, because compilers often
>> +     generate code to modify these using non-atomic read-modify-write
>> +     sequences.  Do not attempt to use bitfields to synchronize parallel
>> +     algorithms.
>> +
>> + (*) Even in cases where bitfields are protected by locks, all fields
>> +     in a given bitfield must be protected by one lock.  If two fields
>> +     in a given bitfield are protected by different locks, the compiler's
>> +     non-atomic read-modify-write sequences can cause an update to one
>> +     field to corrupt the value of an adjacent field.
>> +
>> + (*) These guarantees apply only to properly aligned and sized scalar
>> +     variables.  "Properly sized" currently means "int" and "long",
>> +     because some CPU families do not support loads and stores of
>> +     other sizes.  ("Some CPU families" is currently believed to
>> +     be only Alpha 21064.  If this is actually the case, a different
>> +     non-guarantee is likely to be formulated.)
> 
> This is a bit unclear.  Presumably you're talking about definiteness of
> the outcome (as in what's seen after multiple stores to the same
> variable).

No, the last conditions refers to adjacent byte stores from different
cpu contexts (either interrupt or SMP).

> The guarantees are only for natural width on Parisc as well,
> so you would get a mess if you did byte stores to adjacent memory
> locations.

For a simple test like:

struct x {
	long a;
	char b;
	char c;
	char d;
	char e;
};

void store_bc(struct x *p) {
	p->b = 1;
	p->c = 2;
}

on parisc, gcc generates separate byte stores

void store_bc(struct x *p) {
   0:	34 1c 00 02 	ldi 1,ret0
   4:	0f 5c 12 08 	stb ret0,4(r26)
   8:	34 1c 00 04 	ldi 2,ret0
   c:	e8 40 c0 00 	bv r0(rp)
  10:	0f 5c 12 0a 	stb ret0,5(r26)

which appears to confirm that on parisc adjacent byte data
is safe from corruption by concurrent cpu updates; that is,

CPU 0                | CPU 1
                     |
p->b = 1             | p->c = 2
                     |

will result in p->b == 1 && p->c == 2 (assume both values
were 0 before the call to store_bc()).

Regards,
Peter Hurley

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-05  2:47                       ` Peter Hurley
  0 siblings, 0 replies; 311+ messages in thread
From: Peter Hurley @ 2014-09-05  2:47 UTC (permalink / raw)
  To: James Bottomley, paulmck
  Cc: H. Peter Anvin, One Thousand Gnomes, Jakub Jelinek,
	Mikael Pettersson, Benjamin Herrenschmidt, Richard Henderson,
	Oleg Nesterov, Miroslav Franc, Paul Mackerras, linuxppc-dev,
	linux-kernel, linux-arch, Tony Luck, linux-ia64

Hi James,

On 09/04/2014 10:11 PM, James Bottomley wrote:
> On Thu, 2014-09-04 at 17:17 -0700, Paul E. McKenney wrote:
>> +And there are anti-guarantees:
>> +
>> + (*) These guarantees do not apply to bitfields, because compilers often
>> +     generate code to modify these using non-atomic read-modify-write
>> +     sequences.  Do not attempt to use bitfields to synchronize parallel
>> +     algorithms.
>> +
>> + (*) Even in cases where bitfields are protected by locks, all fields
>> +     in a given bitfield must be protected by one lock.  If two fields
>> +     in a given bitfield are protected by different locks, the compiler's
>> +     non-atomic read-modify-write sequences can cause an update to one
>> +     field to corrupt the value of an adjacent field.
>> +
>> + (*) These guarantees apply only to properly aligned and sized scalar
>> +     variables.  "Properly sized" currently means "int" and "long",
>> +     because some CPU families do not support loads and stores of
>> +     other sizes.  ("Some CPU families" is currently believed to
>> +     be only Alpha 21064.  If this is actually the case, a different
>> +     non-guarantee is likely to be formulated.)
> 
> This is a bit unclear.  Presumably you're talking about definiteness of
> the outcome (as in what's seen after multiple stores to the same
> variable).

No, the last conditions refers to adjacent byte stores from different
cpu contexts (either interrupt or SMP).

> The guarantees are only for natural width on Parisc as well,
> so you would get a mess if you did byte stores to adjacent memory
> locations.

For a simple test like:

struct x {
	long a;
	char b;
	char c;
	char d;
	char e;
};

void store_bc(struct x *p) {
	p->b = 1;
	p->c = 2;
}

on parisc, gcc generates separate byte stores

void store_bc(struct x *p) {
   0:	34 1c 00 02 	ldi 1,ret0
   4:	0f 5c 12 08 	stb ret0,4(r26)
   8:	34 1c 00 04 	ldi 2,ret0
   c:	e8 40 c0 00 	bv r0(rp)
  10:	0f 5c 12 0a 	stb ret0,5(r26)

which appears to confirm that on parisc adjacent byte data
is safe from corruption by concurrent cpu updates; that is,

CPU 0                | CPU 1
                     |
p->b = 1             | p->c = 2
                     |

will result in p->b = 1 && p->c = 2 (assume both values
were 0 before the call to store_bc()).

Regards,
Peter Hurley



^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
  2014-09-05  2:47                       ` Peter Hurley
  (?)
@ 2014-09-05  4:06                         ` Paul E. McKenney
  -1 siblings, 0 replies; 311+ messages in thread
From: Paul E. McKenney @ 2014-09-05  4:06 UTC (permalink / raw)
  To: Peter Hurley
  Cc: James Bottomley, H. Peter Anvin, One Thousand Gnomes,
	Jakub Jelinek, Mikael Pettersson, Benjamin Herrenschmidt,
	Richard Henderson, Oleg Nesterov, Miroslav Franc, Paul Mackerras,
	linuxppc-dev, linux-kernel, linux-arch, Tony Luck, linux-ia64

On Thu, Sep 04, 2014 at 10:47:24PM -0400, Peter Hurley wrote:
> Hi James,
> 
> On 09/04/2014 10:11 PM, James Bottomley wrote:
> > On Thu, 2014-09-04 at 17:17 -0700, Paul E. McKenney wrote:
> >> +And there are anti-guarantees:
> >> +
> >> + (*) These guarantees do not apply to bitfields, because compilers often
> >> +     generate code to modify these using non-atomic read-modify-write
> >> +     sequences.  Do not attempt to use bitfields to synchronize parallel
> >> +     algorithms.
> >> +
> >> + (*) Even in cases where bitfields are protected by locks, all fields
> >> +     in a given bitfield must be protected by one lock.  If two fields
> >> +     in a given bitfield are protected by different locks, the compiler's
> >> +     non-atomic read-modify-write sequences can cause an update to one
> >> +     field to corrupt the value of an adjacent field.
> >> +
> >> + (*) These guarantees apply only to properly aligned and sized scalar
> >> +     variables.  "Properly sized" currently means "int" and "long",
> >> +     because some CPU families do not support loads and stores of
> >> +     other sizes.  ("Some CPU families" is currently believed to
> >> +     be only Alpha 21064.  If this is actually the case, a different
> >> +     non-guarantee is likely to be formulated.)
> > 
> > This is a bit unclear.  Presumably you're talking about definiteness of
> > the outcome (as in what's seen after multiple stores to the same
> > variable).
> 
> No, the last conditions refers to adjacent byte stores from different
> cpu contexts (either interrupt or SMP).
> 
> > The guarantees are only for natural width on Parisc as well,
> > so you would get a mess if you did byte stores to adjacent memory
> > locations.
> 
> For a simple test like:
> 
> struct x {
> 	long a;
> 	char b;
> 	char c;
> 	char d;
> 	char e;
> };
> 
> void store_bc(struct x *p) {
> 	p->b = 1;
> 	p->c = 2;
> }
> 
> on parisc, gcc generates separate byte stores
> 
> void store_bc(struct x *p) {
>    0:	34 1c 00 02 	ldi 1,ret0
>    4:	0f 5c 12 08 	stb ret0,4(r26)
>    8:	34 1c 00 04 	ldi 2,ret0
>    c:	e8 40 c0 00 	bv r0(rp)
>   10:	0f 5c 12 0a 	stb ret0,5(r26)
> 
> which appears to confirm that on parisc adjacent byte data
> is safe from corruption by concurrent cpu updates; that is,
> 
> CPU 0                | CPU 1
>                      |
> p->b = 1             | p->c = 2
>                      |
> 
> will result in p->b == 1 && p->c == 2 (assume both values
> were 0 before the call to store_bc()).

What Peter said.  I would ask for suggestions for better wording, but
I would much rather be able to say that single-byte reads and writes
are atomic and that aligned-short reads and writes are also atomic.

Thus far, it looks like we lose only very old Alpha systems, so unless
I hear otherwise, I update my patch to outlaw these very old systems.

						Thanx, Paul


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-05  4:06                         ` Paul E. McKenney
  0 siblings, 0 replies; 311+ messages in thread
From: Paul E. McKenney @ 2014-09-05  4:06 UTC (permalink / raw)
  To: Peter Hurley
  Cc: Jakub Jelinek, One Thousand Gnomes, linux-arch, linux-ia64,
	Mikael Pettersson, Oleg Nesterov, linux-kernel, James Bottomley,
	Tony Luck, Paul Mackerras, H. Peter Anvin, linuxppc-dev,
	Miroslav Franc, Richard Henderson

On Thu, Sep 04, 2014 at 10:47:24PM -0400, Peter Hurley wrote:
> Hi James,
> 
> On 09/04/2014 10:11 PM, James Bottomley wrote:
> > On Thu, 2014-09-04 at 17:17 -0700, Paul E. McKenney wrote:
> >> +And there are anti-guarantees:
> >> +
> >> + (*) These guarantees do not apply to bitfields, because compilers often
> >> +     generate code to modify these using non-atomic read-modify-write
> >> +     sequences.  Do not attempt to use bitfields to synchronize parallel
> >> +     algorithms.
> >> +
> >> + (*) Even in cases where bitfields are protected by locks, all fields
> >> +     in a given bitfield must be protected by one lock.  If two fields
> >> +     in a given bitfield are protected by different locks, the compiler's
> >> +     non-atomic read-modify-write sequences can cause an update to one
> >> +     field to corrupt the value of an adjacent field.
> >> +
> >> + (*) These guarantees apply only to properly aligned and sized scalar
> >> +     variables.  "Properly sized" currently means "int" and "long",
> >> +     because some CPU families do not support loads and stores of
> >> +     other sizes.  ("Some CPU families" is currently believed to
> >> +     be only Alpha 21064.  If this is actually the case, a different
> >> +     non-guarantee is likely to be formulated.)
> > 
> > This is a bit unclear.  Presumably you're talking about definiteness of
> > the outcome (as in what's seen after multiple stores to the same
> > variable).
> 
> No, the last conditions refers to adjacent byte stores from different
> cpu contexts (either interrupt or SMP).
> 
> > The guarantees are only for natural width on Parisc as well,
> > so you would get a mess if you did byte stores to adjacent memory
> > locations.
> 
> For a simple test like:
> 
> struct x {
> 	long a;
> 	char b;
> 	char c;
> 	char d;
> 	char e;
> };
> 
> void store_bc(struct x *p) {
> 	p->b = 1;
> 	p->c = 2;
> }
> 
> on parisc, gcc generates separate byte stores
> 
> void store_bc(struct x *p) {
>    0:	34 1c 00 02 	ldi 1,ret0
>    4:	0f 5c 12 08 	stb ret0,4(r26)
>    8:	34 1c 00 04 	ldi 2,ret0
>    c:	e8 40 c0 00 	bv r0(rp)
>   10:	0f 5c 12 0a 	stb ret0,5(r26)
> 
> which appears to confirm that on parisc adjacent byte data
> is safe from corruption by concurrent cpu updates; that is,
> 
> CPU 0                | CPU 1
>                      |
> p->b = 1             | p->c = 2
>                      |
> 
> will result in p->b == 1 && p->c == 2 (assume both values
> were 0 before the call to store_bc()).

What Peter said.  I would ask for suggestions for better wording, but
I would much rather be able to say that single-byte reads and writes
are atomic and that aligned-short reads and writes are also atomic.

Thus far, it looks like we lose only very old Alpha systems, so unless
I hear otherwise, I update my patch to outlaw these very old systems.

						Thanx, Paul

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-05  4:06                         ` Paul E. McKenney
  0 siblings, 0 replies; 311+ messages in thread
From: Paul E. McKenney @ 2014-09-05  4:06 UTC (permalink / raw)
  To: Peter Hurley
  Cc: James Bottomley, H. Peter Anvin, One Thousand Gnomes,
	Jakub Jelinek, Mikael Pettersson, Benjamin Herrenschmidt,
	Richard Henderson, Oleg Nesterov, Miroslav Franc, Paul Mackerras,
	linuxppc-dev, linux-kernel, linux-arch, Tony Luck, linux-ia64

On Thu, Sep 04, 2014 at 10:47:24PM -0400, Peter Hurley wrote:
> Hi James,
> 
> On 09/04/2014 10:11 PM, James Bottomley wrote:
> > On Thu, 2014-09-04 at 17:17 -0700, Paul E. McKenney wrote:
> >> +And there are anti-guarantees:
> >> +
> >> + (*) These guarantees do not apply to bitfields, because compilers often
> >> +     generate code to modify these using non-atomic read-modify-write
> >> +     sequences.  Do not attempt to use bitfields to synchronize parallel
> >> +     algorithms.
> >> +
> >> + (*) Even in cases where bitfields are protected by locks, all fields
> >> +     in a given bitfield must be protected by one lock.  If two fields
> >> +     in a given bitfield are protected by different locks, the compiler's
> >> +     non-atomic read-modify-write sequences can cause an update to one
> >> +     field to corrupt the value of an adjacent field.
> >> +
> >> + (*) These guarantees apply only to properly aligned and sized scalar
> >> +     variables.  "Properly sized" currently means "int" and "long",
> >> +     because some CPU families do not support loads and stores of
> >> +     other sizes.  ("Some CPU families" is currently believed to
> >> +     be only Alpha 21064.  If this is actually the case, a different
> >> +     non-guarantee is likely to be formulated.)
> > 
> > This is a bit unclear.  Presumably you're talking about definiteness of
> > the outcome (as in what's seen after multiple stores to the same
> > variable).
> 
> No, the last conditions refers to adjacent byte stores from different
> cpu contexts (either interrupt or SMP).
> 
> > The guarantees are only for natural width on Parisc as well,
> > so you would get a mess if you did byte stores to adjacent memory
> > locations.
> 
> For a simple test like:
> 
> struct x {
> 	long a;
> 	char b;
> 	char c;
> 	char d;
> 	char e;
> };
> 
> void store_bc(struct x *p) {
> 	p->b = 1;
> 	p->c = 2;
> }
> 
> on parisc, gcc generates separate byte stores
> 
> void store_bc(struct x *p) {
>    0:	34 1c 00 02 	ldi 1,ret0
>    4:	0f 5c 12 08 	stb ret0,4(r26)
>    8:	34 1c 00 04 	ldi 2,ret0
>    c:	e8 40 c0 00 	bv r0(rp)
>   10:	0f 5c 12 0a 	stb ret0,5(r26)
> 
> which appears to confirm that on parisc adjacent byte data
> is safe from corruption by concurrent cpu updates; that is,
> 
> CPU 0                | CPU 1
>                      |
> p->b = 1             | p->c = 2
>                      |
> 
> will result in p->b = 1 && p->c = 2 (assume both values
> were 0 before the call to store_bc()).

What Peter said.  I would ask for suggestions for better wording, but
I would much rather be able to say that single-byte reads and writes
are atomic and that aligned-short reads and writes are also atomic.

Thus far, it looks like we lose only very old Alpha systems, so unless
I hear otherwise, I update my patch to outlaw these very old systems.

						Thanx, Paul


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
  2014-09-05  2:08               ` H. Peter Anvin
  (?)
  (?)
@ 2014-09-05  8:16                 ` Michael Cree
  -1 siblings, 0 replies; 311+ messages in thread
From: Michael Cree @ 2014-09-05  8:16 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Peter Hurley, Benjamin Herrenschmidt, David Laight,
	Jakub Jelinek, linux-arch, Tony Luck, linux-ia64, Oleg Nesterov,
	linux-kernel, Paul Mackerras, Paul E. McKenney, linuxppc-dev,
	Miroslav Franc, Richard Henderson, linux-alpha

On Thu, Sep 04, 2014 at 07:08:48PM -0700, H. Peter Anvin wrote:
> On 09/04/2014 05:59 PM, Peter Hurley wrote:
> > I have no idea how prevalent the ev56 is compared to the ev5.
> > Still we're talking about a chip that came out in 1996.
> 
> Ah yes, I stand corrected.  According to Wikipedia, the affected CPUs
> were all the 2106x CPUs (EV4, EV45, LCA4, LCA45) plus the 21164 with no
> suffix (EV5).  However, we're still talking about museum pieces here.

Yes, that is correct, EV56 is the first Alpha CPU to have the byte-word
extension (BWX) CPU instructions.

It would not worry me if the kernel decided to assume atomic aligned
scalar accesses for all arches, thus terminating support for Alphas
without BWX.

The X server, ever since the libpciaccess change, does not work on
Alphas without BWX.

Debian Alpha (pretty much up to date at Debian-Ports) is still compiled
for all Alphas, i.e., without BWX.  The last attempt to start compiling
Debian Alpha with BWX, about three years ago when Alpha was kicked out
to Debian-Ports resulted in a couple or so complaints so got nowhere.
It's frustrating supporting the lowest common demoninator as many of
the bugs specific to Alpha can be resolved by recompiling with the BWX.
The kernel no longer supporting Alphas without BWX might just be the
incentive we need to switch Debian Alpha to compiling with BWX.

Cheers
Michael.

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-05  8:16                 ` Michael Cree
  0 siblings, 0 replies; 311+ messages in thread
From: Michael Cree @ 2014-09-05  8:16 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Jakub Jelinek, linux-arch, Tony Luck, linux-ia64, Peter Hurley,
	Oleg Nesterov, linux-kernel, David Laight, Paul Mackerras,
	linux-alpha, Paul E. McKenney, linuxppc-dev, Miroslav Franc,
	Richard Henderson

On Thu, Sep 04, 2014 at 07:08:48PM -0700, H. Peter Anvin wrote:
> On 09/04/2014 05:59 PM, Peter Hurley wrote:
> > I have no idea how prevalent the ev56 is compared to the ev5.
> > Still we're talking about a chip that came out in 1996.
> 
> Ah yes, I stand corrected.  According to Wikipedia, the affected CPUs
> were all the 2106x CPUs (EV4, EV45, LCA4, LCA45) plus the 21164 with no
> suffix (EV5).  However, we're still talking about museum pieces here.

Yes, that is correct, EV56 is the first Alpha CPU to have the byte-word
extension (BWX) CPU instructions.

It would not worry me if the kernel decided to assume atomic aligned
scalar accesses for all arches, thus terminating support for Alphas
without BWX.

The X server, ever since the libpciaccess change, does not work on
Alphas without BWX.

Debian Alpha (pretty much up to date at Debian-Ports) is still compiled
for all Alphas, i.e., without BWX.  The last attempt to start compiling
Debian Alpha with BWX, about three years ago when Alpha was kicked out
to Debian-Ports resulted in a couple or so complaints so got nowhere.
It's frustrating supporting the lowest common demoninator as many of
the bugs specific to Alpha can be resolved by recompiling with the BWX.
The kernel no longer supporting Alphas without BWX might just be the
incentive we need to switch Debian Alpha to compiling with BWX.

Cheers
Michael.
_______________________________________________
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-05  8:16                 ` Michael Cree
  0 siblings, 0 replies; 311+ messages in thread
From: Michael Cree @ 2014-09-05  8:16 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Jakub Jelinek, linux-arch, Tony Luck, linux-ia64, Peter Hurley,
	Oleg Nesterov, linux-kernel, David Laight, Paul Mackerras,
	linux-alpha, Paul E. McKenney, linuxppc-dev, Miroslav Franc,
	Richard Henderson

On Thu, Sep 04, 2014 at 07:08:48PM -0700, H. Peter Anvin wrote:
> On 09/04/2014 05:59 PM, Peter Hurley wrote:
> > I have no idea how prevalent the ev56 is compared to the ev5.
> > Still we're talking about a chip that came out in 1996.
> 
> Ah yes, I stand corrected.  According to Wikipedia, the affected CPUs
> were all the 2106x CPUs (EV4, EV45, LCA4, LCA45) plus the 21164 with no
> suffix (EV5).  However, we're still talking about museum pieces here.

Yes, that is correct, EV56 is the first Alpha CPU to have the byte-word
extension (BWX) CPU instructions.

It would not worry me if the kernel decided to assume atomic aligned
scalar accesses for all arches, thus terminating support for Alphas
without BWX.

The X server, ever since the libpciaccess change, does not work on
Alphas without BWX.

Debian Alpha (pretty much up to date at Debian-Ports) is still compiled
for all Alphas, i.e., without BWX.  The last attempt to start compiling
Debian Alpha with BWX, about three years ago when Alpha was kicked out
to Debian-Ports resulted in a couple or so complaints so got nowhere.
It's frustrating supporting the lowest common demoninator as many of
the bugs specific to Alpha can be resolved by recompiling with the BWX.
The kernel no longer supporting Alphas without BWX might just be the
incentive we need to switch Debian Alpha to compiling with BWX.

Cheers
Michael.

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-05  8:16                 ` Michael Cree
  0 siblings, 0 replies; 311+ messages in thread
From: Michael Cree @ 2014-09-05  8:16 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Peter Hurley, Benjamin Herrenschmidt, David Laight,
	Jakub Jelinek, linux-arch, Tony Luck, linux-ia64, Oleg Nesterov,
	linux-kernel, Paul Mackerras, Paul E. McKenney, linuxppc-dev,
	Miroslav Franc, Richard Henderson, linux-alpha

On Thu, Sep 04, 2014 at 07:08:48PM -0700, H. Peter Anvin wrote:
> On 09/04/2014 05:59 PM, Peter Hurley wrote:
> > I have no idea how prevalent the ev56 is compared to the ev5.
> > Still we're talking about a chip that came out in 1996.
> 
> Ah yes, I stand corrected.  According to Wikipedia, the affected CPUs
> were all the 2106x CPUs (EV4, EV45, LCA4, LCA45) plus the 21164 with no
> suffix (EV5).  However, we're still talking about museum pieces here.

Yes, that is correct, EV56 is the first Alpha CPU to have the byte-word
extension (BWX) CPU instructions.

It would not worry me if the kernel decided to assume atomic aligned
scalar accesses for all arches, thus terminating support for Alphas
without BWX.

The X server, ever since the libpciaccess change, does not work on
Alphas without BWX.

Debian Alpha (pretty much up to date at Debian-Ports) is still compiled
for all Alphas, i.e., without BWX.  The last attempt to start compiling
Debian Alpha with BWX, about three years ago when Alpha was kicked out
to Debian-Ports resulted in a couple or so complaints so got nowhere.
It's frustrating supporting the lowest common demoninator as many of
the bugs specific to Alpha can be resolved by recompiling with the BWX.
The kernel no longer supporting Alphas without BWX might just be the
incentive we need to switch Debian Alpha to compiling with BWX.

Cheers
Michael.

^ permalink raw reply	[flat|nested] 311+ messages in thread

* RE: bit fields && data tearing
  2014-09-05  4:06                         ` Paul E. McKenney
  (?)
  (?)
@ 2014-09-05  8:30                           ` David Laight
  -1 siblings, 0 replies; 311+ messages in thread
From: David Laight @ 2014-09-05  8:30 UTC (permalink / raw)
  To: 'paulmck@linux.vnet.ibm.com', Peter Hurley
  Cc: Jakub Jelinek, One Thousand Gnomes, linux-arch, linux-ia64,
	Mikael Pettersson, Oleg Nesterov, linux-kernel, James Bottomley,
	Tony Luck, Paul Mackerras, H. Peter Anvin, linuxppc-dev,
	Miroslav Franc, Richard Henderson

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 3504 bytes --]

From: Paul E. McKenney
> On Thu, Sep 04, 2014 at 10:47:24PM -0400, Peter Hurley wrote:
> > Hi James,
> >
> > On 09/04/2014 10:11 PM, James Bottomley wrote:
> > > On Thu, 2014-09-04 at 17:17 -0700, Paul E. McKenney wrote:
> > >> +And there are anti-guarantees:
> > >> +
> > >> + (*) These guarantees do not apply to bitfields, because compilers often
> > >> +     generate code to modify these using non-atomic read-modify-write
> > >> +     sequences.  Do not attempt to use bitfields to synchronize parallel
> > >> +     algorithms.
> > >> +
> > >> + (*) Even in cases where bitfields are protected by locks, all fields
> > >> +     in a given bitfield must be protected by one lock.  If two fields
> > >> +     in a given bitfield are protected by different locks, the compiler's
> > >> +     non-atomic read-modify-write sequences can cause an update to one
> > >> +     field to corrupt the value of an adjacent field.
> > >> +
> > >> + (*) These guarantees apply only to properly aligned and sized scalar
> > >> +     variables.  "Properly sized" currently means "int" and "long",
> > >> +     because some CPU families do not support loads and stores of
> > >> +     other sizes.  ("Some CPU families" is currently believed to
> > >> +     be only Alpha 21064.  If this is actually the case, a different
> > >> +     non-guarantee is likely to be formulated.)
> > >
> > > This is a bit unclear.  Presumably you're talking about definiteness of
> > > the outcome (as in what's seen after multiple stores to the same
> > > variable).
> >
> > No, the last conditions refers to adjacent byte stores from different
> > cpu contexts (either interrupt or SMP).
> >
> > > The guarantees are only for natural width on Parisc as well,
> > > so you would get a mess if you did byte stores to adjacent memory
> > > locations.
> >
> > For a simple test like:
> >
> > struct x {
> > 	long a;
> > 	char b;
> > 	char c;
> > 	char d;
> > 	char e;
> > };
> >
> > void store_bc(struct x *p) {
> > 	p->b = 1;
> > 	p->c = 2;
> > }
> >
> > on parisc, gcc generates separate byte stores
> >
> > void store_bc(struct x *p) {
> >    0:	34 1c 00 02 	ldi 1,ret0
> >    4:	0f 5c 12 08 	stb ret0,4(r26)
> >    8:	34 1c 00 04 	ldi 2,ret0
> >    c:	e8 40 c0 00 	bv r0(rp)
> >   10:	0f 5c 12 0a 	stb ret0,5(r26)
> >
> > which appears to confirm that on parisc adjacent byte data
> > is safe from corruption by concurrent cpu updates; that is,
> >
> > CPU 0                | CPU 1
> >                      |
> > p->b = 1             | p->c = 2
> >                      |
> >
> > will result in p->b == 1 && p->c == 2 (assume both values
> > were 0 before the call to store_bc()).
> 
> What Peter said.  I would ask for suggestions for better wording, but
> I would much rather be able to say that single-byte reads and writes
> are atomic and that aligned-short reads and writes are also atomic.
> 
> Thus far, it looks like we lose only very old Alpha systems, so unless
> I hear otherwise, I update my patch to outlaw these very old systems.

People with old Alphas can run NetBSD instead, along with those who have real VAXen :-)

I've seen gcc generate 32bit accesses for 16bit structure members on arm.
It does this because of the more limited range of the offsets for the 16bit access.
OTOH I don't know if it ever did this for writes - so it may be moot.

	David

ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 311+ messages in thread

* RE: bit fields && data tearing
@ 2014-09-05  8:30                           ` David Laight
  0 siblings, 0 replies; 311+ messages in thread
From: David Laight @ 2014-09-05  8:30 UTC (permalink / raw)
  To: 'paulmck@linux.vnet.ibm.com', Peter Hurley
  Cc: Jakub Jelinek, One Thousand Gnomes, linux-arch, linux-ia64,
	Mikael Pettersson, Oleg Nesterov, linux-kernel, James Bottomley,
	Tony Luck, Paul Mackerras, H. Peter Anvin, linuxppc-dev,
	Miroslav Franc, Richard Henderson

From: Paul E. McKenney
> On Thu, Sep 04, 2014 at 10:47:24PM -0400, Peter Hurley wrote:
> > Hi James,
> >
> > On 09/04/2014 10:11 PM, James Bottomley wrote:
> > > On Thu, 2014-09-04 at 17:17 -0700, Paul E. McKenney wrote:
> > >> +And there are anti-guarantees:
> > >> +
> > >> + (*) These guarantees do not apply to bitfields, because compilers often
> > >> +     generate code to modify these using non-atomic read-modify-write
> > >> +     sequences.  Do not attempt to use bitfields to synchronize parallel
> > >> +     algorithms.
> > >> +
> > >> + (*) Even in cases where bitfields are protected by locks, all fields
> > >> +     in a given bitfield must be protected by one lock.  If two fields
> > >> +     in a given bitfield are protected by different locks, the compiler's
> > >> +     non-atomic read-modify-write sequences can cause an update to one
> > >> +     field to corrupt the value of an adjacent field.
> > >> +
> > >> + (*) These guarantees apply only to properly aligned and sized scalar
> > >> +     variables.  "Properly sized" currently means "int" and "long",
> > >> +     because some CPU families do not support loads and stores of
> > >> +     other sizes.  ("Some CPU families" is currently believed to
> > >> +     be only Alpha 21064.  If this is actually the case, a different
> > >> +     non-guarantee is likely to be formulated.)
> > >
> > > This is a bit unclear.  Presumably you're talking about definiteness of
> > > the outcome (as in what's seen after multiple stores to the same
> > > variable).
> >
> > No, the last conditions refers to adjacent byte stores from different
> > cpu contexts (either interrupt or SMP).
> >
> > > The guarantees are only for natural width on Parisc as well,
> > > so you would get a mess if you did byte stores to adjacent memory
> > > locations.
> >
> > For a simple test like:
> >
> > struct x {
> > 	long a;
> > 	char b;
> > 	char c;
> > 	char d;
> > 	char e;
> > };
> >
> > void store_bc(struct x *p) {
> > 	p->b = 1;
> > 	p->c = 2;
> > }
> >
> > on parisc, gcc generates separate byte stores
> >
> > void store_bc(struct x *p) {
> >    0:	34 1c 00 02 	ldi 1,ret0
> >    4:	0f 5c 12 08 	stb ret0,4(r26)
> >    8:	34 1c 00 04 	ldi 2,ret0
> >    c:	e8 40 c0 00 	bv r0(rp)
> >   10:	0f 5c 12 0a 	stb ret0,5(r26)
> >
> > which appears to confirm that on parisc adjacent byte data
> > is safe from corruption by concurrent cpu updates; that is,
> >
> > CPU 0                | CPU 1
> >                      |
> > p->b = 1             | p->c = 2
> >                      |
> >
> > will result in p->b == 1 && p->c == 2 (assume both values
> > were 0 before the call to store_bc()).
> 
> What Peter said.  I would ask for suggestions for better wording, but
> I would much rather be able to say that single-byte reads and writes
> are atomic and that aligned-short reads and writes are also atomic.
> 
> Thus far, it looks like we lose only very old Alpha systems, so unless
> I hear otherwise, I update my patch to outlaw these very old systems.

People with old Alphas can run NetBSD instead, along with those who have real VAXen :-)

I've seen gcc generate 32bit accesses for 16bit structure members on arm.
It does this because of the more limited range of the offsets for the 16bit access.
OTOH I don't know if it ever did this for writes - so it may be moot.

	David


^ permalink raw reply	[flat|nested] 311+ messages in thread

* RE: bit fields && data tearing
@ 2014-09-05  8:30                           ` David Laight
  0 siblings, 0 replies; 311+ messages in thread
From: David Laight @ 2014-09-05  8:30 UTC (permalink / raw)
  To: 'paulmck@linux.vnet.ibm.com', Peter Hurley
  Cc: Jakub Jelinek, One Thousand Gnomes, Tony Luck, linux-ia64,
	Mikael Pettersson, H. Peter Anvin, Oleg Nesterov, linux-kernel,
	James Bottomley, Paul Mackerras, linux-arch, linuxppc-dev,
	Miroslav Franc, Richard Henderson

RnJvbTogUGF1bCBFLiBNY0tlbm5leQ0KPiBPbiBUaHUsIFNlcCAwNCwgMjAxNCBhdCAxMDo0Nzoy
NFBNIC0wNDAwLCBQZXRlciBIdXJsZXkgd3JvdGU6DQo+ID4gSGkgSmFtZXMsDQo+ID4NCj4gPiBP
biAwOS8wNC8yMDE0IDEwOjExIFBNLCBKYW1lcyBCb3R0b21sZXkgd3JvdGU6DQo+ID4gPiBPbiBU
aHUsIDIwMTQtMDktMDQgYXQgMTc6MTcgLTA3MDAsIFBhdWwgRS4gTWNLZW5uZXkgd3JvdGU6DQo+
ID4gPj4gK0FuZCB0aGVyZSBhcmUgYW50aS1ndWFyYW50ZWVzOg0KPiA+ID4+ICsNCj4gPiA+PiAr
ICgqKSBUaGVzZSBndWFyYW50ZWVzIGRvIG5vdCBhcHBseSB0byBiaXRmaWVsZHMsIGJlY2F1c2Ug
Y29tcGlsZXJzIG9mdGVuDQo+ID4gPj4gKyAgICAgZ2VuZXJhdGUgY29kZSB0byBtb2RpZnkgdGhl
c2UgdXNpbmcgbm9uLWF0b21pYyByZWFkLW1vZGlmeS13cml0ZQ0KPiA+ID4+ICsgICAgIHNlcXVl
bmNlcy4gIERvIG5vdCBhdHRlbXB0IHRvIHVzZSBiaXRmaWVsZHMgdG8gc3luY2hyb25pemUgcGFy
YWxsZWwNCj4gPiA+PiArICAgICBhbGdvcml0aG1zLg0KPiA+ID4+ICsNCj4gPiA+PiArICgqKSBF
dmVuIGluIGNhc2VzIHdoZXJlIGJpdGZpZWxkcyBhcmUgcHJvdGVjdGVkIGJ5IGxvY2tzLCBhbGwg
ZmllbGRzDQo+ID4gPj4gKyAgICAgaW4gYSBnaXZlbiBiaXRmaWVsZCBtdXN0IGJlIHByb3RlY3Rl
ZCBieSBvbmUgbG9jay4gIElmIHR3byBmaWVsZHMNCj4gPiA+PiArICAgICBpbiBhIGdpdmVuIGJp
dGZpZWxkIGFyZSBwcm90ZWN0ZWQgYnkgZGlmZmVyZW50IGxvY2tzLCB0aGUgY29tcGlsZXIncw0K
PiA+ID4+ICsgICAgIG5vbi1hdG9taWMgcmVhZC1tb2RpZnktd3JpdGUgc2VxdWVuY2VzIGNhbiBj
YXVzZSBhbiB1cGRhdGUgdG8gb25lDQo+ID4gPj4gKyAgICAgZmllbGQgdG8gY29ycnVwdCB0aGUg
dmFsdWUgb2YgYW4gYWRqYWNlbnQgZmllbGQuDQo+ID4gPj4gKw0KPiA+ID4+ICsgKCopIFRoZXNl
IGd1YXJhbnRlZXMgYXBwbHkgb25seSB0byBwcm9wZXJseSBhbGlnbmVkIGFuZCBzaXplZCBzY2Fs
YXINCj4gPiA+PiArICAgICB2YXJpYWJsZXMuICAiUHJvcGVybHkgc2l6ZWQiIGN1cnJlbnRseSBt
ZWFucyAiaW50IiBhbmQgImxvbmciLA0KPiA+ID4+ICsgICAgIGJlY2F1c2Ugc29tZSBDUFUgZmFt
aWxpZXMgZG8gbm90IHN1cHBvcnQgbG9hZHMgYW5kIHN0b3JlcyBvZg0KPiA+ID4+ICsgICAgIG90
aGVyIHNpemVzLiAgKCJTb21lIENQVSBmYW1pbGllcyIgaXMgY3VycmVudGx5IGJlbGlldmVkIHRv
DQo+ID4gPj4gKyAgICAgYmUgb25seSBBbHBoYSAyMTA2NC4gIElmIHRoaXMgaXMgYWN0dWFsbHkg
dGhlIGNhc2UsIGEgZGlmZmVyZW50DQo+ID4gPj4gKyAgICAgbm9uLWd1YXJhbnRlZSBpcyBsaWtl
bHkgdG8gYmUgZm9ybXVsYXRlZC4pDQo+ID4gPg0KPiA+ID4gVGhpcyBpcyBhIGJpdCB1bmNsZWFy
LiAgUHJlc3VtYWJseSB5b3UncmUgdGFsa2luZyBhYm91dCBkZWZpbml0ZW5lc3Mgb2YNCj4gPiA+
IHRoZSBvdXRjb21lIChhcyBpbiB3aGF0J3Mgc2VlbiBhZnRlciBtdWx0aXBsZSBzdG9yZXMgdG8g
dGhlIHNhbWUNCj4gPiA+IHZhcmlhYmxlKS4NCj4gPg0KPiA+IE5vLCB0aGUgbGFzdCBjb25kaXRp
b25zIHJlZmVycyB0byBhZGphY2VudCBieXRlIHN0b3JlcyBmcm9tIGRpZmZlcmVudA0KPiA+IGNw
dSBjb250ZXh0cyAoZWl0aGVyIGludGVycnVwdCBvciBTTVApLg0KPiA+DQo+ID4gPiBUaGUgZ3Vh
cmFudGVlcyBhcmUgb25seSBmb3IgbmF0dXJhbCB3aWR0aCBvbiBQYXJpc2MgYXMgd2VsbCwNCj4g
PiA+IHNvIHlvdSB3b3VsZCBnZXQgYSBtZXNzIGlmIHlvdSBkaWQgYnl0ZSBzdG9yZXMgdG8gYWRq
YWNlbnQgbWVtb3J5DQo+ID4gPiBsb2NhdGlvbnMuDQo+ID4NCj4gPiBGb3IgYSBzaW1wbGUgdGVz
dCBsaWtlOg0KPiA+DQo+ID4gc3RydWN0IHggew0KPiA+IAlsb25nIGE7DQo+ID4gCWNoYXIgYjsN
Cj4gPiAJY2hhciBjOw0KPiA+IAljaGFyIGQ7DQo+ID4gCWNoYXIgZTsNCj4gPiB9Ow0KPiA+DQo+
ID4gdm9pZCBzdG9yZV9iYyhzdHJ1Y3QgeCAqcCkgew0KPiA+IAlwLT5iID0gMTsNCj4gPiAJcC0+
YyA9IDI7DQo+ID4gfQ0KPiA+DQo+ID4gb24gcGFyaXNjLCBnY2MgZ2VuZXJhdGVzIHNlcGFyYXRl
IGJ5dGUgc3RvcmVzDQo+ID4NCj4gPiB2b2lkIHN0b3JlX2JjKHN0cnVjdCB4ICpwKSB7DQo+ID4g
ICAgMDoJMzQgMWMgMDAgMDIgCWxkaSAxLHJldDANCj4gPiAgICA0OgkwZiA1YyAxMiAwOCAJc3Ri
IHJldDAsNChyMjYpDQo+ID4gICAgODoJMzQgMWMgMDAgMDQgCWxkaSAyLHJldDANCj4gPiAgICBj
OgllOCA0MCBjMCAwMCAJYnYgcjAocnApDQo+ID4gICAxMDoJMGYgNWMgMTIgMGEgCXN0YiByZXQw
LDUocjI2KQ0KPiA+DQo+ID4gd2hpY2ggYXBwZWFycyB0byBjb25maXJtIHRoYXQgb24gcGFyaXNj
IGFkamFjZW50IGJ5dGUgZGF0YQ0KPiA+IGlzIHNhZmUgZnJvbSBjb3JydXB0aW9uIGJ5IGNvbmN1
cnJlbnQgY3B1IHVwZGF0ZXM7IHRoYXQgaXMsDQo+ID4NCj4gPiBDUFUgMCAgICAgICAgICAgICAg
ICB8IENQVSAxDQo+ID4gICAgICAgICAgICAgICAgICAgICAgfA0KPiA+IHAtPmIgPSAxICAgICAg
ICAgICAgIHwgcC0+YyA9IDINCj4gPiAgICAgICAgICAgICAgICAgICAgICB8DQo+ID4NCj4gPiB3
aWxsIHJlc3VsdCBpbiBwLT5iID09IDEgJiYgcC0+YyA9PSAyIChhc3N1bWUgYm90aCB2YWx1ZXMN
Cj4gPiB3ZXJlIDAgYmVmb3JlIHRoZSBjYWxsIHRvIHN0b3JlX2JjKCkpLg0KPiANCj4gV2hhdCBQ
ZXRlciBzYWlkLiAgSSB3b3VsZCBhc2sgZm9yIHN1Z2dlc3Rpb25zIGZvciBiZXR0ZXIgd29yZGlu
ZywgYnV0DQo+IEkgd291bGQgbXVjaCByYXRoZXIgYmUgYWJsZSB0byBzYXkgdGhhdCBzaW5nbGUt
Ynl0ZSByZWFkcyBhbmQgd3JpdGVzDQo+IGFyZSBhdG9taWMgYW5kIHRoYXQgYWxpZ25lZC1zaG9y
dCByZWFkcyBhbmQgd3JpdGVzIGFyZSBhbHNvIGF0b21pYy4NCj4gDQo+IFRodXMgZmFyLCBpdCBs
b29rcyBsaWtlIHdlIGxvc2Ugb25seSB2ZXJ5IG9sZCBBbHBoYSBzeXN0ZW1zLCBzbyB1bmxlc3MN
Cj4gSSBoZWFyIG90aGVyd2lzZSwgSSB1cGRhdGUgbXkgcGF0Y2ggdG8gb3V0bGF3IHRoZXNlIHZl
cnkgb2xkIHN5c3RlbXMuDQoNClBlb3BsZSB3aXRoIG9sZCBBbHBoYXMgY2FuIHJ1biBOZXRCU0Qg
aW5zdGVhZCwgYWxvbmcgd2l0aCB0aG9zZSB3aG8gaGF2ZSByZWFsIFZBWGVuIDotKQ0KDQpJJ3Zl
IHNlZW4gZ2NjIGdlbmVyYXRlIDMyYml0IGFjY2Vzc2VzIGZvciAxNmJpdCBzdHJ1Y3R1cmUgbWVt
YmVycyBvbiBhcm0uDQpJdCBkb2VzIHRoaXMgYmVjYXVzZSBvZiB0aGUgbW9yZSBsaW1pdGVkIHJh
bmdlIG9mIHRoZSBvZmZzZXRzIGZvciB0aGUgMTZiaXQgYWNjZXNzLg0KT1RPSCBJIGRvbid0IGtu
b3cgaWYgaXQgZXZlciBkaWQgdGhpcyBmb3Igd3JpdGVzIC0gc28gaXQgbWF5IGJlIG1vb3QuDQoN
CglEYXZpZA0KDQo=

^ permalink raw reply	[flat|nested] 311+ messages in thread

* RE: bit fields && data tearing
@ 2014-09-05  8:30                           ` David Laight
  0 siblings, 0 replies; 311+ messages in thread
From: David Laight @ 2014-09-05  8:30 UTC (permalink / raw)
  To: 'paulmck@linux.vnet.ibm.com', Peter Hurley
  Cc: Jakub Jelinek, One Thousand Gnomes, linux-arch, linux-ia64,
	Mikael Pettersson, Oleg Nesterov, linux-kernel, James Bottomley,
	Tony Luck, Paul Mackerras, H. Peter Anvin, linuxppc-dev,
	Miroslav Franc, Richard Henderson

RnJvbTogUGF1bCBFLiBNY0tlbm5leQ0KPiBPbiBUaHUsIFNlcCAwNCwgMjAxNCBhdCAxMDo0Nzoy
NFBNIC0wNDAwLCBQZXRlciBIdXJsZXkgd3JvdGU6DQo+ID4gSGkgSmFtZXMsDQo+ID4NCj4gPiBP
biAwOS8wNC8yMDE0IDEwOjExIFBNLCBKYW1lcyBCb3R0b21sZXkgd3JvdGU6DQo+ID4gPiBPbiBU
aHUsIDIwMTQtMDktMDQgYXQgMTc6MTcgLTA3MDAsIFBhdWwgRS4gTWNLZW5uZXkgd3JvdGU6DQo+
ID4gPj4gK0FuZCB0aGVyZSBhcmUgYW50aS1ndWFyYW50ZWVzOg0KPiA+ID4+ICsNCj4gPiA+PiAr
ICgqKSBUaGVzZSBndWFyYW50ZWVzIGRvIG5vdCBhcHBseSB0byBiaXRmaWVsZHMsIGJlY2F1c2Ug
Y29tcGlsZXJzIG9mdGVuDQo+ID4gPj4gKyAgICAgZ2VuZXJhdGUgY29kZSB0byBtb2RpZnkgdGhl
c2UgdXNpbmcgbm9uLWF0b21pYyByZWFkLW1vZGlmeS13cml0ZQ0KPiA+ID4+ICsgICAgIHNlcXVl
bmNlcy4gIERvIG5vdCBhdHRlbXB0IHRvIHVzZSBiaXRmaWVsZHMgdG8gc3luY2hyb25pemUgcGFy
YWxsZWwNCj4gPiA+PiArICAgICBhbGdvcml0aG1zLg0KPiA+ID4+ICsNCj4gPiA+PiArICgqKSBF
dmVuIGluIGNhc2VzIHdoZXJlIGJpdGZpZWxkcyBhcmUgcHJvdGVjdGVkIGJ5IGxvY2tzLCBhbGwg
ZmllbGRzDQo+ID4gPj4gKyAgICAgaW4gYSBnaXZlbiBiaXRmaWVsZCBtdXN0IGJlIHByb3RlY3Rl
ZCBieSBvbmUgbG9jay4gIElmIHR3byBmaWVsZHMNCj4gPiA+PiArICAgICBpbiBhIGdpdmVuIGJp
dGZpZWxkIGFyZSBwcm90ZWN0ZWQgYnkgZGlmZmVyZW50IGxvY2tzLCB0aGUgY29tcGlsZXIncw0K
PiA+ID4+ICsgICAgIG5vbi1hdG9taWMgcmVhZC1tb2RpZnktd3JpdGUgc2VxdWVuY2VzIGNhbiBj
YXVzZSBhbiB1cGRhdGUgdG8gb25lDQo+ID4gPj4gKyAgICAgZmllbGQgdG8gY29ycnVwdCB0aGUg
dmFsdWUgb2YgYW4gYWRqYWNlbnQgZmllbGQuDQo+ID4gPj4gKw0KPiA+ID4+ICsgKCopIFRoZXNl
IGd1YXJhbnRlZXMgYXBwbHkgb25seSB0byBwcm9wZXJseSBhbGlnbmVkIGFuZCBzaXplZCBzY2Fs
YXINCj4gPiA+PiArICAgICB2YXJpYWJsZXMuICAiUHJvcGVybHkgc2l6ZWQiIGN1cnJlbnRseSBt
ZWFucyAiaW50IiBhbmQgImxvbmciLA0KPiA+ID4+ICsgICAgIGJlY2F1c2Ugc29tZSBDUFUgZmFt
aWxpZXMgZG8gbm90IHN1cHBvcnQgbG9hZHMgYW5kIHN0b3JlcyBvZg0KPiA+ID4+ICsgICAgIG90
aGVyIHNpemVzLiAgKCJTb21lIENQVSBmYW1pbGllcyIgaXMgY3VycmVudGx5IGJlbGlldmVkIHRv
DQo+ID4gPj4gKyAgICAgYmUgb25seSBBbHBoYSAyMTA2NC4gIElmIHRoaXMgaXMgYWN0dWFsbHkg
dGhlIGNhc2UsIGEgZGlmZmVyZW50DQo+ID4gPj4gKyAgICAgbm9uLWd1YXJhbnRlZSBpcyBsaWtl
bHkgdG8gYmUgZm9ybXVsYXRlZC4pDQo+ID4gPg0KPiA+ID4gVGhpcyBpcyBhIGJpdCB1bmNsZWFy
LiAgUHJlc3VtYWJseSB5b3UncmUgdGFsa2luZyBhYm91dCBkZWZpbml0ZW5lc3Mgb2YNCj4gPiA+
IHRoZSBvdXRjb21lIChhcyBpbiB3aGF0J3Mgc2VlbiBhZnRlciBtdWx0aXBsZSBzdG9yZXMgdG8g
dGhlIHNhbWUNCj4gPiA+IHZhcmlhYmxlKS4NCj4gPg0KPiA+IE5vLCB0aGUgbGFzdCBjb25kaXRp
b25zIHJlZmVycyB0byBhZGphY2VudCBieXRlIHN0b3JlcyBmcm9tIGRpZmZlcmVudA0KPiA+IGNw
dSBjb250ZXh0cyAoZWl0aGVyIGludGVycnVwdCBvciBTTVApLg0KPiA+DQo+ID4gPiBUaGUgZ3Vh
cmFudGVlcyBhcmUgb25seSBmb3IgbmF0dXJhbCB3aWR0aCBvbiBQYXJpc2MgYXMgd2VsbCwNCj4g
PiA+IHNvIHlvdSB3b3VsZCBnZXQgYSBtZXNzIGlmIHlvdSBkaWQgYnl0ZSBzdG9yZXMgdG8gYWRq
YWNlbnQgbWVtb3J5DQo+ID4gPiBsb2NhdGlvbnMuDQo+ID4NCj4gPiBGb3IgYSBzaW1wbGUgdGVz
dCBsaWtlOg0KPiA+DQo+ID4gc3RydWN0IHggew0KPiA+IAlsb25nIGE7DQo+ID4gCWNoYXIgYjsN
Cj4gPiAJY2hhciBjOw0KPiA+IAljaGFyIGQ7DQo+ID4gCWNoYXIgZTsNCj4gPiB9Ow0KPiA+DQo+
ID4gdm9pZCBzdG9yZV9iYyhzdHJ1Y3QgeCAqcCkgew0KPiA+IAlwLT5iID0gMTsNCj4gPiAJcC0+
YyA9IDI7DQo+ID4gfQ0KPiA+DQo+ID4gb24gcGFyaXNjLCBnY2MgZ2VuZXJhdGVzIHNlcGFyYXRl
IGJ5dGUgc3RvcmVzDQo+ID4NCj4gPiB2b2lkIHN0b3JlX2JjKHN0cnVjdCB4ICpwKSB7DQo+ID4g
ICAgMDoJMzQgMWMgMDAgMDIgCWxkaSAxLHJldDANCj4gPiAgICA0OgkwZiA1YyAxMiAwOCAJc3Ri
IHJldDAsNChyMjYpDQo+ID4gICAgODoJMzQgMWMgMDAgMDQgCWxkaSAyLHJldDANCj4gPiAgICBj
OgllOCA0MCBjMCAwMCAJYnYgcjAocnApDQo+ID4gICAxMDoJMGYgNWMgMTIgMGEgCXN0YiByZXQw
LDUocjI2KQ0KPiA+DQo+ID4gd2hpY2ggYXBwZWFycyB0byBjb25maXJtIHRoYXQgb24gcGFyaXNj
IGFkamFjZW50IGJ5dGUgZGF0YQ0KPiA+IGlzIHNhZmUgZnJvbSBjb3JydXB0aW9uIGJ5IGNvbmN1
cnJlbnQgY3B1IHVwZGF0ZXM7IHRoYXQgaXMsDQo+ID4NCj4gPiBDUFUgMCAgICAgICAgICAgICAg
ICB8IENQVSAxDQo+ID4gICAgICAgICAgICAgICAgICAgICAgfA0KPiA+IHAtPmIgPSAxICAgICAg
ICAgICAgIHwgcC0+YyA9IDINCj4gPiAgICAgICAgICAgICAgICAgICAgICB8DQo+ID4NCj4gPiB3
aWxsIHJlc3VsdCBpbiBwLT5iID09IDEgJiYgcC0+YyA9PSAyIChhc3N1bWUgYm90aCB2YWx1ZXMN
Cj4gPiB3ZXJlIDAgYmVmb3JlIHRoZSBjYWxsIHRvIHN0b3JlX2JjKCkpLg0KPiANCj4gV2hhdCBQ
ZXRlciBzYWlkLiAgSSB3b3VsZCBhc2sgZm9yIHN1Z2dlc3Rpb25zIGZvciBiZXR0ZXIgd29yZGlu
ZywgYnV0DQo+IEkgd291bGQgbXVjaCByYXRoZXIgYmUgYWJsZSB0byBzYXkgdGhhdCBzaW5nbGUt
Ynl0ZSByZWFkcyBhbmQgd3JpdGVzDQo+IGFyZSBhdG9taWMgYW5kIHRoYXQgYWxpZ25lZC1zaG9y
dCByZWFkcyBhbmQgd3JpdGVzIGFyZSBhbHNvIGF0b21pYy4NCj4gDQo+IFRodXMgZmFyLCBpdCBs
b29rcyBsaWtlIHdlIGxvc2Ugb25seSB2ZXJ5IG9sZCBBbHBoYSBzeXN0ZW1zLCBzbyB1bmxlc3MN
Cj4gSSBoZWFyIG90aGVyd2lzZSwgSSB1cGRhdGUgbXkgcGF0Y2ggdG8gb3V0bGF3IHRoZXNlIHZl
cnkgb2xkIHN5c3RlbXMuDQoNClBlb3BsZSB3aXRoIG9sZCBBbHBoYXMgY2FuIHJ1biBOZXRCU0Qg
aW5zdGVhZCwgYWxvbmcgd2l0aCB0aG9zZSB3aG8gaGF2ZSByZWFsIFZBWGVuIDotKQ0KDQpJJ3Zl
IHNlZW4gZ2NjIGdlbmVyYXRlIDMyYml0IGFjY2Vzc2VzIGZvciAxNmJpdCBzdHJ1Y3R1cmUgbWVt
YmVycyBvbiBhcm0uDQpJdCBkb2VzIHRoaXMgYmVjYXVzZSBvZiB0aGUgbW9yZSBsaW1pdGVkIHJh
bmdlIG9mIHRoZSBvZmZzZXRzIGZvciB0aGUgMTZiaXQgYWNjZXNzLg0KT1RPSCBJIGRvbid0IGtu
b3cgaWYgaXQgZXZlciBkaWQgdGhpcyBmb3Igd3JpdGVzIC0gc28gaXQgbWF5IGJlIG1vb3QuDQoN
CglEYXZpZA0KDQo

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
  2014-09-05  8:30                           ` David Laight
  (?)
@ 2014-09-05 12:31                             ` Peter Hurley
  -1 siblings, 0 replies; 311+ messages in thread
From: Peter Hurley @ 2014-09-05 12:31 UTC (permalink / raw)
  To: David Laight, 'paulmck@linux.vnet.ibm.com'
  Cc: Jakub Jelinek, One Thousand Gnomes, linux-arch, linux-ia64,
	Mikael Pettersson, Oleg Nesterov, linux-kernel, James Bottomley,
	Tony Luck, Paul Mackerras, H. Peter Anvin, linuxppc-dev,
	Miroslav Franc, Richard Henderson, linux-arm

[ +cc linux-arm ]

Hi David,

On 09/05/2014 04:30 AM, David Laight wrote:
> I've seen gcc generate 32bit accesses for 16bit structure members on arm.
> It does this because of the more limited range of the offsets for the 16bit access.
> OTOH I don't know if it ever did this for writes - so it may be moot.

Can you recall the particulars, like what ARM config or what code?

I tried an overly-simple test to see if gcc would bump up to the word load for
the 12-bit offset mode, but it stuck with register offset rather than immediate
offset. [I used the compiler options for allmodconfig and a 4.8 cross-compiler.]

Maybe the test doesn't generate enough register pressure on the compiler?

Regards,
Peter Hurley

#define ARRAY_SIZE(x)  (sizeof(x)/sizeof((x)[0]))

struct x {
	long unused[64];
	short b[12];
	int unused2[10];
	short c;
};

void store_c(struct x *p, short a[]) {
	int i;

	for (i = 0; i < ARRAY_SIZE(p->b); i++)
		p->b[i] = a[i];
	p->c = 2;
}


void store_c(struct x *p, short a[]) {
   0:	e1a0c00d 	mov	ip, sp
   4:	e3a03000 	mov	r3, #0
   8:	e92dd800 	push	{fp, ip, lr, pc}
   c:	e24cb004 	sub	fp, ip, #4
	int i;

	for (i = 0; i < ARRAY_SIZE(p->b); i++)
		p->b[i] = a[i];
  10:	e191c0b3 	ldrh	ip, [r1, r3]
  14:	e0802003 	add	r2, r0, r3
  18:	e2822c01 	add	r2, r2, #256	; 0x100
  1c:	e2833002 	add	r3, r3, #2
  20:	e3530018 	cmp	r3, #24
  24:	e1c2c0b0 	strh	ip, [r2]
  28:	1afffff8 	bne	10 <store_c+0x10>
	p->c = 2;
  2c:	e3a03d05 	mov	r3, #320	; 0x140
  30:	e3a02002 	mov	r2, #2
  34:	e18020b3 	strh	r2, [r0, r3]
  38:	e89da800 	ldm	sp, {fp, sp, pc}


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-05 12:31                             ` Peter Hurley
  0 siblings, 0 replies; 311+ messages in thread
From: Peter Hurley @ 2014-09-05 12:31 UTC (permalink / raw)
  To: David Laight, 'paulmck@linux.vnet.ibm.com'
  Cc: Jakub Jelinek, One Thousand Gnomes, Tony Luck, linux-ia64,
	Mikael Pettersson, H. Peter Anvin, Oleg Nesterov, linux-kernel,
	James Bottomley, Paul Mackerras, linux-arch, linux-arm,
	linuxppc-dev, Miroslav Franc, Richard Henderson

[ +cc linux-arm ]

Hi David,

On 09/05/2014 04:30 AM, David Laight wrote:
> I've seen gcc generate 32bit accesses for 16bit structure members on arm.
> It does this because of the more limited range of the offsets for the 16bit access.
> OTOH I don't know if it ever did this for writes - so it may be moot.

Can you recall the particulars, like what ARM config or what code?

I tried an overly-simple test to see if gcc would bump up to the word load for
the 12-bit offset mode, but it stuck with register offset rather than immediate
offset. [I used the compiler options for allmodconfig and a 4.8 cross-compiler.]

Maybe the test doesn't generate enough register pressure on the compiler?

Regards,
Peter Hurley

#define ARRAY_SIZE(x)  (sizeof(x)/sizeof((x)[0]))

struct x {
	long unused[64];
	short b[12];
	int unused2[10];
	short c;
};

void store_c(struct x *p, short a[]) {
	int i;

	for (i = 0; i < ARRAY_SIZE(p->b); i++)
		p->b[i] = a[i];
	p->c = 2;
}


void store_c(struct x *p, short a[]) {
   0:	e1a0c00d 	mov	ip, sp
   4:	e3a03000 	mov	r3, #0
   8:	e92dd800 	push	{fp, ip, lr, pc}
   c:	e24cb004 	sub	fp, ip, #4
	int i;

	for (i = 0; i < ARRAY_SIZE(p->b); i++)
		p->b[i] = a[i];
  10:	e191c0b3 	ldrh	ip, [r1, r3]
  14:	e0802003 	add	r2, r0, r3
  18:	e2822c01 	add	r2, r2, #256	; 0x100
  1c:	e2833002 	add	r3, r3, #2
  20:	e3530018 	cmp	r3, #24
  24:	e1c2c0b0 	strh	ip, [r2]
  28:	1afffff8 	bne	10 <store_c+0x10>
	p->c = 2;
  2c:	e3a03d05 	mov	r3, #320	; 0x140
  30:	e3a02002 	mov	r2, #2
  34:	e18020b3 	strh	r2, [r0, r3]
  38:	e89da800 	ldm	sp, {fp, sp, pc}

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-05 12:31                             ` Peter Hurley
  0 siblings, 0 replies; 311+ messages in thread
From: Peter Hurley @ 2014-09-05 12:31 UTC (permalink / raw)
  To: David Laight, 'paulmck@linux.vnet.ibm.com'
  Cc: Jakub Jelinek, One Thousand Gnomes, linux-arch, linux-ia64,
	Mikael Pettersson, Oleg Nesterov, linux-kernel, James Bottomley,
	Tony Luck, Paul Mackerras, H. Peter Anvin, linuxppc-dev,
	Miroslav Franc, Richard Henderson, linux-arm

[ +cc linux-arm ]

Hi David,

On 09/05/2014 04:30 AM, David Laight wrote:
> I've seen gcc generate 32bit accesses for 16bit structure members on arm.
> It does this because of the more limited range of the offsets for the 16bit access.
> OTOH I don't know if it ever did this for writes - so it may be moot.

Can you recall the particulars, like what ARM config or what code?

I tried an overly-simple test to see if gcc would bump up to the word load for
the 12-bit offset mode, but it stuck with register offset rather than immediate
offset. [I used the compiler options for allmodconfig and a 4.8 cross-compiler.]

Maybe the test doesn't generate enough register pressure on the compiler?

Regards,
Peter Hurley

#define ARRAY_SIZE(x)  (sizeof(x)/sizeof((x)[0]))

struct x {
	long unused[64];
	short b[12];
	int unused2[10];
	short c;
};

void store_c(struct x *p, short a[]) {
	int i;

	for (i = 0; i < ARRAY_SIZE(p->b); i++)
		p->b[i] = a[i];
	p->c = 2;
}


void store_c(struct x *p, short a[]) {
   0:	e1a0c00d 	mov	ip, sp
   4:	e3a03000 	mov	r3, #0
   8:	e92dd800 	push	{fp, ip, lr, pc}
   c:	e24cb004 	sub	fp, ip, #4
	int i;

	for (i = 0; i < ARRAY_SIZE(p->b); i++)
		p->b[i] = a[i];
  10:	e191c0b3 	ldrh	ip, [r1, r3]
  14:	e0802003 	add	r2, r0, r3
  18:	e2822c01 	add	r2, r2, #256	; 0x100
  1c:	e2833002 	add	r3, r3, #2
  20:	e3530018 	cmp	r3, #24
  24:	e1c2c0b0 	strh	ip, [r2]
  28:	1afffff8 	bne	10 <store_c+0x10>
	p->c = 2;
  2c:	e3a03d05 	mov	r3, #320	; 0x140
  30:	e3a02002 	mov	r2, #2
  34:	e18020b3 	strh	r2, [r0, r3]
  38:	e89da800 	ldm	sp, {fp, sp, pc}


^ permalink raw reply	[flat|nested] 311+ messages in thread

* RE: bit fields && data tearing
  2014-09-05 12:31                             ` Peter Hurley
  (?)
  (?)
@ 2014-09-05 12:37                               ` David Laight
  -1 siblings, 0 replies; 311+ messages in thread
From: David Laight @ 2014-09-05 12:37 UTC (permalink / raw)
  To: 'Peter Hurley', 'paulmck@linux.vnet.ibm.com'
  Cc: Jakub Jelinek, One Thousand Gnomes, linux-arch, linux-ia64,
	Mikael Pettersson, Oleg Nesterov, linux-kernel, James Bottomley,
	Tony Luck, Paul Mackerras, H. Peter Anvin, linuxppc-dev,
	Miroslav Franc, Richard Henderson, linux-arm

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 2267 bytes --]

From: Peter Hurley
> [ +cc linux-arm ]
> 
> Hi David,
> 
> On 09/05/2014 04:30 AM, David Laight wrote:
> > I've seen gcc generate 32bit accesses for 16bit structure members on arm.
> > It does this because of the more limited range of the offsets for the 16bit access.
> > OTOH I don't know if it ever did this for writes - so it may be moot.
> 
> Can you recall the particulars, like what ARM config or what code?
> 
> I tried an overly-simple test to see if gcc would bump up to the word load for
> the 12-bit offset mode, but it stuck with register offset rather than immediate
> offset. [I used the compiler options for allmodconfig and a 4.8 cross-compiler.]
> 
> Maybe the test doesn't generate enough register pressure on the compiler?

Dunno, I would have been using a much older version of the compiler.
It is possible that it doesn't do it any more.
It might only have done it for loads.

The compiler used to use misaligned 32bit loads for structure
members on large 4n+2 byte boundaries as well.
I'm pretty sure it doesn't do that either.

There have been a lot of compiler versions since I was compiling
anything for arm.

	David

> Regards,
> Peter Hurley
> 
> #define ARRAY_SIZE(x)  (sizeof(x)/sizeof((x)[0]))
> 
> struct x {
> 	long unused[64];
> 	short b[12];
> 	int unused2[10];
> 	short c;
> };
> 
> void store_c(struct x *p, short a[]) {
> 	int i;
> 
> 	for (i = 0; i < ARRAY_SIZE(p->b); i++)
> 		p->b[i] = a[i];
> 	p->c = 2;
> }
> 
> 
> void store_c(struct x *p, short a[]) {
>    0:	e1a0c00d 	mov	ip, sp
>    4:	e3a03000 	mov	r3, #0
>    8:	e92dd800 	push	{fp, ip, lr, pc}
>    c:	e24cb004 	sub	fp, ip, #4
> 	int i;
> 
> 	for (i = 0; i < ARRAY_SIZE(p->b); i++)
> 		p->b[i] = a[i];
>   10:	e191c0b3 	ldrh	ip, [r1, r3]
>   14:	e0802003 	add	r2, r0, r3
>   18:	e2822c01 	add	r2, r2, #256	; 0x100
>   1c:	e2833002 	add	r3, r3, #2
>   20:	e3530018 	cmp	r3, #24
>   24:	e1c2c0b0 	strh	ip, [r2]
>   28:	1afffff8 	bne	10 <store_c+0x10>
> 	p->c = 2;
>   2c:	e3a03d05 	mov	r3, #320	; 0x140
>   30:	e3a02002 	mov	r2, #2
>   34:	e18020b3 	strh	r2, [r0, r3]
>   38:	e89da800 	ldm	sp, {fp, sp, pc}

ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 311+ messages in thread

* RE: bit fields && data tearing
@ 2014-09-05 12:37                               ` David Laight
  0 siblings, 0 replies; 311+ messages in thread
From: David Laight @ 2014-09-05 12:37 UTC (permalink / raw)
  To: 'Peter Hurley', 'paulmck@linux.vnet.ibm.com'
  Cc: Jakub Jelinek, One Thousand Gnomes, linux-arch, linux-ia64,
	Mikael Pettersson, Oleg Nesterov, linux-kernel, James Bottomley,
	Tony Luck, Paul Mackerras, H. Peter Anvin, linuxppc-dev,
	Miroslav Franc, Richard Henderson, linux-arm

From: Peter Hurley
> [ +cc linux-arm ]
> 
> Hi David,
> 
> On 09/05/2014 04:30 AM, David Laight wrote:
> > I've seen gcc generate 32bit accesses for 16bit structure members on arm.
> > It does this because of the more limited range of the offsets for the 16bit access.
> > OTOH I don't know if it ever did this for writes - so it may be moot.
> 
> Can you recall the particulars, like what ARM config or what code?
> 
> I tried an overly-simple test to see if gcc would bump up to the word load for
> the 12-bit offset mode, but it stuck with register offset rather than immediate
> offset. [I used the compiler options for allmodconfig and a 4.8 cross-compiler.]
> 
> Maybe the test doesn't generate enough register pressure on the compiler?

Dunno, I would have been using a much older version of the compiler.
It is possible that it doesn't do it any more.
It might only have done it for loads.

The compiler used to use misaligned 32bit loads for structure
members on large 4n+2 byte boundaries as well.
I'm pretty sure it doesn't do that either.

There have been a lot of compiler versions since I was compiling
anything for arm.

	David

> Regards,
> Peter Hurley
> 
> #define ARRAY_SIZE(x)  (sizeof(x)/sizeof((x)[0]))
> 
> struct x {
> 	long unused[64];
> 	short b[12];
> 	int unused2[10];
> 	short c;
> };
> 
> void store_c(struct x *p, short a[]) {
> 	int i;
> 
> 	for (i = 0; i < ARRAY_SIZE(p->b); i++)
> 		p->b[i] = a[i];
> 	p->c = 2;
> }
> 
> 
> void store_c(struct x *p, short a[]) {
>    0:	e1a0c00d 	mov	ip, sp
>    4:	e3a03000 	mov	r3, #0
>    8:	e92dd800 	push	{fp, ip, lr, pc}
>    c:	e24cb004 	sub	fp, ip, #4
> 	int i;
> 
> 	for (i = 0; i < ARRAY_SIZE(p->b); i++)
> 		p->b[i] = a[i];
>   10:	e191c0b3 	ldrh	ip, [r1, r3]
>   14:	e0802003 	add	r2, r0, r3
>   18:	e2822c01 	add	r2, r2, #256	; 0x100
>   1c:	e2833002 	add	r3, r3, #2
>   20:	e3530018 	cmp	r3, #24
>   24:	e1c2c0b0 	strh	ip, [r2]
>   28:	1afffff8 	bne	10 <store_c+0x10>
> 	p->c = 2;
>   2c:	e3a03d05 	mov	r3, #320	; 0x140
>   30:	e3a02002 	mov	r2, #2
>   34:	e18020b3 	strh	r2, [r0, r3]
>   38:	e89da800 	ldm	sp, {fp, sp, pc}


^ permalink raw reply	[flat|nested] 311+ messages in thread

* RE: bit fields && data tearing
@ 2014-09-05 12:37                               ` David Laight
  0 siblings, 0 replies; 311+ messages in thread
From: David Laight @ 2014-09-05 12:37 UTC (permalink / raw)
  To: 'Peter Hurley', 'paulmck@linux.vnet.ibm.com'
  Cc: Jakub Jelinek, One Thousand Gnomes, Tony Luck, linux-ia64,
	Mikael Pettersson, H. Peter Anvin, Oleg Nesterov, linux-kernel,
	James Bottomley, Paul Mackerras, linux-arch, linux-arm,
	linuxppc-dev, Miroslav Franc, Richard Henderson

RnJvbTogUGV0ZXIgSHVybGV5DQo+IFsgK2NjIGxpbnV4LWFybSBdDQo+IA0KPiBIaSBEYXZpZCwN
Cj4gDQo+IE9uIDA5LzA1LzIwMTQgMDQ6MzAgQU0sIERhdmlkIExhaWdodCB3cm90ZToNCj4gPiBJ
J3ZlIHNlZW4gZ2NjIGdlbmVyYXRlIDMyYml0IGFjY2Vzc2VzIGZvciAxNmJpdCBzdHJ1Y3R1cmUg
bWVtYmVycyBvbiBhcm0uDQo+ID4gSXQgZG9lcyB0aGlzIGJlY2F1c2Ugb2YgdGhlIG1vcmUgbGlt
aXRlZCByYW5nZSBvZiB0aGUgb2Zmc2V0cyBmb3IgdGhlIDE2Yml0IGFjY2Vzcy4NCj4gPiBPVE9I
IEkgZG9uJ3Qga25vdyBpZiBpdCBldmVyIGRpZCB0aGlzIGZvciB3cml0ZXMgLSBzbyBpdCBtYXkg
YmUgbW9vdC4NCj4gDQo+IENhbiB5b3UgcmVjYWxsIHRoZSBwYXJ0aWN1bGFycywgbGlrZSB3aGF0
IEFSTSBjb25maWcgb3Igd2hhdCBjb2RlPw0KPiANCj4gSSB0cmllZCBhbiBvdmVybHktc2ltcGxl
IHRlc3QgdG8gc2VlIGlmIGdjYyB3b3VsZCBidW1wIHVwIHRvIHRoZSB3b3JkIGxvYWQgZm9yDQo+
IHRoZSAxMi1iaXQgb2Zmc2V0IG1vZGUsIGJ1dCBpdCBzdHVjayB3aXRoIHJlZ2lzdGVyIG9mZnNl
dCByYXRoZXIgdGhhbiBpbW1lZGlhdGUNCj4gb2Zmc2V0LiBbSSB1c2VkIHRoZSBjb21waWxlciBv
cHRpb25zIGZvciBhbGxtb2Rjb25maWcgYW5kIGEgNC44IGNyb3NzLWNvbXBpbGVyLl0NCj4gDQo+
IE1heWJlIHRoZSB0ZXN0IGRvZXNuJ3QgZ2VuZXJhdGUgZW5vdWdoIHJlZ2lzdGVyIHByZXNzdXJl
IG9uIHRoZSBjb21waWxlcj8NCg0KRHVubm8sIEkgd291bGQgaGF2ZSBiZWVuIHVzaW5nIGEgbXVj
aCBvbGRlciB2ZXJzaW9uIG9mIHRoZSBjb21waWxlci4NCkl0IGlzIHBvc3NpYmxlIHRoYXQgaXQg
ZG9lc24ndCBkbyBpdCBhbnkgbW9yZS4NCkl0IG1pZ2h0IG9ubHkgaGF2ZSBkb25lIGl0IGZvciBs
b2Fkcy4NCg0KVGhlIGNvbXBpbGVyIHVzZWQgdG8gdXNlIG1pc2FsaWduZWQgMzJiaXQgbG9hZHMg
Zm9yIHN0cnVjdHVyZQ0KbWVtYmVycyBvbiBsYXJnZSA0bisyIGJ5dGUgYm91bmRhcmllcyBhcyB3
ZWxsLg0KSSdtIHByZXR0eSBzdXJlIGl0IGRvZXNuJ3QgZG8gdGhhdCBlaXRoZXIuDQoNClRoZXJl
IGhhdmUgYmVlbiBhIGxvdCBvZiBjb21waWxlciB2ZXJzaW9ucyBzaW5jZSBJIHdhcyBjb21waWxp
bmcNCmFueXRoaW5nIGZvciBhcm0uDQoNCglEYXZpZA0KDQo+IFJlZ2FyZHMsDQo+IFBldGVyIEh1
cmxleQ0KPiANCj4gI2RlZmluZSBBUlJBWV9TSVpFKHgpICAoc2l6ZW9mKHgpL3NpemVvZigoeClb
MF0pKQ0KPiANCj4gc3RydWN0IHggew0KPiAJbG9uZyB1bnVzZWRbNjRdOw0KPiAJc2hvcnQgYlsx
Ml07DQo+IAlpbnQgdW51c2VkMlsxMF07DQo+IAlzaG9ydCBjOw0KPiB9Ow0KPiANCj4gdm9pZCBz
dG9yZV9jKHN0cnVjdCB4ICpwLCBzaG9ydCBhW10pIHsNCj4gCWludCBpOw0KPiANCj4gCWZvciAo
aSA9IDA7IGkgPCBBUlJBWV9TSVpFKHAtPmIpOyBpKyspDQo+IAkJcC0+YltpXSA9IGFbaV07DQo+
IAlwLT5jID0gMjsNCj4gfQ0KPiANCj4gDQo+IHZvaWQgc3RvcmVfYyhzdHJ1Y3QgeCAqcCwgc2hv
cnQgYVtdKSB7DQo+ICAgIDA6CWUxYTBjMDBkIAltb3YJaXAsIHNwDQo+ICAgIDQ6CWUzYTAzMDAw
IAltb3YJcjMsICMwDQo+ICAgIDg6CWU5MmRkODAwIAlwdXNoCXtmcCwgaXAsIGxyLCBwY30NCj4g
ICAgYzoJZTI0Y2IwMDQgCXN1YglmcCwgaXAsICM0DQo+IAlpbnQgaTsNCj4gDQo+IAlmb3IgKGkg
PSAwOyBpIDwgQVJSQVlfU0laRShwLT5iKTsgaSsrKQ0KPiAJCXAtPmJbaV0gPSBhW2ldOw0KPiAg
IDEwOgllMTkxYzBiMyAJbGRyaAlpcCwgW3IxLCByM10NCj4gICAxNDoJZTA4MDIwMDMgCWFkZAly
MiwgcjAsIHIzDQo+ICAgMTg6CWUyODIyYzAxIAlhZGQJcjIsIHIyLCAjMjU2CTsgMHgxMDANCj4g
ICAxYzoJZTI4MzMwMDIgCWFkZAlyMywgcjMsICMyDQo+ICAgMjA6CWUzNTMwMDE4IAljbXAJcjMs
ICMyNA0KPiAgIDI0OgllMWMyYzBiMCAJc3RyaAlpcCwgW3IyXQ0KPiAgIDI4OgkxYWZmZmZmOCAJ
Ym5lCTEwIDxzdG9yZV9jKzB4MTA+DQo+IAlwLT5jID0gMjsNCj4gICAyYzoJZTNhMDNkMDUgCW1v
dglyMywgIzMyMAk7IDB4MTQwDQo+ICAgMzA6CWUzYTAyMDAyIAltb3YJcjIsICMyDQo+ICAgMzQ6
CWUxODAyMGIzIAlzdHJoCXIyLCBbcjAsIHIzXQ0KPiAgIDM4OgllODlkYTgwMCAJbGRtCXNwLCB7
ZnAsIHNwLCBwY30NCg0K

^ permalink raw reply	[flat|nested] 311+ messages in thread

* RE: bit fields && data tearing
@ 2014-09-05 12:37                               ` David Laight
  0 siblings, 0 replies; 311+ messages in thread
From: David Laight @ 2014-09-05 12:37 UTC (permalink / raw)
  To: 'Peter Hurley', 'paulmck@linux.vnet.ibm.com'
  Cc: Jakub Jelinek, One Thousand Gnomes, linux-arch, linux-ia64,
	Mikael Pettersson, Oleg Nesterov, linux-kernel, James Bottomley,
	Tony Luck, Paul Mackerras, H. Peter Anvin, linuxppc-dev,
	Miroslav Franc, Richard Henderson, linux-arm

RnJvbTogUGV0ZXIgSHVybGV5DQo+IFsgK2NjIGxpbnV4LWFybSBdDQo+IA0KPiBIaSBEYXZpZCwN
Cj4gDQo+IE9uIDA5LzA1LzIwMTQgMDQ6MzAgQU0sIERhdmlkIExhaWdodCB3cm90ZToNCj4gPiBJ
J3ZlIHNlZW4gZ2NjIGdlbmVyYXRlIDMyYml0IGFjY2Vzc2VzIGZvciAxNmJpdCBzdHJ1Y3R1cmUg
bWVtYmVycyBvbiBhcm0uDQo+ID4gSXQgZG9lcyB0aGlzIGJlY2F1c2Ugb2YgdGhlIG1vcmUgbGlt
aXRlZCByYW5nZSBvZiB0aGUgb2Zmc2V0cyBmb3IgdGhlIDE2Yml0IGFjY2Vzcy4NCj4gPiBPVE9I
IEkgZG9uJ3Qga25vdyBpZiBpdCBldmVyIGRpZCB0aGlzIGZvciB3cml0ZXMgLSBzbyBpdCBtYXkg
YmUgbW9vdC4NCj4gDQo+IENhbiB5b3UgcmVjYWxsIHRoZSBwYXJ0aWN1bGFycywgbGlrZSB3aGF0
IEFSTSBjb25maWcgb3Igd2hhdCBjb2RlPw0KPiANCj4gSSB0cmllZCBhbiBvdmVybHktc2ltcGxl
IHRlc3QgdG8gc2VlIGlmIGdjYyB3b3VsZCBidW1wIHVwIHRvIHRoZSB3b3JkIGxvYWQgZm9yDQo+
IHRoZSAxMi1iaXQgb2Zmc2V0IG1vZGUsIGJ1dCBpdCBzdHVjayB3aXRoIHJlZ2lzdGVyIG9mZnNl
dCByYXRoZXIgdGhhbiBpbW1lZGlhdGUNCj4gb2Zmc2V0LiBbSSB1c2VkIHRoZSBjb21waWxlciBv
cHRpb25zIGZvciBhbGxtb2Rjb25maWcgYW5kIGEgNC44IGNyb3NzLWNvbXBpbGVyLl0NCj4gDQo+
IE1heWJlIHRoZSB0ZXN0IGRvZXNuJ3QgZ2VuZXJhdGUgZW5vdWdoIHJlZ2lzdGVyIHByZXNzdXJl
IG9uIHRoZSBjb21waWxlcj8NCg0KRHVubm8sIEkgd291bGQgaGF2ZSBiZWVuIHVzaW5nIGEgbXVj
aCBvbGRlciB2ZXJzaW9uIG9mIHRoZSBjb21waWxlci4NCkl0IGlzIHBvc3NpYmxlIHRoYXQgaXQg
ZG9lc24ndCBkbyBpdCBhbnkgbW9yZS4NCkl0IG1pZ2h0IG9ubHkgaGF2ZSBkb25lIGl0IGZvciBs
b2Fkcy4NCg0KVGhlIGNvbXBpbGVyIHVzZWQgdG8gdXNlIG1pc2FsaWduZWQgMzJiaXQgbG9hZHMg
Zm9yIHN0cnVjdHVyZQ0KbWVtYmVycyBvbiBsYXJnZSA0bisyIGJ5dGUgYm91bmRhcmllcyBhcyB3
ZWxsLg0KSSdtIHByZXR0eSBzdXJlIGl0IGRvZXNuJ3QgZG8gdGhhdCBlaXRoZXIuDQoNClRoZXJl
IGhhdmUgYmVlbiBhIGxvdCBvZiBjb21waWxlciB2ZXJzaW9ucyBzaW5jZSBJIHdhcyBjb21waWxp
bmcNCmFueXRoaW5nIGZvciBhcm0uDQoNCglEYXZpZA0KDQo+IFJlZ2FyZHMsDQo+IFBldGVyIEh1
cmxleQ0KPiANCj4gI2RlZmluZSBBUlJBWV9TSVpFKHgpICAoc2l6ZW9mKHgpL3NpemVvZigoeClb
MF0pKQ0KPiANCj4gc3RydWN0IHggew0KPiAJbG9uZyB1bnVzZWRbNjRdOw0KPiAJc2hvcnQgYlsx
Ml07DQo+IAlpbnQgdW51c2VkMlsxMF07DQo+IAlzaG9ydCBjOw0KPiB9Ow0KPiANCj4gdm9pZCBz
dG9yZV9jKHN0cnVjdCB4ICpwLCBzaG9ydCBhW10pIHsNCj4gCWludCBpOw0KPiANCj4gCWZvciAo
aSA9IDA7IGkgPCBBUlJBWV9TSVpFKHAtPmIpOyBpKyspDQo+IAkJcC0+YltpXSA9IGFbaV07DQo+
IAlwLT5jID0gMjsNCj4gfQ0KPiANCj4gDQo+IHZvaWQgc3RvcmVfYyhzdHJ1Y3QgeCAqcCwgc2hv
cnQgYVtdKSB7DQo+ICAgIDA6CWUxYTBjMDBkIAltb3YJaXAsIHNwDQo+ICAgIDQ6CWUzYTAzMDAw
IAltb3YJcjMsICMwDQo+ICAgIDg6CWU5MmRkODAwIAlwdXNoCXtmcCwgaXAsIGxyLCBwY30NCj4g
ICAgYzoJZTI0Y2IwMDQgCXN1YglmcCwgaXAsICM0DQo+IAlpbnQgaTsNCj4gDQo+IAlmb3IgKGkg
PSAwOyBpIDwgQVJSQVlfU0laRShwLT5iKTsgaSsrKQ0KPiAJCXAtPmJbaV0gPSBhW2ldOw0KPiAg
IDEwOgllMTkxYzBiMyAJbGRyaAlpcCwgW3IxLCByM10NCj4gICAxNDoJZTA4MDIwMDMgCWFkZAly
MiwgcjAsIHIzDQo+ICAgMTg6CWUyODIyYzAxIAlhZGQJcjIsIHIyLCAjMjU2CTsgMHgxMDANCj4g
ICAxYzoJZTI4MzMwMDIgCWFkZAlyMywgcjMsICMyDQo+ICAgMjA6CWUzNTMwMDE4IAljbXAJcjMs
ICMyNA0KPiAgIDI0OgllMWMyYzBiMCAJc3RyaAlpcCwgW3IyXQ0KPiAgIDI4OgkxYWZmZmZmOCAJ
Ym5lCTEwIDxzdG9yZV9jKzB4MTA+DQo+IAlwLT5jID0gMjsNCj4gICAyYzoJZTNhMDNkMDUgCW1v
dglyMywgIzMyMAk7IDB4MTQwDQo+ICAgMzA6CWUzYTAyMDAyIAltb3YJcjIsICMyDQo+ICAgMzQ6
CWUxODAyMGIzIAlzdHJoCXIyLCBbcjAsIHIzXQ0KPiAgIDM4OgllODlkYTgwMCAJbGRtCXNwLCB7
ZnAsIHNwLCBwY30NCg0K

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
  2014-09-05  2:08               ` H. Peter Anvin
  (?)
@ 2014-09-05 15:31                 ` Peter Hurley
  -1 siblings, 0 replies; 311+ messages in thread
From: Peter Hurley @ 2014-09-05 15:31 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Benjamin Herrenschmidt, David Laight, Jakub Jelinek, linux-arch,
	Tony Luck, linux-ia64, Oleg Nesterov, linux-kernel,
	Paul Mackerras, Paul E. McKenney, linuxppc-dev, Miroslav Franc,
	Richard Henderson, linux-alpha

On 09/04/2014 10:08 PM, H. Peter Anvin wrote:
> On 09/04/2014 05:59 PM, Peter Hurley wrote:
>> I have no idea how prevalent the ev56 is compared to the ev5.
>> Still we're talking about a chip that came out in 1996.
> 
> Ah yes, I stand corrected.  According to Wikipedia, the affected CPUs
> were all the 2106x CPUs (EV4, EV45, LCA4, LCA45) plus the 21164 with no
> suffix (EV5).  However, we're still talking about museum pieces here.
> 
> I wonder what the one I have in my garage is... I'm sure I could emulate
> it faster, though.

Which is a bit ironic because I remember when Digital had a team
working on emulating native x86 apps on Alpha/NT.


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-05 15:31                 ` Peter Hurley
  0 siblings, 0 replies; 311+ messages in thread
From: Peter Hurley @ 2014-09-05 15:31 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Jakub Jelinek, linux-arch, Tony Luck, linux-ia64, Oleg Nesterov,
	linux-kernel, David Laight, Paul Mackerras, linux-alpha,
	Paul E. McKenney, linuxppc-dev, Miroslav Franc,
	Richard Henderson

On 09/04/2014 10:08 PM, H. Peter Anvin wrote:
> On 09/04/2014 05:59 PM, Peter Hurley wrote:
>> I have no idea how prevalent the ev56 is compared to the ev5.
>> Still we're talking about a chip that came out in 1996.
> 
> Ah yes, I stand corrected.  According to Wikipedia, the affected CPUs
> were all the 2106x CPUs (EV4, EV45, LCA4, LCA45) plus the 21164 with no
> suffix (EV5).  However, we're still talking about museum pieces here.
> 
> I wonder what the one I have in my garage is... I'm sure I could emulate
> it faster, though.

Which is a bit ironic because I remember when Digital had a team
working on emulating native x86 apps on Alpha/NT.

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-05 15:31                 ` Peter Hurley
  0 siblings, 0 replies; 311+ messages in thread
From: Peter Hurley @ 2014-09-05 15:31 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Benjamin Herrenschmidt, David Laight, Jakub Jelinek, linux-arch,
	Tony Luck, linux-ia64, Oleg Nesterov, linux-kernel,
	Paul Mackerras, Paul E. McKenney, linuxppc-dev, Miroslav Franc,
	Richard Henderson, linux-alpha

On 09/04/2014 10:08 PM, H. Peter Anvin wrote:
> On 09/04/2014 05:59 PM, Peter Hurley wrote:
>> I have no idea how prevalent the ev56 is compared to the ev5.
>> Still we're talking about a chip that came out in 1996.
> 
> Ah yes, I stand corrected.  According to Wikipedia, the affected CPUs
> were all the 2106x CPUs (EV4, EV45, LCA4, LCA45) plus the 21164 with no
> suffix (EV5).  However, we're still talking about museum pieces here.
> 
> I wonder what the one I have in my garage is... I'm sure I could emulate
> it faster, though.

Which is a bit ironic because I remember when Digital had a team
working on emulating native x86 apps on Alpha/NT.


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
  2014-09-05 15:31                 ` Peter Hurley
  (?)
@ 2014-09-05 15:41                   ` H. Peter Anvin
  -1 siblings, 0 replies; 311+ messages in thread
From: H. Peter Anvin @ 2014-09-05 15:41 UTC (permalink / raw)
  To: Peter Hurley
  Cc: Benjamin Herrenschmidt, David Laight, Jakub Jelinek, linux-arch,
	Tony Luck, linux-ia64, Oleg Nesterov, linux-kernel,
	Paul Mackerras, Paul E. McKenney, linuxppc-dev, Miroslav Franc,
	Richard Henderson, linux-alpha

On 09/05/2014 08:31 AM, Peter Hurley wrote:
> 
> Which is a bit ironic because I remember when Digital had a team
> working on emulating native x86 apps on Alpha/NT.
> 

Right, because the x86 architecture was obsolete and would never scale...

	-hpa


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-05 15:41                   ` H. Peter Anvin
  0 siblings, 0 replies; 311+ messages in thread
From: H. Peter Anvin @ 2014-09-05 15:41 UTC (permalink / raw)
  To: Peter Hurley
  Cc: Jakub Jelinek, linux-arch, Tony Luck, linux-ia64, Oleg Nesterov,
	linux-kernel, David Laight, Paul Mackerras, linux-alpha,
	Paul E. McKenney, linuxppc-dev, Miroslav Franc,
	Richard Henderson

On 09/05/2014 08:31 AM, Peter Hurley wrote:
> 
> Which is a bit ironic because I remember when Digital had a team
> working on emulating native x86 apps on Alpha/NT.
> 

Right, because the x86 architecture was obsolete and would never scale...

	-hpa

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-05 15:41                   ` H. Peter Anvin
  0 siblings, 0 replies; 311+ messages in thread
From: H. Peter Anvin @ 2014-09-05 15:41 UTC (permalink / raw)
  To: Peter Hurley
  Cc: Benjamin Herrenschmidt, David Laight, Jakub Jelinek, linux-arch,
	Tony Luck, linux-ia64, Oleg Nesterov, linux-kernel,
	Paul Mackerras, Paul E. McKenney, linuxppc-dev, Miroslav Franc,
	Richard Henderson, linux-alpha

On 09/05/2014 08:31 AM, Peter Hurley wrote:
> 
> Which is a bit ironic because I remember when Digital had a team
> working on emulating native x86 apps on Alpha/NT.
> 

Right, because the x86 architecture was obsolete and would never scale...

	-hpa


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
  2014-09-05 12:37                               ` David Laight
  (?)
@ 2014-09-05 16:17                                 ` Peter Hurley
  -1 siblings, 0 replies; 311+ messages in thread
From: Peter Hurley @ 2014-09-05 16:17 UTC (permalink / raw)
  To: David Laight, 'paulmck@linux.vnet.ibm.com'
  Cc: Jakub Jelinek, One Thousand Gnomes, linux-arch, linux-ia64,
	Mikael Pettersson, Oleg Nesterov, linux-kernel, James Bottomley,
	Tony Luck, Paul Mackerras, H. Peter Anvin, linuxppc-dev,
	Miroslav Franc, Richard Henderson, linux-arm

On 09/05/2014 08:37 AM, David Laight wrote:
> From: Peter Hurley
>> On 09/05/2014 04:30 AM, David Laight wrote:
>>> I've seen gcc generate 32bit accesses for 16bit structure members on arm.
>>> It does this because of the more limited range of the offsets for the 16bit access.
>>> OTOH I don't know if it ever did this for writes - so it may be moot.
>>
>> Can you recall the particulars, like what ARM config or what code?
>>
>> I tried an overly-simple test to see if gcc would bump up to the word load for
>> the 12-bit offset mode, but it stuck with register offset rather than immediate
>> offset. [I used the compiler options for allmodconfig and a 4.8 cross-compiler.]
>>
>> Maybe the test doesn't generate enough register pressure on the compiler?
> 
> Dunno, I would have been using a much older version of the compiler.
> It is possible that it doesn't do it any more.
> It might only have done it for loads.
> 
> The compiler used to use misaligned 32bit loads for structure
> members on large 4n+2 byte boundaries as well.
> I'm pretty sure it doesn't do that either.
> 
> There have been a lot of compiler versions since I was compiling
> anything for arm.

Yeah, it seems gcc for ARM no longer uses the larger operand size as a
substitute for 12-bit immediate offset addressing mode, even for reads.

While this test:

struct x {
	short b[12];
};

short load_b(struct x *p) {
	return p->b[8];
}

generates the 8-bit immediate offset form,

short load_b(struct x *p) {
   0:	e1d001f0 	ldrsh	r0, [r0, #16]
   4:	e12fff1e 	bx	lr


pushing the offset out past 256:

struct x {
	long unused[64];
	short b[12];
};

short load_b(struct x *p) {
	return p->b[8];
}

generates the register offset addressing mode instead of 12-bit immediate:

short load_b(struct x *p) {
   0:	e3a03e11 	mov	r3, #272	; 0x110
   4:	e19000f3 	ldrsh	r0, [r0, r3]
   8:	e12fff1e 	bx	lr

Regards,
Peter Hurley

[Note: I compiled without the frame pointer to simplify the code generation]

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-05 16:17                                 ` Peter Hurley
  0 siblings, 0 replies; 311+ messages in thread
From: Peter Hurley @ 2014-09-05 16:17 UTC (permalink / raw)
  To: David Laight, 'paulmck@linux.vnet.ibm.com'
  Cc: Jakub Jelinek, One Thousand Gnomes, Tony Luck, linux-ia64,
	Mikael Pettersson, H. Peter Anvin, Oleg Nesterov, linux-kernel,
	James Bottomley, Paul Mackerras, linux-arch, linux-arm,
	linuxppc-dev, Miroslav Franc, Richard Henderson

On 09/05/2014 08:37 AM, David Laight wrote:
> From: Peter Hurley
>> On 09/05/2014 04:30 AM, David Laight wrote:
>>> I've seen gcc generate 32bit accesses for 16bit structure members on arm.
>>> It does this because of the more limited range of the offsets for the 16bit access.
>>> OTOH I don't know if it ever did this for writes - so it may be moot.
>>
>> Can you recall the particulars, like what ARM config or what code?
>>
>> I tried an overly-simple test to see if gcc would bump up to the word load for
>> the 12-bit offset mode, but it stuck with register offset rather than immediate
>> offset. [I used the compiler options for allmodconfig and a 4.8 cross-compiler.]
>>
>> Maybe the test doesn't generate enough register pressure on the compiler?
> 
> Dunno, I would have been using a much older version of the compiler.
> It is possible that it doesn't do it any more.
> It might only have done it for loads.
> 
> The compiler used to use misaligned 32bit loads for structure
> members on large 4n+2 byte boundaries as well.
> I'm pretty sure it doesn't do that either.
> 
> There have been a lot of compiler versions since I was compiling
> anything for arm.

Yeah, it seems gcc for ARM no longer uses the larger operand size as a
substitute for 12-bit immediate offset addressing mode, even for reads.

While this test:

struct x {
	short b[12];
};

short load_b(struct x *p) {
	return p->b[8];
}

generates the 8-bit immediate offset form,

short load_b(struct x *p) {
   0:	e1d001f0 	ldrsh	r0, [r0, #16]
   4:	e12fff1e 	bx	lr


pushing the offset out past 256:

struct x {
	long unused[64];
	short b[12];
};

short load_b(struct x *p) {
	return p->b[8];
}

generates the register offset addressing mode instead of 12-bit immediate:

short load_b(struct x *p) {
   0:	e3a03e11 	mov	r3, #272	; 0x110
   4:	e19000f3 	ldrsh	r0, [r0, r3]
   8:	e12fff1e 	bx	lr

Regards,
Peter Hurley

[Note: I compiled without the frame pointer to simplify the code generation]

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-05 16:17                                 ` Peter Hurley
  0 siblings, 0 replies; 311+ messages in thread
From: Peter Hurley @ 2014-09-05 16:17 UTC (permalink / raw)
  To: David Laight, 'paulmck@linux.vnet.ibm.com'
  Cc: Jakub Jelinek, One Thousand Gnomes, linux-arch, linux-ia64,
	Mikael Pettersson, Oleg Nesterov, linux-kernel, James Bottomley,
	Tony Luck, Paul Mackerras, H. Peter Anvin, linuxppc-dev,
	Miroslav Franc, Richard Henderson, linux-arm

On 09/05/2014 08:37 AM, David Laight wrote:
> From: Peter Hurley
>> On 09/05/2014 04:30 AM, David Laight wrote:
>>> I've seen gcc generate 32bit accesses for 16bit structure members on arm.
>>> It does this because of the more limited range of the offsets for the 16bit access.
>>> OTOH I don't know if it ever did this for writes - so it may be moot.
>>
>> Can you recall the particulars, like what ARM config or what code?
>>
>> I tried an overly-simple test to see if gcc would bump up to the word load for
>> the 12-bit offset mode, but it stuck with register offset rather than immediate
>> offset. [I used the compiler options for allmodconfig and a 4.8 cross-compiler.]
>>
>> Maybe the test doesn't generate enough register pressure on the compiler?
> 
> Dunno, I would have been using a much older version of the compiler.
> It is possible that it doesn't do it any more.
> It might only have done it for loads.
> 
> The compiler used to use misaligned 32bit loads for structure
> members on large 4n+2 byte boundaries as well.
> I'm pretty sure it doesn't do that either.
> 
> There have been a lot of compiler versions since I was compiling
> anything for arm.

Yeah, it seems gcc for ARM no longer uses the larger operand size as a
substitute for 12-bit immediate offset addressing mode, even for reads.

While this test:

struct x {
	short b[12];
};

short load_b(struct x *p) {
	return p->b[8];
}

generates the 8-bit immediate offset form,

short load_b(struct x *p) {
   0:	e1d001f0 	ldrsh	r0, [r0, #16]
   4:	e12fff1e 	bx	lr


pushing the offset out past 256:

struct x {
	long unused[64];
	short b[12];
};

short load_b(struct x *p) {
	return p->b[8];
}

generates the register offset addressing mode instead of 12-bit immediate:

short load_b(struct x *p) {
   0:	e3a03e11 	mov	r3, #272	; 0x110
   4:	e19000f3 	ldrsh	r0, [r0, r3]
   8:	e12fff1e 	bx	lr

Regards,
Peter Hurley

[Note: I compiled without the frame pointer to simplify the code generation]

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
  2014-09-05  8:16                 ` Michael Cree
@ 2014-09-05 18:09                   ` Paul E. McKenney
  -1 siblings, 0 replies; 311+ messages in thread
From: Paul E. McKenney @ 2014-09-05 18:09 UTC (permalink / raw)
  To: Michael Cree, H. Peter Anvin, Peter Hurley,
	Benjamin Herrenschmidt, David Laight, Jakub Jelinek, linux-arch,
	Tony Luck, linux-ia64, Oleg Nesterov, linux-kernel,
	Paul Mackerras, linuxppc-dev, Miroslav Franc, Richard Henderson,
	linux-alpha

On Fri, Sep 05, 2014 at 08:16:48PM +1200, Michael Cree wrote:
> On Thu, Sep 04, 2014 at 07:08:48PM -0700, H. Peter Anvin wrote:
> > On 09/04/2014 05:59 PM, Peter Hurley wrote:
> > > I have no idea how prevalent the ev56 is compared to the ev5.
> > > Still we're talking about a chip that came out in 1996.
> > 
> > Ah yes, I stand corrected.  According to Wikipedia, the affected CPUs
> > were all the 2106x CPUs (EV4, EV45, LCA4, LCA45) plus the 21164 with no
> > suffix (EV5).  However, we're still talking about museum pieces here.
> 
> Yes, that is correct, EV56 is the first Alpha CPU to have the byte-word
> extension (BWX) CPU instructions.
> 
> It would not worry me if the kernel decided to assume atomic aligned
> scalar accesses for all arches, thus terminating support for Alphas
> without BWX.
> 
> The X server, ever since the libpciaccess change, does not work on
> Alphas without BWX.
> 
> Debian Alpha (pretty much up to date at Debian-Ports) is still compiled
> for all Alphas, i.e., without BWX.  The last attempt to start compiling
> Debian Alpha with BWX, about three years ago when Alpha was kicked out
> to Debian-Ports resulted in a couple or so complaints so got nowhere.
> It's frustrating supporting the lowest common demoninator as many of
> the bugs specific to Alpha can be resolved by recompiling with the BWX.
> The kernel no longer supporting Alphas without BWX might just be the
> incentive we need to switch Debian Alpha to compiling with BWX.

Very good, then I update my patch as follows.  Thoughts?

							Thanx, Paul

------------------------------------------------------------------------

documentation: Record limitations of bitfields and small variables

This commit documents the fact that it is not safe to use bitfields as
shared variables in synchronization algorithms.  It also documents that
CPUs must provide one-byte and two-byte load and store instructions
in order to be supported by the Linux kernel.  (Michael Cree
has agreed to the resulting non-support of pre-EV56 Alpha CPUs:
https://lkml.org/lkml/2014/9/5/143.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
index 87be0a8a78de..455df6b298f7 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -269,6 +269,30 @@ And there are a number of things that _must_ or _must_not_ be assumed:
 	STORE *(A + 4) = Y; STORE *A = X;
 	STORE {*A, *(A + 4) } = {X, Y};
 
+And there are anti-guarantees:
+
+ (*) These guarantees do not apply to bitfields, because compilers often
+     generate code to modify these using non-atomic read-modify-write
+     sequences.  Do not attempt to use bitfields to synchronize parallel
+     algorithms.
+
+ (*) Even in cases where bitfields are protected by locks, all fields
+     in a given bitfield must be protected by one lock.  If two fields
+     in a given bitfield are protected by different locks, the compiler's
+     non-atomic read-modify-write sequences can cause an update to one
+     field to corrupt the value of an adjacent field.
+
+ (*) These guarantees apply only to properly aligned and sized scalar
+     variables.  "Properly sized" currently means variables that are the
+     same size as "char", "short", "int" and "long".  "Properly aligned"
+     means the natural alignment, thus no constraints for "char",
+     two-byte alignment for "short", four-byte alignment for "int",
+     and either four-byte or eight-byte alignment for "long", on 32-bit
+     and 64-bit systems, respectively.  Note that this means that the
+     Linux kernel does not support pre-EV56 Alpha CPUs, because these
+     older CPUs do not provide one-byte and two-byte loads and stores.
+     Alpha EV56 and later Alpha CPUs are still supported.
+
 
 =========================
 WHAT ARE MEMORY BARRIERS?


^ permalink raw reply related	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-05 18:09                   ` Paul E. McKenney
  0 siblings, 0 replies; 311+ messages in thread
From: Paul E. McKenney @ 2014-09-05 18:09 UTC (permalink / raw)
  To: Michael Cree, H. Peter Anvin, Peter Hurley,
	Benjamin Herrenschmidt, David Laight, Jakub Jelinek, linux-arch,
	Tony Luck, linux-ia64, Oleg Nesterov, linux-kernel,
	Paul Mackerras, linuxppc-dev, Miroslav Franc, Richard Henderson,
	linux-alpha

On Fri, Sep 05, 2014 at 08:16:48PM +1200, Michael Cree wrote:
> On Thu, Sep 04, 2014 at 07:08:48PM -0700, H. Peter Anvin wrote:
> > On 09/04/2014 05:59 PM, Peter Hurley wrote:
> > > I have no idea how prevalent the ev56 is compared to the ev5.
> > > Still we're talking about a chip that came out in 1996.
> > 
> > Ah yes, I stand corrected.  According to Wikipedia, the affected CPUs
> > were all the 2106x CPUs (EV4, EV45, LCA4, LCA45) plus the 21164 with no
> > suffix (EV5).  However, we're still talking about museum pieces here.
> 
> Yes, that is correct, EV56 is the first Alpha CPU to have the byte-word
> extension (BWX) CPU instructions.
> 
> It would not worry me if the kernel decided to assume atomic aligned
> scalar accesses for all arches, thus terminating support for Alphas
> without BWX.
> 
> The X server, ever since the libpciaccess change, does not work on
> Alphas without BWX.
> 
> Debian Alpha (pretty much up to date at Debian-Ports) is still compiled
> for all Alphas, i.e., without BWX.  The last attempt to start compiling
> Debian Alpha with BWX, about three years ago when Alpha was kicked out
> to Debian-Ports resulted in a couple or so complaints so got nowhere.
> It's frustrating supporting the lowest common demoninator as many of
> the bugs specific to Alpha can be resolved by recompiling with the BWX.
> The kernel no longer supporting Alphas without BWX might just be the
> incentive we need to switch Debian Alpha to compiling with BWX.

Very good, then I update my patch as follows.  Thoughts?

							Thanx, Paul

------------------------------------------------------------------------

documentation: Record limitations of bitfields and small variables

This commit documents the fact that it is not safe to use bitfields as
shared variables in synchronization algorithms.  It also documents that
CPUs must provide one-byte and two-byte load and store instructions
in order to be supported by the Linux kernel.  (Michael Cree
has agreed to the resulting non-support of pre-EV56 Alpha CPUs:
https://lkml.org/lkml/2014/9/5/143.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
index 87be0a8a78de..455df6b298f7 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -269,6 +269,30 @@ And there are a number of things that _must_ or _must_not_ be assumed:
 	STORE *(A + 4) = Y; STORE *A = X;
 	STORE {*A, *(A + 4) } = {X, Y};
 
+And there are anti-guarantees:
+
+ (*) These guarantees do not apply to bitfields, because compilers often
+     generate code to modify these using non-atomic read-modify-write
+     sequences.  Do not attempt to use bitfields to synchronize parallel
+     algorithms.
+
+ (*) Even in cases where bitfields are protected by locks, all fields
+     in a given bitfield must be protected by one lock.  If two fields
+     in a given bitfield are protected by different locks, the compiler's
+     non-atomic read-modify-write sequences can cause an update to one
+     field to corrupt the value of an adjacent field.
+
+ (*) These guarantees apply only to properly aligned and sized scalar
+     variables.  "Properly sized" currently means variables that are the
+     same size as "char", "short", "int" and "long".  "Properly aligned"
+     means the natural alignment, thus no constraints for "char",
+     two-byte alignment for "short", four-byte alignment for "int",
+     and either four-byte or eight-byte alignment for "long", on 32-bit
+     and 64-bit systems, respectively.  Note that this means that the
+     Linux kernel does not support pre-EV56 Alpha CPUs, because these
+     older CPUs do not provide one-byte and two-byte loads and stores.
+     Alpha EV56 and later Alpha CPUs are still supported.
+
 
 ============ WHAT ARE MEMORY BARRIERS?


^ permalink raw reply related	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
  2014-09-05 18:09                   ` Paul E. McKenney
@ 2014-09-05 18:31                     ` Paul E. McKenney
  -1 siblings, 0 replies; 311+ messages in thread
From: Paul E. McKenney @ 2014-09-05 18:31 UTC (permalink / raw)
  To: Michael Cree, H. Peter Anvin, Peter Hurley,
	Benjamin Herrenschmidt, David Laight, Jakub Jelinek, linux-arch,
	Tony Luck, linux-ia64, Oleg Nesterov, linux-kernel,
	Paul Mackerras, linuxppc-dev, Miroslav Franc, Richard Henderson,
	linux-alpha
  Cc: peterz

On Fri, Sep 05, 2014 at 11:09:50AM -0700, Paul E. McKenney wrote:
> On Fri, Sep 05, 2014 at 08:16:48PM +1200, Michael Cree wrote:
> > On Thu, Sep 04, 2014 at 07:08:48PM -0700, H. Peter Anvin wrote:
> > > On 09/04/2014 05:59 PM, Peter Hurley wrote:
> > > > I have no idea how prevalent the ev56 is compared to the ev5.
> > > > Still we're talking about a chip that came out in 1996.
> > > 
> > > Ah yes, I stand corrected.  According to Wikipedia, the affected CPUs
> > > were all the 2106x CPUs (EV4, EV45, LCA4, LCA45) plus the 21164 with no
> > > suffix (EV5).  However, we're still talking about museum pieces here.
> > 
> > Yes, that is correct, EV56 is the first Alpha CPU to have the byte-word
> > extension (BWX) CPU instructions.
> > 
> > It would not worry me if the kernel decided to assume atomic aligned
> > scalar accesses for all arches, thus terminating support for Alphas
> > without BWX.
> > 
> > The X server, ever since the libpciaccess change, does not work on
> > Alphas without BWX.
> > 
> > Debian Alpha (pretty much up to date at Debian-Ports) is still compiled
> > for all Alphas, i.e., without BWX.  The last attempt to start compiling
> > Debian Alpha with BWX, about three years ago when Alpha was kicked out
> > to Debian-Ports resulted in a couple or so complaints so got nowhere.
> > It's frustrating supporting the lowest common demoninator as many of
> > the bugs specific to Alpha can be resolved by recompiling with the BWX.
> > The kernel no longer supporting Alphas without BWX might just be the
> > incentive we need to switch Debian Alpha to compiling with BWX.
> 
> Very good, then I update my patch as follows.  Thoughts?

And, while I am at it, fix smp_load_acquire() and smp_store_release()
to allow single-byte and double-byte accesses.  (Adding Peter Zijlstra
on CC.)

							Thanx, Paul

------------------------------------------------------------------------

compiler: Allow 1- and 2-byte smp_load_acquire() and smp_store_release()

CPUs without single-byte and double-byte loads and stores place some
"interesting" requirements on concurrent code.  For example (adapted
from Peter Hurley's test code), suppose we have the following structure:
    
    	struct foo {
    		spinlock_t lock1;
    		spinlock_t lock2;
    		char a; /* Protected by lock1. */
    		char b; /* Protected by lock2. */
    	};
    	struct foo *foop;
    
Of course, it is common (and good) practice to place data protected
by different locks in separate cache lines.  However, if the locks are
rarely acquired (for example, only in rare error cases), and there are
a great many instances of the data structure, then memory footprint can
trump false-sharing concerns, so that it can be better to place them in
the same cache cache line as above.

But if the CPU does not support single-byte loads and stores, a store
to foop->a will do a non-atomic read-modify-write operation on foop->b,
which will come as a nasty surprise to someone holding foop->lock2.  So we
now require CPUs to support single-byte and double-byte loads and stores.
Therefore, this commit adjusts the definition of __native_word() to allow
these sizes to be used by smp_load_acquire() and smp_store_release().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>

diff --git a/include/linux/compiler.h b/include/linux/compiler.h
index d5ad7b1118fc..934a834ab9f9 100644
--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -311,7 +311,7 @@ void ftrace_likely_update(struct ftrace_branch_data *f, int val, int expect);
 
 /* Is this type a native word size -- useful for atomic operations */
 #ifndef __native_word
-# define __native_word(t) (sizeof(t) == sizeof(int) || sizeof(t) == sizeof(long))
+# define __native_word(t) (sizeof(t) == sizeof(char) || sizeof(t) == sizeof(short) || sizeof(t) == sizeof(int) || sizeof(t) == sizeof(long))
 #endif
 
 /* Compile time object size, -1 for unknown */


^ permalink raw reply related	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-05 18:31                     ` Paul E. McKenney
  0 siblings, 0 replies; 311+ messages in thread
From: Paul E. McKenney @ 2014-09-05 18:31 UTC (permalink / raw)
  To: Michael Cree, H. Peter Anvin, Peter Hurley,
	Benjamin Herrenschmidt, David Laight, Jakub Jelinek, linux-arch,
	Tony Luck, linux-ia64, Oleg Nesterov, linux-kernel,
	Paul Mackerras, linuxppc-dev, Miroslav Franc, Richard Henderson,
	linux-alpha
  Cc: peterz

On Fri, Sep 05, 2014 at 11:09:50AM -0700, Paul E. McKenney wrote:
> On Fri, Sep 05, 2014 at 08:16:48PM +1200, Michael Cree wrote:
> > On Thu, Sep 04, 2014 at 07:08:48PM -0700, H. Peter Anvin wrote:
> > > On 09/04/2014 05:59 PM, Peter Hurley wrote:
> > > > I have no idea how prevalent the ev56 is compared to the ev5.
> > > > Still we're talking about a chip that came out in 1996.
> > > 
> > > Ah yes, I stand corrected.  According to Wikipedia, the affected CPUs
> > > were all the 2106x CPUs (EV4, EV45, LCA4, LCA45) plus the 21164 with no
> > > suffix (EV5).  However, we're still talking about museum pieces here.
> > 
> > Yes, that is correct, EV56 is the first Alpha CPU to have the byte-word
> > extension (BWX) CPU instructions.
> > 
> > It would not worry me if the kernel decided to assume atomic aligned
> > scalar accesses for all arches, thus terminating support for Alphas
> > without BWX.
> > 
> > The X server, ever since the libpciaccess change, does not work on
> > Alphas without BWX.
> > 
> > Debian Alpha (pretty much up to date at Debian-Ports) is still compiled
> > for all Alphas, i.e., without BWX.  The last attempt to start compiling
> > Debian Alpha with BWX, about three years ago when Alpha was kicked out
> > to Debian-Ports resulted in a couple or so complaints so got nowhere.
> > It's frustrating supporting the lowest common demoninator as many of
> > the bugs specific to Alpha can be resolved by recompiling with the BWX.
> > The kernel no longer supporting Alphas without BWX might just be the
> > incentive we need to switch Debian Alpha to compiling with BWX.
> 
> Very good, then I update my patch as follows.  Thoughts?

And, while I am at it, fix smp_load_acquire() and smp_store_release()
to allow single-byte and double-byte accesses.  (Adding Peter Zijlstra
on CC.)

							Thanx, Paul

------------------------------------------------------------------------

compiler: Allow 1- and 2-byte smp_load_acquire() and smp_store_release()

CPUs without single-byte and double-byte loads and stores place some
"interesting" requirements on concurrent code.  For example (adapted
from Peter Hurley's test code), suppose we have the following structure:
    
    	struct foo {
    		spinlock_t lock1;
    		spinlock_t lock2;
    		char a; /* Protected by lock1. */
    		char b; /* Protected by lock2. */
    	};
    	struct foo *foop;
    
Of course, it is common (and good) practice to place data protected
by different locks in separate cache lines.  However, if the locks are
rarely acquired (for example, only in rare error cases), and there are
a great many instances of the data structure, then memory footprint can
trump false-sharing concerns, so that it can be better to place them in
the same cache cache line as above.

But if the CPU does not support single-byte loads and stores, a store
to foop->a will do a non-atomic read-modify-write operation on foop->b,
which will come as a nasty surprise to someone holding foop->lock2.  So we
now require CPUs to support single-byte and double-byte loads and stores.
Therefore, this commit adjusts the definition of __native_word() to allow
these sizes to be used by smp_load_acquire() and smp_store_release().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>

diff --git a/include/linux/compiler.h b/include/linux/compiler.h
index d5ad7b1118fc..934a834ab9f9 100644
--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -311,7 +311,7 @@ void ftrace_likely_update(struct ftrace_branch_data *f, int val, int expect);
 
 /* Is this type a native word size -- useful for atomic operations */
 #ifndef __native_word
-# define __native_word(t) (sizeof(t) = sizeof(int) || sizeof(t) = sizeof(long))
+# define __native_word(t) (sizeof(t) = sizeof(char) || sizeof(t) = sizeof(short) || sizeof(t) = sizeof(int) || sizeof(t) = sizeof(long))
 #endif
 
 /* Compile time object size, -1 for unknown */


^ permalink raw reply related	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
  2014-09-05 18:09                   ` Paul E. McKenney
@ 2014-09-05 18:50                     ` Peter Hurley
  -1 siblings, 0 replies; 311+ messages in thread
From: Peter Hurley @ 2014-09-05 18:50 UTC (permalink / raw)
  To: paulmck, Michael Cree, H. Peter Anvin, Benjamin Herrenschmidt,
	David Laight, Jakub Jelinek, linux-arch, Tony Luck, linux-ia64,
	Oleg Nesterov, linux-kernel, Paul Mackerras, linuxppc-dev,
	Miroslav Franc, Richard Henderson, linux-alpha

On 09/05/2014 02:09 PM, Paul E. McKenney wrote:
> On Fri, Sep 05, 2014 at 08:16:48PM +1200, Michael Cree wrote:
>> On Thu, Sep 04, 2014 at 07:08:48PM -0700, H. Peter Anvin wrote:
>>> On 09/04/2014 05:59 PM, Peter Hurley wrote:
>>>> I have no idea how prevalent the ev56 is compared to the ev5.
>>>> Still we're talking about a chip that came out in 1996.
>>>
>>> Ah yes, I stand corrected.  According to Wikipedia, the affected CPUs
>>> were all the 2106x CPUs (EV4, EV45, LCA4, LCA45) plus the 21164 with no
>>> suffix (EV5).  However, we're still talking about museum pieces here.
>>
>> Yes, that is correct, EV56 is the first Alpha CPU to have the byte-word
>> extension (BWX) CPU instructions.
>>
>> It would not worry me if the kernel decided to assume atomic aligned
>> scalar accesses for all arches, thus terminating support for Alphas
>> without BWX.
>>
>> The X server, ever since the libpciaccess change, does not work on
>> Alphas without BWX.
>>
>> Debian Alpha (pretty much up to date at Debian-Ports) is still compiled
>> for all Alphas, i.e., without BWX.  The last attempt to start compiling
>> Debian Alpha with BWX, about three years ago when Alpha was kicked out
>> to Debian-Ports resulted in a couple or so complaints so got nowhere.
>> It's frustrating supporting the lowest common demoninator as many of
>> the bugs specific to Alpha can be resolved by recompiling with the BWX.
>> The kernel no longer supporting Alphas without BWX might just be the
>> incentive we need to switch Debian Alpha to compiling with BWX.
> 
> Very good, then I update my patch as follows.  Thoughts?
> 
> 							Thanx, Paul

Minor [optional] edits.

Thanks,
Peter Hurley

> ------------------------------------------------------------------------
> 
> documentation: Record limitations of bitfields and small variables
> 
> This commit documents the fact that it is not safe to use bitfields as
> shared variables in synchronization algorithms.  It also documents that
> CPUs must provide one-byte and two-byte load and store instructions
                   ^
                atomic
> in order to be supported by the Linux kernel.  (Michael Cree
> has agreed to the resulting non-support of pre-EV56 Alpha CPUs:
> https://lkml.org/lkml/2014/9/5/143.
> 
> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> 
> diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
> index 87be0a8a78de..455df6b298f7 100644
> --- a/Documentation/memory-barriers.txt
> +++ b/Documentation/memory-barriers.txt
> @@ -269,6 +269,30 @@ And there are a number of things that _must_ or _must_not_ be assumed:
>  	STORE *(A + 4) = Y; STORE *A = X;
>  	STORE {*A, *(A + 4) } = {X, Y};
>  
> +And there are anti-guarantees:
> +
> + (*) These guarantees do not apply to bitfields, because compilers often
> +     generate code to modify these using non-atomic read-modify-write
> +     sequences.  Do not attempt to use bitfields to synchronize parallel
> +     algorithms.
> +
> + (*) Even in cases where bitfields are protected by locks, all fields
> +     in a given bitfield must be protected by one lock.  If two fields
> +     in a given bitfield are protected by different locks, the compiler's
> +     non-atomic read-modify-write sequences can cause an update to one
> +     field to corrupt the value of an adjacent field.
> +
> + (*) These guarantees apply only to properly aligned and sized scalar
> +     variables.  "Properly sized" currently means variables that are the
> +     same size as "char", "short", "int" and "long".  "Properly aligned"
> +     means the natural alignment, thus no constraints for "char",
> +     two-byte alignment for "short", four-byte alignment for "int",
> +     and either four-byte or eight-byte alignment for "long", on 32-bit
> +     and 64-bit systems, respectively.  Note that this means that the
> +     Linux kernel does not support pre-EV56 Alpha CPUs, because these
> +     older CPUs do not provide one-byte and two-byte loads and stores.
                                 ^
                            non-atomic
> +     Alpha EV56 and later Alpha CPUs are still supported.
> +
>  
>  =========================
>  WHAT ARE MEMORY BARRIERS?
> 


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-05 18:50                     ` Peter Hurley
  0 siblings, 0 replies; 311+ messages in thread
From: Peter Hurley @ 2014-09-05 18:50 UTC (permalink / raw)
  To: paulmck, Michael Cree, H. Peter Anvin, Benjamin Herrenschmidt,
	David Laight, Jakub Jelinek, linux-arch, Tony Luck, linux-ia64,
	Oleg Nesterov, linux-kernel, Paul Mackerras, linuxppc-dev,
	Miroslav Franc, Richard Henderson, linux-alpha

On 09/05/2014 02:09 PM, Paul E. McKenney wrote:
> On Fri, Sep 05, 2014 at 08:16:48PM +1200, Michael Cree wrote:
>> On Thu, Sep 04, 2014 at 07:08:48PM -0700, H. Peter Anvin wrote:
>>> On 09/04/2014 05:59 PM, Peter Hurley wrote:
>>>> I have no idea how prevalent the ev56 is compared to the ev5.
>>>> Still we're talking about a chip that came out in 1996.
>>>
>>> Ah yes, I stand corrected.  According to Wikipedia, the affected CPUs
>>> were all the 2106x CPUs (EV4, EV45, LCA4, LCA45) plus the 21164 with no
>>> suffix (EV5).  However, we're still talking about museum pieces here.
>>
>> Yes, that is correct, EV56 is the first Alpha CPU to have the byte-word
>> extension (BWX) CPU instructions.
>>
>> It would not worry me if the kernel decided to assume atomic aligned
>> scalar accesses for all arches, thus terminating support for Alphas
>> without BWX.
>>
>> The X server, ever since the libpciaccess change, does not work on
>> Alphas without BWX.
>>
>> Debian Alpha (pretty much up to date at Debian-Ports) is still compiled
>> for all Alphas, i.e., without BWX.  The last attempt to start compiling
>> Debian Alpha with BWX, about three years ago when Alpha was kicked out
>> to Debian-Ports resulted in a couple or so complaints so got nowhere.
>> It's frustrating supporting the lowest common demoninator as many of
>> the bugs specific to Alpha can be resolved by recompiling with the BWX.
>> The kernel no longer supporting Alphas without BWX might just be the
>> incentive we need to switch Debian Alpha to compiling with BWX.
> 
> Very good, then I update my patch as follows.  Thoughts?
> 
> 							Thanx, Paul

Minor [optional] edits.

Thanks,
Peter Hurley

> ------------------------------------------------------------------------
> 
> documentation: Record limitations of bitfields and small variables
> 
> This commit documents the fact that it is not safe to use bitfields as
> shared variables in synchronization algorithms.  It also documents that
> CPUs must provide one-byte and two-byte load and store instructions
                   ^
                atomic
> in order to be supported by the Linux kernel.  (Michael Cree
> has agreed to the resulting non-support of pre-EV56 Alpha CPUs:
> https://lkml.org/lkml/2014/9/5/143.
> 
> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> 
> diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
> index 87be0a8a78de..455df6b298f7 100644
> --- a/Documentation/memory-barriers.txt
> +++ b/Documentation/memory-barriers.txt
> @@ -269,6 +269,30 @@ And there are a number of things that _must_ or _must_not_ be assumed:
>  	STORE *(A + 4) = Y; STORE *A = X;
>  	STORE {*A, *(A + 4) } = {X, Y};
>  
> +And there are anti-guarantees:
> +
> + (*) These guarantees do not apply to bitfields, because compilers often
> +     generate code to modify these using non-atomic read-modify-write
> +     sequences.  Do not attempt to use bitfields to synchronize parallel
> +     algorithms.
> +
> + (*) Even in cases where bitfields are protected by locks, all fields
> +     in a given bitfield must be protected by one lock.  If two fields
> +     in a given bitfield are protected by different locks, the compiler's
> +     non-atomic read-modify-write sequences can cause an update to one
> +     field to corrupt the value of an adjacent field.
> +
> + (*) These guarantees apply only to properly aligned and sized scalar
> +     variables.  "Properly sized" currently means variables that are the
> +     same size as "char", "short", "int" and "long".  "Properly aligned"
> +     means the natural alignment, thus no constraints for "char",
> +     two-byte alignment for "short", four-byte alignment for "int",
> +     and either four-byte or eight-byte alignment for "long", on 32-bit
> +     and 64-bit systems, respectively.  Note that this means that the
> +     Linux kernel does not support pre-EV56 Alpha CPUs, because these
> +     older CPUs do not provide one-byte and two-byte loads and stores.
                                 ^
                            non-atomic
> +     Alpha EV56 and later Alpha CPUs are still supported.
> +
>  
>  ============>  WHAT ARE MEMORY BARRIERS?
> 


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
  2014-09-05 18:50                     ` Peter Hurley
  (?)
@ 2014-09-05 19:05                       ` Paul E. McKenney
  -1 siblings, 0 replies; 311+ messages in thread
From: Paul E. McKenney @ 2014-09-05 19:05 UTC (permalink / raw)
  To: Peter Hurley
  Cc: Michael Cree, H. Peter Anvin, Benjamin Herrenschmidt,
	David Laight, Jakub Jelinek, linux-arch, Tony Luck, linux-ia64,
	Oleg Nesterov, linux-kernel, Paul Mackerras, linuxppc-dev,
	Miroslav Franc, Richard Henderson, linux-alpha

On Fri, Sep 05, 2014 at 02:50:31PM -0400, Peter Hurley wrote:
> On 09/05/2014 02:09 PM, Paul E. McKenney wrote:
> > On Fri, Sep 05, 2014 at 08:16:48PM +1200, Michael Cree wrote:
> >> On Thu, Sep 04, 2014 at 07:08:48PM -0700, H. Peter Anvin wrote:
> >>> On 09/04/2014 05:59 PM, Peter Hurley wrote:
> >>>> I have no idea how prevalent the ev56 is compared to the ev5.
> >>>> Still we're talking about a chip that came out in 1996.
> >>>
> >>> Ah yes, I stand corrected.  According to Wikipedia, the affected CPUs
> >>> were all the 2106x CPUs (EV4, EV45, LCA4, LCA45) plus the 21164 with no
> >>> suffix (EV5).  However, we're still talking about museum pieces here.
> >>
> >> Yes, that is correct, EV56 is the first Alpha CPU to have the byte-word
> >> extension (BWX) CPU instructions.
> >>
> >> It would not worry me if the kernel decided to assume atomic aligned
> >> scalar accesses for all arches, thus terminating support for Alphas
> >> without BWX.
> >>
> >> The X server, ever since the libpciaccess change, does not work on
> >> Alphas without BWX.
> >>
> >> Debian Alpha (pretty much up to date at Debian-Ports) is still compiled
> >> for all Alphas, i.e., without BWX.  The last attempt to start compiling
> >> Debian Alpha with BWX, about three years ago when Alpha was kicked out
> >> to Debian-Ports resulted in a couple or so complaints so got nowhere.
> >> It's frustrating supporting the lowest common demoninator as many of
> >> the bugs specific to Alpha can be resolved by recompiling with the BWX.
> >> The kernel no longer supporting Alphas without BWX might just be the
> >> incentive we need to switch Debian Alpha to compiling with BWX.
> > 
> > Very good, then I update my patch as follows.  Thoughts?
> > 
> > 							Thanx, Paul
> 
> Minor [optional] edits.
> 
> Thanks,
> Peter Hurley
> 
> > ------------------------------------------------------------------------
> > 
> > documentation: Record limitations of bitfields and small variables
> > 
> > This commit documents the fact that it is not safe to use bitfields as
> > shared variables in synchronization algorithms.  It also documents that
> > CPUs must provide one-byte and two-byte load and store instructions
>                    ^
>                 atomic

Here you meant non-atomic?  My guess is that you are referring to the
fact that you could emulate a one-byte store on pre-EV56 Alpha CPUs
using the ll and sc atomic-read-modify-write instructions, correct?

> > in order to be supported by the Linux kernel.  (Michael Cree
> > has agreed to the resulting non-support of pre-EV56 Alpha CPUs:
> > https://lkml.org/lkml/2014/9/5/143.
> > 
> > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > 
> > diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
> > index 87be0a8a78de..455df6b298f7 100644
> > --- a/Documentation/memory-barriers.txt
> > +++ b/Documentation/memory-barriers.txt
> > @@ -269,6 +269,30 @@ And there are a number of things that _must_ or _must_not_ be assumed:
> >  	STORE *(A + 4) = Y; STORE *A = X;
> >  	STORE {*A, *(A + 4) } = {X, Y};
> >  
> > +And there are anti-guarantees:
> > +
> > + (*) These guarantees do not apply to bitfields, because compilers often
> > +     generate code to modify these using non-atomic read-modify-write
> > +     sequences.  Do not attempt to use bitfields to synchronize parallel
> > +     algorithms.
> > +
> > + (*) Even in cases where bitfields are protected by locks, all fields
> > +     in a given bitfield must be protected by one lock.  If two fields
> > +     in a given bitfield are protected by different locks, the compiler's
> > +     non-atomic read-modify-write sequences can cause an update to one
> > +     field to corrupt the value of an adjacent field.
> > +
> > + (*) These guarantees apply only to properly aligned and sized scalar
> > +     variables.  "Properly sized" currently means variables that are the
> > +     same size as "char", "short", "int" and "long".  "Properly aligned"
> > +     means the natural alignment, thus no constraints for "char",
> > +     two-byte alignment for "short", four-byte alignment for "int",
> > +     and either four-byte or eight-byte alignment for "long", on 32-bit
> > +     and 64-bit systems, respectively.  Note that this means that the
> > +     Linux kernel does not support pre-EV56 Alpha CPUs, because these
> > +     older CPUs do not provide one-byte and two-byte loads and stores.
>                                  ^
>                             non-atomic

I took this, thank you!

							Thanx, Paul

> > +     Alpha EV56 and later Alpha CPUs are still supported.
> > +
> >  
> >  =========================
> >  WHAT ARE MEMORY BARRIERS?
> > 
> 


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-05 19:05                       ` Paul E. McKenney
  0 siblings, 0 replies; 311+ messages in thread
From: Paul E. McKenney @ 2014-09-05 19:05 UTC (permalink / raw)
  To: Peter Hurley
  Cc: Jakub Jelinek, linux-arch, Tony Luck, linux-ia64, Michael Cree,
	linux-alpha, Oleg Nesterov, linux-kernel, David Laight,
	Paul Mackerras, H. Peter Anvin, linuxppc-dev, Miroslav Franc,
	Richard Henderson

On Fri, Sep 05, 2014 at 02:50:31PM -0400, Peter Hurley wrote:
> On 09/05/2014 02:09 PM, Paul E. McKenney wrote:
> > On Fri, Sep 05, 2014 at 08:16:48PM +1200, Michael Cree wrote:
> >> On Thu, Sep 04, 2014 at 07:08:48PM -0700, H. Peter Anvin wrote:
> >>> On 09/04/2014 05:59 PM, Peter Hurley wrote:
> >>>> I have no idea how prevalent the ev56 is compared to the ev5.
> >>>> Still we're talking about a chip that came out in 1996.
> >>>
> >>> Ah yes, I stand corrected.  According to Wikipedia, the affected CPUs
> >>> were all the 2106x CPUs (EV4, EV45, LCA4, LCA45) plus the 21164 with no
> >>> suffix (EV5).  However, we're still talking about museum pieces here.
> >>
> >> Yes, that is correct, EV56 is the first Alpha CPU to have the byte-word
> >> extension (BWX) CPU instructions.
> >>
> >> It would not worry me if the kernel decided to assume atomic aligned
> >> scalar accesses for all arches, thus terminating support for Alphas
> >> without BWX.
> >>
> >> The X server, ever since the libpciaccess change, does not work on
> >> Alphas without BWX.
> >>
> >> Debian Alpha (pretty much up to date at Debian-Ports) is still compiled
> >> for all Alphas, i.e., without BWX.  The last attempt to start compiling
> >> Debian Alpha with BWX, about three years ago when Alpha was kicked out
> >> to Debian-Ports resulted in a couple or so complaints so got nowhere.
> >> It's frustrating supporting the lowest common demoninator as many of
> >> the bugs specific to Alpha can be resolved by recompiling with the BWX.
> >> The kernel no longer supporting Alphas without BWX might just be the
> >> incentive we need to switch Debian Alpha to compiling with BWX.
> > 
> > Very good, then I update my patch as follows.  Thoughts?
> > 
> > 							Thanx, Paul
> 
> Minor [optional] edits.
> 
> Thanks,
> Peter Hurley
> 
> > ------------------------------------------------------------------------
> > 
> > documentation: Record limitations of bitfields and small variables
> > 
> > This commit documents the fact that it is not safe to use bitfields as
> > shared variables in synchronization algorithms.  It also documents that
> > CPUs must provide one-byte and two-byte load and store instructions
>                    ^
>                 atomic

Here you meant non-atomic?  My guess is that you are referring to the
fact that you could emulate a one-byte store on pre-EV56 Alpha CPUs
using the ll and sc atomic-read-modify-write instructions, correct?

> > in order to be supported by the Linux kernel.  (Michael Cree
> > has agreed to the resulting non-support of pre-EV56 Alpha CPUs:
> > https://lkml.org/lkml/2014/9/5/143.
> > 
> > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > 
> > diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
> > index 87be0a8a78de..455df6b298f7 100644
> > --- a/Documentation/memory-barriers.txt
> > +++ b/Documentation/memory-barriers.txt
> > @@ -269,6 +269,30 @@ And there are a number of things that _must_ or _must_not_ be assumed:
> >  	STORE *(A + 4) = Y; STORE *A = X;
> >  	STORE {*A, *(A + 4) } = {X, Y};
> >  
> > +And there are anti-guarantees:
> > +
> > + (*) These guarantees do not apply to bitfields, because compilers often
> > +     generate code to modify these using non-atomic read-modify-write
> > +     sequences.  Do not attempt to use bitfields to synchronize parallel
> > +     algorithms.
> > +
> > + (*) Even in cases where bitfields are protected by locks, all fields
> > +     in a given bitfield must be protected by one lock.  If two fields
> > +     in a given bitfield are protected by different locks, the compiler's
> > +     non-atomic read-modify-write sequences can cause an update to one
> > +     field to corrupt the value of an adjacent field.
> > +
> > + (*) These guarantees apply only to properly aligned and sized scalar
> > +     variables.  "Properly sized" currently means variables that are the
> > +     same size as "char", "short", "int" and "long".  "Properly aligned"
> > +     means the natural alignment, thus no constraints for "char",
> > +     two-byte alignment for "short", four-byte alignment for "int",
> > +     and either four-byte or eight-byte alignment for "long", on 32-bit
> > +     and 64-bit systems, respectively.  Note that this means that the
> > +     Linux kernel does not support pre-EV56 Alpha CPUs, because these
> > +     older CPUs do not provide one-byte and two-byte loads and stores.
>                                  ^
>                             non-atomic

I took this, thank you!

							Thanx, Paul

> > +     Alpha EV56 and later Alpha CPUs are still supported.
> > +
> >  
> >  =========================
> >  WHAT ARE MEMORY BARRIERS?
> > 
> 

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-05 19:05                       ` Paul E. McKenney
  0 siblings, 0 replies; 311+ messages in thread
From: Paul E. McKenney @ 2014-09-05 19:05 UTC (permalink / raw)
  To: Peter Hurley
  Cc: Michael Cree, H. Peter Anvin, Benjamin Herrenschmidt,
	David Laight, Jakub Jelinek, linux-arch, Tony Luck, linux-ia64,
	Oleg Nesterov, linux-kernel, Paul Mackerras, linuxppc-dev,
	Miroslav Franc, Richard Henderson, linux-alpha

On Fri, Sep 05, 2014 at 02:50:31PM -0400, Peter Hurley wrote:
> On 09/05/2014 02:09 PM, Paul E. McKenney wrote:
> > On Fri, Sep 05, 2014 at 08:16:48PM +1200, Michael Cree wrote:
> >> On Thu, Sep 04, 2014 at 07:08:48PM -0700, H. Peter Anvin wrote:
> >>> On 09/04/2014 05:59 PM, Peter Hurley wrote:
> >>>> I have no idea how prevalent the ev56 is compared to the ev5.
> >>>> Still we're talking about a chip that came out in 1996.
> >>>
> >>> Ah yes, I stand corrected.  According to Wikipedia, the affected CPUs
> >>> were all the 2106x CPUs (EV4, EV45, LCA4, LCA45) plus the 21164 with no
> >>> suffix (EV5).  However, we're still talking about museum pieces here.
> >>
> >> Yes, that is correct, EV56 is the first Alpha CPU to have the byte-word
> >> extension (BWX) CPU instructions.
> >>
> >> It would not worry me if the kernel decided to assume atomic aligned
> >> scalar accesses for all arches, thus terminating support for Alphas
> >> without BWX.
> >>
> >> The X server, ever since the libpciaccess change, does not work on
> >> Alphas without BWX.
> >>
> >> Debian Alpha (pretty much up to date at Debian-Ports) is still compiled
> >> for all Alphas, i.e., without BWX.  The last attempt to start compiling
> >> Debian Alpha with BWX, about three years ago when Alpha was kicked out
> >> to Debian-Ports resulted in a couple or so complaints so got nowhere.
> >> It's frustrating supporting the lowest common demoninator as many of
> >> the bugs specific to Alpha can be resolved by recompiling with the BWX.
> >> The kernel no longer supporting Alphas without BWX might just be the
> >> incentive we need to switch Debian Alpha to compiling with BWX.
> > 
> > Very good, then I update my patch as follows.  Thoughts?
> > 
> > 							Thanx, Paul
> 
> Minor [optional] edits.
> 
> Thanks,
> Peter Hurley
> 
> > ------------------------------------------------------------------------
> > 
> > documentation: Record limitations of bitfields and small variables
> > 
> > This commit documents the fact that it is not safe to use bitfields as
> > shared variables in synchronization algorithms.  It also documents that
> > CPUs must provide one-byte and two-byte load and store instructions
>                    ^
>                 atomic

Here you meant non-atomic?  My guess is that you are referring to the
fact that you could emulate a one-byte store on pre-EV56 Alpha CPUs
using the ll and sc atomic-read-modify-write instructions, correct?

> > in order to be supported by the Linux kernel.  (Michael Cree
> > has agreed to the resulting non-support of pre-EV56 Alpha CPUs:
> > https://lkml.org/lkml/2014/9/5/143.
> > 
> > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > 
> > diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
> > index 87be0a8a78de..455df6b298f7 100644
> > --- a/Documentation/memory-barriers.txt
> > +++ b/Documentation/memory-barriers.txt
> > @@ -269,6 +269,30 @@ And there are a number of things that _must_ or _must_not_ be assumed:
> >  	STORE *(A + 4) = Y; STORE *A = X;
> >  	STORE {*A, *(A + 4) } = {X, Y};
> >  
> > +And there are anti-guarantees:
> > +
> > + (*) These guarantees do not apply to bitfields, because compilers often
> > +     generate code to modify these using non-atomic read-modify-write
> > +     sequences.  Do not attempt to use bitfields to synchronize parallel
> > +     algorithms.
> > +
> > + (*) Even in cases where bitfields are protected by locks, all fields
> > +     in a given bitfield must be protected by one lock.  If two fields
> > +     in a given bitfield are protected by different locks, the compiler's
> > +     non-atomic read-modify-write sequences can cause an update to one
> > +     field to corrupt the value of an adjacent field.
> > +
> > + (*) These guarantees apply only to properly aligned and sized scalar
> > +     variables.  "Properly sized" currently means variables that are the
> > +     same size as "char", "short", "int" and "long".  "Properly aligned"
> > +     means the natural alignment, thus no constraints for "char",
> > +     two-byte alignment for "short", four-byte alignment for "int",
> > +     and either four-byte or eight-byte alignment for "long", on 32-bit
> > +     and 64-bit systems, respectively.  Note that this means that the
> > +     Linux kernel does not support pre-EV56 Alpha CPUs, because these
> > +     older CPUs do not provide one-byte and two-byte loads and stores.
>                                  ^
>                             non-atomic

I took this, thank you!

							Thanx, Paul

> > +     Alpha EV56 and later Alpha CPUs are still supported.
> > +
> >  
> >  ============> >  WHAT ARE MEMORY BARRIERS?
> > 
> 


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
  2014-09-05 19:05                       ` Paul E. McKenney
  (?)
@ 2014-09-05 19:24                         ` Peter Hurley
  -1 siblings, 0 replies; 311+ messages in thread
From: Peter Hurley @ 2014-09-05 19:24 UTC (permalink / raw)
  To: paulmck
  Cc: Michael Cree, H. Peter Anvin, Benjamin Herrenschmidt,
	David Laight, Jakub Jelinek, linux-arch, Tony Luck, linux-ia64,
	Oleg Nesterov, linux-kernel, Paul Mackerras, linuxppc-dev,
	Miroslav Franc, Richard Henderson, linux-alpha

On 09/05/2014 03:05 PM, Paul E. McKenney wrote:
> On Fri, Sep 05, 2014 at 02:50:31PM -0400, Peter Hurley wrote:
>> On 09/05/2014 02:09 PM, Paul E. McKenney wrote:

[cut]

>>> ------------------------------------------------------------------------
>>>
>>> documentation: Record limitations of bitfields and small variables
>>>
>>> This commit documents the fact that it is not safe to use bitfields as
>>> shared variables in synchronization algorithms.  It also documents that
>>> CPUs must provide one-byte and two-byte load and store instructions
>>                    ^
>>                 atomic
> 
> Here you meant non-atomic?  My guess is that you are referring to the
> fact that you could emulate a one-byte store on pre-EV56 Alpha CPUs
> using the ll and sc atomic-read-modify-write instructions, correct?

Yes, that's what I meant. I must be tired and am misreading the commit
message, or misinterpreting it's meaning.


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-05 19:24                         ` Peter Hurley
  0 siblings, 0 replies; 311+ messages in thread
From: Peter Hurley @ 2014-09-05 19:24 UTC (permalink / raw)
  To: paulmck
  Cc: Jakub Jelinek, linux-arch, Tony Luck, linux-ia64, Michael Cree,
	linux-alpha, Oleg Nesterov, linux-kernel, David Laight,
	Paul Mackerras, H. Peter Anvin, linuxppc-dev, Miroslav Franc,
	Richard Henderson

On 09/05/2014 03:05 PM, Paul E. McKenney wrote:
> On Fri, Sep 05, 2014 at 02:50:31PM -0400, Peter Hurley wrote:
>> On 09/05/2014 02:09 PM, Paul E. McKenney wrote:

[cut]

>>> ------------------------------------------------------------------------
>>>
>>> documentation: Record limitations of bitfields and small variables
>>>
>>> This commit documents the fact that it is not safe to use bitfields as
>>> shared variables in synchronization algorithms.  It also documents that
>>> CPUs must provide one-byte and two-byte load and store instructions
>>                    ^
>>                 atomic
> 
> Here you meant non-atomic?  My guess is that you are referring to the
> fact that you could emulate a one-byte store on pre-EV56 Alpha CPUs
> using the ll and sc atomic-read-modify-write instructions, correct?

Yes, that's what I meant. I must be tired and am misreading the commit
message, or misinterpreting it's meaning.

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-05 19:24                         ` Peter Hurley
  0 siblings, 0 replies; 311+ messages in thread
From: Peter Hurley @ 2014-09-05 19:24 UTC (permalink / raw)
  To: paulmck
  Cc: Michael Cree, H. Peter Anvin, Benjamin Herrenschmidt,
	David Laight, Jakub Jelinek, linux-arch, Tony Luck, linux-ia64,
	Oleg Nesterov, linux-kernel, Paul Mackerras, linuxppc-dev,
	Miroslav Franc, Richard Henderson, linux-alpha

On 09/05/2014 03:05 PM, Paul E. McKenney wrote:
> On Fri, Sep 05, 2014 at 02:50:31PM -0400, Peter Hurley wrote:
>> On 09/05/2014 02:09 PM, Paul E. McKenney wrote:

[cut]

>>> ------------------------------------------------------------------------
>>>
>>> documentation: Record limitations of bitfields and small variables
>>>
>>> This commit documents the fact that it is not safe to use bitfields as
>>> shared variables in synchronization algorithms.  It also documents that
>>> CPUs must provide one-byte and two-byte load and store instructions
>>                    ^
>>                 atomic
> 
> Here you meant non-atomic?  My guess is that you are referring to the
> fact that you could emulate a one-byte store on pre-EV56 Alpha CPUs
> using the ll and sc atomic-read-modify-write instructions, correct?

Yes, that's what I meant. I must be tired and am misreading the commit
message, or misinterpreting it's meaning.


^ permalink raw reply	[flat|nested] 311+ messages in thread

* RE: bit fields && data tearing
  2014-09-05 19:05                       ` Paul E. McKenney
  (?)
@ 2014-09-05 19:38                         ` Marc Gauthier
  -1 siblings, 0 replies; 311+ messages in thread
From: Marc Gauthier @ 2014-09-05 19:38 UTC (permalink / raw)
  To: paulmck, Peter Hurley
  Cc: Michael Cree, H. Peter Anvin, Benjamin Herrenschmidt,
	David Laight, Jakub Jelinek, linux-arch, Tony Luck, linux-ia64,
	Oleg Nesterov, linux-kernel, Paul Mackerras, linuxppc-dev,
	Miroslav Franc, Richard Henderson, linux-alpha

Paul E. McKenney wrote:
>On Fri, Sep 05, 2014 at 02:50:31PM -0400, Peter Hurley wrote:
>>On 09/05/2014 02:09 PM, Paul E. McKenney wrote:
>>> This commit documents the fact that it is not safe to use bitfields as
>>> shared variables in synchronization algorithms.  It also documents that
>>> CPUs must provide one-byte and two-byte load and store instructions
>>                    ^
>>                 atomic
> 
> Here you meant non-atomic?  My guess is that you are referring to the
> fact that you could emulate a one-byte store on pre-EV56 Alpha CPUs
> using the ll and sc atomic-read-modify-write instructions, correct?
> 
>>> in order to be supported by the Linux kernel.  (Michael Cree
>>> has agreed to the resulting non-support of pre-EV56 Alpha CPUs:
>>> https://lkml.org/lkml/2014/9/5/143.
[...]

>>> +     and 64-bit systems, respectively.  Note that this means that the
>>> +     Linux kernel does not support pre-EV56 Alpha CPUs, because these
>>> +     older CPUs do not provide one-byte and two-byte loads and stores.
>>                                  ^
>>                             non-atomic
> 
> I took this, thank you!

Eum, am I totally lost, or aren't both of these supposed to say "atomic" ?

Can't imagine requiring a CPU to provide non-atomic loads and stores
(i.e. requiring old Alpha behavior?).

-Marc

^ permalink raw reply	[flat|nested] 311+ messages in thread

* RE: bit fields && data tearing
@ 2014-09-05 19:38                         ` Marc Gauthier
  0 siblings, 0 replies; 311+ messages in thread
From: Marc Gauthier @ 2014-09-05 19:38 UTC (permalink / raw)
  To: paulmck, Peter Hurley
  Cc: Jakub Jelinek, linux-arch, Tony Luck, linux-ia64, Michael Cree,
	linux-alpha, Oleg Nesterov, linux-kernel, David Laight,
	Paul Mackerras, H. Peter Anvin, linuxppc-dev, Miroslav Franc,
	Richard Henderson

Paul E. McKenney wrote:
>On Fri, Sep 05, 2014 at 02:50:31PM -0400, Peter Hurley wrote:
>>On 09/05/2014 02:09 PM, Paul E. McKenney wrote:
>>> This commit documents the fact that it is not safe to use bitfields as
>>> shared variables in synchronization algorithms.  It also documents that
>>> CPUs must provide one-byte and two-byte load and store instructions
>>                    ^
>>                 atomic
>=20
> Here you meant non-atomic?  My guess is that you are referring to the
> fact that you could emulate a one-byte store on pre-EV56 Alpha CPUs
> using the ll and sc atomic-read-modify-write instructions, correct?
>=20
>>> in order to be supported by the Linux kernel.  (Michael Cree
>>> has agreed to the resulting non-support of pre-EV56 Alpha CPUs:
>>> https://lkml.org/lkml/2014/9/5/143.
[...]

>>> +     and 64-bit systems, respectively.  Note that this means that the
>>> +     Linux kernel does not support pre-EV56 Alpha CPUs, because these
>>> +     older CPUs do not provide one-byte and two-byte loads and stores.
>>                                  ^
>>                             non-atomic
>=20
> I took this, thank you!

Eum, am I totally lost, or aren't both of these supposed to say "atomic" ?

Can't imagine requiring a CPU to provide non-atomic loads and stores
(i.e. requiring old Alpha behavior?).

-Marc

^ permalink raw reply	[flat|nested] 311+ messages in thread

* RE: bit fields && data tearing
@ 2014-09-05 19:38                         ` Marc Gauthier
  0 siblings, 0 replies; 311+ messages in thread
From: Marc Gauthier @ 2014-09-05 19:38 UTC (permalink / raw)
  To: paulmck, Peter Hurley
  Cc: Michael Cree, H. Peter Anvin, Benjamin Herrenschmidt,
	David Laight, Jakub Jelinek, linux-arch, Tony Luck, linux-ia64,
	Oleg Nesterov, linux-kernel, Paul Mackerras, linuxppc-dev,
	Miroslav Franc, Richard Henderson, linux-alpha

Paul E. McKenney wrote:
>On Fri, Sep 05, 2014 at 02:50:31PM -0400, Peter Hurley wrote:
>>On 09/05/2014 02:09 PM, Paul E. McKenney wrote:
>>> This commit documents the fact that it is not safe to use bitfields as
>>> shared variables in synchronization algorithms.  It also documents that
>>> CPUs must provide one-byte and two-byte load and store instructions
>>                    ^
>>                 atomic
> 
> Here you meant non-atomic?  My guess is that you are referring to the
> fact that you could emulate a one-byte store on pre-EV56 Alpha CPUs
> using the ll and sc atomic-read-modify-write instructions, correct?
> 
>>> in order to be supported by the Linux kernel.  (Michael Cree
>>> has agreed to the resulting non-support of pre-EV56 Alpha CPUs:
>>> https://lkml.org/lkml/2014/9/5/143.
[...]

>>> +     and 64-bit systems, respectively.  Note that this means that the
>>> +     Linux kernel does not support pre-EV56 Alpha CPUs, because these
>>> +     older CPUs do not provide one-byte and two-byte loads and stores.
>>                                  ^
>>                             non-atomic
> 
> I took this, thank you!

Eum, am I totally lost, or aren't both of these supposed to say "atomic" ?

Can't imagine requiring a CPU to provide non-atomic loads and stores
(i.e. requiring old Alpha behavior?).

-Marc

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
  2014-09-05 18:31                     ` Paul E. McKenney
  (?)
@ 2014-09-05 19:52                       ` Peter Zijlstra
  -1 siblings, 0 replies; 311+ messages in thread
From: Peter Zijlstra @ 2014-09-05 19:52 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Michael Cree, H. Peter Anvin, Peter Hurley,
	Benjamin Herrenschmidt, David Laight, Jakub Jelinek, linux-arch,
	Tony Luck, linux-ia64, Oleg Nesterov, linux-kernel,
	Paul Mackerras, linuxppc-dev, Miroslav Franc, Richard Henderson,
	linux-alpha

On Fri, Sep 05, 2014 at 11:31:09AM -0700, Paul E. McKenney wrote:
> compiler: Allow 1- and 2-byte smp_load_acquire() and smp_store_release()
> 
> CPUs without single-byte and double-byte loads and stores place some
> "interesting" requirements on concurrent code.  For example (adapted
> from Peter Hurley's test code), suppose we have the following structure:
>     
>     	struct foo {
>     		spinlock_t lock1;
>     		spinlock_t lock2;
>     		char a; /* Protected by lock1. */
>     		char b; /* Protected by lock2. */
>     	};
>     	struct foo *foop;
>     
> Of course, it is common (and good) practice to place data protected
> by different locks in separate cache lines.  However, if the locks are
> rarely acquired (for example, only in rare error cases), and there are
> a great many instances of the data structure, then memory footprint can
> trump false-sharing concerns, so that it can be better to place them in
> the same cache cache line as above.
> 
> But if the CPU does not support single-byte loads and stores, a store
> to foop->a will do a non-atomic read-modify-write operation on foop->b,
> which will come as a nasty surprise to someone holding foop->lock2.  So we
> now require CPUs to support single-byte and double-byte loads and stores.
> Therefore, this commit adjusts the definition of __native_word() to allow
> these sizes to be used by smp_load_acquire() and smp_store_release().

So does this patch depends on a patch that removes pre EV56 alpha
support? I'm all for removing that, but I need to see the patch merged
before we can do this.

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-05 19:52                       ` Peter Zijlstra
  0 siblings, 0 replies; 311+ messages in thread
From: Peter Zijlstra @ 2014-09-05 19:52 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Jakub Jelinek, linux-arch, Tony Luck, linux-ia64, Peter Hurley,
	Michael Cree, linux-alpha, Oleg Nesterov, linux-kernel,
	David Laight, Paul Mackerras, H. Peter Anvin, linuxppc-dev,
	Miroslav Franc, Richard Henderson

On Fri, Sep 05, 2014 at 11:31:09AM -0700, Paul E. McKenney wrote:
> compiler: Allow 1- and 2-byte smp_load_acquire() and smp_store_release()
> 
> CPUs without single-byte and double-byte loads and stores place some
> "interesting" requirements on concurrent code.  For example (adapted
> from Peter Hurley's test code), suppose we have the following structure:
>     
>     	struct foo {
>     		spinlock_t lock1;
>     		spinlock_t lock2;
>     		char a; /* Protected by lock1. */
>     		char b; /* Protected by lock2. */
>     	};
>     	struct foo *foop;
>     
> Of course, it is common (and good) practice to place data protected
> by different locks in separate cache lines.  However, if the locks are
> rarely acquired (for example, only in rare error cases), and there are
> a great many instances of the data structure, then memory footprint can
> trump false-sharing concerns, so that it can be better to place them in
> the same cache cache line as above.
> 
> But if the CPU does not support single-byte loads and stores, a store
> to foop->a will do a non-atomic read-modify-write operation on foop->b,
> which will come as a nasty surprise to someone holding foop->lock2.  So we
> now require CPUs to support single-byte and double-byte loads and stores.
> Therefore, this commit adjusts the definition of __native_word() to allow
> these sizes to be used by smp_load_acquire() and smp_store_release().

So does this patch depends on a patch that removes pre EV56 alpha
support? I'm all for removing that, but I need to see the patch merged
before we can do this.

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-05 19:52                       ` Peter Zijlstra
  0 siblings, 0 replies; 311+ messages in thread
From: Peter Zijlstra @ 2014-09-05 19:52 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Michael Cree, H. Peter Anvin, Peter Hurley,
	Benjamin Herrenschmidt, David Laight, Jakub Jelinek, linux-arch,
	Tony Luck, linux-ia64, Oleg Nesterov, linux-kernel,
	Paul Mackerras, linuxppc-dev, Miroslav Franc, Richard Henderson,
	linux-alpha

On Fri, Sep 05, 2014 at 11:31:09AM -0700, Paul E. McKenney wrote:
> compiler: Allow 1- and 2-byte smp_load_acquire() and smp_store_release()
> 
> CPUs without single-byte and double-byte loads and stores place some
> "interesting" requirements on concurrent code.  For example (adapted
> from Peter Hurley's test code), suppose we have the following structure:
>     
>     	struct foo {
>     		spinlock_t lock1;
>     		spinlock_t lock2;
>     		char a; /* Protected by lock1. */
>     		char b; /* Protected by lock2. */
>     	};
>     	struct foo *foop;
>     
> Of course, it is common (and good) practice to place data protected
> by different locks in separate cache lines.  However, if the locks are
> rarely acquired (for example, only in rare error cases), and there are
> a great many instances of the data structure, then memory footprint can
> trump false-sharing concerns, so that it can be better to place them in
> the same cache cache line as above.
> 
> But if the CPU does not support single-byte loads and stores, a store
> to foop->a will do a non-atomic read-modify-write operation on foop->b,
> which will come as a nasty surprise to someone holding foop->lock2.  So we
> now require CPUs to support single-byte and double-byte loads and stores.
> Therefore, this commit adjusts the definition of __native_word() to allow
> these sizes to be used by smp_load_acquire() and smp_store_release().

So does this patch depends on a patch that removes pre EV56 alpha
support? I'm all for removing that, but I need to see the patch merged
before we can do this.

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
  2014-09-05 19:52                       ` Peter Zijlstra
  (?)
@ 2014-09-05 20:01                         ` Peter Hurley
  -1 siblings, 0 replies; 311+ messages in thread
From: Peter Hurley @ 2014-09-05 20:01 UTC (permalink / raw)
  To: Peter Zijlstra, Paul E. McKenney
  Cc: Michael Cree, H. Peter Anvin, Benjamin Herrenschmidt,
	David Laight, Jakub Jelinek, linux-arch, Tony Luck, linux-ia64,
	Oleg Nesterov, linux-kernel, Paul Mackerras, linuxppc-dev,
	Miroslav Franc, Richard Henderson, linux-alpha

On 09/05/2014 03:52 PM, Peter Zijlstra wrote:
> On Fri, Sep 05, 2014 at 11:31:09AM -0700, Paul E. McKenney wrote:
>> compiler: Allow 1- and 2-byte smp_load_acquire() and smp_store_release()
>>
>> CPUs without single-byte and double-byte loads and stores place some
>> "interesting" requirements on concurrent code.  For example (adapted
>> from Peter Hurley's test code), suppose we have the following structure:
>>     
>>     	struct foo {
>>     		spinlock_t lock1;
>>     		spinlock_t lock2;
>>     		char a; /* Protected by lock1. */
>>     		char b; /* Protected by lock2. */
>>     	};
>>     	struct foo *foop;
>>     
>> Of course, it is common (and good) practice to place data protected
>> by different locks in separate cache lines.  However, if the locks are
>> rarely acquired (for example, only in rare error cases), and there are
>> a great many instances of the data structure, then memory footprint can
>> trump false-sharing concerns, so that it can be better to place them in
>> the same cache cache line as above.
>>
>> But if the CPU does not support single-byte loads and stores, a store
>> to foop->a will do a non-atomic read-modify-write operation on foop->b,
>> which will come as a nasty surprise to someone holding foop->lock2.  So we
>> now require CPUs to support single-byte and double-byte loads and stores.
>> Therefore, this commit adjusts the definition of __native_word() to allow
>> these sizes to be used by smp_load_acquire() and smp_store_release().
> 
> So does this patch depends on a patch that removes pre EV56 alpha
> support? I'm all for removing that, but I need to see the patch merged
> before we can do this.

I'm working on that but Alpha's Kconfig is not quite straightforward.


... and I'm wondering if I should _remove_ pre-EV56 configurations or
move the default choice and produce a warning about unsupported Alpha
CPUs instead?

Regards,
Peter Hurley

[ How does one do a red popup in kbuild?
  The 'comment' approach is too subtle.
]




^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-05 20:01                         ` Peter Hurley
  0 siblings, 0 replies; 311+ messages in thread
From: Peter Hurley @ 2014-09-05 20:01 UTC (permalink / raw)
  To: Peter Zijlstra, Paul E. McKenney
  Cc: Jakub Jelinek, linux-arch, Tony Luck, linux-ia64, Michael Cree,
	linux-alpha, Oleg Nesterov, linux-kernel, David Laight,
	Paul Mackerras, H. Peter Anvin, linuxppc-dev, Miroslav Franc,
	Richard Henderson

On 09/05/2014 03:52 PM, Peter Zijlstra wrote:
> On Fri, Sep 05, 2014 at 11:31:09AM -0700, Paul E. McKenney wrote:
>> compiler: Allow 1- and 2-byte smp_load_acquire() and smp_store_release()
>>
>> CPUs without single-byte and double-byte loads and stores place some
>> "interesting" requirements on concurrent code.  For example (adapted
>> from Peter Hurley's test code), suppose we have the following structure:
>>     
>>     	struct foo {
>>     		spinlock_t lock1;
>>     		spinlock_t lock2;
>>     		char a; /* Protected by lock1. */
>>     		char b; /* Protected by lock2. */
>>     	};
>>     	struct foo *foop;
>>     
>> Of course, it is common (and good) practice to place data protected
>> by different locks in separate cache lines.  However, if the locks are
>> rarely acquired (for example, only in rare error cases), and there are
>> a great many instances of the data structure, then memory footprint can
>> trump false-sharing concerns, so that it can be better to place them in
>> the same cache cache line as above.
>>
>> But if the CPU does not support single-byte loads and stores, a store
>> to foop->a will do a non-atomic read-modify-write operation on foop->b,
>> which will come as a nasty surprise to someone holding foop->lock2.  So we
>> now require CPUs to support single-byte and double-byte loads and stores.
>> Therefore, this commit adjusts the definition of __native_word() to allow
>> these sizes to be used by smp_load_acquire() and smp_store_release().
> 
> So does this patch depends on a patch that removes pre EV56 alpha
> support? I'm all for removing that, but I need to see the patch merged
> before we can do this.

I'm working on that but Alpha's Kconfig is not quite straightforward.


... and I'm wondering if I should _remove_ pre-EV56 configurations or
move the default choice and produce a warning about unsupported Alpha
CPUs instead?

Regards,
Peter Hurley

[ How does one do a red popup in kbuild?
  The 'comment' approach is too subtle.
]

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-05 20:01                         ` Peter Hurley
  0 siblings, 0 replies; 311+ messages in thread
From: Peter Hurley @ 2014-09-05 20:01 UTC (permalink / raw)
  To: Peter Zijlstra, Paul E. McKenney
  Cc: Michael Cree, H. Peter Anvin, Benjamin Herrenschmidt,
	David Laight, Jakub Jelinek, linux-arch, Tony Luck, linux-ia64,
	Oleg Nesterov, linux-kernel, Paul Mackerras, linuxppc-dev,
	Miroslav Franc, Richard Henderson, linux-alpha

On 09/05/2014 03:52 PM, Peter Zijlstra wrote:
> On Fri, Sep 05, 2014 at 11:31:09AM -0700, Paul E. McKenney wrote:
>> compiler: Allow 1- and 2-byte smp_load_acquire() and smp_store_release()
>>
>> CPUs without single-byte and double-byte loads and stores place some
>> "interesting" requirements on concurrent code.  For example (adapted
>> from Peter Hurley's test code), suppose we have the following structure:
>>     
>>     	struct foo {
>>     		spinlock_t lock1;
>>     		spinlock_t lock2;
>>     		char a; /* Protected by lock1. */
>>     		char b; /* Protected by lock2. */
>>     	};
>>     	struct foo *foop;
>>     
>> Of course, it is common (and good) practice to place data protected
>> by different locks in separate cache lines.  However, if the locks are
>> rarely acquired (for example, only in rare error cases), and there are
>> a great many instances of the data structure, then memory footprint can
>> trump false-sharing concerns, so that it can be better to place them in
>> the same cache cache line as above.
>>
>> But if the CPU does not support single-byte loads and stores, a store
>> to foop->a will do a non-atomic read-modify-write operation on foop->b,
>> which will come as a nasty surprise to someone holding foop->lock2.  So we
>> now require CPUs to support single-byte and double-byte loads and stores.
>> Therefore, this commit adjusts the definition of __native_word() to allow
>> these sizes to be used by smp_load_acquire() and smp_store_release().
> 
> So does this patch depends on a patch that removes pre EV56 alpha
> support? I'm all for removing that, but I need to see the patch merged
> before we can do this.

I'm working on that but Alpha's Kconfig is not quite straightforward.


... and I'm wondering if I should _remove_ pre-EV56 configurations or
move the default choice and produce a warning about unsupported Alpha
CPUs instead?

Regards,
Peter Hurley

[ How does one do a red popup in kbuild?
  The 'comment' approach is too subtle.
]




^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
  2014-09-05 19:24                         ` Peter Hurley
  (?)
  (?)
@ 2014-09-05 20:09                           ` Paul E. McKenney
  -1 siblings, 0 replies; 311+ messages in thread
From: Paul E. McKenney @ 2014-09-05 20:09 UTC (permalink / raw)
  To: Peter Hurley
  Cc: Michael Cree, H. Peter Anvin, Benjamin Herrenschmidt,
	David Laight, Jakub Jelinek, linux-arch, Tony Luck, linux-ia64,
	Oleg Nesterov, linux-kernel, Paul Mackerras, linuxppc-dev,
	Miroslav Franc, Richard Henderson, linux-alpha

On Fri, Sep 05, 2014 at 03:24:35PM -0400, Peter Hurley wrote:
> On 09/05/2014 03:05 PM, Paul E. McKenney wrote:
> > On Fri, Sep 05, 2014 at 02:50:31PM -0400, Peter Hurley wrote:
> >> On 09/05/2014 02:09 PM, Paul E. McKenney wrote:
> 
> [cut]
> 
> >>> ------------------------------------------------------------------------
> >>>
> >>> documentation: Record limitations of bitfields and small variables
> >>>
> >>> This commit documents the fact that it is not safe to use bitfields as
> >>> shared variables in synchronization algorithms.  It also documents that
> >>> CPUs must provide one-byte and two-byte load and store instructions
> >>                    ^
> >>                 atomic
> > 
> > Here you meant non-atomic?  My guess is that you are referring to the
> > fact that you could emulate a one-byte store on pre-EV56 Alpha CPUs
> > using the ll and sc atomic-read-modify-write instructions, correct?
> 
> Yes, that's what I meant. I must be tired and am misreading the commit
> message, or misinterpreting it's meaning.

Very good, got it!

							Thanx, Paul


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-05 20:09                           ` Paul E. McKenney
  0 siblings, 0 replies; 311+ messages in thread
From: Paul E. McKenney @ 2014-09-05 20:09 UTC (permalink / raw)
  To: Peter Hurley
  Cc: Jakub Jelinek, linux-arch, Tony Luck, linux-ia64, Michael Cree,
	linux-alpha, Oleg Nesterov, linux-kernel, David Laight,
	Paul Mackerras, H. Peter Anvin, linuxppc-dev, Miroslav Franc,
	Richard Henderson

On Fri, Sep 05, 2014 at 03:24:35PM -0400, Peter Hurley wrote:
> On 09/05/2014 03:05 PM, Paul E. McKenney wrote:
> > On Fri, Sep 05, 2014 at 02:50:31PM -0400, Peter Hurley wrote:
> >> On 09/05/2014 02:09 PM, Paul E. McKenney wrote:
> 
> [cut]
> 
> >>> ------------------------------------------------------------------------
> >>>
> >>> documentation: Record limitations of bitfields and small variables
> >>>
> >>> This commit documents the fact that it is not safe to use bitfields as
> >>> shared variables in synchronization algorithms.  It also documents that
> >>> CPUs must provide one-byte and two-byte load and store instructions
> >>                    ^
> >>                 atomic
> > 
> > Here you meant non-atomic?  My guess is that you are referring to the
> > fact that you could emulate a one-byte store on pre-EV56 Alpha CPUs
> > using the ll and sc atomic-read-modify-write instructions, correct?
> 
> Yes, that's what I meant. I must be tired and am misreading the commit
> message, or misinterpreting it's meaning.

Very good, got it!

							Thanx, Paul

_______________________________________________
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-05 20:09                           ` Paul E. McKenney
  0 siblings, 0 replies; 311+ messages in thread
From: Paul E. McKenney @ 2014-09-05 20:09 UTC (permalink / raw)
  To: Peter Hurley
  Cc: Jakub Jelinek, linux-arch, Tony Luck, linux-ia64, Michael Cree,
	linux-alpha, Oleg Nesterov, linux-kernel, David Laight,
	Paul Mackerras, H. Peter Anvin, linuxppc-dev, Miroslav Franc,
	Richard Henderson

On Fri, Sep 05, 2014 at 03:24:35PM -0400, Peter Hurley wrote:
> On 09/05/2014 03:05 PM, Paul E. McKenney wrote:
> > On Fri, Sep 05, 2014 at 02:50:31PM -0400, Peter Hurley wrote:
> >> On 09/05/2014 02:09 PM, Paul E. McKenney wrote:
> 
> [cut]
> 
> >>> ------------------------------------------------------------------------
> >>>
> >>> documentation: Record limitations of bitfields and small variables
> >>>
> >>> This commit documents the fact that it is not safe to use bitfields as
> >>> shared variables in synchronization algorithms.  It also documents that
> >>> CPUs must provide one-byte and two-byte load and store instructions
> >>                    ^
> >>                 atomic
> > 
> > Here you meant non-atomic?  My guess is that you are referring to the
> > fact that you could emulate a one-byte store on pre-EV56 Alpha CPUs
> > using the ll and sc atomic-read-modify-write instructions, correct?
> 
> Yes, that's what I meant. I must be tired and am misreading the commit
> message, or misinterpreting it's meaning.

Very good, got it!

							Thanx, Paul

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-05 20:09                           ` Paul E. McKenney
  0 siblings, 0 replies; 311+ messages in thread
From: Paul E. McKenney @ 2014-09-05 20:09 UTC (permalink / raw)
  To: Peter Hurley
  Cc: Michael Cree, H. Peter Anvin, Benjamin Herrenschmidt,
	David Laight, Jakub Jelinek, linux-arch, Tony Luck, linux-ia64,
	Oleg Nesterov, linux-kernel, Paul Mackerras, linuxppc-dev,
	Miroslav Franc, Richard Henderson, linux-alpha

On Fri, Sep 05, 2014 at 03:24:35PM -0400, Peter Hurley wrote:
> On 09/05/2014 03:05 PM, Paul E. McKenney wrote:
> > On Fri, Sep 05, 2014 at 02:50:31PM -0400, Peter Hurley wrote:
> >> On 09/05/2014 02:09 PM, Paul E. McKenney wrote:
> 
> [cut]
> 
> >>> ------------------------------------------------------------------------
> >>>
> >>> documentation: Record limitations of bitfields and small variables
> >>>
> >>> This commit documents the fact that it is not safe to use bitfields as
> >>> shared variables in synchronization algorithms.  It also documents that
> >>> CPUs must provide one-byte and two-byte load and store instructions
> >>                    ^
> >>                 atomic
> > 
> > Here you meant non-atomic?  My guess is that you are referring to the
> > fact that you could emulate a one-byte store on pre-EV56 Alpha CPUs
> > using the ll and sc atomic-read-modify-write instructions, correct?
> 
> Yes, that's what I meant. I must be tired and am misreading the commit
> message, or misinterpreting it's meaning.

Very good, got it!

							Thanx, Paul


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
  2014-09-05 20:01                         ` Peter Hurley
  (?)
@ 2014-09-05 20:12                           ` Peter Zijlstra
  -1 siblings, 0 replies; 311+ messages in thread
From: Peter Zijlstra @ 2014-09-05 20:12 UTC (permalink / raw)
  To: Peter Hurley
  Cc: Paul E. McKenney, Michael Cree, H. Peter Anvin,
	Benjamin Herrenschmidt, David Laight, Jakub Jelinek, linux-arch,
	Tony Luck, linux-ia64, Oleg Nesterov, linux-kernel,
	Paul Mackerras, linuxppc-dev, Miroslav Franc, Richard Henderson,
	linux-alpha

On Fri, Sep 05, 2014 at 04:01:35PM -0400, Peter Hurley wrote:
> > So does this patch depends on a patch that removes pre EV56 alpha
> > support? I'm all for removing that, but I need to see the patch merged
> > before we can do this.
> 
> I'm working on that but Alpha's Kconfig is not quite straightforward.
> 
> 
> ... and I'm wondering if I should _remove_ pre-EV56 configurations or
> move the default choice and produce a warning about unsupported Alpha
> CPUs instead?
> 

 depends BROKEN 

or is that deprecated?

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-05 20:12                           ` Peter Zijlstra
  0 siblings, 0 replies; 311+ messages in thread
From: Peter Zijlstra @ 2014-09-05 20:12 UTC (permalink / raw)
  To: Peter Hurley
  Cc: Jakub Jelinek, linux-arch, Tony Luck, linux-ia64, Michael Cree,
	linux-alpha, Oleg Nesterov, linux-kernel, David Laight,
	Paul Mackerras, H. Peter Anvin, Paul E. McKenney, linuxppc-dev,
	Miroslav Franc, Richard Henderson

On Fri, Sep 05, 2014 at 04:01:35PM -0400, Peter Hurley wrote:
> > So does this patch depends on a patch that removes pre EV56 alpha
> > support? I'm all for removing that, but I need to see the patch merged
> > before we can do this.
> 
> I'm working on that but Alpha's Kconfig is not quite straightforward.
> 
> 
> ... and I'm wondering if I should _remove_ pre-EV56 configurations or
> move the default choice and produce a warning about unsupported Alpha
> CPUs instead?
> 

 depends BROKEN 

or is that deprecated?

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-05 20:12                           ` Peter Zijlstra
  0 siblings, 0 replies; 311+ messages in thread
From: Peter Zijlstra @ 2014-09-05 20:12 UTC (permalink / raw)
  To: Peter Hurley
  Cc: Paul E. McKenney, Michael Cree, H. Peter Anvin,
	Benjamin Herrenschmidt, David Laight, Jakub Jelinek, linux-arch,
	Tony Luck, linux-ia64, Oleg Nesterov, linux-kernel,
	Paul Mackerras, linuxppc-dev, Miroslav Franc, Richard Henderson,
	linux-alpha

On Fri, Sep 05, 2014 at 04:01:35PM -0400, Peter Hurley wrote:
> > So does this patch depends on a patch that removes pre EV56 alpha
> > support? I'm all for removing that, but I need to see the patch merged
> > before we can do this.
> 
> I'm working on that but Alpha's Kconfig is not quite straightforward.
> 
> 
> ... and I'm wondering if I should _remove_ pre-EV56 configurations or
> move the default choice and produce a warning about unsupported Alpha
> CPUs instead?
> 

 depends BROKEN 

or is that deprecated?

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
  2014-09-05 19:38                         ` Marc Gauthier
  (?)
@ 2014-09-05 20:14                           ` Peter Hurley
  -1 siblings, 0 replies; 311+ messages in thread
From: Peter Hurley @ 2014-09-05 20:14 UTC (permalink / raw)
  To: Marc Gauthier, paulmck
  Cc: Michael Cree, H. Peter Anvin, Benjamin Herrenschmidt,
	David Laight, Jakub Jelinek, linux-arch, Tony Luck, linux-ia64,
	Oleg Nesterov, linux-kernel, Paul Mackerras, linuxppc-dev,
	Miroslav Franc, Richard Henderson, linux-alpha

On 09/05/2014 03:38 PM, Marc Gauthier wrote:
> Paul E. McKenney wrote:
>> On Fri, Sep 05, 2014 at 02:50:31PM -0400, Peter Hurley wrote:
>>> On 09/05/2014 02:09 PM, Paul E. McKenney wrote:
>>>> This commit documents the fact that it is not safe to use bitfields as
>>>> shared variables in synchronization algorithms.  It also documents that
>>>> CPUs must provide one-byte and two-byte load and store instructions
>>>                    ^
>>>                 atomic
>>
>> Here you meant non-atomic?  My guess is that you are referring to the
>> fact that you could emulate a one-byte store on pre-EV56 Alpha CPUs
>> using the ll and sc atomic-read-modify-write instructions, correct?
>>
>>>> in order to be supported by the Linux kernel.  (Michael Cree
>>>> has agreed to the resulting non-support of pre-EV56 Alpha CPUs:
>>>> https://lkml.org/lkml/2014/9/5/143.
> [...]
> 
>>>> +     and 64-bit systems, respectively.  Note that this means that the
>>>> +     Linux kernel does not support pre-EV56 Alpha CPUs, because these
>>>> +     older CPUs do not provide one-byte and two-byte loads and stores.
>>>                                  ^
>>>                             non-atomic
>>
>> I took this, thank you!
> 
> Eum, am I totally lost, or aren't both of these supposed to say "atomic" ?
> 
> Can't imagine requiring a CPU to provide non-atomic loads and stores
> (i.e. requiring old Alpha behavior?).

Here's how I read the two statements.

First, the commit message:

"It [this commit] documents that CPUs [supported by the Linux kernel]
_must provide_ atomic one-byte and two-byte naturally aligned loads and stores."

Second, in the body of the document:

"The Linux kernel no longer supports pre-EV56 Alpha CPUs, because these
older CPUs _do not provide_ atomic one-byte and two-byte loads and stores."

Regards,
Peter Hurley


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-05 20:14                           ` Peter Hurley
  0 siblings, 0 replies; 311+ messages in thread
From: Peter Hurley @ 2014-09-05 20:14 UTC (permalink / raw)
  To: Marc Gauthier, paulmck
  Cc: Jakub Jelinek, linux-arch, Tony Luck, linux-ia64, Michael Cree,
	linux-alpha, Oleg Nesterov, linux-kernel, David Laight,
	Paul Mackerras, H. Peter Anvin, linuxppc-dev, Miroslav Franc,
	Richard Henderson

On 09/05/2014 03:38 PM, Marc Gauthier wrote:
> Paul E. McKenney wrote:
>> On Fri, Sep 05, 2014 at 02:50:31PM -0400, Peter Hurley wrote:
>>> On 09/05/2014 02:09 PM, Paul E. McKenney wrote:
>>>> This commit documents the fact that it is not safe to use bitfields as
>>>> shared variables in synchronization algorithms.  It also documents that
>>>> CPUs must provide one-byte and two-byte load and store instructions
>>>                    ^
>>>                 atomic
>>
>> Here you meant non-atomic?  My guess is that you are referring to the
>> fact that you could emulate a one-byte store on pre-EV56 Alpha CPUs
>> using the ll and sc atomic-read-modify-write instructions, correct?
>>
>>>> in order to be supported by the Linux kernel.  (Michael Cree
>>>> has agreed to the resulting non-support of pre-EV56 Alpha CPUs:
>>>> https://lkml.org/lkml/2014/9/5/143.
> [...]
> 
>>>> +     and 64-bit systems, respectively.  Note that this means that the
>>>> +     Linux kernel does not support pre-EV56 Alpha CPUs, because these
>>>> +     older CPUs do not provide one-byte and two-byte loads and stores.
>>>                                  ^
>>>                             non-atomic
>>
>> I took this, thank you!
> 
> Eum, am I totally lost, or aren't both of these supposed to say "atomic" ?
> 
> Can't imagine requiring a CPU to provide non-atomic loads and stores
> (i.e. requiring old Alpha behavior?).

Here's how I read the two statements.

First, the commit message:

"It [this commit] documents that CPUs [supported by the Linux kernel]
_must provide_ atomic one-byte and two-byte naturally aligned loads and stores."

Second, in the body of the document:

"The Linux kernel no longer supports pre-EV56 Alpha CPUs, because these
older CPUs _do not provide_ atomic one-byte and two-byte loads and stores."

Regards,
Peter Hurley

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-05 20:14                           ` Peter Hurley
  0 siblings, 0 replies; 311+ messages in thread
From: Peter Hurley @ 2014-09-05 20:14 UTC (permalink / raw)
  To: Marc Gauthier, paulmck
  Cc: Michael Cree, H. Peter Anvin, Benjamin Herrenschmidt,
	David Laight, Jakub Jelinek, linux-arch, Tony Luck, linux-ia64,
	Oleg Nesterov, linux-kernel, Paul Mackerras, linuxppc-dev,
	Miroslav Franc, Richard Henderson, linux-alpha

On 09/05/2014 03:38 PM, Marc Gauthier wrote:
> Paul E. McKenney wrote:
>> On Fri, Sep 05, 2014 at 02:50:31PM -0400, Peter Hurley wrote:
>>> On 09/05/2014 02:09 PM, Paul E. McKenney wrote:
>>>> This commit documents the fact that it is not safe to use bitfields as
>>>> shared variables in synchronization algorithms.  It also documents that
>>>> CPUs must provide one-byte and two-byte load and store instructions
>>>                    ^
>>>                 atomic
>>
>> Here you meant non-atomic?  My guess is that you are referring to the
>> fact that you could emulate a one-byte store on pre-EV56 Alpha CPUs
>> using the ll and sc atomic-read-modify-write instructions, correct?
>>
>>>> in order to be supported by the Linux kernel.  (Michael Cree
>>>> has agreed to the resulting non-support of pre-EV56 Alpha CPUs:
>>>> https://lkml.org/lkml/2014/9/5/143.
> [...]
> 
>>>> +     and 64-bit systems, respectively.  Note that this means that the
>>>> +     Linux kernel does not support pre-EV56 Alpha CPUs, because these
>>>> +     older CPUs do not provide one-byte and two-byte loads and stores.
>>>                                  ^
>>>                             non-atomic
>>
>> I took this, thank you!
> 
> Eum, am I totally lost, or aren't both of these supposed to say "atomic" ?
> 
> Can't imagine requiring a CPU to provide non-atomic loads and stores
> (i.e. requiring old Alpha behavior?).

Here's how I read the two statements.

First, the commit message:

"It [this commit] documents that CPUs [supported by the Linux kernel]
_must provide_ atomic one-byte and two-byte naturally aligned loads and stores."

Second, in the body of the document:

"The Linux kernel no longer supports pre-EV56 Alpha CPUs, because these
older CPUs _do not provide_ atomic one-byte and two-byte loads and stores."

Regards,
Peter Hurley


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
  2014-09-05 20:12                           ` Peter Zijlstra
  (?)
@ 2014-09-05 20:15                             ` H. Peter Anvin
  -1 siblings, 0 replies; 311+ messages in thread
From: H. Peter Anvin @ 2014-09-05 20:15 UTC (permalink / raw)
  To: Peter Zijlstra, Peter Hurley
  Cc: Paul E. McKenney, Michael Cree, Benjamin Herrenschmidt,
	David Laight, Jakub Jelinek, linux-arch, Tony Luck, linux-ia64,
	Oleg Nesterov, linux-kernel, Paul Mackerras, linuxppc-dev,
	Miroslav Franc, Richard Henderson, linux-alpha

On 09/05/2014 01:12 PM, Peter Zijlstra wrote:
>>
>> ... and I'm wondering if I should _remove_ pre-EV56 configurations or
>> move the default choice and produce a warning about unsupported Alpha
>> CPUs instead?
>>
> 
>  depends BROKEN 
> 
> or is that deprecated?
> 

Just rip it out, like I did for the i386.

	-hpa


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-05 20:15                             ` H. Peter Anvin
  0 siblings, 0 replies; 311+ messages in thread
From: H. Peter Anvin @ 2014-09-05 20:15 UTC (permalink / raw)
  To: Peter Zijlstra, Peter Hurley
  Cc: Jakub Jelinek, linux-arch, Tony Luck, linux-ia64, Michael Cree,
	Oleg Nesterov, linux-kernel, David Laight, Paul Mackerras,
	linux-alpha, Paul E. McKenney, linuxppc-dev, Miroslav Franc,
	Richard Henderson

On 09/05/2014 01:12 PM, Peter Zijlstra wrote:
>>
>> ... and I'm wondering if I should _remove_ pre-EV56 configurations or
>> move the default choice and produce a warning about unsupported Alpha
>> CPUs instead?
>>
> 
>  depends BROKEN 
> 
> or is that deprecated?
> 

Just rip it out, like I did for the i386.

	-hpa

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-05 20:15                             ` H. Peter Anvin
  0 siblings, 0 replies; 311+ messages in thread
From: H. Peter Anvin @ 2014-09-05 20:15 UTC (permalink / raw)
  To: Peter Zijlstra, Peter Hurley
  Cc: Paul E. McKenney, Michael Cree, Benjamin Herrenschmidt,
	David Laight, Jakub Jelinek, linux-arch, Tony Luck, linux-ia64,
	Oleg Nesterov, linux-kernel, Paul Mackerras, linuxppc-dev,
	Miroslav Franc, Richard Henderson, linux-alpha

On 09/05/2014 01:12 PM, Peter Zijlstra wrote:
>>
>> ... and I'm wondering if I should _remove_ pre-EV56 configurations or
>> move the default choice and produce a warning about unsupported Alpha
>> CPUs instead?
>>
> 
>  depends BROKEN 
> 
> or is that deprecated?
> 

Just rip it out, like I did for the i386.

	-hpa


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
  2014-09-05 20:01                         ` Peter Hurley
  (?)
@ 2014-09-05 20:19                           ` Paul E. McKenney
  -1 siblings, 0 replies; 311+ messages in thread
From: Paul E. McKenney @ 2014-09-05 20:19 UTC (permalink / raw)
  To: Peter Hurley
  Cc: Peter Zijlstra, Michael Cree, H. Peter Anvin,
	Benjamin Herrenschmidt, David Laight, Jakub Jelinek, linux-arch,
	Tony Luck, linux-ia64, Oleg Nesterov, linux-kernel,
	Paul Mackerras, linuxppc-dev, Miroslav Franc, Richard Henderson,
	linux-alpha

On Fri, Sep 05, 2014 at 04:01:35PM -0400, Peter Hurley wrote:
> On 09/05/2014 03:52 PM, Peter Zijlstra wrote:
> > On Fri, Sep 05, 2014 at 11:31:09AM -0700, Paul E. McKenney wrote:
> >> compiler: Allow 1- and 2-byte smp_load_acquire() and smp_store_release()
> >>
> >> CPUs without single-byte and double-byte loads and stores place some
> >> "interesting" requirements on concurrent code.  For example (adapted
> >> from Peter Hurley's test code), suppose we have the following structure:
> >>     
> >>     	struct foo {
> >>     		spinlock_t lock1;
> >>     		spinlock_t lock2;
> >>     		char a; /* Protected by lock1. */
> >>     		char b; /* Protected by lock2. */
> >>     	};
> >>     	struct foo *foop;
> >>     
> >> Of course, it is common (and good) practice to place data protected
> >> by different locks in separate cache lines.  However, if the locks are
> >> rarely acquired (for example, only in rare error cases), and there are
> >> a great many instances of the data structure, then memory footprint can
> >> trump false-sharing concerns, so that it can be better to place them in
> >> the same cache cache line as above.
> >>
> >> But if the CPU does not support single-byte loads and stores, a store
> >> to foop->a will do a non-atomic read-modify-write operation on foop->b,
> >> which will come as a nasty surprise to someone holding foop->lock2.  So we
> >> now require CPUs to support single-byte and double-byte loads and stores.
> >> Therefore, this commit adjusts the definition of __native_word() to allow
> >> these sizes to be used by smp_load_acquire() and smp_store_release().
> > 
> > So does this patch depends on a patch that removes pre EV56 alpha
> > support? I'm all for removing that, but I need to see the patch merged
> > before we can do this.
> 
> I'm working on that but Alpha's Kconfig is not quite straightforward.
> 
> 
> ... and I'm wondering if I should _remove_ pre-EV56 configurations or
> move the default choice and produce a warning about unsupported Alpha
> CPUs instead?

I suspect that either would work, given that the Alpha community is
pretty close-knit.  Just setting the appropriate flag to make the
compiler generate one-byte and two-byte loads and stores would
probably suffice.  ;-)

							Thanx, Paul

> Regards,
> Peter Hurley
> 
> [ How does one do a red popup in kbuild?
>   The 'comment' approach is too subtle.
> ]
> 
> 
> 


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-05 20:19                           ` Paul E. McKenney
  0 siblings, 0 replies; 311+ messages in thread
From: Paul E. McKenney @ 2014-09-05 20:19 UTC (permalink / raw)
  To: Peter Hurley
  Cc: Jakub Jelinek, linux-arch, Tony Luck, linux-ia64, Michael Cree,
	Peter Zijlstra, linux-alpha, Oleg Nesterov, linux-kernel,
	David Laight, Paul Mackerras, H. Peter Anvin, linuxppc-dev,
	Miroslav Franc, Richard Henderson

On Fri, Sep 05, 2014 at 04:01:35PM -0400, Peter Hurley wrote:
> On 09/05/2014 03:52 PM, Peter Zijlstra wrote:
> > On Fri, Sep 05, 2014 at 11:31:09AM -0700, Paul E. McKenney wrote:
> >> compiler: Allow 1- and 2-byte smp_load_acquire() and smp_store_release()
> >>
> >> CPUs without single-byte and double-byte loads and stores place some
> >> "interesting" requirements on concurrent code.  For example (adapted
> >> from Peter Hurley's test code), suppose we have the following structure:
> >>     
> >>     	struct foo {
> >>     		spinlock_t lock1;
> >>     		spinlock_t lock2;
> >>     		char a; /* Protected by lock1. */
> >>     		char b; /* Protected by lock2. */
> >>     	};
> >>     	struct foo *foop;
> >>     
> >> Of course, it is common (and good) practice to place data protected
> >> by different locks in separate cache lines.  However, if the locks are
> >> rarely acquired (for example, only in rare error cases), and there are
> >> a great many instances of the data structure, then memory footprint can
> >> trump false-sharing concerns, so that it can be better to place them in
> >> the same cache cache line as above.
> >>
> >> But if the CPU does not support single-byte loads and stores, a store
> >> to foop->a will do a non-atomic read-modify-write operation on foop->b,
> >> which will come as a nasty surprise to someone holding foop->lock2.  So we
> >> now require CPUs to support single-byte and double-byte loads and stores.
> >> Therefore, this commit adjusts the definition of __native_word() to allow
> >> these sizes to be used by smp_load_acquire() and smp_store_release().
> > 
> > So does this patch depends on a patch that removes pre EV56 alpha
> > support? I'm all for removing that, but I need to see the patch merged
> > before we can do this.
> 
> I'm working on that but Alpha's Kconfig is not quite straightforward.
> 
> 
> ... and I'm wondering if I should _remove_ pre-EV56 configurations or
> move the default choice and produce a warning about unsupported Alpha
> CPUs instead?

I suspect that either would work, given that the Alpha community is
pretty close-knit.  Just setting the appropriate flag to make the
compiler generate one-byte and two-byte loads and stores would
probably suffice.  ;-)

							Thanx, Paul

> Regards,
> Peter Hurley
> 
> [ How does one do a red popup in kbuild?
>   The 'comment' approach is too subtle.
> ]
> 
> 
> 

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-05 20:19                           ` Paul E. McKenney
  0 siblings, 0 replies; 311+ messages in thread
From: Paul E. McKenney @ 2014-09-05 20:19 UTC (permalink / raw)
  To: Peter Hurley
  Cc: Peter Zijlstra, Michael Cree, H. Peter Anvin,
	Benjamin Herrenschmidt, David Laight, Jakub Jelinek, linux-arch,
	Tony Luck, linux-ia64, Oleg Nesterov, linux-kernel,
	Paul Mackerras, linuxppc-dev, Miroslav Franc, Richard Henderson,
	linux-alpha

On Fri, Sep 05, 2014 at 04:01:35PM -0400, Peter Hurley wrote:
> On 09/05/2014 03:52 PM, Peter Zijlstra wrote:
> > On Fri, Sep 05, 2014 at 11:31:09AM -0700, Paul E. McKenney wrote:
> >> compiler: Allow 1- and 2-byte smp_load_acquire() and smp_store_release()
> >>
> >> CPUs without single-byte and double-byte loads and stores place some
> >> "interesting" requirements on concurrent code.  For example (adapted
> >> from Peter Hurley's test code), suppose we have the following structure:
> >>     
> >>     	struct foo {
> >>     		spinlock_t lock1;
> >>     		spinlock_t lock2;
> >>     		char a; /* Protected by lock1. */
> >>     		char b; /* Protected by lock2. */
> >>     	};
> >>     	struct foo *foop;
> >>     
> >> Of course, it is common (and good) practice to place data protected
> >> by different locks in separate cache lines.  However, if the locks are
> >> rarely acquired (for example, only in rare error cases), and there are
> >> a great many instances of the data structure, then memory footprint can
> >> trump false-sharing concerns, so that it can be better to place them in
> >> the same cache cache line as above.
> >>
> >> But if the CPU does not support single-byte loads and stores, a store
> >> to foop->a will do a non-atomic read-modify-write operation on foop->b,
> >> which will come as a nasty surprise to someone holding foop->lock2.  So we
> >> now require CPUs to support single-byte and double-byte loads and stores.
> >> Therefore, this commit adjusts the definition of __native_word() to allow
> >> these sizes to be used by smp_load_acquire() and smp_store_release().
> > 
> > So does this patch depends on a patch that removes pre EV56 alpha
> > support? I'm all for removing that, but I need to see the patch merged
> > before we can do this.
> 
> I'm working on that but Alpha's Kconfig is not quite straightforward.
> 
> 
> ... and I'm wondering if I should _remove_ pre-EV56 configurations or
> move the default choice and produce a warning about unsupported Alpha
> CPUs instead?

I suspect that either would work, given that the Alpha community is
pretty close-knit.  Just setting the appropriate flag to make the
compiler generate one-byte and two-byte loads and stores would
probably suffice.  ;-)

							Thanx, Paul

> Regards,
> Peter Hurley
> 
> [ How does one do a red popup in kbuild?
>   The 'comment' approach is too subtle.
> ]
> 
> 
> 


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
  2014-09-05 20:14                           ` Peter Hurley
  (?)
@ 2014-09-05 20:34                             ` H. Peter Anvin
  -1 siblings, 0 replies; 311+ messages in thread
From: H. Peter Anvin @ 2014-09-05 20:34 UTC (permalink / raw)
  To: Peter Hurley, Marc Gauthier, paulmck
  Cc: Michael Cree, Benjamin Herrenschmidt, David Laight,
	Jakub Jelinek, linux-arch, Tony Luck, linux-ia64, Oleg Nesterov,
	linux-kernel, Paul Mackerras, linuxppc-dev, Miroslav Franc,
	Richard Henderson, linux-alpha

On 09/05/2014 01:14 PM, Peter Hurley wrote:
> 
> Here's how I read the two statements.
> 
> First, the commit message:
> 
> "It [this commit] documents that CPUs [supported by the Linux kernel]
> _must provide_ atomic one-byte and two-byte naturally aligned loads and stores."
> 
> Second, in the body of the document:
> 
> "The Linux kernel no longer supports pre-EV56 Alpha CPUs, because these
> older CPUs _do not provide_ atomic one-byte and two-byte loads and stores."
> 

Does this apply in general or only to SMP configurations?  I guess
non-SMP configurations would still have problems if interrupted in the
wrong place...

	-hpa


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-05 20:34                             ` H. Peter Anvin
  0 siblings, 0 replies; 311+ messages in thread
From: H. Peter Anvin @ 2014-09-05 20:34 UTC (permalink / raw)
  To: Peter Hurley, Marc Gauthier, paulmck
  Cc: Jakub Jelinek, linux-arch, Tony Luck, linux-ia64, Michael Cree,
	Oleg Nesterov, linux-kernel, David Laight, Paul Mackerras,
	linux-alpha, linuxppc-dev, Miroslav Franc, Richard Henderson

On 09/05/2014 01:14 PM, Peter Hurley wrote:
> 
> Here's how I read the two statements.
> 
> First, the commit message:
> 
> "It [this commit] documents that CPUs [supported by the Linux kernel]
> _must provide_ atomic one-byte and two-byte naturally aligned loads and stores."
> 
> Second, in the body of the document:
> 
> "The Linux kernel no longer supports pre-EV56 Alpha CPUs, because these
> older CPUs _do not provide_ atomic one-byte and two-byte loads and stores."
> 

Does this apply in general or only to SMP configurations?  I guess
non-SMP configurations would still have problems if interrupted in the
wrong place...

	-hpa

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-05 20:34                             ` H. Peter Anvin
  0 siblings, 0 replies; 311+ messages in thread
From: H. Peter Anvin @ 2014-09-05 20:34 UTC (permalink / raw)
  To: Peter Hurley, Marc Gauthier, paulmck
  Cc: Michael Cree, Benjamin Herrenschmidt, David Laight,
	Jakub Jelinek, linux-arch, Tony Luck, linux-ia64, Oleg Nesterov,
	linux-kernel, Paul Mackerras, linuxppc-dev, Miroslav Franc,
	Richard Henderson, linux-alpha

On 09/05/2014 01:14 PM, Peter Hurley wrote:
> 
> Here's how I read the two statements.
> 
> First, the commit message:
> 
> "It [this commit] documents that CPUs [supported by the Linux kernel]
> _must provide_ atomic one-byte and two-byte naturally aligned loads and stores."
> 
> Second, in the body of the document:
> 
> "The Linux kernel no longer supports pre-EV56 Alpha CPUs, because these
> older CPUs _do not provide_ atomic one-byte and two-byte loads and stores."
> 

Does this apply in general or only to SMP configurations?  I guess
non-SMP configurations would still have problems if interrupted in the
wrong place...

	-hpa


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
  2014-09-05 20:14                           ` Peter Hurley
  (?)
@ 2014-09-05 20:39                             ` Michael Cree
  -1 siblings, 0 replies; 311+ messages in thread
From: Michael Cree @ 2014-09-05 20:39 UTC (permalink / raw)
  To: Peter Hurley
  Cc: Marc Gauthier, paulmck, H. Peter Anvin, Benjamin Herrenschmidt,
	David Laight, Jakub Jelinek, linux-arch, Tony Luck, linux-ia64,
	Oleg Nesterov, linux-kernel, Paul Mackerras, linuxppc-dev,
	Miroslav Franc, Richard Henderson, linux-alpha

On Fri, Sep 05, 2014 at 04:14:48PM -0400, Peter Hurley wrote:
> Second, in the body of the document:
> 
> "The Linux kernel no longer supports pre-EV56 Alpha CPUs, because these
> older CPUs _do not provide_ atomic one-byte and two-byte loads and stores."

Let's be clear here, the pre-EV56 Alpha CPUs do provide an atomic
one-byte and two-byte load and store; it's just that one must use
locked load and store sequences to achieve atomicity.  The point,
I think, is that the pre-EV56 Alpha CPUs provide non-atomic one-byte
and two-byte load and stores as the norm, and that is the problem.

Cheers
Michael.

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-05 20:39                             ` Michael Cree
  0 siblings, 0 replies; 311+ messages in thread
From: Michael Cree @ 2014-09-05 20:39 UTC (permalink / raw)
  To: Peter Hurley
  Cc: Jakub Jelinek, linux-arch, Tony Luck, David Laight, linux-alpha,
	Oleg Nesterov, linux-kernel, Marc Gauthier, Paul Mackerras,
	H. Peter Anvin, linux-ia64, paulmck, linuxppc-dev,
	Miroslav Franc, Richard Henderson

On Fri, Sep 05, 2014 at 04:14:48PM -0400, Peter Hurley wrote:
> Second, in the body of the document:
> 
> "The Linux kernel no longer supports pre-EV56 Alpha CPUs, because these
> older CPUs _do not provide_ atomic one-byte and two-byte loads and stores."

Let's be clear here, the pre-EV56 Alpha CPUs do provide an atomic
one-byte and two-byte load and store; it's just that one must use
locked load and store sequences to achieve atomicity.  The point,
I think, is that the pre-EV56 Alpha CPUs provide non-atomic one-byte
and two-byte load and stores as the norm, and that is the problem.

Cheers
Michael.

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-05 20:39                             ` Michael Cree
  0 siblings, 0 replies; 311+ messages in thread
From: Michael Cree @ 2014-09-05 20:39 UTC (permalink / raw)
  To: Peter Hurley
  Cc: Marc Gauthier, paulmck, H. Peter Anvin, Benjamin Herrenschmidt,
	David Laight, Jakub Jelinek, linux-arch, Tony Luck, linux-ia64,
	Oleg Nesterov, linux-kernel, Paul Mackerras, linuxppc-dev,
	Miroslav Franc, Richard Henderson, linux-alpha

On Fri, Sep 05, 2014 at 04:14:48PM -0400, Peter Hurley wrote:
> Second, in the body of the document:
> 
> "The Linux kernel no longer supports pre-EV56 Alpha CPUs, because these
> older CPUs _do not provide_ atomic one-byte and two-byte loads and stores."

Let's be clear here, the pre-EV56 Alpha CPUs do provide an atomic
one-byte and two-byte load and store; it's just that one must use
locked load and store sequences to achieve atomicity.  The point,
I think, is that the pre-EV56 Alpha CPUs provide non-atomic one-byte
and two-byte load and stores as the norm, and that is the problem.

Cheers
Michael.

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
  2014-09-05 20:14                           ` Peter Hurley
  (?)
@ 2014-09-05 20:42                             ` Paul E. McKenney
  -1 siblings, 0 replies; 311+ messages in thread
From: Paul E. McKenney @ 2014-09-05 20:42 UTC (permalink / raw)
  To: Peter Hurley
  Cc: Marc Gauthier, Michael Cree, H. Peter Anvin,
	Benjamin Herrenschmidt, David Laight, Jakub Jelinek, linux-arch,
	Tony Luck, linux-ia64, Oleg Nesterov, linux-kernel,
	Paul Mackerras, linuxppc-dev, Miroslav Franc, Richard Henderson,
	linux-alpha

On Fri, Sep 05, 2014 at 04:14:48PM -0400, Peter Hurley wrote:
> On 09/05/2014 03:38 PM, Marc Gauthier wrote:
> > Paul E. McKenney wrote:
> >> On Fri, Sep 05, 2014 at 02:50:31PM -0400, Peter Hurley wrote:
> >>> On 09/05/2014 02:09 PM, Paul E. McKenney wrote:
> >>>> This commit documents the fact that it is not safe to use bitfields as
> >>>> shared variables in synchronization algorithms.  It also documents that
> >>>> CPUs must provide one-byte and two-byte load and store instructions
> >>>                    ^
> >>>                 atomic
> >>
> >> Here you meant non-atomic?  My guess is that you are referring to the
> >> fact that you could emulate a one-byte store on pre-EV56 Alpha CPUs
> >> using the ll and sc atomic-read-modify-write instructions, correct?
> >>
> >>>> in order to be supported by the Linux kernel.  (Michael Cree
> >>>> has agreed to the resulting non-support of pre-EV56 Alpha CPUs:
> >>>> https://lkml.org/lkml/2014/9/5/143.
> > [...]
> > 
> >>>> +     and 64-bit systems, respectively.  Note that this means that the
> >>>> +     Linux kernel does not support pre-EV56 Alpha CPUs, because these
> >>>> +     older CPUs do not provide one-byte and two-byte loads and stores.
> >>>                                  ^
> >>>                             non-atomic
> >>
> >> I took this, thank you!
> > 
> > Eum, am I totally lost, or aren't both of these supposed to say "atomic" ?
> > 
> > Can't imagine requiring a CPU to provide non-atomic loads and stores
> > (i.e. requiring old Alpha behavior?).
> 
> Here's how I read the two statements.
> 
> First, the commit message:
> 
> "It [this commit] documents that CPUs [supported by the Linux kernel]
> _must provide_ atomic one-byte and two-byte naturally aligned loads and stores."
> 
> Second, in the body of the document:
> 
> "The Linux kernel no longer supports pre-EV56 Alpha CPUs, because these
> older CPUs _do not provide_ atomic one-byte and two-byte loads and stores."

Hmmm...  It is a bit ambiguous.  How about the following?

							Thanx, Paul

------------------------------------------------------------------------

documentation: Record limitations of bitfields and small variables

This commit documents the fact that it is not safe to use bitfields
as shared variables in synchronization algorithms.  It also documents
that CPUs must provide one-byte and two-byte normal load and store
instructions in order to be supported by the Linux kernel.  (Michael
Cree has agreed to the resulting non-support of pre-EV56 Alpha CPUs:
https://lkml.org/lkml/2014/9/5/143.)

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
index 87be0a8a78de..fe4d51b704c5 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -269,6 +269,37 @@ And there are a number of things that _must_ or _must_not_ be assumed:
 	STORE *(A + 4) = Y; STORE *A = X;
 	STORE {*A, *(A + 4) } = {X, Y};
 
+And there are anti-guarantees:
+
+ (*) These guarantees do not apply to bitfields, because compilers often
+     generate code to modify these using non-atomic read-modify-write
+     sequences.  Do not attempt to use bitfields to synchronize parallel
+     algorithms.
+
+ (*) Even in cases where bitfields are protected by locks, all fields
+     in a given bitfield must be protected by one lock.  If two fields
+     in a given bitfield are protected by different locks, the compiler's
+     non-atomic read-modify-write sequences can cause an update to one
+     field to corrupt the value of an adjacent field.
+
+ (*) These guarantees apply only to properly aligned and sized scalar
+     variables.  "Properly sized" currently means variables that are
+     the same size as "char", "short", "int" and "long".  "Properly
+     aligned" means the natural alignment, thus no constraints for
+     "char", two-byte alignment for "short", four-byte alignment for
+     "int", and either four-byte or eight-byte alignment for "long",
+     on 32-bit and 64-bit systems, respectively.  Note that this means
+     that the Linux kernel does not support pre-EV56 Alpha CPUs,
+     because these older CPUs do not provide one-byte and two-byte
+     load and store instructions.  (In theory, the pre-EV56 Alpha CPUs
+     can emulate these instructions using load-linked/store-conditional
+     instructions, but in practice this approach has excessive overhead.
+     Keep in mind that this emulation would be required on -all- single-
+     and double-byte loads and stores in order to handle adjacent bytes
+     protected by different locks.)
+
+     Alpha EV56 and later Alpha CPUs are still supported.
+
 
 =========================
 WHAT ARE MEMORY BARRIERS?


^ permalink raw reply related	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-05 20:42                             ` Paul E. McKenney
  0 siblings, 0 replies; 311+ messages in thread
From: Paul E. McKenney @ 2014-09-05 20:42 UTC (permalink / raw)
  To: Peter Hurley
  Cc: Jakub Jelinek, linux-arch, Tony Luck, linux-ia64, Michael Cree,
	linux-alpha, Oleg Nesterov, linux-kernel, Marc Gauthier,
	Paul Mackerras, H. Peter Anvin, David Laight, linuxppc-dev,
	Miroslav Franc, Richard Henderson

On Fri, Sep 05, 2014 at 04:14:48PM -0400, Peter Hurley wrote:
> On 09/05/2014 03:38 PM, Marc Gauthier wrote:
> > Paul E. McKenney wrote:
> >> On Fri, Sep 05, 2014 at 02:50:31PM -0400, Peter Hurley wrote:
> >>> On 09/05/2014 02:09 PM, Paul E. McKenney wrote:
> >>>> This commit documents the fact that it is not safe to use bitfields as
> >>>> shared variables in synchronization algorithms.  It also documents that
> >>>> CPUs must provide one-byte and two-byte load and store instructions
> >>>                    ^
> >>>                 atomic
> >>
> >> Here you meant non-atomic?  My guess is that you are referring to the
> >> fact that you could emulate a one-byte store on pre-EV56 Alpha CPUs
> >> using the ll and sc atomic-read-modify-write instructions, correct?
> >>
> >>>> in order to be supported by the Linux kernel.  (Michael Cree
> >>>> has agreed to the resulting non-support of pre-EV56 Alpha CPUs:
> >>>> https://lkml.org/lkml/2014/9/5/143.
> > [...]
> > 
> >>>> +     and 64-bit systems, respectively.  Note that this means that the
> >>>> +     Linux kernel does not support pre-EV56 Alpha CPUs, because these
> >>>> +     older CPUs do not provide one-byte and two-byte loads and stores.
> >>>                                  ^
> >>>                             non-atomic
> >>
> >> I took this, thank you!
> > 
> > Eum, am I totally lost, or aren't both of these supposed to say "atomic" ?
> > 
> > Can't imagine requiring a CPU to provide non-atomic loads and stores
> > (i.e. requiring old Alpha behavior?).
> 
> Here's how I read the two statements.
> 
> First, the commit message:
> 
> "It [this commit] documents that CPUs [supported by the Linux kernel]
> _must provide_ atomic one-byte and two-byte naturally aligned loads and stores."
> 
> Second, in the body of the document:
> 
> "The Linux kernel no longer supports pre-EV56 Alpha CPUs, because these
> older CPUs _do not provide_ atomic one-byte and two-byte loads and stores."

Hmmm...  It is a bit ambiguous.  How about the following?

							Thanx, Paul

------------------------------------------------------------------------

documentation: Record limitations of bitfields and small variables

This commit documents the fact that it is not safe to use bitfields
as shared variables in synchronization algorithms.  It also documents
that CPUs must provide one-byte and two-byte normal load and store
instructions in order to be supported by the Linux kernel.  (Michael
Cree has agreed to the resulting non-support of pre-EV56 Alpha CPUs:
https://lkml.org/lkml/2014/9/5/143.)

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
index 87be0a8a78de..fe4d51b704c5 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -269,6 +269,37 @@ And there are a number of things that _must_ or _must_not_ be assumed:
 	STORE *(A + 4) = Y; STORE *A = X;
 	STORE {*A, *(A + 4) } = {X, Y};
 
+And there are anti-guarantees:
+
+ (*) These guarantees do not apply to bitfields, because compilers often
+     generate code to modify these using non-atomic read-modify-write
+     sequences.  Do not attempt to use bitfields to synchronize parallel
+     algorithms.
+
+ (*) Even in cases where bitfields are protected by locks, all fields
+     in a given bitfield must be protected by one lock.  If two fields
+     in a given bitfield are protected by different locks, the compiler's
+     non-atomic read-modify-write sequences can cause an update to one
+     field to corrupt the value of an adjacent field.
+
+ (*) These guarantees apply only to properly aligned and sized scalar
+     variables.  "Properly sized" currently means variables that are
+     the same size as "char", "short", "int" and "long".  "Properly
+     aligned" means the natural alignment, thus no constraints for
+     "char", two-byte alignment for "short", four-byte alignment for
+     "int", and either four-byte or eight-byte alignment for "long",
+     on 32-bit and 64-bit systems, respectively.  Note that this means
+     that the Linux kernel does not support pre-EV56 Alpha CPUs,
+     because these older CPUs do not provide one-byte and two-byte
+     load and store instructions.  (In theory, the pre-EV56 Alpha CPUs
+     can emulate these instructions using load-linked/store-conditional
+     instructions, but in practice this approach has excessive overhead.
+     Keep in mind that this emulation would be required on -all- single-
+     and double-byte loads and stores in order to handle adjacent bytes
+     protected by different locks.)
+
+     Alpha EV56 and later Alpha CPUs are still supported.
+
 
 =========================
 WHAT ARE MEMORY BARRIERS?

^ permalink raw reply related	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-05 20:42                             ` Paul E. McKenney
  0 siblings, 0 replies; 311+ messages in thread
From: Paul E. McKenney @ 2014-09-05 20:42 UTC (permalink / raw)
  To: Peter Hurley
  Cc: Marc Gauthier, Michael Cree, H. Peter Anvin,
	Benjamin Herrenschmidt, David Laight, Jakub Jelinek, linux-arch,
	Tony Luck, linux-ia64, Oleg Nesterov, linux-kernel,
	Paul Mackerras, linuxppc-dev, Miroslav Franc, Richard Henderson,
	linux-alpha

On Fri, Sep 05, 2014 at 04:14:48PM -0400, Peter Hurley wrote:
> On 09/05/2014 03:38 PM, Marc Gauthier wrote:
> > Paul E. McKenney wrote:
> >> On Fri, Sep 05, 2014 at 02:50:31PM -0400, Peter Hurley wrote:
> >>> On 09/05/2014 02:09 PM, Paul E. McKenney wrote:
> >>>> This commit documents the fact that it is not safe to use bitfields as
> >>>> shared variables in synchronization algorithms.  It also documents that
> >>>> CPUs must provide one-byte and two-byte load and store instructions
> >>>                    ^
> >>>                 atomic
> >>
> >> Here you meant non-atomic?  My guess is that you are referring to the
> >> fact that you could emulate a one-byte store on pre-EV56 Alpha CPUs
> >> using the ll and sc atomic-read-modify-write instructions, correct?
> >>
> >>>> in order to be supported by the Linux kernel.  (Michael Cree
> >>>> has agreed to the resulting non-support of pre-EV56 Alpha CPUs:
> >>>> https://lkml.org/lkml/2014/9/5/143.
> > [...]
> > 
> >>>> +     and 64-bit systems, respectively.  Note that this means that the
> >>>> +     Linux kernel does not support pre-EV56 Alpha CPUs, because these
> >>>> +     older CPUs do not provide one-byte and two-byte loads and stores.
> >>>                                  ^
> >>>                             non-atomic
> >>
> >> I took this, thank you!
> > 
> > Eum, am I totally lost, or aren't both of these supposed to say "atomic" ?
> > 
> > Can't imagine requiring a CPU to provide non-atomic loads and stores
> > (i.e. requiring old Alpha behavior?).
> 
> Here's how I read the two statements.
> 
> First, the commit message:
> 
> "It [this commit] documents that CPUs [supported by the Linux kernel]
> _must provide_ atomic one-byte and two-byte naturally aligned loads and stores."
> 
> Second, in the body of the document:
> 
> "The Linux kernel no longer supports pre-EV56 Alpha CPUs, because these
> older CPUs _do not provide_ atomic one-byte and two-byte loads and stores."

Hmmm...  It is a bit ambiguous.  How about the following?

							Thanx, Paul

------------------------------------------------------------------------

documentation: Record limitations of bitfields and small variables

This commit documents the fact that it is not safe to use bitfields
as shared variables in synchronization algorithms.  It also documents
that CPUs must provide one-byte and two-byte normal load and store
instructions in order to be supported by the Linux kernel.  (Michael
Cree has agreed to the resulting non-support of pre-EV56 Alpha CPUs:
https://lkml.org/lkml/2014/9/5/143.)

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
index 87be0a8a78de..fe4d51b704c5 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -269,6 +269,37 @@ And there are a number of things that _must_ or _must_not_ be assumed:
 	STORE *(A + 4) = Y; STORE *A = X;
 	STORE {*A, *(A + 4) } = {X, Y};
 
+And there are anti-guarantees:
+
+ (*) These guarantees do not apply to bitfields, because compilers often
+     generate code to modify these using non-atomic read-modify-write
+     sequences.  Do not attempt to use bitfields to synchronize parallel
+     algorithms.
+
+ (*) Even in cases where bitfields are protected by locks, all fields
+     in a given bitfield must be protected by one lock.  If two fields
+     in a given bitfield are protected by different locks, the compiler's
+     non-atomic read-modify-write sequences can cause an update to one
+     field to corrupt the value of an adjacent field.
+
+ (*) These guarantees apply only to properly aligned and sized scalar
+     variables.  "Properly sized" currently means variables that are
+     the same size as "char", "short", "int" and "long".  "Properly
+     aligned" means the natural alignment, thus no constraints for
+     "char", two-byte alignment for "short", four-byte alignment for
+     "int", and either four-byte or eight-byte alignment for "long",
+     on 32-bit and 64-bit systems, respectively.  Note that this means
+     that the Linux kernel does not support pre-EV56 Alpha CPUs,
+     because these older CPUs do not provide one-byte and two-byte
+     load and store instructions.  (In theory, the pre-EV56 Alpha CPUs
+     can emulate these instructions using load-linked/store-conditional
+     instructions, but in practice this approach has excessive overhead.
+     Keep in mind that this emulation would be required on -all- single-
+     and double-byte loads and stores in order to handle adjacent bytes
+     protected by different locks.)
+
+     Alpha EV56 and later Alpha CPUs are still supported.
+
 
 ============ WHAT ARE MEMORY BARRIERS?


^ permalink raw reply related	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
  2014-09-05 20:34                             ` H. Peter Anvin
  (?)
@ 2014-09-05 20:42                               ` Michael Cree
  -1 siblings, 0 replies; 311+ messages in thread
From: Michael Cree @ 2014-09-05 20:42 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Peter Hurley, Marc Gauthier, paulmck, Benjamin Herrenschmidt,
	David Laight, Jakub Jelinek, linux-arch, Tony Luck, linux-ia64,
	Oleg Nesterov, linux-kernel, Paul Mackerras, linuxppc-dev,
	Miroslav Franc, Richard Henderson, linux-alpha

On Fri, Sep 05, 2014 at 01:34:52PM -0700, H. Peter Anvin wrote:
> On 09/05/2014 01:14 PM, Peter Hurley wrote:
> > 
> > Here's how I read the two statements.
> > 
> > First, the commit message:
> > 
> > "It [this commit] documents that CPUs [supported by the Linux kernel]
> > _must provide_ atomic one-byte and two-byte naturally aligned loads and stores."
> > 
> > Second, in the body of the document:
> > 
> > "The Linux kernel no longer supports pre-EV56 Alpha CPUs, because these
> > older CPUs _do not provide_ atomic one-byte and two-byte loads and stores."
> > 
> 
> Does this apply in general or only to SMP configurations?  I guess
> non-SMP configurations would still have problems if interrupted in the
> wrong place...

Yes.

Cheers
Michael.

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-05 20:42                               ` Michael Cree
  0 siblings, 0 replies; 311+ messages in thread
From: Michael Cree @ 2014-09-05 20:42 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Jakub Jelinek, linux-arch, Tony Luck, David Laight, Peter Hurley,
	Oleg Nesterov, linux-kernel, Marc Gauthier, Paul Mackerras,
	linux-alpha, linux-ia64, paulmck, linuxppc-dev, Miroslav Franc,
	Richard Henderson

On Fri, Sep 05, 2014 at 01:34:52PM -0700, H. Peter Anvin wrote:
> On 09/05/2014 01:14 PM, Peter Hurley wrote:
> > 
> > Here's how I read the two statements.
> > 
> > First, the commit message:
> > 
> > "It [this commit] documents that CPUs [supported by the Linux kernel]
> > _must provide_ atomic one-byte and two-byte naturally aligned loads and stores."
> > 
> > Second, in the body of the document:
> > 
> > "The Linux kernel no longer supports pre-EV56 Alpha CPUs, because these
> > older CPUs _do not provide_ atomic one-byte and two-byte loads and stores."
> > 
> 
> Does this apply in general or only to SMP configurations?  I guess
> non-SMP configurations would still have problems if interrupted in the
> wrong place...

Yes.

Cheers
Michael.

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-05 20:42                               ` Michael Cree
  0 siblings, 0 replies; 311+ messages in thread
From: Michael Cree @ 2014-09-05 20:42 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Peter Hurley, Marc Gauthier, paulmck, Benjamin Herrenschmidt,
	David Laight, Jakub Jelinek, linux-arch, Tony Luck, linux-ia64,
	Oleg Nesterov, linux-kernel, Paul Mackerras, linuxppc-dev,
	Miroslav Franc, Richard Henderson, linux-alpha

On Fri, Sep 05, 2014 at 01:34:52PM -0700, H. Peter Anvin wrote:
> On 09/05/2014 01:14 PM, Peter Hurley wrote:
> > 
> > Here's how I read the two statements.
> > 
> > First, the commit message:
> > 
> > "It [this commit] documents that CPUs [supported by the Linux kernel]
> > _must provide_ atomic one-byte and two-byte naturally aligned loads and stores."
> > 
> > Second, in the body of the document:
> > 
> > "The Linux kernel no longer supports pre-EV56 Alpha CPUs, because these
> > older CPUs _do not provide_ atomic one-byte and two-byte loads and stores."
> > 
> 
> Does this apply in general or only to SMP configurations?  I guess
> non-SMP configurations would still have problems if interrupted in the
> wrong place...

Yes.

Cheers
Michael.

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
  2014-09-05 20:34                             ` H. Peter Anvin
  (?)
@ 2014-09-05 20:43                               ` Paul E. McKenney
  -1 siblings, 0 replies; 311+ messages in thread
From: Paul E. McKenney @ 2014-09-05 20:43 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Peter Hurley, Marc Gauthier, Michael Cree,
	Benjamin Herrenschmidt, David Laight, Jakub Jelinek, linux-arch,
	Tony Luck, linux-ia64, Oleg Nesterov, linux-kernel,
	Paul Mackerras, linuxppc-dev, Miroslav Franc, Richard Henderson,
	linux-alpha

On Fri, Sep 05, 2014 at 01:34:52PM -0700, H. Peter Anvin wrote:
> On 09/05/2014 01:14 PM, Peter Hurley wrote:
> > 
> > Here's how I read the two statements.
> > 
> > First, the commit message:
> > 
> > "It [this commit] documents that CPUs [supported by the Linux kernel]
> > _must provide_ atomic one-byte and two-byte naturally aligned loads and stores."
> > 
> > Second, in the body of the document:
> > 
> > "The Linux kernel no longer supports pre-EV56 Alpha CPUs, because these
> > older CPUs _do not provide_ atomic one-byte and two-byte loads and stores."
> > 
> 
> Does this apply in general or only to SMP configurations?  I guess
> non-SMP configurations would still have problems if interrupted in the
> wrong place...

And preemption could cause problems, too.  So I believe that it needs
to be universal.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-05 20:43                               ` Paul E. McKenney
  0 siblings, 0 replies; 311+ messages in thread
From: Paul E. McKenney @ 2014-09-05 20:43 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Jakub Jelinek, linux-arch, Tony Luck, linux-ia64, Peter Hurley,
	Michael Cree, Oleg Nesterov, linux-kernel, Marc Gauthier,
	Paul Mackerras, linux-alpha, David Laight, linuxppc-dev,
	Miroslav Franc, Richard Henderson

On Fri, Sep 05, 2014 at 01:34:52PM -0700, H. Peter Anvin wrote:
> On 09/05/2014 01:14 PM, Peter Hurley wrote:
> > 
> > Here's how I read the two statements.
> > 
> > First, the commit message:
> > 
> > "It [this commit] documents that CPUs [supported by the Linux kernel]
> > _must provide_ atomic one-byte and two-byte naturally aligned loads and stores."
> > 
> > Second, in the body of the document:
> > 
> > "The Linux kernel no longer supports pre-EV56 Alpha CPUs, because these
> > older CPUs _do not provide_ atomic one-byte and two-byte loads and stores."
> > 
> 
> Does this apply in general or only to SMP configurations?  I guess
> non-SMP configurations would still have problems if interrupted in the
> wrong place...

And preemption could cause problems, too.  So I believe that it needs
to be universal.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-05 20:43                               ` Paul E. McKenney
  0 siblings, 0 replies; 311+ messages in thread
From: Paul E. McKenney @ 2014-09-05 20:43 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Peter Hurley, Marc Gauthier, Michael Cree,
	Benjamin Herrenschmidt, David Laight, Jakub Jelinek, linux-arch,
	Tony Luck, linux-ia64, Oleg Nesterov, linux-kernel,
	Paul Mackerras, linuxppc-dev, Miroslav Franc, Richard Henderson,
	linux-alpha

On Fri, Sep 05, 2014 at 01:34:52PM -0700, H. Peter Anvin wrote:
> On 09/05/2014 01:14 PM, Peter Hurley wrote:
> > 
> > Here's how I read the two statements.
> > 
> > First, the commit message:
> > 
> > "It [this commit] documents that CPUs [supported by the Linux kernel]
> > _must provide_ atomic one-byte and two-byte naturally aligned loads and stores."
> > 
> > Second, in the body of the document:
> > 
> > "The Linux kernel no longer supports pre-EV56 Alpha CPUs, because these
> > older CPUs _do not provide_ atomic one-byte and two-byte loads and stores."
> > 
> 
> Does this apply in general or only to SMP configurations?  I guess
> non-SMP configurations would still have problems if interrupted in the
> wrong place...

And preemption could cause problems, too.  So I believe that it needs
to be universal.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
  2014-09-05 20:43                               ` Paul E. McKenney
  (?)
@ 2014-09-05 20:48                                 ` Thomas Gleixner
  -1 siblings, 0 replies; 311+ messages in thread
From: Thomas Gleixner @ 2014-09-05 20:48 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: H. Peter Anvin, Peter Hurley, Marc Gauthier, Michael Cree,
	Benjamin Herrenschmidt, David Laight, Jakub Jelinek, linux-arch,
	Tony Luck, linux-ia64, Oleg Nesterov, linux-kernel,
	Paul Mackerras, linuxppc-dev, Miroslav Franc, Richard Henderson,
	linux-alpha

On Fri, 5 Sep 2014, Paul E. McKenney wrote:
> On Fri, Sep 05, 2014 at 01:34:52PM -0700, H. Peter Anvin wrote:
> > On 09/05/2014 01:14 PM, Peter Hurley wrote:
> > > 
> > > Here's how I read the two statements.
> > > 
> > > First, the commit message:
> > > 
> > > "It [this commit] documents that CPUs [supported by the Linux kernel]
> > > _must provide_ atomic one-byte and two-byte naturally aligned loads and stores."
> > > 
> > > Second, in the body of the document:
> > > 
> > > "The Linux kernel no longer supports pre-EV56 Alpha CPUs, because these
> > > older CPUs _do not provide_ atomic one-byte and two-byte loads and stores."
> > > 
> > 
> > Does this apply in general or only to SMP configurations?  I guess
> > non-SMP configurations would still have problems if interrupted in the
> > wrong place...
> 
> And preemption could cause problems, too.  So I believe that it needs
> to be universal.

Well preemption is usually caused by an interrupt, except you have a
combined load and preempt instruction :)

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-05 20:48                                 ` Thomas Gleixner
  0 siblings, 0 replies; 311+ messages in thread
From: Thomas Gleixner @ 2014-09-05 20:48 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Jakub Jelinek, linux-arch, Tony Luck, linux-ia64, Peter Hurley,
	Michael Cree, linux-alpha, Oleg Nesterov, linux-kernel,
	Marc Gauthier, Paul Mackerras, H. Peter Anvin, David Laight,
	linuxppc-dev, Miroslav Franc, Richard Henderson

On Fri, 5 Sep 2014, Paul E. McKenney wrote:
> On Fri, Sep 05, 2014 at 01:34:52PM -0700, H. Peter Anvin wrote:
> > On 09/05/2014 01:14 PM, Peter Hurley wrote:
> > > 
> > > Here's how I read the two statements.
> > > 
> > > First, the commit message:
> > > 
> > > "It [this commit] documents that CPUs [supported by the Linux kernel]
> > > _must provide_ atomic one-byte and two-byte naturally aligned loads and stores."
> > > 
> > > Second, in the body of the document:
> > > 
> > > "The Linux kernel no longer supports pre-EV56 Alpha CPUs, because these
> > > older CPUs _do not provide_ atomic one-byte and two-byte loads and stores."
> > > 
> > 
> > Does this apply in general or only to SMP configurations?  I guess
> > non-SMP configurations would still have problems if interrupted in the
> > wrong place...
> 
> And preemption could cause problems, too.  So I believe that it needs
> to be universal.

Well preemption is usually caused by an interrupt, except you have a
combined load and preempt instruction :)

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-05 20:48                                 ` Thomas Gleixner
  0 siblings, 0 replies; 311+ messages in thread
From: Thomas Gleixner @ 2014-09-05 20:48 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: H. Peter Anvin, Peter Hurley, Marc Gauthier, Michael Cree,
	Benjamin Herrenschmidt, David Laight, Jakub Jelinek, linux-arch,
	Tony Luck, linux-ia64, Oleg Nesterov, linux-kernel,
	Paul Mackerras, linuxppc-dev, Miroslav Franc, Richard Henderson,
	linux-alpha

On Fri, 5 Sep 2014, Paul E. McKenney wrote:
> On Fri, Sep 05, 2014 at 01:34:52PM -0700, H. Peter Anvin wrote:
> > On 09/05/2014 01:14 PM, Peter Hurley wrote:
> > > 
> > > Here's how I read the two statements.
> > > 
> > > First, the commit message:
> > > 
> > > "It [this commit] documents that CPUs [supported by the Linux kernel]
> > > _must provide_ atomic one-byte and two-byte naturally aligned loads and stores."
> > > 
> > > Second, in the body of the document:
> > > 
> > > "The Linux kernel no longer supports pre-EV56 Alpha CPUs, because these
> > > older CPUs _do not provide_ atomic one-byte and two-byte loads and stores."
> > > 
> > 
> > Does this apply in general or only to SMP configurations?  I guess
> > non-SMP configurations would still have problems if interrupted in the
> > wrong place...
> 
> And preemption could cause problems, too.  So I believe that it needs
> to be universal.

Well preemption is usually caused by an interrupt, except you have a
combined load and preempt instruction :)

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
  2014-09-05 20:48                                 ` Thomas Gleixner
  (?)
@ 2014-09-05 21:05                                   ` Paul E. McKenney
  -1 siblings, 0 replies; 311+ messages in thread
From: Paul E. McKenney @ 2014-09-05 21:05 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: H. Peter Anvin, Peter Hurley, Marc Gauthier, Michael Cree,
	Benjamin Herrenschmidt, David Laight, Jakub Jelinek, linux-arch,
	Tony Luck, linux-ia64, Oleg Nesterov, linux-kernel,
	Paul Mackerras, linuxppc-dev, Miroslav Franc, Richard Henderson,
	linux-alpha

On Fri, Sep 05, 2014 at 10:48:34PM +0200, Thomas Gleixner wrote:
> On Fri, 5 Sep 2014, Paul E. McKenney wrote:
> > On Fri, Sep 05, 2014 at 01:34:52PM -0700, H. Peter Anvin wrote:
> > > On 09/05/2014 01:14 PM, Peter Hurley wrote:
> > > > 
> > > > Here's how I read the two statements.
> > > > 
> > > > First, the commit message:
> > > > 
> > > > "It [this commit] documents that CPUs [supported by the Linux kernel]
> > > > _must provide_ atomic one-byte and two-byte naturally aligned loads and stores."
> > > > 
> > > > Second, in the body of the document:
> > > > 
> > > > "The Linux kernel no longer supports pre-EV56 Alpha CPUs, because these
> > > > older CPUs _do not provide_ atomic one-byte and two-byte loads and stores."
> > > > 
> > > 
> > > Does this apply in general or only to SMP configurations?  I guess
> > > non-SMP configurations would still have problems if interrupted in the
> > > wrong place...
> > 
> > And preemption could cause problems, too.  So I believe that it needs
> > to be universal.
> 
> Well preemption is usually caused by an interrupt, except you have a
> combined load and preempt instruction :)

Fair point!  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-05 21:05                                   ` Paul E. McKenney
  0 siblings, 0 replies; 311+ messages in thread
From: Paul E. McKenney @ 2014-09-05 21:05 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Jakub Jelinek, linux-arch, Tony Luck, linux-ia64, Peter Hurley,
	Michael Cree, linux-alpha, Oleg Nesterov, linux-kernel,
	Marc Gauthier, Paul Mackerras, H. Peter Anvin, David Laight,
	linuxppc-dev, Miroslav Franc, Richard Henderson

On Fri, Sep 05, 2014 at 10:48:34PM +0200, Thomas Gleixner wrote:
> On Fri, 5 Sep 2014, Paul E. McKenney wrote:
> > On Fri, Sep 05, 2014 at 01:34:52PM -0700, H. Peter Anvin wrote:
> > > On 09/05/2014 01:14 PM, Peter Hurley wrote:
> > > > 
> > > > Here's how I read the two statements.
> > > > 
> > > > First, the commit message:
> > > > 
> > > > "It [this commit] documents that CPUs [supported by the Linux kernel]
> > > > _must provide_ atomic one-byte and two-byte naturally aligned loads and stores."
> > > > 
> > > > Second, in the body of the document:
> > > > 
> > > > "The Linux kernel no longer supports pre-EV56 Alpha CPUs, because these
> > > > older CPUs _do not provide_ atomic one-byte and two-byte loads and stores."
> > > > 
> > > 
> > > Does this apply in general or only to SMP configurations?  I guess
> > > non-SMP configurations would still have problems if interrupted in the
> > > wrong place...
> > 
> > And preemption could cause problems, too.  So I believe that it needs
> > to be universal.
> 
> Well preemption is usually caused by an interrupt, except you have a
> combined load and preempt instruction :)

Fair point!  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-05 21:05                                   ` Paul E. McKenney
  0 siblings, 0 replies; 311+ messages in thread
From: Paul E. McKenney @ 2014-09-05 21:05 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: H. Peter Anvin, Peter Hurley, Marc Gauthier, Michael Cree,
	Benjamin Herrenschmidt, David Laight, Jakub Jelinek, linux-arch,
	Tony Luck, linux-ia64, Oleg Nesterov, linux-kernel,
	Paul Mackerras, linuxppc-dev, Miroslav Franc, Richard Henderson,
	linux-alpha

On Fri, Sep 05, 2014 at 10:48:34PM +0200, Thomas Gleixner wrote:
> On Fri, 5 Sep 2014, Paul E. McKenney wrote:
> > On Fri, Sep 05, 2014 at 01:34:52PM -0700, H. Peter Anvin wrote:
> > > On 09/05/2014 01:14 PM, Peter Hurley wrote:
> > > > 
> > > > Here's how I read the two statements.
> > > > 
> > > > First, the commit message:
> > > > 
> > > > "It [this commit] documents that CPUs [supported by the Linux kernel]
> > > > _must provide_ atomic one-byte and two-byte naturally aligned loads and stores."
> > > > 
> > > > Second, in the body of the document:
> > > > 
> > > > "The Linux kernel no longer supports pre-EV56 Alpha CPUs, because these
> > > > older CPUs _do not provide_ atomic one-byte and two-byte loads and stores."
> > > > 
> > > 
> > > Does this apply in general or only to SMP configurations?  I guess
> > > non-SMP configurations would still have problems if interrupted in the
> > > wrong place...
> > 
> > And preemption could cause problems, too.  So I believe that it needs
> > to be universal.
> 
> Well preemption is usually caused by an interrupt, except you have a
> combined load and preempt instruction :)

Fair point!  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
  2014-09-05 20:39                             ` Michael Cree
@ 2014-09-05 21:12                               ` Peter Hurley
  -1 siblings, 0 replies; 311+ messages in thread
From: Peter Hurley @ 2014-09-05 21:12 UTC (permalink / raw)
  To: Michael Cree, Marc Gauthier, paulmck, H. Peter Anvin,
	Benjamin Herrenschmidt, David Laight, Jakub Jelinek, linux-arch,
	Tony Luck, linux-ia64, Oleg Nesterov, linux-kernel,
	Paul Mackerras, linuxppc-dev, Miroslav Franc, Richard Henderson,
	linux-alpha

On 09/05/2014 04:39 PM, Michael Cree wrote:
> On Fri, Sep 05, 2014 at 04:14:48PM -0400, Peter Hurley wrote:
>> Second, in the body of the document:
>>
>> "The Linux kernel no longer supports pre-EV56 Alpha CPUs, because these
>> older CPUs _do not provide_ atomic one-byte and two-byte loads and stores."
> 
> Let's be clear here, the pre-EV56 Alpha CPUs do provide an atomic
> one-byte and two-byte load and store; it's just that one must use
> locked load and store sequences to achieve atomicity.  The point,
> I think, is that the pre-EV56 Alpha CPUs provide non-atomic one-byte
> and two-byte load and stores as the norm, and that is the problem.

I'm all for an Alpha expert to jump in here and meet the criteria;
which is that byte stores cannot corrupt adjacent storage (nor can
aligned short stores).

To my mind, a quick look at Documentation/circular-buffers.txt will
pretty much convince anyone that trying to differentiate by execution
context is undoable.

If someone wants to make Alphas do cmpxchg loops for every byte store,
then ok. Or any other solution that doesn't require subsystem code
changes.

Regards,
Peter Hurley

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-05 21:12                               ` Peter Hurley
  0 siblings, 0 replies; 311+ messages in thread
From: Peter Hurley @ 2014-09-05 21:12 UTC (permalink / raw)
  To: Michael Cree, Marc Gauthier, paulmck, H. Peter Anvin,
	Benjamin Herrenschmidt, David Laight, Jakub Jelinek, linux-arch,
	Tony Luck, linux-ia64, Oleg Nesterov, linux-kernel,
	Paul Mackerras, linuxppc-dev, Miroslav Franc, Richard Henderson,
	linux-alpha

On 09/05/2014 04:39 PM, Michael Cree wrote:
> On Fri, Sep 05, 2014 at 04:14:48PM -0400, Peter Hurley wrote:
>> Second, in the body of the document:
>>
>> "The Linux kernel no longer supports pre-EV56 Alpha CPUs, because these
>> older CPUs _do not provide_ atomic one-byte and two-byte loads and stores."
> 
> Let's be clear here, the pre-EV56 Alpha CPUs do provide an atomic
> one-byte and two-byte load and store; it's just that one must use
> locked load and store sequences to achieve atomicity.  The point,
> I think, is that the pre-EV56 Alpha CPUs provide non-atomic one-byte
> and two-byte load and stores as the norm, and that is the problem.

I'm all for an Alpha expert to jump in here and meet the criteria;
which is that byte stores cannot corrupt adjacent storage (nor can
aligned short stores).

To my mind, a quick look at Documentation/circular-buffers.txt will
pretty much convince anyone that trying to differentiate by execution
context is undoable.

If someone wants to make Alphas do cmpxchg loops for every byte store,
then ok. Or any other solution that doesn't require subsystem code
changes.

Regards,
Peter Hurley

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
  2014-09-05 21:12                               ` Peter Hurley
  (?)
@ 2014-09-05 21:27                                 ` Michael Cree
  -1 siblings, 0 replies; 311+ messages in thread
From: Michael Cree @ 2014-09-05 21:27 UTC (permalink / raw)
  To: Peter Hurley
  Cc: Marc Gauthier, paulmck, H. Peter Anvin, Benjamin Herrenschmidt,
	David Laight, Jakub Jelinek, linux-arch, Tony Luck, linux-ia64,
	Oleg Nesterov, linux-kernel, Paul Mackerras, linuxppc-dev,
	Miroslav Franc, Richard Henderson, linux-alpha

On Fri, Sep 05, 2014 at 05:12:28PM -0400, Peter Hurley wrote:
> On 09/05/2014 04:39 PM, Michael Cree wrote:
> > On Fri, Sep 05, 2014 at 04:14:48PM -0400, Peter Hurley wrote:
> >> Second, in the body of the document:
> >>
> >> "The Linux kernel no longer supports pre-EV56 Alpha CPUs, because these
> >> older CPUs _do not provide_ atomic one-byte and two-byte loads and stores."
> > 
> > Let's be clear here, the pre-EV56 Alpha CPUs do provide an atomic
> > one-byte and two-byte load and store; it's just that one must use
> > locked load and store sequences to achieve atomicity.  The point,
> > I think, is that the pre-EV56 Alpha CPUs provide non-atomic one-byte
> > and two-byte load and stores as the norm, and that is the problem.
> 
> I'm all for an Alpha expert to jump in here and meet the criteria;
> which is that byte stores cannot corrupt adjacent storage (nor can
> aligned short stores).
> 
> To my mind, a quick look at Documentation/circular-buffers.txt will
> pretty much convince anyone that trying to differentiate by execution
> context is undoable.
> 
> If someone wants to make Alphas do cmpxchg loops for every byte store,
> then ok. Or any other solution that doesn't require subsystem code
> changes.

I am not suggesting that anyone do that work.  I'm certainly not going
to do it.

All I was pointing out is that the claim that "_do not provide_" made
above with emphasis is, strictly interpreted, not true, thus should not
be committed to the documentation without further clarification.

Cheers
Michael.

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-05 21:27                                 ` Michael Cree
  0 siblings, 0 replies; 311+ messages in thread
From: Michael Cree @ 2014-09-05 21:27 UTC (permalink / raw)
  To: Peter Hurley
  Cc: Jakub Jelinek, linux-arch, Tony Luck, David Laight, linux-alpha,
	Oleg Nesterov, linux-kernel, Marc Gauthier, Paul Mackerras,
	H. Peter Anvin, linux-ia64, paulmck, linuxppc-dev,
	Miroslav Franc, Richard Henderson

On Fri, Sep 05, 2014 at 05:12:28PM -0400, Peter Hurley wrote:
> On 09/05/2014 04:39 PM, Michael Cree wrote:
> > On Fri, Sep 05, 2014 at 04:14:48PM -0400, Peter Hurley wrote:
> >> Second, in the body of the document:
> >>
> >> "The Linux kernel no longer supports pre-EV56 Alpha CPUs, because these
> >> older CPUs _do not provide_ atomic one-byte and two-byte loads and stores."
> > 
> > Let's be clear here, the pre-EV56 Alpha CPUs do provide an atomic
> > one-byte and two-byte load and store; it's just that one must use
> > locked load and store sequences to achieve atomicity.  The point,
> > I think, is that the pre-EV56 Alpha CPUs provide non-atomic one-byte
> > and two-byte load and stores as the norm, and that is the problem.
> 
> I'm all for an Alpha expert to jump in here and meet the criteria;
> which is that byte stores cannot corrupt adjacent storage (nor can
> aligned short stores).
> 
> To my mind, a quick look at Documentation/circular-buffers.txt will
> pretty much convince anyone that trying to differentiate by execution
> context is undoable.
> 
> If someone wants to make Alphas do cmpxchg loops for every byte store,
> then ok. Or any other solution that doesn't require subsystem code
> changes.

I am not suggesting that anyone do that work.  I'm certainly not going
to do it.

All I was pointing out is that the claim that "_do not provide_" made
above with emphasis is, strictly interpreted, not true, thus should not
be committed to the documentation without further clarification.

Cheers
Michael.

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-05 21:27                                 ` Michael Cree
  0 siblings, 0 replies; 311+ messages in thread
From: Michael Cree @ 2014-09-05 21:27 UTC (permalink / raw)
  To: Peter Hurley
  Cc: Marc Gauthier, paulmck, H. Peter Anvin, Benjamin Herrenschmidt,
	David Laight, Jakub Jelinek, linux-arch, Tony Luck, linux-ia64,
	Oleg Nesterov, linux-kernel, Paul Mackerras, linuxppc-dev,
	Miroslav Franc, Richard Henderson, linux-alpha

On Fri, Sep 05, 2014 at 05:12:28PM -0400, Peter Hurley wrote:
> On 09/05/2014 04:39 PM, Michael Cree wrote:
> > On Fri, Sep 05, 2014 at 04:14:48PM -0400, Peter Hurley wrote:
> >> Second, in the body of the document:
> >>
> >> "The Linux kernel no longer supports pre-EV56 Alpha CPUs, because these
> >> older CPUs _do not provide_ atomic one-byte and two-byte loads and stores."
> > 
> > Let's be clear here, the pre-EV56 Alpha CPUs do provide an atomic
> > one-byte and two-byte load and store; it's just that one must use
> > locked load and store sequences to achieve atomicity.  The point,
> > I think, is that the pre-EV56 Alpha CPUs provide non-atomic one-byte
> > and two-byte load and stores as the norm, and that is the problem.
> 
> I'm all for an Alpha expert to jump in here and meet the criteria;
> which is that byte stores cannot corrupt adjacent storage (nor can
> aligned short stores).
> 
> To my mind, a quick look at Documentation/circular-buffers.txt will
> pretty much convince anyone that trying to differentiate by execution
> context is undoable.
> 
> If someone wants to make Alphas do cmpxchg loops for every byte store,
> then ok. Or any other solution that doesn't require subsystem code
> changes.

I am not suggesting that anyone do that work.  I'm certainly not going
to do it.

All I was pointing out is that the claim that "_do not provide_" made
above with emphasis is, strictly interpreted, not true, thus should not
be committed to the documentation without further clarification.

Cheers
Michael.

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
  2014-09-05  4:06                         ` Paul E. McKenney
  (?)
@ 2014-09-07  5:07                           ` James Bottomley
  -1 siblings, 0 replies; 311+ messages in thread
From: James Bottomley @ 2014-09-07  5:07 UTC (permalink / raw)
  To: paulmck
  Cc: Peter Hurley, H. Peter Anvin, One Thousand Gnomes, Jakub Jelinek,
	Mikael Pettersson, Benjamin Herrenschmidt, Richard Henderson,
	Oleg Nesterov, Miroslav Franc, Paul Mackerras, linuxppc-dev,
	linux-kernel, linux-arch, Tony Luck, linux-ia64

On Thu, 2014-09-04 at 21:06 -0700, Paul E. McKenney wrote:
> On Thu, Sep 04, 2014 at 10:47:24PM -0400, Peter Hurley wrote:
> > Hi James,
> > 
> > On 09/04/2014 10:11 PM, James Bottomley wrote:
> > > On Thu, 2014-09-04 at 17:17 -0700, Paul E. McKenney wrote:
> > >> +And there are anti-guarantees:
> > >> +
> > >> + (*) These guarantees do not apply to bitfields, because compilers often
> > >> +     generate code to modify these using non-atomic read-modify-write
> > >> +     sequences.  Do not attempt to use bitfields to synchronize parallel
> > >> +     algorithms.
> > >> +
> > >> + (*) Even in cases where bitfields are protected by locks, all fields
> > >> +     in a given bitfield must be protected by one lock.  If two fields
> > >> +     in a given bitfield are protected by different locks, the compiler's
> > >> +     non-atomic read-modify-write sequences can cause an update to one
> > >> +     field to corrupt the value of an adjacent field.
> > >> +
> > >> + (*) These guarantees apply only to properly aligned and sized scalar
> > >> +     variables.  "Properly sized" currently means "int" and "long",
> > >> +     because some CPU families do not support loads and stores of
> > >> +     other sizes.  ("Some CPU families" is currently believed to
> > >> +     be only Alpha 21064.  If this is actually the case, a different
> > >> +     non-guarantee is likely to be formulated.)
> > > 
> > > This is a bit unclear.  Presumably you're talking about definiteness of
> > > the outcome (as in what's seen after multiple stores to the same
> > > variable).
> > 
> > No, the last conditions refers to adjacent byte stores from different
> > cpu contexts (either interrupt or SMP).
> > 
> > > The guarantees are only for natural width on Parisc as well,
> > > so you would get a mess if you did byte stores to adjacent memory
> > > locations.
> > 
> > For a simple test like:
> > 
> > struct x {
> > 	long a;
> > 	char b;
> > 	char c;
> > 	char d;
> > 	char e;
> > };
> > 
> > void store_bc(struct x *p) {
> > 	p->b = 1;
> > 	p->c = 2;
> > }
> > 
> > on parisc, gcc generates separate byte stores
> > 
> > void store_bc(struct x *p) {
> >    0:	34 1c 00 02 	ldi 1,ret0
> >    4:	0f 5c 12 08 	stb ret0,4(r26)
> >    8:	34 1c 00 04 	ldi 2,ret0
> >    c:	e8 40 c0 00 	bv r0(rp)
> >   10:	0f 5c 12 0a 	stb ret0,5(r26)
> > 
> > which appears to confirm that on parisc adjacent byte data
> > is safe from corruption by concurrent cpu updates; that is,
> > 
> > CPU 0                | CPU 1
> >                      |
> > p->b = 1             | p->c = 2
> >                      |
> > 
> > will result in p->b == 1 && p->c == 2 (assume both values
> > were 0 before the call to store_bc()).
> 
> What Peter said.  I would ask for suggestions for better wording, but
> I would much rather be able to say that single-byte reads and writes
> are atomic and that aligned-short reads and writes are also atomic.
> 
> Thus far, it looks like we lose only very old Alpha systems, so unless
> I hear otherwise, I update my patch to outlaw these very old systems.

This isn't universally true according to the architecture manual.  The
PARISC CPU can make byte to long word stores atomic against the memory
bus but not against the I/O bus for instance.  Atomicity is a property
of the underlying substrate, not of the CPU.  Implying that atomicity is
a CPU property is incorrect.

James



^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-07  5:07                           ` James Bottomley
  0 siblings, 0 replies; 311+ messages in thread
From: James Bottomley @ 2014-09-07  5:07 UTC (permalink / raw)
  To: paulmck
  Cc: Jakub Jelinek, One Thousand Gnomes, linux-arch, linux-ia64,
	Peter Hurley, Mikael Pettersson, Oleg Nesterov, linux-kernel,
	Tony Luck, Paul Mackerras, H. Peter Anvin, linuxppc-dev,
	Miroslav Franc, Richard Henderson

On Thu, 2014-09-04 at 21:06 -0700, Paul E. McKenney wrote:
> On Thu, Sep 04, 2014 at 10:47:24PM -0400, Peter Hurley wrote:
> > Hi James,
> > 
> > On 09/04/2014 10:11 PM, James Bottomley wrote:
> > > On Thu, 2014-09-04 at 17:17 -0700, Paul E. McKenney wrote:
> > >> +And there are anti-guarantees:
> > >> +
> > >> + (*) These guarantees do not apply to bitfields, because compilers often
> > >> +     generate code to modify these using non-atomic read-modify-write
> > >> +     sequences.  Do not attempt to use bitfields to synchronize parallel
> > >> +     algorithms.
> > >> +
> > >> + (*) Even in cases where bitfields are protected by locks, all fields
> > >> +     in a given bitfield must be protected by one lock.  If two fields
> > >> +     in a given bitfield are protected by different locks, the compiler's
> > >> +     non-atomic read-modify-write sequences can cause an update to one
> > >> +     field to corrupt the value of an adjacent field.
> > >> +
> > >> + (*) These guarantees apply only to properly aligned and sized scalar
> > >> +     variables.  "Properly sized" currently means "int" and "long",
> > >> +     because some CPU families do not support loads and stores of
> > >> +     other sizes.  ("Some CPU families" is currently believed to
> > >> +     be only Alpha 21064.  If this is actually the case, a different
> > >> +     non-guarantee is likely to be formulated.)
> > > 
> > > This is a bit unclear.  Presumably you're talking about definiteness of
> > > the outcome (as in what's seen after multiple stores to the same
> > > variable).
> > 
> > No, the last conditions refers to adjacent byte stores from different
> > cpu contexts (either interrupt or SMP).
> > 
> > > The guarantees are only for natural width on Parisc as well,
> > > so you would get a mess if you did byte stores to adjacent memory
> > > locations.
> > 
> > For a simple test like:
> > 
> > struct x {
> > 	long a;
> > 	char b;
> > 	char c;
> > 	char d;
> > 	char e;
> > };
> > 
> > void store_bc(struct x *p) {
> > 	p->b = 1;
> > 	p->c = 2;
> > }
> > 
> > on parisc, gcc generates separate byte stores
> > 
> > void store_bc(struct x *p) {
> >    0:	34 1c 00 02 	ldi 1,ret0
> >    4:	0f 5c 12 08 	stb ret0,4(r26)
> >    8:	34 1c 00 04 	ldi 2,ret0
> >    c:	e8 40 c0 00 	bv r0(rp)
> >   10:	0f 5c 12 0a 	stb ret0,5(r26)
> > 
> > which appears to confirm that on parisc adjacent byte data
> > is safe from corruption by concurrent cpu updates; that is,
> > 
> > CPU 0                | CPU 1
> >                      |
> > p->b = 1             | p->c = 2
> >                      |
> > 
> > will result in p->b == 1 && p->c == 2 (assume both values
> > were 0 before the call to store_bc()).
> 
> What Peter said.  I would ask for suggestions for better wording, but
> I would much rather be able to say that single-byte reads and writes
> are atomic and that aligned-short reads and writes are also atomic.
> 
> Thus far, it looks like we lose only very old Alpha systems, so unless
> I hear otherwise, I update my patch to outlaw these very old systems.

This isn't universally true according to the architecture manual.  The
PARISC CPU can make byte to long word stores atomic against the memory
bus but not against the I/O bus for instance.  Atomicity is a property
of the underlying substrate, not of the CPU.  Implying that atomicity is
a CPU property is incorrect.

James

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-07  5:07                           ` James Bottomley
  0 siblings, 0 replies; 311+ messages in thread
From: James Bottomley @ 2014-09-07  5:07 UTC (permalink / raw)
  To: paulmck
  Cc: Peter Hurley, H. Peter Anvin, One Thousand Gnomes, Jakub Jelinek,
	Mikael Pettersson, Benjamin Herrenschmidt, Richard Henderson,
	Oleg Nesterov, Miroslav Franc, Paul Mackerras, linuxppc-dev,
	linux-kernel, linux-arch, Tony Luck, linux-ia64

On Thu, 2014-09-04 at 21:06 -0700, Paul E. McKenney wrote:
> On Thu, Sep 04, 2014 at 10:47:24PM -0400, Peter Hurley wrote:
> > Hi James,
> > 
> > On 09/04/2014 10:11 PM, James Bottomley wrote:
> > > On Thu, 2014-09-04 at 17:17 -0700, Paul E. McKenney wrote:
> > >> +And there are anti-guarantees:
> > >> +
> > >> + (*) These guarantees do not apply to bitfields, because compilers often
> > >> +     generate code to modify these using non-atomic read-modify-write
> > >> +     sequences.  Do not attempt to use bitfields to synchronize parallel
> > >> +     algorithms.
> > >> +
> > >> + (*) Even in cases where bitfields are protected by locks, all fields
> > >> +     in a given bitfield must be protected by one lock.  If two fields
> > >> +     in a given bitfield are protected by different locks, the compiler's
> > >> +     non-atomic read-modify-write sequences can cause an update to one
> > >> +     field to corrupt the value of an adjacent field.
> > >> +
> > >> + (*) These guarantees apply only to properly aligned and sized scalar
> > >> +     variables.  "Properly sized" currently means "int" and "long",
> > >> +     because some CPU families do not support loads and stores of
> > >> +     other sizes.  ("Some CPU families" is currently believed to
> > >> +     be only Alpha 21064.  If this is actually the case, a different
> > >> +     non-guarantee is likely to be formulated.)
> > > 
> > > This is a bit unclear.  Presumably you're talking about definiteness of
> > > the outcome (as in what's seen after multiple stores to the same
> > > variable).
> > 
> > No, the last conditions refers to adjacent byte stores from different
> > cpu contexts (either interrupt or SMP).
> > 
> > > The guarantees are only for natural width on Parisc as well,
> > > so you would get a mess if you did byte stores to adjacent memory
> > > locations.
> > 
> > For a simple test like:
> > 
> > struct x {
> > 	long a;
> > 	char b;
> > 	char c;
> > 	char d;
> > 	char e;
> > };
> > 
> > void store_bc(struct x *p) {
> > 	p->b = 1;
> > 	p->c = 2;
> > }
> > 
> > on parisc, gcc generates separate byte stores
> > 
> > void store_bc(struct x *p) {
> >    0:	34 1c 00 02 	ldi 1,ret0
> >    4:	0f 5c 12 08 	stb ret0,4(r26)
> >    8:	34 1c 00 04 	ldi 2,ret0
> >    c:	e8 40 c0 00 	bv r0(rp)
> >   10:	0f 5c 12 0a 	stb ret0,5(r26)
> > 
> > which appears to confirm that on parisc adjacent byte data
> > is safe from corruption by concurrent cpu updates; that is,
> > 
> > CPU 0                | CPU 1
> >                      |
> > p->b = 1             | p->c = 2
> >                      |
> > 
> > will result in p->b = 1 && p->c = 2 (assume both values
> > were 0 before the call to store_bc()).
> 
> What Peter said.  I would ask for suggestions for better wording, but
> I would much rather be able to say that single-byte reads and writes
> are atomic and that aligned-short reads and writes are also atomic.
> 
> Thus far, it looks like we lose only very old Alpha systems, so unless
> I hear otherwise, I update my patch to outlaw these very old systems.

This isn't universally true according to the architecture manual.  The
PARISC CPU can make byte to long word stores atomic against the memory
bus but not against the I/O bus for instance.  Atomicity is a property
of the underlying substrate, not of the CPU.  Implying that atomicity is
a CPU property is incorrect.

James



^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
  2014-09-07  5:07                           ` James Bottomley
  (?)
@ 2014-09-07 16:21                             ` Paul E. McKenney
  -1 siblings, 0 replies; 311+ messages in thread
From: Paul E. McKenney @ 2014-09-07 16:21 UTC (permalink / raw)
  To: James Bottomley
  Cc: Peter Hurley, H. Peter Anvin, One Thousand Gnomes, Jakub Jelinek,
	Mikael Pettersson, Benjamin Herrenschmidt, Richard Henderson,
	Oleg Nesterov, Miroslav Franc, Paul Mackerras, linuxppc-dev,
	linux-kernel, linux-arch, Tony Luck, linux-ia64

On Sat, Sep 06, 2014 at 10:07:22PM -0700, James Bottomley wrote:
> On Thu, 2014-09-04 at 21:06 -0700, Paul E. McKenney wrote:
> > On Thu, Sep 04, 2014 at 10:47:24PM -0400, Peter Hurley wrote:
> > > Hi James,
> > > 
> > > On 09/04/2014 10:11 PM, James Bottomley wrote:
> > > > On Thu, 2014-09-04 at 17:17 -0700, Paul E. McKenney wrote:
> > > >> +And there are anti-guarantees:
> > > >> +
> > > >> + (*) These guarantees do not apply to bitfields, because compilers often
> > > >> +     generate code to modify these using non-atomic read-modify-write
> > > >> +     sequences.  Do not attempt to use bitfields to synchronize parallel
> > > >> +     algorithms.
> > > >> +
> > > >> + (*) Even in cases where bitfields are protected by locks, all fields
> > > >> +     in a given bitfield must be protected by one lock.  If two fields
> > > >> +     in a given bitfield are protected by different locks, the compiler's
> > > >> +     non-atomic read-modify-write sequences can cause an update to one
> > > >> +     field to corrupt the value of an adjacent field.
> > > >> +
> > > >> + (*) These guarantees apply only to properly aligned and sized scalar
> > > >> +     variables.  "Properly sized" currently means "int" and "long",
> > > >> +     because some CPU families do not support loads and stores of
> > > >> +     other sizes.  ("Some CPU families" is currently believed to
> > > >> +     be only Alpha 21064.  If this is actually the case, a different
> > > >> +     non-guarantee is likely to be formulated.)
> > > > 
> > > > This is a bit unclear.  Presumably you're talking about definiteness of
> > > > the outcome (as in what's seen after multiple stores to the same
> > > > variable).
> > > 
> > > No, the last conditions refers to adjacent byte stores from different
> > > cpu contexts (either interrupt or SMP).
> > > 
> > > > The guarantees are only for natural width on Parisc as well,
> > > > so you would get a mess if you did byte stores to adjacent memory
> > > > locations.
> > > 
> > > For a simple test like:
> > > 
> > > struct x {
> > > 	long a;
> > > 	char b;
> > > 	char c;
> > > 	char d;
> > > 	char e;
> > > };
> > > 
> > > void store_bc(struct x *p) {
> > > 	p->b = 1;
> > > 	p->c = 2;
> > > }
> > > 
> > > on parisc, gcc generates separate byte stores
> > > 
> > > void store_bc(struct x *p) {
> > >    0:	34 1c 00 02 	ldi 1,ret0
> > >    4:	0f 5c 12 08 	stb ret0,4(r26)
> > >    8:	34 1c 00 04 	ldi 2,ret0
> > >    c:	e8 40 c0 00 	bv r0(rp)
> > >   10:	0f 5c 12 0a 	stb ret0,5(r26)
> > > 
> > > which appears to confirm that on parisc adjacent byte data
> > > is safe from corruption by concurrent cpu updates; that is,
> > > 
> > > CPU 0                | CPU 1
> > >                      |
> > > p->b = 1             | p->c = 2
> > >                      |
> > > 
> > > will result in p->b == 1 && p->c == 2 (assume both values
> > > were 0 before the call to store_bc()).
> > 
> > What Peter said.  I would ask for suggestions for better wording, but
> > I would much rather be able to say that single-byte reads and writes
> > are atomic and that aligned-short reads and writes are also atomic.
> > 
> > Thus far, it looks like we lose only very old Alpha systems, so unless
> > I hear otherwise, I update my patch to outlaw these very old systems.
> 
> This isn't universally true according to the architecture manual.  The
> PARISC CPU can make byte to long word stores atomic against the memory
> bus but not against the I/O bus for instance.  Atomicity is a property
> of the underlying substrate, not of the CPU.  Implying that atomicity is
> a CPU property is incorrect.

OK, fair point.

But are there in-use-for-Linux PARISC memory fabrics (for normal memory,
not I/O) that do not support single-byte and double-byte stores?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-07 16:21                             ` Paul E. McKenney
  0 siblings, 0 replies; 311+ messages in thread
From: Paul E. McKenney @ 2014-09-07 16:21 UTC (permalink / raw)
  To: James Bottomley
  Cc: Jakub Jelinek, One Thousand Gnomes, linux-arch, linux-ia64,
	Peter Hurley, Mikael Pettersson, Oleg Nesterov, linux-kernel,
	Tony Luck, Paul Mackerras, H. Peter Anvin, linuxppc-dev,
	Miroslav Franc, Richard Henderson

On Sat, Sep 06, 2014 at 10:07:22PM -0700, James Bottomley wrote:
> On Thu, 2014-09-04 at 21:06 -0700, Paul E. McKenney wrote:
> > On Thu, Sep 04, 2014 at 10:47:24PM -0400, Peter Hurley wrote:
> > > Hi James,
> > > 
> > > On 09/04/2014 10:11 PM, James Bottomley wrote:
> > > > On Thu, 2014-09-04 at 17:17 -0700, Paul E. McKenney wrote:
> > > >> +And there are anti-guarantees:
> > > >> +
> > > >> + (*) These guarantees do not apply to bitfields, because compilers often
> > > >> +     generate code to modify these using non-atomic read-modify-write
> > > >> +     sequences.  Do not attempt to use bitfields to synchronize parallel
> > > >> +     algorithms.
> > > >> +
> > > >> + (*) Even in cases where bitfields are protected by locks, all fields
> > > >> +     in a given bitfield must be protected by one lock.  If two fields
> > > >> +     in a given bitfield are protected by different locks, the compiler's
> > > >> +     non-atomic read-modify-write sequences can cause an update to one
> > > >> +     field to corrupt the value of an adjacent field.
> > > >> +
> > > >> + (*) These guarantees apply only to properly aligned and sized scalar
> > > >> +     variables.  "Properly sized" currently means "int" and "long",
> > > >> +     because some CPU families do not support loads and stores of
> > > >> +     other sizes.  ("Some CPU families" is currently believed to
> > > >> +     be only Alpha 21064.  If this is actually the case, a different
> > > >> +     non-guarantee is likely to be formulated.)
> > > > 
> > > > This is a bit unclear.  Presumably you're talking about definiteness of
> > > > the outcome (as in what's seen after multiple stores to the same
> > > > variable).
> > > 
> > > No, the last conditions refers to adjacent byte stores from different
> > > cpu contexts (either interrupt or SMP).
> > > 
> > > > The guarantees are only for natural width on Parisc as well,
> > > > so you would get a mess if you did byte stores to adjacent memory
> > > > locations.
> > > 
> > > For a simple test like:
> > > 
> > > struct x {
> > > 	long a;
> > > 	char b;
> > > 	char c;
> > > 	char d;
> > > 	char e;
> > > };
> > > 
> > > void store_bc(struct x *p) {
> > > 	p->b = 1;
> > > 	p->c = 2;
> > > }
> > > 
> > > on parisc, gcc generates separate byte stores
> > > 
> > > void store_bc(struct x *p) {
> > >    0:	34 1c 00 02 	ldi 1,ret0
> > >    4:	0f 5c 12 08 	stb ret0,4(r26)
> > >    8:	34 1c 00 04 	ldi 2,ret0
> > >    c:	e8 40 c0 00 	bv r0(rp)
> > >   10:	0f 5c 12 0a 	stb ret0,5(r26)
> > > 
> > > which appears to confirm that on parisc adjacent byte data
> > > is safe from corruption by concurrent cpu updates; that is,
> > > 
> > > CPU 0                | CPU 1
> > >                      |
> > > p->b = 1             | p->c = 2
> > >                      |
> > > 
> > > will result in p->b == 1 && p->c == 2 (assume both values
> > > were 0 before the call to store_bc()).
> > 
> > What Peter said.  I would ask for suggestions for better wording, but
> > I would much rather be able to say that single-byte reads and writes
> > are atomic and that aligned-short reads and writes are also atomic.
> > 
> > Thus far, it looks like we lose only very old Alpha systems, so unless
> > I hear otherwise, I update my patch to outlaw these very old systems.
> 
> This isn't universally true according to the architecture manual.  The
> PARISC CPU can make byte to long word stores atomic against the memory
> bus but not against the I/O bus for instance.  Atomicity is a property
> of the underlying substrate, not of the CPU.  Implying that atomicity is
> a CPU property is incorrect.

OK, fair point.

But are there in-use-for-Linux PARISC memory fabrics (for normal memory,
not I/O) that do not support single-byte and double-byte stores?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-07 16:21                             ` Paul E. McKenney
  0 siblings, 0 replies; 311+ messages in thread
From: Paul E. McKenney @ 2014-09-07 16:21 UTC (permalink / raw)
  To: James Bottomley
  Cc: Peter Hurley, H. Peter Anvin, One Thousand Gnomes, Jakub Jelinek,
	Mikael Pettersson, Benjamin Herrenschmidt, Richard Henderson,
	Oleg Nesterov, Miroslav Franc, Paul Mackerras, linuxppc-dev,
	linux-kernel, linux-arch, Tony Luck, linux-ia64

On Sat, Sep 06, 2014 at 10:07:22PM -0700, James Bottomley wrote:
> On Thu, 2014-09-04 at 21:06 -0700, Paul E. McKenney wrote:
> > On Thu, Sep 04, 2014 at 10:47:24PM -0400, Peter Hurley wrote:
> > > Hi James,
> > > 
> > > On 09/04/2014 10:11 PM, James Bottomley wrote:
> > > > On Thu, 2014-09-04 at 17:17 -0700, Paul E. McKenney wrote:
> > > >> +And there are anti-guarantees:
> > > >> +
> > > >> + (*) These guarantees do not apply to bitfields, because compilers often
> > > >> +     generate code to modify these using non-atomic read-modify-write
> > > >> +     sequences.  Do not attempt to use bitfields to synchronize parallel
> > > >> +     algorithms.
> > > >> +
> > > >> + (*) Even in cases where bitfields are protected by locks, all fields
> > > >> +     in a given bitfield must be protected by one lock.  If two fields
> > > >> +     in a given bitfield are protected by different locks, the compiler's
> > > >> +     non-atomic read-modify-write sequences can cause an update to one
> > > >> +     field to corrupt the value of an adjacent field.
> > > >> +
> > > >> + (*) These guarantees apply only to properly aligned and sized scalar
> > > >> +     variables.  "Properly sized" currently means "int" and "long",
> > > >> +     because some CPU families do not support loads and stores of
> > > >> +     other sizes.  ("Some CPU families" is currently believed to
> > > >> +     be only Alpha 21064.  If this is actually the case, a different
> > > >> +     non-guarantee is likely to be formulated.)
> > > > 
> > > > This is a bit unclear.  Presumably you're talking about definiteness of
> > > > the outcome (as in what's seen after multiple stores to the same
> > > > variable).
> > > 
> > > No, the last conditions refers to adjacent byte stores from different
> > > cpu contexts (either interrupt or SMP).
> > > 
> > > > The guarantees are only for natural width on Parisc as well,
> > > > so you would get a mess if you did byte stores to adjacent memory
> > > > locations.
> > > 
> > > For a simple test like:
> > > 
> > > struct x {
> > > 	long a;
> > > 	char b;
> > > 	char c;
> > > 	char d;
> > > 	char e;
> > > };
> > > 
> > > void store_bc(struct x *p) {
> > > 	p->b = 1;
> > > 	p->c = 2;
> > > }
> > > 
> > > on parisc, gcc generates separate byte stores
> > > 
> > > void store_bc(struct x *p) {
> > >    0:	34 1c 00 02 	ldi 1,ret0
> > >    4:	0f 5c 12 08 	stb ret0,4(r26)
> > >    8:	34 1c 00 04 	ldi 2,ret0
> > >    c:	e8 40 c0 00 	bv r0(rp)
> > >   10:	0f 5c 12 0a 	stb ret0,5(r26)
> > > 
> > > which appears to confirm that on parisc adjacent byte data
> > > is safe from corruption by concurrent cpu updates; that is,
> > > 
> > > CPU 0                | CPU 1
> > >                      |
> > > p->b = 1             | p->c = 2
> > >                      |
> > > 
> > > will result in p->b = 1 && p->c = 2 (assume both values
> > > were 0 before the call to store_bc()).
> > 
> > What Peter said.  I would ask for suggestions for better wording, but
> > I would much rather be able to say that single-byte reads and writes
> > are atomic and that aligned-short reads and writes are also atomic.
> > 
> > Thus far, it looks like we lose only very old Alpha systems, so unless
> > I hear otherwise, I update my patch to outlaw these very old systems.
> 
> This isn't universally true according to the architecture manual.  The
> PARISC CPU can make byte to long word stores atomic against the memory
> bus but not against the I/O bus for instance.  Atomicity is a property
> of the underlying substrate, not of the CPU.  Implying that atomicity is
> a CPU property is incorrect.

OK, fair point.

But are there in-use-for-Linux PARISC memory fabrics (for normal memory,
not I/O) that do not support single-byte and double-byte stores?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
  2014-09-07 16:21                             ` Paul E. McKenney
  (?)
@ 2014-09-07 19:04                               ` James Bottomley
  -1 siblings, 0 replies; 311+ messages in thread
From: James Bottomley @ 2014-09-07 19:04 UTC (permalink / raw)
  To: paulmck
  Cc: Peter Hurley, H. Peter Anvin, One Thousand Gnomes, Jakub Jelinek,
	Mikael Pettersson, Benjamin Herrenschmidt, Richard Henderson,
	Oleg Nesterov, Miroslav Franc, Paul Mackerras, linuxppc-dev,
	linux-kernel, linux-arch, Tony Luck, linux-ia64

On Sun, 2014-09-07 at 09:21 -0700, Paul E. McKenney wrote:
> On Sat, Sep 06, 2014 at 10:07:22PM -0700, James Bottomley wrote:
> > On Thu, 2014-09-04 at 21:06 -0700, Paul E. McKenney wrote:
> > > On Thu, Sep 04, 2014 at 10:47:24PM -0400, Peter Hurley wrote:
> > > > Hi James,
> > > > 
> > > > On 09/04/2014 10:11 PM, James Bottomley wrote:
> > > > > On Thu, 2014-09-04 at 17:17 -0700, Paul E. McKenney wrote:
> > > > >> +And there are anti-guarantees:
> > > > >> +
> > > > >> + (*) These guarantees do not apply to bitfields, because compilers often
> > > > >> +     generate code to modify these using non-atomic read-modify-write
> > > > >> +     sequences.  Do not attempt to use bitfields to synchronize parallel
> > > > >> +     algorithms.
> > > > >> +
> > > > >> + (*) Even in cases where bitfields are protected by locks, all fields
> > > > >> +     in a given bitfield must be protected by one lock.  If two fields
> > > > >> +     in a given bitfield are protected by different locks, the compiler's
> > > > >> +     non-atomic read-modify-write sequences can cause an update to one
> > > > >> +     field to corrupt the value of an adjacent field.
> > > > >> +
> > > > >> + (*) These guarantees apply only to properly aligned and sized scalar
> > > > >> +     variables.  "Properly sized" currently means "int" and "long",
> > > > >> +     because some CPU families do not support loads and stores of
> > > > >> +     other sizes.  ("Some CPU families" is currently believed to
> > > > >> +     be only Alpha 21064.  If this is actually the case, a different
> > > > >> +     non-guarantee is likely to be formulated.)
> > > > > 
> > > > > This is a bit unclear.  Presumably you're talking about definiteness of
> > > > > the outcome (as in what's seen after multiple stores to the same
> > > > > variable).
> > > > 
> > > > No, the last conditions refers to adjacent byte stores from different
> > > > cpu contexts (either interrupt or SMP).
> > > > 
> > > > > The guarantees are only for natural width on Parisc as well,
> > > > > so you would get a mess if you did byte stores to adjacent memory
> > > > > locations.
> > > > 
> > > > For a simple test like:
> > > > 
> > > > struct x {
> > > > 	long a;
> > > > 	char b;
> > > > 	char c;
> > > > 	char d;
> > > > 	char e;
> > > > };
> > > > 
> > > > void store_bc(struct x *p) {
> > > > 	p->b = 1;
> > > > 	p->c = 2;
> > > > }
> > > > 
> > > > on parisc, gcc generates separate byte stores
> > > > 
> > > > void store_bc(struct x *p) {
> > > >    0:	34 1c 00 02 	ldi 1,ret0
> > > >    4:	0f 5c 12 08 	stb ret0,4(r26)
> > > >    8:	34 1c 00 04 	ldi 2,ret0
> > > >    c:	e8 40 c0 00 	bv r0(rp)
> > > >   10:	0f 5c 12 0a 	stb ret0,5(r26)
> > > > 
> > > > which appears to confirm that on parisc adjacent byte data
> > > > is safe from corruption by concurrent cpu updates; that is,
> > > > 
> > > > CPU 0                | CPU 1
> > > >                      |
> > > > p->b = 1             | p->c = 2
> > > >                      |
> > > > 
> > > > will result in p->b == 1 && p->c == 2 (assume both values
> > > > were 0 before the call to store_bc()).
> > > 
> > > What Peter said.  I would ask for suggestions for better wording, but
> > > I would much rather be able to say that single-byte reads and writes
> > > are atomic and that aligned-short reads and writes are also atomic.
> > > 
> > > Thus far, it looks like we lose only very old Alpha systems, so unless
> > > I hear otherwise, I update my patch to outlaw these very old systems.
> > 
> > This isn't universally true according to the architecture manual.  The
> > PARISC CPU can make byte to long word stores atomic against the memory
> > bus but not against the I/O bus for instance.  Atomicity is a property
> > of the underlying substrate, not of the CPU.  Implying that atomicity is
> > a CPU property is incorrect.
> 
> OK, fair point.
> 
> But are there in-use-for-Linux PARISC memory fabrics (for normal memory,
> not I/O) that do not support single-byte and double-byte stores?

For aligned access, I believe that's always the case for the memory bus
(on both 32 and 64 bit systems).  However, it only applies to machine
instruction loads and stores of the same width..  If you mix the widths
on the loads and stores, all bets are off.  That means you have to
beware of the gcc penchant for coalescing loads and stores: if it sees
two adjacent byte stores it can coalesce them into a short store
instead ... that screws up the atomicity guarantees.

James



^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-07 19:04                               ` James Bottomley
  0 siblings, 0 replies; 311+ messages in thread
From: James Bottomley @ 2014-09-07 19:04 UTC (permalink / raw)
  To: paulmck
  Cc: Jakub Jelinek, One Thousand Gnomes, linux-arch, linux-ia64,
	Peter Hurley, Mikael Pettersson, Oleg Nesterov, linux-kernel,
	Tony Luck, Paul Mackerras, H. Peter Anvin, linuxppc-dev,
	Miroslav Franc, Richard Henderson

On Sun, 2014-09-07 at 09:21 -0700, Paul E. McKenney wrote:
> On Sat, Sep 06, 2014 at 10:07:22PM -0700, James Bottomley wrote:
> > On Thu, 2014-09-04 at 21:06 -0700, Paul E. McKenney wrote:
> > > On Thu, Sep 04, 2014 at 10:47:24PM -0400, Peter Hurley wrote:
> > > > Hi James,
> > > > 
> > > > On 09/04/2014 10:11 PM, James Bottomley wrote:
> > > > > On Thu, 2014-09-04 at 17:17 -0700, Paul E. McKenney wrote:
> > > > >> +And there are anti-guarantees:
> > > > >> +
> > > > >> + (*) These guarantees do not apply to bitfields, because compilers often
> > > > >> +     generate code to modify these using non-atomic read-modify-write
> > > > >> +     sequences.  Do not attempt to use bitfields to synchronize parallel
> > > > >> +     algorithms.
> > > > >> +
> > > > >> + (*) Even in cases where bitfields are protected by locks, all fields
> > > > >> +     in a given bitfield must be protected by one lock.  If two fields
> > > > >> +     in a given bitfield are protected by different locks, the compiler's
> > > > >> +     non-atomic read-modify-write sequences can cause an update to one
> > > > >> +     field to corrupt the value of an adjacent field.
> > > > >> +
> > > > >> + (*) These guarantees apply only to properly aligned and sized scalar
> > > > >> +     variables.  "Properly sized" currently means "int" and "long",
> > > > >> +     because some CPU families do not support loads and stores of
> > > > >> +     other sizes.  ("Some CPU families" is currently believed to
> > > > >> +     be only Alpha 21064.  If this is actually the case, a different
> > > > >> +     non-guarantee is likely to be formulated.)
> > > > > 
> > > > > This is a bit unclear.  Presumably you're talking about definiteness of
> > > > > the outcome (as in what's seen after multiple stores to the same
> > > > > variable).
> > > > 
> > > > No, the last conditions refers to adjacent byte stores from different
> > > > cpu contexts (either interrupt or SMP).
> > > > 
> > > > > The guarantees are only for natural width on Parisc as well,
> > > > > so you would get a mess if you did byte stores to adjacent memory
> > > > > locations.
> > > > 
> > > > For a simple test like:
> > > > 
> > > > struct x {
> > > > 	long a;
> > > > 	char b;
> > > > 	char c;
> > > > 	char d;
> > > > 	char e;
> > > > };
> > > > 
> > > > void store_bc(struct x *p) {
> > > > 	p->b = 1;
> > > > 	p->c = 2;
> > > > }
> > > > 
> > > > on parisc, gcc generates separate byte stores
> > > > 
> > > > void store_bc(struct x *p) {
> > > >    0:	34 1c 00 02 	ldi 1,ret0
> > > >    4:	0f 5c 12 08 	stb ret0,4(r26)
> > > >    8:	34 1c 00 04 	ldi 2,ret0
> > > >    c:	e8 40 c0 00 	bv r0(rp)
> > > >   10:	0f 5c 12 0a 	stb ret0,5(r26)
> > > > 
> > > > which appears to confirm that on parisc adjacent byte data
> > > > is safe from corruption by concurrent cpu updates; that is,
> > > > 
> > > > CPU 0                | CPU 1
> > > >                      |
> > > > p->b = 1             | p->c = 2
> > > >                      |
> > > > 
> > > > will result in p->b == 1 && p->c == 2 (assume both values
> > > > were 0 before the call to store_bc()).
> > > 
> > > What Peter said.  I would ask for suggestions for better wording, but
> > > I would much rather be able to say that single-byte reads and writes
> > > are atomic and that aligned-short reads and writes are also atomic.
> > > 
> > > Thus far, it looks like we lose only very old Alpha systems, so unless
> > > I hear otherwise, I update my patch to outlaw these very old systems.
> > 
> > This isn't universally true according to the architecture manual.  The
> > PARISC CPU can make byte to long word stores atomic against the memory
> > bus but not against the I/O bus for instance.  Atomicity is a property
> > of the underlying substrate, not of the CPU.  Implying that atomicity is
> > a CPU property is incorrect.
> 
> OK, fair point.
> 
> But are there in-use-for-Linux PARISC memory fabrics (for normal memory,
> not I/O) that do not support single-byte and double-byte stores?

For aligned access, I believe that's always the case for the memory bus
(on both 32 and 64 bit systems).  However, it only applies to machine
instruction loads and stores of the same width..  If you mix the widths
on the loads and stores, all bets are off.  That means you have to
beware of the gcc penchant for coalescing loads and stores: if it sees
two adjacent byte stores it can coalesce them into a short store
instead ... that screws up the atomicity guarantees.

James

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-07 19:04                               ` James Bottomley
  0 siblings, 0 replies; 311+ messages in thread
From: James Bottomley @ 2014-09-07 19:04 UTC (permalink / raw)
  To: paulmck
  Cc: Peter Hurley, H. Peter Anvin, One Thousand Gnomes, Jakub Jelinek,
	Mikael Pettersson, Benjamin Herrenschmidt, Richard Henderson,
	Oleg Nesterov, Miroslav Franc, Paul Mackerras, linuxppc-dev,
	linux-kernel, linux-arch, Tony Luck, linux-ia64

On Sun, 2014-09-07 at 09:21 -0700, Paul E. McKenney wrote:
> On Sat, Sep 06, 2014 at 10:07:22PM -0700, James Bottomley wrote:
> > On Thu, 2014-09-04 at 21:06 -0700, Paul E. McKenney wrote:
> > > On Thu, Sep 04, 2014 at 10:47:24PM -0400, Peter Hurley wrote:
> > > > Hi James,
> > > > 
> > > > On 09/04/2014 10:11 PM, James Bottomley wrote:
> > > > > On Thu, 2014-09-04 at 17:17 -0700, Paul E. McKenney wrote:
> > > > >> +And there are anti-guarantees:
> > > > >> +
> > > > >> + (*) These guarantees do not apply to bitfields, because compilers often
> > > > >> +     generate code to modify these using non-atomic read-modify-write
> > > > >> +     sequences.  Do not attempt to use bitfields to synchronize parallel
> > > > >> +     algorithms.
> > > > >> +
> > > > >> + (*) Even in cases where bitfields are protected by locks, all fields
> > > > >> +     in a given bitfield must be protected by one lock.  If two fields
> > > > >> +     in a given bitfield are protected by different locks, the compiler's
> > > > >> +     non-atomic read-modify-write sequences can cause an update to one
> > > > >> +     field to corrupt the value of an adjacent field.
> > > > >> +
> > > > >> + (*) These guarantees apply only to properly aligned and sized scalar
> > > > >> +     variables.  "Properly sized" currently means "int" and "long",
> > > > >> +     because some CPU families do not support loads and stores of
> > > > >> +     other sizes.  ("Some CPU families" is currently believed to
> > > > >> +     be only Alpha 21064.  If this is actually the case, a different
> > > > >> +     non-guarantee is likely to be formulated.)
> > > > > 
> > > > > This is a bit unclear.  Presumably you're talking about definiteness of
> > > > > the outcome (as in what's seen after multiple stores to the same
> > > > > variable).
> > > > 
> > > > No, the last conditions refers to adjacent byte stores from different
> > > > cpu contexts (either interrupt or SMP).
> > > > 
> > > > > The guarantees are only for natural width on Parisc as well,
> > > > > so you would get a mess if you did byte stores to adjacent memory
> > > > > locations.
> > > > 
> > > > For a simple test like:
> > > > 
> > > > struct x {
> > > > 	long a;
> > > > 	char b;
> > > > 	char c;
> > > > 	char d;
> > > > 	char e;
> > > > };
> > > > 
> > > > void store_bc(struct x *p) {
> > > > 	p->b = 1;
> > > > 	p->c = 2;
> > > > }
> > > > 
> > > > on parisc, gcc generates separate byte stores
> > > > 
> > > > void store_bc(struct x *p) {
> > > >    0:	34 1c 00 02 	ldi 1,ret0
> > > >    4:	0f 5c 12 08 	stb ret0,4(r26)
> > > >    8:	34 1c 00 04 	ldi 2,ret0
> > > >    c:	e8 40 c0 00 	bv r0(rp)
> > > >   10:	0f 5c 12 0a 	stb ret0,5(r26)
> > > > 
> > > > which appears to confirm that on parisc adjacent byte data
> > > > is safe from corruption by concurrent cpu updates; that is,
> > > > 
> > > > CPU 0                | CPU 1
> > > >                      |
> > > > p->b = 1             | p->c = 2
> > > >                      |
> > > > 
> > > > will result in p->b = 1 && p->c = 2 (assume both values
> > > > were 0 before the call to store_bc()).
> > > 
> > > What Peter said.  I would ask for suggestions for better wording, but
> > > I would much rather be able to say that single-byte reads and writes
> > > are atomic and that aligned-short reads and writes are also atomic.
> > > 
> > > Thus far, it looks like we lose only very old Alpha systems, so unless
> > > I hear otherwise, I update my patch to outlaw these very old systems.
> > 
> > This isn't universally true according to the architecture manual.  The
> > PARISC CPU can make byte to long word stores atomic against the memory
> > bus but not against the I/O bus for instance.  Atomicity is a property
> > of the underlying substrate, not of the CPU.  Implying that atomicity is
> > a CPU property is incorrect.
> 
> OK, fair point.
> 
> But are there in-use-for-Linux PARISC memory fabrics (for normal memory,
> not I/O) that do not support single-byte and double-byte stores?

For aligned access, I believe that's always the case for the memory bus
(on both 32 and 64 bit systems).  However, it only applies to machine
instruction loads and stores of the same width..  If you mix the widths
on the loads and stores, all bets are off.  That means you have to
beware of the gcc penchant for coalescing loads and stores: if it sees
two adjacent byte stores it can coalesce them into a short store
instead ... that screws up the atomicity guarantees.

James



^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
  2014-09-07 19:04                               ` James Bottomley
  (?)
@ 2014-09-07 20:41                                 ` Peter Hurley
  -1 siblings, 0 replies; 311+ messages in thread
From: Peter Hurley @ 2014-09-07 20:41 UTC (permalink / raw)
  To: James Bottomley, paulmck
  Cc: H. Peter Anvin, One Thousand Gnomes, Jakub Jelinek,
	Mikael Pettersson, Benjamin Herrenschmidt, Richard Henderson,
	Oleg Nesterov, Miroslav Franc, Paul Mackerras, linuxppc-dev,
	linux-kernel, linux-arch, Tony Luck, linux-ia64

On 09/07/2014 03:04 PM, James Bottomley wrote:
> On Sun, 2014-09-07 at 09:21 -0700, Paul E. McKenney wrote:
>> On Sat, Sep 06, 2014 at 10:07:22PM -0700, James Bottomley wrote:
>>> On Thu, 2014-09-04 at 21:06 -0700, Paul E. McKenney wrote:
>>>> On Thu, Sep 04, 2014 at 10:47:24PM -0400, Peter Hurley wrote:
>>>>> Hi James,
>>>>>
>>>>> On 09/04/2014 10:11 PM, James Bottomley wrote:
>>>>>> On Thu, 2014-09-04 at 17:17 -0700, Paul E. McKenney wrote:
>>>>>>> +And there are anti-guarantees:
>>>>>>> +
>>>>>>> + (*) These guarantees do not apply to bitfields, because compilers often
>>>>>>> +     generate code to modify these using non-atomic read-modify-write
>>>>>>> +     sequences.  Do not attempt to use bitfields to synchronize parallel
>>>>>>> +     algorithms.
>>>>>>> +
>>>>>>> + (*) Even in cases where bitfields are protected by locks, all fields
>>>>>>> +     in a given bitfield must be protected by one lock.  If two fields
>>>>>>> +     in a given bitfield are protected by different locks, the compiler's
>>>>>>> +     non-atomic read-modify-write sequences can cause an update to one
>>>>>>> +     field to corrupt the value of an adjacent field.
>>>>>>> +
>>>>>>> + (*) These guarantees apply only to properly aligned and sized scalar
>>>>>>> +     variables.  "Properly sized" currently means "int" and "long",
>>>>>>> +     because some CPU families do not support loads and stores of
>>>>>>> +     other sizes.  ("Some CPU families" is currently believed to
>>>>>>> +     be only Alpha 21064.  If this is actually the case, a different
>>>>>>> +     non-guarantee is likely to be formulated.)
>>>>>>
>>>>>> This is a bit unclear.  Presumably you're talking about definiteness of
>>>>>> the outcome (as in what's seen after multiple stores to the same
>>>>>> variable).
>>>>>
>>>>> No, the last conditions refers to adjacent byte stores from different
>>>>> cpu contexts (either interrupt or SMP).
>>>>>
>>>>>> The guarantees are only for natural width on Parisc as well,
>>>>>> so you would get a mess if you did byte stores to adjacent memory
>>>>>> locations.
>>>>>
>>>>> For a simple test like:
>>>>>
>>>>> struct x {
>>>>> 	long a;
>>>>> 	char b;
>>>>> 	char c;
>>>>> 	char d;
>>>>> 	char e;
>>>>> };
>>>>>
>>>>> void store_bc(struct x *p) {
>>>>> 	p->b = 1;
>>>>> 	p->c = 2;
>>>>> }
>>>>>
>>>>> on parisc, gcc generates separate byte stores
>>>>>
>>>>> void store_bc(struct x *p) {
>>>>>    0:	34 1c 00 02 	ldi 1,ret0
>>>>>    4:	0f 5c 12 08 	stb ret0,4(r26)
>>>>>    8:	34 1c 00 04 	ldi 2,ret0
>>>>>    c:	e8 40 c0 00 	bv r0(rp)
>>>>>   10:	0f 5c 12 0a 	stb ret0,5(r26)
>>>>>
>>>>> which appears to confirm that on parisc adjacent byte data
>>>>> is safe from corruption by concurrent cpu updates; that is,
>>>>>
>>>>> CPU 0                | CPU 1
>>>>>                      |
>>>>> p->b = 1             | p->c = 2
>>>>>                      |
>>>>>
>>>>> will result in p->b == 1 && p->c == 2 (assume both values
>>>>> were 0 before the call to store_bc()).
>>>>
>>>> What Peter said.  I would ask for suggestions for better wording, but
>>>> I would much rather be able to say that single-byte reads and writes
>>>> are atomic and that aligned-short reads and writes are also atomic.
>>>>
>>>> Thus far, it looks like we lose only very old Alpha systems, so unless
>>>> I hear otherwise, I update my patch to outlaw these very old systems.
>>>
>>> This isn't universally true according to the architecture manual.  The
>>> PARISC CPU can make byte to long word stores atomic against the memory
>>> bus but not against the I/O bus for instance.  Atomicity is a property
>>> of the underlying substrate, not of the CPU.  Implying that atomicity is
>>> a CPU property is incorrect.

To go back to this briefly, while it's true that atomicity is not
a CPU property, atomicity is not possible if the CPU is not cooperating.

>> OK, fair point.
>>
>> But are there in-use-for-Linux PARISC memory fabrics (for normal memory,
>> not I/O) that do not support single-byte and double-byte stores?
> 
> For aligned access, I believe that's always the case for the memory bus
> (on both 32 and 64 bit systems).  However, it only applies to machine
> instruction loads and stores of the same width..  If you mix the widths
> on the loads and stores, all bets are off.  That means you have to
> beware of the gcc penchant for coalescing loads and stores: if it sees
> two adjacent byte stores it can coalesce them into a short store
> instead ... that screws up the atomicity guarantees.

Two things: I think that gcc has given up on combining adjacent writes,
perhaps because unaligned writes on some arches are prohibitive, so
whatever minor optimization was believed to be gained was quickly lost,
multi-fold. (Although this might not be the reason since one would
expect that gcc would know the alignment of the promoted store).

But additionally, even if gcc combines adjacent writes _that are part
of the program flow_ then I believe the situation is no worse than
would otherwise exist.

For instance, given the following:

struct x {
	spinlock_t lock;
	long a;
	byte b;
	byte c;
};

void locked_store_b(struct x *p)
{
	spin_lock(&p->lock);
	p->b = 1;
	spin_unlock(&p->lock);
	p->c = 2;
}

Granted, the author probably expects ordered writes of
	STORE B
	STORE C
but that's not guaranteed because there is no memory barrier
ordering B before C.

I see no problem with gcc write-combining in the absence of
memory barriers (as long as alignment is still respected,
ie., the size-promoted write is still naturally aligned). The
combined write is still atomic.

Regards,
Peter Hurley

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-07 20:41                                 ` Peter Hurley
  0 siblings, 0 replies; 311+ messages in thread
From: Peter Hurley @ 2014-09-07 20:41 UTC (permalink / raw)
  To: James Bottomley, paulmck
  Cc: Jakub Jelinek, One Thousand Gnomes, linux-arch, linux-ia64,
	Mikael Pettersson, Oleg Nesterov, linux-kernel, Tony Luck,
	Paul Mackerras, H. Peter Anvin, linuxppc-dev, Miroslav Franc,
	Richard Henderson

On 09/07/2014 03:04 PM, James Bottomley wrote:
> On Sun, 2014-09-07 at 09:21 -0700, Paul E. McKenney wrote:
>> On Sat, Sep 06, 2014 at 10:07:22PM -0700, James Bottomley wrote:
>>> On Thu, 2014-09-04 at 21:06 -0700, Paul E. McKenney wrote:
>>>> On Thu, Sep 04, 2014 at 10:47:24PM -0400, Peter Hurley wrote:
>>>>> Hi James,
>>>>>
>>>>> On 09/04/2014 10:11 PM, James Bottomley wrote:
>>>>>> On Thu, 2014-09-04 at 17:17 -0700, Paul E. McKenney wrote:
>>>>>>> +And there are anti-guarantees:
>>>>>>> +
>>>>>>> + (*) These guarantees do not apply to bitfields, because compilers often
>>>>>>> +     generate code to modify these using non-atomic read-modify-write
>>>>>>> +     sequences.  Do not attempt to use bitfields to synchronize parallel
>>>>>>> +     algorithms.
>>>>>>> +
>>>>>>> + (*) Even in cases where bitfields are protected by locks, all fields
>>>>>>> +     in a given bitfield must be protected by one lock.  If two fields
>>>>>>> +     in a given bitfield are protected by different locks, the compiler's
>>>>>>> +     non-atomic read-modify-write sequences can cause an update to one
>>>>>>> +     field to corrupt the value of an adjacent field.
>>>>>>> +
>>>>>>> + (*) These guarantees apply only to properly aligned and sized scalar
>>>>>>> +     variables.  "Properly sized" currently means "int" and "long",
>>>>>>> +     because some CPU families do not support loads and stores of
>>>>>>> +     other sizes.  ("Some CPU families" is currently believed to
>>>>>>> +     be only Alpha 21064.  If this is actually the case, a different
>>>>>>> +     non-guarantee is likely to be formulated.)
>>>>>>
>>>>>> This is a bit unclear.  Presumably you're talking about definiteness of
>>>>>> the outcome (as in what's seen after multiple stores to the same
>>>>>> variable).
>>>>>
>>>>> No, the last conditions refers to adjacent byte stores from different
>>>>> cpu contexts (either interrupt or SMP).
>>>>>
>>>>>> The guarantees are only for natural width on Parisc as well,
>>>>>> so you would get a mess if you did byte stores to adjacent memory
>>>>>> locations.
>>>>>
>>>>> For a simple test like:
>>>>>
>>>>> struct x {
>>>>> 	long a;
>>>>> 	char b;
>>>>> 	char c;
>>>>> 	char d;
>>>>> 	char e;
>>>>> };
>>>>>
>>>>> void store_bc(struct x *p) {
>>>>> 	p->b = 1;
>>>>> 	p->c = 2;
>>>>> }
>>>>>
>>>>> on parisc, gcc generates separate byte stores
>>>>>
>>>>> void store_bc(struct x *p) {
>>>>>    0:	34 1c 00 02 	ldi 1,ret0
>>>>>    4:	0f 5c 12 08 	stb ret0,4(r26)
>>>>>    8:	34 1c 00 04 	ldi 2,ret0
>>>>>    c:	e8 40 c0 00 	bv r0(rp)
>>>>>   10:	0f 5c 12 0a 	stb ret0,5(r26)
>>>>>
>>>>> which appears to confirm that on parisc adjacent byte data
>>>>> is safe from corruption by concurrent cpu updates; that is,
>>>>>
>>>>> CPU 0                | CPU 1
>>>>>                      |
>>>>> p->b = 1             | p->c = 2
>>>>>                      |
>>>>>
>>>>> will result in p->b == 1 && p->c == 2 (assume both values
>>>>> were 0 before the call to store_bc()).
>>>>
>>>> What Peter said.  I would ask for suggestions for better wording, but
>>>> I would much rather be able to say that single-byte reads and writes
>>>> are atomic and that aligned-short reads and writes are also atomic.
>>>>
>>>> Thus far, it looks like we lose only very old Alpha systems, so unless
>>>> I hear otherwise, I update my patch to outlaw these very old systems.
>>>
>>> This isn't universally true according to the architecture manual.  The
>>> PARISC CPU can make byte to long word stores atomic against the memory
>>> bus but not against the I/O bus for instance.  Atomicity is a property
>>> of the underlying substrate, not of the CPU.  Implying that atomicity is
>>> a CPU property is incorrect.

To go back to this briefly, while it's true that atomicity is not
a CPU property, atomicity is not possible if the CPU is not cooperating.

>> OK, fair point.
>>
>> But are there in-use-for-Linux PARISC memory fabrics (for normal memory,
>> not I/O) that do not support single-byte and double-byte stores?
> 
> For aligned access, I believe that's always the case for the memory bus
> (on both 32 and 64 bit systems).  However, it only applies to machine
> instruction loads and stores of the same width..  If you mix the widths
> on the loads and stores, all bets are off.  That means you have to
> beware of the gcc penchant for coalescing loads and stores: if it sees
> two adjacent byte stores it can coalesce them into a short store
> instead ... that screws up the atomicity guarantees.

Two things: I think that gcc has given up on combining adjacent writes,
perhaps because unaligned writes on some arches are prohibitive, so
whatever minor optimization was believed to be gained was quickly lost,
multi-fold. (Although this might not be the reason since one would
expect that gcc would know the alignment of the promoted store).

But additionally, even if gcc combines adjacent writes _that are part
of the program flow_ then I believe the situation is no worse than
would otherwise exist.

For instance, given the following:

struct x {
	spinlock_t lock;
	long a;
	byte b;
	byte c;
};

void locked_store_b(struct x *p)
{
	spin_lock(&p->lock);
	p->b = 1;
	spin_unlock(&p->lock);
	p->c = 2;
}

Granted, the author probably expects ordered writes of
	STORE B
	STORE C
but that's not guaranteed because there is no memory barrier
ordering B before C.

I see no problem with gcc write-combining in the absence of
memory barriers (as long as alignment is still respected,
ie., the size-promoted write is still naturally aligned). The
combined write is still atomic.

Regards,
Peter Hurley

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-07 20:41                                 ` Peter Hurley
  0 siblings, 0 replies; 311+ messages in thread
From: Peter Hurley @ 2014-09-07 20:41 UTC (permalink / raw)
  To: James Bottomley, paulmck
  Cc: H. Peter Anvin, One Thousand Gnomes, Jakub Jelinek,
	Mikael Pettersson, Benjamin Herrenschmidt, Richard Henderson,
	Oleg Nesterov, Miroslav Franc, Paul Mackerras, linuxppc-dev,
	linux-kernel, linux-arch, Tony Luck, linux-ia64

On 09/07/2014 03:04 PM, James Bottomley wrote:
> On Sun, 2014-09-07 at 09:21 -0700, Paul E. McKenney wrote:
>> On Sat, Sep 06, 2014 at 10:07:22PM -0700, James Bottomley wrote:
>>> On Thu, 2014-09-04 at 21:06 -0700, Paul E. McKenney wrote:
>>>> On Thu, Sep 04, 2014 at 10:47:24PM -0400, Peter Hurley wrote:
>>>>> Hi James,
>>>>>
>>>>> On 09/04/2014 10:11 PM, James Bottomley wrote:
>>>>>> On Thu, 2014-09-04 at 17:17 -0700, Paul E. McKenney wrote:
>>>>>>> +And there are anti-guarantees:
>>>>>>> +
>>>>>>> + (*) These guarantees do not apply to bitfields, because compilers often
>>>>>>> +     generate code to modify these using non-atomic read-modify-write
>>>>>>> +     sequences.  Do not attempt to use bitfields to synchronize parallel
>>>>>>> +     algorithms.
>>>>>>> +
>>>>>>> + (*) Even in cases where bitfields are protected by locks, all fields
>>>>>>> +     in a given bitfield must be protected by one lock.  If two fields
>>>>>>> +     in a given bitfield are protected by different locks, the compiler's
>>>>>>> +     non-atomic read-modify-write sequences can cause an update to one
>>>>>>> +     field to corrupt the value of an adjacent field.
>>>>>>> +
>>>>>>> + (*) These guarantees apply only to properly aligned and sized scalar
>>>>>>> +     variables.  "Properly sized" currently means "int" and "long",
>>>>>>> +     because some CPU families do not support loads and stores of
>>>>>>> +     other sizes.  ("Some CPU families" is currently believed to
>>>>>>> +     be only Alpha 21064.  If this is actually the case, a different
>>>>>>> +     non-guarantee is likely to be formulated.)
>>>>>>
>>>>>> This is a bit unclear.  Presumably you're talking about definiteness of
>>>>>> the outcome (as in what's seen after multiple stores to the same
>>>>>> variable).
>>>>>
>>>>> No, the last conditions refers to adjacent byte stores from different
>>>>> cpu contexts (either interrupt or SMP).
>>>>>
>>>>>> The guarantees are only for natural width on Parisc as well,
>>>>>> so you would get a mess if you did byte stores to adjacent memory
>>>>>> locations.
>>>>>
>>>>> For a simple test like:
>>>>>
>>>>> struct x {
>>>>> 	long a;
>>>>> 	char b;
>>>>> 	char c;
>>>>> 	char d;
>>>>> 	char e;
>>>>> };
>>>>>
>>>>> void store_bc(struct x *p) {
>>>>> 	p->b = 1;
>>>>> 	p->c = 2;
>>>>> }
>>>>>
>>>>> on parisc, gcc generates separate byte stores
>>>>>
>>>>> void store_bc(struct x *p) {
>>>>>    0:	34 1c 00 02 	ldi 1,ret0
>>>>>    4:	0f 5c 12 08 	stb ret0,4(r26)
>>>>>    8:	34 1c 00 04 	ldi 2,ret0
>>>>>    c:	e8 40 c0 00 	bv r0(rp)
>>>>>   10:	0f 5c 12 0a 	stb ret0,5(r26)
>>>>>
>>>>> which appears to confirm that on parisc adjacent byte data
>>>>> is safe from corruption by concurrent cpu updates; that is,
>>>>>
>>>>> CPU 0                | CPU 1
>>>>>                      |
>>>>> p->b = 1             | p->c = 2
>>>>>                      |
>>>>>
>>>>> will result in p->b = 1 && p->c = 2 (assume both values
>>>>> were 0 before the call to store_bc()).
>>>>
>>>> What Peter said.  I would ask for suggestions for better wording, but
>>>> I would much rather be able to say that single-byte reads and writes
>>>> are atomic and that aligned-short reads and writes are also atomic.
>>>>
>>>> Thus far, it looks like we lose only very old Alpha systems, so unless
>>>> I hear otherwise, I update my patch to outlaw these very old systems.
>>>
>>> This isn't universally true according to the architecture manual.  The
>>> PARISC CPU can make byte to long word stores atomic against the memory
>>> bus but not against the I/O bus for instance.  Atomicity is a property
>>> of the underlying substrate, not of the CPU.  Implying that atomicity is
>>> a CPU property is incorrect.

To go back to this briefly, while it's true that atomicity is not
a CPU property, atomicity is not possible if the CPU is not cooperating.

>> OK, fair point.
>>
>> But are there in-use-for-Linux PARISC memory fabrics (for normal memory,
>> not I/O) that do not support single-byte and double-byte stores?
> 
> For aligned access, I believe that's always the case for the memory bus
> (on both 32 and 64 bit systems).  However, it only applies to machine
> instruction loads and stores of the same width..  If you mix the widths
> on the loads and stores, all bets are off.  That means you have to
> beware of the gcc penchant for coalescing loads and stores: if it sees
> two adjacent byte stores it can coalesce them into a short store
> instead ... that screws up the atomicity guarantees.

Two things: I think that gcc has given up on combining adjacent writes,
perhaps because unaligned writes on some arches are prohibitive, so
whatever minor optimization was believed to be gained was quickly lost,
multi-fold. (Although this might not be the reason since one would
expect that gcc would know the alignment of the promoted store).

But additionally, even if gcc combines adjacent writes _that are part
of the program flow_ then I believe the situation is no worse than
would otherwise exist.

For instance, given the following:

struct x {
	spinlock_t lock;
	long a;
	byte b;
	byte c;
};

void locked_store_b(struct x *p)
{
	spin_lock(&p->lock);
	p->b = 1;
	spin_unlock(&p->lock);
	p->c = 2;
}

Granted, the author probably expects ordered writes of
	STORE B
	STORE C
but that's not guaranteed because there is no memory barrier
ordering B before C.

I see no problem with gcc write-combining in the absence of
memory barriers (as long as alignment is still respected,
ie., the size-promoted write is still naturally aligned). The
combined write is still atomic.

Regards,
Peter Hurley

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
  2014-09-07 19:04                               ` James Bottomley
  (?)
@ 2014-09-07 23:00                                 ` Paul E. McKenney
  -1 siblings, 0 replies; 311+ messages in thread
From: Paul E. McKenney @ 2014-09-07 23:00 UTC (permalink / raw)
  To: James Bottomley
  Cc: Peter Hurley, H. Peter Anvin, One Thousand Gnomes, Jakub Jelinek,
	Mikael Pettersson, Benjamin Herrenschmidt, Richard Henderson,
	Oleg Nesterov, Miroslav Franc, Paul Mackerras, linuxppc-dev,
	linux-kernel, linux-arch, Tony Luck, linux-ia64

On Sun, Sep 07, 2014 at 12:04:47PM -0700, James Bottomley wrote:
> On Sun, 2014-09-07 at 09:21 -0700, Paul E. McKenney wrote:
> > On Sat, Sep 06, 2014 at 10:07:22PM -0700, James Bottomley wrote:
> > > On Thu, 2014-09-04 at 21:06 -0700, Paul E. McKenney wrote:
> > > > On Thu, Sep 04, 2014 at 10:47:24PM -0400, Peter Hurley wrote:
> > > > > Hi James,
> > > > > 
> > > > > On 09/04/2014 10:11 PM, James Bottomley wrote:
> > > > > > On Thu, 2014-09-04 at 17:17 -0700, Paul E. McKenney wrote:
> > > > > >> +And there are anti-guarantees:
> > > > > >> +
> > > > > >> + (*) These guarantees do not apply to bitfields, because compilers often
> > > > > >> +     generate code to modify these using non-atomic read-modify-write
> > > > > >> +     sequences.  Do not attempt to use bitfields to synchronize parallel
> > > > > >> +     algorithms.
> > > > > >> +
> > > > > >> + (*) Even in cases where bitfields are protected by locks, all fields
> > > > > >> +     in a given bitfield must be protected by one lock.  If two fields
> > > > > >> +     in a given bitfield are protected by different locks, the compiler's
> > > > > >> +     non-atomic read-modify-write sequences can cause an update to one
> > > > > >> +     field to corrupt the value of an adjacent field.
> > > > > >> +
> > > > > >> + (*) These guarantees apply only to properly aligned and sized scalar
> > > > > >> +     variables.  "Properly sized" currently means "int" and "long",
> > > > > >> +     because some CPU families do not support loads and stores of
> > > > > >> +     other sizes.  ("Some CPU families" is currently believed to
> > > > > >> +     be only Alpha 21064.  If this is actually the case, a different
> > > > > >> +     non-guarantee is likely to be formulated.)
> > > > > > 
> > > > > > This is a bit unclear.  Presumably you're talking about definiteness of
> > > > > > the outcome (as in what's seen after multiple stores to the same
> > > > > > variable).
> > > > > 
> > > > > No, the last conditions refers to adjacent byte stores from different
> > > > > cpu contexts (either interrupt or SMP).
> > > > > 
> > > > > > The guarantees are only for natural width on Parisc as well,
> > > > > > so you would get a mess if you did byte stores to adjacent memory
> > > > > > locations.
> > > > > 
> > > > > For a simple test like:
> > > > > 
> > > > > struct x {
> > > > > 	long a;
> > > > > 	char b;
> > > > > 	char c;
> > > > > 	char d;
> > > > > 	char e;
> > > > > };
> > > > > 
> > > > > void store_bc(struct x *p) {
> > > > > 	p->b = 1;
> > > > > 	p->c = 2;
> > > > > }
> > > > > 
> > > > > on parisc, gcc generates separate byte stores
> > > > > 
> > > > > void store_bc(struct x *p) {
> > > > >    0:	34 1c 00 02 	ldi 1,ret0
> > > > >    4:	0f 5c 12 08 	stb ret0,4(r26)
> > > > >    8:	34 1c 00 04 	ldi 2,ret0
> > > > >    c:	e8 40 c0 00 	bv r0(rp)
> > > > >   10:	0f 5c 12 0a 	stb ret0,5(r26)
> > > > > 
> > > > > which appears to confirm that on parisc adjacent byte data
> > > > > is safe from corruption by concurrent cpu updates; that is,
> > > > > 
> > > > > CPU 0                | CPU 1
> > > > >                      |
> > > > > p->b = 1             | p->c = 2
> > > > >                      |
> > > > > 
> > > > > will result in p->b == 1 && p->c == 2 (assume both values
> > > > > were 0 before the call to store_bc()).
> > > > 
> > > > What Peter said.  I would ask for suggestions for better wording, but
> > > > I would much rather be able to say that single-byte reads and writes
> > > > are atomic and that aligned-short reads and writes are also atomic.
> > > > 
> > > > Thus far, it looks like we lose only very old Alpha systems, so unless
> > > > I hear otherwise, I update my patch to outlaw these very old systems.
> > > 
> > > This isn't universally true according to the architecture manual.  The
> > > PARISC CPU can make byte to long word stores atomic against the memory
> > > bus but not against the I/O bus for instance.  Atomicity is a property
> > > of the underlying substrate, not of the CPU.  Implying that atomicity is
> > > a CPU property is incorrect.
> > 
> > OK, fair point.
> > 
> > But are there in-use-for-Linux PARISC memory fabrics (for normal memory,
> > not I/O) that do not support single-byte and double-byte stores?
> 
> For aligned access, I believe that's always the case for the memory bus
> (on both 32 and 64 bit systems).  However, it only applies to machine
> instruction loads and stores of the same width..  If you mix the widths
> on the loads and stores, all bets are off.  That means you have to
> beware of the gcc penchant for coalescing loads and stores: if it sees
> two adjacent byte stores it can coalesce them into a short store
> instead ... that screws up the atomicity guarantees.

OK, that means that to make PARISC work reliably, we need to use
ACCESS_ONCE() for loads and stores that could have racing accesses.
If I understand correctly, this will -not- be needed for code guarded
by locks, even with Peter's examples.

So if we have something like this:

	struct foo {
		char a;
		char b;
	};
	struct foo *fp;

then this code would be bad:

	fp->a = 1;
	fp->b = 2;

The reason is (as you say) that GCC would be happy to store 0x0102
(or vice versa, depending on endianness) to the pair.  We instead
need:

	ACCESS_ONCE(fp->a) = 1;
	ACCESS_ONCE(fp->b) = 2;

However, if the code is protected by locks, no problem:

	struct foo {
		spinlock_t lock_a;
		spinlock_t lock_b;
		char a;
		char b;
	};

Then it is OK to do the following:

	spin_lock(fp->lock_a);
	fp->a = 1;
	spin_unlock(fp->lock_a);
	spin_lock(fp->lock_b);
	fp->b = 1;
	spin_unlock(fp->lock_b);

Or even this, assuming ->lock_a precedes ->lock_b in the locking hierarchy:

	spin_lock(fp->lock_a);
	spin_lock(fp->lock_b);
	fp->a = 1;
	fp->b = 1;
	spin_unlock(fp->lock_a);
	spin_unlock(fp->lock_b);

Here gcc might merge the assignments to fp->a and fp->b, but that is OK
because both locks are held, presumably preventing other assignments or
references to fp->a and fp->b.

On the other hand, if either fp->a or fp->b are referenced outside of their
respective locks, even once, then this last code fragment would still need
ACCESS_ONCE() as follows:

	spin_lock(fp->lock_a);
	spin_lock(fp->lock_b);
	ACCESS_ONCE(fp->a) = 1;
	ACCESS_ONCE(fp->b) = 1;
	spin_unlock(fp->lock_a);
	spin_unlock(fp->lock_b);

Does that cover it?  If so, I will update memory-barriers.txt accordingly.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-07 23:00                                 ` Paul E. McKenney
  0 siblings, 0 replies; 311+ messages in thread
From: Paul E. McKenney @ 2014-09-07 23:00 UTC (permalink / raw)
  To: James Bottomley
  Cc: Jakub Jelinek, One Thousand Gnomes, linux-arch, linux-ia64,
	Peter Hurley, Mikael Pettersson, Oleg Nesterov, linux-kernel,
	Tony Luck, Paul Mackerras, H. Peter Anvin, linuxppc-dev,
	Miroslav Franc, Richard Henderson

On Sun, Sep 07, 2014 at 12:04:47PM -0700, James Bottomley wrote:
> On Sun, 2014-09-07 at 09:21 -0700, Paul E. McKenney wrote:
> > On Sat, Sep 06, 2014 at 10:07:22PM -0700, James Bottomley wrote:
> > > On Thu, 2014-09-04 at 21:06 -0700, Paul E. McKenney wrote:
> > > > On Thu, Sep 04, 2014 at 10:47:24PM -0400, Peter Hurley wrote:
> > > > > Hi James,
> > > > > 
> > > > > On 09/04/2014 10:11 PM, James Bottomley wrote:
> > > > > > On Thu, 2014-09-04 at 17:17 -0700, Paul E. McKenney wrote:
> > > > > >> +And there are anti-guarantees:
> > > > > >> +
> > > > > >> + (*) These guarantees do not apply to bitfields, because compilers often
> > > > > >> +     generate code to modify these using non-atomic read-modify-write
> > > > > >> +     sequences.  Do not attempt to use bitfields to synchronize parallel
> > > > > >> +     algorithms.
> > > > > >> +
> > > > > >> + (*) Even in cases where bitfields are protected by locks, all fields
> > > > > >> +     in a given bitfield must be protected by one lock.  If two fields
> > > > > >> +     in a given bitfield are protected by different locks, the compiler's
> > > > > >> +     non-atomic read-modify-write sequences can cause an update to one
> > > > > >> +     field to corrupt the value of an adjacent field.
> > > > > >> +
> > > > > >> + (*) These guarantees apply only to properly aligned and sized scalar
> > > > > >> +     variables.  "Properly sized" currently means "int" and "long",
> > > > > >> +     because some CPU families do not support loads and stores of
> > > > > >> +     other sizes.  ("Some CPU families" is currently believed to
> > > > > >> +     be only Alpha 21064.  If this is actually the case, a different
> > > > > >> +     non-guarantee is likely to be formulated.)
> > > > > > 
> > > > > > This is a bit unclear.  Presumably you're talking about definiteness of
> > > > > > the outcome (as in what's seen after multiple stores to the same
> > > > > > variable).
> > > > > 
> > > > > No, the last conditions refers to adjacent byte stores from different
> > > > > cpu contexts (either interrupt or SMP).
> > > > > 
> > > > > > The guarantees are only for natural width on Parisc as well,
> > > > > > so you would get a mess if you did byte stores to adjacent memory
> > > > > > locations.
> > > > > 
> > > > > For a simple test like:
> > > > > 
> > > > > struct x {
> > > > > 	long a;
> > > > > 	char b;
> > > > > 	char c;
> > > > > 	char d;
> > > > > 	char e;
> > > > > };
> > > > > 
> > > > > void store_bc(struct x *p) {
> > > > > 	p->b = 1;
> > > > > 	p->c = 2;
> > > > > }
> > > > > 
> > > > > on parisc, gcc generates separate byte stores
> > > > > 
> > > > > void store_bc(struct x *p) {
> > > > >    0:	34 1c 00 02 	ldi 1,ret0
> > > > >    4:	0f 5c 12 08 	stb ret0,4(r26)
> > > > >    8:	34 1c 00 04 	ldi 2,ret0
> > > > >    c:	e8 40 c0 00 	bv r0(rp)
> > > > >   10:	0f 5c 12 0a 	stb ret0,5(r26)
> > > > > 
> > > > > which appears to confirm that on parisc adjacent byte data
> > > > > is safe from corruption by concurrent cpu updates; that is,
> > > > > 
> > > > > CPU 0                | CPU 1
> > > > >                      |
> > > > > p->b = 1             | p->c = 2
> > > > >                      |
> > > > > 
> > > > > will result in p->b == 1 && p->c == 2 (assume both values
> > > > > were 0 before the call to store_bc()).
> > > > 
> > > > What Peter said.  I would ask for suggestions for better wording, but
> > > > I would much rather be able to say that single-byte reads and writes
> > > > are atomic and that aligned-short reads and writes are also atomic.
> > > > 
> > > > Thus far, it looks like we lose only very old Alpha systems, so unless
> > > > I hear otherwise, I update my patch to outlaw these very old systems.
> > > 
> > > This isn't universally true according to the architecture manual.  The
> > > PARISC CPU can make byte to long word stores atomic against the memory
> > > bus but not against the I/O bus for instance.  Atomicity is a property
> > > of the underlying substrate, not of the CPU.  Implying that atomicity is
> > > a CPU property is incorrect.
> > 
> > OK, fair point.
> > 
> > But are there in-use-for-Linux PARISC memory fabrics (for normal memory,
> > not I/O) that do not support single-byte and double-byte stores?
> 
> For aligned access, I believe that's always the case for the memory bus
> (on both 32 and 64 bit systems).  However, it only applies to machine
> instruction loads and stores of the same width..  If you mix the widths
> on the loads and stores, all bets are off.  That means you have to
> beware of the gcc penchant for coalescing loads and stores: if it sees
> two adjacent byte stores it can coalesce them into a short store
> instead ... that screws up the atomicity guarantees.

OK, that means that to make PARISC work reliably, we need to use
ACCESS_ONCE() for loads and stores that could have racing accesses.
If I understand correctly, this will -not- be needed for code guarded
by locks, even with Peter's examples.

So if we have something like this:

	struct foo {
		char a;
		char b;
	};
	struct foo *fp;

then this code would be bad:

	fp->a = 1;
	fp->b = 2;

The reason is (as you say) that GCC would be happy to store 0x0102
(or vice versa, depending on endianness) to the pair.  We instead
need:

	ACCESS_ONCE(fp->a) = 1;
	ACCESS_ONCE(fp->b) = 2;

However, if the code is protected by locks, no problem:

	struct foo {
		spinlock_t lock_a;
		spinlock_t lock_b;
		char a;
		char b;
	};

Then it is OK to do the following:

	spin_lock(fp->lock_a);
	fp->a = 1;
	spin_unlock(fp->lock_a);
	spin_lock(fp->lock_b);
	fp->b = 1;
	spin_unlock(fp->lock_b);

Or even this, assuming ->lock_a precedes ->lock_b in the locking hierarchy:

	spin_lock(fp->lock_a);
	spin_lock(fp->lock_b);
	fp->a = 1;
	fp->b = 1;
	spin_unlock(fp->lock_a);
	spin_unlock(fp->lock_b);

Here gcc might merge the assignments to fp->a and fp->b, but that is OK
because both locks are held, presumably preventing other assignments or
references to fp->a and fp->b.

On the other hand, if either fp->a or fp->b are referenced outside of their
respective locks, even once, then this last code fragment would still need
ACCESS_ONCE() as follows:

	spin_lock(fp->lock_a);
	spin_lock(fp->lock_b);
	ACCESS_ONCE(fp->a) = 1;
	ACCESS_ONCE(fp->b) = 1;
	spin_unlock(fp->lock_a);
	spin_unlock(fp->lock_b);

Does that cover it?  If so, I will update memory-barriers.txt accordingly.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-07 23:00                                 ` Paul E. McKenney
  0 siblings, 0 replies; 311+ messages in thread
From: Paul E. McKenney @ 2014-09-07 23:00 UTC (permalink / raw)
  To: James Bottomley
  Cc: Peter Hurley, H. Peter Anvin, One Thousand Gnomes, Jakub Jelinek,
	Mikael Pettersson, Benjamin Herrenschmidt, Richard Henderson,
	Oleg Nesterov, Miroslav Franc, Paul Mackerras, linuxppc-dev,
	linux-kernel, linux-arch, Tony Luck, linux-ia64

On Sun, Sep 07, 2014 at 12:04:47PM -0700, James Bottomley wrote:
> On Sun, 2014-09-07 at 09:21 -0700, Paul E. McKenney wrote:
> > On Sat, Sep 06, 2014 at 10:07:22PM -0700, James Bottomley wrote:
> > > On Thu, 2014-09-04 at 21:06 -0700, Paul E. McKenney wrote:
> > > > On Thu, Sep 04, 2014 at 10:47:24PM -0400, Peter Hurley wrote:
> > > > > Hi James,
> > > > > 
> > > > > On 09/04/2014 10:11 PM, James Bottomley wrote:
> > > > > > On Thu, 2014-09-04 at 17:17 -0700, Paul E. McKenney wrote:
> > > > > >> +And there are anti-guarantees:
> > > > > >> +
> > > > > >> + (*) These guarantees do not apply to bitfields, because compilers often
> > > > > >> +     generate code to modify these using non-atomic read-modify-write
> > > > > >> +     sequences.  Do not attempt to use bitfields to synchronize parallel
> > > > > >> +     algorithms.
> > > > > >> +
> > > > > >> + (*) Even in cases where bitfields are protected by locks, all fields
> > > > > >> +     in a given bitfield must be protected by one lock.  If two fields
> > > > > >> +     in a given bitfield are protected by different locks, the compiler's
> > > > > >> +     non-atomic read-modify-write sequences can cause an update to one
> > > > > >> +     field to corrupt the value of an adjacent field.
> > > > > >> +
> > > > > >> + (*) These guarantees apply only to properly aligned and sized scalar
> > > > > >> +     variables.  "Properly sized" currently means "int" and "long",
> > > > > >> +     because some CPU families do not support loads and stores of
> > > > > >> +     other sizes.  ("Some CPU families" is currently believed to
> > > > > >> +     be only Alpha 21064.  If this is actually the case, a different
> > > > > >> +     non-guarantee is likely to be formulated.)
> > > > > > 
> > > > > > This is a bit unclear.  Presumably you're talking about definiteness of
> > > > > > the outcome (as in what's seen after multiple stores to the same
> > > > > > variable).
> > > > > 
> > > > > No, the last conditions refers to adjacent byte stores from different
> > > > > cpu contexts (either interrupt or SMP).
> > > > > 
> > > > > > The guarantees are only for natural width on Parisc as well,
> > > > > > so you would get a mess if you did byte stores to adjacent memory
> > > > > > locations.
> > > > > 
> > > > > For a simple test like:
> > > > > 
> > > > > struct x {
> > > > > 	long a;
> > > > > 	char b;
> > > > > 	char c;
> > > > > 	char d;
> > > > > 	char e;
> > > > > };
> > > > > 
> > > > > void store_bc(struct x *p) {
> > > > > 	p->b = 1;
> > > > > 	p->c = 2;
> > > > > }
> > > > > 
> > > > > on parisc, gcc generates separate byte stores
> > > > > 
> > > > > void store_bc(struct x *p) {
> > > > >    0:	34 1c 00 02 	ldi 1,ret0
> > > > >    4:	0f 5c 12 08 	stb ret0,4(r26)
> > > > >    8:	34 1c 00 04 	ldi 2,ret0
> > > > >    c:	e8 40 c0 00 	bv r0(rp)
> > > > >   10:	0f 5c 12 0a 	stb ret0,5(r26)
> > > > > 
> > > > > which appears to confirm that on parisc adjacent byte data
> > > > > is safe from corruption by concurrent cpu updates; that is,
> > > > > 
> > > > > CPU 0                | CPU 1
> > > > >                      |
> > > > > p->b = 1             | p->c = 2
> > > > >                      |
> > > > > 
> > > > > will result in p->b = 1 && p->c = 2 (assume both values
> > > > > were 0 before the call to store_bc()).
> > > > 
> > > > What Peter said.  I would ask for suggestions for better wording, but
> > > > I would much rather be able to say that single-byte reads and writes
> > > > are atomic and that aligned-short reads and writes are also atomic.
> > > > 
> > > > Thus far, it looks like we lose only very old Alpha systems, so unless
> > > > I hear otherwise, I update my patch to outlaw these very old systems.
> > > 
> > > This isn't universally true according to the architecture manual.  The
> > > PARISC CPU can make byte to long word stores atomic against the memory
> > > bus but not against the I/O bus for instance.  Atomicity is a property
> > > of the underlying substrate, not of the CPU.  Implying that atomicity is
> > > a CPU property is incorrect.
> > 
> > OK, fair point.
> > 
> > But are there in-use-for-Linux PARISC memory fabrics (for normal memory,
> > not I/O) that do not support single-byte and double-byte stores?
> 
> For aligned access, I believe that's always the case for the memory bus
> (on both 32 and 64 bit systems).  However, it only applies to machine
> instruction loads and stores of the same width..  If you mix the widths
> on the loads and stores, all bets are off.  That means you have to
> beware of the gcc penchant for coalescing loads and stores: if it sees
> two adjacent byte stores it can coalesce them into a short store
> instead ... that screws up the atomicity guarantees.

OK, that means that to make PARISC work reliably, we need to use
ACCESS_ONCE() for loads and stores that could have racing accesses.
If I understand correctly, this will -not- be needed for code guarded
by locks, even with Peter's examples.

So if we have something like this:

	struct foo {
		char a;
		char b;
	};
	struct foo *fp;

then this code would be bad:

	fp->a = 1;
	fp->b = 2;

The reason is (as you say) that GCC would be happy to store 0x0102
(or vice versa, depending on endianness) to the pair.  We instead
need:

	ACCESS_ONCE(fp->a) = 1;
	ACCESS_ONCE(fp->b) = 2;

However, if the code is protected by locks, no problem:

	struct foo {
		spinlock_t lock_a;
		spinlock_t lock_b;
		char a;
		char b;
	};

Then it is OK to do the following:

	spin_lock(fp->lock_a);
	fp->a = 1;
	spin_unlock(fp->lock_a);
	spin_lock(fp->lock_b);
	fp->b = 1;
	spin_unlock(fp->lock_b);

Or even this, assuming ->lock_a precedes ->lock_b in the locking hierarchy:

	spin_lock(fp->lock_a);
	spin_lock(fp->lock_b);
	fp->a = 1;
	fp->b = 1;
	spin_unlock(fp->lock_a);
	spin_unlock(fp->lock_b);

Here gcc might merge the assignments to fp->a and fp->b, but that is OK
because both locks are held, presumably preventing other assignments or
references to fp->a and fp->b.

On the other hand, if either fp->a or fp->b are referenced outside of their
respective locks, even once, then this last code fragment would still need
ACCESS_ONCE() as follows:

	spin_lock(fp->lock_a);
	spin_lock(fp->lock_b);
	ACCESS_ONCE(fp->a) = 1;
	ACCESS_ONCE(fp->b) = 1;
	spin_unlock(fp->lock_a);
	spin_unlock(fp->lock_b);

Does that cover it?  If so, I will update memory-barriers.txt accordingly.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
  2014-09-07 23:00                                 ` Paul E. McKenney
  (?)
@ 2014-09-07 23:17                                   ` H. Peter Anvin
  -1 siblings, 0 replies; 311+ messages in thread
From: H. Peter Anvin @ 2014-09-07 23:17 UTC (permalink / raw)
  To: paulmck, James Bottomley
  Cc: Peter Hurley, One Thousand Gnomes, Jakub Jelinek,
	Mikael Pettersson, Benjamin Herrenschmidt, Richard Henderson,
	Oleg Nesterov, Miroslav Franc, Paul Mackerras, linuxppc-dev,
	linux-kernel, linux-arch, Tony Luck, linux-ia64

I'm confused why storing 0x0102 would be a problem.  I think gcc does that even on other cpus.

More atomicity can't hurt, can it?

On September 7, 2014 4:00:19 PM PDT, "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
>On Sun, Sep 07, 2014 at 12:04:47PM -0700, James Bottomley wrote:
>> On Sun, 2014-09-07 at 09:21 -0700, Paul E. McKenney wrote:
>> > On Sat, Sep 06, 2014 at 10:07:22PM -0700, James Bottomley wrote:
>> > > On Thu, 2014-09-04 at 21:06 -0700, Paul E. McKenney wrote:
>> > > > On Thu, Sep 04, 2014 at 10:47:24PM -0400, Peter Hurley wrote:
>> > > > > Hi James,
>> > > > > 
>> > > > > On 09/04/2014 10:11 PM, James Bottomley wrote:
>> > > > > > On Thu, 2014-09-04 at 17:17 -0700, Paul E. McKenney wrote:
>> > > > > >> +And there are anti-guarantees:
>> > > > > >> +
>> > > > > >> + (*) These guarantees do not apply to bitfields, because
>compilers often
>> > > > > >> +     generate code to modify these using non-atomic
>read-modify-write
>> > > > > >> +     sequences.  Do not attempt to use bitfields to
>synchronize parallel
>> > > > > >> +     algorithms.
>> > > > > >> +
>> > > > > >> + (*) Even in cases where bitfields are protected by
>locks, all fields
>> > > > > >> +     in a given bitfield must be protected by one lock. 
>If two fields
>> > > > > >> +     in a given bitfield are protected by different
>locks, the compiler's
>> > > > > >> +     non-atomic read-modify-write sequences can cause an
>update to one
>> > > > > >> +     field to corrupt the value of an adjacent field.
>> > > > > >> +
>> > > > > >> + (*) These guarantees apply only to properly aligned and
>sized scalar
>> > > > > >> +     variables.  "Properly sized" currently means "int"
>and "long",
>> > > > > >> +     because some CPU families do not support loads and
>stores of
>> > > > > >> +     other sizes.  ("Some CPU families" is currently
>believed to
>> > > > > >> +     be only Alpha 21064.  If this is actually the case,
>a different
>> > > > > >> +     non-guarantee is likely to be formulated.)
>> > > > > > 
>> > > > > > This is a bit unclear.  Presumably you're talking about
>definiteness of
>> > > > > > the outcome (as in what's seen after multiple stores to the
>same
>> > > > > > variable).
>> > > > > 
>> > > > > No, the last conditions refers to adjacent byte stores from
>different
>> > > > > cpu contexts (either interrupt or SMP).
>> > > > > 
>> > > > > > The guarantees are only for natural width on Parisc as
>well,
>> > > > > > so you would get a mess if you did byte stores to adjacent
>memory
>> > > > > > locations.
>> > > > > 
>> > > > > For a simple test like:
>> > > > > 
>> > > > > struct x {
>> > > > > 	long a;
>> > > > > 	char b;
>> > > > > 	char c;
>> > > > > 	char d;
>> > > > > 	char e;
>> > > > > };
>> > > > > 
>> > > > > void store_bc(struct x *p) {
>> > > > > 	p->b = 1;
>> > > > > 	p->c = 2;
>> > > > > }
>> > > > > 
>> > > > > on parisc, gcc generates separate byte stores
>> > > > > 
>> > > > > void store_bc(struct x *p) {
>> > > > >    0:	34 1c 00 02 	ldi 1,ret0
>> > > > >    4:	0f 5c 12 08 	stb ret0,4(r26)
>> > > > >    8:	34 1c 00 04 	ldi 2,ret0
>> > > > >    c:	e8 40 c0 00 	bv r0(rp)
>> > > > >   10:	0f 5c 12 0a 	stb ret0,5(r26)
>> > > > > 
>> > > > > which appears to confirm that on parisc adjacent byte data
>> > > > > is safe from corruption by concurrent cpu updates; that is,
>> > > > > 
>> > > > > CPU 0                | CPU 1
>> > > > >                      |
>> > > > > p->b = 1             | p->c = 2
>> > > > >                      |
>> > > > > 
>> > > > > will result in p->b == 1 && p->c == 2 (assume both values
>> > > > > were 0 before the call to store_bc()).
>> > > > 
>> > > > What Peter said.  I would ask for suggestions for better
>wording, but
>> > > > I would much rather be able to say that single-byte reads and
>writes
>> > > > are atomic and that aligned-short reads and writes are also
>atomic.
>> > > > 
>> > > > Thus far, it looks like we lose only very old Alpha systems, so
>unless
>> > > > I hear otherwise, I update my patch to outlaw these very old
>systems.
>> > > 
>> > > This isn't universally true according to the architecture manual.
> The
>> > > PARISC CPU can make byte to long word stores atomic against the
>memory
>> > > bus but not against the I/O bus for instance.  Atomicity is a
>property
>> > > of the underlying substrate, not of the CPU.  Implying that
>atomicity is
>> > > a CPU property is incorrect.
>> > 
>> > OK, fair point.
>> > 
>> > But are there in-use-for-Linux PARISC memory fabrics (for normal
>memory,
>> > not I/O) that do not support single-byte and double-byte stores?
>> 
>> For aligned access, I believe that's always the case for the memory
>bus
>> (on both 32 and 64 bit systems).  However, it only applies to machine
>> instruction loads and stores of the same width..  If you mix the
>widths
>> on the loads and stores, all bets are off.  That means you have to
>> beware of the gcc penchant for coalescing loads and stores: if it
>sees
>> two adjacent byte stores it can coalesce them into a short store
>> instead ... that screws up the atomicity guarantees.
>
>OK, that means that to make PARISC work reliably, we need to use
>ACCESS_ONCE() for loads and stores that could have racing accesses.
>If I understand correctly, this will -not- be needed for code guarded
>by locks, even with Peter's examples.
>
>So if we have something like this:
>
>	struct foo {
>		char a;
>		char b;
>	};
>	struct foo *fp;
>
>then this code would be bad:
>
>	fp->a = 1;
>	fp->b = 2;
>
>The reason is (as you say) that GCC would be happy to store 0x0102
>(or vice versa, depending on endianness) to the pair.  We instead
>need:
>
>	ACCESS_ONCE(fp->a) = 1;
>	ACCESS_ONCE(fp->b) = 2;
>
>However, if the code is protected by locks, no problem:
>
>	struct foo {
>		spinlock_t lock_a;
>		spinlock_t lock_b;
>		char a;
>		char b;
>	};
>
>Then it is OK to do the following:
>
>	spin_lock(fp->lock_a);
>	fp->a = 1;
>	spin_unlock(fp->lock_a);
>	spin_lock(fp->lock_b);
>	fp->b = 1;
>	spin_unlock(fp->lock_b);
>
>Or even this, assuming ->lock_a precedes ->lock_b in the locking
>hierarchy:
>
>	spin_lock(fp->lock_a);
>	spin_lock(fp->lock_b);
>	fp->a = 1;
>	fp->b = 1;
>	spin_unlock(fp->lock_a);
>	spin_unlock(fp->lock_b);
>
>Here gcc might merge the assignments to fp->a and fp->b, but that is OK
>because both locks are held, presumably preventing other assignments or
>references to fp->a and fp->b.
>
>On the other hand, if either fp->a or fp->b are referenced outside of
>their
>respective locks, even once, then this last code fragment would still
>need
>ACCESS_ONCE() as follows:
>
>	spin_lock(fp->lock_a);
>	spin_lock(fp->lock_b);
>	ACCESS_ONCE(fp->a) = 1;
>	ACCESS_ONCE(fp->b) = 1;
>	spin_unlock(fp->lock_a);
>	spin_unlock(fp->lock_b);
>
>Does that cover it?  If so, I will update memory-barriers.txt
>accordingly.
>
>							Thanx, Paul

-- 
Sent from my mobile phone.  Please pardon brevity and lack of formatting.

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-07 23:17                                   ` H. Peter Anvin
  0 siblings, 0 replies; 311+ messages in thread
From: H. Peter Anvin @ 2014-09-07 23:17 UTC (permalink / raw)
  To: paulmck, James Bottomley
  Cc: Jakub Jelinek, One Thousand Gnomes, Tony Luck, linux-ia64,
	Peter Hurley, Mikael Pettersson, Oleg Nesterov, linux-kernel,
	Paul Mackerras, linux-arch, linuxppc-dev, Miroslav Franc,
	Richard Henderson

I'm confused why storing 0x0102 would be a problem.  I think gcc does that even on other cpus.

More atomicity can't hurt, can it?

On September 7, 2014 4:00:19 PM PDT, "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
>On Sun, Sep 07, 2014 at 12:04:47PM -0700, James Bottomley wrote:
>> On Sun, 2014-09-07 at 09:21 -0700, Paul E. McKenney wrote:
>> > On Sat, Sep 06, 2014 at 10:07:22PM -0700, James Bottomley wrote:
>> > > On Thu, 2014-09-04 at 21:06 -0700, Paul E. McKenney wrote:
>> > > > On Thu, Sep 04, 2014 at 10:47:24PM -0400, Peter Hurley wrote:
>> > > > > Hi James,
>> > > > > 
>> > > > > On 09/04/2014 10:11 PM, James Bottomley wrote:
>> > > > > > On Thu, 2014-09-04 at 17:17 -0700, Paul E. McKenney wrote:
>> > > > > >> +And there are anti-guarantees:
>> > > > > >> +
>> > > > > >> + (*) These guarantees do not apply to bitfields, because
>compilers often
>> > > > > >> +     generate code to modify these using non-atomic
>read-modify-write
>> > > > > >> +     sequences.  Do not attempt to use bitfields to
>synchronize parallel
>> > > > > >> +     algorithms.
>> > > > > >> +
>> > > > > >> + (*) Even in cases where bitfields are protected by
>locks, all fields
>> > > > > >> +     in a given bitfield must be protected by one lock. 
>If two fields
>> > > > > >> +     in a given bitfield are protected by different
>locks, the compiler's
>> > > > > >> +     non-atomic read-modify-write sequences can cause an
>update to one
>> > > > > >> +     field to corrupt the value of an adjacent field.
>> > > > > >> +
>> > > > > >> + (*) These guarantees apply only to properly aligned and
>sized scalar
>> > > > > >> +     variables.  "Properly sized" currently means "int"
>and "long",
>> > > > > >> +     because some CPU families do not support loads and
>stores of
>> > > > > >> +     other sizes.  ("Some CPU families" is currently
>believed to
>> > > > > >> +     be only Alpha 21064.  If this is actually the case,
>a different
>> > > > > >> +     non-guarantee is likely to be formulated.)
>> > > > > > 
>> > > > > > This is a bit unclear.  Presumably you're talking about
>definiteness of
>> > > > > > the outcome (as in what's seen after multiple stores to the
>same
>> > > > > > variable).
>> > > > > 
>> > > > > No, the last conditions refers to adjacent byte stores from
>different
>> > > > > cpu contexts (either interrupt or SMP).
>> > > > > 
>> > > > > > The guarantees are only for natural width on Parisc as
>well,
>> > > > > > so you would get a mess if you did byte stores to adjacent
>memory
>> > > > > > locations.
>> > > > > 
>> > > > > For a simple test like:
>> > > > > 
>> > > > > struct x {
>> > > > > 	long a;
>> > > > > 	char b;
>> > > > > 	char c;
>> > > > > 	char d;
>> > > > > 	char e;
>> > > > > };
>> > > > > 
>> > > > > void store_bc(struct x *p) {
>> > > > > 	p->b = 1;
>> > > > > 	p->c = 2;
>> > > > > }
>> > > > > 
>> > > > > on parisc, gcc generates separate byte stores
>> > > > > 
>> > > > > void store_bc(struct x *p) {
>> > > > >    0:	34 1c 00 02 	ldi 1,ret0
>> > > > >    4:	0f 5c 12 08 	stb ret0,4(r26)
>> > > > >    8:	34 1c 00 04 	ldi 2,ret0
>> > > > >    c:	e8 40 c0 00 	bv r0(rp)
>> > > > >   10:	0f 5c 12 0a 	stb ret0,5(r26)
>> > > > > 
>> > > > > which appears to confirm that on parisc adjacent byte data
>> > > > > is safe from corruption by concurrent cpu updates; that is,
>> > > > > 
>> > > > > CPU 0                | CPU 1
>> > > > >                      |
>> > > > > p->b = 1             | p->c = 2
>> > > > >                      |
>> > > > > 
>> > > > > will result in p->b == 1 && p->c == 2 (assume both values
>> > > > > were 0 before the call to store_bc()).
>> > > > 
>> > > > What Peter said.  I would ask for suggestions for better
>wording, but
>> > > > I would much rather be able to say that single-byte reads and
>writes
>> > > > are atomic and that aligned-short reads and writes are also
>atomic.
>> > > > 
>> > > > Thus far, it looks like we lose only very old Alpha systems, so
>unless
>> > > > I hear otherwise, I update my patch to outlaw these very old
>systems.
>> > > 
>> > > This isn't universally true according to the architecture manual.
> The
>> > > PARISC CPU can make byte to long word stores atomic against the
>memory
>> > > bus but not against the I/O bus for instance.  Atomicity is a
>property
>> > > of the underlying substrate, not of the CPU.  Implying that
>atomicity is
>> > > a CPU property is incorrect.
>> > 
>> > OK, fair point.
>> > 
>> > But are there in-use-for-Linux PARISC memory fabrics (for normal
>memory,
>> > not I/O) that do not support single-byte and double-byte stores?
>> 
>> For aligned access, I believe that's always the case for the memory
>bus
>> (on both 32 and 64 bit systems).  However, it only applies to machine
>> instruction loads and stores of the same width..  If you mix the
>widths
>> on the loads and stores, all bets are off.  That means you have to
>> beware of the gcc penchant for coalescing loads and stores: if it
>sees
>> two adjacent byte stores it can coalesce them into a short store
>> instead ... that screws up the atomicity guarantees.
>
>OK, that means that to make PARISC work reliably, we need to use
>ACCESS_ONCE() for loads and stores that could have racing accesses.
>If I understand correctly, this will -not- be needed for code guarded
>by locks, even with Peter's examples.
>
>So if we have something like this:
>
>	struct foo {
>		char a;
>		char b;
>	};
>	struct foo *fp;
>
>then this code would be bad:
>
>	fp->a = 1;
>	fp->b = 2;
>
>The reason is (as you say) that GCC would be happy to store 0x0102
>(or vice versa, depending on endianness) to the pair.  We instead
>need:
>
>	ACCESS_ONCE(fp->a) = 1;
>	ACCESS_ONCE(fp->b) = 2;
>
>However, if the code is protected by locks, no problem:
>
>	struct foo {
>		spinlock_t lock_a;
>		spinlock_t lock_b;
>		char a;
>		char b;
>	};
>
>Then it is OK to do the following:
>
>	spin_lock(fp->lock_a);
>	fp->a = 1;
>	spin_unlock(fp->lock_a);
>	spin_lock(fp->lock_b);
>	fp->b = 1;
>	spin_unlock(fp->lock_b);
>
>Or even this, assuming ->lock_a precedes ->lock_b in the locking
>hierarchy:
>
>	spin_lock(fp->lock_a);
>	spin_lock(fp->lock_b);
>	fp->a = 1;
>	fp->b = 1;
>	spin_unlock(fp->lock_a);
>	spin_unlock(fp->lock_b);
>
>Here gcc might merge the assignments to fp->a and fp->b, but that is OK
>because both locks are held, presumably preventing other assignments or
>references to fp->a and fp->b.
>
>On the other hand, if either fp->a or fp->b are referenced outside of
>their
>respective locks, even once, then this last code fragment would still
>need
>ACCESS_ONCE() as follows:
>
>	spin_lock(fp->lock_a);
>	spin_lock(fp->lock_b);
>	ACCESS_ONCE(fp->a) = 1;
>	ACCESS_ONCE(fp->b) = 1;
>	spin_unlock(fp->lock_a);
>	spin_unlock(fp->lock_b);
>
>Does that cover it?  If so, I will update memory-barriers.txt
>accordingly.
>
>							Thanx, Paul

-- 
Sent from my mobile phone.  Please pardon brevity and lack of formatting.

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-07 23:17                                   ` H. Peter Anvin
  0 siblings, 0 replies; 311+ messages in thread
From: H. Peter Anvin @ 2014-09-07 23:17 UTC (permalink / raw)
  To: paulmck, James Bottomley
  Cc: Peter Hurley, One Thousand Gnomes, Jakub Jelinek,
	Mikael Pettersson, Benjamin Herrenschmidt, Richard Henderson,
	Oleg Nesterov, Miroslav Franc, Paul Mackerras, linuxppc-dev,
	linux-kernel, linux-arch, Tony Luck, linux-ia64

I'm confused why storing 0x0102 would be a problem.  I think gcc does that even on other cpus.

More atomicity can't hurt, can it?

On September 7, 2014 4:00:19 PM PDT, "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
>On Sun, Sep 07, 2014 at 12:04:47PM -0700, James Bottomley wrote:
>> On Sun, 2014-09-07 at 09:21 -0700, Paul E. McKenney wrote:
>> > On Sat, Sep 06, 2014 at 10:07:22PM -0700, James Bottomley wrote:
>> > > On Thu, 2014-09-04 at 21:06 -0700, Paul E. McKenney wrote:
>> > > > On Thu, Sep 04, 2014 at 10:47:24PM -0400, Peter Hurley wrote:
>> > > > > Hi James,
>> > > > > 
>> > > > > On 09/04/2014 10:11 PM, James Bottomley wrote:
>> > > > > > On Thu, 2014-09-04 at 17:17 -0700, Paul E. McKenney wrote:
>> > > > > >> +And there are anti-guarantees:
>> > > > > >> +
>> > > > > >> + (*) These guarantees do not apply to bitfields, because
>compilers often
>> > > > > >> +     generate code to modify these using non-atomic
>read-modify-write
>> > > > > >> +     sequences.  Do not attempt to use bitfields to
>synchronize parallel
>> > > > > >> +     algorithms.
>> > > > > >> +
>> > > > > >> + (*) Even in cases where bitfields are protected by
>locks, all fields
>> > > > > >> +     in a given bitfield must be protected by one lock. 
>If two fields
>> > > > > >> +     in a given bitfield are protected by different
>locks, the compiler's
>> > > > > >> +     non-atomic read-modify-write sequences can cause an
>update to one
>> > > > > >> +     field to corrupt the value of an adjacent field.
>> > > > > >> +
>> > > > > >> + (*) These guarantees apply only to properly aligned and
>sized scalar
>> > > > > >> +     variables.  "Properly sized" currently means "int"
>and "long",
>> > > > > >> +     because some CPU families do not support loads and
>stores of
>> > > > > >> +     other sizes.  ("Some CPU families" is currently
>believed to
>> > > > > >> +     be only Alpha 21064.  If this is actually the case,
>a different
>> > > > > >> +     non-guarantee is likely to be formulated.)
>> > > > > > 
>> > > > > > This is a bit unclear.  Presumably you're talking about
>definiteness of
>> > > > > > the outcome (as in what's seen after multiple stores to the
>same
>> > > > > > variable).
>> > > > > 
>> > > > > No, the last conditions refers to adjacent byte stores from
>different
>> > > > > cpu contexts (either interrupt or SMP).
>> > > > > 
>> > > > > > The guarantees are only for natural width on Parisc as
>well,
>> > > > > > so you would get a mess if you did byte stores to adjacent
>memory
>> > > > > > locations.
>> > > > > 
>> > > > > For a simple test like:
>> > > > > 
>> > > > > struct x {
>> > > > > 	long a;
>> > > > > 	char b;
>> > > > > 	char c;
>> > > > > 	char d;
>> > > > > 	char e;
>> > > > > };
>> > > > > 
>> > > > > void store_bc(struct x *p) {
>> > > > > 	p->b = 1;
>> > > > > 	p->c = 2;
>> > > > > }
>> > > > > 
>> > > > > on parisc, gcc generates separate byte stores
>> > > > > 
>> > > > > void store_bc(struct x *p) {
>> > > > >    0:	34 1c 00 02 	ldi 1,ret0
>> > > > >    4:	0f 5c 12 08 	stb ret0,4(r26)
>> > > > >    8:	34 1c 00 04 	ldi 2,ret0
>> > > > >    c:	e8 40 c0 00 	bv r0(rp)
>> > > > >   10:	0f 5c 12 0a 	stb ret0,5(r26)
>> > > > > 
>> > > > > which appears to confirm that on parisc adjacent byte data
>> > > > > is safe from corruption by concurrent cpu updates; that is,
>> > > > > 
>> > > > > CPU 0                | CPU 1
>> > > > >                      |
>> > > > > p->b = 1             | p->c = 2
>> > > > >                      |
>> > > > > 
>> > > > > will result in p->b = 1 && p->c = 2 (assume both values
>> > > > > were 0 before the call to store_bc()).
>> > > > 
>> > > > What Peter said.  I would ask for suggestions for better
>wording, but
>> > > > I would much rather be able to say that single-byte reads and
>writes
>> > > > are atomic and that aligned-short reads and writes are also
>atomic.
>> > > > 
>> > > > Thus far, it looks like we lose only very old Alpha systems, so
>unless
>> > > > I hear otherwise, I update my patch to outlaw these very old
>systems.
>> > > 
>> > > This isn't universally true according to the architecture manual.
> The
>> > > PARISC CPU can make byte to long word stores atomic against the
>memory
>> > > bus but not against the I/O bus for instance.  Atomicity is a
>property
>> > > of the underlying substrate, not of the CPU.  Implying that
>atomicity is
>> > > a CPU property is incorrect.
>> > 
>> > OK, fair point.
>> > 
>> > But are there in-use-for-Linux PARISC memory fabrics (for normal
>memory,
>> > not I/O) that do not support single-byte and double-byte stores?
>> 
>> For aligned access, I believe that's always the case for the memory
>bus
>> (on both 32 and 64 bit systems).  However, it only applies to machine
>> instruction loads and stores of the same width..  If you mix the
>widths
>> on the loads and stores, all bets are off.  That means you have to
>> beware of the gcc penchant for coalescing loads and stores: if it
>sees
>> two adjacent byte stores it can coalesce them into a short store
>> instead ... that screws up the atomicity guarantees.
>
>OK, that means that to make PARISC work reliably, we need to use
>ACCESS_ONCE() for loads and stores that could have racing accesses.
>If I understand correctly, this will -not- be needed for code guarded
>by locks, even with Peter's examples.
>
>So if we have something like this:
>
>	struct foo {
>		char a;
>		char b;
>	};
>	struct foo *fp;
>
>then this code would be bad:
>
>	fp->a = 1;
>	fp->b = 2;
>
>The reason is (as you say) that GCC would be happy to store 0x0102
>(or vice versa, depending on endianness) to the pair.  We instead
>need:
>
>	ACCESS_ONCE(fp->a) = 1;
>	ACCESS_ONCE(fp->b) = 2;
>
>However, if the code is protected by locks, no problem:
>
>	struct foo {
>		spinlock_t lock_a;
>		spinlock_t lock_b;
>		char a;
>		char b;
>	};
>
>Then it is OK to do the following:
>
>	spin_lock(fp->lock_a);
>	fp->a = 1;
>	spin_unlock(fp->lock_a);
>	spin_lock(fp->lock_b);
>	fp->b = 1;
>	spin_unlock(fp->lock_b);
>
>Or even this, assuming ->lock_a precedes ->lock_b in the locking
>hierarchy:
>
>	spin_lock(fp->lock_a);
>	spin_lock(fp->lock_b);
>	fp->a = 1;
>	fp->b = 1;
>	spin_unlock(fp->lock_a);
>	spin_unlock(fp->lock_b);
>
>Here gcc might merge the assignments to fp->a and fp->b, but that is OK
>because both locks are held, presumably preventing other assignments or
>references to fp->a and fp->b.
>
>On the other hand, if either fp->a or fp->b are referenced outside of
>their
>respective locks, even once, then this last code fragment would still
>need
>ACCESS_ONCE() as follows:
>
>	spin_lock(fp->lock_a);
>	spin_lock(fp->lock_b);
>	ACCESS_ONCE(fp->a) = 1;
>	ACCESS_ONCE(fp->b) = 1;
>	spin_unlock(fp->lock_a);
>	spin_unlock(fp->lock_b);
>
>Does that cover it?  If so, I will update memory-barriers.txt
>accordingly.
>
>							Thanx, Paul

-- 
Sent from my mobile phone.  Please pardon brevity and lack of formatting.

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
  2014-09-07 23:17                                   ` H. Peter Anvin
  (?)
@ 2014-09-07 23:36                                     ` Paul E. McKenney
  -1 siblings, 0 replies; 311+ messages in thread
From: Paul E. McKenney @ 2014-09-07 23:36 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: James Bottomley, Peter Hurley, One Thousand Gnomes,
	Jakub Jelinek, Mikael Pettersson, Benjamin Herrenschmidt,
	Richard Henderson, Oleg Nesterov, Miroslav Franc, Paul Mackerras,
	linuxppc-dev, linux-kernel, linux-arch, Tony Luck, linux-ia64

On Sun, Sep 07, 2014 at 04:17:30PM -0700, H. Peter Anvin wrote:
> I'm confused why storing 0x0102 would be a problem.  I think gcc does that even on other cpus.
> 
> More atomicity can't hurt, can it?

I must defer to James for any additional details on why PARISC systems
don't provide atomicity for partially overlapping stores.  ;-)

							Thanx, Paul

> On September 7, 2014 4:00:19 PM PDT, "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> >On Sun, Sep 07, 2014 at 12:04:47PM -0700, James Bottomley wrote:
> >> On Sun, 2014-09-07 at 09:21 -0700, Paul E. McKenney wrote:
> >> > On Sat, Sep 06, 2014 at 10:07:22PM -0700, James Bottomley wrote:
> >> > > On Thu, 2014-09-04 at 21:06 -0700, Paul E. McKenney wrote:
> >> > > > On Thu, Sep 04, 2014 at 10:47:24PM -0400, Peter Hurley wrote:
> >> > > > > Hi James,
> >> > > > > 
> >> > > > > On 09/04/2014 10:11 PM, James Bottomley wrote:
> >> > > > > > On Thu, 2014-09-04 at 17:17 -0700, Paul E. McKenney wrote:
> >> > > > > >> +And there are anti-guarantees:
> >> > > > > >> +
> >> > > > > >> + (*) These guarantees do not apply to bitfields, because
> >compilers often
> >> > > > > >> +     generate code to modify these using non-atomic
> >read-modify-write
> >> > > > > >> +     sequences.  Do not attempt to use bitfields to
> >synchronize parallel
> >> > > > > >> +     algorithms.
> >> > > > > >> +
> >> > > > > >> + (*) Even in cases where bitfields are protected by
> >locks, all fields
> >> > > > > >> +     in a given bitfield must be protected by one lock. 
> >If two fields
> >> > > > > >> +     in a given bitfield are protected by different
> >locks, the compiler's
> >> > > > > >> +     non-atomic read-modify-write sequences can cause an
> >update to one
> >> > > > > >> +     field to corrupt the value of an adjacent field.
> >> > > > > >> +
> >> > > > > >> + (*) These guarantees apply only to properly aligned and
> >sized scalar
> >> > > > > >> +     variables.  "Properly sized" currently means "int"
> >and "long",
> >> > > > > >> +     because some CPU families do not support loads and
> >stores of
> >> > > > > >> +     other sizes.  ("Some CPU families" is currently
> >believed to
> >> > > > > >> +     be only Alpha 21064.  If this is actually the case,
> >a different
> >> > > > > >> +     non-guarantee is likely to be formulated.)
> >> > > > > > 
> >> > > > > > This is a bit unclear.  Presumably you're talking about
> >definiteness of
> >> > > > > > the outcome (as in what's seen after multiple stores to the
> >same
> >> > > > > > variable).
> >> > > > > 
> >> > > > > No, the last conditions refers to adjacent byte stores from
> >different
> >> > > > > cpu contexts (either interrupt or SMP).
> >> > > > > 
> >> > > > > > The guarantees are only for natural width on Parisc as
> >well,
> >> > > > > > so you would get a mess if you did byte stores to adjacent
> >memory
> >> > > > > > locations.
> >> > > > > 
> >> > > > > For a simple test like:
> >> > > > > 
> >> > > > > struct x {
> >> > > > > 	long a;
> >> > > > > 	char b;
> >> > > > > 	char c;
> >> > > > > 	char d;
> >> > > > > 	char e;
> >> > > > > };
> >> > > > > 
> >> > > > > void store_bc(struct x *p) {
> >> > > > > 	p->b = 1;
> >> > > > > 	p->c = 2;
> >> > > > > }
> >> > > > > 
> >> > > > > on parisc, gcc generates separate byte stores
> >> > > > > 
> >> > > > > void store_bc(struct x *p) {
> >> > > > >    0:	34 1c 00 02 	ldi 1,ret0
> >> > > > >    4:	0f 5c 12 08 	stb ret0,4(r26)
> >> > > > >    8:	34 1c 00 04 	ldi 2,ret0
> >> > > > >    c:	e8 40 c0 00 	bv r0(rp)
> >> > > > >   10:	0f 5c 12 0a 	stb ret0,5(r26)
> >> > > > > 
> >> > > > > which appears to confirm that on parisc adjacent byte data
> >> > > > > is safe from corruption by concurrent cpu updates; that is,
> >> > > > > 
> >> > > > > CPU 0                | CPU 1
> >> > > > >                      |
> >> > > > > p->b = 1             | p->c = 2
> >> > > > >                      |
> >> > > > > 
> >> > > > > will result in p->b == 1 && p->c == 2 (assume both values
> >> > > > > were 0 before the call to store_bc()).
> >> > > > 
> >> > > > What Peter said.  I would ask for suggestions for better
> >wording, but
> >> > > > I would much rather be able to say that single-byte reads and
> >writes
> >> > > > are atomic and that aligned-short reads and writes are also
> >atomic.
> >> > > > 
> >> > > > Thus far, it looks like we lose only very old Alpha systems, so
> >unless
> >> > > > I hear otherwise, I update my patch to outlaw these very old
> >systems.
> >> > > 
> >> > > This isn't universally true according to the architecture manual.
> > The
> >> > > PARISC CPU can make byte to long word stores atomic against the
> >memory
> >> > > bus but not against the I/O bus for instance.  Atomicity is a
> >property
> >> > > of the underlying substrate, not of the CPU.  Implying that
> >atomicity is
> >> > > a CPU property is incorrect.
> >> > 
> >> > OK, fair point.
> >> > 
> >> > But are there in-use-for-Linux PARISC memory fabrics (for normal
> >memory,
> >> > not I/O) that do not support single-byte and double-byte stores?
> >> 
> >> For aligned access, I believe that's always the case for the memory
> >bus
> >> (on both 32 and 64 bit systems).  However, it only applies to machine
> >> instruction loads and stores of the same width..  If you mix the
> >widths
> >> on the loads and stores, all bets are off.  That means you have to
> >> beware of the gcc penchant for coalescing loads and stores: if it
> >sees
> >> two adjacent byte stores it can coalesce them into a short store
> >> instead ... that screws up the atomicity guarantees.
> >
> >OK, that means that to make PARISC work reliably, we need to use
> >ACCESS_ONCE() for loads and stores that could have racing accesses.
> >If I understand correctly, this will -not- be needed for code guarded
> >by locks, even with Peter's examples.
> >
> >So if we have something like this:
> >
> >	struct foo {
> >		char a;
> >		char b;
> >	};
> >	struct foo *fp;
> >
> >then this code would be bad:
> >
> >	fp->a = 1;
> >	fp->b = 2;
> >
> >The reason is (as you say) that GCC would be happy to store 0x0102
> >(or vice versa, depending on endianness) to the pair.  We instead
> >need:
> >
> >	ACCESS_ONCE(fp->a) = 1;
> >	ACCESS_ONCE(fp->b) = 2;
> >
> >However, if the code is protected by locks, no problem:
> >
> >	struct foo {
> >		spinlock_t lock_a;
> >		spinlock_t lock_b;
> >		char a;
> >		char b;
> >	};
> >
> >Then it is OK to do the following:
> >
> >	spin_lock(fp->lock_a);
> >	fp->a = 1;
> >	spin_unlock(fp->lock_a);
> >	spin_lock(fp->lock_b);
> >	fp->b = 1;
> >	spin_unlock(fp->lock_b);
> >
> >Or even this, assuming ->lock_a precedes ->lock_b in the locking
> >hierarchy:
> >
> >	spin_lock(fp->lock_a);
> >	spin_lock(fp->lock_b);
> >	fp->a = 1;
> >	fp->b = 1;
> >	spin_unlock(fp->lock_a);
> >	spin_unlock(fp->lock_b);
> >
> >Here gcc might merge the assignments to fp->a and fp->b, but that is OK
> >because both locks are held, presumably preventing other assignments or
> >references to fp->a and fp->b.
> >
> >On the other hand, if either fp->a or fp->b are referenced outside of
> >their
> >respective locks, even once, then this last code fragment would still
> >need
> >ACCESS_ONCE() as follows:
> >
> >	spin_lock(fp->lock_a);
> >	spin_lock(fp->lock_b);
> >	ACCESS_ONCE(fp->a) = 1;
> >	ACCESS_ONCE(fp->b) = 1;
> >	spin_unlock(fp->lock_a);
> >	spin_unlock(fp->lock_b);
> >
> >Does that cover it?  If so, I will update memory-barriers.txt
> >accordingly.
> >
> >							Thanx, Paul
> 
> -- 
> Sent from my mobile phone.  Please pardon brevity and lack of formatting.
> 


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-07 23:36                                     ` Paul E. McKenney
  0 siblings, 0 replies; 311+ messages in thread
From: Paul E. McKenney @ 2014-09-07 23:36 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Jakub Jelinek, One Thousand Gnomes, Tony Luck, linux-ia64,
	Peter Hurley, Mikael Pettersson, Oleg Nesterov, linux-kernel,
	James Bottomley, Paul Mackerras, linux-arch, linuxppc-dev,
	Miroslav Franc, Richard Henderson

On Sun, Sep 07, 2014 at 04:17:30PM -0700, H. Peter Anvin wrote:
> I'm confused why storing 0x0102 would be a problem.  I think gcc does that even on other cpus.
> 
> More atomicity can't hurt, can it?

I must defer to James for any additional details on why PARISC systems
don't provide atomicity for partially overlapping stores.  ;-)

							Thanx, Paul

> On September 7, 2014 4:00:19 PM PDT, "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> >On Sun, Sep 07, 2014 at 12:04:47PM -0700, James Bottomley wrote:
> >> On Sun, 2014-09-07 at 09:21 -0700, Paul E. McKenney wrote:
> >> > On Sat, Sep 06, 2014 at 10:07:22PM -0700, James Bottomley wrote:
> >> > > On Thu, 2014-09-04 at 21:06 -0700, Paul E. McKenney wrote:
> >> > > > On Thu, Sep 04, 2014 at 10:47:24PM -0400, Peter Hurley wrote:
> >> > > > > Hi James,
> >> > > > > 
> >> > > > > On 09/04/2014 10:11 PM, James Bottomley wrote:
> >> > > > > > On Thu, 2014-09-04 at 17:17 -0700, Paul E. McKenney wrote:
> >> > > > > >> +And there are anti-guarantees:
> >> > > > > >> +
> >> > > > > >> + (*) These guarantees do not apply to bitfields, because
> >compilers often
> >> > > > > >> +     generate code to modify these using non-atomic
> >read-modify-write
> >> > > > > >> +     sequences.  Do not attempt to use bitfields to
> >synchronize parallel
> >> > > > > >> +     algorithms.
> >> > > > > >> +
> >> > > > > >> + (*) Even in cases where bitfields are protected by
> >locks, all fields
> >> > > > > >> +     in a given bitfield must be protected by one lock. 
> >If two fields
> >> > > > > >> +     in a given bitfield are protected by different
> >locks, the compiler's
> >> > > > > >> +     non-atomic read-modify-write sequences can cause an
> >update to one
> >> > > > > >> +     field to corrupt the value of an adjacent field.
> >> > > > > >> +
> >> > > > > >> + (*) These guarantees apply only to properly aligned and
> >sized scalar
> >> > > > > >> +     variables.  "Properly sized" currently means "int"
> >and "long",
> >> > > > > >> +     because some CPU families do not support loads and
> >stores of
> >> > > > > >> +     other sizes.  ("Some CPU families" is currently
> >believed to
> >> > > > > >> +     be only Alpha 21064.  If this is actually the case,
> >a different
> >> > > > > >> +     non-guarantee is likely to be formulated.)
> >> > > > > > 
> >> > > > > > This is a bit unclear.  Presumably you're talking about
> >definiteness of
> >> > > > > > the outcome (as in what's seen after multiple stores to the
> >same
> >> > > > > > variable).
> >> > > > > 
> >> > > > > No, the last conditions refers to adjacent byte stores from
> >different
> >> > > > > cpu contexts (either interrupt or SMP).
> >> > > > > 
> >> > > > > > The guarantees are only for natural width on Parisc as
> >well,
> >> > > > > > so you would get a mess if you did byte stores to adjacent
> >memory
> >> > > > > > locations.
> >> > > > > 
> >> > > > > For a simple test like:
> >> > > > > 
> >> > > > > struct x {
> >> > > > > 	long a;
> >> > > > > 	char b;
> >> > > > > 	char c;
> >> > > > > 	char d;
> >> > > > > 	char e;
> >> > > > > };
> >> > > > > 
> >> > > > > void store_bc(struct x *p) {
> >> > > > > 	p->b = 1;
> >> > > > > 	p->c = 2;
> >> > > > > }
> >> > > > > 
> >> > > > > on parisc, gcc generates separate byte stores
> >> > > > > 
> >> > > > > void store_bc(struct x *p) {
> >> > > > >    0:	34 1c 00 02 	ldi 1,ret0
> >> > > > >    4:	0f 5c 12 08 	stb ret0,4(r26)
> >> > > > >    8:	34 1c 00 04 	ldi 2,ret0
> >> > > > >    c:	e8 40 c0 00 	bv r0(rp)
> >> > > > >   10:	0f 5c 12 0a 	stb ret0,5(r26)
> >> > > > > 
> >> > > > > which appears to confirm that on parisc adjacent byte data
> >> > > > > is safe from corruption by concurrent cpu updates; that is,
> >> > > > > 
> >> > > > > CPU 0                | CPU 1
> >> > > > >                      |
> >> > > > > p->b = 1             | p->c = 2
> >> > > > >                      |
> >> > > > > 
> >> > > > > will result in p->b == 1 && p->c == 2 (assume both values
> >> > > > > were 0 before the call to store_bc()).
> >> > > > 
> >> > > > What Peter said.  I would ask for suggestions for better
> >wording, but
> >> > > > I would much rather be able to say that single-byte reads and
> >writes
> >> > > > are atomic and that aligned-short reads and writes are also
> >atomic.
> >> > > > 
> >> > > > Thus far, it looks like we lose only very old Alpha systems, so
> >unless
> >> > > > I hear otherwise, I update my patch to outlaw these very old
> >systems.
> >> > > 
> >> > > This isn't universally true according to the architecture manual.
> > The
> >> > > PARISC CPU can make byte to long word stores atomic against the
> >memory
> >> > > bus but not against the I/O bus for instance.  Atomicity is a
> >property
> >> > > of the underlying substrate, not of the CPU.  Implying that
> >atomicity is
> >> > > a CPU property is incorrect.
> >> > 
> >> > OK, fair point.
> >> > 
> >> > But are there in-use-for-Linux PARISC memory fabrics (for normal
> >memory,
> >> > not I/O) that do not support single-byte and double-byte stores?
> >> 
> >> For aligned access, I believe that's always the case for the memory
> >bus
> >> (on both 32 and 64 bit systems).  However, it only applies to machine
> >> instruction loads and stores of the same width..  If you mix the
> >widths
> >> on the loads and stores, all bets are off.  That means you have to
> >> beware of the gcc penchant for coalescing loads and stores: if it
> >sees
> >> two adjacent byte stores it can coalesce them into a short store
> >> instead ... that screws up the atomicity guarantees.
> >
> >OK, that means that to make PARISC work reliably, we need to use
> >ACCESS_ONCE() for loads and stores that could have racing accesses.
> >If I understand correctly, this will -not- be needed for code guarded
> >by locks, even with Peter's examples.
> >
> >So if we have something like this:
> >
> >	struct foo {
> >		char a;
> >		char b;
> >	};
> >	struct foo *fp;
> >
> >then this code would be bad:
> >
> >	fp->a = 1;
> >	fp->b = 2;
> >
> >The reason is (as you say) that GCC would be happy to store 0x0102
> >(or vice versa, depending on endianness) to the pair.  We instead
> >need:
> >
> >	ACCESS_ONCE(fp->a) = 1;
> >	ACCESS_ONCE(fp->b) = 2;
> >
> >However, if the code is protected by locks, no problem:
> >
> >	struct foo {
> >		spinlock_t lock_a;
> >		spinlock_t lock_b;
> >		char a;
> >		char b;
> >	};
> >
> >Then it is OK to do the following:
> >
> >	spin_lock(fp->lock_a);
> >	fp->a = 1;
> >	spin_unlock(fp->lock_a);
> >	spin_lock(fp->lock_b);
> >	fp->b = 1;
> >	spin_unlock(fp->lock_b);
> >
> >Or even this, assuming ->lock_a precedes ->lock_b in the locking
> >hierarchy:
> >
> >	spin_lock(fp->lock_a);
> >	spin_lock(fp->lock_b);
> >	fp->a = 1;
> >	fp->b = 1;
> >	spin_unlock(fp->lock_a);
> >	spin_unlock(fp->lock_b);
> >
> >Here gcc might merge the assignments to fp->a and fp->b, but that is OK
> >because both locks are held, presumably preventing other assignments or
> >references to fp->a and fp->b.
> >
> >On the other hand, if either fp->a or fp->b are referenced outside of
> >their
> >respective locks, even once, then this last code fragment would still
> >need
> >ACCESS_ONCE() as follows:
> >
> >	spin_lock(fp->lock_a);
> >	spin_lock(fp->lock_b);
> >	ACCESS_ONCE(fp->a) = 1;
> >	ACCESS_ONCE(fp->b) = 1;
> >	spin_unlock(fp->lock_a);
> >	spin_unlock(fp->lock_b);
> >
> >Does that cover it?  If so, I will update memory-barriers.txt
> >accordingly.
> >
> >							Thanx, Paul
> 
> -- 
> Sent from my mobile phone.  Please pardon brevity and lack of formatting.
> 

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-07 23:36                                     ` Paul E. McKenney
  0 siblings, 0 replies; 311+ messages in thread
From: Paul E. McKenney @ 2014-09-07 23:36 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: James Bottomley, Peter Hurley, One Thousand Gnomes,
	Jakub Jelinek, Mikael Pettersson, Benjamin Herrenschmidt,
	Richard Henderson, Oleg Nesterov, Miroslav Franc, Paul Mackerras,
	linuxppc-dev, linux-kernel, linux-arch, Tony Luck, linux-ia64

On Sun, Sep 07, 2014 at 04:17:30PM -0700, H. Peter Anvin wrote:
> I'm confused why storing 0x0102 would be a problem.  I think gcc does that even on other cpus.
> 
> More atomicity can't hurt, can it?

I must defer to James for any additional details on why PARISC systems
don't provide atomicity for partially overlapping stores.  ;-)

							Thanx, Paul

> On September 7, 2014 4:00:19 PM PDT, "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> >On Sun, Sep 07, 2014 at 12:04:47PM -0700, James Bottomley wrote:
> >> On Sun, 2014-09-07 at 09:21 -0700, Paul E. McKenney wrote:
> >> > On Sat, Sep 06, 2014 at 10:07:22PM -0700, James Bottomley wrote:
> >> > > On Thu, 2014-09-04 at 21:06 -0700, Paul E. McKenney wrote:
> >> > > > On Thu, Sep 04, 2014 at 10:47:24PM -0400, Peter Hurley wrote:
> >> > > > > Hi James,
> >> > > > > 
> >> > > > > On 09/04/2014 10:11 PM, James Bottomley wrote:
> >> > > > > > On Thu, 2014-09-04 at 17:17 -0700, Paul E. McKenney wrote:
> >> > > > > >> +And there are anti-guarantees:
> >> > > > > >> +
> >> > > > > >> + (*) These guarantees do not apply to bitfields, because
> >compilers often
> >> > > > > >> +     generate code to modify these using non-atomic
> >read-modify-write
> >> > > > > >> +     sequences.  Do not attempt to use bitfields to
> >synchronize parallel
> >> > > > > >> +     algorithms.
> >> > > > > >> +
> >> > > > > >> + (*) Even in cases where bitfields are protected by
> >locks, all fields
> >> > > > > >> +     in a given bitfield must be protected by one lock. 
> >If two fields
> >> > > > > >> +     in a given bitfield are protected by different
> >locks, the compiler's
> >> > > > > >> +     non-atomic read-modify-write sequences can cause an
> >update to one
> >> > > > > >> +     field to corrupt the value of an adjacent field.
> >> > > > > >> +
> >> > > > > >> + (*) These guarantees apply only to properly aligned and
> >sized scalar
> >> > > > > >> +     variables.  "Properly sized" currently means "int"
> >and "long",
> >> > > > > >> +     because some CPU families do not support loads and
> >stores of
> >> > > > > >> +     other sizes.  ("Some CPU families" is currently
> >believed to
> >> > > > > >> +     be only Alpha 21064.  If this is actually the case,
> >a different
> >> > > > > >> +     non-guarantee is likely to be formulated.)
> >> > > > > > 
> >> > > > > > This is a bit unclear.  Presumably you're talking about
> >definiteness of
> >> > > > > > the outcome (as in what's seen after multiple stores to the
> >same
> >> > > > > > variable).
> >> > > > > 
> >> > > > > No, the last conditions refers to adjacent byte stores from
> >different
> >> > > > > cpu contexts (either interrupt or SMP).
> >> > > > > 
> >> > > > > > The guarantees are only for natural width on Parisc as
> >well,
> >> > > > > > so you would get a mess if you did byte stores to adjacent
> >memory
> >> > > > > > locations.
> >> > > > > 
> >> > > > > For a simple test like:
> >> > > > > 
> >> > > > > struct x {
> >> > > > > 	long a;
> >> > > > > 	char b;
> >> > > > > 	char c;
> >> > > > > 	char d;
> >> > > > > 	char e;
> >> > > > > };
> >> > > > > 
> >> > > > > void store_bc(struct x *p) {
> >> > > > > 	p->b = 1;
> >> > > > > 	p->c = 2;
> >> > > > > }
> >> > > > > 
> >> > > > > on parisc, gcc generates separate byte stores
> >> > > > > 
> >> > > > > void store_bc(struct x *p) {
> >> > > > >    0:	34 1c 00 02 	ldi 1,ret0
> >> > > > >    4:	0f 5c 12 08 	stb ret0,4(r26)
> >> > > > >    8:	34 1c 00 04 	ldi 2,ret0
> >> > > > >    c:	e8 40 c0 00 	bv r0(rp)
> >> > > > >   10:	0f 5c 12 0a 	stb ret0,5(r26)
> >> > > > > 
> >> > > > > which appears to confirm that on parisc adjacent byte data
> >> > > > > is safe from corruption by concurrent cpu updates; that is,
> >> > > > > 
> >> > > > > CPU 0                | CPU 1
> >> > > > >                      |
> >> > > > > p->b = 1             | p->c = 2
> >> > > > >                      |
> >> > > > > 
> >> > > > > will result in p->b = 1 && p->c = 2 (assume both values
> >> > > > > were 0 before the call to store_bc()).
> >> > > > 
> >> > > > What Peter said.  I would ask for suggestions for better
> >wording, but
> >> > > > I would much rather be able to say that single-byte reads and
> >writes
> >> > > > are atomic and that aligned-short reads and writes are also
> >atomic.
> >> > > > 
> >> > > > Thus far, it looks like we lose only very old Alpha systems, so
> >unless
> >> > > > I hear otherwise, I update my patch to outlaw these very old
> >systems.
> >> > > 
> >> > > This isn't universally true according to the architecture manual.
> > The
> >> > > PARISC CPU can make byte to long word stores atomic against the
> >memory
> >> > > bus but not against the I/O bus for instance.  Atomicity is a
> >property
> >> > > of the underlying substrate, not of the CPU.  Implying that
> >atomicity is
> >> > > a CPU property is incorrect.
> >> > 
> >> > OK, fair point.
> >> > 
> >> > But are there in-use-for-Linux PARISC memory fabrics (for normal
> >memory,
> >> > not I/O) that do not support single-byte and double-byte stores?
> >> 
> >> For aligned access, I believe that's always the case for the memory
> >bus
> >> (on both 32 and 64 bit systems).  However, it only applies to machine
> >> instruction loads and stores of the same width..  If you mix the
> >widths
> >> on the loads and stores, all bets are off.  That means you have to
> >> beware of the gcc penchant for coalescing loads and stores: if it
> >sees
> >> two adjacent byte stores it can coalesce them into a short store
> >> instead ... that screws up the atomicity guarantees.
> >
> >OK, that means that to make PARISC work reliably, we need to use
> >ACCESS_ONCE() for loads and stores that could have racing accesses.
> >If I understand correctly, this will -not- be needed for code guarded
> >by locks, even with Peter's examples.
> >
> >So if we have something like this:
> >
> >	struct foo {
> >		char a;
> >		char b;
> >	};
> >	struct foo *fp;
> >
> >then this code would be bad:
> >
> >	fp->a = 1;
> >	fp->b = 2;
> >
> >The reason is (as you say) that GCC would be happy to store 0x0102
> >(or vice versa, depending on endianness) to the pair.  We instead
> >need:
> >
> >	ACCESS_ONCE(fp->a) = 1;
> >	ACCESS_ONCE(fp->b) = 2;
> >
> >However, if the code is protected by locks, no problem:
> >
> >	struct foo {
> >		spinlock_t lock_a;
> >		spinlock_t lock_b;
> >		char a;
> >		char b;
> >	};
> >
> >Then it is OK to do the following:
> >
> >	spin_lock(fp->lock_a);
> >	fp->a = 1;
> >	spin_unlock(fp->lock_a);
> >	spin_lock(fp->lock_b);
> >	fp->b = 1;
> >	spin_unlock(fp->lock_b);
> >
> >Or even this, assuming ->lock_a precedes ->lock_b in the locking
> >hierarchy:
> >
> >	spin_lock(fp->lock_a);
> >	spin_lock(fp->lock_b);
> >	fp->a = 1;
> >	fp->b = 1;
> >	spin_unlock(fp->lock_a);
> >	spin_unlock(fp->lock_b);
> >
> >Here gcc might merge the assignments to fp->a and fp->b, but that is OK
> >because both locks are held, presumably preventing other assignments or
> >references to fp->a and fp->b.
> >
> >On the other hand, if either fp->a or fp->b are referenced outside of
> >their
> >respective locks, even once, then this last code fragment would still
> >need
> >ACCESS_ONCE() as follows:
> >
> >	spin_lock(fp->lock_a);
> >	spin_lock(fp->lock_b);
> >	ACCESS_ONCE(fp->a) = 1;
> >	ACCESS_ONCE(fp->b) = 1;
> >	spin_unlock(fp->lock_a);
> >	spin_unlock(fp->lock_b);
> >
> >Does that cover it?  If so, I will update memory-barriers.txt
> >accordingly.
> >
> >							Thanx, Paul
> 
> -- 
> Sent from my mobile phone.  Please pardon brevity and lack of formatting.
> 


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
  2014-09-07 23:36                                     ` Paul E. McKenney
  (?)
@ 2014-09-07 23:39                                       ` H. Peter Anvin
  -1 siblings, 0 replies; 311+ messages in thread
From: H. Peter Anvin @ 2014-09-07 23:39 UTC (permalink / raw)
  To: paulmck
  Cc: James Bottomley, Peter Hurley, One Thousand Gnomes,
	Jakub Jelinek, Mikael Pettersson, Benjamin Herrenschmidt,
	Richard Henderson, Oleg Nesterov, Miroslav Franc, Paul Mackerras,
	linuxppc-dev, linux-kernel, linux-arch, Tony Luck, linux-ia64

How many PARISC systems do we have that actually do real work on Linux?

On September 7, 2014 4:36:55 PM PDT, "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
>On Sun, Sep 07, 2014 at 04:17:30PM -0700, H. Peter Anvin wrote:
>> I'm confused why storing 0x0102 would be a problem.  I think gcc does
>that even on other cpus.
>> 
>> More atomicity can't hurt, can it?
>
>I must defer to James for any additional details on why PARISC systems
>don't provide atomicity for partially overlapping stores.  ;-)
>
>							Thanx, Paul
>
>> On September 7, 2014 4:00:19 PM PDT, "Paul E. McKenney"
><paulmck@linux.vnet.ibm.com> wrote:
>> >On Sun, Sep 07, 2014 at 12:04:47PM -0700, James Bottomley wrote:
>> >> On Sun, 2014-09-07 at 09:21 -0700, Paul E. McKenney wrote:
>> >> > On Sat, Sep 06, 2014 at 10:07:22PM -0700, James Bottomley wrote:
>> >> > > On Thu, 2014-09-04 at 21:06 -0700, Paul E. McKenney wrote:
>> >> > > > On Thu, Sep 04, 2014 at 10:47:24PM -0400, Peter Hurley
>wrote:
>> >> > > > > Hi James,
>> >> > > > > 
>> >> > > > > On 09/04/2014 10:11 PM, James Bottomley wrote:
>> >> > > > > > On Thu, 2014-09-04 at 17:17 -0700, Paul E. McKenney
>wrote:
>> >> > > > > >> +And there are anti-guarantees:
>> >> > > > > >> +
>> >> > > > > >> + (*) These guarantees do not apply to bitfields,
>because
>> >compilers often
>> >> > > > > >> +     generate code to modify these using non-atomic
>> >read-modify-write
>> >> > > > > >> +     sequences.  Do not attempt to use bitfields to
>> >synchronize parallel
>> >> > > > > >> +     algorithms.
>> >> > > > > >> +
>> >> > > > > >> + (*) Even in cases where bitfields are protected by
>> >locks, all fields
>> >> > > > > >> +     in a given bitfield must be protected by one
>lock. 
>> >If two fields
>> >> > > > > >> +     in a given bitfield are protected by different
>> >locks, the compiler's
>> >> > > > > >> +     non-atomic read-modify-write sequences can cause
>an
>> >update to one
>> >> > > > > >> +     field to corrupt the value of an adjacent field.
>> >> > > > > >> +
>> >> > > > > >> + (*) These guarantees apply only to properly aligned
>and
>> >sized scalar
>> >> > > > > >> +     variables.  "Properly sized" currently means
>"int"
>> >and "long",
>> >> > > > > >> +     because some CPU families do not support loads
>and
>> >stores of
>> >> > > > > >> +     other sizes.  ("Some CPU families" is currently
>> >believed to
>> >> > > > > >> +     be only Alpha 21064.  If this is actually the
>case,
>> >a different
>> >> > > > > >> +     non-guarantee is likely to be formulated.)
>> >> > > > > > 
>> >> > > > > > This is a bit unclear.  Presumably you're talking about
>> >definiteness of
>> >> > > > > > the outcome (as in what's seen after multiple stores to
>the
>> >same
>> >> > > > > > variable).
>> >> > > > > 
>> >> > > > > No, the last conditions refers to adjacent byte stores
>from
>> >different
>> >> > > > > cpu contexts (either interrupt or SMP).
>> >> > > > > 
>> >> > > > > > The guarantees are only for natural width on Parisc as
>> >well,
>> >> > > > > > so you would get a mess if you did byte stores to
>adjacent
>> >memory
>> >> > > > > > locations.
>> >> > > > > 
>> >> > > > > For a simple test like:
>> >> > > > > 
>> >> > > > > struct x {
>> >> > > > > 	long a;
>> >> > > > > 	char b;
>> >> > > > > 	char c;
>> >> > > > > 	char d;
>> >> > > > > 	char e;
>> >> > > > > };
>> >> > > > > 
>> >> > > > > void store_bc(struct x *p) {
>> >> > > > > 	p->b = 1;
>> >> > > > > 	p->c = 2;
>> >> > > > > }
>> >> > > > > 
>> >> > > > > on parisc, gcc generates separate byte stores
>> >> > > > > 
>> >> > > > > void store_bc(struct x *p) {
>> >> > > > >    0:	34 1c 00 02 	ldi 1,ret0
>> >> > > > >    4:	0f 5c 12 08 	stb ret0,4(r26)
>> >> > > > >    8:	34 1c 00 04 	ldi 2,ret0
>> >> > > > >    c:	e8 40 c0 00 	bv r0(rp)
>> >> > > > >   10:	0f 5c 12 0a 	stb ret0,5(r26)
>> >> > > > > 
>> >> > > > > which appears to confirm that on parisc adjacent byte data
>> >> > > > > is safe from corruption by concurrent cpu updates; that
>is,
>> >> > > > > 
>> >> > > > > CPU 0                | CPU 1
>> >> > > > >                      |
>> >> > > > > p->b = 1             | p->c = 2
>> >> > > > >                      |
>> >> > > > > 
>> >> > > > > will result in p->b == 1 && p->c == 2 (assume both values
>> >> > > > > were 0 before the call to store_bc()).
>> >> > > > 
>> >> > > > What Peter said.  I would ask for suggestions for better
>> >wording, but
>> >> > > > I would much rather be able to say that single-byte reads
>and
>> >writes
>> >> > > > are atomic and that aligned-short reads and writes are also
>> >atomic.
>> >> > > > 
>> >> > > > Thus far, it looks like we lose only very old Alpha systems,
>so
>> >unless
>> >> > > > I hear otherwise, I update my patch to outlaw these very old
>> >systems.
>> >> > > 
>> >> > > This isn't universally true according to the architecture
>manual.
>> > The
>> >> > > PARISC CPU can make byte to long word stores atomic against
>the
>> >memory
>> >> > > bus but not against the I/O bus for instance.  Atomicity is a
>> >property
>> >> > > of the underlying substrate, not of the CPU.  Implying that
>> >atomicity is
>> >> > > a CPU property is incorrect.
>> >> > 
>> >> > OK, fair point.
>> >> > 
>> >> > But are there in-use-for-Linux PARISC memory fabrics (for normal
>> >memory,
>> >> > not I/O) that do not support single-byte and double-byte stores?
>> >> 
>> >> For aligned access, I believe that's always the case for the
>memory
>> >bus
>> >> (on both 32 and 64 bit systems).  However, it only applies to
>machine
>> >> instruction loads and stores of the same width..  If you mix the
>> >widths
>> >> on the loads and stores, all bets are off.  That means you have to
>> >> beware of the gcc penchant for coalescing loads and stores: if it
>> >sees
>> >> two adjacent byte stores it can coalesce them into a short store
>> >> instead ... that screws up the atomicity guarantees.
>> >
>> >OK, that means that to make PARISC work reliably, we need to use
>> >ACCESS_ONCE() for loads and stores that could have racing accesses.
>> >If I understand correctly, this will -not- be needed for code
>guarded
>> >by locks, even with Peter's examples.
>> >
>> >So if we have something like this:
>> >
>> >	struct foo {
>> >		char a;
>> >		char b;
>> >	};
>> >	struct foo *fp;
>> >
>> >then this code would be bad:
>> >
>> >	fp->a = 1;
>> >	fp->b = 2;
>> >
>> >The reason is (as you say) that GCC would be happy to store 0x0102
>> >(or vice versa, depending on endianness) to the pair.  We instead
>> >need:
>> >
>> >	ACCESS_ONCE(fp->a) = 1;
>> >	ACCESS_ONCE(fp->b) = 2;
>> >
>> >However, if the code is protected by locks, no problem:
>> >
>> >	struct foo {
>> >		spinlock_t lock_a;
>> >		spinlock_t lock_b;
>> >		char a;
>> >		char b;
>> >	};
>> >
>> >Then it is OK to do the following:
>> >
>> >	spin_lock(fp->lock_a);
>> >	fp->a = 1;
>> >	spin_unlock(fp->lock_a);
>> >	spin_lock(fp->lock_b);
>> >	fp->b = 1;
>> >	spin_unlock(fp->lock_b);
>> >
>> >Or even this, assuming ->lock_a precedes ->lock_b in the locking
>> >hierarchy:
>> >
>> >	spin_lock(fp->lock_a);
>> >	spin_lock(fp->lock_b);
>> >	fp->a = 1;
>> >	fp->b = 1;
>> >	spin_unlock(fp->lock_a);
>> >	spin_unlock(fp->lock_b);
>> >
>> >Here gcc might merge the assignments to fp->a and fp->b, but that is
>OK
>> >because both locks are held, presumably preventing other assignments
>or
>> >references to fp->a and fp->b.
>> >
>> >On the other hand, if either fp->a or fp->b are referenced outside
>of
>> >their
>> >respective locks, even once, then this last code fragment would
>still
>> >need
>> >ACCESS_ONCE() as follows:
>> >
>> >	spin_lock(fp->lock_a);
>> >	spin_lock(fp->lock_b);
>> >	ACCESS_ONCE(fp->a) = 1;
>> >	ACCESS_ONCE(fp->b) = 1;
>> >	spin_unlock(fp->lock_a);
>> >	spin_unlock(fp->lock_b);
>> >
>> >Does that cover it?  If so, I will update memory-barriers.txt
>> >accordingly.
>> >
>> >							Thanx, Paul
>> 
>> -- 
>> Sent from my mobile phone.  Please pardon brevity and lack of
>formatting.
>> 

-- 
Sent from my mobile phone.  Please pardon brevity and lack of formatting.

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-07 23:39                                       ` H. Peter Anvin
  0 siblings, 0 replies; 311+ messages in thread
From: H. Peter Anvin @ 2014-09-07 23:39 UTC (permalink / raw)
  To: paulmck
  Cc: Jakub Jelinek, One Thousand Gnomes, Tony Luck, linux-ia64,
	Peter Hurley, Mikael Pettersson, Oleg Nesterov, linux-kernel,
	James Bottomley, Paul Mackerras, linux-arch, linuxppc-dev,
	Miroslav Franc, Richard Henderson

How many PARISC systems do we have that actually do real work on Linux?

On September 7, 2014 4:36:55 PM PDT, "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
>On Sun, Sep 07, 2014 at 04:17:30PM -0700, H. Peter Anvin wrote:
>> I'm confused why storing 0x0102 would be a problem.  I think gcc does
>that even on other cpus.
>> 
>> More atomicity can't hurt, can it?
>
>I must defer to James for any additional details on why PARISC systems
>don't provide atomicity for partially overlapping stores.  ;-)
>
>							Thanx, Paul
>
>> On September 7, 2014 4:00:19 PM PDT, "Paul E. McKenney"
><paulmck@linux.vnet.ibm.com> wrote:
>> >On Sun, Sep 07, 2014 at 12:04:47PM -0700, James Bottomley wrote:
>> >> On Sun, 2014-09-07 at 09:21 -0700, Paul E. McKenney wrote:
>> >> > On Sat, Sep 06, 2014 at 10:07:22PM -0700, James Bottomley wrote:
>> >> > > On Thu, 2014-09-04 at 21:06 -0700, Paul E. McKenney wrote:
>> >> > > > On Thu, Sep 04, 2014 at 10:47:24PM -0400, Peter Hurley
>wrote:
>> >> > > > > Hi James,
>> >> > > > > 
>> >> > > > > On 09/04/2014 10:11 PM, James Bottomley wrote:
>> >> > > > > > On Thu, 2014-09-04 at 17:17 -0700, Paul E. McKenney
>wrote:
>> >> > > > > >> +And there are anti-guarantees:
>> >> > > > > >> +
>> >> > > > > >> + (*) These guarantees do not apply to bitfields,
>because
>> >compilers often
>> >> > > > > >> +     generate code to modify these using non-atomic
>> >read-modify-write
>> >> > > > > >> +     sequences.  Do not attempt to use bitfields to
>> >synchronize parallel
>> >> > > > > >> +     algorithms.
>> >> > > > > >> +
>> >> > > > > >> + (*) Even in cases where bitfields are protected by
>> >locks, all fields
>> >> > > > > >> +     in a given bitfield must be protected by one
>lock. 
>> >If two fields
>> >> > > > > >> +     in a given bitfield are protected by different
>> >locks, the compiler's
>> >> > > > > >> +     non-atomic read-modify-write sequences can cause
>an
>> >update to one
>> >> > > > > >> +     field to corrupt the value of an adjacent field.
>> >> > > > > >> +
>> >> > > > > >> + (*) These guarantees apply only to properly aligned
>and
>> >sized scalar
>> >> > > > > >> +     variables.  "Properly sized" currently means
>"int"
>> >and "long",
>> >> > > > > >> +     because some CPU families do not support loads
>and
>> >stores of
>> >> > > > > >> +     other sizes.  ("Some CPU families" is currently
>> >believed to
>> >> > > > > >> +     be only Alpha 21064.  If this is actually the
>case,
>> >a different
>> >> > > > > >> +     non-guarantee is likely to be formulated.)
>> >> > > > > > 
>> >> > > > > > This is a bit unclear.  Presumably you're talking about
>> >definiteness of
>> >> > > > > > the outcome (as in what's seen after multiple stores to
>the
>> >same
>> >> > > > > > variable).
>> >> > > > > 
>> >> > > > > No, the last conditions refers to adjacent byte stores
>from
>> >different
>> >> > > > > cpu contexts (either interrupt or SMP).
>> >> > > > > 
>> >> > > > > > The guarantees are only for natural width on Parisc as
>> >well,
>> >> > > > > > so you would get a mess if you did byte stores to
>adjacent
>> >memory
>> >> > > > > > locations.
>> >> > > > > 
>> >> > > > > For a simple test like:
>> >> > > > > 
>> >> > > > > struct x {
>> >> > > > > 	long a;
>> >> > > > > 	char b;
>> >> > > > > 	char c;
>> >> > > > > 	char d;
>> >> > > > > 	char e;
>> >> > > > > };
>> >> > > > > 
>> >> > > > > void store_bc(struct x *p) {
>> >> > > > > 	p->b = 1;
>> >> > > > > 	p->c = 2;
>> >> > > > > }
>> >> > > > > 
>> >> > > > > on parisc, gcc generates separate byte stores
>> >> > > > > 
>> >> > > > > void store_bc(struct x *p) {
>> >> > > > >    0:	34 1c 00 02 	ldi 1,ret0
>> >> > > > >    4:	0f 5c 12 08 	stb ret0,4(r26)
>> >> > > > >    8:	34 1c 00 04 	ldi 2,ret0
>> >> > > > >    c:	e8 40 c0 00 	bv r0(rp)
>> >> > > > >   10:	0f 5c 12 0a 	stb ret0,5(r26)
>> >> > > > > 
>> >> > > > > which appears to confirm that on parisc adjacent byte data
>> >> > > > > is safe from corruption by concurrent cpu updates; that
>is,
>> >> > > > > 
>> >> > > > > CPU 0                | CPU 1
>> >> > > > >                      |
>> >> > > > > p->b = 1             | p->c = 2
>> >> > > > >                      |
>> >> > > > > 
>> >> > > > > will result in p->b == 1 && p->c == 2 (assume both values
>> >> > > > > were 0 before the call to store_bc()).
>> >> > > > 
>> >> > > > What Peter said.  I would ask for suggestions for better
>> >wording, but
>> >> > > > I would much rather be able to say that single-byte reads
>and
>> >writes
>> >> > > > are atomic and that aligned-short reads and writes are also
>> >atomic.
>> >> > > > 
>> >> > > > Thus far, it looks like we lose only very old Alpha systems,
>so
>> >unless
>> >> > > > I hear otherwise, I update my patch to outlaw these very old
>> >systems.
>> >> > > 
>> >> > > This isn't universally true according to the architecture
>manual.
>> > The
>> >> > > PARISC CPU can make byte to long word stores atomic against
>the
>> >memory
>> >> > > bus but not against the I/O bus for instance.  Atomicity is a
>> >property
>> >> > > of the underlying substrate, not of the CPU.  Implying that
>> >atomicity is
>> >> > > a CPU property is incorrect.
>> >> > 
>> >> > OK, fair point.
>> >> > 
>> >> > But are there in-use-for-Linux PARISC memory fabrics (for normal
>> >memory,
>> >> > not I/O) that do not support single-byte and double-byte stores?
>> >> 
>> >> For aligned access, I believe that's always the case for the
>memory
>> >bus
>> >> (on both 32 and 64 bit systems).  However, it only applies to
>machine
>> >> instruction loads and stores of the same width..  If you mix the
>> >widths
>> >> on the loads and stores, all bets are off.  That means you have to
>> >> beware of the gcc penchant for coalescing loads and stores: if it
>> >sees
>> >> two adjacent byte stores it can coalesce them into a short store
>> >> instead ... that screws up the atomicity guarantees.
>> >
>> >OK, that means that to make PARISC work reliably, we need to use
>> >ACCESS_ONCE() for loads and stores that could have racing accesses.
>> >If I understand correctly, this will -not- be needed for code
>guarded
>> >by locks, even with Peter's examples.
>> >
>> >So if we have something like this:
>> >
>> >	struct foo {
>> >		char a;
>> >		char b;
>> >	};
>> >	struct foo *fp;
>> >
>> >then this code would be bad:
>> >
>> >	fp->a = 1;
>> >	fp->b = 2;
>> >
>> >The reason is (as you say) that GCC would be happy to store 0x0102
>> >(or vice versa, depending on endianness) to the pair.  We instead
>> >need:
>> >
>> >	ACCESS_ONCE(fp->a) = 1;
>> >	ACCESS_ONCE(fp->b) = 2;
>> >
>> >However, if the code is protected by locks, no problem:
>> >
>> >	struct foo {
>> >		spinlock_t lock_a;
>> >		spinlock_t lock_b;
>> >		char a;
>> >		char b;
>> >	};
>> >
>> >Then it is OK to do the following:
>> >
>> >	spin_lock(fp->lock_a);
>> >	fp->a = 1;
>> >	spin_unlock(fp->lock_a);
>> >	spin_lock(fp->lock_b);
>> >	fp->b = 1;
>> >	spin_unlock(fp->lock_b);
>> >
>> >Or even this, assuming ->lock_a precedes ->lock_b in the locking
>> >hierarchy:
>> >
>> >	spin_lock(fp->lock_a);
>> >	spin_lock(fp->lock_b);
>> >	fp->a = 1;
>> >	fp->b = 1;
>> >	spin_unlock(fp->lock_a);
>> >	spin_unlock(fp->lock_b);
>> >
>> >Here gcc might merge the assignments to fp->a and fp->b, but that is
>OK
>> >because both locks are held, presumably preventing other assignments
>or
>> >references to fp->a and fp->b.
>> >
>> >On the other hand, if either fp->a or fp->b are referenced outside
>of
>> >their
>> >respective locks, even once, then this last code fragment would
>still
>> >need
>> >ACCESS_ONCE() as follows:
>> >
>> >	spin_lock(fp->lock_a);
>> >	spin_lock(fp->lock_b);
>> >	ACCESS_ONCE(fp->a) = 1;
>> >	ACCESS_ONCE(fp->b) = 1;
>> >	spin_unlock(fp->lock_a);
>> >	spin_unlock(fp->lock_b);
>> >
>> >Does that cover it?  If so, I will update memory-barriers.txt
>> >accordingly.
>> >
>> >							Thanx, Paul
>> 
>> -- 
>> Sent from my mobile phone.  Please pardon brevity and lack of
>formatting.
>> 

-- 
Sent from my mobile phone.  Please pardon brevity and lack of formatting.

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-07 23:39                                       ` H. Peter Anvin
  0 siblings, 0 replies; 311+ messages in thread
From: H. Peter Anvin @ 2014-09-07 23:39 UTC (permalink / raw)
  To: paulmck
  Cc: James Bottomley, Peter Hurley, One Thousand Gnomes,
	Jakub Jelinek, Mikael Pettersson, Benjamin Herrenschmidt,
	Richard Henderson, Oleg Nesterov, Miroslav Franc, Paul Mackerras,
	linuxppc-dev, linux-kernel, linux-arch, Tony Luck, linux-ia64

How many PARISC systems do we have that actually do real work on Linux?

On September 7, 2014 4:36:55 PM PDT, "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
>On Sun, Sep 07, 2014 at 04:17:30PM -0700, H. Peter Anvin wrote:
>> I'm confused why storing 0x0102 would be a problem.  I think gcc does
>that even on other cpus.
>> 
>> More atomicity can't hurt, can it?
>
>I must defer to James for any additional details on why PARISC systems
>don't provide atomicity for partially overlapping stores.  ;-)
>
>							Thanx, Paul
>
>> On September 7, 2014 4:00:19 PM PDT, "Paul E. McKenney"
><paulmck@linux.vnet.ibm.com> wrote:
>> >On Sun, Sep 07, 2014 at 12:04:47PM -0700, James Bottomley wrote:
>> >> On Sun, 2014-09-07 at 09:21 -0700, Paul E. McKenney wrote:
>> >> > On Sat, Sep 06, 2014 at 10:07:22PM -0700, James Bottomley wrote:
>> >> > > On Thu, 2014-09-04 at 21:06 -0700, Paul E. McKenney wrote:
>> >> > > > On Thu, Sep 04, 2014 at 10:47:24PM -0400, Peter Hurley
>wrote:
>> >> > > > > Hi James,
>> >> > > > > 
>> >> > > > > On 09/04/2014 10:11 PM, James Bottomley wrote:
>> >> > > > > > On Thu, 2014-09-04 at 17:17 -0700, Paul E. McKenney
>wrote:
>> >> > > > > >> +And there are anti-guarantees:
>> >> > > > > >> +
>> >> > > > > >> + (*) These guarantees do not apply to bitfields,
>because
>> >compilers often
>> >> > > > > >> +     generate code to modify these using non-atomic
>> >read-modify-write
>> >> > > > > >> +     sequences.  Do not attempt to use bitfields to
>> >synchronize parallel
>> >> > > > > >> +     algorithms.
>> >> > > > > >> +
>> >> > > > > >> + (*) Even in cases where bitfields are protected by
>> >locks, all fields
>> >> > > > > >> +     in a given bitfield must be protected by one
>lock. 
>> >If two fields
>> >> > > > > >> +     in a given bitfield are protected by different
>> >locks, the compiler's
>> >> > > > > >> +     non-atomic read-modify-write sequences can cause
>an
>> >update to one
>> >> > > > > >> +     field to corrupt the value of an adjacent field.
>> >> > > > > >> +
>> >> > > > > >> + (*) These guarantees apply only to properly aligned
>and
>> >sized scalar
>> >> > > > > >> +     variables.  "Properly sized" currently means
>"int"
>> >and "long",
>> >> > > > > >> +     because some CPU families do not support loads
>and
>> >stores of
>> >> > > > > >> +     other sizes.  ("Some CPU families" is currently
>> >believed to
>> >> > > > > >> +     be only Alpha 21064.  If this is actually the
>case,
>> >a different
>> >> > > > > >> +     non-guarantee is likely to be formulated.)
>> >> > > > > > 
>> >> > > > > > This is a bit unclear.  Presumably you're talking about
>> >definiteness of
>> >> > > > > > the outcome (as in what's seen after multiple stores to
>the
>> >same
>> >> > > > > > variable).
>> >> > > > > 
>> >> > > > > No, the last conditions refers to adjacent byte stores
>from
>> >different
>> >> > > > > cpu contexts (either interrupt or SMP).
>> >> > > > > 
>> >> > > > > > The guarantees are only for natural width on Parisc as
>> >well,
>> >> > > > > > so you would get a mess if you did byte stores to
>adjacent
>> >memory
>> >> > > > > > locations.
>> >> > > > > 
>> >> > > > > For a simple test like:
>> >> > > > > 
>> >> > > > > struct x {
>> >> > > > > 	long a;
>> >> > > > > 	char b;
>> >> > > > > 	char c;
>> >> > > > > 	char d;
>> >> > > > > 	char e;
>> >> > > > > };
>> >> > > > > 
>> >> > > > > void store_bc(struct x *p) {
>> >> > > > > 	p->b = 1;
>> >> > > > > 	p->c = 2;
>> >> > > > > }
>> >> > > > > 
>> >> > > > > on parisc, gcc generates separate byte stores
>> >> > > > > 
>> >> > > > > void store_bc(struct x *p) {
>> >> > > > >    0:	34 1c 00 02 	ldi 1,ret0
>> >> > > > >    4:	0f 5c 12 08 	stb ret0,4(r26)
>> >> > > > >    8:	34 1c 00 04 	ldi 2,ret0
>> >> > > > >    c:	e8 40 c0 00 	bv r0(rp)
>> >> > > > >   10:	0f 5c 12 0a 	stb ret0,5(r26)
>> >> > > > > 
>> >> > > > > which appears to confirm that on parisc adjacent byte data
>> >> > > > > is safe from corruption by concurrent cpu updates; that
>is,
>> >> > > > > 
>> >> > > > > CPU 0                | CPU 1
>> >> > > > >                      |
>> >> > > > > p->b = 1             | p->c = 2
>> >> > > > >                      |
>> >> > > > > 
>> >> > > > > will result in p->b = 1 && p->c = 2 (assume both values
>> >> > > > > were 0 before the call to store_bc()).
>> >> > > > 
>> >> > > > What Peter said.  I would ask for suggestions for better
>> >wording, but
>> >> > > > I would much rather be able to say that single-byte reads
>and
>> >writes
>> >> > > > are atomic and that aligned-short reads and writes are also
>> >atomic.
>> >> > > > 
>> >> > > > Thus far, it looks like we lose only very old Alpha systems,
>so
>> >unless
>> >> > > > I hear otherwise, I update my patch to outlaw these very old
>> >systems.
>> >> > > 
>> >> > > This isn't universally true according to the architecture
>manual.
>> > The
>> >> > > PARISC CPU can make byte to long word stores atomic against
>the
>> >memory
>> >> > > bus but not against the I/O bus for instance.  Atomicity is a
>> >property
>> >> > > of the underlying substrate, not of the CPU.  Implying that
>> >atomicity is
>> >> > > a CPU property is incorrect.
>> >> > 
>> >> > OK, fair point.
>> >> > 
>> >> > But are there in-use-for-Linux PARISC memory fabrics (for normal
>> >memory,
>> >> > not I/O) that do not support single-byte and double-byte stores?
>> >> 
>> >> For aligned access, I believe that's always the case for the
>memory
>> >bus
>> >> (on both 32 and 64 bit systems).  However, it only applies to
>machine
>> >> instruction loads and stores of the same width..  If you mix the
>> >widths
>> >> on the loads and stores, all bets are off.  That means you have to
>> >> beware of the gcc penchant for coalescing loads and stores: if it
>> >sees
>> >> two adjacent byte stores it can coalesce them into a short store
>> >> instead ... that screws up the atomicity guarantees.
>> >
>> >OK, that means that to make PARISC work reliably, we need to use
>> >ACCESS_ONCE() for loads and stores that could have racing accesses.
>> >If I understand correctly, this will -not- be needed for code
>guarded
>> >by locks, even with Peter's examples.
>> >
>> >So if we have something like this:
>> >
>> >	struct foo {
>> >		char a;
>> >		char b;
>> >	};
>> >	struct foo *fp;
>> >
>> >then this code would be bad:
>> >
>> >	fp->a = 1;
>> >	fp->b = 2;
>> >
>> >The reason is (as you say) that GCC would be happy to store 0x0102
>> >(or vice versa, depending on endianness) to the pair.  We instead
>> >need:
>> >
>> >	ACCESS_ONCE(fp->a) = 1;
>> >	ACCESS_ONCE(fp->b) = 2;
>> >
>> >However, if the code is protected by locks, no problem:
>> >
>> >	struct foo {
>> >		spinlock_t lock_a;
>> >		spinlock_t lock_b;
>> >		char a;
>> >		char b;
>> >	};
>> >
>> >Then it is OK to do the following:
>> >
>> >	spin_lock(fp->lock_a);
>> >	fp->a = 1;
>> >	spin_unlock(fp->lock_a);
>> >	spin_lock(fp->lock_b);
>> >	fp->b = 1;
>> >	spin_unlock(fp->lock_b);
>> >
>> >Or even this, assuming ->lock_a precedes ->lock_b in the locking
>> >hierarchy:
>> >
>> >	spin_lock(fp->lock_a);
>> >	spin_lock(fp->lock_b);
>> >	fp->a = 1;
>> >	fp->b = 1;
>> >	spin_unlock(fp->lock_a);
>> >	spin_unlock(fp->lock_b);
>> >
>> >Here gcc might merge the assignments to fp->a and fp->b, but that is
>OK
>> >because both locks are held, presumably preventing other assignments
>or
>> >references to fp->a and fp->b.
>> >
>> >On the other hand, if either fp->a or fp->b are referenced outside
>of
>> >their
>> >respective locks, even once, then this last code fragment would
>still
>> >need
>> >ACCESS_ONCE() as follows:
>> >
>> >	spin_lock(fp->lock_a);
>> >	spin_lock(fp->lock_b);
>> >	ACCESS_ONCE(fp->a) = 1;
>> >	ACCESS_ONCE(fp->b) = 1;
>> >	spin_unlock(fp->lock_a);
>> >	spin_unlock(fp->lock_b);
>> >
>> >Does that cover it?  If so, I will update memory-barriers.txt
>> >accordingly.
>> >
>> >							Thanx, Paul
>> 
>> -- 
>> Sent from my mobile phone.  Please pardon brevity and lack of
>formatting.
>> 

-- 
Sent from my mobile phone.  Please pardon brevity and lack of formatting.

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
  2014-09-07 20:41                                 ` Peter Hurley
  (?)
@ 2014-09-08  5:50                                   ` James Bottomley
  -1 siblings, 0 replies; 311+ messages in thread
From: James Bottomley @ 2014-09-08  5:50 UTC (permalink / raw)
  To: Peter Hurley
  Cc: paulmck, H. Peter Anvin, One Thousand Gnomes, Jakub Jelinek,
	Mikael Pettersson, Benjamin Herrenschmidt, Richard Henderson,
	Oleg Nesterov, Miroslav Franc, Paul Mackerras, linuxppc-dev,
	linux-kernel, linux-arch, Tony Luck, linux-ia64

On Sun, 2014-09-07 at 16:41 -0400, Peter Hurley wrote:
> On 09/07/2014 03:04 PM, James Bottomley wrote:
> > On Sun, 2014-09-07 at 09:21 -0700, Paul E. McKenney wrote:
> >> On Sat, Sep 06, 2014 at 10:07:22PM -0700, James Bottomley wrote:
> >>> On Thu, 2014-09-04 at 21:06 -0700, Paul E. McKenney wrote:
> >>>> On Thu, Sep 04, 2014 at 10:47:24PM -0400, Peter Hurley wrote:
> >>>>> Hi James,
> >>>>>
> >>>>> On 09/04/2014 10:11 PM, James Bottomley wrote:
> >>>>>> On Thu, 2014-09-04 at 17:17 -0700, Paul E. McKenney wrote:
> >>>>>>> +And there are anti-guarantees:
> >>>>>>> +
> >>>>>>> + (*) These guarantees do not apply to bitfields, because compilers often
> >>>>>>> +     generate code to modify these using non-atomic read-modify-write
> >>>>>>> +     sequences.  Do not attempt to use bitfields to synchronize parallel
> >>>>>>> +     algorithms.
> >>>>>>> +
> >>>>>>> + (*) Even in cases where bitfields are protected by locks, all fields
> >>>>>>> +     in a given bitfield must be protected by one lock.  If two fields
> >>>>>>> +     in a given bitfield are protected by different locks, the compiler's
> >>>>>>> +     non-atomic read-modify-write sequences can cause an update to one
> >>>>>>> +     field to corrupt the value of an adjacent field.
> >>>>>>> +
> >>>>>>> + (*) These guarantees apply only to properly aligned and sized scalar
> >>>>>>> +     variables.  "Properly sized" currently means "int" and "long",
> >>>>>>> +     because some CPU families do not support loads and stores of
> >>>>>>> +     other sizes.  ("Some CPU families" is currently believed to
> >>>>>>> +     be only Alpha 21064.  If this is actually the case, a different
> >>>>>>> +     non-guarantee is likely to be formulated.)
> >>>>>>
> >>>>>> This is a bit unclear.  Presumably you're talking about definiteness of
> >>>>>> the outcome (as in what's seen after multiple stores to the same
> >>>>>> variable).
> >>>>>
> >>>>> No, the last conditions refers to adjacent byte stores from different
> >>>>> cpu contexts (either interrupt or SMP).
> >>>>>
> >>>>>> The guarantees are only for natural width on Parisc as well,
> >>>>>> so you would get a mess if you did byte stores to adjacent memory
> >>>>>> locations.
> >>>>>
> >>>>> For a simple test like:
> >>>>>
> >>>>> struct x {
> >>>>> 	long a;
> >>>>> 	char b;
> >>>>> 	char c;
> >>>>> 	char d;
> >>>>> 	char e;
> >>>>> };
> >>>>>
> >>>>> void store_bc(struct x *p) {
> >>>>> 	p->b = 1;
> >>>>> 	p->c = 2;
> >>>>> }
> >>>>>
> >>>>> on parisc, gcc generates separate byte stores
> >>>>>
> >>>>> void store_bc(struct x *p) {
> >>>>>    0:	34 1c 00 02 	ldi 1,ret0
> >>>>>    4:	0f 5c 12 08 	stb ret0,4(r26)
> >>>>>    8:	34 1c 00 04 	ldi 2,ret0
> >>>>>    c:	e8 40 c0 00 	bv r0(rp)
> >>>>>   10:	0f 5c 12 0a 	stb ret0,5(r26)
> >>>>>
> >>>>> which appears to confirm that on parisc adjacent byte data
> >>>>> is safe from corruption by concurrent cpu updates; that is,
> >>>>>
> >>>>> CPU 0                | CPU 1
> >>>>>                      |
> >>>>> p->b = 1             | p->c = 2
> >>>>>                      |
> >>>>>
> >>>>> will result in p->b == 1 && p->c == 2 (assume both values
> >>>>> were 0 before the call to store_bc()).
> >>>>
> >>>> What Peter said.  I would ask for suggestions for better wording, but
> >>>> I would much rather be able to say that single-byte reads and writes
> >>>> are atomic and that aligned-short reads and writes are also atomic.
> >>>>
> >>>> Thus far, it looks like we lose only very old Alpha systems, so unless
> >>>> I hear otherwise, I update my patch to outlaw these very old systems.
> >>>
> >>> This isn't universally true according to the architecture manual.  The
> >>> PARISC CPU can make byte to long word stores atomic against the memory
> >>> bus but not against the I/O bus for instance.  Atomicity is a property
> >>> of the underlying substrate, not of the CPU.  Implying that atomicity is
> >>> a CPU property is incorrect.
> 
> To go back to this briefly, while it's true that atomicity is not
> a CPU property, atomicity is not possible if the CPU is not cooperating.

You mean if it doesn't have the bus logic to emit?  Like ia32 mostly
can't do 64 bit writes ... sure.

> >> OK, fair point.
> >>
> >> But are there in-use-for-Linux PARISC memory fabrics (for normal memory,
> >> not I/O) that do not support single-byte and double-byte stores?
> > 
> > For aligned access, I believe that's always the case for the memory bus
> > (on both 32 and 64 bit systems).  However, it only applies to machine
> > instruction loads and stores of the same width..  If you mix the widths
> > on the loads and stores, all bets are off.  That means you have to
> > beware of the gcc penchant for coalescing loads and stores: if it sees
> > two adjacent byte stores it can coalesce them into a short store
> > instead ... that screws up the atomicity guarantees.
> 
> Two things: I think that gcc has given up on combining adjacent writes,
> perhaps because unaligned writes on some arches are prohibitive, so
> whatever minor optimization was believed to be gained was quickly lost,
> multi-fold. (Although this might not be the reason since one would
> expect that gcc would know the alignment of the promoted store).

Um, gcc assumes architecturally correct alignment; that's why it pads
structures.  Only when accessing packed structures will it use the
lowest unit load/store.

if you have a struct { char a, char b }; and load first a then b with a
constant gcc will obligingly optimise to a short store.

> But additionally, even if gcc combines adjacent writes _that are part
> of the program flow_ then I believe the situation is no worse than
> would otherwise exist.
> 
> For instance, given the following:
> 
> struct x {
> 	spinlock_t lock;
> 	long a;
> 	byte b;
> 	byte c;
> };
> 
> void locked_store_b(struct x *p)
> {
> 	spin_lock(&p->lock);
> 	p->b = 1;
> 	spin_unlock(&p->lock);
> 	p->c = 2;
> }
> 
> Granted, the author probably expects ordered writes of
> 	STORE B
> 	STORE C
> but that's not guaranteed because there is no memory barrier
> ordering B before C.

Yes, there is: loads and stores may not migrate into or out of critical
sections.

> I see no problem with gcc write-combining in the absence of
> memory barriers (as long as alignment is still respected,
> ie., the size-promoted write is still naturally aligned). The
> combined write is still atomic.

Actual alignment is pretty irrelevant.  That's why all architectures
which require alignment also have to implement misaligned traps ... this
is a fundamental requirement of the networking code, for instance.

The main problem is gcc thinking there's a misalignment (or a packed
structure).  That causes splitting of loads and stores and that destroys
atomicity.

James





^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-08  5:50                                   ` James Bottomley
  0 siblings, 0 replies; 311+ messages in thread
From: James Bottomley @ 2014-09-08  5:50 UTC (permalink / raw)
  To: Peter Hurley
  Cc: Jakub Jelinek, One Thousand Gnomes, linux-arch, linux-ia64,
	Mikael Pettersson, Oleg Nesterov, linux-kernel, Tony Luck,
	Paul Mackerras, H. Peter Anvin, paulmck, linuxppc-dev,
	Miroslav Franc, Richard Henderson

On Sun, 2014-09-07 at 16:41 -0400, Peter Hurley wrote:
> On 09/07/2014 03:04 PM, James Bottomley wrote:
> > On Sun, 2014-09-07 at 09:21 -0700, Paul E. McKenney wrote:
> >> On Sat, Sep 06, 2014 at 10:07:22PM -0700, James Bottomley wrote:
> >>> On Thu, 2014-09-04 at 21:06 -0700, Paul E. McKenney wrote:
> >>>> On Thu, Sep 04, 2014 at 10:47:24PM -0400, Peter Hurley wrote:
> >>>>> Hi James,
> >>>>>
> >>>>> On 09/04/2014 10:11 PM, James Bottomley wrote:
> >>>>>> On Thu, 2014-09-04 at 17:17 -0700, Paul E. McKenney wrote:
> >>>>>>> +And there are anti-guarantees:
> >>>>>>> +
> >>>>>>> + (*) These guarantees do not apply to bitfields, because compilers often
> >>>>>>> +     generate code to modify these using non-atomic read-modify-write
> >>>>>>> +     sequences.  Do not attempt to use bitfields to synchronize parallel
> >>>>>>> +     algorithms.
> >>>>>>> +
> >>>>>>> + (*) Even in cases where bitfields are protected by locks, all fields
> >>>>>>> +     in a given bitfield must be protected by one lock.  If two fields
> >>>>>>> +     in a given bitfield are protected by different locks, the compiler's
> >>>>>>> +     non-atomic read-modify-write sequences can cause an update to one
> >>>>>>> +     field to corrupt the value of an adjacent field.
> >>>>>>> +
> >>>>>>> + (*) These guarantees apply only to properly aligned and sized scalar
> >>>>>>> +     variables.  "Properly sized" currently means "int" and "long",
> >>>>>>> +     because some CPU families do not support loads and stores of
> >>>>>>> +     other sizes.  ("Some CPU families" is currently believed to
> >>>>>>> +     be only Alpha 21064.  If this is actually the case, a different
> >>>>>>> +     non-guarantee is likely to be formulated.)
> >>>>>>
> >>>>>> This is a bit unclear.  Presumably you're talking about definiteness of
> >>>>>> the outcome (as in what's seen after multiple stores to the same
> >>>>>> variable).
> >>>>>
> >>>>> No, the last conditions refers to adjacent byte stores from different
> >>>>> cpu contexts (either interrupt or SMP).
> >>>>>
> >>>>>> The guarantees are only for natural width on Parisc as well,
> >>>>>> so you would get a mess if you did byte stores to adjacent memory
> >>>>>> locations.
> >>>>>
> >>>>> For a simple test like:
> >>>>>
> >>>>> struct x {
> >>>>> 	long a;
> >>>>> 	char b;
> >>>>> 	char c;
> >>>>> 	char d;
> >>>>> 	char e;
> >>>>> };
> >>>>>
> >>>>> void store_bc(struct x *p) {
> >>>>> 	p->b = 1;
> >>>>> 	p->c = 2;
> >>>>> }
> >>>>>
> >>>>> on parisc, gcc generates separate byte stores
> >>>>>
> >>>>> void store_bc(struct x *p) {
> >>>>>    0:	34 1c 00 02 	ldi 1,ret0
> >>>>>    4:	0f 5c 12 08 	stb ret0,4(r26)
> >>>>>    8:	34 1c 00 04 	ldi 2,ret0
> >>>>>    c:	e8 40 c0 00 	bv r0(rp)
> >>>>>   10:	0f 5c 12 0a 	stb ret0,5(r26)
> >>>>>
> >>>>> which appears to confirm that on parisc adjacent byte data
> >>>>> is safe from corruption by concurrent cpu updates; that is,
> >>>>>
> >>>>> CPU 0                | CPU 1
> >>>>>                      |
> >>>>> p->b = 1             | p->c = 2
> >>>>>                      |
> >>>>>
> >>>>> will result in p->b == 1 && p->c == 2 (assume both values
> >>>>> were 0 before the call to store_bc()).
> >>>>
> >>>> What Peter said.  I would ask for suggestions for better wording, but
> >>>> I would much rather be able to say that single-byte reads and writes
> >>>> are atomic and that aligned-short reads and writes are also atomic.
> >>>>
> >>>> Thus far, it looks like we lose only very old Alpha systems, so unless
> >>>> I hear otherwise, I update my patch to outlaw these very old systems.
> >>>
> >>> This isn't universally true according to the architecture manual.  The
> >>> PARISC CPU can make byte to long word stores atomic against the memory
> >>> bus but not against the I/O bus for instance.  Atomicity is a property
> >>> of the underlying substrate, not of the CPU.  Implying that atomicity is
> >>> a CPU property is incorrect.
> 
> To go back to this briefly, while it's true that atomicity is not
> a CPU property, atomicity is not possible if the CPU is not cooperating.

You mean if it doesn't have the bus logic to emit?  Like ia32 mostly
can't do 64 bit writes ... sure.

> >> OK, fair point.
> >>
> >> But are there in-use-for-Linux PARISC memory fabrics (for normal memory,
> >> not I/O) that do not support single-byte and double-byte stores?
> > 
> > For aligned access, I believe that's always the case for the memory bus
> > (on both 32 and 64 bit systems).  However, it only applies to machine
> > instruction loads and stores of the same width..  If you mix the widths
> > on the loads and stores, all bets are off.  That means you have to
> > beware of the gcc penchant for coalescing loads and stores: if it sees
> > two adjacent byte stores it can coalesce them into a short store
> > instead ... that screws up the atomicity guarantees.
> 
> Two things: I think that gcc has given up on combining adjacent writes,
> perhaps because unaligned writes on some arches are prohibitive, so
> whatever minor optimization was believed to be gained was quickly lost,
> multi-fold. (Although this might not be the reason since one would
> expect that gcc would know the alignment of the promoted store).

Um, gcc assumes architecturally correct alignment; that's why it pads
structures.  Only when accessing packed structures will it use the
lowest unit load/store.

if you have a struct { char a, char b }; and load first a then b with a
constant gcc will obligingly optimise to a short store.

> But additionally, even if gcc combines adjacent writes _that are part
> of the program flow_ then I believe the situation is no worse than
> would otherwise exist.
> 
> For instance, given the following:
> 
> struct x {
> 	spinlock_t lock;
> 	long a;
> 	byte b;
> 	byte c;
> };
> 
> void locked_store_b(struct x *p)
> {
> 	spin_lock(&p->lock);
> 	p->b = 1;
> 	spin_unlock(&p->lock);
> 	p->c = 2;
> }
> 
> Granted, the author probably expects ordered writes of
> 	STORE B
> 	STORE C
> but that's not guaranteed because there is no memory barrier
> ordering B before C.

Yes, there is: loads and stores may not migrate into or out of critical
sections.

> I see no problem with gcc write-combining in the absence of
> memory barriers (as long as alignment is still respected,
> ie., the size-promoted write is still naturally aligned). The
> combined write is still atomic.

Actual alignment is pretty irrelevant.  That's why all architectures
which require alignment also have to implement misaligned traps ... this
is a fundamental requirement of the networking code, for instance.

The main problem is gcc thinking there's a misalignment (or a packed
structure).  That causes splitting of loads and stores and that destroys
atomicity.

James

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-08  5:50                                   ` James Bottomley
  0 siblings, 0 replies; 311+ messages in thread
From: James Bottomley @ 2014-09-08  5:50 UTC (permalink / raw)
  To: Peter Hurley
  Cc: paulmck, H. Peter Anvin, One Thousand Gnomes, Jakub Jelinek,
	Mikael Pettersson, Benjamin Herrenschmidt, Richard Henderson,
	Oleg Nesterov, Miroslav Franc, Paul Mackerras, linuxppc-dev,
	linux-kernel, linux-arch, Tony Luck, linux-ia64

On Sun, 2014-09-07 at 16:41 -0400, Peter Hurley wrote:
> On 09/07/2014 03:04 PM, James Bottomley wrote:
> > On Sun, 2014-09-07 at 09:21 -0700, Paul E. McKenney wrote:
> >> On Sat, Sep 06, 2014 at 10:07:22PM -0700, James Bottomley wrote:
> >>> On Thu, 2014-09-04 at 21:06 -0700, Paul E. McKenney wrote:
> >>>> On Thu, Sep 04, 2014 at 10:47:24PM -0400, Peter Hurley wrote:
> >>>>> Hi James,
> >>>>>
> >>>>> On 09/04/2014 10:11 PM, James Bottomley wrote:
> >>>>>> On Thu, 2014-09-04 at 17:17 -0700, Paul E. McKenney wrote:
> >>>>>>> +And there are anti-guarantees:
> >>>>>>> +
> >>>>>>> + (*) These guarantees do not apply to bitfields, because compilers often
> >>>>>>> +     generate code to modify these using non-atomic read-modify-write
> >>>>>>> +     sequences.  Do not attempt to use bitfields to synchronize parallel
> >>>>>>> +     algorithms.
> >>>>>>> +
> >>>>>>> + (*) Even in cases where bitfields are protected by locks, all fields
> >>>>>>> +     in a given bitfield must be protected by one lock.  If two fields
> >>>>>>> +     in a given bitfield are protected by different locks, the compiler's
> >>>>>>> +     non-atomic read-modify-write sequences can cause an update to one
> >>>>>>> +     field to corrupt the value of an adjacent field.
> >>>>>>> +
> >>>>>>> + (*) These guarantees apply only to properly aligned and sized scalar
> >>>>>>> +     variables.  "Properly sized" currently means "int" and "long",
> >>>>>>> +     because some CPU families do not support loads and stores of
> >>>>>>> +     other sizes.  ("Some CPU families" is currently believed to
> >>>>>>> +     be only Alpha 21064.  If this is actually the case, a different
> >>>>>>> +     non-guarantee is likely to be formulated.)
> >>>>>>
> >>>>>> This is a bit unclear.  Presumably you're talking about definiteness of
> >>>>>> the outcome (as in what's seen after multiple stores to the same
> >>>>>> variable).
> >>>>>
> >>>>> No, the last conditions refers to adjacent byte stores from different
> >>>>> cpu contexts (either interrupt or SMP).
> >>>>>
> >>>>>> The guarantees are only for natural width on Parisc as well,
> >>>>>> so you would get a mess if you did byte stores to adjacent memory
> >>>>>> locations.
> >>>>>
> >>>>> For a simple test like:
> >>>>>
> >>>>> struct x {
> >>>>> 	long a;
> >>>>> 	char b;
> >>>>> 	char c;
> >>>>> 	char d;
> >>>>> 	char e;
> >>>>> };
> >>>>>
> >>>>> void store_bc(struct x *p) {
> >>>>> 	p->b = 1;
> >>>>> 	p->c = 2;
> >>>>> }
> >>>>>
> >>>>> on parisc, gcc generates separate byte stores
> >>>>>
> >>>>> void store_bc(struct x *p) {
> >>>>>    0:	34 1c 00 02 	ldi 1,ret0
> >>>>>    4:	0f 5c 12 08 	stb ret0,4(r26)
> >>>>>    8:	34 1c 00 04 	ldi 2,ret0
> >>>>>    c:	e8 40 c0 00 	bv r0(rp)
> >>>>>   10:	0f 5c 12 0a 	stb ret0,5(r26)
> >>>>>
> >>>>> which appears to confirm that on parisc adjacent byte data
> >>>>> is safe from corruption by concurrent cpu updates; that is,
> >>>>>
> >>>>> CPU 0                | CPU 1
> >>>>>                      |
> >>>>> p->b = 1             | p->c = 2
> >>>>>                      |
> >>>>>
> >>>>> will result in p->b = 1 && p->c = 2 (assume both values
> >>>>> were 0 before the call to store_bc()).
> >>>>
> >>>> What Peter said.  I would ask for suggestions for better wording, but
> >>>> I would much rather be able to say that single-byte reads and writes
> >>>> are atomic and that aligned-short reads and writes are also atomic.
> >>>>
> >>>> Thus far, it looks like we lose only very old Alpha systems, so unless
> >>>> I hear otherwise, I update my patch to outlaw these very old systems.
> >>>
> >>> This isn't universally true according to the architecture manual.  The
> >>> PARISC CPU can make byte to long word stores atomic against the memory
> >>> bus but not against the I/O bus for instance.  Atomicity is a property
> >>> of the underlying substrate, not of the CPU.  Implying that atomicity is
> >>> a CPU property is incorrect.
> 
> To go back to this briefly, while it's true that atomicity is not
> a CPU property, atomicity is not possible if the CPU is not cooperating.

You mean if it doesn't have the bus logic to emit?  Like ia32 mostly
can't do 64 bit writes ... sure.

> >> OK, fair point.
> >>
> >> But are there in-use-for-Linux PARISC memory fabrics (for normal memory,
> >> not I/O) that do not support single-byte and double-byte stores?
> > 
> > For aligned access, I believe that's always the case for the memory bus
> > (on both 32 and 64 bit systems).  However, it only applies to machine
> > instruction loads and stores of the same width..  If you mix the widths
> > on the loads and stores, all bets are off.  That means you have to
> > beware of the gcc penchant for coalescing loads and stores: if it sees
> > two adjacent byte stores it can coalesce them into a short store
> > instead ... that screws up the atomicity guarantees.
> 
> Two things: I think that gcc has given up on combining adjacent writes,
> perhaps because unaligned writes on some arches are prohibitive, so
> whatever minor optimization was believed to be gained was quickly lost,
> multi-fold. (Although this might not be the reason since one would
> expect that gcc would know the alignment of the promoted store).

Um, gcc assumes architecturally correct alignment; that's why it pads
structures.  Only when accessing packed structures will it use the
lowest unit load/store.

if you have a struct { char a, char b }; and load first a then b with a
constant gcc will obligingly optimise to a short store.

> But additionally, even if gcc combines adjacent writes _that are part
> of the program flow_ then I believe the situation is no worse than
> would otherwise exist.
> 
> For instance, given the following:
> 
> struct x {
> 	spinlock_t lock;
> 	long a;
> 	byte b;
> 	byte c;
> };
> 
> void locked_store_b(struct x *p)
> {
> 	spin_lock(&p->lock);
> 	p->b = 1;
> 	spin_unlock(&p->lock);
> 	p->c = 2;
> }
> 
> Granted, the author probably expects ordered writes of
> 	STORE B
> 	STORE C
> but that's not guaranteed because there is no memory barrier
> ordering B before C.

Yes, there is: loads and stores may not migrate into or out of critical
sections.

> I see no problem with gcc write-combining in the absence of
> memory barriers (as long as alignment is still respected,
> ie., the size-promoted write is still naturally aligned). The
> combined write is still atomic.

Actual alignment is pretty irrelevant.  That's why all architectures
which require alignment also have to implement misaligned traps ... this
is a fundamental requirement of the networking code, for instance.

The main problem is gcc thinking there's a misalignment (or a packed
structure).  That causes splitting of loads and stores and that destroys
atomicity.

James





^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
  2014-09-07 23:39                                       ` H. Peter Anvin
  (?)
@ 2014-09-08  5:56                                         ` James Bottomley
  -1 siblings, 0 replies; 311+ messages in thread
From: James Bottomley @ 2014-09-08  5:56 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: paulmck, Peter Hurley, One Thousand Gnomes, Jakub Jelinek,
	Mikael Pettersson, Benjamin Herrenschmidt, Richard Henderson,
	Oleg Nesterov, Miroslav Franc, Paul Mackerras, linuxppc-dev,
	linux-kernel, linux-arch, Tony Luck, linux-ia64

On Sun, 2014-09-07 at 16:39 -0700, H. Peter Anvin wrote:
> How many PARISC systems do we have that actually do real work on Linux?

I'd be very surprised if this problem didn't exist on all alignment
requiring architectures, like PPC and Sparc as well.  I know it would be
very convenient if all the world were an x86 ... but it would also be
very boring as well.

The rules for coping with it are well known and a relaxation of what we
currently do in the kernel, so I don't see what the actual problem is.

In the context of this thread, PA can't do atomic bit sets (no atomic
RMW except the ldcw operation) it can do atomic writes to fundamental
sizes (byte, short, int, long) provided gcc emits the correct primitive
(there are lots of gotchas in this, but that's not an architectural
problem).  These atomicity guarantees depend on the underlying storage
and are respected for main memory but not for any other type of bus.

James


> On September 7, 2014 4:36:55 PM PDT, "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> >On Sun, Sep 07, 2014 at 04:17:30PM -0700, H. Peter Anvin wrote:
> >> I'm confused why storing 0x0102 would be a problem.  I think gcc does
> >that even on other cpus.
> >> 
> >> More atomicity can't hurt, can it?
> >
> >I must defer to James for any additional details on why PARISC systems
> >don't provide atomicity for partially overlapping stores.  ;-)
> >
> >							Thanx, Paul
> >
> >> On September 7, 2014 4:00:19 PM PDT, "Paul E. McKenney"
> ><paulmck@linux.vnet.ibm.com> wrote:
> >> >On Sun, Sep 07, 2014 at 12:04:47PM -0700, James Bottomley wrote:
> >> >> On Sun, 2014-09-07 at 09:21 -0700, Paul E. McKenney wrote:
> >> >> > On Sat, Sep 06, 2014 at 10:07:22PM -0700, James Bottomley wrote:
> >> >> > > On Thu, 2014-09-04 at 21:06 -0700, Paul E. McKenney wrote:
> >> >> > > > On Thu, Sep 04, 2014 at 10:47:24PM -0400, Peter Hurley
> >wrote:
> >> >> > > > > Hi James,
> >> >> > > > > 
> >> >> > > > > On 09/04/2014 10:11 PM, James Bottomley wrote:
> >> >> > > > > > On Thu, 2014-09-04 at 17:17 -0700, Paul E. McKenney
> >wrote:
> >> >> > > > > >> +And there are anti-guarantees:
> >> >> > > > > >> +
> >> >> > > > > >> + (*) These guarantees do not apply to bitfields,
> >because
> >> >compilers often
> >> >> > > > > >> +     generate code to modify these using non-atomic
> >> >read-modify-write
> >> >> > > > > >> +     sequences.  Do not attempt to use bitfields to
> >> >synchronize parallel
> >> >> > > > > >> +     algorithms.
> >> >> > > > > >> +
> >> >> > > > > >> + (*) Even in cases where bitfields are protected by
> >> >locks, all fields
> >> >> > > > > >> +     in a given bitfield must be protected by one
> >lock. 
> >> >If two fields
> >> >> > > > > >> +     in a given bitfield are protected by different
> >> >locks, the compiler's
> >> >> > > > > >> +     non-atomic read-modify-write sequences can cause
> >an
> >> >update to one
> >> >> > > > > >> +     field to corrupt the value of an adjacent field.
> >> >> > > > > >> +
> >> >> > > > > >> + (*) These guarantees apply only to properly aligned
> >and
> >> >sized scalar
> >> >> > > > > >> +     variables.  "Properly sized" currently means
> >"int"
> >> >and "long",
> >> >> > > > > >> +     because some CPU families do not support loads
> >and
> >> >stores of
> >> >> > > > > >> +     other sizes.  ("Some CPU families" is currently
> >> >believed to
> >> >> > > > > >> +     be only Alpha 21064.  If this is actually the
> >case,
> >> >a different
> >> >> > > > > >> +     non-guarantee is likely to be formulated.)
> >> >> > > > > > 
> >> >> > > > > > This is a bit unclear.  Presumably you're talking about
> >> >definiteness of
> >> >> > > > > > the outcome (as in what's seen after multiple stores to
> >the
> >> >same
> >> >> > > > > > variable).
> >> >> > > > > 
> >> >> > > > > No, the last conditions refers to adjacent byte stores
> >from
> >> >different
> >> >> > > > > cpu contexts (either interrupt or SMP).
> >> >> > > > > 
> >> >> > > > > > The guarantees are only for natural width on Parisc as
> >> >well,
> >> >> > > > > > so you would get a mess if you did byte stores to
> >adjacent
> >> >memory
> >> >> > > > > > locations.
> >> >> > > > > 
> >> >> > > > > For a simple test like:
> >> >> > > > > 
> >> >> > > > > struct x {
> >> >> > > > > 	long a;
> >> >> > > > > 	char b;
> >> >> > > > > 	char c;
> >> >> > > > > 	char d;
> >> >> > > > > 	char e;
> >> >> > > > > };
> >> >> > > > > 
> >> >> > > > > void store_bc(struct x *p) {
> >> >> > > > > 	p->b = 1;
> >> >> > > > > 	p->c = 2;
> >> >> > > > > }
> >> >> > > > > 
> >> >> > > > > on parisc, gcc generates separate byte stores
> >> >> > > > > 
> >> >> > > > > void store_bc(struct x *p) {
> >> >> > > > >    0:	34 1c 00 02 	ldi 1,ret0
> >> >> > > > >    4:	0f 5c 12 08 	stb ret0,4(r26)
> >> >> > > > >    8:	34 1c 00 04 	ldi 2,ret0
> >> >> > > > >    c:	e8 40 c0 00 	bv r0(rp)
> >> >> > > > >   10:	0f 5c 12 0a 	stb ret0,5(r26)
> >> >> > > > > 
> >> >> > > > > which appears to confirm that on parisc adjacent byte data
> >> >> > > > > is safe from corruption by concurrent cpu updates; that
> >is,
> >> >> > > > > 
> >> >> > > > > CPU 0                | CPU 1
> >> >> > > > >                      |
> >> >> > > > > p->b = 1             | p->c = 2
> >> >> > > > >                      |
> >> >> > > > > 
> >> >> > > > > will result in p->b == 1 && p->c == 2 (assume both values
> >> >> > > > > were 0 before the call to store_bc()).
> >> >> > > > 
> >> >> > > > What Peter said.  I would ask for suggestions for better
> >> >wording, but
> >> >> > > > I would much rather be able to say that single-byte reads
> >and
> >> >writes
> >> >> > > > are atomic and that aligned-short reads and writes are also
> >> >atomic.
> >> >> > > > 
> >> >> > > > Thus far, it looks like we lose only very old Alpha systems,
> >so
> >> >unless
> >> >> > > > I hear otherwise, I update my patch to outlaw these very old
> >> >systems.
> >> >> > > 
> >> >> > > This isn't universally true according to the architecture
> >manual.
> >> > The
> >> >> > > PARISC CPU can make byte to long word stores atomic against
> >the
> >> >memory
> >> >> > > bus but not against the I/O bus for instance.  Atomicity is a
> >> >property
> >> >> > > of the underlying substrate, not of the CPU.  Implying that
> >> >atomicity is
> >> >> > > a CPU property is incorrect.
> >> >> > 
> >> >> > OK, fair point.
> >> >> > 
> >> >> > But are there in-use-for-Linux PARISC memory fabrics (for normal
> >> >memory,
> >> >> > not I/O) that do not support single-byte and double-byte stores?
> >> >> 
> >> >> For aligned access, I believe that's always the case for the
> >memory
> >> >bus
> >> >> (on both 32 and 64 bit systems).  However, it only applies to
> >machine
> >> >> instruction loads and stores of the same width..  If you mix the
> >> >widths
> >> >> on the loads and stores, all bets are off.  That means you have to
> >> >> beware of the gcc penchant for coalescing loads and stores: if it
> >> >sees
> >> >> two adjacent byte stores it can coalesce them into a short store
> >> >> instead ... that screws up the atomicity guarantees.
> >> >
> >> >OK, that means that to make PARISC work reliably, we need to use
> >> >ACCESS_ONCE() for loads and stores that could have racing accesses.
> >> >If I understand correctly, this will -not- be needed for code
> >guarded
> >> >by locks, even with Peter's examples.
> >> >
> >> >So if we have something like this:
> >> >
> >> >	struct foo {
> >> >		char a;
> >> >		char b;
> >> >	};
> >> >	struct foo *fp;
> >> >
> >> >then this code would be bad:
> >> >
> >> >	fp->a = 1;
> >> >	fp->b = 2;
> >> >
> >> >The reason is (as you say) that GCC would be happy to store 0x0102
> >> >(or vice versa, depending on endianness) to the pair.  We instead
> >> >need:
> >> >
> >> >	ACCESS_ONCE(fp->a) = 1;
> >> >	ACCESS_ONCE(fp->b) = 2;
> >> >
> >> >However, if the code is protected by locks, no problem:
> >> >
> >> >	struct foo {
> >> >		spinlock_t lock_a;
> >> >		spinlock_t lock_b;
> >> >		char a;
> >> >		char b;
> >> >	};
> >> >
> >> >Then it is OK to do the following:
> >> >
> >> >	spin_lock(fp->lock_a);
> >> >	fp->a = 1;
> >> >	spin_unlock(fp->lock_a);
> >> >	spin_lock(fp->lock_b);
> >> >	fp->b = 1;
> >> >	spin_unlock(fp->lock_b);
> >> >
> >> >Or even this, assuming ->lock_a precedes ->lock_b in the locking
> >> >hierarchy:
> >> >
> >> >	spin_lock(fp->lock_a);
> >> >	spin_lock(fp->lock_b);
> >> >	fp->a = 1;
> >> >	fp->b = 1;
> >> >	spin_unlock(fp->lock_a);
> >> >	spin_unlock(fp->lock_b);
> >> >
> >> >Here gcc might merge the assignments to fp->a and fp->b, but that is
> >OK
> >> >because both locks are held, presumably preventing other assignments
> >or
> >> >references to fp->a and fp->b.
> >> >
> >> >On the other hand, if either fp->a or fp->b are referenced outside
> >of
> >> >their
> >> >respective locks, even once, then this last code fragment would
> >still
> >> >need
> >> >ACCESS_ONCE() as follows:
> >> >
> >> >	spin_lock(fp->lock_a);
> >> >	spin_lock(fp->lock_b);
> >> >	ACCESS_ONCE(fp->a) = 1;
> >> >	ACCESS_ONCE(fp->b) = 1;
> >> >	spin_unlock(fp->lock_a);
> >> >	spin_unlock(fp->lock_b);
> >> >
> >> >Does that cover it?  If so, I will update memory-barriers.txt
> >> >accordingly.
> >> >
> >> >							Thanx, Paul
> >> 
> >> -- 
> >> Sent from my mobile phone.  Please pardon brevity and lack of
> >formatting.
> >> 
> 



^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-08  5:56                                         ` James Bottomley
  0 siblings, 0 replies; 311+ messages in thread
From: James Bottomley @ 2014-09-08  5:56 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Jakub Jelinek, One Thousand Gnomes, Tony Luck, linux-ia64,
	Peter Hurley, Mikael Pettersson, Oleg Nesterov, linux-kernel,
	Paul Mackerras, linux-arch, paulmck, linuxppc-dev,
	Miroslav Franc, Richard Henderson

On Sun, 2014-09-07 at 16:39 -0700, H. Peter Anvin wrote:
> How many PARISC systems do we have that actually do real work on Linux?

I'd be very surprised if this problem didn't exist on all alignment
requiring architectures, like PPC and Sparc as well.  I know it would be
very convenient if all the world were an x86 ... but it would also be
very boring as well.

The rules for coping with it are well known and a relaxation of what we
currently do in the kernel, so I don't see what the actual problem is.

In the context of this thread, PA can't do atomic bit sets (no atomic
RMW except the ldcw operation) it can do atomic writes to fundamental
sizes (byte, short, int, long) provided gcc emits the correct primitive
(there are lots of gotchas in this, but that's not an architectural
problem).  These atomicity guarantees depend on the underlying storage
and are respected for main memory but not for any other type of bus.

James


> On September 7, 2014 4:36:55 PM PDT, "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> >On Sun, Sep 07, 2014 at 04:17:30PM -0700, H. Peter Anvin wrote:
> >> I'm confused why storing 0x0102 would be a problem.  I think gcc does
> >that even on other cpus.
> >> 
> >> More atomicity can't hurt, can it?
> >
> >I must defer to James for any additional details on why PARISC systems
> >don't provide atomicity for partially overlapping stores.  ;-)
> >
> >							Thanx, Paul
> >
> >> On September 7, 2014 4:00:19 PM PDT, "Paul E. McKenney"
> ><paulmck@linux.vnet.ibm.com> wrote:
> >> >On Sun, Sep 07, 2014 at 12:04:47PM -0700, James Bottomley wrote:
> >> >> On Sun, 2014-09-07 at 09:21 -0700, Paul E. McKenney wrote:
> >> >> > On Sat, Sep 06, 2014 at 10:07:22PM -0700, James Bottomley wrote:
> >> >> > > On Thu, 2014-09-04 at 21:06 -0700, Paul E. McKenney wrote:
> >> >> > > > On Thu, Sep 04, 2014 at 10:47:24PM -0400, Peter Hurley
> >wrote:
> >> >> > > > > Hi James,
> >> >> > > > > 
> >> >> > > > > On 09/04/2014 10:11 PM, James Bottomley wrote:
> >> >> > > > > > On Thu, 2014-09-04 at 17:17 -0700, Paul E. McKenney
> >wrote:
> >> >> > > > > >> +And there are anti-guarantees:
> >> >> > > > > >> +
> >> >> > > > > >> + (*) These guarantees do not apply to bitfields,
> >because
> >> >compilers often
> >> >> > > > > >> +     generate code to modify these using non-atomic
> >> >read-modify-write
> >> >> > > > > >> +     sequences.  Do not attempt to use bitfields to
> >> >synchronize parallel
> >> >> > > > > >> +     algorithms.
> >> >> > > > > >> +
> >> >> > > > > >> + (*) Even in cases where bitfields are protected by
> >> >locks, all fields
> >> >> > > > > >> +     in a given bitfield must be protected by one
> >lock. 
> >> >If two fields
> >> >> > > > > >> +     in a given bitfield are protected by different
> >> >locks, the compiler's
> >> >> > > > > >> +     non-atomic read-modify-write sequences can cause
> >an
> >> >update to one
> >> >> > > > > >> +     field to corrupt the value of an adjacent field.
> >> >> > > > > >> +
> >> >> > > > > >> + (*) These guarantees apply only to properly aligned
> >and
> >> >sized scalar
> >> >> > > > > >> +     variables.  "Properly sized" currently means
> >"int"
> >> >and "long",
> >> >> > > > > >> +     because some CPU families do not support loads
> >and
> >> >stores of
> >> >> > > > > >> +     other sizes.  ("Some CPU families" is currently
> >> >believed to
> >> >> > > > > >> +     be only Alpha 21064.  If this is actually the
> >case,
> >> >a different
> >> >> > > > > >> +     non-guarantee is likely to be formulated.)
> >> >> > > > > > 
> >> >> > > > > > This is a bit unclear.  Presumably you're talking about
> >> >definiteness of
> >> >> > > > > > the outcome (as in what's seen after multiple stores to
> >the
> >> >same
> >> >> > > > > > variable).
> >> >> > > > > 
> >> >> > > > > No, the last conditions refers to adjacent byte stores
> >from
> >> >different
> >> >> > > > > cpu contexts (either interrupt or SMP).
> >> >> > > > > 
> >> >> > > > > > The guarantees are only for natural width on Parisc as
> >> >well,
> >> >> > > > > > so you would get a mess if you did byte stores to
> >adjacent
> >> >memory
> >> >> > > > > > locations.
> >> >> > > > > 
> >> >> > > > > For a simple test like:
> >> >> > > > > 
> >> >> > > > > struct x {
> >> >> > > > > 	long a;
> >> >> > > > > 	char b;
> >> >> > > > > 	char c;
> >> >> > > > > 	char d;
> >> >> > > > > 	char e;
> >> >> > > > > };
> >> >> > > > > 
> >> >> > > > > void store_bc(struct x *p) {
> >> >> > > > > 	p->b = 1;
> >> >> > > > > 	p->c = 2;
> >> >> > > > > }
> >> >> > > > > 
> >> >> > > > > on parisc, gcc generates separate byte stores
> >> >> > > > > 
> >> >> > > > > void store_bc(struct x *p) {
> >> >> > > > >    0:	34 1c 00 02 	ldi 1,ret0
> >> >> > > > >    4:	0f 5c 12 08 	stb ret0,4(r26)
> >> >> > > > >    8:	34 1c 00 04 	ldi 2,ret0
> >> >> > > > >    c:	e8 40 c0 00 	bv r0(rp)
> >> >> > > > >   10:	0f 5c 12 0a 	stb ret0,5(r26)
> >> >> > > > > 
> >> >> > > > > which appears to confirm that on parisc adjacent byte data
> >> >> > > > > is safe from corruption by concurrent cpu updates; that
> >is,
> >> >> > > > > 
> >> >> > > > > CPU 0                | CPU 1
> >> >> > > > >                      |
> >> >> > > > > p->b = 1             | p->c = 2
> >> >> > > > >                      |
> >> >> > > > > 
> >> >> > > > > will result in p->b == 1 && p->c == 2 (assume both values
> >> >> > > > > were 0 before the call to store_bc()).
> >> >> > > > 
> >> >> > > > What Peter said.  I would ask for suggestions for better
> >> >wording, but
> >> >> > > > I would much rather be able to say that single-byte reads
> >and
> >> >writes
> >> >> > > > are atomic and that aligned-short reads and writes are also
> >> >atomic.
> >> >> > > > 
> >> >> > > > Thus far, it looks like we lose only very old Alpha systems,
> >so
> >> >unless
> >> >> > > > I hear otherwise, I update my patch to outlaw these very old
> >> >systems.
> >> >> > > 
> >> >> > > This isn't universally true according to the architecture
> >manual.
> >> > The
> >> >> > > PARISC CPU can make byte to long word stores atomic against
> >the
> >> >memory
> >> >> > > bus but not against the I/O bus for instance.  Atomicity is a
> >> >property
> >> >> > > of the underlying substrate, not of the CPU.  Implying that
> >> >atomicity is
> >> >> > > a CPU property is incorrect.
> >> >> > 
> >> >> > OK, fair point.
> >> >> > 
> >> >> > But are there in-use-for-Linux PARISC memory fabrics (for normal
> >> >memory,
> >> >> > not I/O) that do not support single-byte and double-byte stores?
> >> >> 
> >> >> For aligned access, I believe that's always the case for the
> >memory
> >> >bus
> >> >> (on both 32 and 64 bit systems).  However, it only applies to
> >machine
> >> >> instruction loads and stores of the same width..  If you mix the
> >> >widths
> >> >> on the loads and stores, all bets are off.  That means you have to
> >> >> beware of the gcc penchant for coalescing loads and stores: if it
> >> >sees
> >> >> two adjacent byte stores it can coalesce them into a short store
> >> >> instead ... that screws up the atomicity guarantees.
> >> >
> >> >OK, that means that to make PARISC work reliably, we need to use
> >> >ACCESS_ONCE() for loads and stores that could have racing accesses.
> >> >If I understand correctly, this will -not- be needed for code
> >guarded
> >> >by locks, even with Peter's examples.
> >> >
> >> >So if we have something like this:
> >> >
> >> >	struct foo {
> >> >		char a;
> >> >		char b;
> >> >	};
> >> >	struct foo *fp;
> >> >
> >> >then this code would be bad:
> >> >
> >> >	fp->a = 1;
> >> >	fp->b = 2;
> >> >
> >> >The reason is (as you say) that GCC would be happy to store 0x0102
> >> >(or vice versa, depending on endianness) to the pair.  We instead
> >> >need:
> >> >
> >> >	ACCESS_ONCE(fp->a) = 1;
> >> >	ACCESS_ONCE(fp->b) = 2;
> >> >
> >> >However, if the code is protected by locks, no problem:
> >> >
> >> >	struct foo {
> >> >		spinlock_t lock_a;
> >> >		spinlock_t lock_b;
> >> >		char a;
> >> >		char b;
> >> >	};
> >> >
> >> >Then it is OK to do the following:
> >> >
> >> >	spin_lock(fp->lock_a);
> >> >	fp->a = 1;
> >> >	spin_unlock(fp->lock_a);
> >> >	spin_lock(fp->lock_b);
> >> >	fp->b = 1;
> >> >	spin_unlock(fp->lock_b);
> >> >
> >> >Or even this, assuming ->lock_a precedes ->lock_b in the locking
> >> >hierarchy:
> >> >
> >> >	spin_lock(fp->lock_a);
> >> >	spin_lock(fp->lock_b);
> >> >	fp->a = 1;
> >> >	fp->b = 1;
> >> >	spin_unlock(fp->lock_a);
> >> >	spin_unlock(fp->lock_b);
> >> >
> >> >Here gcc might merge the assignments to fp->a and fp->b, but that is
> >OK
> >> >because both locks are held, presumably preventing other assignments
> >or
> >> >references to fp->a and fp->b.
> >> >
> >> >On the other hand, if either fp->a or fp->b are referenced outside
> >of
> >> >their
> >> >respective locks, even once, then this last code fragment would
> >still
> >> >need
> >> >ACCESS_ONCE() as follows:
> >> >
> >> >	spin_lock(fp->lock_a);
> >> >	spin_lock(fp->lock_b);
> >> >	ACCESS_ONCE(fp->a) = 1;
> >> >	ACCESS_ONCE(fp->b) = 1;
> >> >	spin_unlock(fp->lock_a);
> >> >	spin_unlock(fp->lock_b);
> >> >
> >> >Does that cover it?  If so, I will update memory-barriers.txt
> >> >accordingly.
> >> >
> >> >							Thanx, Paul
> >> 
> >> -- 
> >> Sent from my mobile phone.  Please pardon brevity and lack of
> >formatting.
> >> 
> 

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-08  5:56                                         ` James Bottomley
  0 siblings, 0 replies; 311+ messages in thread
From: James Bottomley @ 2014-09-08  5:56 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: paulmck, Peter Hurley, One Thousand Gnomes, Jakub Jelinek,
	Mikael Pettersson, Benjamin Herrenschmidt, Richard Henderson,
	Oleg Nesterov, Miroslav Franc, Paul Mackerras, linuxppc-dev,
	linux-kernel, linux-arch, Tony Luck, linux-ia64

On Sun, 2014-09-07 at 16:39 -0700, H. Peter Anvin wrote:
> How many PARISC systems do we have that actually do real work on Linux?

I'd be very surprised if this problem didn't exist on all alignment
requiring architectures, like PPC and Sparc as well.  I know it would be
very convenient if all the world were an x86 ... but it would also be
very boring as well.

The rules for coping with it are well known and a relaxation of what we
currently do in the kernel, so I don't see what the actual problem is.

In the context of this thread, PA can't do atomic bit sets (no atomic
RMW except the ldcw operation) it can do atomic writes to fundamental
sizes (byte, short, int, long) provided gcc emits the correct primitive
(there are lots of gotchas in this, but that's not an architectural
problem).  These atomicity guarantees depend on the underlying storage
and are respected for main memory but not for any other type of bus.

James


> On September 7, 2014 4:36:55 PM PDT, "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> >On Sun, Sep 07, 2014 at 04:17:30PM -0700, H. Peter Anvin wrote:
> >> I'm confused why storing 0x0102 would be a problem.  I think gcc does
> >that even on other cpus.
> >> 
> >> More atomicity can't hurt, can it?
> >
> >I must defer to James for any additional details on why PARISC systems
> >don't provide atomicity for partially overlapping stores.  ;-)
> >
> >							Thanx, Paul
> >
> >> On September 7, 2014 4:00:19 PM PDT, "Paul E. McKenney"
> ><paulmck@linux.vnet.ibm.com> wrote:
> >> >On Sun, Sep 07, 2014 at 12:04:47PM -0700, James Bottomley wrote:
> >> >> On Sun, 2014-09-07 at 09:21 -0700, Paul E. McKenney wrote:
> >> >> > On Sat, Sep 06, 2014 at 10:07:22PM -0700, James Bottomley wrote:
> >> >> > > On Thu, 2014-09-04 at 21:06 -0700, Paul E. McKenney wrote:
> >> >> > > > On Thu, Sep 04, 2014 at 10:47:24PM -0400, Peter Hurley
> >wrote:
> >> >> > > > > Hi James,
> >> >> > > > > 
> >> >> > > > > On 09/04/2014 10:11 PM, James Bottomley wrote:
> >> >> > > > > > On Thu, 2014-09-04 at 17:17 -0700, Paul E. McKenney
> >wrote:
> >> >> > > > > >> +And there are anti-guarantees:
> >> >> > > > > >> +
> >> >> > > > > >> + (*) These guarantees do not apply to bitfields,
> >because
> >> >compilers often
> >> >> > > > > >> +     generate code to modify these using non-atomic
> >> >read-modify-write
> >> >> > > > > >> +     sequences.  Do not attempt to use bitfields to
> >> >synchronize parallel
> >> >> > > > > >> +     algorithms.
> >> >> > > > > >> +
> >> >> > > > > >> + (*) Even in cases where bitfields are protected by
> >> >locks, all fields
> >> >> > > > > >> +     in a given bitfield must be protected by one
> >lock. 
> >> >If two fields
> >> >> > > > > >> +     in a given bitfield are protected by different
> >> >locks, the compiler's
> >> >> > > > > >> +     non-atomic read-modify-write sequences can cause
> >an
> >> >update to one
> >> >> > > > > >> +     field to corrupt the value of an adjacent field.
> >> >> > > > > >> +
> >> >> > > > > >> + (*) These guarantees apply only to properly aligned
> >and
> >> >sized scalar
> >> >> > > > > >> +     variables.  "Properly sized" currently means
> >"int"
> >> >and "long",
> >> >> > > > > >> +     because some CPU families do not support loads
> >and
> >> >stores of
> >> >> > > > > >> +     other sizes.  ("Some CPU families" is currently
> >> >believed to
> >> >> > > > > >> +     be only Alpha 21064.  If this is actually the
> >case,
> >> >a different
> >> >> > > > > >> +     non-guarantee is likely to be formulated.)
> >> >> > > > > > 
> >> >> > > > > > This is a bit unclear.  Presumably you're talking about
> >> >definiteness of
> >> >> > > > > > the outcome (as in what's seen after multiple stores to
> >the
> >> >same
> >> >> > > > > > variable).
> >> >> > > > > 
> >> >> > > > > No, the last conditions refers to adjacent byte stores
> >from
> >> >different
> >> >> > > > > cpu contexts (either interrupt or SMP).
> >> >> > > > > 
> >> >> > > > > > The guarantees are only for natural width on Parisc as
> >> >well,
> >> >> > > > > > so you would get a mess if you did byte stores to
> >adjacent
> >> >memory
> >> >> > > > > > locations.
> >> >> > > > > 
> >> >> > > > > For a simple test like:
> >> >> > > > > 
> >> >> > > > > struct x {
> >> >> > > > > 	long a;
> >> >> > > > > 	char b;
> >> >> > > > > 	char c;
> >> >> > > > > 	char d;
> >> >> > > > > 	char e;
> >> >> > > > > };
> >> >> > > > > 
> >> >> > > > > void store_bc(struct x *p) {
> >> >> > > > > 	p->b = 1;
> >> >> > > > > 	p->c = 2;
> >> >> > > > > }
> >> >> > > > > 
> >> >> > > > > on parisc, gcc generates separate byte stores
> >> >> > > > > 
> >> >> > > > > void store_bc(struct x *p) {
> >> >> > > > >    0:	34 1c 00 02 	ldi 1,ret0
> >> >> > > > >    4:	0f 5c 12 08 	stb ret0,4(r26)
> >> >> > > > >    8:	34 1c 00 04 	ldi 2,ret0
> >> >> > > > >    c:	e8 40 c0 00 	bv r0(rp)
> >> >> > > > >   10:	0f 5c 12 0a 	stb ret0,5(r26)
> >> >> > > > > 
> >> >> > > > > which appears to confirm that on parisc adjacent byte data
> >> >> > > > > is safe from corruption by concurrent cpu updates; that
> >is,
> >> >> > > > > 
> >> >> > > > > CPU 0                | CPU 1
> >> >> > > > >                      |
> >> >> > > > > p->b = 1             | p->c = 2
> >> >> > > > >                      |
> >> >> > > > > 
> >> >> > > > > will result in p->b = 1 && p->c = 2 (assume both values
> >> >> > > > > were 0 before the call to store_bc()).
> >> >> > > > 
> >> >> > > > What Peter said.  I would ask for suggestions for better
> >> >wording, but
> >> >> > > > I would much rather be able to say that single-byte reads
> >and
> >> >writes
> >> >> > > > are atomic and that aligned-short reads and writes are also
> >> >atomic.
> >> >> > > > 
> >> >> > > > Thus far, it looks like we lose only very old Alpha systems,
> >so
> >> >unless
> >> >> > > > I hear otherwise, I update my patch to outlaw these very old
> >> >systems.
> >> >> > > 
> >> >> > > This isn't universally true according to the architecture
> >manual.
> >> > The
> >> >> > > PARISC CPU can make byte to long word stores atomic against
> >the
> >> >memory
> >> >> > > bus but not against the I/O bus for instance.  Atomicity is a
> >> >property
> >> >> > > of the underlying substrate, not of the CPU.  Implying that
> >> >atomicity is
> >> >> > > a CPU property is incorrect.
> >> >> > 
> >> >> > OK, fair point.
> >> >> > 
> >> >> > But are there in-use-for-Linux PARISC memory fabrics (for normal
> >> >memory,
> >> >> > not I/O) that do not support single-byte and double-byte stores?
> >> >> 
> >> >> For aligned access, I believe that's always the case for the
> >memory
> >> >bus
> >> >> (on both 32 and 64 bit systems).  However, it only applies to
> >machine
> >> >> instruction loads and stores of the same width..  If you mix the
> >> >widths
> >> >> on the loads and stores, all bets are off.  That means you have to
> >> >> beware of the gcc penchant for coalescing loads and stores: if it
> >> >sees
> >> >> two adjacent byte stores it can coalesce them into a short store
> >> >> instead ... that screws up the atomicity guarantees.
> >> >
> >> >OK, that means that to make PARISC work reliably, we need to use
> >> >ACCESS_ONCE() for loads and stores that could have racing accesses.
> >> >If I understand correctly, this will -not- be needed for code
> >guarded
> >> >by locks, even with Peter's examples.
> >> >
> >> >So if we have something like this:
> >> >
> >> >	struct foo {
> >> >		char a;
> >> >		char b;
> >> >	};
> >> >	struct foo *fp;
> >> >
> >> >then this code would be bad:
> >> >
> >> >	fp->a = 1;
> >> >	fp->b = 2;
> >> >
> >> >The reason is (as you say) that GCC would be happy to store 0x0102
> >> >(or vice versa, depending on endianness) to the pair.  We instead
> >> >need:
> >> >
> >> >	ACCESS_ONCE(fp->a) = 1;
> >> >	ACCESS_ONCE(fp->b) = 2;
> >> >
> >> >However, if the code is protected by locks, no problem:
> >> >
> >> >	struct foo {
> >> >		spinlock_t lock_a;
> >> >		spinlock_t lock_b;
> >> >		char a;
> >> >		char b;
> >> >	};
> >> >
> >> >Then it is OK to do the following:
> >> >
> >> >	spin_lock(fp->lock_a);
> >> >	fp->a = 1;
> >> >	spin_unlock(fp->lock_a);
> >> >	spin_lock(fp->lock_b);
> >> >	fp->b = 1;
> >> >	spin_unlock(fp->lock_b);
> >> >
> >> >Or even this, assuming ->lock_a precedes ->lock_b in the locking
> >> >hierarchy:
> >> >
> >> >	spin_lock(fp->lock_a);
> >> >	spin_lock(fp->lock_b);
> >> >	fp->a = 1;
> >> >	fp->b = 1;
> >> >	spin_unlock(fp->lock_a);
> >> >	spin_unlock(fp->lock_b);
> >> >
> >> >Here gcc might merge the assignments to fp->a and fp->b, but that is
> >OK
> >> >because both locks are held, presumably preventing other assignments
> >or
> >> >references to fp->a and fp->b.
> >> >
> >> >On the other hand, if either fp->a or fp->b are referenced outside
> >of
> >> >their
> >> >respective locks, even once, then this last code fragment would
> >still
> >> >need
> >> >ACCESS_ONCE() as follows:
> >> >
> >> >	spin_lock(fp->lock_a);
> >> >	spin_lock(fp->lock_b);
> >> >	ACCESS_ONCE(fp->a) = 1;
> >> >	ACCESS_ONCE(fp->b) = 1;
> >> >	spin_unlock(fp->lock_a);
> >> >	spin_unlock(fp->lock_b);
> >> >
> >> >Does that cover it?  If so, I will update memory-barriers.txt
> >> >accordingly.
> >> >
> >> >							Thanx, Paul
> >> 
> >> -- 
> >> Sent from my mobile phone.  Please pardon brevity and lack of
> >formatting.
> >> 
> 



^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
  2014-09-05 15:41                   ` H. Peter Anvin
  (?)
@ 2014-09-08 17:52                     ` One Thousand Gnomes
  -1 siblings, 0 replies; 311+ messages in thread
From: One Thousand Gnomes @ 2014-09-08 17:52 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Peter Hurley, Benjamin Herrenschmidt, David Laight,
	Jakub Jelinek, linux-arch, Tony Luck, linux-ia64, Oleg Nesterov,
	linux-kernel, Paul Mackerras, Paul E. McKenney, linuxppc-dev,
	Miroslav Franc, Richard Henderson, linux-alpha

On Fri, 05 Sep 2014 08:41:52 -0700
"H. Peter Anvin" <hpa@zytor.com> wrote:

> On 09/05/2014 08:31 AM, Peter Hurley wrote:
> > 
> > Which is a bit ironic because I remember when Digital had a team
> > working on emulating native x86 apps on Alpha/NT.
> > 
> 
> Right, because the x86 architecture was obsolete and would never scale...

Talking about "not scaling" can anyone explain how a "you need to use
set_bit() and friends" bug report scaled into a hundred message plus
discussion about ambiguous properties of processors (and nobody has
audited all the embedded platforms we support yet, or the weirder ARMs)
and a propsal to remove Alpha support.

Wouldn't it be *much* simpler to do what I suggested in the first place
and use the existing intended for purpose, deliberately put there,
functions for atomic bitops, because they are fast on sane processors and
they work on everything else.

I think the whole "removing Alpha EV5" support is basically bonkers. Just
use set_bit in the tty layer. Alpha will continue to work as well as it
always has done and you won't design out support for any future processor
that turns out not to do byte aligned stores.

Alan

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-08 17:52                     ` One Thousand Gnomes
  0 siblings, 0 replies; 311+ messages in thread
From: One Thousand Gnomes @ 2014-09-08 17:52 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Jakub Jelinek, linux-arch, Tony Luck, linux-ia64, Peter Hurley,
	Oleg Nesterov, linux-kernel, David Laight, Paul Mackerras,
	linux-alpha, Paul E. McKenney, linuxppc-dev, Miroslav Franc,
	Richard Henderson

On Fri, 05 Sep 2014 08:41:52 -0700
"H. Peter Anvin" <hpa@zytor.com> wrote:

> On 09/05/2014 08:31 AM, Peter Hurley wrote:
> > 
> > Which is a bit ironic because I remember when Digital had a team
> > working on emulating native x86 apps on Alpha/NT.
> > 
> 
> Right, because the x86 architecture was obsolete and would never scale...

Talking about "not scaling" can anyone explain how a "you need to use
set_bit() and friends" bug report scaled into a hundred message plus
discussion about ambiguous properties of processors (and nobody has
audited all the embedded platforms we support yet, or the weirder ARMs)
and a propsal to remove Alpha support.

Wouldn't it be *much* simpler to do what I suggested in the first place
and use the existing intended for purpose, deliberately put there,
functions for atomic bitops, because they are fast on sane processors and
they work on everything else.

I think the whole "removing Alpha EV5" support is basically bonkers. Just
use set_bit in the tty layer. Alpha will continue to work as well as it
always has done and you won't design out support for any future processor
that turns out not to do byte aligned stores.

Alan

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-08 17:52                     ` One Thousand Gnomes
  0 siblings, 0 replies; 311+ messages in thread
From: One Thousand Gnomes @ 2014-09-08 17:52 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Peter Hurley, Benjamin Herrenschmidt, David Laight,
	Jakub Jelinek, linux-arch, Tony Luck, linux-ia64, Oleg Nesterov,
	linux-kernel, Paul Mackerras, Paul E. McKenney, linuxppc-dev,
	Miroslav Franc, Richard Henderson, linux-alpha

On Fri, 05 Sep 2014 08:41:52 -0700
"H. Peter Anvin" <hpa@zytor.com> wrote:

> On 09/05/2014 08:31 AM, Peter Hurley wrote:
> > 
> > Which is a bit ironic because I remember when Digital had a team
> > working on emulating native x86 apps on Alpha/NT.
> > 
> 
> Right, because the x86 architecture was obsolete and would never scale...

Talking about "not scaling" can anyone explain how a "you need to use
set_bit() and friends" bug report scaled into a hundred message plus
discussion about ambiguous properties of processors (and nobody has
audited all the embedded platforms we support yet, or the weirder ARMs)
and a propsal to remove Alpha support.

Wouldn't it be *much* simpler to do what I suggested in the first place
and use the existing intended for purpose, deliberately put there,
functions for atomic bitops, because they are fast on sane processors and
they work on everything else.

I think the whole "removing Alpha EV5" support is basically bonkers. Just
use set_bit in the tty layer. Alpha will continue to work as well as it
always has done and you won't design out support for any future processor
that turns out not to do byte aligned stores.

Alan

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
  2014-09-08 17:52                     ` One Thousand Gnomes
  (?)
@ 2014-09-08 17:59                       ` H. Peter Anvin
  -1 siblings, 0 replies; 311+ messages in thread
From: H. Peter Anvin @ 2014-09-08 17:59 UTC (permalink / raw)
  To: One Thousand Gnomes
  Cc: Peter Hurley, Benjamin Herrenschmidt, David Laight,
	Jakub Jelinek, linux-arch, Tony Luck, linux-ia64, Oleg Nesterov,
	linux-kernel, Paul Mackerras, Paul E. McKenney, linuxppc-dev,
	Miroslav Franc, Richard Henderson, linux-alpha

On 09/08/2014 10:52 AM, One Thousand Gnomes wrote:
> On Fri, 05 Sep 2014 08:41:52 -0700
> "H. Peter Anvin" <hpa@zytor.com> wrote:
> 
>> On 09/05/2014 08:31 AM, Peter Hurley wrote:
>>>
>>> Which is a bit ironic because I remember when Digital had a team
>>> working on emulating native x86 apps on Alpha/NT.
>>>
>>
>> Right, because the x86 architecture was obsolete and would never scale...
> 
> Talking about "not scaling" can anyone explain how a "you need to use
> set_bit() and friends" bug report scaled into a hundred message plus
> discussion about ambiguous properties of processors (and nobody has
> audited all the embedded platforms we support yet, or the weirder ARMs)
> and a propsal to remove Alpha support.
> 
> Wouldn't it be *much* simpler to do what I suggested in the first place
> and use the existing intended for purpose, deliberately put there,
> functions for atomic bitops, because they are fast on sane processors and
> they work on everything else.
> 
> I think the whole "removing Alpha EV5" support is basically bonkers. Just
> use set_bit in the tty layer. Alpha will continue to work as well as it
> always has done and you won't design out support for any future processor
> that turns out not to do byte aligned stores.
> 
> Alan
> 

Is *that* what we are talking about?  I was added to this conversation
in the middle where it had already generalized, so I had no idea.

	-hpa


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-08 17:59                       ` H. Peter Anvin
  0 siblings, 0 replies; 311+ messages in thread
From: H. Peter Anvin @ 2014-09-08 17:59 UTC (permalink / raw)
  To: One Thousand Gnomes
  Cc: Jakub Jelinek, linux-arch, Tony Luck, linux-ia64, Peter Hurley,
	Oleg Nesterov, linux-kernel, David Laight, Paul Mackerras,
	linux-alpha, Paul E. McKenney, linuxppc-dev, Miroslav Franc,
	Richard Henderson

On 09/08/2014 10:52 AM, One Thousand Gnomes wrote:
> On Fri, 05 Sep 2014 08:41:52 -0700
> "H. Peter Anvin" <hpa@zytor.com> wrote:
> 
>> On 09/05/2014 08:31 AM, Peter Hurley wrote:
>>>
>>> Which is a bit ironic because I remember when Digital had a team
>>> working on emulating native x86 apps on Alpha/NT.
>>>
>>
>> Right, because the x86 architecture was obsolete and would never scale...
> 
> Talking about "not scaling" can anyone explain how a "you need to use
> set_bit() and friends" bug report scaled into a hundred message plus
> discussion about ambiguous properties of processors (and nobody has
> audited all the embedded platforms we support yet, or the weirder ARMs)
> and a propsal to remove Alpha support.
> 
> Wouldn't it be *much* simpler to do what I suggested in the first place
> and use the existing intended for purpose, deliberately put there,
> functions for atomic bitops, because they are fast on sane processors and
> they work on everything else.
> 
> I think the whole "removing Alpha EV5" support is basically bonkers. Just
> use set_bit in the tty layer. Alpha will continue to work as well as it
> always has done and you won't design out support for any future processor
> that turns out not to do byte aligned stores.
> 
> Alan
> 

Is *that* what we are talking about?  I was added to this conversation
in the middle where it had already generalized, so I had no idea.

	-hpa

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-08 17:59                       ` H. Peter Anvin
  0 siblings, 0 replies; 311+ messages in thread
From: H. Peter Anvin @ 2014-09-08 17:59 UTC (permalink / raw)
  To: One Thousand Gnomes
  Cc: Peter Hurley, Benjamin Herrenschmidt, David Laight,
	Jakub Jelinek, linux-arch, Tony Luck, linux-ia64, Oleg Nesterov,
	linux-kernel, Paul Mackerras, Paul E. McKenney, linuxppc-dev,
	Miroslav Franc, Richard Henderson, linux-alpha

On 09/08/2014 10:52 AM, One Thousand Gnomes wrote:
> On Fri, 05 Sep 2014 08:41:52 -0700
> "H. Peter Anvin" <hpa@zytor.com> wrote:
> 
>> On 09/05/2014 08:31 AM, Peter Hurley wrote:
>>>
>>> Which is a bit ironic because I remember when Digital had a team
>>> working on emulating native x86 apps on Alpha/NT.
>>>
>>
>> Right, because the x86 architecture was obsolete and would never scale...
> 
> Talking about "not scaling" can anyone explain how a "you need to use
> set_bit() and friends" bug report scaled into a hundred message plus
> discussion about ambiguous properties of processors (and nobody has
> audited all the embedded platforms we support yet, or the weirder ARMs)
> and a propsal to remove Alpha support.
> 
> Wouldn't it be *much* simpler to do what I suggested in the first place
> and use the existing intended for purpose, deliberately put there,
> functions for atomic bitops, because they are fast on sane processors and
> they work on everything else.
> 
> I think the whole "removing Alpha EV5" support is basically bonkers. Just
> use set_bit in the tty layer. Alpha will continue to work as well as it
> always has done and you won't design out support for any future processor
> that turns out not to do byte aligned stores.
> 
> Alan
> 

Is *that* what we are talking about?  I was added to this conversation
in the middle where it had already generalized, so I had no idea.

	-hpa


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
  2014-09-08  5:56                                         ` James Bottomley
  (?)
@ 2014-09-08 18:12                                           ` H. Peter Anvin
  -1 siblings, 0 replies; 311+ messages in thread
From: H. Peter Anvin @ 2014-09-08 18:12 UTC (permalink / raw)
  To: James Bottomley
  Cc: paulmck, Peter Hurley, One Thousand Gnomes, Jakub Jelinek,
	Mikael Pettersson, Benjamin Herrenschmidt, Richard Henderson,
	Oleg Nesterov, Miroslav Franc, Paul Mackerras, linuxppc-dev,
	linux-kernel, linux-arch, Tony Luck, linux-ia64

On 09/07/2014 10:56 PM, James Bottomley wrote:
> On Sun, 2014-09-07 at 16:39 -0700, H. Peter Anvin wrote:
>> How many PARISC systems do we have that actually do real work on Linux?
> 
> I'd be very surprised if this problem didn't exist on all alignment
> requiring architectures, like PPC and Sparc as well.  I know it would be
> very convenient if all the world were an x86 ... but it would also be
> very boring as well.

I wouldn't be so sure about that.  That is a pretty aggressive
relaxation of ordering that PARISC has enacted here, kind of like the
Alpha "we don't need no stinking byte accesses".

> The rules for coping with it are well known and a relaxation of what we
> currently do in the kernel, so I don't see what the actual problem is.
> 
> In the context of this thread, PA can't do atomic bit sets (no atomic
> RMW except the ldcw operation) it can do atomic writes to fundamental
> sizes (byte, short, int, long) provided gcc emits the correct primitive
> (there are lots of gotchas in this, but that's not an architectural
> problem).  These atomicity guarantees depend on the underlying storage
> and are respected for main memory but not for any other type of bus.

So I'm not trying to push the "all the world is an x86"... certainly not
given that x86 has abnormally strict ordering rules and so is itself an
outlier.  What I *don't* really want to have to deal with is going
through more than causal effort to accommodate outliers which no longer
have any real value -- we have too much work to do.

	-hpa



^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-08 18:12                                           ` H. Peter Anvin
  0 siblings, 0 replies; 311+ messages in thread
From: H. Peter Anvin @ 2014-09-08 18:12 UTC (permalink / raw)
  To: James Bottomley
  Cc: Jakub Jelinek, One Thousand Gnomes, Tony Luck, linux-ia64,
	Peter Hurley, Mikael Pettersson, Oleg Nesterov, linux-kernel,
	Paul Mackerras, linux-arch, paulmck, linuxppc-dev,
	Miroslav Franc, Richard Henderson

On 09/07/2014 10:56 PM, James Bottomley wrote:
> On Sun, 2014-09-07 at 16:39 -0700, H. Peter Anvin wrote:
>> How many PARISC systems do we have that actually do real work on Linux?
> 
> I'd be very surprised if this problem didn't exist on all alignment
> requiring architectures, like PPC and Sparc as well.  I know it would be
> very convenient if all the world were an x86 ... but it would also be
> very boring as well.

I wouldn't be so sure about that.  That is a pretty aggressive
relaxation of ordering that PARISC has enacted here, kind of like the
Alpha "we don't need no stinking byte accesses".

> The rules for coping with it are well known and a relaxation of what we
> currently do in the kernel, so I don't see what the actual problem is.
> 
> In the context of this thread, PA can't do atomic bit sets (no atomic
> RMW except the ldcw operation) it can do atomic writes to fundamental
> sizes (byte, short, int, long) provided gcc emits the correct primitive
> (there are lots of gotchas in this, but that's not an architectural
> problem).  These atomicity guarantees depend on the underlying storage
> and are respected for main memory but not for any other type of bus.

So I'm not trying to push the "all the world is an x86"... certainly not
given that x86 has abnormally strict ordering rules and so is itself an
outlier.  What I *don't* really want to have to deal with is going
through more than causal effort to accommodate outliers which no longer
have any real value -- we have too much work to do.

	-hpa

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-08 18:12                                           ` H. Peter Anvin
  0 siblings, 0 replies; 311+ messages in thread
From: H. Peter Anvin @ 2014-09-08 18:12 UTC (permalink / raw)
  To: James Bottomley
  Cc: paulmck, Peter Hurley, One Thousand Gnomes, Jakub Jelinek,
	Mikael Pettersson, Benjamin Herrenschmidt, Richard Henderson,
	Oleg Nesterov, Miroslav Franc, Paul Mackerras, linuxppc-dev,
	linux-kernel, linux-arch, Tony Luck, linux-ia64

On 09/07/2014 10:56 PM, James Bottomley wrote:
> On Sun, 2014-09-07 at 16:39 -0700, H. Peter Anvin wrote:
>> How many PARISC systems do we have that actually do real work on Linux?
> 
> I'd be very surprised if this problem didn't exist on all alignment
> requiring architectures, like PPC and Sparc as well.  I know it would be
> very convenient if all the world were an x86 ... but it would also be
> very boring as well.

I wouldn't be so sure about that.  That is a pretty aggressive
relaxation of ordering that PARISC has enacted here, kind of like the
Alpha "we don't need no stinking byte accesses".

> The rules for coping with it are well known and a relaxation of what we
> currently do in the kernel, so I don't see what the actual problem is.
> 
> In the context of this thread, PA can't do atomic bit sets (no atomic
> RMW except the ldcw operation) it can do atomic writes to fundamental
> sizes (byte, short, int, long) provided gcc emits the correct primitive
> (there are lots of gotchas in this, but that's not an architectural
> problem).  These atomicity guarantees depend on the underlying storage
> and are respected for main memory but not for any other type of bus.

So I'm not trying to push the "all the world is an x86"... certainly not
given that x86 has abnormally strict ordering rules and so is itself an
outlier.  What I *don't* really want to have to deal with is going
through more than causal effort to accommodate outliers which no longer
have any real value -- we have too much work to do.

	-hpa



^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
  2014-09-08 17:52                     ` One Thousand Gnomes
  (?)
@ 2014-09-08 18:13                       ` James Bottomley
  -1 siblings, 0 replies; 311+ messages in thread
From: James Bottomley @ 2014-09-08 18:13 UTC (permalink / raw)
  To: One Thousand Gnomes
  Cc: H. Peter Anvin, Peter Hurley, Benjamin Herrenschmidt,
	David Laight, Jakub Jelinek, linux-arch, Tony Luck, linux-ia64,
	Oleg Nesterov, linux-kernel, Paul Mackerras, Paul E. McKenney,
	linuxppc-dev, Miroslav Franc, Richard Henderson, linux-alpha

On Mon, 2014-09-08 at 18:52 +0100, One Thousand Gnomes wrote:
> On Fri, 05 Sep 2014 08:41:52 -0700
> "H. Peter Anvin" <hpa@zytor.com> wrote:
> 
> > On 09/05/2014 08:31 AM, Peter Hurley wrote:
> > > 
> > > Which is a bit ironic because I remember when Digital had a team
> > > working on emulating native x86 apps on Alpha/NT.
> > > 
> > 
> > Right, because the x86 architecture was obsolete and would never scale...
> 
> Talking about "not scaling" can anyone explain how a "you need to use
> set_bit() and friends" bug report scaled into a hundred message plus
> discussion about ambiguous properties of processors (and nobody has
> audited all the embedded platforms we support yet, or the weirder ARMs)
> and a propsal to remove Alpha support.
> 
> Wouldn't it be *much* simpler to do what I suggested in the first place
> and use the existing intended for purpose, deliberately put there,
> functions for atomic bitops, because they are fast on sane processors and
> they work on everything else.
> 
> I think the whole "removing Alpha EV5" support is basically bonkers. Just
> use set_bit in the tty layer. Alpha will continue to work as well as it
> always has done and you won't design out support for any future processor
> that turns out not to do byte aligned stores.

Seconded.  We implement via hashed spinlocks on PA ... but hey, we're
not the fastest architecture anyway and semantically it just works.

James



^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-08 18:13                       ` James Bottomley
  0 siblings, 0 replies; 311+ messages in thread
From: James Bottomley @ 2014-09-08 18:13 UTC (permalink / raw)
  To: One Thousand Gnomes
  Cc: Jakub Jelinek, linux-arch, Tony Luck, linux-ia64, Peter Hurley,
	linux-alpha, Oleg Nesterov, linux-kernel, David Laight,
	Paul Mackerras, H. Peter Anvin, Paul E. McKenney, linuxppc-dev,
	Miroslav Franc, Richard Henderson

On Mon, 2014-09-08 at 18:52 +0100, One Thousand Gnomes wrote:
> On Fri, 05 Sep 2014 08:41:52 -0700
> "H. Peter Anvin" <hpa@zytor.com> wrote:
> 
> > On 09/05/2014 08:31 AM, Peter Hurley wrote:
> > > 
> > > Which is a bit ironic because I remember when Digital had a team
> > > working on emulating native x86 apps on Alpha/NT.
> > > 
> > 
> > Right, because the x86 architecture was obsolete and would never scale...
> 
> Talking about "not scaling" can anyone explain how a "you need to use
> set_bit() and friends" bug report scaled into a hundred message plus
> discussion about ambiguous properties of processors (and nobody has
> audited all the embedded platforms we support yet, or the weirder ARMs)
> and a propsal to remove Alpha support.
> 
> Wouldn't it be *much* simpler to do what I suggested in the first place
> and use the existing intended for purpose, deliberately put there,
> functions for atomic bitops, because they are fast on sane processors and
> they work on everything else.
> 
> I think the whole "removing Alpha EV5" support is basically bonkers. Just
> use set_bit in the tty layer. Alpha will continue to work as well as it
> always has done and you won't design out support for any future processor
> that turns out not to do byte aligned stores.

Seconded.  We implement via hashed spinlocks on PA ... but hey, we're
not the fastest architecture anyway and semantically it just works.

James

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-08 18:13                       ` James Bottomley
  0 siblings, 0 replies; 311+ messages in thread
From: James Bottomley @ 2014-09-08 18:13 UTC (permalink / raw)
  To: One Thousand Gnomes
  Cc: H. Peter Anvin, Peter Hurley, Benjamin Herrenschmidt,
	David Laight, Jakub Jelinek, linux-arch, Tony Luck, linux-ia64,
	Oleg Nesterov, linux-kernel, Paul Mackerras, Paul E. McKenney,
	linuxppc-dev, Miroslav Franc, Richard Henderson, linux-alpha

On Mon, 2014-09-08 at 18:52 +0100, One Thousand Gnomes wrote:
> On Fri, 05 Sep 2014 08:41:52 -0700
> "H. Peter Anvin" <hpa@zytor.com> wrote:
> 
> > On 09/05/2014 08:31 AM, Peter Hurley wrote:
> > > 
> > > Which is a bit ironic because I remember when Digital had a team
> > > working on emulating native x86 apps on Alpha/NT.
> > > 
> > 
> > Right, because the x86 architecture was obsolete and would never scale...
> 
> Talking about "not scaling" can anyone explain how a "you need to use
> set_bit() and friends" bug report scaled into a hundred message plus
> discussion about ambiguous properties of processors (and nobody has
> audited all the embedded platforms we support yet, or the weirder ARMs)
> and a propsal to remove Alpha support.
> 
> Wouldn't it be *much* simpler to do what I suggested in the first place
> and use the existing intended for purpose, deliberately put there,
> functions for atomic bitops, because they are fast on sane processors and
> they work on everything else.
> 
> I think the whole "removing Alpha EV5" support is basically bonkers. Just
> use set_bit in the tty layer. Alpha will continue to work as well as it
> always has done and you won't design out support for any future processor
> that turns out not to do byte aligned stores.

Seconded.  We implement via hashed spinlocks on PA ... but hey, we're
not the fastest architecture anyway and semantically it just works.

James



^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
  2014-09-08 18:12                                           ` H. Peter Anvin
  (?)
@ 2014-09-08 19:09                                             ` James Bottomley
  -1 siblings, 0 replies; 311+ messages in thread
From: James Bottomley @ 2014-09-08 19:09 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: paulmck, Peter Hurley, One Thousand Gnomes, Jakub Jelinek,
	Mikael Pettersson, Benjamin Herrenschmidt, Richard Henderson,
	Oleg Nesterov, Miroslav Franc, Paul Mackerras, linuxppc-dev,
	linux-kernel, linux-arch, Tony Luck, linux-ia64

On Mon, 2014-09-08 at 11:12 -0700, H. Peter Anvin wrote:
> On 09/07/2014 10:56 PM, James Bottomley wrote:
> > On Sun, 2014-09-07 at 16:39 -0700, H. Peter Anvin wrote:
> >> How many PARISC systems do we have that actually do real work on Linux?
> > 
> > I'd be very surprised if this problem didn't exist on all alignment
> > requiring architectures, like PPC and Sparc as well.  I know it would be
> > very convenient if all the world were an x86 ... but it would also be
> > very boring as well.
> 
> I wouldn't be so sure about that.  That is a pretty aggressive
> relaxation of ordering that PARISC has enacted here, kind of like the
> Alpha "we don't need no stinking byte accesses".

Um, I think you need to re-read the thread; that's not what I said at
all. It's even written lower down: "PA can't do atomic bit sets (no
atomic RMW except the ldcw operation) it can do atomic writes to
fundamental sizes (byte, short, int, long) provided gcc emits the
correct primitive".  The original question was whether atomicity
required native bus width access, which we currently assume, so there's
no extant problem.

> > The rules for coping with it are well known and a relaxation of what we
> > currently do in the kernel, so I don't see what the actual problem is.
> > 
> > In the context of this thread, PA can't do atomic bit sets (no atomic
> > RMW except the ldcw operation) it can do atomic writes to fundamental
> > sizes (byte, short, int, long) provided gcc emits the correct primitive
> > (there are lots of gotchas in this, but that's not an architectural
> > problem).  These atomicity guarantees depend on the underlying storage
> > and are respected for main memory but not for any other type of bus.
> 
> So I'm not trying to push the "all the world is an x86"... certainly not
> given that x86 has abnormally strict ordering rules and so is itself an
> outlier.  What I *don't* really want to have to deal with is going
> through more than causal effort to accommodate outliers which no longer
> have any real value -- we have too much work to do.

What accommodation?  I was just explaining the architectural atomicity
rules on PA.  How they're bus dependent (which isn't a PA restriction)
and how the compiler can mess them up.  None of the current PA atomicity
rules conflicts with anything we do today ... the danger is that moving
to implicit atomicity gives us more scope for compiler problems ... but
that's not PA specific either; any architecture requiring alignment has
this problem.

James



^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-08 19:09                                             ` James Bottomley
  0 siblings, 0 replies; 311+ messages in thread
From: James Bottomley @ 2014-09-08 19:09 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Jakub Jelinek, One Thousand Gnomes, Tony Luck, linux-ia64,
	Peter Hurley, Mikael Pettersson, Oleg Nesterov, linux-kernel,
	Paul Mackerras, linux-arch, paulmck, linuxppc-dev,
	Miroslav Franc, Richard Henderson

On Mon, 2014-09-08 at 11:12 -0700, H. Peter Anvin wrote:
> On 09/07/2014 10:56 PM, James Bottomley wrote:
> > On Sun, 2014-09-07 at 16:39 -0700, H. Peter Anvin wrote:
> >> How many PARISC systems do we have that actually do real work on Linux?
> > 
> > I'd be very surprised if this problem didn't exist on all alignment
> > requiring architectures, like PPC and Sparc as well.  I know it would be
> > very convenient if all the world were an x86 ... but it would also be
> > very boring as well.
> 
> I wouldn't be so sure about that.  That is a pretty aggressive
> relaxation of ordering that PARISC has enacted here, kind of like the
> Alpha "we don't need no stinking byte accesses".

Um, I think you need to re-read the thread; that's not what I said at
all. It's even written lower down: "PA can't do atomic bit sets (no
atomic RMW except the ldcw operation) it can do atomic writes to
fundamental sizes (byte, short, int, long) provided gcc emits the
correct primitive".  The original question was whether atomicity
required native bus width access, which we currently assume, so there's
no extant problem.

> > The rules for coping with it are well known and a relaxation of what we
> > currently do in the kernel, so I don't see what the actual problem is.
> > 
> > In the context of this thread, PA can't do atomic bit sets (no atomic
> > RMW except the ldcw operation) it can do atomic writes to fundamental
> > sizes (byte, short, int, long) provided gcc emits the correct primitive
> > (there are lots of gotchas in this, but that's not an architectural
> > problem).  These atomicity guarantees depend on the underlying storage
> > and are respected for main memory but not for any other type of bus.
> 
> So I'm not trying to push the "all the world is an x86"... certainly not
> given that x86 has abnormally strict ordering rules and so is itself an
> outlier.  What I *don't* really want to have to deal with is going
> through more than causal effort to accommodate outliers which no longer
> have any real value -- we have too much work to do.

What accommodation?  I was just explaining the architectural atomicity
rules on PA.  How they're bus dependent (which isn't a PA restriction)
and how the compiler can mess them up.  None of the current PA atomicity
rules conflicts with anything we do today ... the danger is that moving
to implicit atomicity gives us more scope for compiler problems ... but
that's not PA specific either; any architecture requiring alignment has
this problem.

James

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-08 19:09                                             ` James Bottomley
  0 siblings, 0 replies; 311+ messages in thread
From: James Bottomley @ 2014-09-08 19:09 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: paulmck, Peter Hurley, One Thousand Gnomes, Jakub Jelinek,
	Mikael Pettersson, Benjamin Herrenschmidt, Richard Henderson,
	Oleg Nesterov, Miroslav Franc, Paul Mackerras, linuxppc-dev,
	linux-kernel, linux-arch, Tony Luck, linux-ia64

On Mon, 2014-09-08 at 11:12 -0700, H. Peter Anvin wrote:
> On 09/07/2014 10:56 PM, James Bottomley wrote:
> > On Sun, 2014-09-07 at 16:39 -0700, H. Peter Anvin wrote:
> >> How many PARISC systems do we have that actually do real work on Linux?
> > 
> > I'd be very surprised if this problem didn't exist on all alignment
> > requiring architectures, like PPC and Sparc as well.  I know it would be
> > very convenient if all the world were an x86 ... but it would also be
> > very boring as well.
> 
> I wouldn't be so sure about that.  That is a pretty aggressive
> relaxation of ordering that PARISC has enacted here, kind of like the
> Alpha "we don't need no stinking byte accesses".

Um, I think you need to re-read the thread; that's not what I said at
all. It's even written lower down: "PA can't do atomic bit sets (no
atomic RMW except the ldcw operation) it can do atomic writes to
fundamental sizes (byte, short, int, long) provided gcc emits the
correct primitive".  The original question was whether atomicity
required native bus width access, which we currently assume, so there's
no extant problem.

> > The rules for coping with it are well known and a relaxation of what we
> > currently do in the kernel, so I don't see what the actual problem is.
> > 
> > In the context of this thread, PA can't do atomic bit sets (no atomic
> > RMW except the ldcw operation) it can do atomic writes to fundamental
> > sizes (byte, short, int, long) provided gcc emits the correct primitive
> > (there are lots of gotchas in this, but that's not an architectural
> > problem).  These atomicity guarantees depend on the underlying storage
> > and are respected for main memory but not for any other type of bus.
> 
> So I'm not trying to push the "all the world is an x86"... certainly not
> given that x86 has abnormally strict ordering rules and so is itself an
> outlier.  What I *don't* really want to have to deal with is going
> through more than causal effort to accommodate outliers which no longer
> have any real value -- we have too much work to do.

What accommodation?  I was just explaining the architectural atomicity
rules on PA.  How they're bus dependent (which isn't a PA restriction)
and how the compiler can mess them up.  None of the current PA atomicity
rules conflicts with anything we do today ... the danger is that moving
to implicit atomicity gives us more scope for compiler problems ... but
that's not PA specific either; any architecture requiring alignment has
this problem.

James



^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
  2014-09-08 19:09                                             ` James Bottomley
  (?)
  (?)
@ 2014-09-08 19:12                                               ` H. Peter Anvin
  -1 siblings, 0 replies; 311+ messages in thread
From: H. Peter Anvin @ 2014-09-08 19:12 UTC (permalink / raw)
  To: James Bottomley
  Cc: paulmck, Peter Hurley, One Thousand Gnomes, Jakub Jelinek,
	Mikael Pettersson, Benjamin Herrenschmidt, Richard Henderson,
	Oleg Nesterov, Miroslav Franc, Paul Mackerras, linuxppc-dev,
	linux-kernel, linux-arch, Tony Luck, linux-ia64

On 09/08/2014 12:09 PM, James Bottomley wrote:
> 
> Um, I think you need to re-read the thread; that's not what I said at
> all. It's even written lower down: "PA can't do atomic bit sets (no
> atomic RMW except the ldcw operation) it can do atomic writes to
> fundamental sizes (byte, short, int, long) provided gcc emits the
> correct primitive".  The original question was whether atomicity
> required native bus width access, which we currently assume, so there's
> no extant problem.
> 

The issue at hand was whether or not partially overlapped (but natually
aligned) writes can pass each other.  *This* is the aggressive
relaxation to which I am referring.

I would guess that that is a very unusual constraint.

	-hpa


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
  2014-09-08 19:09                                             ` James Bottomley
  (?)
@ 2014-09-08 19:12                                               ` H. Peter Anvin
  -1 siblings, 0 replies; 311+ messages in thread
From: H. Peter Anvin @ 2014-09-08 19:12 UTC (permalink / raw)
  To: James Bottomley
  Cc: paulmck, Peter Hurley, One Thousand Gnomes, Jakub Jelinek,
	Mikael Pettersson, Benjamin Herrenschmidt, Richard Henderson,
	Oleg Nesterov, Miroslav Franc, Paul Mackerras, linuxppc-dev,
	linux-kernel, linux-arch, Tony Luck, linux-ia64

On 09/08/2014 12:09 PM, James Bottomley wrote:
> 
> Um, I think you need to re-read the thread; that's not what I said at
> all. It's even written lower down: "PA can't do atomic bit sets (no
> atomic RMW except the ldcw operation) it can do atomic writes to
> fundamental sizes (byte, short, int, long) provided gcc emits the
> correct primitive".  The original question was whether atomicity
> required native bus width access, which we currently assume, so there's
> no extant problem.
> 

The issue at hand was whether or not partially overlapped (but natually
aligned) writes can pass each other.  *This* is the aggressive
relaxation to which I am referring.

I would guess that that is a very unusual constraint.

	-hpa


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-08 19:12                                               ` H. Peter Anvin
  0 siblings, 0 replies; 311+ messages in thread
From: H. Peter Anvin @ 2014-09-08 19:12 UTC (permalink / raw)
  To: James Bottomley
  Cc: Jakub Jelinek, One Thousand Gnomes, Tony Luck, linux-ia64,
	Peter Hurley, Mikael Pettersson, Oleg Nesterov, linux-kernel,
	Paul Mackerras, linux-arch, paulmck, linuxppc-dev,
	Miroslav Franc, Richard Henderson

On 09/08/2014 12:09 PM, James Bottomley wrote:
> 
> Um, I think you need to re-read the thread; that's not what I said at
> all. It's even written lower down: "PA can't do atomic bit sets (no
> atomic RMW except the ldcw operation) it can do atomic writes to
> fundamental sizes (byte, short, int, long) provided gcc emits the
> correct primitive".  The original question was whether atomicity
> required native bus width access, which we currently assume, so there's
> no extant problem.
> 

The issue at hand was whether or not partially overlapped (but natually
aligned) writes can pass each other.  *This* is the aggressive
relaxation to which I am referring.

I would guess that that is a very unusual constraint.

	-hpa

_______________________________________________
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-08 19:12                                               ` H. Peter Anvin
  0 siblings, 0 replies; 311+ messages in thread
From: H. Peter Anvin @ 2014-09-08 19:12 UTC (permalink / raw)
  To: James Bottomley
  Cc: Jakub Jelinek, One Thousand Gnomes, Tony Luck, linux-ia64,
	Peter Hurley, Mikael Pettersson, Oleg Nesterov, linux-kernel,
	Paul Mackerras, linux-arch, paulmck, linuxppc-dev,
	Miroslav Franc, Richard Henderson

On 09/08/2014 12:09 PM, James Bottomley wrote:
> 
> Um, I think you need to re-read the thread; that's not what I said at
> all. It's even written lower down: "PA can't do atomic bit sets (no
> atomic RMW except the ldcw operation) it can do atomic writes to
> fundamental sizes (byte, short, int, long) provided gcc emits the
> correct primitive".  The original question was whether atomicity
> required native bus width access, which we currently assume, so there's
> no extant problem.
> 

The issue at hand was whether or not partially overlapped (but natually
aligned) writes can pass each other.  *This* is the aggressive
relaxation to which I am referring.

I would guess that that is a very unusual constraint.

	-hpa

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-08 19:12                                               ` H. Peter Anvin
  0 siblings, 0 replies; 311+ messages in thread
From: H. Peter Anvin @ 2014-09-08 19:12 UTC (permalink / raw)
  To: James Bottomley
  Cc: Jakub Jelinek, One Thousand Gnomes, Tony Luck, linux-ia64,
	Peter Hurley, Mikael Pettersson, Oleg Nesterov, linux-kernel,
	Paul Mackerras, linux-arch, paulmck, linuxppc-dev,
	Miroslav Franc, Richard Henderson

On 09/08/2014 12:09 PM, James Bottomley wrote:
> 
> Um, I think you need to re-read the thread; that's not what I said at
> all. It's even written lower down: "PA can't do atomic bit sets (no
> atomic RMW except the ldcw operation) it can do atomic writes to
> fundamental sizes (byte, short, int, long) provided gcc emits the
> correct primitive".  The original question was whether atomicity
> required native bus width access, which we currently assume, so there's
> no extant problem.
> 

The issue at hand was whether or not partially overlapped (but natually
aligned) writes can pass each other.  *This* is the aggressive
relaxation to which I am referring.

I would guess that that is a very unusual constraint.

	-hpa

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-08 19:12                                               ` H. Peter Anvin
  0 siblings, 0 replies; 311+ messages in thread
From: H. Peter Anvin @ 2014-09-08 19:12 UTC (permalink / raw)
  To: James Bottomley
  Cc: paulmck, Peter Hurley, One Thousand Gnomes, Jakub Jelinek,
	Mikael Pettersson, Benjamin Herrenschmidt, Richard Henderson,
	Oleg Nesterov, Miroslav Franc, Paul Mackerras, linuxppc-dev,
	linux-kernel, linux-arch, Tony Luck, linux-ia64

On 09/08/2014 12:09 PM, James Bottomley wrote:
> 
> Um, I think you need to re-read the thread; that's not what I said at
> all. It's even written lower down: "PA can't do atomic bit sets (no
> atomic RMW except the ldcw operation) it can do atomic writes to
> fundamental sizes (byte, short, int, long) provided gcc emits the
> correct primitive".  The original question was whether atomicity
> required native bus width access, which we currently assume, so there's
> no extant problem.
> 

The issue at hand was whether or not partially overlapped (but natually
aligned) writes can pass each other.  *This* is the aggressive
relaxation to which I am referring.

I would guess that that is a very unusual constraint.

	-hpa


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-08 19:12                                               ` H. Peter Anvin
  0 siblings, 0 replies; 311+ messages in thread
From: H. Peter Anvin @ 2014-09-08 19:12 UTC (permalink / raw)
  To: James Bottomley
  Cc: paulmck, Peter Hurley, One Thousand Gnomes, Jakub Jelinek,
	Mikael Pettersson, Benjamin Herrenschmidt, Richard Henderson,
	Oleg Nesterov, Miroslav Franc, Paul Mackerras, linuxppc-dev,
	linux-kernel, linux-arch, Tony Luck, linux-ia64

On 09/08/2014 12:09 PM, James Bottomley wrote:
> 
> Um, I think you need to re-read the thread; that's not what I said at
> all. It's even written lower down: "PA can't do atomic bit sets (no
> atomic RMW except the ldcw operation) it can do atomic writes to
> fundamental sizes (byte, short, int, long) provided gcc emits the
> correct primitive".  The original question was whether atomicity
> required native bus width access, which we currently assume, so there's
> no extant problem.
> 

The issue at hand was whether or not partially overlapped (but natually
aligned) writes can pass each other.  *This* is the aggressive
relaxation to which I am referring.

I would guess that that is a very unusual constraint.

	-hpa


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
  2014-09-08 17:59                       ` H. Peter Anvin
  (?)
@ 2014-09-08 19:17                         ` One Thousand Gnomes
  -1 siblings, 0 replies; 311+ messages in thread
From: One Thousand Gnomes @ 2014-09-08 19:17 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Peter Hurley, Benjamin Herrenschmidt, David Laight,
	Jakub Jelinek, linux-arch, Tony Luck, linux-ia64, Oleg Nesterov,
	linux-kernel, Paul Mackerras, Paul E. McKenney, linuxppc-dev,
	Miroslav Franc, Richard Henderson, linux-alpha

> > I think the whole "removing Alpha EV5" support is basically bonkers. Just
> > use set_bit in the tty layer. Alpha will continue to work as well as it
> > always has done and you won't design out support for any future processor
> > that turns out not to do byte aligned stores.
> > 
> > Alan
> > 
> 
> Is *that* what we are talking about?  I was added to this conversation
> in the middle where it had already generalized, so I had no idea.
> 
> 	-hpa

Yes there are some flags in the tty layer that are vulnerable to this
(although they've been vulnerable to it and missing a lock since last
century with no known ill effects).

It's not as if this is causing any work to anyone beyond using the
standard API, no API needs to change.

Alan

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-08 19:17                         ` One Thousand Gnomes
  0 siblings, 0 replies; 311+ messages in thread
From: One Thousand Gnomes @ 2014-09-08 19:17 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Jakub Jelinek, linux-arch, Tony Luck, linux-ia64, Peter Hurley,
	Oleg Nesterov, linux-kernel, David Laight, Paul Mackerras,
	linux-alpha, Paul E. McKenney, linuxppc-dev, Miroslav Franc,
	Richard Henderson

> > I think the whole "removing Alpha EV5" support is basically bonkers. Just
> > use set_bit in the tty layer. Alpha will continue to work as well as it
> > always has done and you won't design out support for any future processor
> > that turns out not to do byte aligned stores.
> > 
> > Alan
> > 
> 
> Is *that* what we are talking about?  I was added to this conversation
> in the middle where it had already generalized, so I had no idea.
> 
> 	-hpa

Yes there are some flags in the tty layer that are vulnerable to this
(although they've been vulnerable to it and missing a lock since last
century with no known ill effects).

It's not as if this is causing any work to anyone beyond using the
standard API, no API needs to change.

Alan

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-08 19:17                         ` One Thousand Gnomes
  0 siblings, 0 replies; 311+ messages in thread
From: One Thousand Gnomes @ 2014-09-08 19:17 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Peter Hurley, Benjamin Herrenschmidt, David Laight,
	Jakub Jelinek, linux-arch, Tony Luck, linux-ia64, Oleg Nesterov,
	linux-kernel, Paul Mackerras, Paul E. McKenney, linuxppc-dev,
	Miroslav Franc, Richard Henderson, linux-alpha

> > I think the whole "removing Alpha EV5" support is basically bonkers. Just
> > use set_bit in the tty layer. Alpha will continue to work as well as it
> > always has done and you won't design out support for any future processor
> > that turns out not to do byte aligned stores.
> > 
> > Alan
> > 
> 
> Is *that* what we are talking about?  I was added to this conversation
> in the middle where it had already generalized, so I had no idea.
> 
> 	-hpa

Yes there are some flags in the tty layer that are vulnerable to this
(although they've been vulnerable to it and missing a lock since last
century with no known ill effects).

It's not as if this is causing any work to anyone beyond using the
standard API, no API needs to change.

Alan

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
  2014-09-08  5:50                                   ` James Bottomley
  (?)
  (?)
@ 2014-09-08 20:45                                     ` Chris Metcalf
  -1 siblings, 0 replies; 311+ messages in thread
From: Chris Metcalf @ 2014-09-08 20:45 UTC (permalink / raw)
  To: James Bottomley, Peter Hurley
  Cc: paulmck, H. Peter Anvin, One Thousand Gnomes, Jakub Jelinek,
	Mikael Pettersson, Benjamin Herrenschmidt, Richard Henderson,
	Oleg Nesterov, Miroslav Franc, Paul Mackerras, linuxppc-dev,
	linux-kernel, linux-arch, Tony Luck, linux-ia64

On 9/8/2014 1:50 AM, James Bottomley wrote:
> Actual alignment is pretty irrelevant.  That's why all architectures
> which require alignment also have to implement misaligned traps ... this
> is a fundamental requirement of the networking code, for instance.

Can you clarify what you think the requirement is?  The tile architecture
doesn't support misaligned load/store in general, but we do support it for
userspace (using a nifty JIT approach with a direct-map hash table kept
in userspace), and also for get_user/put_user.  But that's it, and,
the networking subsystem works fine for us.

Occasionally we report bugs for driver code that doesn't use the
get_unaligned_xxx() macros and friends, and our fixes are generally taken
upstream.

-- 
Chris Metcalf, Tilera Corp.
http://www.tilera.com

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-08 20:45                                     ` Chris Metcalf
  0 siblings, 0 replies; 311+ messages in thread
From: Chris Metcalf @ 2014-09-08 20:45 UTC (permalink / raw)
  To: James Bottomley, Peter Hurley
  Cc: paulmck, H. Peter Anvin, One Thousand Gnomes, Jakub Jelinek,
	Mikael Pettersson, Benjamin Herrenschmidt, Richard Henderson,
	Oleg Nesterov, Miroslav Franc, Paul Mackerras, linuxppc-dev,
	linux-kernel, linux-arch, Tony Luck, linux-ia64

On 9/8/2014 1:50 AM, James Bottomley wrote:
> Actual alignment is pretty irrelevant.  That's why all architectures
> which require alignment also have to implement misaligned traps ... this
> is a fundamental requirement of the networking code, for instance.

Can you clarify what you think the requirement is?  The tile architecture
doesn't support misaligned load/store in general, but we do support it for
userspace (using a nifty JIT approach with a direct-map hash table kept
in userspace), and also for get_user/put_user.  But that's it, and,
the networking subsystem works fine for us.

Occasionally we report bugs for driver code that doesn't use the
get_unaligned_xxx() macros and friends, and our fixes are generally taken
upstream.

-- 
Chris Metcalf, Tilera Corp.
http://www.tilera.com

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-08 20:45                                     ` Chris Metcalf
  0 siblings, 0 replies; 311+ messages in thread
From: Chris Metcalf @ 2014-09-08 20:45 UTC (permalink / raw)
  To: James Bottomley, Peter Hurley
  Cc: Jakub Jelinek, One Thousand Gnomes, linux-arch, linux-ia64,
	Mikael Pettersson, Oleg Nesterov, linux-kernel, Tony Luck,
	Paul Mackerras, H. Peter Anvin, paulmck, linuxppc-dev,
	Miroslav Franc, Richard Henderson

On 9/8/2014 1:50 AM, James Bottomley wrote:
> Actual alignment is pretty irrelevant.  That's why all architectures
> which require alignment also have to implement misaligned traps ... this
> is a fundamental requirement of the networking code, for instance.

Can you clarify what you think the requirement is?  The tile architecture
doesn't support misaligned load/store in general, but we do support it for
userspace (using a nifty JIT approach with a direct-map hash table kept
in userspace), and also for get_user/put_user.  But that's it, and,
the networking subsystem works fine for us.

Occasionally we report bugs for driver code that doesn't use the
get_unaligned_xxx() macros and friends, and our fixes are generally taken
upstream.

-- 
Chris Metcalf, Tilera Corp.
http://www.tilera.com

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-08 20:45                                     ` Chris Metcalf
  0 siblings, 0 replies; 311+ messages in thread
From: Chris Metcalf @ 2014-09-08 20:45 UTC (permalink / raw)
  To: James Bottomley, Peter Hurley
  Cc: paulmck, H. Peter Anvin, One Thousand Gnomes, Jakub Jelinek,
	Mikael Pettersson, Benjamin Herrenschmidt, Richard Henderson,
	Oleg Nesterov, Miroslav Franc, Paul Mackerras, linuxppc-dev,
	linux-kernel, linux-arch, Tony Luck, linux-ia64

On 9/8/2014 1:50 AM, James Bottomley wrote:
> Actual alignment is pretty irrelevant.  That's why all architectures
> which require alignment also have to implement misaligned traps ... this
> is a fundamental requirement of the networking code, for instance.

Can you clarify what you think the requirement is?  The tile architecture
doesn't support misaligned load/store in general, but we do support it for
userspace (using a nifty JIT approach with a direct-map hash table kept
in userspace), and also for get_user/put_user.  But that's it, and,
the networking subsystem works fine for us.

Occasionally we report bugs for driver code that doesn't use the
get_unaligned_xxx() macros and friends, and our fixes are generally taken
upstream.

-- 
Chris Metcalf, Tilera Corp.
http://www.tilera.com

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
  2014-09-08 19:12                                               ` H. Peter Anvin
  (?)
@ 2014-09-08 22:39                                                 ` James Bottomley
  -1 siblings, 0 replies; 311+ messages in thread
From: James Bottomley @ 2014-09-08 22:39 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: paulmck, Peter Hurley, One Thousand Gnomes, Jakub Jelinek,
	Mikael Pettersson, Benjamin Herrenschmidt, Richard Henderson,
	Oleg Nesterov, Miroslav Franc, Paul Mackerras, linuxppc-dev,
	linux-kernel, linux-arch, Tony Luck, linux-ia64

On Mon, 2014-09-08 at 12:12 -0700, H. Peter Anvin wrote:
> On 09/08/2014 12:09 PM, James Bottomley wrote:
> > 
> > Um, I think you need to re-read the thread; that's not what I said at
> > all. It's even written lower down: "PA can't do atomic bit sets (no
> > atomic RMW except the ldcw operation) it can do atomic writes to
> > fundamental sizes (byte, short, int, long) provided gcc emits the
> > correct primitive".  The original question was whether atomicity
> > required native bus width access, which we currently assume, so there's
> > no extant problem.
> > 
> 
> The issue at hand was whether or not partially overlapped (but natually
> aligned) writes can pass each other.  *This* is the aggressive
> relaxation to which I am referring.

I don't understand what you mean by "pass each other".  Atomicity
guarantees are not ordering guarantees in a SMP environment.  The
guarantee is that if you follow the rules when two CPUs update the same
natural width aligned object simultaneously using the same primitive,
the result is either one or the other of their updates.  Which one wins
(the ordering) isn't defined.

James


> I would guess that that is a very unusual constraint.



^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-08 22:39                                                 ` James Bottomley
  0 siblings, 0 replies; 311+ messages in thread
From: James Bottomley @ 2014-09-08 22:39 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Jakub Jelinek, One Thousand Gnomes, Tony Luck, linux-ia64,
	Peter Hurley, Mikael Pettersson, Oleg Nesterov, linux-kernel,
	Paul Mackerras, linux-arch, paulmck, linuxppc-dev,
	Miroslav Franc, Richard Henderson

On Mon, 2014-09-08 at 12:12 -0700, H. Peter Anvin wrote:
> On 09/08/2014 12:09 PM, James Bottomley wrote:
> > 
> > Um, I think you need to re-read the thread; that's not what I said at
> > all. It's even written lower down: "PA can't do atomic bit sets (no
> > atomic RMW except the ldcw operation) it can do atomic writes to
> > fundamental sizes (byte, short, int, long) provided gcc emits the
> > correct primitive".  The original question was whether atomicity
> > required native bus width access, which we currently assume, so there's
> > no extant problem.
> > 
> 
> The issue at hand was whether or not partially overlapped (but natually
> aligned) writes can pass each other.  *This* is the aggressive
> relaxation to which I am referring.

I don't understand what you mean by "pass each other".  Atomicity
guarantees are not ordering guarantees in a SMP environment.  The
guarantee is that if you follow the rules when two CPUs update the same
natural width aligned object simultaneously using the same primitive,
the result is either one or the other of their updates.  Which one wins
(the ordering) isn't defined.

James


> I would guess that that is a very unusual constraint.

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-08 22:39                                                 ` James Bottomley
  0 siblings, 0 replies; 311+ messages in thread
From: James Bottomley @ 2014-09-08 22:39 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: paulmck, Peter Hurley, One Thousand Gnomes, Jakub Jelinek,
	Mikael Pettersson, Benjamin Herrenschmidt, Richard Henderson,
	Oleg Nesterov, Miroslav Franc, Paul Mackerras, linuxppc-dev,
	linux-kernel, linux-arch, Tony Luck, linux-ia64

On Mon, 2014-09-08 at 12:12 -0700, H. Peter Anvin wrote:
> On 09/08/2014 12:09 PM, James Bottomley wrote:
> > 
> > Um, I think you need to re-read the thread; that's not what I said at
> > all. It's even written lower down: "PA can't do atomic bit sets (no
> > atomic RMW except the ldcw operation) it can do atomic writes to
> > fundamental sizes (byte, short, int, long) provided gcc emits the
> > correct primitive".  The original question was whether atomicity
> > required native bus width access, which we currently assume, so there's
> > no extant problem.
> > 
> 
> The issue at hand was whether or not partially overlapped (but natually
> aligned) writes can pass each other.  *This* is the aggressive
> relaxation to which I am referring.

I don't understand what you mean by "pass each other".  Atomicity
guarantees are not ordering guarantees in a SMP environment.  The
guarantee is that if you follow the rules when two CPUs update the same
natural width aligned object simultaneously using the same primitive,
the result is either one or the other of their updates.  Which one wins
(the ordering) isn't defined.

James


> I would guess that that is a very unusual constraint.



^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
  2014-09-08 20:45                                     ` Chris Metcalf
  (?)
@ 2014-09-08 22:43                                       ` James Bottomley
  -1 siblings, 0 replies; 311+ messages in thread
From: James Bottomley @ 2014-09-08 22:43 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Peter Hurley, paulmck, H. Peter Anvin, One Thousand Gnomes,
	Jakub Jelinek, Mikael Pettersson, Benjamin Herrenschmidt,
	Richard Henderson, Oleg Nesterov, Miroslav Franc, Paul Mackerras,
	linuxppc-dev, linux-kernel, linux-arch, Tony Luck, linux-ia64

On Mon, 2014-09-08 at 16:45 -0400, Chris Metcalf wrote:
> On 9/8/2014 1:50 AM, James Bottomley wrote:
> > Actual alignment is pretty irrelevant.  That's why all architectures
> > which require alignment also have to implement misaligned traps ... this
> > is a fundamental requirement of the networking code, for instance.
> 
> Can you clarify what you think the requirement is?  The tile architecture
> doesn't support misaligned load/store in general, but we do support it for
> userspace (using a nifty JIT approach with a direct-map hash table kept
> in userspace), and also for get_user/put_user.  But that's it, and,
> the networking subsystem works fine for us.

This was years ago (possibly decades).  We had to implement in-kernel
unaligned traps for the networking layer because it could access short
and int fields that weren't of the correct alignment when processing
packets.  It that's all corrected now, we wouldn't really notice (except
a bit of a speed up since an unaligned trap effectively places the
broken out instructions into the bit stream).

James


> Occasionally we report bugs for driver code that doesn't use the
> get_unaligned_xxx() macros and friends, and our fixes are generally taken
> upstream.



^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-08 22:43                                       ` James Bottomley
  0 siblings, 0 replies; 311+ messages in thread
From: James Bottomley @ 2014-09-08 22:43 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Jakub Jelinek, One Thousand Gnomes, linux-arch, linux-ia64,
	Peter Hurley, Mikael Pettersson, Oleg Nesterov, linux-kernel,
	Tony Luck, Paul Mackerras, H. Peter Anvin, paulmck, linuxppc-dev,
	Miroslav Franc, Richard Henderson

On Mon, 2014-09-08 at 16:45 -0400, Chris Metcalf wrote:
> On 9/8/2014 1:50 AM, James Bottomley wrote:
> > Actual alignment is pretty irrelevant.  That's why all architectures
> > which require alignment also have to implement misaligned traps ... this
> > is a fundamental requirement of the networking code, for instance.
> 
> Can you clarify what you think the requirement is?  The tile architecture
> doesn't support misaligned load/store in general, but we do support it for
> userspace (using a nifty JIT approach with a direct-map hash table kept
> in userspace), and also for get_user/put_user.  But that's it, and,
> the networking subsystem works fine for us.

This was years ago (possibly decades).  We had to implement in-kernel
unaligned traps for the networking layer because it could access short
and int fields that weren't of the correct alignment when processing
packets.  It that's all corrected now, we wouldn't really notice (except
a bit of a speed up since an unaligned trap effectively places the
broken out instructions into the bit stream).

James


> Occasionally we report bugs for driver code that doesn't use the
> get_unaligned_xxx() macros and friends, and our fixes are generally taken
> upstream.

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-08 22:43                                       ` James Bottomley
  0 siblings, 0 replies; 311+ messages in thread
From: James Bottomley @ 2014-09-08 22:43 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Peter Hurley, paulmck, H. Peter Anvin, One Thousand Gnomes,
	Jakub Jelinek, Mikael Pettersson, Benjamin Herrenschmidt,
	Richard Henderson, Oleg Nesterov, Miroslav Franc, Paul Mackerras,
	linuxppc-dev, linux-kernel, linux-arch, Tony Luck, linux-ia64

On Mon, 2014-09-08 at 16:45 -0400, Chris Metcalf wrote:
> On 9/8/2014 1:50 AM, James Bottomley wrote:
> > Actual alignment is pretty irrelevant.  That's why all architectures
> > which require alignment also have to implement misaligned traps ... this
> > is a fundamental requirement of the networking code, for instance.
> 
> Can you clarify what you think the requirement is?  The tile architecture
> doesn't support misaligned load/store in general, but we do support it for
> userspace (using a nifty JIT approach with a direct-map hash table kept
> in userspace), and also for get_user/put_user.  But that's it, and,
> the networking subsystem works fine for us.

This was years ago (possibly decades).  We had to implement in-kernel
unaligned traps for the networking layer because it could access short
and int fields that weren't of the correct alignment when processing
packets.  It that's all corrected now, we wouldn't really notice (except
a bit of a speed up since an unaligned trap effectively places the
broken out instructions into the bit stream).

James


> Occasionally we report bugs for driver code that doesn't use the
> get_unaligned_xxx() macros and friends, and our fixes are generally taken
> upstream.



^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
  2014-09-08 17:59                       ` H. Peter Anvin
  (?)
@ 2014-09-08 22:47                         ` Peter Hurley
  -1 siblings, 0 replies; 311+ messages in thread
From: Peter Hurley @ 2014-09-08 22:47 UTC (permalink / raw)
  To: H. Peter Anvin, One Thousand Gnomes
  Cc: David Laight, Jakub Jelinek, linux-arch, Tony Luck, linux-ia64,
	Oleg Nesterov, linux-kernel, Paul Mackerras, Paul E. McKenney,
	linuxppc-dev, Miroslav Franc, Richard Henderson, linux-alpha

On 09/08/2014 01:59 PM, H. Peter Anvin wrote:
> On 09/08/2014 10:52 AM, One Thousand Gnomes wrote:
>> On Fri, 05 Sep 2014 08:41:52 -0700
>> "H. Peter Anvin" <hpa@zytor.com> wrote:
>>
>>> On 09/05/2014 08:31 AM, Peter Hurley wrote:
>>>>
>>>> Which is a bit ironic because I remember when Digital had a team
>>>> working on emulating native x86 apps on Alpha/NT.
>>>>
>>>
>>> Right, because the x86 architecture was obsolete and would never scale...
>>
>> Talking about "not scaling" can anyone explain how a "you need to use
>> set_bit() and friends" bug report scaled into a hundred message plus
>> discussion about ambiguous properties of processors (and nobody has
>> audited all the embedded platforms we support yet, or the weirder ARMs)
>> and a propsal to remove Alpha support.
>>
>> Wouldn't it be *much* simpler to do what I suggested in the first place
>> and use the existing intended for purpose, deliberately put there,
>> functions for atomic bitops, because they are fast on sane processors and
>> they work on everything else.
>>
>> I think the whole "removing Alpha EV5" support is basically bonkers. Just
>> use set_bit in the tty layer. Alpha will continue to work as well as it
>> always has done and you won't design out support for any future processor
>> that turns out not to do byte aligned stores.
>>
>> Alan
>>
> 
> Is *that* what we are talking about?  I was added to this conversation
> in the middle where it had already generalized, so I had no idea.

No, this is just what brought this craziness to my attention.

For example, byte- and short-sized circular buffers could not possibly
be safe either, when the head nears the tail.

Who has audited global storage and ensured that _every_ byte-sized write
doesn't happen to be adjacent to some other storage that may not happen
to be protected by the same (or any) lock?

Regards,
Peter Hurley




^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-08 22:47                         ` Peter Hurley
  0 siblings, 0 replies; 311+ messages in thread
From: Peter Hurley @ 2014-09-08 22:47 UTC (permalink / raw)
  To: H. Peter Anvin, One Thousand Gnomes
  Cc: Jakub Jelinek, linux-arch, Tony Luck, linux-ia64, Oleg Nesterov,
	linux-kernel, David Laight, Paul Mackerras, linux-alpha,
	Paul E. McKenney, linuxppc-dev, Miroslav Franc,
	Richard Henderson

On 09/08/2014 01:59 PM, H. Peter Anvin wrote:
> On 09/08/2014 10:52 AM, One Thousand Gnomes wrote:
>> On Fri, 05 Sep 2014 08:41:52 -0700
>> "H. Peter Anvin" <hpa@zytor.com> wrote:
>>
>>> On 09/05/2014 08:31 AM, Peter Hurley wrote:
>>>>
>>>> Which is a bit ironic because I remember when Digital had a team
>>>> working on emulating native x86 apps on Alpha/NT.
>>>>
>>>
>>> Right, because the x86 architecture was obsolete and would never scale...
>>
>> Talking about "not scaling" can anyone explain how a "you need to use
>> set_bit() and friends" bug report scaled into a hundred message plus
>> discussion about ambiguous properties of processors (and nobody has
>> audited all the embedded platforms we support yet, or the weirder ARMs)
>> and a propsal to remove Alpha support.
>>
>> Wouldn't it be *much* simpler to do what I suggested in the first place
>> and use the existing intended for purpose, deliberately put there,
>> functions for atomic bitops, because they are fast on sane processors and
>> they work on everything else.
>>
>> I think the whole "removing Alpha EV5" support is basically bonkers. Just
>> use set_bit in the tty layer. Alpha will continue to work as well as it
>> always has done and you won't design out support for any future processor
>> that turns out not to do byte aligned stores.
>>
>> Alan
>>
> 
> Is *that* what we are talking about?  I was added to this conversation
> in the middle where it had already generalized, so I had no idea.

No, this is just what brought this craziness to my attention.

For example, byte- and short-sized circular buffers could not possibly
be safe either, when the head nears the tail.

Who has audited global storage and ensured that _every_ byte-sized write
doesn't happen to be adjacent to some other storage that may not happen
to be protected by the same (or any) lock?

Regards,
Peter Hurley

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-08 22:47                         ` Peter Hurley
  0 siblings, 0 replies; 311+ messages in thread
From: Peter Hurley @ 2014-09-08 22:47 UTC (permalink / raw)
  To: H. Peter Anvin, One Thousand Gnomes
  Cc: David Laight, Jakub Jelinek, linux-arch, Tony Luck, linux-ia64,
	Oleg Nesterov, linux-kernel, Paul Mackerras, Paul E. McKenney,
	linuxppc-dev, Miroslav Franc, Richard Henderson, linux-alpha

On 09/08/2014 01:59 PM, H. Peter Anvin wrote:
> On 09/08/2014 10:52 AM, One Thousand Gnomes wrote:
>> On Fri, 05 Sep 2014 08:41:52 -0700
>> "H. Peter Anvin" <hpa@zytor.com> wrote:
>>
>>> On 09/05/2014 08:31 AM, Peter Hurley wrote:
>>>>
>>>> Which is a bit ironic because I remember when Digital had a team
>>>> working on emulating native x86 apps on Alpha/NT.
>>>>
>>>
>>> Right, because the x86 architecture was obsolete and would never scale...
>>
>> Talking about "not scaling" can anyone explain how a "you need to use
>> set_bit() and friends" bug report scaled into a hundred message plus
>> discussion about ambiguous properties of processors (and nobody has
>> audited all the embedded platforms we support yet, or the weirder ARMs)
>> and a propsal to remove Alpha support.
>>
>> Wouldn't it be *much* simpler to do what I suggested in the first place
>> and use the existing intended for purpose, deliberately put there,
>> functions for atomic bitops, because they are fast on sane processors and
>> they work on everything else.
>>
>> I think the whole "removing Alpha EV5" support is basically bonkers. Just
>> use set_bit in the tty layer. Alpha will continue to work as well as it
>> always has done and you won't design out support for any future processor
>> that turns out not to do byte aligned stores.
>>
>> Alan
>>
> 
> Is *that* what we are talking about?  I was added to this conversation
> in the middle where it had already generalized, so I had no idea.

No, this is just what brought this craziness to my attention.

For example, byte- and short-sized circular buffers could not possibly
be safe either, when the head nears the tail.

Who has audited global storage and ensured that _every_ byte-sized write
doesn't happen to be adjacent to some other storage that may not happen
to be protected by the same (or any) lock?

Regards,
Peter Hurley




^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
  2014-09-08  5:50                                   ` James Bottomley
  (?)
@ 2014-09-08 23:30                                     ` Peter Hurley
  -1 siblings, 0 replies; 311+ messages in thread
From: Peter Hurley @ 2014-09-08 23:30 UTC (permalink / raw)
  To: James Bottomley
  Cc: paulmck, H. Peter Anvin, One Thousand Gnomes, Jakub Jelinek,
	Mikael Pettersson, Benjamin Herrenschmidt, Richard Henderson,
	Oleg Nesterov, Miroslav Franc, Paul Mackerras, linuxppc-dev,
	linux-kernel, linux-arch, Tony Luck, linux-ia64

On 09/08/2014 01:50 AM, James Bottomley wrote:
> On Sun, 2014-09-07 at 16:41 -0400, Peter Hurley wrote:
>> On 09/07/2014 03:04 PM, James Bottomley wrote:
>>> On Sun, 2014-09-07 at 09:21 -0700, Paul E. McKenney wrote:
>>>> On Sat, Sep 06, 2014 at 10:07:22PM -0700, James Bottomley wrote:
>>>>> On Thu, 2014-09-04 at 21:06 -0700, Paul E. McKenney wrote:
>>>>>> On Thu, Sep 04, 2014 at 10:47:24PM -0400, Peter Hurley wrote:
>>>>>>> Hi James,
>>>>>>>
>>>>>>> On 09/04/2014 10:11 PM, James Bottomley wrote:
>>>>>>>> On Thu, 2014-09-04 at 17:17 -0700, Paul E. McKenney wrote:
>>>>>>>>> +And there are anti-guarantees:
>>>>>>>>> +
>>>>>>>>> + (*) These guarantees do not apply to bitfields, because compilers often
>>>>>>>>> +     generate code to modify these using non-atomic read-modify-write
>>>>>>>>> +     sequences.  Do not attempt to use bitfields to synchronize parallel
>>>>>>>>> +     algorithms.
>>>>>>>>> +
>>>>>>>>> + (*) Even in cases where bitfields are protected by locks, all fields
>>>>>>>>> +     in a given bitfield must be protected by one lock.  If two fields
>>>>>>>>> +     in a given bitfield are protected by different locks, the compiler's
>>>>>>>>> +     non-atomic read-modify-write sequences can cause an update to one
>>>>>>>>> +     field to corrupt the value of an adjacent field.
>>>>>>>>> +
>>>>>>>>> + (*) These guarantees apply only to properly aligned and sized scalar
>>>>>>>>> +     variables.  "Properly sized" currently means "int" and "long",
>>>>>>>>> +     because some CPU families do not support loads and stores of
>>>>>>>>> +     other sizes.  ("Some CPU families" is currently believed to
>>>>>>>>> +     be only Alpha 21064.  If this is actually the case, a different
>>>>>>>>> +     non-guarantee is likely to be formulated.)
>>>>>>>>
>>>>>>>> This is a bit unclear.  Presumably you're talking about definiteness of
>>>>>>>> the outcome (as in what's seen after multiple stores to the same
>>>>>>>> variable).
>>>>>>>
>>>>>>> No, the last conditions refers to adjacent byte stores from different
>>>>>>> cpu contexts (either interrupt or SMP).
>>>>>>>
>>>>>>>> The guarantees are only for natural width on Parisc as well,
>>>>>>>> so you would get a mess if you did byte stores to adjacent memory
>>>>>>>> locations.
>>>>>>>
>>>>>>> For a simple test like:
>>>>>>>
>>>>>>> struct x {
>>>>>>> 	long a;
>>>>>>> 	char b;
>>>>>>> 	char c;
>>>>>>> 	char d;
>>>>>>> 	char e;
>>>>>>> };
>>>>>>>
>>>>>>> void store_bc(struct x *p) {
>>>>>>> 	p->b = 1;
>>>>>>> 	p->c = 2;
>>>>>>> }
>>>>>>>
>>>>>>> on parisc, gcc generates separate byte stores
>>>>>>>
>>>>>>> void store_bc(struct x *p) {
>>>>>>>    0:	34 1c 00 02 	ldi 1,ret0
>>>>>>>    4:	0f 5c 12 08 	stb ret0,4(r26)
>>>>>>>    8:	34 1c 00 04 	ldi 2,ret0
>>>>>>>    c:	e8 40 c0 00 	bv r0(rp)
>>>>>>>   10:	0f 5c 12 0a 	stb ret0,5(r26)
>>>>>>>
>>>>>>> which appears to confirm that on parisc adjacent byte data
>>>>>>> is safe from corruption by concurrent cpu updates; that is,
>>>>>>>
>>>>>>> CPU 0                | CPU 1
>>>>>>>                      |
>>>>>>> p->b = 1             | p->c = 2
>>>>>>>                      |
>>>>>>>
>>>>>>> will result in p->b == 1 && p->c == 2 (assume both values
>>>>>>> were 0 before the call to store_bc()).
>>>>>>
>>>>>> What Peter said.  I would ask for suggestions for better wording, but
>>>>>> I would much rather be able to say that single-byte reads and writes
>>>>>> are atomic and that aligned-short reads and writes are also atomic.
>>>>>>
>>>>>> Thus far, it looks like we lose only very old Alpha systems, so unless
>>>>>> I hear otherwise, I update my patch to outlaw these very old systems.
>>>>>
>>>>> This isn't universally true according to the architecture manual.  The
>>>>> PARISC CPU can make byte to long word stores atomic against the memory
>>>>> bus but not against the I/O bus for instance.  Atomicity is a property
>>>>> of the underlying substrate, not of the CPU.  Implying that atomicity is
>>>>> a CPU property is incorrect.
>>
>> To go back to this briefly, while it's true that atomicity is not
>> a CPU property, atomicity is not possible if the CPU is not cooperating.
> 
> You mean if it doesn't have the bus logic to emit?  Like ia32 mostly
> can't do 64 bit writes ... sure.

Yeah, or Alphas that do rmw for byte storage.

>>>> OK, fair point.
>>>>
>>>> But are there in-use-for-Linux PARISC memory fabrics (for normal memory,
>>>> not I/O) that do not support single-byte and double-byte stores?
>>>
>>> For aligned access, I believe that's always the case for the memory bus
>>> (on both 32 and 64 bit systems).  However, it only applies to machine
>>> instruction loads and stores of the same width..  If you mix the widths
>>> on the loads and stores, all bets are off.  That means you have to
>>> beware of the gcc penchant for coalescing loads and stores: if it sees
>>> two adjacent byte stores it can coalesce them into a short store
>>> instead ... that screws up the atomicity guarantees.
>>
>> Two things: I think that gcc has given up on combining adjacent writes,
>> perhaps because unaligned writes on some arches are prohibitive, so
>> whatever minor optimization was believed to be gained was quickly lost,
>> multi-fold. (Although this might not be the reason since one would
>> expect that gcc would know the alignment of the promoted store).
> 
> Um, gcc assumes architecturally correct alignment; that's why it pads
> structures.  Only when accessing packed structures will it use the
> lowest unit load/store.
> 
> if you have a struct { char a, char b }; and load first a then b with a
> constant gcc will obligingly optimise to a short store.

Um, please re-look at the test above. The exact test you describe is
coded above and compiled with gcc 4.6.3 cross-compiler for parisc using
the kernel compiler options.

In the generated code, please note the _absence_ of a combined write
to two adjacent byte stores.


>> But additionally, even if gcc combines adjacent writes _that are part
>> of the program flow_ then I believe the situation is no worse than
>> would otherwise exist.
>>
>> For instance, given the following:
>>
>> struct x {
>> 	spinlock_t lock;
>> 	long a;
>> 	byte b;
>> 	byte c;
>> };
>>
>> void locked_store_b(struct x *p)
>> {
>> 	spin_lock(&p->lock);
>> 	p->b = 1;
>> 	spin_unlock(&p->lock);
>> 	p->c = 2;
>> }
>>
>> Granted, the author probably expects ordered writes of
>> 	STORE B
>> 	STORE C
>> but that's not guaranteed because there is no memory barrier
>> ordering B before C.
> 
> Yes, there is: loads and stores may not migrate into or out of critical
> sections.

That's a common misconception.

The processor is free to re-order this to:

	STORE C
	STORE B
	UNLOCK

That's because the unlock() only guarantees that:

Stores before the unlock in program order are guaranteed to complete
before the unlock completes. Stores after the unlock _may_ complete
before the unlock completes.

My point was that even if compiler barriers had the same semantics
as memory barriers, the situation would be no worse. That is, code
that is sensitive to memory barriers (like the example I gave above)
would merely have the same fragility with one-way compiler barriers
(with respect to the compiler combining writes).

That's what I meant by "no worse than would otherwise exist".

>> I see no problem with gcc write-combining in the absence of
>> memory barriers (as long as alignment is still respected,
>> ie., the size-promoted write is still naturally aligned). The
>> combined write is still atomic.
> 
> Actual alignment is pretty irrelevant.  That's why all architectures
> which require alignment also have to implement misaligned traps ... this
> is a fundamental requirement of the networking code, for instance.
> 
> The main problem is gcc thinking there's a misalignment (or a packed
> structure).  That causes splitting of loads and stores and that destroys
> atomicity.

Yeah, the extra requirement I added is basically nonsense, since the
only issue is what instructions the compiler is emitting. So if compiler
thinks the alignment is natural and combines the writes -- ok. If the
compiler thinks the alignment is off and doesn't combine the writes --
also ok.

Regards,
Peter Hurley


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-08 23:30                                     ` Peter Hurley
  0 siblings, 0 replies; 311+ messages in thread
From: Peter Hurley @ 2014-09-08 23:30 UTC (permalink / raw)
  To: James Bottomley
  Cc: Jakub Jelinek, One Thousand Gnomes, linux-arch, linux-ia64,
	Mikael Pettersson, Oleg Nesterov, linux-kernel, Tony Luck,
	Paul Mackerras, H. Peter Anvin, paulmck, linuxppc-dev,
	Miroslav Franc, Richard Henderson

On 09/08/2014 01:50 AM, James Bottomley wrote:
> On Sun, 2014-09-07 at 16:41 -0400, Peter Hurley wrote:
>> On 09/07/2014 03:04 PM, James Bottomley wrote:
>>> On Sun, 2014-09-07 at 09:21 -0700, Paul E. McKenney wrote:
>>>> On Sat, Sep 06, 2014 at 10:07:22PM -0700, James Bottomley wrote:
>>>>> On Thu, 2014-09-04 at 21:06 -0700, Paul E. McKenney wrote:
>>>>>> On Thu, Sep 04, 2014 at 10:47:24PM -0400, Peter Hurley wrote:
>>>>>>> Hi James,
>>>>>>>
>>>>>>> On 09/04/2014 10:11 PM, James Bottomley wrote:
>>>>>>>> On Thu, 2014-09-04 at 17:17 -0700, Paul E. McKenney wrote:
>>>>>>>>> +And there are anti-guarantees:
>>>>>>>>> +
>>>>>>>>> + (*) These guarantees do not apply to bitfields, because compilers often
>>>>>>>>> +     generate code to modify these using non-atomic read-modify-write
>>>>>>>>> +     sequences.  Do not attempt to use bitfields to synchronize parallel
>>>>>>>>> +     algorithms.
>>>>>>>>> +
>>>>>>>>> + (*) Even in cases where bitfields are protected by locks, all fields
>>>>>>>>> +     in a given bitfield must be protected by one lock.  If two fields
>>>>>>>>> +     in a given bitfield are protected by different locks, the compiler's
>>>>>>>>> +     non-atomic read-modify-write sequences can cause an update to one
>>>>>>>>> +     field to corrupt the value of an adjacent field.
>>>>>>>>> +
>>>>>>>>> + (*) These guarantees apply only to properly aligned and sized scalar
>>>>>>>>> +     variables.  "Properly sized" currently means "int" and "long",
>>>>>>>>> +     because some CPU families do not support loads and stores of
>>>>>>>>> +     other sizes.  ("Some CPU families" is currently believed to
>>>>>>>>> +     be only Alpha 21064.  If this is actually the case, a different
>>>>>>>>> +     non-guarantee is likely to be formulated.)
>>>>>>>>
>>>>>>>> This is a bit unclear.  Presumably you're talking about definiteness of
>>>>>>>> the outcome (as in what's seen after multiple stores to the same
>>>>>>>> variable).
>>>>>>>
>>>>>>> No, the last conditions refers to adjacent byte stores from different
>>>>>>> cpu contexts (either interrupt or SMP).
>>>>>>>
>>>>>>>> The guarantees are only for natural width on Parisc as well,
>>>>>>>> so you would get a mess if you did byte stores to adjacent memory
>>>>>>>> locations.
>>>>>>>
>>>>>>> For a simple test like:
>>>>>>>
>>>>>>> struct x {
>>>>>>> 	long a;
>>>>>>> 	char b;
>>>>>>> 	char c;
>>>>>>> 	char d;
>>>>>>> 	char e;
>>>>>>> };
>>>>>>>
>>>>>>> void store_bc(struct x *p) {
>>>>>>> 	p->b = 1;
>>>>>>> 	p->c = 2;
>>>>>>> }
>>>>>>>
>>>>>>> on parisc, gcc generates separate byte stores
>>>>>>>
>>>>>>> void store_bc(struct x *p) {
>>>>>>>    0:	34 1c 00 02 	ldi 1,ret0
>>>>>>>    4:	0f 5c 12 08 	stb ret0,4(r26)
>>>>>>>    8:	34 1c 00 04 	ldi 2,ret0
>>>>>>>    c:	e8 40 c0 00 	bv r0(rp)
>>>>>>>   10:	0f 5c 12 0a 	stb ret0,5(r26)
>>>>>>>
>>>>>>> which appears to confirm that on parisc adjacent byte data
>>>>>>> is safe from corruption by concurrent cpu updates; that is,
>>>>>>>
>>>>>>> CPU 0                | CPU 1
>>>>>>>                      |
>>>>>>> p->b = 1             | p->c = 2
>>>>>>>                      |
>>>>>>>
>>>>>>> will result in p->b == 1 && p->c == 2 (assume both values
>>>>>>> were 0 before the call to store_bc()).
>>>>>>
>>>>>> What Peter said.  I would ask for suggestions for better wording, but
>>>>>> I would much rather be able to say that single-byte reads and writes
>>>>>> are atomic and that aligned-short reads and writes are also atomic.
>>>>>>
>>>>>> Thus far, it looks like we lose only very old Alpha systems, so unless
>>>>>> I hear otherwise, I update my patch to outlaw these very old systems.
>>>>>
>>>>> This isn't universally true according to the architecture manual.  The
>>>>> PARISC CPU can make byte to long word stores atomic against the memory
>>>>> bus but not against the I/O bus for instance.  Atomicity is a property
>>>>> of the underlying substrate, not of the CPU.  Implying that atomicity is
>>>>> a CPU property is incorrect.
>>
>> To go back to this briefly, while it's true that atomicity is not
>> a CPU property, atomicity is not possible if the CPU is not cooperating.
> 
> You mean if it doesn't have the bus logic to emit?  Like ia32 mostly
> can't do 64 bit writes ... sure.

Yeah, or Alphas that do rmw for byte storage.

>>>> OK, fair point.
>>>>
>>>> But are there in-use-for-Linux PARISC memory fabrics (for normal memory,
>>>> not I/O) that do not support single-byte and double-byte stores?
>>>
>>> For aligned access, I believe that's always the case for the memory bus
>>> (on both 32 and 64 bit systems).  However, it only applies to machine
>>> instruction loads and stores of the same width..  If you mix the widths
>>> on the loads and stores, all bets are off.  That means you have to
>>> beware of the gcc penchant for coalescing loads and stores: if it sees
>>> two adjacent byte stores it can coalesce them into a short store
>>> instead ... that screws up the atomicity guarantees.
>>
>> Two things: I think that gcc has given up on combining adjacent writes,
>> perhaps because unaligned writes on some arches are prohibitive, so
>> whatever minor optimization was believed to be gained was quickly lost,
>> multi-fold. (Although this might not be the reason since one would
>> expect that gcc would know the alignment of the promoted store).
> 
> Um, gcc assumes architecturally correct alignment; that's why it pads
> structures.  Only when accessing packed structures will it use the
> lowest unit load/store.
> 
> if you have a struct { char a, char b }; and load first a then b with a
> constant gcc will obligingly optimise to a short store.

Um, please re-look at the test above. The exact test you describe is
coded above and compiled with gcc 4.6.3 cross-compiler for parisc using
the kernel compiler options.

In the generated code, please note the _absence_ of a combined write
to two adjacent byte stores.


>> But additionally, even if gcc combines adjacent writes _that are part
>> of the program flow_ then I believe the situation is no worse than
>> would otherwise exist.
>>
>> For instance, given the following:
>>
>> struct x {
>> 	spinlock_t lock;
>> 	long a;
>> 	byte b;
>> 	byte c;
>> };
>>
>> void locked_store_b(struct x *p)
>> {
>> 	spin_lock(&p->lock);
>> 	p->b = 1;
>> 	spin_unlock(&p->lock);
>> 	p->c = 2;
>> }
>>
>> Granted, the author probably expects ordered writes of
>> 	STORE B
>> 	STORE C
>> but that's not guaranteed because there is no memory barrier
>> ordering B before C.
> 
> Yes, there is: loads and stores may not migrate into or out of critical
> sections.

That's a common misconception.

The processor is free to re-order this to:

	STORE C
	STORE B
	UNLOCK

That's because the unlock() only guarantees that:

Stores before the unlock in program order are guaranteed to complete
before the unlock completes. Stores after the unlock _may_ complete
before the unlock completes.

My point was that even if compiler barriers had the same semantics
as memory barriers, the situation would be no worse. That is, code
that is sensitive to memory barriers (like the example I gave above)
would merely have the same fragility with one-way compiler barriers
(with respect to the compiler combining writes).

That's what I meant by "no worse than would otherwise exist".

>> I see no problem with gcc write-combining in the absence of
>> memory barriers (as long as alignment is still respected,
>> ie., the size-promoted write is still naturally aligned). The
>> combined write is still atomic.
> 
> Actual alignment is pretty irrelevant.  That's why all architectures
> which require alignment also have to implement misaligned traps ... this
> is a fundamental requirement of the networking code, for instance.
> 
> The main problem is gcc thinking there's a misalignment (or a packed
> structure).  That causes splitting of loads and stores and that destroys
> atomicity.

Yeah, the extra requirement I added is basically nonsense, since the
only issue is what instructions the compiler is emitting. So if compiler
thinks the alignment is natural and combines the writes -- ok. If the
compiler thinks the alignment is off and doesn't combine the writes --
also ok.

Regards,
Peter Hurley

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-08 23:30                                     ` Peter Hurley
  0 siblings, 0 replies; 311+ messages in thread
From: Peter Hurley @ 2014-09-08 23:30 UTC (permalink / raw)
  To: James Bottomley
  Cc: paulmck, H. Peter Anvin, One Thousand Gnomes, Jakub Jelinek,
	Mikael Pettersson, Benjamin Herrenschmidt, Richard Henderson,
	Oleg Nesterov, Miroslav Franc, Paul Mackerras, linuxppc-dev,
	linux-kernel, linux-arch, Tony Luck, linux-ia64

On 09/08/2014 01:50 AM, James Bottomley wrote:
> On Sun, 2014-09-07 at 16:41 -0400, Peter Hurley wrote:
>> On 09/07/2014 03:04 PM, James Bottomley wrote:
>>> On Sun, 2014-09-07 at 09:21 -0700, Paul E. McKenney wrote:
>>>> On Sat, Sep 06, 2014 at 10:07:22PM -0700, James Bottomley wrote:
>>>>> On Thu, 2014-09-04 at 21:06 -0700, Paul E. McKenney wrote:
>>>>>> On Thu, Sep 04, 2014 at 10:47:24PM -0400, Peter Hurley wrote:
>>>>>>> Hi James,
>>>>>>>
>>>>>>> On 09/04/2014 10:11 PM, James Bottomley wrote:
>>>>>>>> On Thu, 2014-09-04 at 17:17 -0700, Paul E. McKenney wrote:
>>>>>>>>> +And there are anti-guarantees:
>>>>>>>>> +
>>>>>>>>> + (*) These guarantees do not apply to bitfields, because compilers often
>>>>>>>>> +     generate code to modify these using non-atomic read-modify-write
>>>>>>>>> +     sequences.  Do not attempt to use bitfields to synchronize parallel
>>>>>>>>> +     algorithms.
>>>>>>>>> +
>>>>>>>>> + (*) Even in cases where bitfields are protected by locks, all fields
>>>>>>>>> +     in a given bitfield must be protected by one lock.  If two fields
>>>>>>>>> +     in a given bitfield are protected by different locks, the compiler's
>>>>>>>>> +     non-atomic read-modify-write sequences can cause an update to one
>>>>>>>>> +     field to corrupt the value of an adjacent field.
>>>>>>>>> +
>>>>>>>>> + (*) These guarantees apply only to properly aligned and sized scalar
>>>>>>>>> +     variables.  "Properly sized" currently means "int" and "long",
>>>>>>>>> +     because some CPU families do not support loads and stores of
>>>>>>>>> +     other sizes.  ("Some CPU families" is currently believed to
>>>>>>>>> +     be only Alpha 21064.  If this is actually the case, a different
>>>>>>>>> +     non-guarantee is likely to be formulated.)
>>>>>>>>
>>>>>>>> This is a bit unclear.  Presumably you're talking about definiteness of
>>>>>>>> the outcome (as in what's seen after multiple stores to the same
>>>>>>>> variable).
>>>>>>>
>>>>>>> No, the last conditions refers to adjacent byte stores from different
>>>>>>> cpu contexts (either interrupt or SMP).
>>>>>>>
>>>>>>>> The guarantees are only for natural width on Parisc as well,
>>>>>>>> so you would get a mess if you did byte stores to adjacent memory
>>>>>>>> locations.
>>>>>>>
>>>>>>> For a simple test like:
>>>>>>>
>>>>>>> struct x {
>>>>>>> 	long a;
>>>>>>> 	char b;
>>>>>>> 	char c;
>>>>>>> 	char d;
>>>>>>> 	char e;
>>>>>>> };
>>>>>>>
>>>>>>> void store_bc(struct x *p) {
>>>>>>> 	p->b = 1;
>>>>>>> 	p->c = 2;
>>>>>>> }
>>>>>>>
>>>>>>> on parisc, gcc generates separate byte stores
>>>>>>>
>>>>>>> void store_bc(struct x *p) {
>>>>>>>    0:	34 1c 00 02 	ldi 1,ret0
>>>>>>>    4:	0f 5c 12 08 	stb ret0,4(r26)
>>>>>>>    8:	34 1c 00 04 	ldi 2,ret0
>>>>>>>    c:	e8 40 c0 00 	bv r0(rp)
>>>>>>>   10:	0f 5c 12 0a 	stb ret0,5(r26)
>>>>>>>
>>>>>>> which appears to confirm that on parisc adjacent byte data
>>>>>>> is safe from corruption by concurrent cpu updates; that is,
>>>>>>>
>>>>>>> CPU 0                | CPU 1
>>>>>>>                      |
>>>>>>> p->b = 1             | p->c = 2
>>>>>>>                      |
>>>>>>>
>>>>>>> will result in p->b = 1 && p->c = 2 (assume both values
>>>>>>> were 0 before the call to store_bc()).
>>>>>>
>>>>>> What Peter said.  I would ask for suggestions for better wording, but
>>>>>> I would much rather be able to say that single-byte reads and writes
>>>>>> are atomic and that aligned-short reads and writes are also atomic.
>>>>>>
>>>>>> Thus far, it looks like we lose only very old Alpha systems, so unless
>>>>>> I hear otherwise, I update my patch to outlaw these very old systems.
>>>>>
>>>>> This isn't universally true according to the architecture manual.  The
>>>>> PARISC CPU can make byte to long word stores atomic against the memory
>>>>> bus but not against the I/O bus for instance.  Atomicity is a property
>>>>> of the underlying substrate, not of the CPU.  Implying that atomicity is
>>>>> a CPU property is incorrect.
>>
>> To go back to this briefly, while it's true that atomicity is not
>> a CPU property, atomicity is not possible if the CPU is not cooperating.
> 
> You mean if it doesn't have the bus logic to emit?  Like ia32 mostly
> can't do 64 bit writes ... sure.

Yeah, or Alphas that do rmw for byte storage.

>>>> OK, fair point.
>>>>
>>>> But are there in-use-for-Linux PARISC memory fabrics (for normal memory,
>>>> not I/O) that do not support single-byte and double-byte stores?
>>>
>>> For aligned access, I believe that's always the case for the memory bus
>>> (on both 32 and 64 bit systems).  However, it only applies to machine
>>> instruction loads and stores of the same width..  If you mix the widths
>>> on the loads and stores, all bets are off.  That means you have to
>>> beware of the gcc penchant for coalescing loads and stores: if it sees
>>> two adjacent byte stores it can coalesce them into a short store
>>> instead ... that screws up the atomicity guarantees.
>>
>> Two things: I think that gcc has given up on combining adjacent writes,
>> perhaps because unaligned writes on some arches are prohibitive, so
>> whatever minor optimization was believed to be gained was quickly lost,
>> multi-fold. (Although this might not be the reason since one would
>> expect that gcc would know the alignment of the promoted store).
> 
> Um, gcc assumes architecturally correct alignment; that's why it pads
> structures.  Only when accessing packed structures will it use the
> lowest unit load/store.
> 
> if you have a struct { char a, char b }; and load first a then b with a
> constant gcc will obligingly optimise to a short store.

Um, please re-look at the test above. The exact test you describe is
coded above and compiled with gcc 4.6.3 cross-compiler for parisc using
the kernel compiler options.

In the generated code, please note the _absence_ of a combined write
to two adjacent byte stores.


>> But additionally, even if gcc combines adjacent writes _that are part
>> of the program flow_ then I believe the situation is no worse than
>> would otherwise exist.
>>
>> For instance, given the following:
>>
>> struct x {
>> 	spinlock_t lock;
>> 	long a;
>> 	byte b;
>> 	byte c;
>> };
>>
>> void locked_store_b(struct x *p)
>> {
>> 	spin_lock(&p->lock);
>> 	p->b = 1;
>> 	spin_unlock(&p->lock);
>> 	p->c = 2;
>> }
>>
>> Granted, the author probably expects ordered writes of
>> 	STORE B
>> 	STORE C
>> but that's not guaranteed because there is no memory barrier
>> ordering B before C.
> 
> Yes, there is: loads and stores may not migrate into or out of critical
> sections.

That's a common misconception.

The processor is free to re-order this to:

	STORE C
	STORE B
	UNLOCK

That's because the unlock() only guarantees that:

Stores before the unlock in program order are guaranteed to complete
before the unlock completes. Stores after the unlock _may_ complete
before the unlock completes.

My point was that even if compiler barriers had the same semantics
as memory barriers, the situation would be no worse. That is, code
that is sensitive to memory barriers (like the example I gave above)
would merely have the same fragility with one-way compiler barriers
(with respect to the compiler combining writes).

That's what I meant by "no worse than would otherwise exist".

>> I see no problem with gcc write-combining in the absence of
>> memory barriers (as long as alignment is still respected,
>> ie., the size-promoted write is still naturally aligned). The
>> combined write is still atomic.
> 
> Actual alignment is pretty irrelevant.  That's why all architectures
> which require alignment also have to implement misaligned traps ... this
> is a fundamental requirement of the networking code, for instance.
> 
> The main problem is gcc thinking there's a misalignment (or a packed
> structure).  That causes splitting of loads and stores and that destroys
> atomicity.

Yeah, the extra requirement I added is basically nonsense, since the
only issue is what instructions the compiler is emitting. So if compiler
thinks the alignment is natural and combines the writes -- ok. If the
compiler thinks the alignment is off and doesn't combine the writes --
also ok.

Regards,
Peter Hurley


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
  2014-09-08 22:47                         ` Peter Hurley
  (?)
@ 2014-09-09  1:59                           ` Paul E. McKenney
  -1 siblings, 0 replies; 311+ messages in thread
From: Paul E. McKenney @ 2014-09-09  1:59 UTC (permalink / raw)
  To: Peter Hurley
  Cc: H. Peter Anvin, One Thousand Gnomes, David Laight, Jakub Jelinek,
	linux-arch, Tony Luck, linux-ia64, Oleg Nesterov, linux-kernel,
	Paul Mackerras, linuxppc-dev, Miroslav Franc, Richard Henderson,
	linux-alpha

On Mon, Sep 08, 2014 at 06:47:35PM -0400, Peter Hurley wrote:
> On 09/08/2014 01:59 PM, H. Peter Anvin wrote:
> > On 09/08/2014 10:52 AM, One Thousand Gnomes wrote:
> >> On Fri, 05 Sep 2014 08:41:52 -0700
> >> "H. Peter Anvin" <hpa@zytor.com> wrote:
> >>
> >>> On 09/05/2014 08:31 AM, Peter Hurley wrote:
> >>>>
> >>>> Which is a bit ironic because I remember when Digital had a team
> >>>> working on emulating native x86 apps on Alpha/NT.
> >>>>
> >>>
> >>> Right, because the x86 architecture was obsolete and would never scale...
> >>
> >> Talking about "not scaling" can anyone explain how a "you need to use
> >> set_bit() and friends" bug report scaled into a hundred message plus
> >> discussion about ambiguous properties of processors (and nobody has
> >> audited all the embedded platforms we support yet, or the weirder ARMs)
> >> and a propsal to remove Alpha support.
> >>
> >> Wouldn't it be *much* simpler to do what I suggested in the first place
> >> and use the existing intended for purpose, deliberately put there,
> >> functions for atomic bitops, because they are fast on sane processors and
> >> they work on everything else.
> >>
> >> I think the whole "removing Alpha EV5" support is basically bonkers. Just
> >> use set_bit in the tty layer. Alpha will continue to work as well as it
> >> always has done and you won't design out support for any future processor
> >> that turns out not to do byte aligned stores.
> >>
> >> Alan
> >>
> > 
> > Is *that* what we are talking about?  I was added to this conversation
> > in the middle where it had already generalized, so I had no idea.
> 
> No, this is just what brought this craziness to my attention.
> 
> For example, byte- and short-sized circular buffers could not possibly
> be safe either, when the head nears the tail.
> 
> Who has audited global storage and ensured that _every_ byte-sized write
> doesn't happen to be adjacent to some other storage that may not happen
> to be protected by the same (or any) lock?

This was my concern as well.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-09  1:59                           ` Paul E. McKenney
  0 siblings, 0 replies; 311+ messages in thread
From: Paul E. McKenney @ 2014-09-09  1:59 UTC (permalink / raw)
  To: Peter Hurley
  Cc: Jakub Jelinek, One Thousand Gnomes, Tony Luck, linux-ia64,
	H. Peter Anvin, Oleg Nesterov, linux-kernel, linux-alpha,
	David Laight, Paul Mackerras, linux-arch, linuxppc-dev,
	Miroslav Franc, Richard Henderson

On Mon, Sep 08, 2014 at 06:47:35PM -0400, Peter Hurley wrote:
> On 09/08/2014 01:59 PM, H. Peter Anvin wrote:
> > On 09/08/2014 10:52 AM, One Thousand Gnomes wrote:
> >> On Fri, 05 Sep 2014 08:41:52 -0700
> >> "H. Peter Anvin" <hpa@zytor.com> wrote:
> >>
> >>> On 09/05/2014 08:31 AM, Peter Hurley wrote:
> >>>>
> >>>> Which is a bit ironic because I remember when Digital had a team
> >>>> working on emulating native x86 apps on Alpha/NT.
> >>>>
> >>>
> >>> Right, because the x86 architecture was obsolete and would never scale...
> >>
> >> Talking about "not scaling" can anyone explain how a "you need to use
> >> set_bit() and friends" bug report scaled into a hundred message plus
> >> discussion about ambiguous properties of processors (and nobody has
> >> audited all the embedded platforms we support yet, or the weirder ARMs)
> >> and a propsal to remove Alpha support.
> >>
> >> Wouldn't it be *much* simpler to do what I suggested in the first place
> >> and use the existing intended for purpose, deliberately put there,
> >> functions for atomic bitops, because they are fast on sane processors and
> >> they work on everything else.
> >>
> >> I think the whole "removing Alpha EV5" support is basically bonkers. Just
> >> use set_bit in the tty layer. Alpha will continue to work as well as it
> >> always has done and you won't design out support for any future processor
> >> that turns out not to do byte aligned stores.
> >>
> >> Alan
> >>
> > 
> > Is *that* what we are talking about?  I was added to this conversation
> > in the middle where it had already generalized, so I had no idea.
> 
> No, this is just what brought this craziness to my attention.
> 
> For example, byte- and short-sized circular buffers could not possibly
> be safe either, when the head nears the tail.
> 
> Who has audited global storage and ensured that _every_ byte-sized write
> doesn't happen to be adjacent to some other storage that may not happen
> to be protected by the same (or any) lock?

This was my concern as well.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-09  1:59                           ` Paul E. McKenney
  0 siblings, 0 replies; 311+ messages in thread
From: Paul E. McKenney @ 2014-09-09  1:59 UTC (permalink / raw)
  To: Peter Hurley
  Cc: H. Peter Anvin, One Thousand Gnomes, David Laight, Jakub Jelinek,
	linux-arch, Tony Luck, linux-ia64, Oleg Nesterov, linux-kernel,
	Paul Mackerras, linuxppc-dev, Miroslav Franc, Richard Henderson,
	linux-alpha

On Mon, Sep 08, 2014 at 06:47:35PM -0400, Peter Hurley wrote:
> On 09/08/2014 01:59 PM, H. Peter Anvin wrote:
> > On 09/08/2014 10:52 AM, One Thousand Gnomes wrote:
> >> On Fri, 05 Sep 2014 08:41:52 -0700
> >> "H. Peter Anvin" <hpa@zytor.com> wrote:
> >>
> >>> On 09/05/2014 08:31 AM, Peter Hurley wrote:
> >>>>
> >>>> Which is a bit ironic because I remember when Digital had a team
> >>>> working on emulating native x86 apps on Alpha/NT.
> >>>>
> >>>
> >>> Right, because the x86 architecture was obsolete and would never scale...
> >>
> >> Talking about "not scaling" can anyone explain how a "you need to use
> >> set_bit() and friends" bug report scaled into a hundred message plus
> >> discussion about ambiguous properties of processors (and nobody has
> >> audited all the embedded platforms we support yet, or the weirder ARMs)
> >> and a propsal to remove Alpha support.
> >>
> >> Wouldn't it be *much* simpler to do what I suggested in the first place
> >> and use the existing intended for purpose, deliberately put there,
> >> functions for atomic bitops, because they are fast on sane processors and
> >> they work on everything else.
> >>
> >> I think the whole "removing Alpha EV5" support is basically bonkers. Just
> >> use set_bit in the tty layer. Alpha will continue to work as well as it
> >> always has done and you won't design out support for any future processor
> >> that turns out not to do byte aligned stores.
> >>
> >> Alan
> >>
> > 
> > Is *that* what we are talking about?  I was added to this conversation
> > in the middle where it had already generalized, so I had no idea.
> 
> No, this is just what brought this craziness to my attention.
> 
> For example, byte- and short-sized circular buffers could not possibly
> be safe either, when the head nears the tail.
> 
> Who has audited global storage and ensured that _every_ byte-sized write
> doesn't happen to be adjacent to some other storage that may not happen
> to be protected by the same (or any) lock?

This was my concern as well.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
  2014-09-08 22:43                                       ` James Bottomley
  (?)
  (?)
@ 2014-09-09  2:27                                         ` H. Peter Anvin
  -1 siblings, 0 replies; 311+ messages in thread
From: H. Peter Anvin @ 2014-09-09  2:27 UTC (permalink / raw)
  To: James Bottomley, Chris Metcalf
  Cc: Peter Hurley, paulmck, One Thousand Gnomes, Jakub Jelinek,
	Mikael Pettersson, Benjamin Herrenschmidt, Richard Henderson,
	Oleg Nesterov, Miroslav Franc, Paul Mackerras, linuxppc-dev,
	linux-kernel, linux-arch, Tony Luck, linux-ia64

On 09/08/2014 03:43 PM, James Bottomley wrote:
> 
> This was years ago (possibly decades).  We had to implement in-kernel
> unaligned traps for the networking layer because it could access short
> and int fields that weren't of the correct alignment when processing
> packets.  It that's all corrected now, we wouldn't really notice (except
> a bit of a speed up since an unaligned trap effectively places the
> broken out instructions into the bit stream).
> 
> James
> 

Well, ARM doesn't trap, it just silently gives garbage on unaligned
memory references.

	-hpa



^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-09  2:27                                         ` H. Peter Anvin
  0 siblings, 0 replies; 311+ messages in thread
From: H. Peter Anvin @ 2014-09-09  2:27 UTC (permalink / raw)
  To: James Bottomley, Chris Metcalf
  Cc: Jakub Jelinek, One Thousand Gnomes, Tony Luck, linux-ia64,
	Peter Hurley, Mikael Pettersson, Oleg Nesterov, linux-kernel,
	Paul Mackerras, linux-arch, paulmck, linuxppc-dev,
	Miroslav Franc, Richard Henderson

On 09/08/2014 03:43 PM, James Bottomley wrote:
> 
> This was years ago (possibly decades).  We had to implement in-kernel
> unaligned traps for the networking layer because it could access short
> and int fields that weren't of the correct alignment when processing
> packets.  It that's all corrected now, we wouldn't really notice (except
> a bit of a speed up since an unaligned trap effectively places the
> broken out instructions into the bit stream).
> 
> James
> 

Well, ARM doesn't trap, it just silently gives garbage on unaligned
memory references.

	-hpa


_______________________________________________
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-09  2:27                                         ` H. Peter Anvin
  0 siblings, 0 replies; 311+ messages in thread
From: H. Peter Anvin @ 2014-09-09  2:27 UTC (permalink / raw)
  To: James Bottomley, Chris Metcalf
  Cc: Jakub Jelinek, One Thousand Gnomes, Tony Luck, linux-ia64,
	Peter Hurley, Mikael Pettersson, Oleg Nesterov, linux-kernel,
	Paul Mackerras, linux-arch, paulmck, linuxppc-dev,
	Miroslav Franc, Richard Henderson

On 09/08/2014 03:43 PM, James Bottomley wrote:
> 
> This was years ago (possibly decades).  We had to implement in-kernel
> unaligned traps for the networking layer because it could access short
> and int fields that weren't of the correct alignment when processing
> packets.  It that's all corrected now, we wouldn't really notice (except
> a bit of a speed up since an unaligned trap effectively places the
> broken out instructions into the bit stream).
> 
> James
> 

Well, ARM doesn't trap, it just silently gives garbage on unaligned
memory references.

	-hpa

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-09  2:27                                         ` H. Peter Anvin
  0 siblings, 0 replies; 311+ messages in thread
From: H. Peter Anvin @ 2014-09-09  2:27 UTC (permalink / raw)
  To: James Bottomley, Chris Metcalf
  Cc: Peter Hurley, paulmck, One Thousand Gnomes, Jakub Jelinek,
	Mikael Pettersson, Benjamin Herrenschmidt, Richard Henderson,
	Oleg Nesterov, Miroslav Franc, Paul Mackerras, linuxppc-dev,
	linux-kernel, linux-arch, Tony Luck, linux-ia64

On 09/08/2014 03:43 PM, James Bottomley wrote:
> 
> This was years ago (possibly decades).  We had to implement in-kernel
> unaligned traps for the networking layer because it could access short
> and int fields that weren't of the correct alignment when processing
> packets.  It that's all corrected now, we wouldn't really notice (except
> a bit of a speed up since an unaligned trap effectively places the
> broken out instructions into the bit stream).
> 
> James
> 

Well, ARM doesn't trap, it just silently gives garbage on unaligned
memory references.

	-hpa



^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
  2014-09-08 22:39                                                 ` James Bottomley
  (?)
@ 2014-09-09  2:30                                                   ` H. Peter Anvin
  -1 siblings, 0 replies; 311+ messages in thread
From: H. Peter Anvin @ 2014-09-09  2:30 UTC (permalink / raw)
  To: James Bottomley
  Cc: paulmck, Peter Hurley, One Thousand Gnomes, Jakub Jelinek,
	Mikael Pettersson, Benjamin Herrenschmidt, Richard Henderson,
	Oleg Nesterov, Miroslav Franc, Paul Mackerras, linuxppc-dev,
	linux-kernel, linux-arch, Tony Luck, linux-ia64

On 09/08/2014 03:39 PM, James Bottomley wrote:
> 
> I don't understand what you mean by "pass each other".  Atomicity
> guarantees are not ordering guarantees in a SMP environment.  The
> guarantee is that if you follow the rules when two CPUs update the same
> natural width aligned object simultaneously using the same primitive,
> the result is either one or the other of their updates.  Which one wins
> (the ordering) isn't defined.
> 

I'm trying to figure out why it would possibly make a difference in any
kind of sane system if gcc fuses accesses.

Assuming bigendian for the moment, I would expect that if CPU 1 does a
write of 0x01020304 to address 0 and CPU 2 does a write of 0x0506 to
address 2, that the end result would be either 0x01020304 or 0x01020506.
 Similarly, I would expect that if these operations are both done on the
same CPU in that order, that the result would unambiguously be 0x01020506.

I would strongly suspect an architecture which does not provide those
guarantees is an outlier.

	-hpa



^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-09  2:30                                                   ` H. Peter Anvin
  0 siblings, 0 replies; 311+ messages in thread
From: H. Peter Anvin @ 2014-09-09  2:30 UTC (permalink / raw)
  To: James Bottomley
  Cc: Jakub Jelinek, One Thousand Gnomes, Tony Luck, linux-ia64,
	Peter Hurley, Mikael Pettersson, Oleg Nesterov, linux-kernel,
	Paul Mackerras, linux-arch, paulmck, linuxppc-dev,
	Miroslav Franc, Richard Henderson

On 09/08/2014 03:39 PM, James Bottomley wrote:
> 
> I don't understand what you mean by "pass each other".  Atomicity
> guarantees are not ordering guarantees in a SMP environment.  The
> guarantee is that if you follow the rules when two CPUs update the same
> natural width aligned object simultaneously using the same primitive,
> the result is either one or the other of their updates.  Which one wins
> (the ordering) isn't defined.
> 

I'm trying to figure out why it would possibly make a difference in any
kind of sane system if gcc fuses accesses.

Assuming bigendian for the moment, I would expect that if CPU 1 does a
write of 0x01020304 to address 0 and CPU 2 does a write of 0x0506 to
address 2, that the end result would be either 0x01020304 or 0x01020506.
 Similarly, I would expect that if these operations are both done on the
same CPU in that order, that the result would unambiguously be 0x01020506.

I would strongly suspect an architecture which does not provide those
guarantees is an outlier.

	-hpa

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-09  2:30                                                   ` H. Peter Anvin
  0 siblings, 0 replies; 311+ messages in thread
From: H. Peter Anvin @ 2014-09-09  2:30 UTC (permalink / raw)
  To: James Bottomley
  Cc: paulmck, Peter Hurley, One Thousand Gnomes, Jakub Jelinek,
	Mikael Pettersson, Benjamin Herrenschmidt, Richard Henderson,
	Oleg Nesterov, Miroslav Franc, Paul Mackerras, linuxppc-dev,
	linux-kernel, linux-arch, Tony Luck, linux-ia64

On 09/08/2014 03:39 PM, James Bottomley wrote:
> 
> I don't understand what you mean by "pass each other".  Atomicity
> guarantees are not ordering guarantees in a SMP environment.  The
> guarantee is that if you follow the rules when two CPUs update the same
> natural width aligned object simultaneously using the same primitive,
> the result is either one or the other of their updates.  Which one wins
> (the ordering) isn't defined.
> 

I'm trying to figure out why it would possibly make a difference in any
kind of sane system if gcc fuses accesses.

Assuming bigendian for the moment, I would expect that if CPU 1 does a
write of 0x01020304 to address 0 and CPU 2 does a write of 0x0506 to
address 2, that the end result would be either 0x01020304 or 0x01020506.
 Similarly, I would expect that if these operations are both done on the
same CPU in that order, that the result would unambiguously be 0x01020506.

I would strongly suspect an architecture which does not provide those
guarantees is an outlier.

	-hpa



^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
  2014-09-08 23:30                                     ` Peter Hurley
  (?)
@ 2014-09-09  2:56                                       ` James Bottomley
  -1 siblings, 0 replies; 311+ messages in thread
From: James Bottomley @ 2014-09-09  2:56 UTC (permalink / raw)
  To: Peter Hurley
  Cc: paulmck, H. Peter Anvin, One Thousand Gnomes, Jakub Jelinek,
	Mikael Pettersson, Benjamin Herrenschmidt, Richard Henderson,
	Oleg Nesterov, Miroslav Franc, Paul Mackerras, linuxppc-dev,
	linux-kernel, linux-arch, Tony Luck, linux-ia64

On Mon, 2014-09-08 at 19:30 -0400, Peter Hurley wrote:
> On 09/08/2014 01:50 AM, James Bottomley wrote:
> >> Two things: I think that gcc has given up on combining adjacent writes,
> >> perhaps because unaligned writes on some arches are prohibitive, so
> >> whatever minor optimization was believed to be gained was quickly lost,
> >> multi-fold. (Although this might not be the reason since one would
> >> expect that gcc would know the alignment of the promoted store).
> > 
> > Um, gcc assumes architecturally correct alignment; that's why it pads
> > structures.  Only when accessing packed structures will it use the
> > lowest unit load/store.
> > 
> > if you have a struct { char a, char b }; and load first a then b with a
> > constant gcc will obligingly optimise to a short store.
> 
> Um, please re-look at the test above. The exact test you describe is
> coded above and compiled with gcc 4.6.3 cross-compiler for parisc using
> the kernel compiler options.
> 
> In the generated code, please note the _absence_ of a combined write
> to two adjacent byte stores.

So one version of gcc doesn't do it.  Others do because I've been
surprised seeing it in assembly.

> >> But additionally, even if gcc combines adjacent writes _that are part
> >> of the program flow_ then I believe the situation is no worse than
> >> would otherwise exist.
> >>
> >> For instance, given the following:
> >>
> >> struct x {
> >> 	spinlock_t lock;
> >> 	long a;
> >> 	byte b;
> >> 	byte c;
> >> };
> >>
> >> void locked_store_b(struct x *p)
> >> {
> >> 	spin_lock(&p->lock);
> >> 	p->b = 1;
> >> 	spin_unlock(&p->lock);
> >> 	p->c = 2;
> >> }
> >>
> >> Granted, the author probably expects ordered writes of
> >> 	STORE B
> >> 	STORE C
> >> but that's not guaranteed because there is no memory barrier
> >> ordering B before C.
> > 
> > Yes, there is: loads and stores may not migrate into or out of critical
> > sections.
> 
> That's a common misconception.
> 
> The processor is free to re-order this to:
> 
> 	STORE C
> 	STORE B
> 	UNLOCK
> 
> That's because the unlock() only guarantees that:
> 
> Stores before the unlock in program order are guaranteed to complete
> before the unlock completes. Stores after the unlock _may_ complete
> before the unlock completes.
> 
> My point was that even if compiler barriers had the same semantics
> as memory barriers, the situation would be no worse. That is, code
> that is sensitive to memory barriers (like the example I gave above)
> would merely have the same fragility with one-way compiler barriers
> (with respect to the compiler combining writes).
> 
> That's what I meant by "no worse than would otherwise exist".

Actually, that's not correct.  This is actually deja vu with me on the
other side of the argument.  When we first did spinlocks on PA, I argued
as you did: lock only a barrier for code after and unlock for code
before.  The failing case is that you can have a critical section which
performs an atomically required operation and a following unit which
depends on it being performed.  If you begin the following unit before
the atomic requirement, you may end up losing.  It turns out this kind
of pattern is inherent in a lot of mail box device drivers: you need to
set up the mailbox atomically then poke it.  Setup is usually atomic,
deciding which mailbox to prime and actually poking it is in the
following unit.  Priming often involves an I/O bus transaction and if
you poke before priming, you get a misfire.

> >> I see no problem with gcc write-combining in the absence of
> >> memory barriers (as long as alignment is still respected,
> >> ie., the size-promoted write is still naturally aligned). The
> >> combined write is still atomic.
> > 
> > Actual alignment is pretty irrelevant.  That's why all architectures
> > which require alignment also have to implement misaligned traps ... this
> > is a fundamental requirement of the networking code, for instance.
> > 
> > The main problem is gcc thinking there's a misalignment (or a packed
> > structure).  That causes splitting of loads and stores and that destroys
> > atomicity.
> 
> Yeah, the extra requirement I added is basically nonsense, since the
> only issue is what instructions the compiler is emitting. So if compiler
> thinks the alignment is natural and combines the writes -- ok. If the
> compiler thinks the alignment is off and doesn't combine the writes --
> also ok.

Yes, I think I can agree that the only real problem is gcc thinking the
store or load needs splitting.

James



^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-09  2:56                                       ` James Bottomley
  0 siblings, 0 replies; 311+ messages in thread
From: James Bottomley @ 2014-09-09  2:56 UTC (permalink / raw)
  To: Peter Hurley
  Cc: Jakub Jelinek, One Thousand Gnomes, linux-arch, linux-ia64,
	Mikael Pettersson, Oleg Nesterov, linux-kernel, Tony Luck,
	Paul Mackerras, H. Peter Anvin, paulmck, linuxppc-dev,
	Miroslav Franc, Richard Henderson

On Mon, 2014-09-08 at 19:30 -0400, Peter Hurley wrote:
> On 09/08/2014 01:50 AM, James Bottomley wrote:
> >> Two things: I think that gcc has given up on combining adjacent writes,
> >> perhaps because unaligned writes on some arches are prohibitive, so
> >> whatever minor optimization was believed to be gained was quickly lost,
> >> multi-fold. (Although this might not be the reason since one would
> >> expect that gcc would know the alignment of the promoted store).
> > 
> > Um, gcc assumes architecturally correct alignment; that's why it pads
> > structures.  Only when accessing packed structures will it use the
> > lowest unit load/store.
> > 
> > if you have a struct { char a, char b }; and load first a then b with a
> > constant gcc will obligingly optimise to a short store.
> 
> Um, please re-look at the test above. The exact test you describe is
> coded above and compiled with gcc 4.6.3 cross-compiler for parisc using
> the kernel compiler options.
> 
> In the generated code, please note the _absence_ of a combined write
> to two adjacent byte stores.

So one version of gcc doesn't do it.  Others do because I've been
surprised seeing it in assembly.

> >> But additionally, even if gcc combines adjacent writes _that are part
> >> of the program flow_ then I believe the situation is no worse than
> >> would otherwise exist.
> >>
> >> For instance, given the following:
> >>
> >> struct x {
> >> 	spinlock_t lock;
> >> 	long a;
> >> 	byte b;
> >> 	byte c;
> >> };
> >>
> >> void locked_store_b(struct x *p)
> >> {
> >> 	spin_lock(&p->lock);
> >> 	p->b = 1;
> >> 	spin_unlock(&p->lock);
> >> 	p->c = 2;
> >> }
> >>
> >> Granted, the author probably expects ordered writes of
> >> 	STORE B
> >> 	STORE C
> >> but that's not guaranteed because there is no memory barrier
> >> ordering B before C.
> > 
> > Yes, there is: loads and stores may not migrate into or out of critical
> > sections.
> 
> That's a common misconception.
> 
> The processor is free to re-order this to:
> 
> 	STORE C
> 	STORE B
> 	UNLOCK
> 
> That's because the unlock() only guarantees that:
> 
> Stores before the unlock in program order are guaranteed to complete
> before the unlock completes. Stores after the unlock _may_ complete
> before the unlock completes.
> 
> My point was that even if compiler barriers had the same semantics
> as memory barriers, the situation would be no worse. That is, code
> that is sensitive to memory barriers (like the example I gave above)
> would merely have the same fragility with one-way compiler barriers
> (with respect to the compiler combining writes).
> 
> That's what I meant by "no worse than would otherwise exist".

Actually, that's not correct.  This is actually deja vu with me on the
other side of the argument.  When we first did spinlocks on PA, I argued
as you did: lock only a barrier for code after and unlock for code
before.  The failing case is that you can have a critical section which
performs an atomically required operation and a following unit which
depends on it being performed.  If you begin the following unit before
the atomic requirement, you may end up losing.  It turns out this kind
of pattern is inherent in a lot of mail box device drivers: you need to
set up the mailbox atomically then poke it.  Setup is usually atomic,
deciding which mailbox to prime and actually poking it is in the
following unit.  Priming often involves an I/O bus transaction and if
you poke before priming, you get a misfire.

> >> I see no problem with gcc write-combining in the absence of
> >> memory barriers (as long as alignment is still respected,
> >> ie., the size-promoted write is still naturally aligned). The
> >> combined write is still atomic.
> > 
> > Actual alignment is pretty irrelevant.  That's why all architectures
> > which require alignment also have to implement misaligned traps ... this
> > is a fundamental requirement of the networking code, for instance.
> > 
> > The main problem is gcc thinking there's a misalignment (or a packed
> > structure).  That causes splitting of loads and stores and that destroys
> > atomicity.
> 
> Yeah, the extra requirement I added is basically nonsense, since the
> only issue is what instructions the compiler is emitting. So if compiler
> thinks the alignment is natural and combines the writes -- ok. If the
> compiler thinks the alignment is off and doesn't combine the writes --
> also ok.

Yes, I think I can agree that the only real problem is gcc thinking the
store or load needs splitting.

James

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-09  2:56                                       ` James Bottomley
  0 siblings, 0 replies; 311+ messages in thread
From: James Bottomley @ 2014-09-09  2:56 UTC (permalink / raw)
  To: Peter Hurley
  Cc: paulmck, H. Peter Anvin, One Thousand Gnomes, Jakub Jelinek,
	Mikael Pettersson, Benjamin Herrenschmidt, Richard Henderson,
	Oleg Nesterov, Miroslav Franc, Paul Mackerras, linuxppc-dev,
	linux-kernel, linux-arch, Tony Luck, linux-ia64

On Mon, 2014-09-08 at 19:30 -0400, Peter Hurley wrote:
> On 09/08/2014 01:50 AM, James Bottomley wrote:
> >> Two things: I think that gcc has given up on combining adjacent writes,
> >> perhaps because unaligned writes on some arches are prohibitive, so
> >> whatever minor optimization was believed to be gained was quickly lost,
> >> multi-fold. (Although this might not be the reason since one would
> >> expect that gcc would know the alignment of the promoted store).
> > 
> > Um, gcc assumes architecturally correct alignment; that's why it pads
> > structures.  Only when accessing packed structures will it use the
> > lowest unit load/store.
> > 
> > if you have a struct { char a, char b }; and load first a then b with a
> > constant gcc will obligingly optimise to a short store.
> 
> Um, please re-look at the test above. The exact test you describe is
> coded above and compiled with gcc 4.6.3 cross-compiler for parisc using
> the kernel compiler options.
> 
> In the generated code, please note the _absence_ of a combined write
> to two adjacent byte stores.

So one version of gcc doesn't do it.  Others do because I've been
surprised seeing it in assembly.

> >> But additionally, even if gcc combines adjacent writes _that are part
> >> of the program flow_ then I believe the situation is no worse than
> >> would otherwise exist.
> >>
> >> For instance, given the following:
> >>
> >> struct x {
> >> 	spinlock_t lock;
> >> 	long a;
> >> 	byte b;
> >> 	byte c;
> >> };
> >>
> >> void locked_store_b(struct x *p)
> >> {
> >> 	spin_lock(&p->lock);
> >> 	p->b = 1;
> >> 	spin_unlock(&p->lock);
> >> 	p->c = 2;
> >> }
> >>
> >> Granted, the author probably expects ordered writes of
> >> 	STORE B
> >> 	STORE C
> >> but that's not guaranteed because there is no memory barrier
> >> ordering B before C.
> > 
> > Yes, there is: loads and stores may not migrate into or out of critical
> > sections.
> 
> That's a common misconception.
> 
> The processor is free to re-order this to:
> 
> 	STORE C
> 	STORE B
> 	UNLOCK
> 
> That's because the unlock() only guarantees that:
> 
> Stores before the unlock in program order are guaranteed to complete
> before the unlock completes. Stores after the unlock _may_ complete
> before the unlock completes.
> 
> My point was that even if compiler barriers had the same semantics
> as memory barriers, the situation would be no worse. That is, code
> that is sensitive to memory barriers (like the example I gave above)
> would merely have the same fragility with one-way compiler barriers
> (with respect to the compiler combining writes).
> 
> That's what I meant by "no worse than would otherwise exist".

Actually, that's not correct.  This is actually deja vu with me on the
other side of the argument.  When we first did spinlocks on PA, I argued
as you did: lock only a barrier for code after and unlock for code
before.  The failing case is that you can have a critical section which
performs an atomically required operation and a following unit which
depends on it being performed.  If you begin the following unit before
the atomic requirement, you may end up losing.  It turns out this kind
of pattern is inherent in a lot of mail box device drivers: you need to
set up the mailbox atomically then poke it.  Setup is usually atomic,
deciding which mailbox to prime and actually poking it is in the
following unit.  Priming often involves an I/O bus transaction and if
you poke before priming, you get a misfire.

> >> I see no problem with gcc write-combining in the absence of
> >> memory barriers (as long as alignment is still respected,
> >> ie., the size-promoted write is still naturally aligned). The
> >> combined write is still atomic.
> > 
> > Actual alignment is pretty irrelevant.  That's why all architectures
> > which require alignment also have to implement misaligned traps ... this
> > is a fundamental requirement of the networking code, for instance.
> > 
> > The main problem is gcc thinking there's a misalignment (or a packed
> > structure).  That causes splitting of loads and stores and that destroys
> > atomicity.
> 
> Yeah, the extra requirement I added is basically nonsense, since the
> only issue is what instructions the compiler is emitting. So if compiler
> thinks the alignment is natural and combines the writes -- ok. If the
> compiler thinks the alignment is off and doesn't combine the writes --
> also ok.

Yes, I think I can agree that the only real problem is gcc thinking the
store or load needs splitting.

James



^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
  2014-09-09  2:56                                       ` James Bottomley
  (?)
@ 2014-09-09  3:20                                         ` H. Peter Anvin
  -1 siblings, 0 replies; 311+ messages in thread
From: H. Peter Anvin @ 2014-09-09  3:20 UTC (permalink / raw)
  To: James Bottomley, Peter Hurley
  Cc: paulmck, One Thousand Gnomes, Jakub Jelinek, Mikael Pettersson,
	Benjamin Herrenschmidt, Richard Henderson, Oleg Nesterov,
	Miroslav Franc, Paul Mackerras, linuxppc-dev, linux-kernel,
	linux-arch, Tony Luck, linux-ia64

On 09/08/2014 07:56 PM, James Bottomley wrote:
>>
>> Yeah, the extra requirement I added is basically nonsense, since the
>> only issue is what instructions the compiler is emitting. So if compiler
>> thinks the alignment is natural and combines the writes -- ok. If the
>> compiler thinks the alignment is off and doesn't combine the writes --
>> also ok.
> 
> Yes, I think I can agree that the only real problem is gcc thinking the
> store or load needs splitting.
> 

That seems much saner, and yes, that applies to any architecture which
needs unaligned references.  Now, if the references are *actually*
unaligned they can end up being split even on x86, especially if they
cross cache line boundaries.

	-hpa



^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-09  3:20                                         ` H. Peter Anvin
  0 siblings, 0 replies; 311+ messages in thread
From: H. Peter Anvin @ 2014-09-09  3:20 UTC (permalink / raw)
  To: James Bottomley, Peter Hurley
  Cc: Jakub Jelinek, One Thousand Gnomes, Tony Luck, linux-ia64,
	Mikael Pettersson, Oleg Nesterov, linux-kernel, Paul Mackerras,
	linux-arch, paulmck, linuxppc-dev, Miroslav Franc,
	Richard Henderson

On 09/08/2014 07:56 PM, James Bottomley wrote:
>>
>> Yeah, the extra requirement I added is basically nonsense, since the
>> only issue is what instructions the compiler is emitting. So if compiler
>> thinks the alignment is natural and combines the writes -- ok. If the
>> compiler thinks the alignment is off and doesn't combine the writes --
>> also ok.
> 
> Yes, I think I can agree that the only real problem is gcc thinking the
> store or load needs splitting.
> 

That seems much saner, and yes, that applies to any architecture which
needs unaligned references.  Now, if the references are *actually*
unaligned they can end up being split even on x86, especially if they
cross cache line boundaries.

	-hpa

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-09  3:20                                         ` H. Peter Anvin
  0 siblings, 0 replies; 311+ messages in thread
From: H. Peter Anvin @ 2014-09-09  3:20 UTC (permalink / raw)
  To: James Bottomley, Peter Hurley
  Cc: paulmck, One Thousand Gnomes, Jakub Jelinek, Mikael Pettersson,
	Benjamin Herrenschmidt, Richard Henderson, Oleg Nesterov,
	Miroslav Franc, Paul Mackerras, linuxppc-dev, linux-kernel,
	linux-arch, Tony Luck, linux-ia64

On 09/08/2014 07:56 PM, James Bottomley wrote:
>>
>> Yeah, the extra requirement I added is basically nonsense, since the
>> only issue is what instructions the compiler is emitting. So if compiler
>> thinks the alignment is natural and combines the writes -- ok. If the
>> compiler thinks the alignment is off and doesn't combine the writes --
>> also ok.
> 
> Yes, I think I can agree that the only real problem is gcc thinking the
> store or load needs splitting.
> 

That seems much saner, and yes, that applies to any architecture which
needs unaligned references.  Now, if the references are *actually*
unaligned they can end up being split even on x86, especially if they
cross cache line boundaries.

	-hpa



^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
  2014-09-09  2:56                                       ` James Bottomley
  (?)
  (?)
@ 2014-09-09  4:30                                         ` H. Peter Anvin
  -1 siblings, 0 replies; 311+ messages in thread
From: H. Peter Anvin @ 2014-09-09  4:30 UTC (permalink / raw)
  To: James Bottomley
  Cc: Peter Hurley, paulmck, One Thousand Gnomes, Jakub Jelinek,
	Mikael Pettersson, Benjamin Herrenschmidt, Richard Henderson,
	Oleg Nesterov, Miroslav Franc, Paul Mackerras, linuxppc-dev,
	linux-kernel, linux-arch, Tony Luck, linux-ia64

Add a short member for proper alignment and one will probably pop out.

Sent from my tablet, pardon any formatting problems.

> On Sep 8, 2014, at 19:56, James Bottomley <James.Bottomley@HansenPartnership.com> wrote:
> 
>> On Mon, 2014-09-08 at 19:30 -0400, Peter Hurley wrote:
>> On 09/08/2014 01:50 AM, James Bottomley wrote:
>>>> Two things: I think that gcc has given up on combining adjacent writes,
>>>> perhaps because unaligned writes on some arches are prohibitive, so
>>>> whatever minor optimization was believed to be gained was quickly lost,
>>>> multi-fold. (Although this might not be the reason since one would
>>>> expect that gcc would know the alignment of the promoted store).
>>> 
>>> Um, gcc assumes architecturally correct alignment; that's why it pads
>>> structures.  Only when accessing packed structures will it use the
>>> lowest unit load/store.
>>> 
>>> if you have a struct { char a, char b }; and load first a then b with a
>>> constant gcc will obligingly optimise to a short store.
>> 
>> Um, please re-look at the test above. The exact test you describe is
>> coded above and compiled with gcc 4.6.3 cross-compiler for parisc using
>> the kernel compiler options.
>> 
>> In the generated code, please note the _absence_ of a combined write
>> to two adjacent byte stores.
> 
> So one version of gcc doesn't do it.  Others do because I've been
> surprised seeing it in assembly.
> 
>>>> But additionally, even if gcc combines adjacent writes _that are part
>>>> of the program flow_ then I believe the situation is no worse than
>>>> would otherwise exist.
>>>> 
>>>> For instance, given the following:
>>>> 
>>>> struct x {
>>>>    spinlock_t lock;
>>>>    long a;
>>>>    byte b;
>>>>    byte c;
>>>> };
>>>> 
>>>> void locked_store_b(struct x *p)
>>>> {
>>>>    spin_lock(&p->lock);
>>>>    p->b = 1;
>>>>    spin_unlock(&p->lock);
>>>>    p->c = 2;
>>>> }
>>>> 
>>>> Granted, the author probably expects ordered writes of
>>>>    STORE B
>>>>    STORE C
>>>> but that's not guaranteed because there is no memory barrier
>>>> ordering B before C.
>>> 
>>> Yes, there is: loads and stores may not migrate into or out of critical
>>> sections.
>> 
>> That's a common misconception.
>> 
>> The processor is free to re-order this to:
>> 
>>    STORE C
>>    STORE B
>>    UNLOCK
>> 
>> That's because the unlock() only guarantees that:
>> 
>> Stores before the unlock in program order are guaranteed to complete
>> before the unlock completes. Stores after the unlock _may_ complete
>> before the unlock completes.
>> 
>> My point was that even if compiler barriers had the same semantics
>> as memory barriers, the situation would be no worse. That is, code
>> that is sensitive to memory barriers (like the example I gave above)
>> would merely have the same fragility with one-way compiler barriers
>> (with respect to the compiler combining writes).
>> 
>> That's what I meant by "no worse than would otherwise exist".
> 
> Actually, that's not correct.  This is actually deja vu with me on the
> other side of the argument.  When we first did spinlocks on PA, I argued
> as you did: lock only a barrier for code after and unlock for code
> before.  The failing case is that you can have a critical section which
> performs an atomically required operation and a following unit which
> depends on it being performed.  If you begin the following unit before
> the atomic requirement, you may end up losing.  It turns out this kind
> of pattern is inherent in a lot of mail box device drivers: you need to
> set up the mailbox atomically then poke it.  Setup is usually atomic,
> deciding which mailbox to prime and actually poking it is in the
> following unit.  Priming often involves an I/O bus transaction and if
> you poke before priming, you get a misfire.
> 
>>>> I see no problem with gcc write-combining in the absence of
>>>> memory barriers (as long as alignment is still respected,
>>>> ie., the size-promoted write is still naturally aligned). The
>>>> combined write is still atomic.
>>> 
>>> Actual alignment is pretty irrelevant.  That's why all architectures
>>> which require alignment also have to implement misaligned traps ... this
>>> is a fundamental requirement of the networking code, for instance.
>>> 
>>> The main problem is gcc thinking there's a misalignment (or a packed
>>> structure).  That causes splitting of loads and stores and that destroys
>>> atomicity.
>> 
>> Yeah, the extra requirement I added is basically nonsense, since the
>> only issue is what instructions the compiler is emitting. So if compiler
>> thinks the alignment is natural and combines the writes -- ok. If the
>> compiler thinks the alignment is off and doesn't combine the writes --
>> also ok.
> 
> Yes, I think I can agree that the only real problem is gcc thinking the
> store or load needs splitting.
> 
> James
> 
> 

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-09  4:30                                         ` H. Peter Anvin
  0 siblings, 0 replies; 311+ messages in thread
From: H. Peter Anvin @ 2014-09-09  4:30 UTC (permalink / raw)
  To: James Bottomley
  Cc: Jakub Jelinek, One Thousand Gnomes, Tony Luck, linux-ia64,
	Peter Hurley, Mikael Pettersson, Oleg Nesterov, linux-kernel,
	Paul Mackerras, linux-arch, paulmck, linuxppc-dev,
	Miroslav Franc, Richard Henderson

Add a short member for proper alignment and one will probably pop out.

Sent from my tablet, pardon any formatting problems.

> On Sep 8, 2014, at 19:56, James Bottomley <James.Bottomley@HansenPartnership.com> wrote:
> 
>> On Mon, 2014-09-08 at 19:30 -0400, Peter Hurley wrote:
>> On 09/08/2014 01:50 AM, James Bottomley wrote:
>>>> Two things: I think that gcc has given up on combining adjacent writes,
>>>> perhaps because unaligned writes on some arches are prohibitive, so
>>>> whatever minor optimization was believed to be gained was quickly lost,
>>>> multi-fold. (Although this might not be the reason since one would
>>>> expect that gcc would know the alignment of the promoted store).
>>> 
>>> Um, gcc assumes architecturally correct alignment; that's why it pads
>>> structures.  Only when accessing packed structures will it use the
>>> lowest unit load/store.
>>> 
>>> if you have a struct { char a, char b }; and load first a then b with a
>>> constant gcc will obligingly optimise to a short store.
>> 
>> Um, please re-look at the test above. The exact test you describe is
>> coded above and compiled with gcc 4.6.3 cross-compiler for parisc using
>> the kernel compiler options.
>> 
>> In the generated code, please note the _absence_ of a combined write
>> to two adjacent byte stores.
> 
> So one version of gcc doesn't do it.  Others do because I've been
> surprised seeing it in assembly.
> 
>>>> But additionally, even if gcc combines adjacent writes _that are part
>>>> of the program flow_ then I believe the situation is no worse than
>>>> would otherwise exist.
>>>> 
>>>> For instance, given the following:
>>>> 
>>>> struct x {
>>>>    spinlock_t lock;
>>>>    long a;
>>>>    byte b;
>>>>    byte c;
>>>> };
>>>> 
>>>> void locked_store_b(struct x *p)
>>>> {
>>>>    spin_lock(&p->lock);
>>>>    p->b = 1;
>>>>    spin_unlock(&p->lock);
>>>>    p->c = 2;
>>>> }
>>>> 
>>>> Granted, the author probably expects ordered writes of
>>>>    STORE B
>>>>    STORE C
>>>> but that's not guaranteed because there is no memory barrier
>>>> ordering B before C.
>>> 
>>> Yes, there is: loads and stores may not migrate into or out of critical
>>> sections.
>> 
>> That's a common misconception.
>> 
>> The processor is free to re-order this to:
>> 
>>    STORE C
>>    STORE B
>>    UNLOCK
>> 
>> That's because the unlock() only guarantees that:
>> 
>> Stores before the unlock in program order are guaranteed to complete
>> before the unlock completes. Stores after the unlock _may_ complete
>> before the unlock completes.
>> 
>> My point was that even if compiler barriers had the same semantics
>> as memory barriers, the situation would be no worse. That is, code
>> that is sensitive to memory barriers (like the example I gave above)
>> would merely have the same fragility with one-way compiler barriers
>> (with respect to the compiler combining writes).
>> 
>> That's what I meant by "no worse than would otherwise exist".
> 
> Actually, that's not correct.  This is actually deja vu with me on the
> other side of the argument.  When we first did spinlocks on PA, I argued
> as you did: lock only a barrier for code after and unlock for code
> before.  The failing case is that you can have a critical section which
> performs an atomically required operation and a following unit which
> depends on it being performed.  If you begin the following unit before
> the atomic requirement, you may end up losing.  It turns out this kind
> of pattern is inherent in a lot of mail box device drivers: you need to
> set up the mailbox atomically then poke it.  Setup is usually atomic,
> deciding which mailbox to prime and actually poking it is in the
> following unit.  Priming often involves an I/O bus transaction and if
> you poke before priming, you get a misfire.
> 
>>>> I see no problem with gcc write-combining in the absence of
>>>> memory barriers (as long as alignment is still respected,
>>>> ie., the size-promoted write is still naturally aligned). The
>>>> combined write is still atomic.
>>> 
>>> Actual alignment is pretty irrelevant.  That's why all architectures
>>> which require alignment also have to implement misaligned traps ... this
>>> is a fundamental requirement of the networking code, for instance.
>>> 
>>> The main problem is gcc thinking there's a misalignment (or a packed
>>> structure).  That causes splitting of loads and stores and that destroys
>>> atomicity.
>> 
>> Yeah, the extra requirement I added is basically nonsense, since the
>> only issue is what instructions the compiler is emitting. So if compiler
>> thinks the alignment is natural and combines the writes -- ok. If the
>> compiler thinks the alignment is off and doesn't combine the writes --
>> also ok.
> 
> Yes, I think I can agree that the only real problem is gcc thinking the
> store or load needs splitting.
> 
> James
> 
> 
_______________________________________________
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-09  4:30                                         ` H. Peter Anvin
  0 siblings, 0 replies; 311+ messages in thread
From: H. Peter Anvin @ 2014-09-09  4:30 UTC (permalink / raw)
  To: James Bottomley
  Cc: Jakub Jelinek, One Thousand Gnomes, Tony Luck, linux-ia64,
	Peter Hurley, Mikael Pettersson, Oleg Nesterov, linux-kernel,
	Paul Mackerras, linux-arch, paulmck, linuxppc-dev,
	Miroslav Franc, Richard Henderson

Add a short member for proper alignment and one will probably pop out.

Sent from my tablet, pardon any formatting problems.

> On Sep 8, 2014, at 19:56, James Bottomley <James.Bottomley@HansenPartnersh=
ip.com> wrote:
>=20
>> On Mon, 2014-09-08 at 19:30 -0400, Peter Hurley wrote:
>> On 09/08/2014 01:50 AM, James Bottomley wrote:
>>>> Two things: I think that gcc has given up on combining adjacent writes,=

>>>> perhaps because unaligned writes on some arches are prohibitive, so
>>>> whatever minor optimization was believed to be gained was quickly lost,=

>>>> multi-fold. (Although this might not be the reason since one would
>>>> expect that gcc would know the alignment of the promoted store).
>>>=20
>>> Um, gcc assumes architecturally correct alignment; that's why it pads
>>> structures.  Only when accessing packed structures will it use the
>>> lowest unit load/store.
>>>=20
>>> if you have a struct { char a, char b }; and load first a then b with a
>>> constant gcc will obligingly optimise to a short store.
>>=20
>> Um, please re-look at the test above. The exact test you describe is
>> coded above and compiled with gcc 4.6.3 cross-compiler for parisc using
>> the kernel compiler options.
>>=20
>> In the generated code, please note the _absence_ of a combined write
>> to two adjacent byte stores.
>=20
> So one version of gcc doesn't do it.  Others do because I've been
> surprised seeing it in assembly.
>=20
>>>> But additionally, even if gcc combines adjacent writes _that are part
>>>> of the program flow_ then I believe the situation is no worse than
>>>> would otherwise exist.
>>>>=20
>>>> For instance, given the following:
>>>>=20
>>>> struct x {
>>>>    spinlock_t lock;
>>>>    long a;
>>>>    byte b;
>>>>    byte c;
>>>> };
>>>>=20
>>>> void locked_store_b(struct x *p)
>>>> {
>>>>    spin_lock(&p->lock);
>>>>    p->b =3D 1;
>>>>    spin_unlock(&p->lock);
>>>>    p->c =3D 2;
>>>> }
>>>>=20
>>>> Granted, the author probably expects ordered writes of
>>>>    STORE B
>>>>    STORE C
>>>> but that's not guaranteed because there is no memory barrier
>>>> ordering B before C.
>>>=20
>>> Yes, there is: loads and stores may not migrate into or out of critical
>>> sections.
>>=20
>> That's a common misconception.
>>=20
>> The processor is free to re-order this to:
>>=20
>>    STORE C
>>    STORE B
>>    UNLOCK
>>=20
>> That's because the unlock() only guarantees that:
>>=20
>> Stores before the unlock in program order are guaranteed to complete
>> before the unlock completes. Stores after the unlock _may_ complete
>> before the unlock completes.
>>=20
>> My point was that even if compiler barriers had the same semantics
>> as memory barriers, the situation would be no worse. That is, code
>> that is sensitive to memory barriers (like the example I gave above)
>> would merely have the same fragility with one-way compiler barriers
>> (with respect to the compiler combining writes).
>>=20
>> That's what I meant by "no worse than would otherwise exist".
>=20
> Actually, that's not correct.  This is actually deja vu with me on the
> other side of the argument.  When we first did spinlocks on PA, I argued
> as you did: lock only a barrier for code after and unlock for code
> before.  The failing case is that you can have a critical section which
> performs an atomically required operation and a following unit which
> depends on it being performed.  If you begin the following unit before
> the atomic requirement, you may end up losing.  It turns out this kind
> of pattern is inherent in a lot of mail box device drivers: you need to
> set up the mailbox atomically then poke it.  Setup is usually atomic,
> deciding which mailbox to prime and actually poking it is in the
> following unit.  Priming often involves an I/O bus transaction and if
> you poke before priming, you get a misfire.
>=20
>>>> I see no problem with gcc write-combining in the absence of
>>>> memory barriers (as long as alignment is still respected,
>>>> ie., the size-promoted write is still naturally aligned). The
>>>> combined write is still atomic.
>>>=20
>>> Actual alignment is pretty irrelevant.  That's why all architectures
>>> which require alignment also have to implement misaligned traps ... this=

>>> is a fundamental requirement of the networking code, for instance.
>>>=20
>>> The main problem is gcc thinking there's a misalignment (or a packed
>>> structure).  That causes splitting of loads and stores and that destroys=

>>> atomicity.
>>=20
>> Yeah, the extra requirement I added is basically nonsense, since the
>> only issue is what instructions the compiler is emitting. So if compiler
>> thinks the alignment is natural and combines the writes -- ok. If the
>> compiler thinks the alignment is off and doesn't combine the writes --
>> also ok.
>=20
> Yes, I think I can agree that the only real problem is gcc thinking the
> store or load needs splitting.
>=20
> James
>=20
>=20

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-09  4:30                                         ` H. Peter Anvin
  0 siblings, 0 replies; 311+ messages in thread
From: H. Peter Anvin @ 2014-09-09  4:30 UTC (permalink / raw)
  To: James Bottomley
  Cc: Peter Hurley, paulmck, One Thousand Gnomes, Jakub Jelinek,
	Mikael Pettersson, Benjamin Herrenschmidt, Richard Henderson,
	Oleg Nesterov, Miroslav Franc, Paul Mackerras, linuxppc-dev,
	linux-kernel, linux-arch, Tony Luck, linux-ia64

Add a short member for proper alignment and one will probably pop out.

Sent from my tablet, pardon any formatting problems.

> On Sep 8, 2014, at 19:56, James Bottomley <James.Bottomley@HansenPartnership.com> wrote:
> 
>> On Mon, 2014-09-08 at 19:30 -0400, Peter Hurley wrote:
>> On 09/08/2014 01:50 AM, James Bottomley wrote:
>>>> Two things: I think that gcc has given up on combining adjacent writes,
>>>> perhaps because unaligned writes on some arches are prohibitive, so
>>>> whatever minor optimization was believed to be gained was quickly lost,
>>>> multi-fold. (Although this might not be the reason since one would
>>>> expect that gcc would know the alignment of the promoted store).
>>> 
>>> Um, gcc assumes architecturally correct alignment; that's why it pads
>>> structures.  Only when accessing packed structures will it use the
>>> lowest unit load/store.
>>> 
>>> if you have a struct { char a, char b }; and load first a then b with a
>>> constant gcc will obligingly optimise to a short store.
>> 
>> Um, please re-look at the test above. The exact test you describe is
>> coded above and compiled with gcc 4.6.3 cross-compiler for parisc using
>> the kernel compiler options.
>> 
>> In the generated code, please note the _absence_ of a combined write
>> to two adjacent byte stores.
> 
> So one version of gcc doesn't do it.  Others do because I've been
> surprised seeing it in assembly.
> 
>>>> But additionally, even if gcc combines adjacent writes _that are part
>>>> of the program flow_ then I believe the situation is no worse than
>>>> would otherwise exist.
>>>> 
>>>> For instance, given the following:
>>>> 
>>>> struct x {
>>>>    spinlock_t lock;
>>>>    long a;
>>>>    byte b;
>>>>    byte c;
>>>> };
>>>> 
>>>> void locked_store_b(struct x *p)
>>>> {
>>>>    spin_lock(&p->lock);
>>>>    p->b = 1;
>>>>    spin_unlock(&p->lock);
>>>>    p->c = 2;
>>>> }
>>>> 
>>>> Granted, the author probably expects ordered writes of
>>>>    STORE B
>>>>    STORE C
>>>> but that's not guaranteed because there is no memory barrier
>>>> ordering B before C.
>>> 
>>> Yes, there is: loads and stores may not migrate into or out of critical
>>> sections.
>> 
>> That's a common misconception.
>> 
>> The processor is free to re-order this to:
>> 
>>    STORE C
>>    STORE B
>>    UNLOCK
>> 
>> That's because the unlock() only guarantees that:
>> 
>> Stores before the unlock in program order are guaranteed to complete
>> before the unlock completes. Stores after the unlock _may_ complete
>> before the unlock completes.
>> 
>> My point was that even if compiler barriers had the same semantics
>> as memory barriers, the situation would be no worse. That is, code
>> that is sensitive to memory barriers (like the example I gave above)
>> would merely have the same fragility with one-way compiler barriers
>> (with respect to the compiler combining writes).
>> 
>> That's what I meant by "no worse than would otherwise exist".
> 
> Actually, that's not correct.  This is actually deja vu with me on the
> other side of the argument.  When we first did spinlocks on PA, I argued
> as you did: lock only a barrier for code after and unlock for code
> before.  The failing case is that you can have a critical section which
> performs an atomically required operation and a following unit which
> depends on it being performed.  If you begin the following unit before
> the atomic requirement, you may end up losing.  It turns out this kind
> of pattern is inherent in a lot of mail box device drivers: you need to
> set up the mailbox atomically then poke it.  Setup is usually atomic,
> deciding which mailbox to prime and actually poking it is in the
> following unit.  Priming often involves an I/O bus transaction and if
> you poke before priming, you get a misfire.
> 
>>>> I see no problem with gcc write-combining in the absence of
>>>> memory barriers (as long as alignment is still respected,
>>>> ie., the size-promoted write is still naturally aligned). The
>>>> combined write is still atomic.
>>> 
>>> Actual alignment is pretty irrelevant.  That's why all architectures
>>> which require alignment also have to implement misaligned traps ... this
>>> is a fundamental requirement of the networking code, for instance.
>>> 
>>> The main problem is gcc thinking there's a misalignment (or a packed
>>> structure).  That causes splitting of loads and stores and that destroys
>>> atomicity.
>> 
>> Yeah, the extra requirement I added is basically nonsense, since the
>> only issue is what instructions the compiler is emitting. So if compiler
>> thinks the alignment is natural and combines the writes -- ok. If the
>> compiler thinks the alignment is off and doesn't combine the writes --
>> also ok.
> 
> Yes, I think I can agree that the only real problem is gcc thinking the
> store or load needs splitting.
> 
> James
> 
> 

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
  2014-09-09  2:27                                         ` H. Peter Anvin
  (?)
@ 2014-09-09  8:11                                           ` Arnd Bergmann
  -1 siblings, 0 replies; 311+ messages in thread
From: Arnd Bergmann @ 2014-09-09  8:11 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: James Bottomley, Chris Metcalf, Peter Hurley, paulmck,
	One Thousand Gnomes, Jakub Jelinek, Mikael Pettersson,
	Benjamin Herrenschmidt, Richard Henderson, Oleg Nesterov,
	Miroslav Franc, Paul Mackerras, linuxppc-dev, linux-kernel,
	linux-arch, Tony Luck, linux-ia64

On Monday 08 September 2014 19:27:14 H. Peter Anvin wrote:
> On 09/08/2014 03:43 PM, James Bottomley wrote:
> > 
> > This was years ago (possibly decades).  We had to implement in-kernel
> > unaligned traps for the networking layer because it could access short
> > and int fields that weren't of the correct alignment when processing
> > packets.  It that's all corrected now, we wouldn't really notice (except
> > a bit of a speed up since an unaligned trap effectively places the
> > broken out instructions into the bit stream).

I believe the current line of thinking is to consider it a bug in
the device driver. Those bugs may still exist in some rare legacy
drivers, but anything you'd see in practice should not require unaligned
access any more.

> Well, ARM doesn't trap, it just silently gives garbage on unaligned
> memory references.

ARMv6/v7 (anything that really matters these days) has efficient unaligned
accesses that work. Older CPUs are normally set to trap on unaligned
access, originally for the reason James mentions above, but there are
some even older machines that rely on abusing unaligned accesses to do
byte swapping.

	Arnd

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-09  8:11                                           ` Arnd Bergmann
  0 siblings, 0 replies; 311+ messages in thread
From: Arnd Bergmann @ 2014-09-09  8:11 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Jakub Jelinek, One Thousand Gnomes, Tony Luck, linux-ia64,
	Peter Hurley, Mikael Pettersson, Oleg Nesterov, Chris Metcalf,
	linux-kernel, James Bottomley, Paul Mackerras, linux-arch,
	paulmck, linuxppc-dev, Miroslav Franc, Richard Henderson

On Monday 08 September 2014 19:27:14 H. Peter Anvin wrote:
> On 09/08/2014 03:43 PM, James Bottomley wrote:
> > 
> > This was years ago (possibly decades).  We had to implement in-kernel
> > unaligned traps for the networking layer because it could access short
> > and int fields that weren't of the correct alignment when processing
> > packets.  It that's all corrected now, we wouldn't really notice (except
> > a bit of a speed up since an unaligned trap effectively places the
> > broken out instructions into the bit stream).

I believe the current line of thinking is to consider it a bug in
the device driver. Those bugs may still exist in some rare legacy
drivers, but anything you'd see in practice should not require unaligned
access any more.

> Well, ARM doesn't trap, it just silently gives garbage on unaligned
> memory references.

ARMv6/v7 (anything that really matters these days) has efficient unaligned
accesses that work. Older CPUs are normally set to trap on unaligned
access, originally for the reason James mentions above, but there are
some even older machines that rely on abusing unaligned accesses to do
byte swapping.

	Arnd

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-09  8:11                                           ` Arnd Bergmann
  0 siblings, 0 replies; 311+ messages in thread
From: Arnd Bergmann @ 2014-09-09  8:11 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: James Bottomley, Chris Metcalf, Peter Hurley, paulmck,
	One Thousand Gnomes, Jakub Jelinek, Mikael Pettersson,
	Benjamin Herrenschmidt, Richard Henderson, Oleg Nesterov,
	Miroslav Franc, Paul Mackerras, linuxppc-dev, linux-kernel,
	linux-arch, Tony Luck, linux-ia64

On Monday 08 September 2014 19:27:14 H. Peter Anvin wrote:
> On 09/08/2014 03:43 PM, James Bottomley wrote:
> > 
> > This was years ago (possibly decades).  We had to implement in-kernel
> > unaligned traps for the networking layer because it could access short
> > and int fields that weren't of the correct alignment when processing
> > packets.  It that's all corrected now, we wouldn't really notice (except
> > a bit of a speed up since an unaligned trap effectively places the
> > broken out instructions into the bit stream).

I believe the current line of thinking is to consider it a bug in
the device driver. Those bugs may still exist in some rare legacy
drivers, but anything you'd see in practice should not require unaligned
access any more.

> Well, ARM doesn't trap, it just silently gives garbage on unaligned
> memory references.

ARMv6/v7 (anything that really matters these days) has efficient unaligned
accesses that work. Older CPUs are normally set to trap on unaligned
access, originally for the reason James mentions above, but there are
some even older machines that rely on abusing unaligned accesses to do
byte swapping.

	Arnd

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
  2014-09-09  2:56                                       ` James Bottomley
  (?)
@ 2014-09-09 10:40                                         ` Peter Hurley
  -1 siblings, 0 replies; 311+ messages in thread
From: Peter Hurley @ 2014-09-09 10:40 UTC (permalink / raw)
  To: James Bottomley
  Cc: paulmck, H. Peter Anvin, One Thousand Gnomes, Jakub Jelinek,
	Mikael Pettersson, Benjamin Herrenschmidt, Richard Henderson,
	Oleg Nesterov, Miroslav Franc, Paul Mackerras, linuxppc-dev,
	linux-kernel, linux-arch, Tony Luck, linux-ia64

On 09/08/2014 10:56 PM, James Bottomley wrote:
> On Mon, 2014-09-08 at 19:30 -0400, Peter Hurley wrote:
>> On 09/08/2014 01:50 AM, James Bottomley wrote:
>>>> But additionally, even if gcc combines adjacent writes _that are part
>>>> of the program flow_ then I believe the situation is no worse than
>>>> would otherwise exist.
>>>>
>>>> For instance, given the following:
>>>>
>>>> struct x {
>>>> 	spinlock_t lock;
>>>> 	long a;
>>>> 	byte b;
>>>> 	byte c;
>>>> };
>>>>
>>>> void locked_store_b(struct x *p)
>>>> {
>>>> 	spin_lock(&p->lock);
>>>> 	p->b = 1;
>>>> 	spin_unlock(&p->lock);
>>>> 	p->c = 2;
>>>> }
>>>>
>>>> Granted, the author probably expects ordered writes of
>>>> 	STORE B
>>>> 	STORE C
>>>> but that's not guaranteed because there is no memory barrier
>>>> ordering B before C.
>>>
>>> Yes, there is: loads and stores may not migrate into or out of critical
>>> sections.
>>
>> That's a common misconception.
>>
>> The processor is free to re-order this to:
>>
>> 	STORE C
>> 	STORE B
>> 	UNLOCK
>>
>> That's because the unlock() only guarantees that:
>>
>> Stores before the unlock in program order are guaranteed to complete
>> before the unlock completes. Stores after the unlock _may_ complete
>> before the unlock completes.
>>
>> My point was that even if compiler barriers had the same semantics
>> as memory barriers, the situation would be no worse. That is, code
>> that is sensitive to memory barriers (like the example I gave above)
>> would merely have the same fragility with one-way compiler barriers
>> (with respect to the compiler combining writes).
>>
>> That's what I meant by "no worse than would otherwise exist".
> 
> Actually, that's not correct.  This is actually deja vu with me on the
> other side of the argument.  When we first did spinlocks on PA, I argued
> as you did: lock only a barrier for code after and unlock for code
> before.  The failing case is that you can have a critical section which
> performs an atomically required operation and a following unit which
> depends on it being performed.  If you begin the following unit before
> the atomic requirement, you may end up losing.  It turns out this kind
> of pattern is inherent in a lot of mail box device drivers: you need to
> set up the mailbox atomically then poke it.  Setup is usually atomic,
> deciding which mailbox to prime and actually poking it is in the
> following unit.  Priming often involves an I/O bus transaction and if
> you poke before priming, you get a misfire.

Take it up with the man because this was discussed extensively last
year and it was decided that unlocks would not be full barriers.
Thus the changes to memory-barriers.txt that explicitly note this
and the addition of smp_mb__after_unlock_lock() (for two different
locks; an unlock followed by a lock on the same lock is a full barrier).

Code that expects ordered writes after an unlock needs to explicitly
add the memory barrier.

Regards,
Peter Hurley


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-09 10:40                                         ` Peter Hurley
  0 siblings, 0 replies; 311+ messages in thread
From: Peter Hurley @ 2014-09-09 10:40 UTC (permalink / raw)
  To: James Bottomley
  Cc: Jakub Jelinek, One Thousand Gnomes, linux-arch, linux-ia64,
	Mikael Pettersson, Oleg Nesterov, linux-kernel, Tony Luck,
	Paul Mackerras, H. Peter Anvin, paulmck, linuxppc-dev,
	Miroslav Franc, Richard Henderson

On 09/08/2014 10:56 PM, James Bottomley wrote:
> On Mon, 2014-09-08 at 19:30 -0400, Peter Hurley wrote:
>> On 09/08/2014 01:50 AM, James Bottomley wrote:
>>>> But additionally, even if gcc combines adjacent writes _that are part
>>>> of the program flow_ then I believe the situation is no worse than
>>>> would otherwise exist.
>>>>
>>>> For instance, given the following:
>>>>
>>>> struct x {
>>>> 	spinlock_t lock;
>>>> 	long a;
>>>> 	byte b;
>>>> 	byte c;
>>>> };
>>>>
>>>> void locked_store_b(struct x *p)
>>>> {
>>>> 	spin_lock(&p->lock);
>>>> 	p->b = 1;
>>>> 	spin_unlock(&p->lock);
>>>> 	p->c = 2;
>>>> }
>>>>
>>>> Granted, the author probably expects ordered writes of
>>>> 	STORE B
>>>> 	STORE C
>>>> but that's not guaranteed because there is no memory barrier
>>>> ordering B before C.
>>>
>>> Yes, there is: loads and stores may not migrate into or out of critical
>>> sections.
>>
>> That's a common misconception.
>>
>> The processor is free to re-order this to:
>>
>> 	STORE C
>> 	STORE B
>> 	UNLOCK
>>
>> That's because the unlock() only guarantees that:
>>
>> Stores before the unlock in program order are guaranteed to complete
>> before the unlock completes. Stores after the unlock _may_ complete
>> before the unlock completes.
>>
>> My point was that even if compiler barriers had the same semantics
>> as memory barriers, the situation would be no worse. That is, code
>> that is sensitive to memory barriers (like the example I gave above)
>> would merely have the same fragility with one-way compiler barriers
>> (with respect to the compiler combining writes).
>>
>> That's what I meant by "no worse than would otherwise exist".
> 
> Actually, that's not correct.  This is actually deja vu with me on the
> other side of the argument.  When we first did spinlocks on PA, I argued
> as you did: lock only a barrier for code after and unlock for code
> before.  The failing case is that you can have a critical section which
> performs an atomically required operation and a following unit which
> depends on it being performed.  If you begin the following unit before
> the atomic requirement, you may end up losing.  It turns out this kind
> of pattern is inherent in a lot of mail box device drivers: you need to
> set up the mailbox atomically then poke it.  Setup is usually atomic,
> deciding which mailbox to prime and actually poking it is in the
> following unit.  Priming often involves an I/O bus transaction and if
> you poke before priming, you get a misfire.

Take it up with the man because this was discussed extensively last
year and it was decided that unlocks would not be full barriers.
Thus the changes to memory-barriers.txt that explicitly note this
and the addition of smp_mb__after_unlock_lock() (for two different
locks; an unlock followed by a lock on the same lock is a full barrier).

Code that expects ordered writes after an unlock needs to explicitly
add the memory barrier.

Regards,
Peter Hurley

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-09 10:40                                         ` Peter Hurley
  0 siblings, 0 replies; 311+ messages in thread
From: Peter Hurley @ 2014-09-09 10:40 UTC (permalink / raw)
  To: James Bottomley
  Cc: paulmck, H. Peter Anvin, One Thousand Gnomes, Jakub Jelinek,
	Mikael Pettersson, Benjamin Herrenschmidt, Richard Henderson,
	Oleg Nesterov, Miroslav Franc, Paul Mackerras, linuxppc-dev,
	linux-kernel, linux-arch, Tony Luck, linux-ia64

On 09/08/2014 10:56 PM, James Bottomley wrote:
> On Mon, 2014-09-08 at 19:30 -0400, Peter Hurley wrote:
>> On 09/08/2014 01:50 AM, James Bottomley wrote:
>>>> But additionally, even if gcc combines adjacent writes _that are part
>>>> of the program flow_ then I believe the situation is no worse than
>>>> would otherwise exist.
>>>>
>>>> For instance, given the following:
>>>>
>>>> struct x {
>>>> 	spinlock_t lock;
>>>> 	long a;
>>>> 	byte b;
>>>> 	byte c;
>>>> };
>>>>
>>>> void locked_store_b(struct x *p)
>>>> {
>>>> 	spin_lock(&p->lock);
>>>> 	p->b = 1;
>>>> 	spin_unlock(&p->lock);
>>>> 	p->c = 2;
>>>> }
>>>>
>>>> Granted, the author probably expects ordered writes of
>>>> 	STORE B
>>>> 	STORE C
>>>> but that's not guaranteed because there is no memory barrier
>>>> ordering B before C.
>>>
>>> Yes, there is: loads and stores may not migrate into or out of critical
>>> sections.
>>
>> That's a common misconception.
>>
>> The processor is free to re-order this to:
>>
>> 	STORE C
>> 	STORE B
>> 	UNLOCK
>>
>> That's because the unlock() only guarantees that:
>>
>> Stores before the unlock in program order are guaranteed to complete
>> before the unlock completes. Stores after the unlock _may_ complete
>> before the unlock completes.
>>
>> My point was that even if compiler barriers had the same semantics
>> as memory barriers, the situation would be no worse. That is, code
>> that is sensitive to memory barriers (like the example I gave above)
>> would merely have the same fragility with one-way compiler barriers
>> (with respect to the compiler combining writes).
>>
>> That's what I meant by "no worse than would otherwise exist".
> 
> Actually, that's not correct.  This is actually deja vu with me on the
> other side of the argument.  When we first did spinlocks on PA, I argued
> as you did: lock only a barrier for code after and unlock for code
> before.  The failing case is that you can have a critical section which
> performs an atomically required operation and a following unit which
> depends on it being performed.  If you begin the following unit before
> the atomic requirement, you may end up losing.  It turns out this kind
> of pattern is inherent in a lot of mail box device drivers: you need to
> set up the mailbox atomically then poke it.  Setup is usually atomic,
> deciding which mailbox to prime and actually poking it is in the
> following unit.  Priming often involves an I/O bus transaction and if
> you poke before priming, you get a misfire.

Take it up with the man because this was discussed extensively last
year and it was decided that unlocks would not be full barriers.
Thus the changes to memory-barriers.txt that explicitly note this
and the addition of smp_mb__after_unlock_lock() (for two different
locks; an unlock followed by a lock on the same lock is a full barrier).

Code that expects ordered writes after an unlock needs to explicitly
add the memory barrier.

Regards,
Peter Hurley


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
  2014-09-08 22:47                         ` Peter Hurley
  (?)
@ 2014-09-09 11:14                           ` Peter Hurley
  -1 siblings, 0 replies; 311+ messages in thread
From: Peter Hurley @ 2014-09-09 11:14 UTC (permalink / raw)
  To: H. Peter Anvin, One Thousand Gnomes
  Cc: David Laight, Jakub Jelinek, linux-arch, Tony Luck, linux-ia64,
	Oleg Nesterov, linux-kernel, Paul Mackerras, Paul E. McKenney,
	linuxppc-dev, Miroslav Franc, Richard Henderson, linux-alpha

On 09/08/2014 06:47 PM, Peter Hurley wrote:
> On 09/08/2014 01:59 PM, H. Peter Anvin wrote:
>> On 09/08/2014 10:52 AM, One Thousand Gnomes wrote:
>>> On Fri, 05 Sep 2014 08:41:52 -0700
>>> "H. Peter Anvin" <hpa@zytor.com> wrote:
>>>
>>>> On 09/05/2014 08:31 AM, Peter Hurley wrote:
>>>>>
>>>>> Which is a bit ironic because I remember when Digital had a team
>>>>> working on emulating native x86 apps on Alpha/NT.
>>>>>
>>>>
>>>> Right, because the x86 architecture was obsolete and would never scale...
>>>
>>> Talking about "not scaling" can anyone explain how a "you need to use
>>> set_bit() and friends" bug report scaled into a hundred message plus
>>> discussion about ambiguous properties of processors (and nobody has
>>> audited all the embedded platforms we support yet, or the weirder ARMs)
>>> and a propsal to remove Alpha support.
>>>
>>> Wouldn't it be *much* simpler to do what I suggested in the first place
>>> and use the existing intended for purpose, deliberately put there,
>>> functions for atomic bitops, because they are fast on sane processors and
>>> they work on everything else.

And much simpler how?

By turning a 4- line patch into a 400- line patch?

And what about multi-value assignments for which set_bit() doesn't work, like
the byte-sized ->ctrl_status field? Is the expectation to submit a patch
which fixes that for a system from 1995, but already works for everything else?

The extra complexity comes with real cost; reduced reliability for every other
arch .

>>> I think the whole "removing Alpha EV5" support is basically bonkers. Just
>>> use set_bit in the tty layer. Alpha will continue to work as well as it
>>> always has done and you won't design out support for any future processor
>>> that turns out not to do byte aligned stores.

I thought the overriding design principle of this kernel was to support
what exists now, not design-in support for the future which may have
different requirements than expected anyway.


>> Is *that* what we are talking about?  I was added to this conversation
>> in the middle where it had already generalized, so I had no idea.
> 
> No, this is just what brought this craziness to my attention.
> 
> For example, byte- and short-sized circular buffers could not possibly
> be safe either, when the head nears the tail.
> 
> Who has audited global storage and ensured that _every_ byte-sized write
> doesn't happen to be adjacent to some other storage that may not happen
> to be protected by the same (or any) lock?

And add to this list all bitfields because gcc ignores the type
specifier and only allocates the minimum number of bytes to contain the
declared fields.

Regards,
Peter Hurley



^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-09 11:14                           ` Peter Hurley
  0 siblings, 0 replies; 311+ messages in thread
From: Peter Hurley @ 2014-09-09 11:14 UTC (permalink / raw)
  To: H. Peter Anvin, One Thousand Gnomes
  Cc: Jakub Jelinek, linux-arch, Tony Luck, linux-ia64, Oleg Nesterov,
	linux-kernel, David Laight, Paul Mackerras, linux-alpha,
	Paul E. McKenney, linuxppc-dev, Miroslav Franc,
	Richard Henderson

On 09/08/2014 06:47 PM, Peter Hurley wrote:
> On 09/08/2014 01:59 PM, H. Peter Anvin wrote:
>> On 09/08/2014 10:52 AM, One Thousand Gnomes wrote:
>>> On Fri, 05 Sep 2014 08:41:52 -0700
>>> "H. Peter Anvin" <hpa@zytor.com> wrote:
>>>
>>>> On 09/05/2014 08:31 AM, Peter Hurley wrote:
>>>>>
>>>>> Which is a bit ironic because I remember when Digital had a team
>>>>> working on emulating native x86 apps on Alpha/NT.
>>>>>
>>>>
>>>> Right, because the x86 architecture was obsolete and would never scale...
>>>
>>> Talking about "not scaling" can anyone explain how a "you need to use
>>> set_bit() and friends" bug report scaled into a hundred message plus
>>> discussion about ambiguous properties of processors (and nobody has
>>> audited all the embedded platforms we support yet, or the weirder ARMs)
>>> and a propsal to remove Alpha support.
>>>
>>> Wouldn't it be *much* simpler to do what I suggested in the first place
>>> and use the existing intended for purpose, deliberately put there,
>>> functions for atomic bitops, because they are fast on sane processors and
>>> they work on everything else.

And much simpler how?

By turning a 4- line patch into a 400- line patch?

And what about multi-value assignments for which set_bit() doesn't work, like
the byte-sized ->ctrl_status field? Is the expectation to submit a patch
which fixes that for a system from 1995, but already works for everything else?

The extra complexity comes with real cost; reduced reliability for every other
arch .

>>> I think the whole "removing Alpha EV5" support is basically bonkers. Just
>>> use set_bit in the tty layer. Alpha will continue to work as well as it
>>> always has done and you won't design out support for any future processor
>>> that turns out not to do byte aligned stores.

I thought the overriding design principle of this kernel was to support
what exists now, not design-in support for the future which may have
different requirements than expected anyway.


>> Is *that* what we are talking about?  I was added to this conversation
>> in the middle where it had already generalized, so I had no idea.
> 
> No, this is just what brought this craziness to my attention.
> 
> For example, byte- and short-sized circular buffers could not possibly
> be safe either, when the head nears the tail.
> 
> Who has audited global storage and ensured that _every_ byte-sized write
> doesn't happen to be adjacent to some other storage that may not happen
> to be protected by the same (or any) lock?

And add to this list all bitfields because gcc ignores the type
specifier and only allocates the minimum number of bytes to contain the
declared fields.

Regards,
Peter Hurley

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-09 11:14                           ` Peter Hurley
  0 siblings, 0 replies; 311+ messages in thread
From: Peter Hurley @ 2014-09-09 11:14 UTC (permalink / raw)
  To: H. Peter Anvin, One Thousand Gnomes
  Cc: David Laight, Jakub Jelinek, linux-arch, Tony Luck, linux-ia64,
	Oleg Nesterov, linux-kernel, Paul Mackerras, Paul E. McKenney,
	linuxppc-dev, Miroslav Franc, Richard Henderson, linux-alpha

On 09/08/2014 06:47 PM, Peter Hurley wrote:
> On 09/08/2014 01:59 PM, H. Peter Anvin wrote:
>> On 09/08/2014 10:52 AM, One Thousand Gnomes wrote:
>>> On Fri, 05 Sep 2014 08:41:52 -0700
>>> "H. Peter Anvin" <hpa@zytor.com> wrote:
>>>
>>>> On 09/05/2014 08:31 AM, Peter Hurley wrote:
>>>>>
>>>>> Which is a bit ironic because I remember when Digital had a team
>>>>> working on emulating native x86 apps on Alpha/NT.
>>>>>
>>>>
>>>> Right, because the x86 architecture was obsolete and would never scale...
>>>
>>> Talking about "not scaling" can anyone explain how a "you need to use
>>> set_bit() and friends" bug report scaled into a hundred message plus
>>> discussion about ambiguous properties of processors (and nobody has
>>> audited all the embedded platforms we support yet, or the weirder ARMs)
>>> and a propsal to remove Alpha support.
>>>
>>> Wouldn't it be *much* simpler to do what I suggested in the first place
>>> and use the existing intended for purpose, deliberately put there,
>>> functions for atomic bitops, because they are fast on sane processors and
>>> they work on everything else.

And much simpler how?

By turning a 4- line patch into a 400- line patch?

And what about multi-value assignments for which set_bit() doesn't work, like
the byte-sized ->ctrl_status field? Is the expectation to submit a patch
which fixes that for a system from 1995, but already works for everything else?

The extra complexity comes with real cost; reduced reliability for every other
arch .

>>> I think the whole "removing Alpha EV5" support is basically bonkers. Just
>>> use set_bit in the tty layer. Alpha will continue to work as well as it
>>> always has done and you won't design out support for any future processor
>>> that turns out not to do byte aligned stores.

I thought the overriding design principle of this kernel was to support
what exists now, not design-in support for the future which may have
different requirements than expected anyway.


>> Is *that* what we are talking about?  I was added to this conversation
>> in the middle where it had already generalized, so I had no idea.
> 
> No, this is just what brought this craziness to my attention.
> 
> For example, byte- and short-sized circular buffers could not possibly
> be safe either, when the head nears the tail.
> 
> Who has audited global storage and ensured that _every_ byte-sized write
> doesn't happen to be adjacent to some other storage that may not happen
> to be protected by the same (or any) lock?

And add to this list all bitfields because gcc ignores the type
specifier and only allocates the minimum number of bytes to contain the
declared fields.

Regards,
Peter Hurley



^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
  2014-09-08 19:17                         ` One Thousand Gnomes
  (?)
@ 2014-09-09 11:18                           ` Peter Hurley
  -1 siblings, 0 replies; 311+ messages in thread
From: Peter Hurley @ 2014-09-09 11:18 UTC (permalink / raw)
  To: One Thousand Gnomes, H. Peter Anvin
  Cc: Benjamin Herrenschmidt, David Laight, Jakub Jelinek, linux-arch,
	Tony Luck, linux-ia64, Oleg Nesterov, linux-kernel,
	Paul Mackerras, Paul E. McKenney, linuxppc-dev, Miroslav Franc,
	Richard Henderson, linux-alpha

On 09/08/2014 03:17 PM, One Thousand Gnomes wrote:
>>> I think the whole "removing Alpha EV5" support is basically bonkers. Just
>>> use set_bit in the tty layer. Alpha will continue to work as well as it
>>> always has done and you won't design out support for any future processor
>>> that turns out not to do byte aligned stores.
>>>
>>> Alan
>>>
>>
>> Is *that* what we are talking about?  I was added to this conversation
>> in the middle where it had already generalized, so I had no idea.
>>
>> 	-hpa
> 
> Yes there are some flags in the tty layer that are vulnerable to this
> (although they've been vulnerable to it and missing a lock since last
> century with no known ill effects).

That observation cuts both ways; I'm willing to leave it vulnerable with
'no known ill effects' on the Alpha.
 


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-09 11:18                           ` Peter Hurley
  0 siblings, 0 replies; 311+ messages in thread
From: Peter Hurley @ 2014-09-09 11:18 UTC (permalink / raw)
  To: One Thousand Gnomes, H. Peter Anvin
  Cc: Jakub Jelinek, linux-arch, Tony Luck, linux-ia64, Oleg Nesterov,
	linux-kernel, David Laight, Paul Mackerras, linux-alpha,
	Paul E. McKenney, linuxppc-dev, Miroslav Franc,
	Richard Henderson

On 09/08/2014 03:17 PM, One Thousand Gnomes wrote:
>>> I think the whole "removing Alpha EV5" support is basically bonkers. Just
>>> use set_bit in the tty layer. Alpha will continue to work as well as it
>>> always has done and you won't design out support for any future processor
>>> that turns out not to do byte aligned stores.
>>>
>>> Alan
>>>
>>
>> Is *that* what we are talking about?  I was added to this conversation
>> in the middle where it had already generalized, so I had no idea.
>>
>> 	-hpa
> 
> Yes there are some flags in the tty layer that are vulnerable to this
> (although they've been vulnerable to it and missing a lock since last
> century with no known ill effects).

That observation cuts both ways; I'm willing to leave it vulnerable with
'no known ill effects' on the Alpha.
 

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-09 11:18                           ` Peter Hurley
  0 siblings, 0 replies; 311+ messages in thread
From: Peter Hurley @ 2014-09-09 11:18 UTC (permalink / raw)
  To: One Thousand Gnomes, H. Peter Anvin
  Cc: Benjamin Herrenschmidt, David Laight, Jakub Jelinek, linux-arch,
	Tony Luck, linux-ia64, Oleg Nesterov, linux-kernel,
	Paul Mackerras, Paul E. McKenney, linuxppc-dev, Miroslav Franc,
	Richard Henderson, linux-alpha

On 09/08/2014 03:17 PM, One Thousand Gnomes wrote:
>>> I think the whole "removing Alpha EV5" support is basically bonkers. Just
>>> use set_bit in the tty layer. Alpha will continue to work as well as it
>>> always has done and you won't design out support for any future processor
>>> that turns out not to do byte aligned stores.
>>>
>>> Alan
>>>
>>
>> Is *that* what we are talking about?  I was added to this conversation
>> in the middle where it had already generalized, so I had no idea.
>>
>> 	-hpa
> 
> Yes there are some flags in the tty layer that are vulnerable to this
> (although they've been vulnerable to it and missing a lock since last
> century with no known ill effects).

That observation cuts both ways; I'm willing to leave it vulnerable with
'no known ill effects' on the Alpha.
 


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
  2014-09-08 17:52                     ` One Thousand Gnomes
  (?)
@ 2014-09-10 20:18                       ` H. Peter Anvin
  -1 siblings, 0 replies; 311+ messages in thread
From: H. Peter Anvin @ 2014-09-10 20:18 UTC (permalink / raw)
  To: One Thousand Gnomes
  Cc: Peter Hurley, Benjamin Herrenschmidt, David Laight,
	Jakub Jelinek, linux-arch, Tony Luck, linux-ia64, Oleg Nesterov,
	linux-kernel, Paul Mackerras, Paul E. McKenney, linuxppc-dev,
	Miroslav Franc, Richard Henderson, linux-alpha

On 09/08/2014 10:52 AM, One Thousand Gnomes wrote:
> 
> I think the whole "removing Alpha EV5" support is basically bonkers. Just
> use set_bit in the tty layer. Alpha will continue to work as well as it
> always has done and you won't design out support for any future processor
> that turns out not to do byte aligned stores.
> 

I think it's pretty safe to say such a processor is unlikely to ever be
created, for the same reason there weren't any 48-bit computers when
32-bit computers ran out of steam: it caused more problems than it
solved, and Alpha pretty much proved that.  The engineering advantages
would have to be so overwhelmingly in favor for someone to want to walk
down that road again.

	-hpa



^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-10 20:18                       ` H. Peter Anvin
  0 siblings, 0 replies; 311+ messages in thread
From: H. Peter Anvin @ 2014-09-10 20:18 UTC (permalink / raw)
  To: One Thousand Gnomes
  Cc: Jakub Jelinek, linux-arch, Tony Luck, linux-ia64, Peter Hurley,
	Oleg Nesterov, linux-kernel, David Laight, Paul Mackerras,
	linux-alpha, Paul E. McKenney, linuxppc-dev, Miroslav Franc,
	Richard Henderson

On 09/08/2014 10:52 AM, One Thousand Gnomes wrote:
> 
> I think the whole "removing Alpha EV5" support is basically bonkers. Just
> use set_bit in the tty layer. Alpha will continue to work as well as it
> always has done and you won't design out support for any future processor
> that turns out not to do byte aligned stores.
> 

I think it's pretty safe to say such a processor is unlikely to ever be
created, for the same reason there weren't any 48-bit computers when
32-bit computers ran out of steam: it caused more problems than it
solved, and Alpha pretty much proved that.  The engineering advantages
would have to be so overwhelmingly in favor for someone to want to walk
down that road again.

	-hpa

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-10 20:18                       ` H. Peter Anvin
  0 siblings, 0 replies; 311+ messages in thread
From: H. Peter Anvin @ 2014-09-10 20:18 UTC (permalink / raw)
  To: One Thousand Gnomes
  Cc: Peter Hurley, Benjamin Herrenschmidt, David Laight,
	Jakub Jelinek, linux-arch, Tony Luck, linux-ia64, Oleg Nesterov,
	linux-kernel, Paul Mackerras, Paul E. McKenney, linuxppc-dev,
	Miroslav Franc, Richard Henderson, linux-alpha

On 09/08/2014 10:52 AM, One Thousand Gnomes wrote:
> 
> I think the whole "removing Alpha EV5" support is basically bonkers. Just
> use set_bit in the tty layer. Alpha will continue to work as well as it
> always has done and you won't design out support for any future processor
> that turns out not to do byte aligned stores.
> 

I think it's pretty safe to say such a processor is unlikely to ever be
created, for the same reason there weren't any 48-bit computers when
32-bit computers ran out of steam: it caused more problems than it
solved, and Alpha pretty much proved that.  The engineering advantages
would have to be so overwhelmingly in favor for someone to want to walk
down that road again.

	-hpa



^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
  2014-09-10 20:18                       ` H. Peter Anvin
  (?)
@ 2014-09-10 21:10                         ` Rob Landley
  -1 siblings, 0 replies; 311+ messages in thread
From: Rob Landley @ 2014-09-10 21:10 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: One Thousand Gnomes, Peter Hurley, Benjamin Herrenschmidt,
	David Laight, Jakub Jelinek, linux-arch, Tony Luck, linux-ia64,
	Oleg Nesterov, linux-kernel, Paul Mackerras, Paul E. McKenney,
	linuxppc-dev, Miroslav Franc, Richard Henderson, linux-alpha

On Wed, Sep 10, 2014 at 3:18 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 09/08/2014 10:52 AM, One Thousand Gnomes wrote:
>>
>> I think the whole "removing Alpha EV5" support is basically bonkers. Just
>> use set_bit in the tty layer. Alpha will continue to work as well as it
>> always has done and you won't design out support for any future processor
>> that turns out not to do byte aligned stores.
>>
>
> I think it's pretty safe to say such a processor is unlikely to ever be
> created, for the same reason there weren't any 48-bit computers when
> 32-bit computers ran out of steam: it caused more problems than it
> solved, and Alpha pretty much proved that.  The engineering advantages
> would have to be so overwhelmingly in favor for someone to want to walk
> down that road again.
>
>         -hpa

I note for the record I'm trying to get basic Alpha support working in
http://landley.net/aboriginal/about.html now that qemu at least offers
a qemu-system-alpha, and if I do get it working I may take a stab at
adding alpha support to musl-libc.

It's not just that I want to get cross platform testing of package
builds working on a convenient/reproducible/portable way on as many
different targets as I can. My project is designed the way it is
because I think losing the ability to reproduce things from first
principles (and thus understand why it was done that way in the first
place) is a _bad_idea_. And an increasingly real-world problem, see
http://wrttn.in/04af1a for example. (And the way NASA lost the plans
for the Saturn V rocket... In fact scientists' general disinclination
to reproduce old experimental results to see if they were actually
true or not comes up on a regular basis, most recent I saw was
http://www.theglobeandmail.com/life/health-and-fitness/health/the-curious-case-of-the-cyclists-unshaved-legs/article20370814/
)

Of course Peter has consistently dismissed my concerns as "academic":

http://lkml.iu.edu/hypermail/linux/kernel/0802.1/4400.html

Personally, I hope the projects I work on _do_ outlive me, an try to
make it easy for newbies to recapitulate phylogeny. That's why I
thought requiring perl dependencies to build was a bad idea, and why I
thought removing 386 support entirely (instead of restricting it to
UP) was a bad idea. The idea that Linux no longer needs to build on
the original machine Linus designed it for, and now it no longer needs
to work on the machine the "arch" directory was created for (and the
first 64 bit machine)... These positions would appear to have a single
consistent proponent, with whom I disagree.

Oh well, just wanted to get that out there,

Rob

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-10 21:10                         ` Rob Landley
  0 siblings, 0 replies; 311+ messages in thread
From: Rob Landley @ 2014-09-10 21:10 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Jakub Jelinek, One Thousand Gnomes, Tony Luck, linux-ia64,
	Peter Hurley, linux-alpha, Oleg Nesterov, linux-kernel,
	David Laight, Paul Mackerras, linux-arch, Paul E. McKenney,
	linuxppc-dev, Miroslav Franc, Richard Henderson

On Wed, Sep 10, 2014 at 3:18 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 09/08/2014 10:52 AM, One Thousand Gnomes wrote:
>>
>> I think the whole "removing Alpha EV5" support is basically bonkers. Just
>> use set_bit in the tty layer. Alpha will continue to work as well as it
>> always has done and you won't design out support for any future processor
>> that turns out not to do byte aligned stores.
>>
>
> I think it's pretty safe to say such a processor is unlikely to ever be
> created, for the same reason there weren't any 48-bit computers when
> 32-bit computers ran out of steam: it caused more problems than it
> solved, and Alpha pretty much proved that.  The engineering advantages
> would have to be so overwhelmingly in favor for someone to want to walk
> down that road again.
>
>         -hpa

I note for the record I'm trying to get basic Alpha support working in
http://landley.net/aboriginal/about.html now that qemu at least offers
a qemu-system-alpha, and if I do get it working I may take a stab at
adding alpha support to musl-libc.

It's not just that I want to get cross platform testing of package
builds working on a convenient/reproducible/portable way on as many
different targets as I can. My project is designed the way it is
because I think losing the ability to reproduce things from first
principles (and thus understand why it was done that way in the first
place) is a _bad_idea_. And an increasingly real-world problem, see
http://wrttn.in/04af1a for example. (And the way NASA lost the plans
for the Saturn V rocket... In fact scientists' general disinclination
to reproduce old experimental results to see if they were actually
true or not comes up on a regular basis, most recent I saw was
http://www.theglobeandmail.com/life/health-and-fitness/health/the-curious-case-of-the-cyclists-unshaved-legs/article20370814/
)

Of course Peter has consistently dismissed my concerns as "academic":

http://lkml.iu.edu/hypermail/linux/kernel/0802.1/4400.html

Personally, I hope the projects I work on _do_ outlive me, an try to
make it easy for newbies to recapitulate phylogeny. That's why I
thought requiring perl dependencies to build was a bad idea, and why I
thought removing 386 support entirely (instead of restricting it to
UP) was a bad idea. The idea that Linux no longer needs to build on
the original machine Linus designed it for, and now it no longer needs
to work on the machine the "arch" directory was created for (and the
first 64 bit machine)... These positions would appear to have a single
consistent proponent, with whom I disagree.

Oh well, just wanted to get that out there,

Rob

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-10 21:10                         ` Rob Landley
  0 siblings, 0 replies; 311+ messages in thread
From: Rob Landley @ 2014-09-10 21:10 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: One Thousand Gnomes, Peter Hurley, Benjamin Herrenschmidt,
	David Laight, Jakub Jelinek, linux-arch, Tony Luck, linux-ia64,
	Oleg Nesterov, linux-kernel, Paul Mackerras, Paul E. McKenney,
	linuxppc-dev, Miroslav Franc, Richard Henderson, linux-alpha

On Wed, Sep 10, 2014 at 3:18 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 09/08/2014 10:52 AM, One Thousand Gnomes wrote:
>>
>> I think the whole "removing Alpha EV5" support is basically bonkers. Just
>> use set_bit in the tty layer. Alpha will continue to work as well as it
>> always has done and you won't design out support for any future processor
>> that turns out not to do byte aligned stores.
>>
>
> I think it's pretty safe to say such a processor is unlikely to ever be
> created, for the same reason there weren't any 48-bit computers when
> 32-bit computers ran out of steam: it caused more problems than it
> solved, and Alpha pretty much proved that.  The engineering advantages
> would have to be so overwhelmingly in favor for someone to want to walk
> down that road again.
>
>         -hpa

I note for the record I'm trying to get basic Alpha support working in
http://landley.net/aboriginal/about.html now that qemu at least offers
a qemu-system-alpha, and if I do get it working I may take a stab at
adding alpha support to musl-libc.

It's not just that I want to get cross platform testing of package
builds working on a convenient/reproducible/portable way on as many
different targets as I can. My project is designed the way it is
because I think losing the ability to reproduce things from first
principles (and thus understand why it was done that way in the first
place) is a _bad_idea_. And an increasingly real-world problem, see
http://wrttn.in/04af1a for example. (And the way NASA lost the plans
for the Saturn V rocket... In fact scientists' general disinclination
to reproduce old experimental results to see if they were actually
true or not comes up on a regular basis, most recent I saw was
http://www.theglobeandmail.com/life/health-and-fitness/health/the-curious-case-of-the-cyclists-unshaved-legs/article20370814/
)

Of course Peter has consistently dismissed my concerns as "academic":

http://lkml.iu.edu/hypermail/linux/kernel/0802.1/4400.html

Personally, I hope the projects I work on _do_ outlive me, an try to
make it easy for newbies to recapitulate phylogeny. That's why I
thought requiring perl dependencies to build was a bad idea, and why I
thought removing 386 support entirely (instead of restricting it to
UP) was a bad idea. The idea that Linux no longer needs to build on
the original machine Linus designed it for, and now it no longer needs
to work on the machine the "arch" directory was created for (and the
first 64 bit machine)... These positions would appear to have a single
consistent proponent, with whom I disagree.

Oh well, just wanted to get that out there,

Rob

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
  2014-09-09 10:40                                         ` Peter Hurley
  (?)
@ 2014-09-10 21:48                                           ` James Bottomley
  -1 siblings, 0 replies; 311+ messages in thread
From: James Bottomley @ 2014-09-10 21:48 UTC (permalink / raw)
  To: Peter Hurley
  Cc: paulmck, H. Peter Anvin, One Thousand Gnomes, Jakub Jelinek,
	Mikael Pettersson, Benjamin Herrenschmidt, Richard Henderson,
	Oleg Nesterov, Miroslav Franc, Paul Mackerras, linuxppc-dev,
	linux-kernel, linux-arch, Tony Luck, linux-ia64

On Tue, 2014-09-09 at 06:40 -0400, Peter Hurley wrote:
> On 09/08/2014 10:56 PM, James Bottomley wrote:
> > On Mon, 2014-09-08 at 19:30 -0400, Peter Hurley wrote:
> >> On 09/08/2014 01:50 AM, James Bottomley wrote:
> >>>> But additionally, even if gcc combines adjacent writes _that are part
> >>>> of the program flow_ then I believe the situation is no worse than
> >>>> would otherwise exist.
> >>>>
> >>>> For instance, given the following:
> >>>>
> >>>> struct x {
> >>>> 	spinlock_t lock;
> >>>> 	long a;
> >>>> 	byte b;
> >>>> 	byte c;
> >>>> };
> >>>>
> >>>> void locked_store_b(struct x *p)
> >>>> {
> >>>> 	spin_lock(&p->lock);
> >>>> 	p->b = 1;
> >>>> 	spin_unlock(&p->lock);
> >>>> 	p->c = 2;
> >>>> }
> >>>>
> >>>> Granted, the author probably expects ordered writes of
> >>>> 	STORE B
> >>>> 	STORE C
> >>>> but that's not guaranteed because there is no memory barrier
> >>>> ordering B before C.
> >>>
> >>> Yes, there is: loads and stores may not migrate into or out of critical
> >>> sections.
> >>
> >> That's a common misconception.
> >>
> >> The processor is free to re-order this to:
> >>
> >> 	STORE C
> >> 	STORE B
> >> 	UNLOCK
> >>
> >> That's because the unlock() only guarantees that:
> >>
> >> Stores before the unlock in program order are guaranteed to complete
> >> before the unlock completes. Stores after the unlock _may_ complete
> >> before the unlock completes.
> >>
> >> My point was that even if compiler barriers had the same semantics
> >> as memory barriers, the situation would be no worse. That is, code
> >> that is sensitive to memory barriers (like the example I gave above)
> >> would merely have the same fragility with one-way compiler barriers
> >> (with respect to the compiler combining writes).
> >>
> >> That's what I meant by "no worse than would otherwise exist".
> > 
> > Actually, that's not correct.  This is actually deja vu with me on the
> > other side of the argument.  When we first did spinlocks on PA, I argued
> > as you did: lock only a barrier for code after and unlock for code
> > before.  The failing case is that you can have a critical section which
> > performs an atomically required operation and a following unit which
> > depends on it being performed.  If you begin the following unit before
> > the atomic requirement, you may end up losing.  It turns out this kind
> > of pattern is inherent in a lot of mail box device drivers: you need to
> > set up the mailbox atomically then poke it.  Setup is usually atomic,
> > deciding which mailbox to prime and actually poking it is in the
> > following unit.  Priming often involves an I/O bus transaction and if
> > you poke before priming, you get a misfire.
> 
> Take it up with the man because this was discussed extensively last
> year and it was decided that unlocks would not be full barriers.
> Thus the changes to memory-barriers.txt that explicitly note this
> and the addition of smp_mb__after_unlock_lock() (for two different
> locks; an unlock followed by a lock on the same lock is a full barrier).
> 
> Code that expects ordered writes after an unlock needs to explicitly
> add the memory barrier.

I don't really care what ARM does; spin locks are full barriers on
architectures that need them.  The driver problem we had that detected
our semi permeable spinlocks was an LSI 53c875 which is enterprise class
PCI, so presumably not relevant to ARM anyway.

James



^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-10 21:48                                           ` James Bottomley
  0 siblings, 0 replies; 311+ messages in thread
From: James Bottomley @ 2014-09-10 21:48 UTC (permalink / raw)
  To: Peter Hurley
  Cc: Jakub Jelinek, One Thousand Gnomes, linux-arch, linux-ia64,
	Mikael Pettersson, Oleg Nesterov, linux-kernel, Tony Luck,
	Paul Mackerras, H. Peter Anvin, paulmck, linuxppc-dev,
	Miroslav Franc, Richard Henderson

On Tue, 2014-09-09 at 06:40 -0400, Peter Hurley wrote:
> On 09/08/2014 10:56 PM, James Bottomley wrote:
> > On Mon, 2014-09-08 at 19:30 -0400, Peter Hurley wrote:
> >> On 09/08/2014 01:50 AM, James Bottomley wrote:
> >>>> But additionally, even if gcc combines adjacent writes _that are part
> >>>> of the program flow_ then I believe the situation is no worse than
> >>>> would otherwise exist.
> >>>>
> >>>> For instance, given the following:
> >>>>
> >>>> struct x {
> >>>> 	spinlock_t lock;
> >>>> 	long a;
> >>>> 	byte b;
> >>>> 	byte c;
> >>>> };
> >>>>
> >>>> void locked_store_b(struct x *p)
> >>>> {
> >>>> 	spin_lock(&p->lock);
> >>>> 	p->b = 1;
> >>>> 	spin_unlock(&p->lock);
> >>>> 	p->c = 2;
> >>>> }
> >>>>
> >>>> Granted, the author probably expects ordered writes of
> >>>> 	STORE B
> >>>> 	STORE C
> >>>> but that's not guaranteed because there is no memory barrier
> >>>> ordering B before C.
> >>>
> >>> Yes, there is: loads and stores may not migrate into or out of critical
> >>> sections.
> >>
> >> That's a common misconception.
> >>
> >> The processor is free to re-order this to:
> >>
> >> 	STORE C
> >> 	STORE B
> >> 	UNLOCK
> >>
> >> That's because the unlock() only guarantees that:
> >>
> >> Stores before the unlock in program order are guaranteed to complete
> >> before the unlock completes. Stores after the unlock _may_ complete
> >> before the unlock completes.
> >>
> >> My point was that even if compiler barriers had the same semantics
> >> as memory barriers, the situation would be no worse. That is, code
> >> that is sensitive to memory barriers (like the example I gave above)
> >> would merely have the same fragility with one-way compiler barriers
> >> (with respect to the compiler combining writes).
> >>
> >> That's what I meant by "no worse than would otherwise exist".
> > 
> > Actually, that's not correct.  This is actually deja vu with me on the
> > other side of the argument.  When we first did spinlocks on PA, I argued
> > as you did: lock only a barrier for code after and unlock for code
> > before.  The failing case is that you can have a critical section which
> > performs an atomically required operation and a following unit which
> > depends on it being performed.  If you begin the following unit before
> > the atomic requirement, you may end up losing.  It turns out this kind
> > of pattern is inherent in a lot of mail box device drivers: you need to
> > set up the mailbox atomically then poke it.  Setup is usually atomic,
> > deciding which mailbox to prime and actually poking it is in the
> > following unit.  Priming often involves an I/O bus transaction and if
> > you poke before priming, you get a misfire.
> 
> Take it up with the man because this was discussed extensively last
> year and it was decided that unlocks would not be full barriers.
> Thus the changes to memory-barriers.txt that explicitly note this
> and the addition of smp_mb__after_unlock_lock() (for two different
> locks; an unlock followed by a lock on the same lock is a full barrier).
> 
> Code that expects ordered writes after an unlock needs to explicitly
> add the memory barrier.

I don't really care what ARM does; spin locks are full barriers on
architectures that need them.  The driver problem we had that detected
our semi permeable spinlocks was an LSI 53c875 which is enterprise class
PCI, so presumably not relevant to ARM anyway.

James

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-10 21:48                                           ` James Bottomley
  0 siblings, 0 replies; 311+ messages in thread
From: James Bottomley @ 2014-09-10 21:48 UTC (permalink / raw)
  To: Peter Hurley
  Cc: paulmck, H. Peter Anvin, One Thousand Gnomes, Jakub Jelinek,
	Mikael Pettersson, Benjamin Herrenschmidt, Richard Henderson,
	Oleg Nesterov, Miroslav Franc, Paul Mackerras, linuxppc-dev,
	linux-kernel, linux-arch, Tony Luck, linux-ia64

On Tue, 2014-09-09 at 06:40 -0400, Peter Hurley wrote:
> On 09/08/2014 10:56 PM, James Bottomley wrote:
> > On Mon, 2014-09-08 at 19:30 -0400, Peter Hurley wrote:
> >> On 09/08/2014 01:50 AM, James Bottomley wrote:
> >>>> But additionally, even if gcc combines adjacent writes _that are part
> >>>> of the program flow_ then I believe the situation is no worse than
> >>>> would otherwise exist.
> >>>>
> >>>> For instance, given the following:
> >>>>
> >>>> struct x {
> >>>> 	spinlock_t lock;
> >>>> 	long a;
> >>>> 	byte b;
> >>>> 	byte c;
> >>>> };
> >>>>
> >>>> void locked_store_b(struct x *p)
> >>>> {
> >>>> 	spin_lock(&p->lock);
> >>>> 	p->b = 1;
> >>>> 	spin_unlock(&p->lock);
> >>>> 	p->c = 2;
> >>>> }
> >>>>
> >>>> Granted, the author probably expects ordered writes of
> >>>> 	STORE B
> >>>> 	STORE C
> >>>> but that's not guaranteed because there is no memory barrier
> >>>> ordering B before C.
> >>>
> >>> Yes, there is: loads and stores may not migrate into or out of critical
> >>> sections.
> >>
> >> That's a common misconception.
> >>
> >> The processor is free to re-order this to:
> >>
> >> 	STORE C
> >> 	STORE B
> >> 	UNLOCK
> >>
> >> That's because the unlock() only guarantees that:
> >>
> >> Stores before the unlock in program order are guaranteed to complete
> >> before the unlock completes. Stores after the unlock _may_ complete
> >> before the unlock completes.
> >>
> >> My point was that even if compiler barriers had the same semantics
> >> as memory barriers, the situation would be no worse. That is, code
> >> that is sensitive to memory barriers (like the example I gave above)
> >> would merely have the same fragility with one-way compiler barriers
> >> (with respect to the compiler combining writes).
> >>
> >> That's what I meant by "no worse than would otherwise exist".
> > 
> > Actually, that's not correct.  This is actually deja vu with me on the
> > other side of the argument.  When we first did spinlocks on PA, I argued
> > as you did: lock only a barrier for code after and unlock for code
> > before.  The failing case is that you can have a critical section which
> > performs an atomically required operation and a following unit which
> > depends on it being performed.  If you begin the following unit before
> > the atomic requirement, you may end up losing.  It turns out this kind
> > of pattern is inherent in a lot of mail box device drivers: you need to
> > set up the mailbox atomically then poke it.  Setup is usually atomic,
> > deciding which mailbox to prime and actually poking it is in the
> > following unit.  Priming often involves an I/O bus transaction and if
> > you poke before priming, you get a misfire.
> 
> Take it up with the man because this was discussed extensively last
> year and it was decided that unlocks would not be full barriers.
> Thus the changes to memory-barriers.txt that explicitly note this
> and the addition of smp_mb__after_unlock_lock() (for two different
> locks; an unlock followed by a lock on the same lock is a full barrier).
> 
> Code that expects ordered writes after an unlock needs to explicitly
> add the memory barrier.

I don't really care what ARM does; spin locks are full barriers on
architectures that need them.  The driver problem we had that detected
our semi permeable spinlocks was an LSI 53c875 which is enterprise class
PCI, so presumably not relevant to ARM anyway.

James



^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
  2014-09-10 21:48                                           ` James Bottomley
  (?)
@ 2014-09-10 23:50                                             ` Peter Hurley
  -1 siblings, 0 replies; 311+ messages in thread
From: Peter Hurley @ 2014-09-10 23:50 UTC (permalink / raw)
  To: James Bottomley
  Cc: paulmck, H. Peter Anvin, One Thousand Gnomes, Jakub Jelinek,
	Mikael Pettersson, Benjamin Herrenschmidt, Richard Henderson,
	Oleg Nesterov, Miroslav Franc, Paul Mackerras, linuxppc-dev,
	linux-kernel, linux-arch, Tony Luck, linux-ia64

On 09/10/2014 05:48 PM, James Bottomley wrote:
> On Tue, 2014-09-09 at 06:40 -0400, Peter Hurley wrote:
>> On 09/08/2014 10:56 PM, James Bottomley wrote:
>>> On Mon, 2014-09-08 at 19:30 -0400, Peter Hurley wrote:
>>>> On 09/08/2014 01:50 AM, James Bottomley wrote:
>>>>>> But additionally, even if gcc combines adjacent writes _that are part
>>>>>> of the program flow_ then I believe the situation is no worse than
>>>>>> would otherwise exist.
>>>>>>
>>>>>> For instance, given the following:
>>>>>>
>>>>>> struct x {
>>>>>> 	spinlock_t lock;
>>>>>> 	long a;
>>>>>> 	byte b;
>>>>>> 	byte c;
>>>>>> };
>>>>>>
>>>>>> void locked_store_b(struct x *p)
>>>>>> {
>>>>>> 	spin_lock(&p->lock);
>>>>>> 	p->b = 1;
>>>>>> 	spin_unlock(&p->lock);
>>>>>> 	p->c = 2;
>>>>>> }
>>>>>>
>>>>>> Granted, the author probably expects ordered writes of
>>>>>> 	STORE B
>>>>>> 	STORE C
>>>>>> but that's not guaranteed because there is no memory barrier
>>>>>> ordering B before C.
>>>>>
>>>>> Yes, there is: loads and stores may not migrate into or out of critical
>>>>> sections.
>>>>
>>>> That's a common misconception.
>>>>
>>>> The processor is free to re-order this to:
>>>>
>>>> 	STORE C
>>>> 	STORE B
>>>> 	UNLOCK
>>>>
>>>> That's because the unlock() only guarantees that:
>>>>
>>>> Stores before the unlock in program order are guaranteed to complete
>>>> before the unlock completes. Stores after the unlock _may_ complete
>>>> before the unlock completes.
>>>>
>>>> My point was that even if compiler barriers had the same semantics
>>>> as memory barriers, the situation would be no worse. That is, code
>>>> that is sensitive to memory barriers (like the example I gave above)
>>>> would merely have the same fragility with one-way compiler barriers
>>>> (with respect to the compiler combining writes).
>>>>
>>>> That's what I meant by "no worse than would otherwise exist".
>>>
>>> Actually, that's not correct.  This is actually deja vu with me on the
>>> other side of the argument.  When we first did spinlocks on PA, I argued
>>> as you did: lock only a barrier for code after and unlock for code
>>> before.  The failing case is that you can have a critical section which
>>> performs an atomically required operation and a following unit which
>>> depends on it being performed.  If you begin the following unit before
>>> the atomic requirement, you may end up losing.  It turns out this kind
>>> of pattern is inherent in a lot of mail box device drivers: you need to
>>> set up the mailbox atomically then poke it.  Setup is usually atomic,
>>> deciding which mailbox to prime and actually poking it is in the
>>> following unit.  Priming often involves an I/O bus transaction and if
>>> you poke before priming, you get a misfire.
>>
>> Take it up with the man because this was discussed extensively last
>> year and it was decided that unlocks would not be full barriers.
>> Thus the changes to memory-barriers.txt that explicitly note this
>> and the addition of smp_mb__after_unlock_lock() (for two different
>> locks; an unlock followed by a lock on the same lock is a full barrier).
>>
>> Code that expects ordered writes after an unlock needs to explicitly
>> add the memory barrier.
> 
> I don't really care what ARM does; spin locks are full barriers on
> architectures that need them.  The driver problem we had that detected
> our semi permeable spinlocks was an LSI 53c875 which is enterprise class
> PCI, so presumably not relevant to ARM anyway.

Almost certainly ia64 arch_spin_unlock() is not a full barrier.



^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-10 23:50                                             ` Peter Hurley
  0 siblings, 0 replies; 311+ messages in thread
From: Peter Hurley @ 2014-09-10 23:50 UTC (permalink / raw)
  To: James Bottomley
  Cc: Jakub Jelinek, One Thousand Gnomes, linux-arch, linux-ia64,
	Mikael Pettersson, Oleg Nesterov, linux-kernel, Tony Luck,
	Paul Mackerras, H. Peter Anvin, paulmck, linuxppc-dev,
	Miroslav Franc, Richard Henderson

On 09/10/2014 05:48 PM, James Bottomley wrote:
> On Tue, 2014-09-09 at 06:40 -0400, Peter Hurley wrote:
>> On 09/08/2014 10:56 PM, James Bottomley wrote:
>>> On Mon, 2014-09-08 at 19:30 -0400, Peter Hurley wrote:
>>>> On 09/08/2014 01:50 AM, James Bottomley wrote:
>>>>>> But additionally, even if gcc combines adjacent writes _that are part
>>>>>> of the program flow_ then I believe the situation is no worse than
>>>>>> would otherwise exist.
>>>>>>
>>>>>> For instance, given the following:
>>>>>>
>>>>>> struct x {
>>>>>> 	spinlock_t lock;
>>>>>> 	long a;
>>>>>> 	byte b;
>>>>>> 	byte c;
>>>>>> };
>>>>>>
>>>>>> void locked_store_b(struct x *p)
>>>>>> {
>>>>>> 	spin_lock(&p->lock);
>>>>>> 	p->b = 1;
>>>>>> 	spin_unlock(&p->lock);
>>>>>> 	p->c = 2;
>>>>>> }
>>>>>>
>>>>>> Granted, the author probably expects ordered writes of
>>>>>> 	STORE B
>>>>>> 	STORE C
>>>>>> but that's not guaranteed because there is no memory barrier
>>>>>> ordering B before C.
>>>>>
>>>>> Yes, there is: loads and stores may not migrate into or out of critical
>>>>> sections.
>>>>
>>>> That's a common misconception.
>>>>
>>>> The processor is free to re-order this to:
>>>>
>>>> 	STORE C
>>>> 	STORE B
>>>> 	UNLOCK
>>>>
>>>> That's because the unlock() only guarantees that:
>>>>
>>>> Stores before the unlock in program order are guaranteed to complete
>>>> before the unlock completes. Stores after the unlock _may_ complete
>>>> before the unlock completes.
>>>>
>>>> My point was that even if compiler barriers had the same semantics
>>>> as memory barriers, the situation would be no worse. That is, code
>>>> that is sensitive to memory barriers (like the example I gave above)
>>>> would merely have the same fragility with one-way compiler barriers
>>>> (with respect to the compiler combining writes).
>>>>
>>>> That's what I meant by "no worse than would otherwise exist".
>>>
>>> Actually, that's not correct.  This is actually deja vu with me on the
>>> other side of the argument.  When we first did spinlocks on PA, I argued
>>> as you did: lock only a barrier for code after and unlock for code
>>> before.  The failing case is that you can have a critical section which
>>> performs an atomically required operation and a following unit which
>>> depends on it being performed.  If you begin the following unit before
>>> the atomic requirement, you may end up losing.  It turns out this kind
>>> of pattern is inherent in a lot of mail box device drivers: you need to
>>> set up the mailbox atomically then poke it.  Setup is usually atomic,
>>> deciding which mailbox to prime and actually poking it is in the
>>> following unit.  Priming often involves an I/O bus transaction and if
>>> you poke before priming, you get a misfire.
>>
>> Take it up with the man because this was discussed extensively last
>> year and it was decided that unlocks would not be full barriers.
>> Thus the changes to memory-barriers.txt that explicitly note this
>> and the addition of smp_mb__after_unlock_lock() (for two different
>> locks; an unlock followed by a lock on the same lock is a full barrier).
>>
>> Code that expects ordered writes after an unlock needs to explicitly
>> add the memory barrier.
> 
> I don't really care what ARM does; spin locks are full barriers on
> architectures that need them.  The driver problem we had that detected
> our semi permeable spinlocks was an LSI 53c875 which is enterprise class
> PCI, so presumably not relevant to ARM anyway.

Almost certainly ia64 arch_spin_unlock() is not a full barrier.

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-10 23:50                                             ` Peter Hurley
  0 siblings, 0 replies; 311+ messages in thread
From: Peter Hurley @ 2014-09-10 23:50 UTC (permalink / raw)
  To: James Bottomley
  Cc: paulmck, H. Peter Anvin, One Thousand Gnomes, Jakub Jelinek,
	Mikael Pettersson, Benjamin Herrenschmidt, Richard Henderson,
	Oleg Nesterov, Miroslav Franc, Paul Mackerras, linuxppc-dev,
	linux-kernel, linux-arch, Tony Luck, linux-ia64

On 09/10/2014 05:48 PM, James Bottomley wrote:
> On Tue, 2014-09-09 at 06:40 -0400, Peter Hurley wrote:
>> On 09/08/2014 10:56 PM, James Bottomley wrote:
>>> On Mon, 2014-09-08 at 19:30 -0400, Peter Hurley wrote:
>>>> On 09/08/2014 01:50 AM, James Bottomley wrote:
>>>>>> But additionally, even if gcc combines adjacent writes _that are part
>>>>>> of the program flow_ then I believe the situation is no worse than
>>>>>> would otherwise exist.
>>>>>>
>>>>>> For instance, given the following:
>>>>>>
>>>>>> struct x {
>>>>>> 	spinlock_t lock;
>>>>>> 	long a;
>>>>>> 	byte b;
>>>>>> 	byte c;
>>>>>> };
>>>>>>
>>>>>> void locked_store_b(struct x *p)
>>>>>> {
>>>>>> 	spin_lock(&p->lock);
>>>>>> 	p->b = 1;
>>>>>> 	spin_unlock(&p->lock);
>>>>>> 	p->c = 2;
>>>>>> }
>>>>>>
>>>>>> Granted, the author probably expects ordered writes of
>>>>>> 	STORE B
>>>>>> 	STORE C
>>>>>> but that's not guaranteed because there is no memory barrier
>>>>>> ordering B before C.
>>>>>
>>>>> Yes, there is: loads and stores may not migrate into or out of critical
>>>>> sections.
>>>>
>>>> That's a common misconception.
>>>>
>>>> The processor is free to re-order this to:
>>>>
>>>> 	STORE C
>>>> 	STORE B
>>>> 	UNLOCK
>>>>
>>>> That's because the unlock() only guarantees that:
>>>>
>>>> Stores before the unlock in program order are guaranteed to complete
>>>> before the unlock completes. Stores after the unlock _may_ complete
>>>> before the unlock completes.
>>>>
>>>> My point was that even if compiler barriers had the same semantics
>>>> as memory barriers, the situation would be no worse. That is, code
>>>> that is sensitive to memory barriers (like the example I gave above)
>>>> would merely have the same fragility with one-way compiler barriers
>>>> (with respect to the compiler combining writes).
>>>>
>>>> That's what I meant by "no worse than would otherwise exist".
>>>
>>> Actually, that's not correct.  This is actually deja vu with me on the
>>> other side of the argument.  When we first did spinlocks on PA, I argued
>>> as you did: lock only a barrier for code after and unlock for code
>>> before.  The failing case is that you can have a critical section which
>>> performs an atomically required operation and a following unit which
>>> depends on it being performed.  If you begin the following unit before
>>> the atomic requirement, you may end up losing.  It turns out this kind
>>> of pattern is inherent in a lot of mail box device drivers: you need to
>>> set up the mailbox atomically then poke it.  Setup is usually atomic,
>>> deciding which mailbox to prime and actually poking it is in the
>>> following unit.  Priming often involves an I/O bus transaction and if
>>> you poke before priming, you get a misfire.
>>
>> Take it up with the man because this was discussed extensively last
>> year and it was decided that unlocks would not be full barriers.
>> Thus the changes to memory-barriers.txt that explicitly note this
>> and the addition of smp_mb__after_unlock_lock() (for two different
>> locks; an unlock followed by a lock on the same lock is a full barrier).
>>
>> Code that expects ordered writes after an unlock needs to explicitly
>> add the memory barrier.
> 
> I don't really care what ARM does; spin locks are full barriers on
> architectures that need them.  The driver problem we had that detected
> our semi permeable spinlocks was an LSI 53c875 which is enterprise class
> PCI, so presumably not relevant to ARM anyway.

Almost certainly ia64 arch_spin_unlock() is not a full barrier.



^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
  2014-09-08 22:47                         ` Peter Hurley
  (?)
@ 2014-09-11 10:04                           ` One Thousand Gnomes
  -1 siblings, 0 replies; 311+ messages in thread
From: One Thousand Gnomes @ 2014-09-11 10:04 UTC (permalink / raw)
  To: Peter Hurley
  Cc: H. Peter Anvin, David Laight, Jakub Jelinek, linux-arch,
	Tony Luck, linux-ia64, Oleg Nesterov, linux-kernel,
	Paul Mackerras, Paul E. McKenney, linuxppc-dev, Miroslav Franc,
	Richard Henderson, linux-alpha

> > Is *that* what we are talking about?  I was added to this conversation
> > in the middle where it had already generalized, so I had no idea.
> 
> No, this is just what brought this craziness to my attention.

None of it is craziness. It's the real world leaking into the crazy
delusional world of sequential programming. Machines are going to get
more not less parallel.

> For example, byte- and short-sized circular buffers could not possibly
> be safe either, when the head nears the tail.
> 
> Who has audited global storage and ensured that _every_ byte-sized write
> doesn't happen to be adjacent to some other storage that may not happen
> to be protected by the same (or any) lock?

Thats a meaningless question. Have you audited it all for correctness of
any other form. Have you mathematically verified the functionality as a
set of formal proofs ? If you can't prove its formally mathematically
functionally correct why are you worried about this ?

Alpha works, maybe it has a near theoretical race on that point. It's not
any worse than it was 15 years ago and nobody has really hit a problem
with it. So from that you can usefully infer that those buffer cases are
not proving a real problem.

The tty locks together on the other hand are asking to hit it, and the
problem you were trying to fix were the ones that need set_bit() to make
the guarantees.

Alan

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-11 10:04                           ` One Thousand Gnomes
  0 siblings, 0 replies; 311+ messages in thread
From: One Thousand Gnomes @ 2014-09-11 10:04 UTC (permalink / raw)
  To: Peter Hurley
  Cc: Jakub Jelinek, linux-arch, Tony Luck, linux-ia64, linux-alpha,
	Oleg Nesterov, linux-kernel, David Laight, Paul Mackerras,
	H. Peter Anvin, Paul E. McKenney, linuxppc-dev, Miroslav Franc,
	Richard Henderson

> > Is *that* what we are talking about?  I was added to this conversation
> > in the middle where it had already generalized, so I had no idea.
> 
> No, this is just what brought this craziness to my attention.

None of it is craziness. It's the real world leaking into the crazy
delusional world of sequential programming. Machines are going to get
more not less parallel.

> For example, byte- and short-sized circular buffers could not possibly
> be safe either, when the head nears the tail.
> 
> Who has audited global storage and ensured that _every_ byte-sized write
> doesn't happen to be adjacent to some other storage that may not happen
> to be protected by the same (or any) lock?

Thats a meaningless question. Have you audited it all for correctness of
any other form. Have you mathematically verified the functionality as a
set of formal proofs ? If you can't prove its formally mathematically
functionally correct why are you worried about this ?

Alpha works, maybe it has a near theoretical race on that point. It's not
any worse than it was 15 years ago and nobody has really hit a problem
with it. So from that you can usefully infer that those buffer cases are
not proving a real problem.

The tty locks together on the other hand are asking to hit it, and the
problem you were trying to fix were the ones that need set_bit() to make
the guarantees.

Alan

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-11 10:04                           ` One Thousand Gnomes
  0 siblings, 0 replies; 311+ messages in thread
From: One Thousand Gnomes @ 2014-09-11 10:04 UTC (permalink / raw)
  To: Peter Hurley
  Cc: H. Peter Anvin, David Laight, Jakub Jelinek, linux-arch,
	Tony Luck, linux-ia64, Oleg Nesterov, linux-kernel,
	Paul Mackerras, Paul E. McKenney, linuxppc-dev, Miroslav Franc,
	Richard Henderson, linux-alpha

> > Is *that* what we are talking about?  I was added to this conversation
> > in the middle where it had already generalized, so I had no idea.
> 
> No, this is just what brought this craziness to my attention.

None of it is craziness. It's the real world leaking into the crazy
delusional world of sequential programming. Machines are going to get
more not less parallel.

> For example, byte- and short-sized circular buffers could not possibly
> be safe either, when the head nears the tail.
> 
> Who has audited global storage and ensured that _every_ byte-sized write
> doesn't happen to be adjacent to some other storage that may not happen
> to be protected by the same (or any) lock?

Thats a meaningless question. Have you audited it all for correctness of
any other form. Have you mathematically verified the functionality as a
set of formal proofs ? If you can't prove its formally mathematically
functionally correct why are you worried about this ?

Alpha works, maybe it has a near theoretical race on that point. It's not
any worse than it was 15 years ago and nobody has really hit a problem
with it. So from that you can usefully infer that those buffer cases are
not proving a real problem.

The tty locks together on the other hand are asking to hit it, and the
problem you were trying to fix were the ones that need set_bit() to make
the guarantees.

Alan

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
  2014-09-10 21:48                                           ` James Bottomley
  (?)
@ 2014-09-11 10:23                                             ` Will Deacon
  -1 siblings, 0 replies; 311+ messages in thread
From: Will Deacon @ 2014-09-11 10:23 UTC (permalink / raw)
  To: James Bottomley
  Cc: Peter Hurley, paulmck, H. Peter Anvin, One Thousand Gnomes,
	Jakub Jelinek, Mikael Pettersson, Benjamin Herrenschmidt,
	Richard Henderson, Oleg Nesterov, Miroslav Franc, Paul Mackerras,
	linuxppc-dev, linux-kernel, linux-arch, Tony Luck, linux-ia64

On Wed, Sep 10, 2014 at 10:48:06PM +0100, James Bottomley wrote:
> On Tue, 2014-09-09 at 06:40 -0400, Peter Hurley wrote:
> > >> The processor is free to re-order this to:
> > >>
> > >> 	STORE C
> > >> 	STORE B
> > >> 	UNLOCK
> > >>
> > >> That's because the unlock() only guarantees that:
> > >>
> > >> Stores before the unlock in program order are guaranteed to complete
> > >> before the unlock completes. Stores after the unlock _may_ complete
> > >> before the unlock completes.
> > >>
> > >> My point was that even if compiler barriers had the same semantics
> > >> as memory barriers, the situation would be no worse. That is, code
> > >> that is sensitive to memory barriers (like the example I gave above)
> > >> would merely have the same fragility with one-way compiler barriers
> > >> (with respect to the compiler combining writes).
> > >>
> > >> That's what I meant by "no worse than would otherwise exist".
> > > 
> > > Actually, that's not correct.  This is actually deja vu with me on the
> > > other side of the argument.  When we first did spinlocks on PA, I argued
> > > as you did: lock only a barrier for code after and unlock for code
> > > before.  The failing case is that you can have a critical section which
> > > performs an atomically required operation and a following unit which
> > > depends on it being performed.  If you begin the following unit before
> > > the atomic requirement, you may end up losing.  It turns out this kind
> > > of pattern is inherent in a lot of mail box device drivers: you need to
> > > set up the mailbox atomically then poke it.  Setup is usually atomic,
> > > deciding which mailbox to prime and actually poking it is in the
> > > following unit.  Priming often involves an I/O bus transaction and if
> > > you poke before priming, you get a misfire.
> > 
> > Take it up with the man because this was discussed extensively last
> > year and it was decided that unlocks would not be full barriers.
> > Thus the changes to memory-barriers.txt that explicitly note this
> > and the addition of smp_mb__after_unlock_lock() (for two different
> > locks; an unlock followed by a lock on the same lock is a full barrier).
> > 
> > Code that expects ordered writes after an unlock needs to explicitly
> > add the memory barrier.
> 
> I don't really care what ARM does; spin locks are full barriers on
> architectures that need them.  The driver problem we had that detected
> our semi permeable spinlocks was an LSI 53c875 which is enterprise class
> PCI, so presumably not relevant to ARM anyway.

FWIW, unlock is always fully ordered against non-relaxed IO accesses. We
have pretty heavy barriers in readX/writeX to ensure this on ARM/arm64.

PPC do tricks in their unlock to avoid the overhead on each IO access.

Will

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-11 10:23                                             ` Will Deacon
  0 siblings, 0 replies; 311+ messages in thread
From: Will Deacon @ 2014-09-11 10:23 UTC (permalink / raw)
  To: James Bottomley
  Cc: Jakub Jelinek, One Thousand Gnomes, linux-arch, linux-ia64,
	Peter Hurley, Mikael Pettersson, Oleg Nesterov, linux-kernel,
	Tony Luck, Paul Mackerras, H. Peter Anvin, paulmck, linuxppc-dev,
	Miroslav Franc, Richard Henderson

On Wed, Sep 10, 2014 at 10:48:06PM +0100, James Bottomley wrote:
> On Tue, 2014-09-09 at 06:40 -0400, Peter Hurley wrote:
> > >> The processor is free to re-order this to:
> > >>
> > >> 	STORE C
> > >> 	STORE B
> > >> 	UNLOCK
> > >>
> > >> That's because the unlock() only guarantees that:
> > >>
> > >> Stores before the unlock in program order are guaranteed to complete
> > >> before the unlock completes. Stores after the unlock _may_ complete
> > >> before the unlock completes.
> > >>
> > >> My point was that even if compiler barriers had the same semantics
> > >> as memory barriers, the situation would be no worse. That is, code
> > >> that is sensitive to memory barriers (like the example I gave above)
> > >> would merely have the same fragility with one-way compiler barriers
> > >> (with respect to the compiler combining writes).
> > >>
> > >> That's what I meant by "no worse than would otherwise exist".
> > > 
> > > Actually, that's not correct.  This is actually deja vu with me on the
> > > other side of the argument.  When we first did spinlocks on PA, I argued
> > > as you did: lock only a barrier for code after and unlock for code
> > > before.  The failing case is that you can have a critical section which
> > > performs an atomically required operation and a following unit which
> > > depends on it being performed.  If you begin the following unit before
> > > the atomic requirement, you may end up losing.  It turns out this kind
> > > of pattern is inherent in a lot of mail box device drivers: you need to
> > > set up the mailbox atomically then poke it.  Setup is usually atomic,
> > > deciding which mailbox to prime and actually poking it is in the
> > > following unit.  Priming often involves an I/O bus transaction and if
> > > you poke before priming, you get a misfire.
> > 
> > Take it up with the man because this was discussed extensively last
> > year and it was decided that unlocks would not be full barriers.
> > Thus the changes to memory-barriers.txt that explicitly note this
> > and the addition of smp_mb__after_unlock_lock() (for two different
> > locks; an unlock followed by a lock on the same lock is a full barrier).
> > 
> > Code that expects ordered writes after an unlock needs to explicitly
> > add the memory barrier.
> 
> I don't really care what ARM does; spin locks are full barriers on
> architectures that need them.  The driver problem we had that detected
> our semi permeable spinlocks was an LSI 53c875 which is enterprise class
> PCI, so presumably not relevant to ARM anyway.

FWIW, unlock is always fully ordered against non-relaxed IO accesses. We
have pretty heavy barriers in readX/writeX to ensure this on ARM/arm64.

PPC do tricks in their unlock to avoid the overhead on each IO access.

Will

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-11 10:23                                             ` Will Deacon
  0 siblings, 0 replies; 311+ messages in thread
From: Will Deacon @ 2014-09-11 10:23 UTC (permalink / raw)
  To: James Bottomley
  Cc: Peter Hurley, paulmck, H. Peter Anvin, One Thousand Gnomes,
	Jakub Jelinek, Mikael Pettersson, Benjamin Herrenschmidt,
	Richard Henderson, Oleg Nesterov, Miroslav Franc, Paul Mackerras,
	linuxppc-dev, linux-kernel, linux-arch, Tony Luck, linux-ia64

On Wed, Sep 10, 2014 at 10:48:06PM +0100, James Bottomley wrote:
> On Tue, 2014-09-09 at 06:40 -0400, Peter Hurley wrote:
> > >> The processor is free to re-order this to:
> > >>
> > >> 	STORE C
> > >> 	STORE B
> > >> 	UNLOCK
> > >>
> > >> That's because the unlock() only guarantees that:
> > >>
> > >> Stores before the unlock in program order are guaranteed to complete
> > >> before the unlock completes. Stores after the unlock _may_ complete
> > >> before the unlock completes.
> > >>
> > >> My point was that even if compiler barriers had the same semantics
> > >> as memory barriers, the situation would be no worse. That is, code
> > >> that is sensitive to memory barriers (like the example I gave above)
> > >> would merely have the same fragility with one-way compiler barriers
> > >> (with respect to the compiler combining writes).
> > >>
> > >> That's what I meant by "no worse than would otherwise exist".
> > > 
> > > Actually, that's not correct.  This is actually deja vu with me on the
> > > other side of the argument.  When we first did spinlocks on PA, I argued
> > > as you did: lock only a barrier for code after and unlock for code
> > > before.  The failing case is that you can have a critical section which
> > > performs an atomically required operation and a following unit which
> > > depends on it being performed.  If you begin the following unit before
> > > the atomic requirement, you may end up losing.  It turns out this kind
> > > of pattern is inherent in a lot of mail box device drivers: you need to
> > > set up the mailbox atomically then poke it.  Setup is usually atomic,
> > > deciding which mailbox to prime and actually poking it is in the
> > > following unit.  Priming often involves an I/O bus transaction and if
> > > you poke before priming, you get a misfire.
> > 
> > Take it up with the man because this was discussed extensively last
> > year and it was decided that unlocks would not be full barriers.
> > Thus the changes to memory-barriers.txt that explicitly note this
> > and the addition of smp_mb__after_unlock_lock() (for two different
> > locks; an unlock followed by a lock on the same lock is a full barrier).
> > 
> > Code that expects ordered writes after an unlock needs to explicitly
> > add the memory barrier.
> 
> I don't really care what ARM does; spin locks are full barriers on
> architectures that need them.  The driver problem we had that detected
> our semi permeable spinlocks was an LSI 53c875 which is enterprise class
> PCI, so presumably not relevant to ARM anyway.

FWIW, unlock is always fully ordered against non-relaxed IO accesses. We
have pretty heavy barriers in readX/writeX to ensure this on ARM/arm64.

PPC do tricks in their unlock to avoid the overhead on each IO access.

Will

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
  2014-09-11 10:04                           ` One Thousand Gnomes
  (?)
@ 2014-09-11 16:16                             ` Paul E. McKenney
  -1 siblings, 0 replies; 311+ messages in thread
From: Paul E. McKenney @ 2014-09-11 16:16 UTC (permalink / raw)
  To: One Thousand Gnomes
  Cc: Peter Hurley, H. Peter Anvin, David Laight, Jakub Jelinek,
	linux-arch, Tony Luck, linux-ia64, Oleg Nesterov, linux-kernel,
	Paul Mackerras, linuxppc-dev, Miroslav Franc, Richard Henderson,
	linux-alpha

On Thu, Sep 11, 2014 at 11:04:11AM +0100, One Thousand Gnomes wrote:
> > > Is *that* what we are talking about?  I was added to this conversation
> > > in the middle where it had already generalized, so I had no idea.
> > 
> > No, this is just what brought this craziness to my attention.
> 
> None of it is craziness. It's the real world leaking into the crazy
> delusional world of sequential programming. Machines are going to get
> more not less parallel.

Amen to that!!!

> > For example, byte- and short-sized circular buffers could not possibly
> > be safe either, when the head nears the tail.
> > 
> > Who has audited global storage and ensured that _every_ byte-sized write
> > doesn't happen to be adjacent to some other storage that may not happen
> > to be protected by the same (or any) lock?
> 
> Thats a meaningless question. Have you audited it all for correctness of
> any other form. Have you mathematically verified the functionality as a
> set of formal proofs ? If you can't prove its formally mathematically
> functionally correct why are you worried about this ?
> 
> Alpha works, maybe it has a near theoretical race on that point. It's not
> any worse than it was 15 years ago and nobody has really hit a problem
> with it. So from that you can usefully infer that those buffer cases are
> not proving a real problem.

Fair enough, I guess.

But Alpha's limitations were given as a reason to restrict
smp_store_release() and smp_load_acquire() from providing one-byte and
two-byte variants.  Of course, I am OK "probabilistically supporting"
pre-EV56 Alpha CPUs, but only if they don't get in the way of us doing
smp_store_release() and smp_load_acquire() on chars and shorts.  So if
pre-EV56 support has to go in order to allow smp_store_release() and
smp_load_acquire() on small data types, then pre-EV56 support simply
has to go.

Alternatively, one way to support this old hardware on a more
deterministic basis is to make the compiler use ll/sc sequences to do
byte and short accesses.  That would be fine as well.

						Thanx, Paul

> The tty locks together on the other hand are asking to hit it, and the
> problem you were trying to fix were the ones that need set_bit() to make
> the guarantees.
> 
> Alan
> 


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-11 16:16                             ` Paul E. McKenney
  0 siblings, 0 replies; 311+ messages in thread
From: Paul E. McKenney @ 2014-09-11 16:16 UTC (permalink / raw)
  To: One Thousand Gnomes
  Cc: Jakub Jelinek, linux-arch, Tony Luck, linux-ia64, Peter Hurley,
	linux-alpha, Oleg Nesterov, linux-kernel, David Laight,
	Paul Mackerras, H. Peter Anvin, linuxppc-dev, Miroslav Franc,
	Richard Henderson

On Thu, Sep 11, 2014 at 11:04:11AM +0100, One Thousand Gnomes wrote:
> > > Is *that* what we are talking about?  I was added to this conversation
> > > in the middle where it had already generalized, so I had no idea.
> > 
> > No, this is just what brought this craziness to my attention.
> 
> None of it is craziness. It's the real world leaking into the crazy
> delusional world of sequential programming. Machines are going to get
> more not less parallel.

Amen to that!!!

> > For example, byte- and short-sized circular buffers could not possibly
> > be safe either, when the head nears the tail.
> > 
> > Who has audited global storage and ensured that _every_ byte-sized write
> > doesn't happen to be adjacent to some other storage that may not happen
> > to be protected by the same (or any) lock?
> 
> Thats a meaningless question. Have you audited it all for correctness of
> any other form. Have you mathematically verified the functionality as a
> set of formal proofs ? If you can't prove its formally mathematically
> functionally correct why are you worried about this ?
> 
> Alpha works, maybe it has a near theoretical race on that point. It's not
> any worse than it was 15 years ago and nobody has really hit a problem
> with it. So from that you can usefully infer that those buffer cases are
> not proving a real problem.

Fair enough, I guess.

But Alpha's limitations were given as a reason to restrict
smp_store_release() and smp_load_acquire() from providing one-byte and
two-byte variants.  Of course, I am OK "probabilistically supporting"
pre-EV56 Alpha CPUs, but only if they don't get in the way of us doing
smp_store_release() and smp_load_acquire() on chars and shorts.  So if
pre-EV56 support has to go in order to allow smp_store_release() and
smp_load_acquire() on small data types, then pre-EV56 support simply
has to go.

Alternatively, one way to support this old hardware on a more
deterministic basis is to make the compiler use ll/sc sequences to do
byte and short accesses.  That would be fine as well.

						Thanx, Paul

> The tty locks together on the other hand are asking to hit it, and the
> problem you were trying to fix were the ones that need set_bit() to make
> the guarantees.
> 
> Alan
> 

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-11 16:16                             ` Paul E. McKenney
  0 siblings, 0 replies; 311+ messages in thread
From: Paul E. McKenney @ 2014-09-11 16:16 UTC (permalink / raw)
  To: One Thousand Gnomes
  Cc: Peter Hurley, H. Peter Anvin, David Laight, Jakub Jelinek,
	linux-arch, Tony Luck, linux-ia64, Oleg Nesterov, linux-kernel,
	Paul Mackerras, linuxppc-dev, Miroslav Franc, Richard Henderson,
	linux-alpha

On Thu, Sep 11, 2014 at 11:04:11AM +0100, One Thousand Gnomes wrote:
> > > Is *that* what we are talking about?  I was added to this conversation
> > > in the middle where it had already generalized, so I had no idea.
> > 
> > No, this is just what brought this craziness to my attention.
> 
> None of it is craziness. It's the real world leaking into the crazy
> delusional world of sequential programming. Machines are going to get
> more not less parallel.

Amen to that!!!

> > For example, byte- and short-sized circular buffers could not possibly
> > be safe either, when the head nears the tail.
> > 
> > Who has audited global storage and ensured that _every_ byte-sized write
> > doesn't happen to be adjacent to some other storage that may not happen
> > to be protected by the same (or any) lock?
> 
> Thats a meaningless question. Have you audited it all for correctness of
> any other form. Have you mathematically verified the functionality as a
> set of formal proofs ? If you can't prove its formally mathematically
> functionally correct why are you worried about this ?
> 
> Alpha works, maybe it has a near theoretical race on that point. It's not
> any worse than it was 15 years ago and nobody has really hit a problem
> with it. So from that you can usefully infer that those buffer cases are
> not proving a real problem.

Fair enough, I guess.

But Alpha's limitations were given as a reason to restrict
smp_store_release() and smp_load_acquire() from providing one-byte and
two-byte variants.  Of course, I am OK "probabilistically supporting"
pre-EV56 Alpha CPUs, but only if they don't get in the way of us doing
smp_store_release() and smp_load_acquire() on chars and shorts.  So if
pre-EV56 support has to go in order to allow smp_store_release() and
smp_load_acquire() on small data types, then pre-EV56 support simply
has to go.

Alternatively, one way to support this old hardware on a more
deterministic basis is to make the compiler use ll/sc sequences to do
byte and short accesses.  That would be fine as well.

						Thanx, Paul

> The tty locks together on the other hand are asking to hit it, and the
> problem you were trying to fix were the ones that need set_bit() to make
> the guarantees.
> 
> Alan
> 


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
  2014-09-11 10:04                           ` One Thousand Gnomes
  (?)
@ 2014-09-11 20:01                             ` Peter Hurley
  -1 siblings, 0 replies; 311+ messages in thread
From: Peter Hurley @ 2014-09-11 20:01 UTC (permalink / raw)
  To: One Thousand Gnomes
  Cc: H. Peter Anvin, David Laight, Jakub Jelinek, linux-arch,
	Tony Luck, linux-ia64, Oleg Nesterov, linux-kernel,
	Paul Mackerras, Paul E. McKenney, linuxppc-dev, Miroslav Franc,
	Richard Henderson, linux-alpha

On 09/11/2014 06:04 AM, One Thousand Gnomes wrote:
>>> Is *that* what we are talking about?  I was added to this conversation
>>> in the middle where it had already generalized, so I had no idea.
>>
>> No, this is just what brought this craziness to my attention.
> 
> None of it is craziness. It's the real world leaking into the crazy
> delusional world of sequential programming. Machines are going to get
> more not less parallel.
> 
>> For example, byte- and short-sized circular buffers could not possibly
>> be safe either, when the head nears the tail.
>>
>> Who has audited global storage and ensured that _every_ byte-sized write
>> doesn't happen to be adjacent to some other storage that may not happen
>> to be protected by the same (or any) lock?
> 
> Thats a meaningless question. Have you audited it all for correctness of
> any other form. Have you mathematically verified the functionality as a
> set of formal proofs ? If you can't prove its formally mathematically
> functionally correct why are you worried about this ?
> 
> Alpha works, maybe it has a near theoretical race on that point. It's not
> any worse than it was 15 years ago and nobody has really hit a problem
> with it. So from that you can usefully infer that those buffer cases are
> not proving a real problem.
>
> The tty locks together on the other hand are asking to hit it, and the
> problem you were trying to fix were the ones that need set_bit() to make
> the guarantees.

So a problem that no one has ever complained about on _any_ arch is suddenly
a problem on a subset of Alpha cpus, but a problem I know exists on Alpha
isn't important because no one's filed a bug about it?

The only Alpha person in this discussion has come out clearly in favor
of dropping EV4/5 support.

The fact is that the kernel itself is much more parallel than it was
15 years ago, and that trend is going to continue. Paired with the fact
that the Alpha is the least-parallel-friendly arch, makes improving
parallelism and correctness even harder within kernel subsystems; harder
than it has to be and harder than it should be.

Linus has repeatedly stated that non-arch code should be as
arch-independent as possible, so I believe that working around problems
created by a cpu from 1995 _which no other arch exhibits_ is ludicrous.
Especially in generic kernel code.

That said, if the Alpha community wants to keep _actively_ supporting
the Alpha arch, fine. They could be working toward solutions for
making Alpha workarounds in generic kernel code unnecessary. If that means
compiler changes, ok. If that means arch-independent macros, well, they
can float that idea.

Or if they're comfortable with the status quo, also fine. By that, I mean
the Alpha arch gets no workarounds in generic kernel code, and if something
goes a little sideways only on Alpha, that's to be expected.

As Paul pointed out, a good first step would be for the Alpha community
to contribute byte and short versions of smp_load_acquire() and
smp_store_release() so that the rest of the kernel community can make
forward progress on more parallelism without Alpha-only limitations.

Regards,
Peter Hurley

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-11 20:01                             ` Peter Hurley
  0 siblings, 0 replies; 311+ messages in thread
From: Peter Hurley @ 2014-09-11 20:01 UTC (permalink / raw)
  To: One Thousand Gnomes
  Cc: Jakub Jelinek, linux-arch, Tony Luck, linux-ia64, linux-alpha,
	Oleg Nesterov, linux-kernel, David Laight, Paul Mackerras,
	H. Peter Anvin, Paul E. McKenney, linuxppc-dev, Miroslav Franc,
	Richard Henderson

On 09/11/2014 06:04 AM, One Thousand Gnomes wrote:
>>> Is *that* what we are talking about?  I was added to this conversation
>>> in the middle where it had already generalized, so I had no idea.
>>
>> No, this is just what brought this craziness to my attention.
> 
> None of it is craziness. It's the real world leaking into the crazy
> delusional world of sequential programming. Machines are going to get
> more not less parallel.
> 
>> For example, byte- and short-sized circular buffers could not possibly
>> be safe either, when the head nears the tail.
>>
>> Who has audited global storage and ensured that _every_ byte-sized write
>> doesn't happen to be adjacent to some other storage that may not happen
>> to be protected by the same (or any) lock?
> 
> Thats a meaningless question. Have you audited it all for correctness of
> any other form. Have you mathematically verified the functionality as a
> set of formal proofs ? If you can't prove its formally mathematically
> functionally correct why are you worried about this ?
> 
> Alpha works, maybe it has a near theoretical race on that point. It's not
> any worse than it was 15 years ago and nobody has really hit a problem
> with it. So from that you can usefully infer that those buffer cases are
> not proving a real problem.
>
> The tty locks together on the other hand are asking to hit it, and the
> problem you were trying to fix were the ones that need set_bit() to make
> the guarantees.

So a problem that no one has ever complained about on _any_ arch is suddenly
a problem on a subset of Alpha cpus, but a problem I know exists on Alpha
isn't important because no one's filed a bug about it?

The only Alpha person in this discussion has come out clearly in favor
of dropping EV4/5 support.

The fact is that the kernel itself is much more parallel than it was
15 years ago, and that trend is going to continue. Paired with the fact
that the Alpha is the least-parallel-friendly arch, makes improving
parallelism and correctness even harder within kernel subsystems; harder
than it has to be and harder than it should be.

Linus has repeatedly stated that non-arch code should be as
arch-independent as possible, so I believe that working around problems
created by a cpu from 1995 _which no other arch exhibits_ is ludicrous.
Especially in generic kernel code.

That said, if the Alpha community wants to keep _actively_ supporting
the Alpha arch, fine. They could be working toward solutions for
making Alpha workarounds in generic kernel code unnecessary. If that means
compiler changes, ok. If that means arch-independent macros, well, they
can float that idea.

Or if they're comfortable with the status quo, also fine. By that, I mean
the Alpha arch gets no workarounds in generic kernel code, and if something
goes a little sideways only on Alpha, that's to be expected.

As Paul pointed out, a good first step would be for the Alpha community
to contribute byte and short versions of smp_load_acquire() and
smp_store_release() so that the rest of the kernel community can make
forward progress on more parallelism without Alpha-only limitations.

Regards,
Peter Hurley

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-11 20:01                             ` Peter Hurley
  0 siblings, 0 replies; 311+ messages in thread
From: Peter Hurley @ 2014-09-11 20:01 UTC (permalink / raw)
  To: One Thousand Gnomes
  Cc: H. Peter Anvin, David Laight, Jakub Jelinek, linux-arch,
	Tony Luck, linux-ia64, Oleg Nesterov, linux-kernel,
	Paul Mackerras, Paul E. McKenney, linuxppc-dev, Miroslav Franc,
	Richard Henderson, linux-alpha

On 09/11/2014 06:04 AM, One Thousand Gnomes wrote:
>>> Is *that* what we are talking about?  I was added to this conversation
>>> in the middle where it had already generalized, so I had no idea.
>>
>> No, this is just what brought this craziness to my attention.
> 
> None of it is craziness. It's the real world leaking into the crazy
> delusional world of sequential programming. Machines are going to get
> more not less parallel.
> 
>> For example, byte- and short-sized circular buffers could not possibly
>> be safe either, when the head nears the tail.
>>
>> Who has audited global storage and ensured that _every_ byte-sized write
>> doesn't happen to be adjacent to some other storage that may not happen
>> to be protected by the same (or any) lock?
> 
> Thats a meaningless question. Have you audited it all for correctness of
> any other form. Have you mathematically verified the functionality as a
> set of formal proofs ? If you can't prove its formally mathematically
> functionally correct why are you worried about this ?
> 
> Alpha works, maybe it has a near theoretical race on that point. It's not
> any worse than it was 15 years ago and nobody has really hit a problem
> with it. So from that you can usefully infer that those buffer cases are
> not proving a real problem.
>
> The tty locks together on the other hand are asking to hit it, and the
> problem you were trying to fix were the ones that need set_bit() to make
> the guarantees.

So a problem that no one has ever complained about on _any_ arch is suddenly
a problem on a subset of Alpha cpus, but a problem I know exists on Alpha
isn't important because no one's filed a bug about it?

The only Alpha person in this discussion has come out clearly in favor
of dropping EV4/5 support.

The fact is that the kernel itself is much more parallel than it was
15 years ago, and that trend is going to continue. Paired with the fact
that the Alpha is the least-parallel-friendly arch, makes improving
parallelism and correctness even harder within kernel subsystems; harder
than it has to be and harder than it should be.

Linus has repeatedly stated that non-arch code should be as
arch-independent as possible, so I believe that working around problems
created by a cpu from 1995 _which no other arch exhibits_ is ludicrous.
Especially in generic kernel code.

That said, if the Alpha community wants to keep _actively_ supporting
the Alpha arch, fine. They could be working toward solutions for
making Alpha workarounds in generic kernel code unnecessary. If that means
compiler changes, ok. If that means arch-independent macros, well, they
can float that idea.

Or if they're comfortable with the status quo, also fine. By that, I mean
the Alpha arch gets no workarounds in generic kernel code, and if something
goes a little sideways only on Alpha, that's to be expected.

As Paul pointed out, a good first step would be for the Alpha community
to contribute byte and short versions of smp_load_acquire() and
smp_store_release() so that the rest of the kernel community can make
forward progress on more parallelism without Alpha-only limitations.

Regards,
Peter Hurley

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
  2014-09-11 20:01                             ` Peter Hurley
  (?)
@ 2014-09-14 23:24                               ` One Thousand Gnomes
  -1 siblings, 0 replies; 311+ messages in thread
From: One Thousand Gnomes @ 2014-09-14 23:24 UTC (permalink / raw)
  To: Peter Hurley
  Cc: H. Peter Anvin, David Laight, Jakub Jelinek, linux-arch,
	Tony Luck, linux-ia64, Oleg Nesterov, linux-kernel,
	Paul Mackerras, Paul E. McKenney, linuxppc-dev, Miroslav Franc,
	Richard Henderson, linux-alpha

> So a problem that no one has ever complained about on _any_ arch is suddenly
> a problem on a subset of Alpha cpus, but a problem I know exists on Alpha
> isn't important because no one's filed a bug about it?

Yes - because if you think about it that tells you that nobody is hitting
it with the old code and it probably doesn't matter.
 
> The only Alpha person in this discussion has come out clearly in favor
> of dropping EV4/5 support.

That's not a statistically valid sample size btw

Plus as I pointed out (and you ignored) you are shutting out any future
processors with this kind of oddity, and you have not audited all the
embedded devices we support or may support in future.
 
> The fact is that the kernel itself is much more parallel than it was
> 15 years ago, and that trend is going to continue. Paired with the fact
> that the Alpha is the least-parallel-friendly arch, makes improving
> parallelism and correctness even harder within kernel subsystems; harder
> than it has to be and harder than it should be.
> 
> Linus has repeatedly stated that non-arch code should be as
> arch-independent as possible

That's why many many years ago we added set_bit() and the other bit
functions. On sane processors they are very fast. On insane ones they
work. They understand byte tearing, they understand store ordering (which
a simple variable does not so you've got to audit all your memory
barriers too). In many cases they are faster than using memory barriers
to guide the compiler because they invalidate less and allow the compiler
more freedom.

All this started because I suggested you use set_bit and friends and for
some reason you've decided to try and find another way to do it. We have
the bit operations for a reason. On x86 they are very very fast, on
uniprocessor anything they are very fast, on multiprocessor in general
they are very fast, or you are dealing with boxes that have sanity
problems of other kinds.

Alan

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-14 23:24                               ` One Thousand Gnomes
  0 siblings, 0 replies; 311+ messages in thread
From: One Thousand Gnomes @ 2014-09-14 23:24 UTC (permalink / raw)
  To: Peter Hurley
  Cc: Jakub Jelinek, linux-arch, Tony Luck, linux-ia64, linux-alpha,
	Oleg Nesterov, linux-kernel, David Laight, Paul Mackerras,
	H. Peter Anvin, Paul E. McKenney, linuxppc-dev, Miroslav Franc,
	Richard Henderson

> So a problem that no one has ever complained about on _any_ arch is suddenly
> a problem on a subset of Alpha cpus, but a problem I know exists on Alpha
> isn't important because no one's filed a bug about it?

Yes - because if you think about it that tells you that nobody is hitting
it with the old code and it probably doesn't matter.
 
> The only Alpha person in this discussion has come out clearly in favor
> of dropping EV4/5 support.

That's not a statistically valid sample size btw

Plus as I pointed out (and you ignored) you are shutting out any future
processors with this kind of oddity, and you have not audited all the
embedded devices we support or may support in future.
 
> The fact is that the kernel itself is much more parallel than it was
> 15 years ago, and that trend is going to continue. Paired with the fact
> that the Alpha is the least-parallel-friendly arch, makes improving
> parallelism and correctness even harder within kernel subsystems; harder
> than it has to be and harder than it should be.
> 
> Linus has repeatedly stated that non-arch code should be as
> arch-independent as possible

That's why many many years ago we added set_bit() and the other bit
functions. On sane processors they are very fast. On insane ones they
work. They understand byte tearing, they understand store ordering (which
a simple variable does not so you've got to audit all your memory
barriers too). In many cases they are faster than using memory barriers
to guide the compiler because they invalidate less and allow the compiler
more freedom.

All this started because I suggested you use set_bit and friends and for
some reason you've decided to try and find another way to do it. We have
the bit operations for a reason. On x86 they are very very fast, on
uniprocessor anything they are very fast, on multiprocessor in general
they are very fast, or you are dealing with boxes that have sanity
problems of other kinds.

Alan

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-14 23:24                               ` One Thousand Gnomes
  0 siblings, 0 replies; 311+ messages in thread
From: One Thousand Gnomes @ 2014-09-14 23:24 UTC (permalink / raw)
  To: Peter Hurley
  Cc: H. Peter Anvin, David Laight, Jakub Jelinek, linux-arch,
	Tony Luck, linux-ia64, Oleg Nesterov, linux-kernel,
	Paul Mackerras, Paul E. McKenney, linuxppc-dev, Miroslav Franc,
	Richard Henderson, linux-alpha

> So a problem that no one has ever complained about on _any_ arch is suddenly
> a problem on a subset of Alpha cpus, but a problem I know exists on Alpha
> isn't important because no one's filed a bug about it?

Yes - because if you think about it that tells you that nobody is hitting
it with the old code and it probably doesn't matter.
 
> The only Alpha person in this discussion has come out clearly in favor
> of dropping EV4/5 support.

That's not a statistically valid sample size btw

Plus as I pointed out (and you ignored) you are shutting out any future
processors with this kind of oddity, and you have not audited all the
embedded devices we support or may support in future.
 
> The fact is that the kernel itself is much more parallel than it was
> 15 years ago, and that trend is going to continue. Paired with the fact
> that the Alpha is the least-parallel-friendly arch, makes improving
> parallelism and correctness even harder within kernel subsystems; harder
> than it has to be and harder than it should be.
> 
> Linus has repeatedly stated that non-arch code should be as
> arch-independent as possible

That's why many many years ago we added set_bit() and the other bit
functions. On sane processors they are very fast. On insane ones they
work. They understand byte tearing, they understand store ordering (which
a simple variable does not so you've got to audit all your memory
barriers too). In many cases they are faster than using memory barriers
to guide the compiler because they invalidate less and allow the compiler
more freedom.

All this started because I suggested you use set_bit and friends and for
some reason you've decided to try and find another way to do it. We have
the bit operations for a reason. On x86 they are very very fast, on
uniprocessor anything they are very fast, on multiprocessor in general
they are very fast, or you are dealing with boxes that have sanity
problems of other kinds.

Alan

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
  2014-09-14 23:24                               ` One Thousand Gnomes
  (?)
@ 2014-09-22 19:51                                 ` Paul E. McKenney
  -1 siblings, 0 replies; 311+ messages in thread
From: Paul E. McKenney @ 2014-09-22 19:51 UTC (permalink / raw)
  To: One Thousand Gnomes
  Cc: Peter Hurley, H. Peter Anvin, David Laight, Jakub Jelinek,
	linux-arch, Tony Luck, linux-ia64, Oleg Nesterov, linux-kernel,
	Paul Mackerras, linuxppc-dev, Miroslav Franc, Richard Henderson,
	linux-alpha, ink, mattst88

On Mon, Sep 15, 2014 at 12:24:27AM +0100, One Thousand Gnomes wrote:
> > So a problem that no one has ever complained about on _any_ arch is suddenly
> > a problem on a subset of Alpha cpus, but a problem I know exists on Alpha
> > isn't important because no one's filed a bug about it?
> 
> Yes - because if you think about it that tells you that nobody is hitting
> it with the old code and it probably doesn't matter.
> 
> > The only Alpha person in this discussion has come out clearly in favor
> > of dropping EV4/5 support.
> 
> That's not a statistically valid sample size btw

OK, adding the other two Alpha Port maintainers on CC.

Attempted summary for their benefit:

o	There was a bug involving older Alpha CPUs using 32-bit
	memory-reference operations to do smaller memory accesses.
	The suggested resolution was to use set_bit().

o	Peter Hurley called out a number of theoretical issues with
	CPUs lacking 8-bit and 16-bit memory-reference instructions,
	for example, adjacent 8-bit variables protected by different
	locks not being safe on such CPUs.

o	Michael Cree pointed out that some of these issues actually
	happen in the X server ever since the libpciaccess change.
	Michael would like to compile for Alpha with BWX (thus allowing
	8-bit and 16-bit memory references, but disallowing pre-EV56
	CPUs) in order make the X server (and thus Debian) work
	better on newer Alpha CPUs.

	Given that Michael Cree maintains the list of Alpha systems
	running Linux, I took this as my cue to provide a couple of
	patches to that effect.

o	Michael Cree also noted that pre-EV56 Alpha CPUs really can
	do 8-bit and 16-bit accesses in an SMP-safe manner via LL/SC,
	but that this requires some hacking on the compilers.

o	Alan Cox argued that we should support pre-EV56 Alpha CPUs
	without any special defense against issues that might arise
	from their lack of 8-bit and 16-bit memory-reference
	instructions, as you can see above.

Richard, Ivan, Matt, thoughts from your perspectives as Alpha Port
maintainers?

> Plus as I pointed out (and you ignored) you are shutting out any future
> processors with this kind of oddity, and you have not audited all the
> embedded devices we support or may support in future.

True enough, but then again, the Alpha architects did feel the need to
add 8-bit and 16-bit memory reference instructions in EV56.  In addition,
if there are future processors that don't provide 8-bit and 16-bit memory
reference instructions, atomic instructions can be used as a fallback.
This fallback is in fact similar to the set_bit approach.

> > The fact is that the kernel itself is much more parallel than it was
> > 15 years ago, and that trend is going to continue. Paired with the fact
> > that the Alpha is the least-parallel-friendly arch, makes improving
> > parallelism and correctness even harder within kernel subsystems; harder
> > than it has to be and harder than it should be.
> > 
> > Linus has repeatedly stated that non-arch code should be as
> > arch-independent as possible
> 
> That's why many many years ago we added set_bit() and the other bit
> functions. On sane processors they are very fast. On insane ones they
> work. They understand byte tearing, they understand store ordering (which
> a simple variable does not so you've got to audit all your memory
> barriers too). In many cases they are faster than using memory barriers
> to guide the compiler because they invalidate less and allow the compiler
> more freedom.
> 
> All this started because I suggested you use set_bit and friends and for
> some reason you've decided to try and find another way to do it. We have
> the bit operations for a reason. On x86 they are very very fast, on
> uniprocessor anything they are very fast, on multiprocessor in general
> they are very fast, or you are dealing with boxes that have sanity
> problems of other kinds.

Indeed, these email threads do tend to examine alternatives from time
to time.  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-22 19:51                                 ` Paul E. McKenney
  0 siblings, 0 replies; 311+ messages in thread
From: Paul E. McKenney @ 2014-09-22 19:51 UTC (permalink / raw)
  To: One Thousand Gnomes
  Cc: Jakub Jelinek, linux-arch, Tony Luck, linux-ia64, Peter Hurley,
	linux-alpha, Oleg Nesterov, linux-kernel, David Laight,
	Paul Mackerras, H. Peter Anvin, ink, mattst88, linuxppc-dev,
	Miroslav Franc, Richard Henderson

On Mon, Sep 15, 2014 at 12:24:27AM +0100, One Thousand Gnomes wrote:
> > So a problem that no one has ever complained about on _any_ arch is suddenly
> > a problem on a subset of Alpha cpus, but a problem I know exists on Alpha
> > isn't important because no one's filed a bug about it?
> 
> Yes - because if you think about it that tells you that nobody is hitting
> it with the old code and it probably doesn't matter.
> 
> > The only Alpha person in this discussion has come out clearly in favor
> > of dropping EV4/5 support.
> 
> That's not a statistically valid sample size btw

OK, adding the other two Alpha Port maintainers on CC.

Attempted summary for their benefit:

o	There was a bug involving older Alpha CPUs using 32-bit
	memory-reference operations to do smaller memory accesses.
	The suggested resolution was to use set_bit().

o	Peter Hurley called out a number of theoretical issues with
	CPUs lacking 8-bit and 16-bit memory-reference instructions,
	for example, adjacent 8-bit variables protected by different
	locks not being safe on such CPUs.

o	Michael Cree pointed out that some of these issues actually
	happen in the X server ever since the libpciaccess change.
	Michael would like to compile for Alpha with BWX (thus allowing
	8-bit and 16-bit memory references, but disallowing pre-EV56
	CPUs) in order make the X server (and thus Debian) work
	better on newer Alpha CPUs.

	Given that Michael Cree maintains the list of Alpha systems
	running Linux, I took this as my cue to provide a couple of
	patches to that effect.

o	Michael Cree also noted that pre-EV56 Alpha CPUs really can
	do 8-bit and 16-bit accesses in an SMP-safe manner via LL/SC,
	but that this requires some hacking on the compilers.

o	Alan Cox argued that we should support pre-EV56 Alpha CPUs
	without any special defense against issues that might arise
	from their lack of 8-bit and 16-bit memory-reference
	instructions, as you can see above.

Richard, Ivan, Matt, thoughts from your perspectives as Alpha Port
maintainers?

> Plus as I pointed out (and you ignored) you are shutting out any future
> processors with this kind of oddity, and you have not audited all the
> embedded devices we support or may support in future.

True enough, but then again, the Alpha architects did feel the need to
add 8-bit and 16-bit memory reference instructions in EV56.  In addition,
if there are future processors that don't provide 8-bit and 16-bit memory
reference instructions, atomic instructions can be used as a fallback.
This fallback is in fact similar to the set_bit approach.

> > The fact is that the kernel itself is much more parallel than it was
> > 15 years ago, and that trend is going to continue. Paired with the fact
> > that the Alpha is the least-parallel-friendly arch, makes improving
> > parallelism and correctness even harder within kernel subsystems; harder
> > than it has to be and harder than it should be.
> > 
> > Linus has repeatedly stated that non-arch code should be as
> > arch-independent as possible
> 
> That's why many many years ago we added set_bit() and the other bit
> functions. On sane processors they are very fast. On insane ones they
> work. They understand byte tearing, they understand store ordering (which
> a simple variable does not so you've got to audit all your memory
> barriers too). In many cases they are faster than using memory barriers
> to guide the compiler because they invalidate less and allow the compiler
> more freedom.
> 
> All this started because I suggested you use set_bit and friends and for
> some reason you've decided to try and find another way to do it. We have
> the bit operations for a reason. On x86 they are very very fast, on
> uniprocessor anything they are very fast, on multiprocessor in general
> they are very fast, or you are dealing with boxes that have sanity
> problems of other kinds.

Indeed, these email threads do tend to examine alternatives from time
to time.  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-22 19:51                                 ` Paul E. McKenney
  0 siblings, 0 replies; 311+ messages in thread
From: Paul E. McKenney @ 2014-09-22 19:51 UTC (permalink / raw)
  To: One Thousand Gnomes
  Cc: Peter Hurley, H. Peter Anvin, David Laight, Jakub Jelinek,
	linux-arch, Tony Luck, linux-ia64, Oleg Nesterov, linux-kernel,
	Paul Mackerras, linuxppc-dev, Miroslav Franc, Richard Henderson,
	linux-alpha, ink, mattst88

On Mon, Sep 15, 2014 at 12:24:27AM +0100, One Thousand Gnomes wrote:
> > So a problem that no one has ever complained about on _any_ arch is suddenly
> > a problem on a subset of Alpha cpus, but a problem I know exists on Alpha
> > isn't important because no one's filed a bug about it?
> 
> Yes - because if you think about it that tells you that nobody is hitting
> it with the old code and it probably doesn't matter.
> 
> > The only Alpha person in this discussion has come out clearly in favor
> > of dropping EV4/5 support.
> 
> That's not a statistically valid sample size btw

OK, adding the other two Alpha Port maintainers on CC.

Attempted summary for their benefit:

o	There was a bug involving older Alpha CPUs using 32-bit
	memory-reference operations to do smaller memory accesses.
	The suggested resolution was to use set_bit().

o	Peter Hurley called out a number of theoretical issues with
	CPUs lacking 8-bit and 16-bit memory-reference instructions,
	for example, adjacent 8-bit variables protected by different
	locks not being safe on such CPUs.

o	Michael Cree pointed out that some of these issues actually
	happen in the X server ever since the libpciaccess change.
	Michael would like to compile for Alpha with BWX (thus allowing
	8-bit and 16-bit memory references, but disallowing pre-EV56
	CPUs) in order make the X server (and thus Debian) work
	better on newer Alpha CPUs.

	Given that Michael Cree maintains the list of Alpha systems
	running Linux, I took this as my cue to provide a couple of
	patches to that effect.

o	Michael Cree also noted that pre-EV56 Alpha CPUs really can
	do 8-bit and 16-bit accesses in an SMP-safe manner via LL/SC,
	but that this requires some hacking on the compilers.

o	Alan Cox argued that we should support pre-EV56 Alpha CPUs
	without any special defense against issues that might arise
	from their lack of 8-bit and 16-bit memory-reference
	instructions, as you can see above.

Richard, Ivan, Matt, thoughts from your perspectives as Alpha Port
maintainers?

> Plus as I pointed out (and you ignored) you are shutting out any future
> processors with this kind of oddity, and you have not audited all the
> embedded devices we support or may support in future.

True enough, but then again, the Alpha architects did feel the need to
add 8-bit and 16-bit memory reference instructions in EV56.  In addition,
if there are future processors that don't provide 8-bit and 16-bit memory
reference instructions, atomic instructions can be used as a fallback.
This fallback is in fact similar to the set_bit approach.

> > The fact is that the kernel itself is much more parallel than it was
> > 15 years ago, and that trend is going to continue. Paired with the fact
> > that the Alpha is the least-parallel-friendly arch, makes improving
> > parallelism and correctness even harder within kernel subsystems; harder
> > than it has to be and harder than it should be.
> > 
> > Linus has repeatedly stated that non-arch code should be as
> > arch-independent as possible
> 
> That's why many many years ago we added set_bit() and the other bit
> functions. On sane processors they are very fast. On insane ones they
> work. They understand byte tearing, they understand store ordering (which
> a simple variable does not so you've got to audit all your memory
> barriers too). In many cases they are faster than using memory barriers
> to guide the compiler because they invalidate less and allow the compiler
> more freedom.
> 
> All this started because I suggested you use set_bit and friends and for
> some reason you've decided to try and find another way to do it. We have
> the bit operations for a reason. On x86 they are very very fast, on
> uniprocessor anything they are very fast, on multiprocessor in general
> they are very fast, or you are dealing with boxes that have sanity
> problems of other kinds.

Indeed, these email threads do tend to examine alternatives from time
to time.  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
  2014-09-14 23:24                               ` One Thousand Gnomes
  (?)
@ 2014-09-23 18:19                                 ` Peter Hurley
  -1 siblings, 0 replies; 311+ messages in thread
From: Peter Hurley @ 2014-09-23 18:19 UTC (permalink / raw)
  To: One Thousand Gnomes
  Cc: H. Peter Anvin, David Laight, Jakub Jelinek, linux-arch,
	Tony Luck, linux-ia64, Oleg Nesterov, linux-kernel,
	Paul Mackerras, Paul E. McKenney, linuxppc-dev, Miroslav Franc,
	Richard Henderson, linux-alpha

On 09/14/2014 07:24 PM, One Thousand Gnomes wrote:
>> So a problem that no one has ever complained about on _any_ arch is suddenly
>> a problem on a subset of Alpha cpus, but a problem I know exists on Alpha
>> isn't important because no one's filed a bug about it?
> 
> Yes - because if you think about it that tells you that nobody is hitting
> it with the old code and it probably doesn't matter.

I don't understand this reply.

No one ever reported that the tty bitfield was being corrupted on any arch,
even though the serialization was broken because the bitfield was being
accessed with different locks. This was likely never discovered because
the corrupted tty state was unlikely to be reproduced easily.

Nevertheless, it needs fixing because it's wrong; so I did.

You pointed out my original fix was incorrect because Alpha CPUs could
still overwrite adjacent state if only bytes were used as separate memory
locations for each state. Ok.

Likewise, I'm pointing that byte-sized circular buffers will also be 
corrupted under certain circumstances (when the head comes near the tail).

Your claim that no one has complained about it (other than me) is no
different than if I had dismissed your objection to my original fix by
saying no one has complained about it.

There could be many motives for why someone with a 20-year old processor
that sees data corruption occasionally has not filed a bug about it, even
if they did bother to track down the original cause (which is pretty unlikely).

>> The only Alpha person in this discussion has come out clearly in favor
>> of dropping EV4/5 support.
> 
> That's not a statistically valid sample size btw
> 
> Plus as I pointed out (and you ignored) you are shutting out any future
> processors with this kind of oddity, and you have not audited all the
> embedded devices we support or may support in future.

I did not ignore this point.

Firstly, I thought Peter's reply was on-point:

On 09/10/2014 04:18 PM, H. Peter Anvin wrote:
> On 09/08/2014 10:52 AM, One Thousand Gnomes wrote:
>>
>> I think the whole "removing Alpha EV5" support is basically bonkers. Just
>> use set_bit in the tty layer. Alpha will continue to work as well as it
>> always has done and you won't design out support for any future processor
>> that turns out not to do byte aligned stores.
>>
> 
> I think it's pretty safe to say such a processor is unlikely to ever be
> created, for the same reason there weren't any 48-bit computers when
> 32-bit computers ran out of steam: it caused more problems than it
> solved, and Alpha pretty much proved that.  The engineering advantages
> would have to be so overwhelmingly in favor for someone to want to walk
> down that road again.

Secondly, I replied:

On 09/09/2014 07:14 AM, Peter Hurley wrote:
> I thought the overriding design principle of this kernel was to support
> what exists now, not design-in support for the future which may have
> different requirements than expected anyway.

>> The fact is that the kernel itself is much more parallel than it was
>> 15 years ago, and that trend is going to continue. Paired with the fact
>> that the Alpha is the least-parallel-friendly arch, makes improving
>> parallelism and correctness even harder within kernel subsystems; harder
>> than it has to be and harder than it should be.
>>
>> Linus has repeatedly stated that non-arch code should be as
>> arch-independent as possible
> 
> That's why many many years ago we added set_bit() and the other bit
> functions. On sane processors they are very fast. On insane ones they
> work. They understand byte tearing, they understand store ordering (which
> a simple variable does not so you've got to audit all your memory
> barriers too). In many cases they are faster than using memory barriers
> to guide the compiler because they invalidate less and allow the compiler
> more freedom.
> 
> All this started because I suggested you use set_bit and friends and for
> some reason you've decided to try and find another way to do it. We have
> the bit operations for a reason. On x86 they are very very fast, on
> uniprocessor anything they are very fast, on multiprocessor in general
> they are very fast, or you are dealing with boxes that have sanity
> problems of other kinds.

1. The question of why I chose to solve this problem without
set_bit(), et.al. I already answered:

On 09/09/2014 07:14 AM, Peter Hurley wrote:
> And much simpler how?
> 
> By turning a 4- line patch into a 400- line patch?
> 
> And what about multi-value assignments for which set_bit() doesn't work, like
> the byte-sized ->ctrl_status field? Is the expectation to submit a patch
> which fixes that for a system from 1995, but already works for everything else?
> 
> The extra complexity comes with real cost; reduced reliability for every other
> arch .

2. I'm not sure where you arrived at the conclusion that set_bit() et.al. is
fast on x86. Besides the heavy-duty instructions (lgdt, mov cr0, etc), the
bus-locked bit instructions are among the most expensive instructions in the
kernel.

I've attached a smoke test I use for lightweight validation of pty slave reads.
If you run this smoke test under perf, almost every high-value hit is a
bus-locked bit instruction. Check out canon_copy_from_read_buf() which performs
the actual copy of data from the line discipline input buffer to the __user
buffer; the clear_bit() is the performance hit. Check out cpu_idle_loop()...

Regards,
Peter Hurley

How I run perf:
$ echo 0 | sudo tee /proc/sys/kernel/kptr_restrict
$ echo -1 | sudo tee /proc/sys/kernel/perf_event_paranoid
$ perf record -a -g ./tty_test1
$ perf report [-k whereever your vmlinux is ]

--- >% ---
/*
 * tty_test1.cpp
 *
 * Description: Validate error-free tty read and write
 *
 * This test repeatedly writes a test pattern from master
 * to slave and verifies the result.
 *
 * Q: Does the slave receive all the data error-free?
 *
 * Expected: PASS
 * Validate: build w/ -DVALIDATE to fail the test
 * Verify: build w/ -DVERIFY to crc-compare the received data
 *         (requires boost)
 *
 * Build: g++ -Wall -o tty_test1 tty_test1.cpp
 *
 */

#include <stdio.h>
#include <fcntl.h>
#include <sys/ioctl.h>
#include <sys/wait.h>
#include <termios.h>
#include <stdlib.h>
#include <errno.h>
#include <string.h>
#include <stdarg.h>
#include <signal.h>
#include <limits.h>
#include <unistd.h>

#ifdef VERIFY
#include <boost/crc.hpp>

static uint32_t reference_crc32;
#endif

const unsigned loops_per_crc = 839;  /* prime */
static struct termios termios;


#define XFER_LEN 173	/* prime smaller than PATTERN_SIZE */
static char pattern[] = \
	/*          1         2         3         4         5
	 * 12345678901234567890123456789012345678901234567890 */
	  "+_)(*&^%$#@!~=-0987654321`|}{POIUYTREWQ][poiuytrew"	       /*  50 */
	  "q:LKJHGFDSA;lkjhgfdsa?><MNBVCXZ/.,mnbvcxz"		       /* +41 */
	  "\"\'\\"						       /* + 3 */
	  "+_)(*&^%$#@!~=-0987654321`|}{POIUYTREWQ][poiuytrew"	       /*  50 */
	  "q:LKJHGFDSA;lkjhgfdsa?><MNBV\n";			       /* +29 */
								       /* 173 */


#define error_exit(f, args...)				       		       \
	printf_exit("%s: " f ": %s (code: %d)\n", __func__, ##args,	       \
						  strerror(errno), errno)

#define msg_exit(f, args...) \
	printf_exit("%s: " f "\n", __func__, ##args)


static void printf_exit(const char *f, ...) {
	va_list va;

	va_start(va, f);
	vfprintf(stderr, f, va);
	va_end(va);

	exit(EXIT_FAILURE);
}

static void child(int fd) {
	unsigned char buffer[MAX_INPUT];
#ifdef VERIFY
	boost::crc_32_type crc32;
#endif
	unsigned n = 0;
	unsigned loops = 0;

	printf("Starting slave read loop...\n");

	while (1) {
		int c = read(fd, buffer, sizeof(buffer));
		if (!c)
			break;
		if (c != XFER_LEN) {
			if (c > 0)
				printf("%.*s", (int)c, buffer);
			error_exit("read slave: %zd", c);
		}
#ifdef VERIFY
		crc32.process_bytes(buffer, c);
#endif
		if (++n >= loops_per_crc) {
#ifdef VERIFY
			if (crc32() != reference_crc32) {
				errno = EIO;
				error_exit("failed crc: reference: %x != %x",
					   reference_crc32, crc32());
			}
			crc32.reset();
#endif
#ifdef PROGRESS
			printf(".");
#endif
			n = 0;
		}
		loops++;
	}

	printf("Success: %u loops x %d-byte pattern\n", loops, XFER_LEN);
}

static void master(int fd) {
	int n;
	int c = strlen(pattern);
	unsigned loops = 1000000U;

	printf("Starting master write loop...\n");

	while (loops--) {
		n = write(fd, pattern, c);
		if (n < 0)
			error_exit("master write");
		if (n != c)
			msg_exit("short write of pattern");
	}

	n = write(fd, &termios.c_cc[VEOF], 1);
	if (n < 0)
		error_exit("writing EOF");
	if (n != 1)
		msg_exit("short write of EOF");

}

int main() {
	int fd;
	char pts_name[24];
	int ptn, unlock = 0;
	pid_t child_id, id;
	int status;

	setbuf(stdout, NULL);

#ifdef VERIFY
	boost::crc_32_type reference;

	for (unsigned i = 0; i < loops_per_crc; i++)
		reference.process_bytes( pattern, XFER_LEN);
	reference_crc32 = reference();
#endif

#ifdef VALIDATE
	pattern[50] |= 0x80;
#endif

	fd = open("/dev/ptmx", O_RDWR);
	if (fd < 0)
		error_exit("opening pty master");
	if (ioctl(fd, TIOCSPTLCK, &unlock))
		error_exit("unlocking pty pair");
	if (ioctl(fd, TIOCGPTN, &ptn))
		error_exit("getting pty #");
	snprintf(pts_name, sizeof(pts_name), "/dev/pts/%d", ptn);

	if (tcgetattr(fd, &termios))
		error_exit("getting termios");
	termios.c_lflag &= ~ECHO;
	if (tcsetattr(fd, TCSANOW, &termios))
		error_exit("setting termios");

	switch (child_id = fork()) {
	case -1:
		error_exit("forking child");

	case 0: /* child */
		close(fd);
		fd = open(pts_name, O_RDWR);
		if (fd < 0)
			error_exit("opening pty slave");
		child(fd);
		break;

	default: /* parent */
		master(fd);

		id = waitpid(child_id, &status, 0);
		if (id < 0 || id != child_id)
			error_exit("waiting for child");
		if (id != child_id)
			msg_exit("waitpid returned %d, expected %d", id, child_id);
		break;
	}

	/* child and parent exit here */
	return 0;
}


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-23 18:19                                 ` Peter Hurley
  0 siblings, 0 replies; 311+ messages in thread
From: Peter Hurley @ 2014-09-23 18:19 UTC (permalink / raw)
  To: One Thousand Gnomes
  Cc: Jakub Jelinek, linux-arch, Tony Luck, linux-ia64, linux-alpha,
	Oleg Nesterov, linux-kernel, David Laight, Paul Mackerras,
	H. Peter Anvin, Paul E. McKenney, linuxppc-dev, Miroslav Franc,
	Richard Henderson

On 09/14/2014 07:24 PM, One Thousand Gnomes wrote:
>> So a problem that no one has ever complained about on _any_ arch is suddenly
>> a problem on a subset of Alpha cpus, but a problem I know exists on Alpha
>> isn't important because no one's filed a bug about it?
> 
> Yes - because if you think about it that tells you that nobody is hitting
> it with the old code and it probably doesn't matter.

I don't understand this reply.

No one ever reported that the tty bitfield was being corrupted on any arch,
even though the serialization was broken because the bitfield was being
accessed with different locks. This was likely never discovered because
the corrupted tty state was unlikely to be reproduced easily.

Nevertheless, it needs fixing because it's wrong; so I did.

You pointed out my original fix was incorrect because Alpha CPUs could
still overwrite adjacent state if only bytes were used as separate memory
locations for each state. Ok.

Likewise, I'm pointing that byte-sized circular buffers will also be 
corrupted under certain circumstances (when the head comes near the tail).

Your claim that no one has complained about it (other than me) is no
different than if I had dismissed your objection to my original fix by
saying no one has complained about it.

There could be many motives for why someone with a 20-year old processor
that sees data corruption occasionally has not filed a bug about it, even
if they did bother to track down the original cause (which is pretty unlikely).

>> The only Alpha person in this discussion has come out clearly in favor
>> of dropping EV4/5 support.
> 
> That's not a statistically valid sample size btw
> 
> Plus as I pointed out (and you ignored) you are shutting out any future
> processors with this kind of oddity, and you have not audited all the
> embedded devices we support or may support in future.

I did not ignore this point.

Firstly, I thought Peter's reply was on-point:

On 09/10/2014 04:18 PM, H. Peter Anvin wrote:
> On 09/08/2014 10:52 AM, One Thousand Gnomes wrote:
>>
>> I think the whole "removing Alpha EV5" support is basically bonkers. Just
>> use set_bit in the tty layer. Alpha will continue to work as well as it
>> always has done and you won't design out support for any future processor
>> that turns out not to do byte aligned stores.
>>
> 
> I think it's pretty safe to say such a processor is unlikely to ever be
> created, for the same reason there weren't any 48-bit computers when
> 32-bit computers ran out of steam: it caused more problems than it
> solved, and Alpha pretty much proved that.  The engineering advantages
> would have to be so overwhelmingly in favor for someone to want to walk
> down that road again.

Secondly, I replied:

On 09/09/2014 07:14 AM, Peter Hurley wrote:
> I thought the overriding design principle of this kernel was to support
> what exists now, not design-in support for the future which may have
> different requirements than expected anyway.

>> The fact is that the kernel itself is much more parallel than it was
>> 15 years ago, and that trend is going to continue. Paired with the fact
>> that the Alpha is the least-parallel-friendly arch, makes improving
>> parallelism and correctness even harder within kernel subsystems; harder
>> than it has to be and harder than it should be.
>>
>> Linus has repeatedly stated that non-arch code should be as
>> arch-independent as possible
> 
> That's why many many years ago we added set_bit() and the other bit
> functions. On sane processors they are very fast. On insane ones they
> work. They understand byte tearing, they understand store ordering (which
> a simple variable does not so you've got to audit all your memory
> barriers too). In many cases they are faster than using memory barriers
> to guide the compiler because they invalidate less and allow the compiler
> more freedom.
> 
> All this started because I suggested you use set_bit and friends and for
> some reason you've decided to try and find another way to do it. We have
> the bit operations for a reason. On x86 they are very very fast, on
> uniprocessor anything they are very fast, on multiprocessor in general
> they are very fast, or you are dealing with boxes that have sanity
> problems of other kinds.

1. The question of why I chose to solve this problem without
set_bit(), et.al. I already answered:

On 09/09/2014 07:14 AM, Peter Hurley wrote:
> And much simpler how?
> 
> By turning a 4- line patch into a 400- line patch?
> 
> And what about multi-value assignments for which set_bit() doesn't work, like
> the byte-sized ->ctrl_status field? Is the expectation to submit a patch
> which fixes that for a system from 1995, but already works for everything else?
> 
> The extra complexity comes with real cost; reduced reliability for every other
> arch .

2. I'm not sure where you arrived at the conclusion that set_bit() et.al. is
fast on x86. Besides the heavy-duty instructions (lgdt, mov cr0, etc), the
bus-locked bit instructions are among the most expensive instructions in the
kernel.

I've attached a smoke test I use for lightweight validation of pty slave reads.
If you run this smoke test under perf, almost every high-value hit is a
bus-locked bit instruction. Check out canon_copy_from_read_buf() which performs
the actual copy of data from the line discipline input buffer to the __user
buffer; the clear_bit() is the performance hit. Check out cpu_idle_loop()...

Regards,
Peter Hurley

How I run perf:
$ echo 0 | sudo tee /proc/sys/kernel/kptr_restrict
$ echo -1 | sudo tee /proc/sys/kernel/perf_event_paranoid
$ perf record -a -g ./tty_test1
$ perf report [-k whereever your vmlinux is ]

--- >% ---
/*
 * tty_test1.cpp
 *
 * Description: Validate error-free tty read and write
 *
 * This test repeatedly writes a test pattern from master
 * to slave and verifies the result.
 *
 * Q: Does the slave receive all the data error-free?
 *
 * Expected: PASS
 * Validate: build w/ -DVALIDATE to fail the test
 * Verify: build w/ -DVERIFY to crc-compare the received data
 *         (requires boost)
 *
 * Build: g++ -Wall -o tty_test1 tty_test1.cpp
 *
 */

#include <stdio.h>
#include <fcntl.h>
#include <sys/ioctl.h>
#include <sys/wait.h>
#include <termios.h>
#include <stdlib.h>
#include <errno.h>
#include <string.h>
#include <stdarg.h>
#include <signal.h>
#include <limits.h>
#include <unistd.h>

#ifdef VERIFY
#include <boost/crc.hpp>

static uint32_t reference_crc32;
#endif

const unsigned loops_per_crc = 839;  /* prime */
static struct termios termios;


#define XFER_LEN 173	/* prime smaller than PATTERN_SIZE */
static char pattern[] = \
	/*          1         2         3         4         5
	 * 12345678901234567890123456789012345678901234567890 */
	  "+_)(*&^%$#@!~=-0987654321`|}{POIUYTREWQ][poiuytrew"	       /*  50 */
	  "q:LKJHGFDSA;lkjhgfdsa?><MNBVCXZ/.,mnbvcxz"		       /* +41 */
	  "\"\'\\"						       /* + 3 */
	  "+_)(*&^%$#@!~=-0987654321`|}{POIUYTREWQ][poiuytrew"	       /*  50 */
	  "q:LKJHGFDSA;lkjhgfdsa?><MNBV\n";			       /* +29 */
								       /* 173 */


#define error_exit(f, args...)				       		       \
	printf_exit("%s: " f ": %s (code: %d)\n", __func__, ##args,	       \
						  strerror(errno), errno)

#define msg_exit(f, args...) \
	printf_exit("%s: " f "\n", __func__, ##args)


static void printf_exit(const char *f, ...) {
	va_list va;

	va_start(va, f);
	vfprintf(stderr, f, va);
	va_end(va);

	exit(EXIT_FAILURE);
}

static void child(int fd) {
	unsigned char buffer[MAX_INPUT];
#ifdef VERIFY
	boost::crc_32_type crc32;
#endif
	unsigned n = 0;
	unsigned loops = 0;

	printf("Starting slave read loop...\n");

	while (1) {
		int c = read(fd, buffer, sizeof(buffer));
		if (!c)
			break;
		if (c != XFER_LEN) {
			if (c > 0)
				printf("%.*s", (int)c, buffer);
			error_exit("read slave: %zd", c);
		}
#ifdef VERIFY
		crc32.process_bytes(buffer, c);
#endif
		if (++n >= loops_per_crc) {
#ifdef VERIFY
			if (crc32() != reference_crc32) {
				errno = EIO;
				error_exit("failed crc: reference: %x != %x",
					   reference_crc32, crc32());
			}
			crc32.reset();
#endif
#ifdef PROGRESS
			printf(".");
#endif
			n = 0;
		}
		loops++;
	}

	printf("Success: %u loops x %d-byte pattern\n", loops, XFER_LEN);
}

static void master(int fd) {
	int n;
	int c = strlen(pattern);
	unsigned loops = 1000000U;

	printf("Starting master write loop...\n");

	while (loops--) {
		n = write(fd, pattern, c);
		if (n < 0)
			error_exit("master write");
		if (n != c)
			msg_exit("short write of pattern");
	}

	n = write(fd, &termios.c_cc[VEOF], 1);
	if (n < 0)
		error_exit("writing EOF");
	if (n != 1)
		msg_exit("short write of EOF");

}

int main() {
	int fd;
	char pts_name[24];
	int ptn, unlock = 0;
	pid_t child_id, id;
	int status;

	setbuf(stdout, NULL);

#ifdef VERIFY
	boost::crc_32_type reference;

	for (unsigned i = 0; i < loops_per_crc; i++)
		reference.process_bytes( pattern, XFER_LEN);
	reference_crc32 = reference();
#endif

#ifdef VALIDATE
	pattern[50] |= 0x80;
#endif

	fd = open("/dev/ptmx", O_RDWR);
	if (fd < 0)
		error_exit("opening pty master");
	if (ioctl(fd, TIOCSPTLCK, &unlock))
		error_exit("unlocking pty pair");
	if (ioctl(fd, TIOCGPTN, &ptn))
		error_exit("getting pty #");
	snprintf(pts_name, sizeof(pts_name), "/dev/pts/%d", ptn);

	if (tcgetattr(fd, &termios))
		error_exit("getting termios");
	termios.c_lflag &= ~ECHO;
	if (tcsetattr(fd, TCSANOW, &termios))
		error_exit("setting termios");

	switch (child_id = fork()) {
	case -1:
		error_exit("forking child");

	case 0: /* child */
		close(fd);
		fd = open(pts_name, O_RDWR);
		if (fd < 0)
			error_exit("opening pty slave");
		child(fd);
		break;

	default: /* parent */
		master(fd);

		id = waitpid(child_id, &status, 0);
		if (id < 0 || id != child_id)
			error_exit("waiting for child");
		if (id != child_id)
			msg_exit("waitpid returned %d, expected %d", id, child_id);
		break;
	}

	/* child and parent exit here */
	return 0;
}

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-23 18:19                                 ` Peter Hurley
  0 siblings, 0 replies; 311+ messages in thread
From: Peter Hurley @ 2014-09-23 18:19 UTC (permalink / raw)
  To: One Thousand Gnomes
  Cc: H. Peter Anvin, David Laight, Jakub Jelinek, linux-arch,
	Tony Luck, linux-ia64, Oleg Nesterov, linux-kernel,
	Paul Mackerras, Paul E. McKenney, linuxppc-dev, Miroslav Franc,
	Richard Henderson, linux-alpha

On 09/14/2014 07:24 PM, One Thousand Gnomes wrote:
>> So a problem that no one has ever complained about on _any_ arch is suddenly
>> a problem on a subset of Alpha cpus, but a problem I know exists on Alpha
>> isn't important because no one's filed a bug about it?
> 
> Yes - because if you think about it that tells you that nobody is hitting
> it with the old code and it probably doesn't matter.

I don't understand this reply.

No one ever reported that the tty bitfield was being corrupted on any arch,
even though the serialization was broken because the bitfield was being
accessed with different locks. This was likely never discovered because
the corrupted tty state was unlikely to be reproduced easily.

Nevertheless, it needs fixing because it's wrong; so I did.

You pointed out my original fix was incorrect because Alpha CPUs could
still overwrite adjacent state if only bytes were used as separate memory
locations for each state. Ok.

Likewise, I'm pointing that byte-sized circular buffers will also be 
corrupted under certain circumstances (when the head comes near the tail).

Your claim that no one has complained about it (other than me) is no
different than if I had dismissed your objection to my original fix by
saying no one has complained about it.

There could be many motives for why someone with a 20-year old processor
that sees data corruption occasionally has not filed a bug about it, even
if they did bother to track down the original cause (which is pretty unlikely).

>> The only Alpha person in this discussion has come out clearly in favor
>> of dropping EV4/5 support.
> 
> That's not a statistically valid sample size btw
> 
> Plus as I pointed out (and you ignored) you are shutting out any future
> processors with this kind of oddity, and you have not audited all the
> embedded devices we support or may support in future.

I did not ignore this point.

Firstly, I thought Peter's reply was on-point:

On 09/10/2014 04:18 PM, H. Peter Anvin wrote:
> On 09/08/2014 10:52 AM, One Thousand Gnomes wrote:
>>
>> I think the whole "removing Alpha EV5" support is basically bonkers. Just
>> use set_bit in the tty layer. Alpha will continue to work as well as it
>> always has done and you won't design out support for any future processor
>> that turns out not to do byte aligned stores.
>>
> 
> I think it's pretty safe to say such a processor is unlikely to ever be
> created, for the same reason there weren't any 48-bit computers when
> 32-bit computers ran out of steam: it caused more problems than it
> solved, and Alpha pretty much proved that.  The engineering advantages
> would have to be so overwhelmingly in favor for someone to want to walk
> down that road again.

Secondly, I replied:

On 09/09/2014 07:14 AM, Peter Hurley wrote:
> I thought the overriding design principle of this kernel was to support
> what exists now, not design-in support for the future which may have
> different requirements than expected anyway.

>> The fact is that the kernel itself is much more parallel than it was
>> 15 years ago, and that trend is going to continue. Paired with the fact
>> that the Alpha is the least-parallel-friendly arch, makes improving
>> parallelism and correctness even harder within kernel subsystems; harder
>> than it has to be and harder than it should be.
>>
>> Linus has repeatedly stated that non-arch code should be as
>> arch-independent as possible
> 
> That's why many many years ago we added set_bit() and the other bit
> functions. On sane processors they are very fast. On insane ones they
> work. They understand byte tearing, they understand store ordering (which
> a simple variable does not so you've got to audit all your memory
> barriers too). In many cases they are faster than using memory barriers
> to guide the compiler because they invalidate less and allow the compiler
> more freedom.
> 
> All this started because I suggested you use set_bit and friends and for
> some reason you've decided to try and find another way to do it. We have
> the bit operations for a reason. On x86 they are very very fast, on
> uniprocessor anything they are very fast, on multiprocessor in general
> they are very fast, or you are dealing with boxes that have sanity
> problems of other kinds.

1. The question of why I chose to solve this problem without
set_bit(), et.al. I already answered:

On 09/09/2014 07:14 AM, Peter Hurley wrote:
> And much simpler how?
> 
> By turning a 4- line patch into a 400- line patch?
> 
> And what about multi-value assignments for which set_bit() doesn't work, like
> the byte-sized ->ctrl_status field? Is the expectation to submit a patch
> which fixes that for a system from 1995, but already works for everything else?
> 
> The extra complexity comes with real cost; reduced reliability for every other
> arch .

2. I'm not sure where you arrived at the conclusion that set_bit() et.al. is
fast on x86. Besides the heavy-duty instructions (lgdt, mov cr0, etc), the
bus-locked bit instructions are among the most expensive instructions in the
kernel.

I've attached a smoke test I use for lightweight validation of pty slave reads.
If you run this smoke test under perf, almost every high-value hit is a
bus-locked bit instruction. Check out canon_copy_from_read_buf() which performs
the actual copy of data from the line discipline input buffer to the __user
buffer; the clear_bit() is the performance hit. Check out cpu_idle_loop()...

Regards,
Peter Hurley

How I run perf:
$ echo 0 | sudo tee /proc/sys/kernel/kptr_restrict
$ echo -1 | sudo tee /proc/sys/kernel/perf_event_paranoid
$ perf record -a -g ./tty_test1
$ perf report [-k whereever your vmlinux is ]

--- >% ---
/*
 * tty_test1.cpp
 *
 * Description: Validate error-free tty read and write
 *
 * This test repeatedly writes a test pattern from master
 * to slave and verifies the result.
 *
 * Q: Does the slave receive all the data error-free?
 *
 * Expected: PASS
 * Validate: build w/ -DVALIDATE to fail the test
 * Verify: build w/ -DVERIFY to crc-compare the received data
 *         (requires boost)
 *
 * Build: g++ -Wall -o tty_test1 tty_test1.cpp
 *
 */

#include <stdio.h>
#include <fcntl.h>
#include <sys/ioctl.h>
#include <sys/wait.h>
#include <termios.h>
#include <stdlib.h>
#include <errno.h>
#include <string.h>
#include <stdarg.h>
#include <signal.h>
#include <limits.h>
#include <unistd.h>

#ifdef VERIFY
#include <boost/crc.hpp>

static uint32_t reference_crc32;
#endif

const unsigned loops_per_crc = 839;  /* prime */
static struct termios termios;


#define XFER_LEN 173	/* prime smaller than PATTERN_SIZE */
static char pattern[] = \
	/*          1         2         3         4         5
	 * 12345678901234567890123456789012345678901234567890 */
	  "+_)(*&^%$#@!~=-0987654321`|}{POIUYTREWQ][poiuytrew"	       /*  50 */
	  "q:LKJHGFDSA;lkjhgfdsa?><MNBVCXZ/.,mnbvcxz"		       /* +41 */
	  "\"\'\\"						       /* + 3 */
	  "+_)(*&^%$#@!~=-0987654321`|}{POIUYTREWQ][poiuytrew"	       /*  50 */
	  "q:LKJHGFDSA;lkjhgfdsa?><MNBV\n";			       /* +29 */
								       /* 173 */


#define error_exit(f, args...)				       		       \
	printf_exit("%s: " f ": %s (code: %d)\n", __func__, ##args,	       \
						  strerror(errno), errno)

#define msg_exit(f, args...) \
	printf_exit("%s: " f "\n", __func__, ##args)


static void printf_exit(const char *f, ...) {
	va_list va;

	va_start(va, f);
	vfprintf(stderr, f, va);
	va_end(va);

	exit(EXIT_FAILURE);
}

static void child(int fd) {
	unsigned char buffer[MAX_INPUT];
#ifdef VERIFY
	boost::crc_32_type crc32;
#endif
	unsigned n = 0;
	unsigned loops = 0;

	printf("Starting slave read loop...\n");

	while (1) {
		int c = read(fd, buffer, sizeof(buffer));
		if (!c)
			break;
		if (c != XFER_LEN) {
			if (c > 0)
				printf("%.*s", (int)c, buffer);
			error_exit("read slave: %zd", c);
		}
#ifdef VERIFY
		crc32.process_bytes(buffer, c);
#endif
		if (++n >= loops_per_crc) {
#ifdef VERIFY
			if (crc32() != reference_crc32) {
				errno = EIO;
				error_exit("failed crc: reference: %x != %x",
					   reference_crc32, crc32());
			}
			crc32.reset();
#endif
#ifdef PROGRESS
			printf(".");
#endif
			n = 0;
		}
		loops++;
	}

	printf("Success: %u loops x %d-byte pattern\n", loops, XFER_LEN);
}

static void master(int fd) {
	int n;
	int c = strlen(pattern);
	unsigned loops = 1000000U;

	printf("Starting master write loop...\n");

	while (loops--) {
		n = write(fd, pattern, c);
		if (n < 0)
			error_exit("master write");
		if (n != c)
			msg_exit("short write of pattern");
	}

	n = write(fd, &termios.c_cc[VEOF], 1);
	if (n < 0)
		error_exit("writing EOF");
	if (n != 1)
		msg_exit("short write of EOF");

}

int main() {
	int fd;
	char pts_name[24];
	int ptn, unlock = 0;
	pid_t child_id, id;
	int status;

	setbuf(stdout, NULL);

#ifdef VERIFY
	boost::crc_32_type reference;

	for (unsigned i = 0; i < loops_per_crc; i++)
		reference.process_bytes( pattern, XFER_LEN);
	reference_crc32 = reference();
#endif

#ifdef VALIDATE
	pattern[50] |= 0x80;
#endif

	fd = open("/dev/ptmx", O_RDWR);
	if (fd < 0)
		error_exit("opening pty master");
	if (ioctl(fd, TIOCSPTLCK, &unlock))
		error_exit("unlocking pty pair");
	if (ioctl(fd, TIOCGPTN, &ptn))
		error_exit("getting pty #");
	snprintf(pts_name, sizeof(pts_name), "/dev/pts/%d", ptn);

	if (tcgetattr(fd, &termios))
		error_exit("getting termios");
	termios.c_lflag &= ~ECHO;
	if (tcsetattr(fd, TCSANOW, &termios))
		error_exit("setting termios");

	switch (child_id = fork()) {
	case -1:
		error_exit("forking child");

	case 0: /* child */
		close(fd);
		fd = open(pts_name, O_RDWR);
		if (fd < 0)
			error_exit("opening pty slave");
		child(fd);
		break;

	default: /* parent */
		master(fd);

		id = waitpid(child_id, &status, 0);
		if (id < 0 || id != child_id)
			error_exit("waiting for child");
		if (id != child_id)
			msg_exit("waitpid returned %d, expected %d", id, child_id);
		break;
	}

	/* child and parent exit here */
	return 0;
}


^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
  2014-09-23 18:19                                 ` Peter Hurley
  (?)
  (?)
@ 2014-09-23 18:39                                   ` One Thousand Gnomes
  -1 siblings, 0 replies; 311+ messages in thread
From: One Thousand Gnomes @ 2014-09-23 18:39 UTC (permalink / raw)
  To: Peter Hurley
  Cc: H. Peter Anvin, David Laight, Jakub Jelinek, linux-arch,
	Tony Luck, linux-ia64, Oleg Nesterov, linux-kernel,
	Paul Mackerras, Paul E. McKenney, linuxppc-dev, Miroslav Franc,
	Richard Henderson, linux-alpha

> > Yes - because if you think about it that tells you that nobody is hitting
> > it with the old code and it probably doesn't matter.
> 
> I don't understand this reply.

It's a matter of priorities. There are hundreds of potential security
holes turned up by scanners, 2,500+ filed bugs in kernel bugzilla alone.

Which matters more - fixing the bugs people hit or the ones that they
don't and which have a very high cost impact on other users ?

> Likewise, I'm pointing that byte-sized circular buffers will also be 
> corrupted under certain circumstances (when the head comes near the tail).

Yes. I believe the classic wording is "this problem has not been observed
in the field"
 
> 2. I'm not sure where you arrived at the conclusion that set_bit() et.al. is
> fast on x86. Besides the heavy-duty instructions (lgdt, mov cr0, etc), the
> bus-locked bit instructions are among the most expensive instructions in the
> kernel.

> I've attached a smoke test I use for lightweight validation of pty slave reads.
> If you run this smoke test under perf, almost every high-value hit is a
> bus-locked bit instruction. Check out canon_copy_from_read_buf() which performs
> the actual copy of data from the line discipline input buffer to the __user
> buffer; the clear_bit() is the performance hit. Check out cpu_idle_loop()...

And then go and instrument whether its the bit instructions or the cache
misses ?


If there is a significant performance improvement by spacing the fields
carefully, and you don't have other ordering problems as a result
(remembering that the compiler has a lot of re-ordering options provided
the re-ordering cannot be observed)

If that's the case then fine - submit a patch with numbers and the fields
spaced and quote the performance data.

Alan

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-23 18:39                                   ` One Thousand Gnomes
  0 siblings, 0 replies; 311+ messages in thread
From: One Thousand Gnomes @ 2014-09-23 18:39 UTC (permalink / raw)
  To: Peter Hurley
  Cc: Jakub Jelinek, linux-arch, Tony Luck, linux-ia64, linux-alpha,
	Oleg Nesterov, linux-kernel, David Laight, Paul Mackerras,
	H. Peter Anvin, Paul E. McKenney, linuxppc-dev, Miroslav Franc,
	Richard Henderson

> > Yes - because if you think about it that tells you that nobody is hitting
> > it with the old code and it probably doesn't matter.
> 
> I don't understand this reply.

It's a matter of priorities. There are hundreds of potential security
holes turned up by scanners, 2,500+ filed bugs in kernel bugzilla alone.

Which matters more - fixing the bugs people hit or the ones that they
don't and which have a very high cost impact on other users ?

> Likewise, I'm pointing that byte-sized circular buffers will also be 
> corrupted under certain circumstances (when the head comes near the tail).

Yes. I believe the classic wording is "this problem has not been observed
in the field"
 
> 2. I'm not sure where you arrived at the conclusion that set_bit() et.al. is
> fast on x86. Besides the heavy-duty instructions (lgdt, mov cr0, etc), the
> bus-locked bit instructions are among the most expensive instructions in the
> kernel.

> I've attached a smoke test I use for lightweight validation of pty slave reads.
> If you run this smoke test under perf, almost every high-value hit is a
> bus-locked bit instruction. Check out canon_copy_from_read_buf() which performs
> the actual copy of data from the line discipline input buffer to the __user
> buffer; the clear_bit() is the performance hit. Check out cpu_idle_loop()...

And then go and instrument whether its the bit instructions or the cache
misses ?


If there is a significant performance improvement by spacing the fields
carefully, and you don't have other ordering problems as a result
(remembering that the compiler has a lot of re-ordering options provided
the re-ordering cannot be observed)

If that's the case then fine - submit a patch with numbers and the fields
spaced and quote the performance data.

Alan
_______________________________________________
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-23 18:39                                   ` One Thousand Gnomes
  0 siblings, 0 replies; 311+ messages in thread
From: One Thousand Gnomes @ 2014-09-23 18:39 UTC (permalink / raw)
  To: Peter Hurley
  Cc: Jakub Jelinek, linux-arch, Tony Luck, linux-ia64, linux-alpha,
	Oleg Nesterov, linux-kernel, David Laight, Paul Mackerras,
	H. Peter Anvin, Paul E. McKenney, linuxppc-dev, Miroslav Franc,
	Richard Henderson

> > Yes - because if you think about it that tells you that nobody is hitting
> > it with the old code and it probably doesn't matter.
> 
> I don't understand this reply.

It's a matter of priorities. There are hundreds of potential security
holes turned up by scanners, 2,500+ filed bugs in kernel bugzilla alone.

Which matters more - fixing the bugs people hit or the ones that they
don't and which have a very high cost impact on other users ?

> Likewise, I'm pointing that byte-sized circular buffers will also be 
> corrupted under certain circumstances (when the head comes near the tail).

Yes. I believe the classic wording is "this problem has not been observed
in the field"
 
> 2. I'm not sure where you arrived at the conclusion that set_bit() et.al. is
> fast on x86. Besides the heavy-duty instructions (lgdt, mov cr0, etc), the
> bus-locked bit instructions are among the most expensive instructions in the
> kernel.

> I've attached a smoke test I use for lightweight validation of pty slave reads.
> If you run this smoke test under perf, almost every high-value hit is a
> bus-locked bit instruction. Check out canon_copy_from_read_buf() which performs
> the actual copy of data from the line discipline input buffer to the __user
> buffer; the clear_bit() is the performance hit. Check out cpu_idle_loop()...

And then go and instrument whether its the bit instructions or the cache
misses ?


If there is a significant performance improvement by spacing the fields
carefully, and you don't have other ordering problems as a result
(remembering that the compiler has a lot of re-ordering options provided
the re-ordering cannot be observed)

If that's the case then fine - submit a patch with numbers and the fields
spaced and quote the performance data.

Alan

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-23 18:39                                   ` One Thousand Gnomes
  0 siblings, 0 replies; 311+ messages in thread
From: One Thousand Gnomes @ 2014-09-23 18:39 UTC (permalink / raw)
  To: Peter Hurley
  Cc: H. Peter Anvin, David Laight, Jakub Jelinek, linux-arch,
	Tony Luck, linux-ia64, Oleg Nesterov, linux-kernel,
	Paul Mackerras, Paul E. McKenney, linuxppc-dev, Miroslav Franc,
	Richard Henderson, linux-alpha

> > Yes - because if you think about it that tells you that nobody is hitting
> > it with the old code and it probably doesn't matter.
> 
> I don't understand this reply.

It's a matter of priorities. There are hundreds of potential security
holes turned up by scanners, 2,500+ filed bugs in kernel bugzilla alone.

Which matters more - fixing the bugs people hit or the ones that they
don't and which have a very high cost impact on other users ?

> Likewise, I'm pointing that byte-sized circular buffers will also be 
> corrupted under certain circumstances (when the head comes near the tail).

Yes. I believe the classic wording is "this problem has not been observed
in the field"
 
> 2. I'm not sure where you arrived at the conclusion that set_bit() et.al. is
> fast on x86. Besides the heavy-duty instructions (lgdt, mov cr0, etc), the
> bus-locked bit instructions are among the most expensive instructions in the
> kernel.

> I've attached a smoke test I use for lightweight validation of pty slave reads.
> If you run this smoke test under perf, almost every high-value hit is a
> bus-locked bit instruction. Check out canon_copy_from_read_buf() which performs
> the actual copy of data from the line discipline input buffer to the __user
> buffer; the clear_bit() is the performance hit. Check out cpu_idle_loop()...

And then go and instrument whether its the bit instructions or the cache
misses ?


If there is a significant performance improvement by spacing the fields
carefully, and you don't have other ordering problems as a result
(remembering that the compiler has a lot of re-ordering options provided
the re-ordering cannot be observed)

If that's the case then fine - submit a patch with numbers and the fields
spaced and quote the performance data.

Alan

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
  2014-09-05 16:17                                 ` Peter Hurley
  (?)
@ 2014-09-25 16:12                                   ` Pavel Machek
  -1 siblings, 0 replies; 311+ messages in thread
From: Pavel Machek @ 2014-09-25 16:12 UTC (permalink / raw)
  To: Peter Hurley
  Cc: David Laight, 'paulmck@linux.vnet.ibm.com',
	Jakub Jelinek, One Thousand Gnomes, linux-arch, linux-ia64,
	Mikael Pettersson, Oleg Nesterov, linux-kernel, James Bottomley,
	Tony Luck, Paul Mackerras, H. Peter Anvin, linuxppc-dev,
	Miroslav Franc, Richard Henderson, linux-arm

On Fri 2014-09-05 12:17:16, Peter Hurley wrote:
> On 09/05/2014 08:37 AM, David Laight wrote:
> > From: Peter Hurley
> >> On 09/05/2014 04:30 AM, David Laight wrote:
> >>> I've seen gcc generate 32bit accesses for 16bit structure members on arm.
> >>> It does this because of the more limited range of the offsets for the 16bit access.
> >>> OTOH I don't know if it ever did this for writes - so it may be moot.
> >>
> >> Can you recall the particulars, like what ARM config or what code?
> >>
> >> I tried an overly-simple test to see if gcc would bump up to the word load for
> >> the 12-bit offset mode, but it stuck with register offset rather than immediate
> >> offset. [I used the compiler options for allmodconfig and a 4.8 cross-compiler.]
> >>
> >> Maybe the test doesn't generate enough register pressure on the compiler?
> > 
> > Dunno, I would have been using a much older version of the compiler.
> > It is possible that it doesn't do it any more.
> > It might only have done it for loads.
> > 
> > The compiler used to use misaligned 32bit loads for structure
> > members on large 4n+2 byte boundaries as well.
> > I'm pretty sure it doesn't do that either.
> > 
> > There have been a lot of compiler versions since I was compiling
> > anything for arm.
> 
> Yeah, it seems gcc for ARM no longer uses the larger operand size as a
> substitute for 12-bit immediate offset addressing mode, even for reads.
> 
> While this test:
> 
> struct x {
> 	short b[12];
> };
> 
> short load_b(struct x *p) {
> 	return p->b[8];
> }
> 
> generates the 8-bit immediate offset form,
> 
> short load_b(struct x *p) {
>    0:	e1d001f0 	ldrsh	r0, [r0, #16]
>    4:	e12fff1e 	bx	lr
> 
> 
> pushing the offset out past 256:
> 
> struct x {
> 	long unused[64];
> 	short b[12];
> };
> 
> short load_b(struct x *p) {
> 	return p->b[8];
> }
> 
> generates the register offset addressing mode instead of 12-bit immediate:
> 
> short load_b(struct x *p) {
>    0:	e3a03e11 	mov	r3, #272	; 0x110
>    4:	e19000f3 	ldrsh	r0, [r0, r3]
>    8:	e12fff1e 	bx	lr

If we rely on new gcc features for correctness, does minimum compiler version need to be updated?
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-25 16:12                                   ` Pavel Machek
  0 siblings, 0 replies; 311+ messages in thread
From: Pavel Machek @ 2014-09-25 16:12 UTC (permalink / raw)
  To: Peter Hurley
  Cc: Jakub Jelinek, linux-arch, linux-arm, Tony Luck, linux-ia64,
	One Thousand Gnomes, Oleg Nesterov, linux-kernel,
	James Bottomley, David Laight, Paul Mackerras, Mikael Pettersson,
	H. Peter Anvin, 'paulmck@linux.vnet.ibm.com',
	linuxppc-dev, Miroslav Franc, Richard Henderson

On Fri 2014-09-05 12:17:16, Peter Hurley wrote:
> On 09/05/2014 08:37 AM, David Laight wrote:
> > From: Peter Hurley
> >> On 09/05/2014 04:30 AM, David Laight wrote:
> >>> I've seen gcc generate 32bit accesses for 16bit structure members on arm.
> >>> It does this because of the more limited range of the offsets for the 16bit access.
> >>> OTOH I don't know if it ever did this for writes - so it may be moot.
> >>
> >> Can you recall the particulars, like what ARM config or what code?
> >>
> >> I tried an overly-simple test to see if gcc would bump up to the word load for
> >> the 12-bit offset mode, but it stuck with register offset rather than immediate
> >> offset. [I used the compiler options for allmodconfig and a 4.8 cross-compiler.]
> >>
> >> Maybe the test doesn't generate enough register pressure on the compiler?
> > 
> > Dunno, I would have been using a much older version of the compiler.
> > It is possible that it doesn't do it any more.
> > It might only have done it for loads.
> > 
> > The compiler used to use misaligned 32bit loads for structure
> > members on large 4n+2 byte boundaries as well.
> > I'm pretty sure it doesn't do that either.
> > 
> > There have been a lot of compiler versions since I was compiling
> > anything for arm.
> 
> Yeah, it seems gcc for ARM no longer uses the larger operand size as a
> substitute for 12-bit immediate offset addressing mode, even for reads.
> 
> While this test:
> 
> struct x {
> 	short b[12];
> };
> 
> short load_b(struct x *p) {
> 	return p->b[8];
> }
> 
> generates the 8-bit immediate offset form,
> 
> short load_b(struct x *p) {
>    0:	e1d001f0 	ldrsh	r0, [r0, #16]
>    4:	e12fff1e 	bx	lr
> 
> 
> pushing the offset out past 256:
> 
> struct x {
> 	long unused[64];
> 	short b[12];
> };
> 
> short load_b(struct x *p) {
> 	return p->b[8];
> }
> 
> generates the register offset addressing mode instead of 12-bit immediate:
> 
> short load_b(struct x *p) {
>    0:	e3a03e11 	mov	r3, #272	; 0x110
>    4:	e19000f3 	ldrsh	r0, [r0, r3]
>    8:	e12fff1e 	bx	lr

If we rely on new gcc features for correctness, does minimum compiler version need to be updated?
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 311+ messages in thread

* Re: bit fields && data tearing
@ 2014-09-25 16:12                                   ` Pavel Machek
  0 siblings, 0 replies; 311+ messages in thread
From: Pavel Machek @ 2014-09-25 16:12 UTC (permalink / raw)
  To: Peter Hurley
  Cc: David Laight, 'paulmck@linux.vnet.ibm.com',
	Jakub Jelinek, One Thousand Gnomes, linux-arch, linux-ia64,
	Mikael Pettersson, Oleg Nesterov, linux-kernel, James Bottomley,
	Tony Luck, Paul Mackerras, H. Peter Anvin, linuxppc-dev,
	Miroslav Franc, Richard Henderson, linux-arm

On Fri 2014-09-05 12:17:16, Peter Hurley wrote:
> On 09/05/2014 08:37 AM, David Laight wrote:
> > From: Peter Hurley
> >> On 09/05/2014 04:30 AM, David Laight wrote:
> >>> I've seen gcc generate 32bit accesses for 16bit structure members on arm.
> >>> It does this because of the more limited range of the offsets for the 16bit access.
> >>> OTOH I don't know if it ever did this for writes - so it may be moot.
> >>
> >> Can you recall the particulars, like what ARM config or what code?
> >>
> >> I tried an overly-simple test to see if gcc would bump up to the word load for
> >> the 12-bit offset mode, but it stuck with register offset rather than immediate
> >> offset. [I used the compiler options for allmodconfig and a 4.8 cross-compiler.]
> >>
> >> Maybe the test doesn't generate enough register pressure on the compiler?
> > 
> > Dunno, I would have been using a much older version of the compiler.
> > It is possible that it doesn't do it any more.
> > It might only have done it for loads.
> > 
> > The compiler used to use misaligned 32bit loads for structure
> > members on large 4n+2 byte boundaries as well.
> > I'm pretty sure it doesn't do that either.
> > 
> > There have been a lot of compiler versions since I was compiling
> > anything for arm.
> 
> Yeah, it seems gcc for ARM no longer uses the larger operand size as a
> substitute for 12-bit immediate offset addressing mode, even for reads.
> 
> While this test:
> 
> struct x {
> 	short b[12];
> };
> 
> short load_b(struct x *p) {
> 	return p->b[8];
> }
> 
> generates the 8-bit immediate offset form,
> 
> short load_b(struct x *p) {
>    0:	e1d001f0 	ldrsh	r0, [r0, #16]
>    4:	e12fff1e 	bx	lr
> 
> 
> pushing the offset out past 256:
> 
> struct x {
> 	long unused[64];
> 	short b[12];
> };
> 
> short load_b(struct x *p) {
> 	return p->b[8];
> }
> 
> generates the register offset addressing mode instead of 12-bit immediate:
> 
> short load_b(struct x *p) {
>    0:	e3a03e11 	mov	r3, #272	; 0x110
>    4:	e19000f3 	ldrsh	r0, [r0, r3]
>    8:	e12fff1e 	bx	lr

If we rely on new gcc features for correctness, does minimum compiler version need to be updated?
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 311+ messages in thread

end of thread, other threads:[~2014-09-25 16:21 UTC | newest]

Thread overview: 311+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-07-12 18:13 bit fields && data tearing Oleg Nesterov
2014-07-12 18:13 ` Oleg Nesterov
2014-07-12 20:51 ` Oleg Nesterov
2014-07-12 20:51   ` Oleg Nesterov
2014-07-12 23:34   ` Benjamin Herrenschmidt
2014-07-12 23:34     ` Benjamin Herrenschmidt
2014-07-13 12:29     ` Oleg Nesterov
2014-07-13 12:29       ` Oleg Nesterov
2014-07-13 13:15     ` Peter Hurley
2014-07-13 13:15       ` Peter Hurley
2014-07-13 22:25       ` Benjamin Herrenschmidt
2014-07-13 22:25         ` Benjamin Herrenschmidt
2014-07-15 13:54         ` Peter Hurley
2014-07-15 13:54           ` Peter Hurley
2014-07-15 15:02           ` Richard Henderson
2014-07-15 15:02             ` Richard Henderson
2014-09-03 22:51 ` Peter Hurley
2014-09-03 22:51   ` Peter Hurley
2014-09-03 22:51   ` Peter Hurley
2014-09-03 23:11   ` Benjamin Herrenschmidt
2014-09-03 23:11     ` Benjamin Herrenschmidt
2014-09-03 23:11     ` Benjamin Herrenschmidt
2014-09-03 23:11     ` Benjamin Herrenschmidt
2014-09-04  8:43     ` David Laight
2014-09-04  8:43       ` David Laight
2014-09-04  8:43       ` David Laight
2014-09-04  8:43       ` David Laight
2014-09-04  9:52       ` Benjamin Herrenschmidt
2014-09-04  9:52         ` Benjamin Herrenschmidt
2014-09-04  9:52         ` Benjamin Herrenschmidt
2014-09-04 22:14         ` H. Peter Anvin
2014-09-04 22:14           ` H. Peter Anvin
2014-09-04 22:14           ` H. Peter Anvin
2014-09-05  0:59           ` Peter Hurley
2014-09-05  0:59             ` Peter Hurley
2014-09-05  0:59             ` Peter Hurley
2014-09-05  2:08             ` H. Peter Anvin
2014-09-05  2:08               ` H. Peter Anvin
2014-09-05  2:08               ` H. Peter Anvin
2014-09-05  2:08               ` H. Peter Anvin
2014-09-05 15:31               ` Peter Hurley
2014-09-05 15:31                 ` Peter Hurley
2014-09-05 15:31                 ` Peter Hurley
2014-09-05 15:41                 ` H. Peter Anvin
2014-09-05 15:41                   ` H. Peter Anvin
2014-09-05 15:41                   ` H. Peter Anvin
2014-09-08 17:52                   ` One Thousand Gnomes
2014-09-08 17:52                     ` One Thousand Gnomes
2014-09-08 17:52                     ` One Thousand Gnomes
2014-09-08 17:59                     ` H. Peter Anvin
2014-09-08 17:59                       ` H. Peter Anvin
2014-09-08 17:59                       ` H. Peter Anvin
2014-09-08 19:17                       ` One Thousand Gnomes
2014-09-08 19:17                         ` One Thousand Gnomes
2014-09-08 19:17                         ` One Thousand Gnomes
2014-09-09 11:18                         ` Peter Hurley
2014-09-09 11:18                           ` Peter Hurley
2014-09-09 11:18                           ` Peter Hurley
2014-09-08 22:47                       ` Peter Hurley
2014-09-08 22:47                         ` Peter Hurley
2014-09-08 22:47                         ` Peter Hurley
2014-09-09  1:59                         ` Paul E. McKenney
2014-09-09  1:59                           ` Paul E. McKenney
2014-09-09  1:59                           ` Paul E. McKenney
2014-09-09 11:14                         ` Peter Hurley
2014-09-09 11:14                           ` Peter Hurley
2014-09-09 11:14                           ` Peter Hurley
2014-09-11 10:04                         ` One Thousand Gnomes
2014-09-11 10:04                           ` One Thousand Gnomes
2014-09-11 10:04                           ` One Thousand Gnomes
2014-09-11 16:16                           ` Paul E. McKenney
2014-09-11 16:16                             ` Paul E. McKenney
2014-09-11 16:16                             ` Paul E. McKenney
2014-09-11 20:01                           ` Peter Hurley
2014-09-11 20:01                             ` Peter Hurley
2014-09-11 20:01                             ` Peter Hurley
2014-09-14 23:24                             ` One Thousand Gnomes
2014-09-14 23:24                               ` One Thousand Gnomes
2014-09-14 23:24                               ` One Thousand Gnomes
2014-09-22 19:51                               ` Paul E. McKenney
2014-09-22 19:51                                 ` Paul E. McKenney
2014-09-22 19:51                                 ` Paul E. McKenney
2014-09-23 18:19                               ` Peter Hurley
2014-09-23 18:19                                 ` Peter Hurley
2014-09-23 18:19                                 ` Peter Hurley
2014-09-23 18:39                                 ` One Thousand Gnomes
2014-09-23 18:39                                   ` One Thousand Gnomes
2014-09-23 18:39                                   ` One Thousand Gnomes
2014-09-23 18:39                                   ` One Thousand Gnomes
2014-09-08 18:13                     ` James Bottomley
2014-09-08 18:13                       ` James Bottomley
2014-09-08 18:13                       ` James Bottomley
2014-09-10 20:18                     ` H. Peter Anvin
2014-09-10 20:18                       ` H. Peter Anvin
2014-09-10 20:18                       ` H. Peter Anvin
2014-09-10 21:10                       ` Rob Landley
2014-09-10 21:10                         ` Rob Landley
2014-09-10 21:10                         ` Rob Landley
2014-09-05  2:08             ` H. Peter Anvin
2014-09-05  2:08               ` H. Peter Anvin
2014-09-05  2:08               ` H. Peter Anvin
2014-09-05  8:16               ` Michael Cree
2014-09-05  8:16                 ` Michael Cree
2014-09-05  8:16                 ` Michael Cree
2014-09-05  8:16                 ` Michael Cree
2014-09-05 18:09                 ` Paul E. McKenney
2014-09-05 18:09                   ` Paul E. McKenney
2014-09-05 18:31                   ` Paul E. McKenney
2014-09-05 18:31                     ` Paul E. McKenney
2014-09-05 19:52                     ` Peter Zijlstra
2014-09-05 19:52                       ` Peter Zijlstra
2014-09-05 19:52                       ` Peter Zijlstra
2014-09-05 20:01                       ` Peter Hurley
2014-09-05 20:01                         ` Peter Hurley
2014-09-05 20:01                         ` Peter Hurley
2014-09-05 20:12                         ` Peter Zijlstra
2014-09-05 20:12                           ` Peter Zijlstra
2014-09-05 20:12                           ` Peter Zijlstra
2014-09-05 20:15                           ` H. Peter Anvin
2014-09-05 20:15                             ` H. Peter Anvin
2014-09-05 20:15                             ` H. Peter Anvin
2014-09-05 20:19                         ` Paul E. McKenney
2014-09-05 20:19                           ` Paul E. McKenney
2014-09-05 20:19                           ` Paul E. McKenney
2014-09-05 18:50                   ` Peter Hurley
2014-09-05 18:50                     ` Peter Hurley
2014-09-05 19:05                     ` Paul E. McKenney
2014-09-05 19:05                       ` Paul E. McKenney
2014-09-05 19:05                       ` Paul E. McKenney
2014-09-05 19:24                       ` Peter Hurley
2014-09-05 19:24                         ` Peter Hurley
2014-09-05 19:24                         ` Peter Hurley
2014-09-05 20:09                         ` Paul E. McKenney
2014-09-05 20:09                           ` Paul E. McKenney
2014-09-05 20:09                           ` Paul E. McKenney
2014-09-05 20:09                           ` Paul E. McKenney
2014-09-05 19:38                       ` Marc Gauthier
2014-09-05 19:38                         ` Marc Gauthier
2014-09-05 19:38                         ` Marc Gauthier
2014-09-05 20:14                         ` Peter Hurley
2014-09-05 20:14                           ` Peter Hurley
2014-09-05 20:14                           ` Peter Hurley
2014-09-05 20:34                           ` H. Peter Anvin
2014-09-05 20:34                             ` H. Peter Anvin
2014-09-05 20:34                             ` H. Peter Anvin
2014-09-05 20:42                             ` Michael Cree
2014-09-05 20:42                               ` Michael Cree
2014-09-05 20:42                               ` Michael Cree
2014-09-05 20:43                             ` Paul E. McKenney
2014-09-05 20:43                               ` Paul E. McKenney
2014-09-05 20:43                               ` Paul E. McKenney
2014-09-05 20:48                               ` Thomas Gleixner
2014-09-05 20:48                                 ` Thomas Gleixner
2014-09-05 20:48                                 ` Thomas Gleixner
2014-09-05 21:05                                 ` Paul E. McKenney
2014-09-05 21:05                                   ` Paul E. McKenney
2014-09-05 21:05                                   ` Paul E. McKenney
2014-09-05 20:39                           ` Michael Cree
2014-09-05 20:39                             ` Michael Cree
2014-09-05 20:39                             ` Michael Cree
2014-09-05 21:12                             ` Peter Hurley
2014-09-05 21:12                               ` Peter Hurley
2014-09-05 21:27                               ` Michael Cree
2014-09-05 21:27                                 ` Michael Cree
2014-09-05 21:27                                 ` Michael Cree
2014-09-05 20:42                           ` Paul E. McKenney
2014-09-05 20:42                             ` Paul E. McKenney
2014-09-05 20:42                             ` Paul E. McKenney
2014-09-04  8:57     ` Mikael Pettersson
2014-09-04  8:57       ` Mikael Pettersson
2014-09-04  8:57       ` Mikael Pettersson
2014-09-04  8:57       ` Mikael Pettersson
2014-09-04  9:09       ` Jakub Jelinek
2014-09-04  9:09         ` Jakub Jelinek
2014-09-04  9:09         ` Jakub Jelinek
2014-09-04 12:24         ` Peter Hurley
2014-09-04 12:24           ` Peter Hurley
2014-09-04 12:24           ` Peter Hurley
2014-09-04 12:29           ` Jakub Jelinek
2014-09-04 12:29             ` Jakub Jelinek
2014-09-04 12:29             ` Jakub Jelinek
2014-09-04 16:50           ` One Thousand Gnomes
2014-09-04 16:50             ` One Thousand Gnomes
2014-09-04 16:50             ` One Thousand Gnomes
2014-09-04 16:50             ` One Thousand Gnomes
2014-09-04 19:42             ` Peter Hurley
2014-09-04 19:42               ` Peter Hurley
2014-09-04 19:42               ` Peter Hurley
2014-09-04 22:16               ` H. Peter Anvin
2014-09-04 22:16                 ` H. Peter Anvin
2014-09-04 22:16                 ` H. Peter Anvin
2014-09-05  0:17                 ` Paul E. McKenney
2014-09-05  0:17                   ` Paul E. McKenney
2014-09-05  0:17                   ` Paul E. McKenney
2014-09-05  1:57                   ` Peter Hurley
2014-09-05  1:57                     ` Peter Hurley
2014-09-05  1:57                     ` Peter Hurley
2014-09-05  2:11                   ` James Bottomley
2014-09-05  2:11                     ` James Bottomley
2014-09-05  2:11                     ` James Bottomley
2014-09-05  2:47                     ` Peter Hurley
2014-09-05  2:47                       ` Peter Hurley
2014-09-05  2:47                       ` Peter Hurley
2014-09-05  4:06                       ` Paul E. McKenney
2014-09-05  4:06                         ` Paul E. McKenney
2014-09-05  4:06                         ` Paul E. McKenney
2014-09-05  8:30                         ` David Laight
2014-09-05  8:30                           ` David Laight
2014-09-05  8:30                           ` David Laight
2014-09-05  8:30                           ` David Laight
2014-09-05 12:31                           ` Peter Hurley
2014-09-05 12:31                             ` Peter Hurley
2014-09-05 12:31                             ` Peter Hurley
2014-09-05 12:37                             ` David Laight
2014-09-05 12:37                               ` David Laight
2014-09-05 12:37                               ` David Laight
2014-09-05 12:37                               ` David Laight
2014-09-05 16:17                               ` Peter Hurley
2014-09-05 16:17                                 ` Peter Hurley
2014-09-05 16:17                                 ` Peter Hurley
2014-09-25 16:12                                 ` Pavel Machek
2014-09-25 16:12                                   ` Pavel Machek
2014-09-25 16:12                                   ` Pavel Machek
2014-09-07  5:07                         ` James Bottomley
2014-09-07  5:07                           ` James Bottomley
2014-09-07  5:07                           ` James Bottomley
2014-09-07 16:21                           ` Paul E. McKenney
2014-09-07 16:21                             ` Paul E. McKenney
2014-09-07 16:21                             ` Paul E. McKenney
2014-09-07 19:04                             ` James Bottomley
2014-09-07 19:04                               ` James Bottomley
2014-09-07 19:04                               ` James Bottomley
2014-09-07 20:41                               ` Peter Hurley
2014-09-07 20:41                                 ` Peter Hurley
2014-09-07 20:41                                 ` Peter Hurley
2014-09-08  5:50                                 ` James Bottomley
2014-09-08  5:50                                   ` James Bottomley
2014-09-08  5:50                                   ` James Bottomley
2014-09-08 20:45                                   ` Chris Metcalf
2014-09-08 20:45                                     ` Chris Metcalf
2014-09-08 20:45                                     ` Chris Metcalf
2014-09-08 20:45                                     ` Chris Metcalf
2014-09-08 22:43                                     ` James Bottomley
2014-09-08 22:43                                       ` James Bottomley
2014-09-08 22:43                                       ` James Bottomley
2014-09-09  2:27                                       ` H. Peter Anvin
2014-09-09  2:27                                         ` H. Peter Anvin
2014-09-09  2:27                                         ` H. Peter Anvin
2014-09-09  2:27                                         ` H. Peter Anvin
2014-09-09  8:11                                         ` Arnd Bergmann
2014-09-09  8:11                                           ` Arnd Bergmann
2014-09-09  8:11                                           ` Arnd Bergmann
2014-09-08 23:30                                   ` Peter Hurley
2014-09-08 23:30                                     ` Peter Hurley
2014-09-08 23:30                                     ` Peter Hurley
2014-09-09  2:56                                     ` James Bottomley
2014-09-09  2:56                                       ` James Bottomley
2014-09-09  2:56                                       ` James Bottomley
2014-09-09  3:20                                       ` H. Peter Anvin
2014-09-09  3:20                                         ` H. Peter Anvin
2014-09-09  3:20                                         ` H. Peter Anvin
2014-09-09  4:30                                       ` H. Peter Anvin
2014-09-09  4:30                                         ` H. Peter Anvin
2014-09-09  4:30                                         ` H. Peter Anvin
2014-09-09  4:30                                         ` H. Peter Anvin
2014-09-09 10:40                                       ` Peter Hurley
2014-09-09 10:40                                         ` Peter Hurley
2014-09-09 10:40                                         ` Peter Hurley
2014-09-10 21:48                                         ` James Bottomley
2014-09-10 21:48                                           ` James Bottomley
2014-09-10 21:48                                           ` James Bottomley
2014-09-10 23:50                                           ` Peter Hurley
2014-09-10 23:50                                             ` Peter Hurley
2014-09-10 23:50                                             ` Peter Hurley
2014-09-11 10:23                                           ` Will Deacon
2014-09-11 10:23                                             ` Will Deacon
2014-09-11 10:23                                             ` Will Deacon
2014-09-07 23:00                               ` Paul E. McKenney
2014-09-07 23:00                                 ` Paul E. McKenney
2014-09-07 23:00                                 ` Paul E. McKenney
2014-09-07 23:17                                 ` H. Peter Anvin
2014-09-07 23:17                                   ` H. Peter Anvin
2014-09-07 23:17                                   ` H. Peter Anvin
2014-09-07 23:36                                   ` Paul E. McKenney
2014-09-07 23:36                                     ` Paul E. McKenney
2014-09-07 23:36                                     ` Paul E. McKenney
2014-09-07 23:39                                     ` H. Peter Anvin
2014-09-07 23:39                                       ` H. Peter Anvin
2014-09-07 23:39                                       ` H. Peter Anvin
2014-09-08  5:56                                       ` James Bottomley
2014-09-08  5:56                                         ` James Bottomley
2014-09-08  5:56                                         ` James Bottomley
2014-09-08 18:12                                         ` H. Peter Anvin
2014-09-08 18:12                                           ` H. Peter Anvin
2014-09-08 18:12                                           ` H. Peter Anvin
2014-09-08 19:09                                           ` James Bottomley
2014-09-08 19:09                                             ` James Bottomley
2014-09-08 19:09                                             ` James Bottomley
2014-09-08 19:12                                             ` H. Peter Anvin
2014-09-08 19:12                                               ` H. Peter Anvin
2014-09-08 19:12                                               ` H. Peter Anvin
2014-09-08 19:12                                             ` H. Peter Anvin
2014-09-08 19:12                                               ` H. Peter Anvin
2014-09-08 19:12                                               ` H. Peter Anvin
2014-09-08 19:12                                               ` H. Peter Anvin
2014-09-08 22:39                                               ` James Bottomley
2014-09-08 22:39                                                 ` James Bottomley
2014-09-08 22:39                                                 ` James Bottomley
2014-09-09  2:30                                                 ` H. Peter Anvin
2014-09-09  2:30                                                   ` H. Peter Anvin
2014-09-09  2:30                                                   ` H. Peter Anvin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.