All of lore.kernel.org
 help / color / mirror / Atom feed
* local_add_return
@ 2008-12-15 13:47 Steven Rostedt
  2008-12-16  6:33 ` local_add_return Rusty Russell
  0 siblings, 1 reply; 33+ messages in thread
From: Steven Rostedt @ 2008-12-15 13:47 UTC (permalink / raw)
  To: Rusty Russell; +Cc: Andrew Morton, LKML

Hi Rusty,

I'm trying to make the ring buffers lockless and reentrant. It is slowly 
going that way. The local_add_return is to reserve a part of the ring 
buffer even when an interrupt can come in and reserver part of that same 
ring buffer. The atomic add here has to only be atomic with respect to 
interrupts.

On intel, there is no reason to use a LOCK increment, since the ring 
buffers are per cpu. A simple asm inc would work. I was thinking that is 
what local_add_return gives me. I could replace the local_add_returns with 
atomic_add_return, but that just seems to be adding overhead for archs 
that do not need it.

Shouldn't local_add_return be a way for archs that can increment a memory 
location atomically against interrupts to use that infrastructure? It can 
simply fall back to atomic_add_return for those archs that do not have 
a lesser equivalent of atomic_add_return.

-- Steve



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: local_add_return
  2008-12-15 13:47 local_add_return Steven Rostedt
@ 2008-12-16  6:33 ` Rusty Russell
  2008-12-16  6:57   ` local_add_return David Miller
                     ` (2 more replies)
  0 siblings, 3 replies; 33+ messages in thread
From: Rusty Russell @ 2008-12-16  6:33 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Andrew Morton, LKML, David Miller, Mathieu Desnoyers,
	Paul Mackerras, Benjamin Herrenschmidt

On Tuesday 16 December 2008 00:17:35 Steven Rostedt wrote:
> Shouldn't local_add_return be a way for archs that can increment a memory 
> location atomically against interrupts to use that infrastructure? It can 
> simply fall back to atomic_add_return for those archs that do not have 
> a lesser equivalent of atomic_add_return.

local_t was originally introduced (but actually never used for) the
SNMP counters.  They use two counters to avoid atomics, but as the ancient
comment says:

/* 
 * FIXME: On x86 and some other CPUs the split into user and softirq parts
 * is not needed because addl $1,memory is atomic against interrupts (but 
 * atomic_inc would be overkill because of the lock cycles). Wants new 
 * nonlocked_atomic_inc() primitives -AK
 */ 
#define DEFINE_SNMP_STAT(type, name)	\
	__typeof__(type) *name[2]

Then last year Mathieu sent (and Andrew accepted) a "rich set of atomic
operations", including excellent documentation "local_ops.txt".  Except
he thought they were atomics, so treated them accordingly.  Also, there
were no users (you're now the only one).

But if these new operations are to become the norm, it changes how archs
should implement local_t.  eg. trivalue becomes less attractive, atomic_long
more.  x86 has its own implementation so doesn't have these issues.

Now, I posted a benchmark patch before for archs to test.  I'm interested
in Sparc64.  Does any arch win from using multiple counters?  PowerPC has
soft interrupt disable, so that solution wins over atomic_long_t for them.

Cheers,
Rusty.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: local_add_return
  2008-12-16  6:33 ` local_add_return Rusty Russell
@ 2008-12-16  6:57   ` David Miller
  2008-12-16  7:13   ` local_add_return David Miller
  2008-12-16 16:25   ` local_add_return Mathieu Desnoyers
  2 siblings, 0 replies; 33+ messages in thread
From: David Miller @ 2008-12-16  6:57 UTC (permalink / raw)
  To: rusty; +Cc: rostedt, akpm, linux-kernel, mathieu.desnoyers, paulus, benh

From: Rusty Russell <rusty@rustcorp.com.au>
Date: Tue, 16 Dec 2008 17:03:00 +1030

> Now, I posted a benchmark patch before for archs to test.  I'm interested
> in Sparc64.  Does any arch win from using multiple counters?  PowerPC has
> soft interrupt disable, so that solution wins over atomic_long_t for them.

I'll get you some sparc64 numbers.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: local_add_return
  2008-12-16  6:33 ` local_add_return Rusty Russell
  2008-12-16  6:57   ` local_add_return David Miller
@ 2008-12-16  7:13   ` David Miller
  2008-12-16 22:50       ` local_add_return Rusty Russell
  2008-12-16 16:25   ` local_add_return Mathieu Desnoyers
  2 siblings, 1 reply; 33+ messages in thread
From: David Miller @ 2008-12-16  7:13 UTC (permalink / raw)
  To: rusty; +Cc: rostedt, akpm, linux-kernel, mathieu.desnoyers, paulus, benh

From: Rusty Russell <rusty@rustcorp.com.au>
Date: Tue, 16 Dec 2008 17:03:00 +1030

> Now, I posted a benchmark patch before for archs to test.  I'm interested
> in Sparc64.

Here ya go:

UltraSPARC-IIIi:

	atomic_long:	local_inc=8180000021/243
			local_add=7469999999/222
			cpu_local_inc=1260000001/37
			local_read=5690000000/169
			(total was 1326071152640)

	irqsave/restore:
			local_inc=6909999997/205
			local_add=6899999996/205
			cpu_local_inc=860000000/25
			local_read=5690000005/169
			(total was 1326071152640)

	trivalue:
			local_inc=6490000000/193
			local_add=6500000000/193
			cpu_local_inc=370000002/11
			local_read=7419999998/221
			(total was 1326071152640)

	local_t:
			local_inc=7440000000/221
			local_add=7440000000/221
			cpu_local_inc=1260000004/37
			local_read=5689999997/169
			(total was 1326071152640, warm_total 34443624448)

Niagara-2:

	atomic_long:
		local_inc=6960000013/207
		local_add=6940000002/206
		cpu_local_inc=2440000002/72
		local_read=5390000000/160
		(total was 1326071152640)

	irqsave/restore:
		local_inc=7660000002/228
		local_add=7669999999/228
		cpu_local_inc=2650000004/78
		local_read=5379999999/160
		(total was 1326071152640)

	trivalue:
		local_inc=5789999998/172
		local_add=5789999995/172
		cpu_local_inc=689999994/20
		local_read=7470000000/222
		(total was 1326071152640)

	local_t:
		local_inc=6940000000/206
		local_add=6950000000/207
		cpu_local_inc=2460000000/73
		local_read=5390000004/160
		(total was 1326071152640, warm_total 34443624448)


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: local_add_return
  2008-12-16  6:33 ` local_add_return Rusty Russell
  2008-12-16  6:57   ` local_add_return David Miller
  2008-12-16  7:13   ` local_add_return David Miller
@ 2008-12-16 16:25   ` Mathieu Desnoyers
  2008-12-17 11:23     ` local_add_return Rusty Russell
  2 siblings, 1 reply; 33+ messages in thread
From: Mathieu Desnoyers @ 2008-12-16 16:25 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Steven Rostedt, Andrew Morton, LKML, David Miller,
	Paul Mackerras, Benjamin Herrenschmidt, Christoph Lameter,
	Paul E. McKenney, Martin Bligh

* Rusty Russell (rusty@rustcorp.com.au) wrote:
> On Tuesday 16 December 2008 00:17:35 Steven Rostedt wrote:
> > Shouldn't local_add_return be a way for archs that can increment a memory 
> > location atomically against interrupts to use that infrastructure? It can 
> > simply fall back to atomic_add_return for those archs that do not have 
> > a lesser equivalent of atomic_add_return.
> 
> local_t was originally introduced (but actually never used for) the
> SNMP counters.  They use two counters to avoid atomics, but as the ancient
> comment says:
> 
> /* 
>  * FIXME: On x86 and some other CPUs the split into user and softirq parts
>  * is not needed because addl $1,memory is atomic against interrupts (but 
>  * atomic_inc would be overkill because of the lock cycles). Wants new 
>  * nonlocked_atomic_inc() primitives -AK
>  */ 
> #define DEFINE_SNMP_STAT(type, name)	\
> 	__typeof__(type) *name[2]
> 
> Then last year Mathieu sent (and Andrew accepted) a "rich set of atomic
> operations", including excellent documentation "local_ops.txt".  Except
> he thought they were atomics, so treated them accordingly.  Also, there
> were no users (you're now the only one).
> 
> But if these new operations are to become the norm, it changes how archs
> should implement local_t.  eg. trivalue becomes less attractive, atomic_long
> more.  x86 has its own implementation so doesn't have these issues.
> 
> Now, I posted a benchmark patch before for archs to test.  I'm interested
> in Sparc64.  Does any arch win from using multiple counters?  PowerPC has
> soft interrupt disable, so that solution wins over atomic_long_t for them.
>

Hi Rusty,

I'd like to comment on your test case found at
http://groups.google.com/group/linux.kernel/msg/98c512fceda26351

Specifically on this comment :

+/* There are three obvious ways to implement local_t on an arch which
+ * can't do single-instruction inc/dec etc.
+ * 1) atomic_long
+ * 2) irq_save/irq_restore
+ * 3) multiple counters.
+ *
+ * This does a very rough benchmark on each one.
+ */ 

Option 3) is not workable for tracers, because it's not safe against
some exceptions (e.g. some hardware errors) nor NMIs. Also, local_t
operations must have preemption disabled before playing on per-cpu data,
which I don't see in your test. This has to be taken into account in the
runtime cost. The "multiple counters" options should also disable
preemption, because a thread being moved to another CPU could corrupt
some other thread's data when being rescheduled.

Only two alternatives does not have this preempt_disable() requirement : 
atomic_long_t and the CPU_OPS work done by Christoph Lameter which use
segments to address the per-cpu data, which effectively removes the need
for disabling preemption around local_t operations because the CPU ID
becomes encoded in a cpu register.

Otherwise, you can be moved to a different CPU between the moment you
read the CPU ID and the moment you access the local data, which can lead
to corruption with local_t and multiple counters options.

Cheers,

Mathieu

> Cheers,
> Rusty.

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: local_add_return
  2008-12-16  7:13   ` local_add_return David Miller
@ 2008-12-16 22:50       ` Rusty Russell
  0 siblings, 0 replies; 33+ messages in thread
From: Rusty Russell @ 2008-12-16 22:38 UTC (permalink / raw)
  To: David Miller
  Cc: rostedt, akpm, linux-kernel, mathieu.desnoyers, paulus, benh,
	linux-ia64, linux-s390

On Tuesday 16 December 2008 17:43:14 David Miller wrote:
> Here ya go:

Very interesting.  There's a little noise there (that first local_inc of 243
is wrong), but the picture is clear: trivalue is the best implementation for
sparc64.

Note: trivalue uses 3 values, so instead of hitting random values across 8MB
it's across 24MB, and despite the resulting cache damage it's 15% faster.  The
cpu_local_inc test is a single value, so no cache effects: it shows trivalue
to be 3 to 3.5 times faster in the cache-hot case.

This sucks, because it really does mean that there's no one-size-fits-all
implementation of local_t.  There's also no platform yet where atomic_long_t
is the right choice; and that's the default!

Any chance of an IA64 or s390 run?  You can normalize if you like, since
it's only to compare the different approaches.

Cheers,
Rusty.

Benchmarks for local_t variants

(This patch also fixes the x86 cpu_local_* macros, which are obviously
unused).

I chose a large array (1M longs) for the inc/add/add_return tests so
the trivalue case would show some cache pressure.

The cpu_local_inc case is always cache-hot, so it's not comparable to
the others.

Time in ns per iteration (brackets is with CONFIG_PREEMPT=y):

		inc	add	add_return	cpu_local_inc	read
x86-32: 2.13 Ghz Core Duo 2
atomic_long	118	118	115		17		17
irqsave/rest	77	78	77		23		16
trivalue	45	45	127		3(6)		21
local_t		36	36	36		1(5)		17

x86-64: 2.6 GHz Dual-Core AMD Opteron 2218
atomic_long	55	60	-		6		19
irqsave/rest	54	54	-		11		19
trivalue	47	47	-		5		28
local_t		47	46	-		1		19

PPC64: 2.7 GHz PPC970MP [normalized]
atomic_long	18	18	20		3(4)		8
irqsave/rest	10(4)	5(4)	4		8(9)		10(9)
trivalue	9	9	2		1(3)		10
local_t		18	18	18		3(4)		8

Sparc64: UltraSPARC-IIIi
atomic_long	243	222	-		37		169
irqsave/rest	205	205	-		25		169
trivalue	193	193	-		11		221
local_t		221	221	-		37		169

Sparc64: Niagara-2
atomic_long	207	206	-		72		160
irqsave/rest	228	228	-		78		160
trivalue:	172	172	-		20		222
local_t		206	207	-		73		160

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
---
 arch/x86/include/asm/local.h |   20 +--
 init/main.c                  |  223 +++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 233 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/local.h b/arch/x86/include/asm/local.h
--- a/arch/x86/include/asm/local.h
+++ b/arch/x86/include/asm/local.h
@@ -220,16 +220,16 @@ static inline long local_sub_return(long
 	preempt_enable();		\
 })					\
 
-#define cpu_local_read(l)    cpu_local_wrap_v(local_read(&__get_cpu_var((l))))
-#define cpu_local_set(l, i)  cpu_local_wrap(local_set(&__get_cpu_var((l)), (i)))
-#define cpu_local_inc(l)     cpu_local_wrap(local_inc(&__get_cpu_var((l))))
-#define cpu_local_dec(l)     cpu_local_wrap(local_dec(&__get_cpu_var((l))))
-#define cpu_local_add(i, l)  cpu_local_wrap(local_add((i), &__get_cpu_var((l))))
-#define cpu_local_sub(i, l)  cpu_local_wrap(local_sub((i), &__get_cpu_var((l))))
+#define cpu_local_read(l)    cpu_local_wrap_v(local_read(&__get_cpu_var(l)))
+#define cpu_local_set(l, i)  cpu_local_wrap(local_set(&__get_cpu_var(l), (i)))
+#define cpu_local_inc(l)     cpu_local_wrap(local_inc(&__get_cpu_var(l)))
+#define cpu_local_dec(l)     cpu_local_wrap(local_dec(&__get_cpu_var(l)))
+#define cpu_local_add(i, l)  cpu_local_wrap(local_add((i), &__get_cpu_var(l)))
+#define cpu_local_sub(i, l)  cpu_local_wrap(local_sub((i), &__get_cpu_var(l)))
 
-#define __cpu_local_inc(l)	cpu_local_inc((l))
-#define __cpu_local_dec(l)	cpu_local_dec((l))
-#define __cpu_local_add(i, l)	cpu_local_add((i), (l))
-#define __cpu_local_sub(i, l)	cpu_local_sub((i), (l))
+#define __cpu_local_inc(l)	cpu_local_inc(l)
+#define __cpu_local_dec(l)	cpu_local_dec(l)
+#define __cpu_local_add(i, l)	cpu_local_add((i), l)
+#define __cpu_local_sub(i, l)	cpu_local_sub((i), l)
 
 #endif /* _ASM_X86_LOCAL_H */
diff --git a/init/main.c b/init/main.c
--- a/init/main.c
+++ b/init/main.c
@@ -534,6 +534,225 @@ void __init __weak thread_info_cache_ini
 {
 }
 
+/* There are three obvious ways to implement local_t on an arch which
+ * can't do single-instruction inc/dec etc.
+ * 1) atomic_long
+ * 2) irq_save/irq_restore
+ * 3) multiple counters.
+ *
+ * This does a very rough benchmark on each one.
+ */
+struct local1 {
+	atomic_long_t v;
+};
+struct local2 {
+	unsigned long v;
+};
+struct local3 {
+	unsigned long v[3];
+};
+
+/* Enough to put some pressure on the caches. */
+#define NUM_LOCAL_TEST (1024*1024)
+#define NUM_LOCAL_RUNS (NUM_LOCAL_TEST*32)
+/* This will make it jump around looking random */
+#define STRIDE 514001
+
+static void *test_local_variants_mem;
+
+static void init_test_local_variants(void)
+{
+	unsigned long size;
+	size = max(sizeof(struct local1),
+		   max(sizeof(struct local2),
+		       max(sizeof(struct local3), sizeof(local_t))))
+		* NUM_LOCAL_TEST;
+	/* Assume this works in early boot. */
+	test_local_variants_mem = alloc_bootmem_nopanic(size);
+
+	if (!test_local_variants_mem) {
+		printk("test_local_variants: failed to allocate %lu bytes\n",
+		       size);
+		return;
+	}
+}
+
+static void print_result(const char *str,
+			 struct timespec start, struct timespec end)
+{
+	s64 diff;
+
+	diff = ktime_to_ns(ktime_sub(timespec_to_ktime(end), timespec_to_ktime(start)));
+	printk("%s=%lli/%lli ",
+	       str, diff, diff/NUM_LOCAL_RUNS);
+}
+
+static unsigned int warm_local_test_cache(const void *mem, size_t len)
+{
+	unsigned int i, total = 0;
+	for (i = 0; i < len; i++)
+		total += ((char *)mem)[i];
+	return total;
+}
+
+#define TEST_LOOP(expr)				\
+	n = 0;					\
+	getnstimeofday(&start);			\
+	for (i = 0; i < NUM_LOCAL_RUNS; i++) {	\
+		expr;				\
+		n += STRIDE;			\
+		n %= NUM_LOCAL_TEST;		\
+	}					\
+	getnstimeofday(&end);
+
+/* This doesn't test cache effects at all */
+#define NUM_PERCPU_VARS		16
+DEFINE_PER_CPU(struct local1[NUM_PERCPU_VARS], local1_test);
+DEFINE_PER_CPU(struct local2[NUM_PERCPU_VARS], local2_test);
+DEFINE_PER_CPU(struct local3[NUM_PERCPU_VARS], local3_test);
+DEFINE_PER_CPU(local_t[NUM_PERCPU_VARS], local4_test);
+
+static void test_local_variants(void)
+{
+	struct timespec start, end;
+	unsigned int i, n;
+	unsigned long total, warm_total = 0;
+	struct local1 *l1;
+	struct local2 *l2;
+	struct local3 *l3;
+	local_t *l4;
+
+	if (!test_local_variants_mem)
+		return;
+
+	printk("Running local_t variant benchmarks\n");
+	l1 = test_local_variants_mem;
+	l2 = test_local_variants_mem;
+	l3 = test_local_variants_mem;
+	l4 = test_local_variants_mem;
+
+	printk("atomic_long: ");
+	memset(l1, 0, sizeof(*l1)*NUM_LOCAL_TEST);
+	TEST_LOOP(atomic_long_inc(&l1[n].v));
+	print_result("local_inc", start, end);
+
+	warm_total += warm_local_test_cache(l1, sizeof(*l1)*NUM_LOCAL_TEST);
+	TEST_LOOP(atomic_long_add(1234, &l1[n].v));
+	print_result("local_add", start, end);
+
+	warm_total += warm_local_test_cache(l1, sizeof(*l1)*NUM_LOCAL_TEST);
+	TEST_LOOP(atomic_long_inc(&__get_cpu_var(local1_test)[n%NUM_PERCPU_VARS].v));
+	print_result("cpu_local_inc", start, end);
+
+	warm_total += warm_local_test_cache(l1, sizeof(*l1)*NUM_LOCAL_TEST);
+	total = 0;
+	TEST_LOOP(total += atomic_long_read(&l1[n].v));
+	print_result("local_read", start, end);
+
+	warm_total += warm_local_test_cache(l1, sizeof(*l1)*NUM_LOCAL_TEST);
+	TEST_LOOP(total += atomic_long_add_return(7, &l1[n].v));
+	print_result("local_add_return", start, end);
+
+	printk("(total was %lu)\n", total);
+
+	printk("irqsave/restore: ");
+	memset(l2, 0, sizeof(*l2)*NUM_LOCAL_TEST);
+	TEST_LOOP(unsigned long flags;
+		  local_irq_save(flags);
+		  l2[n].v++;
+		  local_irq_restore(flags));
+	print_result("local_inc", start, end);
+
+	warm_total += warm_local_test_cache(l2, sizeof(*l2)*NUM_LOCAL_TEST);
+	TEST_LOOP(unsigned long flags;
+		  local_irq_save(flags);
+		  l2[n].v += 1234;
+		  local_irq_restore(flags));
+	print_result("local_add", start, end);
+
+	warm_total += warm_local_test_cache(l2, sizeof(*l2)*NUM_LOCAL_TEST);
+	TEST_LOOP(unsigned long flags;
+		  local_irq_save(flags);
+		  __get_cpu_var(local2_test)[n%NUM_PERCPU_VARS].v++;
+		  local_irq_restore(flags));
+	print_result("cpu_local_inc", start, end);
+
+	warm_total += warm_local_test_cache(l2, sizeof(*l2)*NUM_LOCAL_TEST);
+	total = 0;
+	TEST_LOOP(total += l2[n].v);
+	print_result("local_read", start, end);
+
+	warm_total += warm_local_test_cache(l1, sizeof(*l1)*NUM_LOCAL_TEST);
+	TEST_LOOP(unsigned long flags;
+		  local_irq_save(flags);
+		  l2[n].v += 7;
+		  total += l2[n].v;
+		  local_irq_restore(flags));
+	print_result("local_add_return", start, end);
+	printk("(total was %lu)\n", total);
+
+	printk("trivalue: ");
+	memset(l3, 0, sizeof(*l3)*NUM_LOCAL_TEST);
+	TEST_LOOP(unsigned int idx
+			= !(preempt_count() & (HARDIRQ_MASK | SOFTIRQ_MASK)) +
+			!(preempt_count() & HARDIRQ_MASK);
+		  l3[n].v[idx]++);
+	print_result("local_inc", start, end);
+
+	warm_total += warm_local_test_cache(l3, sizeof(*l3)*NUM_LOCAL_TEST);
+	TEST_LOOP(unsigned int idx
+			= !(preempt_count() & (HARDIRQ_MASK | SOFTIRQ_MASK)) +
+			!(preempt_count() & HARDIRQ_MASK);
+		  l3[n].v[idx] += 1234);
+	print_result("local_add", start, end);
+
+	warm_total += warm_local_test_cache(l3, sizeof(*l3)*NUM_LOCAL_TEST);
+	TEST_LOOP(unsigned int idx
+			= !(preempt_count() & (HARDIRQ_MASK | SOFTIRQ_MASK)) +
+			!(preempt_count() & HARDIRQ_MASK);
+		  get_cpu_var(local3_test)[n%NUM_PERCPU_VARS].v[idx]++;
+		  put_cpu_var());
+	print_result("cpu_local_inc", start, end);
+
+	warm_total += warm_local_test_cache(l3, sizeof(*l3)*NUM_LOCAL_TEST);
+	total = 0;
+	TEST_LOOP(total += l3[n].v[0] + l3[n].v[1] + l3[n].v[2]);
+	print_result("local_read", start, end);
+
+	warm_total += warm_local_test_cache(l1, sizeof(*l1)*NUM_LOCAL_TEST);
+	TEST_LOOP(unsigned long flags;
+		  local_irq_save(flags);
+		  l3[n].v[0] += 7;
+		  total += l3[n].v[0] + l3[n].v[1] + l3[n].v[2];
+		  local_irq_restore(flags));
+	print_result("local_add_return", start, end);
+
+	printk("(total was %lu)\n", total);
+
+	printk("local_t: ");
+	memset(l4, 0, sizeof(*l4)*NUM_LOCAL_TEST);
+	TEST_LOOP(local_inc(&l4[n]));
+	print_result("local_inc", start, end);
+
+	warm_total += warm_local_test_cache(l4, sizeof(*l4)*NUM_LOCAL_TEST);
+	TEST_LOOP(local_add(1234, &l4[n]));
+	print_result("local_add", start, end);
+
+	warm_total += warm_local_test_cache(l4, sizeof(*l4)*NUM_LOCAL_TEST);
+	TEST_LOOP(cpu_local_inc(local4_test[n%NUM_PERCPU_VARS]));
+	print_result("cpu_local_inc", start, end);
+
+	warm_total += warm_local_test_cache(l4, sizeof(*l4)*NUM_LOCAL_TEST);
+	total = 0;
+	TEST_LOOP(total += local_read(&l4[n]));
+	print_result("local_read", start, end);
+
+	warm_total += warm_local_test_cache(l1, sizeof(*l1)*NUM_LOCAL_TEST);
+	TEST_LOOP(total += local_add_return(7, &l1[n].v));
+	print_result("local_add_return", start, end);
+	printk("(total was %lu, warm_total %lu)\n", total, warm_total);
+}
+
 asmlinkage void __init start_kernel(void)
 {
 	char * command_line;
@@ -630,6 +849,8 @@ asmlinkage void __init start_kernel(void
 	 */
 	locking_selftest();
 
+	init_test_local_variants();
+
 #ifdef CONFIG_BLK_DEV_INITRD
 	if (initrd_start && !initrd_below_start_ok &&
 	    page_to_pfn(virt_to_page((void *)initrd_start)) < min_low_pfn) {
@@ -687,6 +908,8 @@ asmlinkage void __init start_kernel(void
 	acpi_early_init(); /* before LAPIC and SMP init */
 
 	ftrace_init();
+
+	test_local_variants();
 
 	/* Do the rest non-__init'ed, we're now alive */
 	rest_init();

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: local_add_return
@ 2008-12-16 22:50       ` Rusty Russell
  0 siblings, 0 replies; 33+ messages in thread
From: Rusty Russell @ 2008-12-16 22:50 UTC (permalink / raw)
  To: David Miller
  Cc: rostedt, akpm, linux-kernel, mathieu.desnoyers, paulus, benh,
	linux-ia64, linux-s390

On Tuesday 16 December 2008 17:43:14 David Miller wrote:
> Here ya go:

Very interesting.  There's a little noise there (that first local_inc of 243
is wrong), but the picture is clear: trivalue is the best implementation for
sparc64.

Note: trivalue uses 3 values, so instead of hitting random values across 8MB
it's across 24MB, and despite the resulting cache damage it's 15% faster.  The
cpu_local_inc test is a single value, so no cache effects: it shows trivalue
to be 3 to 3.5 times faster in the cache-hot case.

This sucks, because it really does mean that there's no one-size-fits-all
implementation of local_t.  There's also no platform yet where atomic_long_t
is the right choice; and that's the default!

Any chance of an IA64 or s390 run?  You can normalize if you like, since
it's only to compare the different approaches.

Cheers,
Rusty.

Benchmarks for local_t variants

(This patch also fixes the x86 cpu_local_* macros, which are obviously
unused).

I chose a large array (1M longs) for the inc/add/add_return tests so
the trivalue case would show some cache pressure.

The cpu_local_inc case is always cache-hot, so it's not comparable to
the others.

Time in ns per iteration (brackets is with CONFIG_PREEMPT=y):

		inc	add	add_return	cpu_local_inc	read
x86-32: 2.13 Ghz Core Duo 2
atomic_long	118	118	115		17		17
irqsave/rest	77	78	77		23		16
trivalue	45	45	127		3(6)		21
local_t		36	36	36		1(5)		17

x86-64: 2.6 GHz Dual-Core AMD Opteron 2218
atomic_long	55	60	-		6		19
irqsave/rest	54	54	-		11		19
trivalue	47	47	-		5		28
local_t		47	46	-		1		19

PPC64: 2.7 GHz PPC970MP [normalized]
atomic_long	18	18	20		3(4)		8
irqsave/rest	10(4)	5(4)	4		8(9)		10(9)
trivalue	9	9	2		1(3)		10
local_t		18	18	18		3(4)		8

Sparc64: UltraSPARC-IIIi
atomic_long	243	222	-		37		169
irqsave/rest	205	205	-		25		169
trivalue	193	193	-		11		221
local_t		221	221	-		37		169

Sparc64: Niagara-2
atomic_long	207	206	-		72		160
irqsave/rest	228	228	-		78		160
trivalue:	172	172	-		20		222
local_t		206	207	-		73		160

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
---
 arch/x86/include/asm/local.h |   20 +--
 init/main.c                  |  223 +++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 233 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/local.h b/arch/x86/include/asm/local.h
--- a/arch/x86/include/asm/local.h
+++ b/arch/x86/include/asm/local.h
@@ -220,16 +220,16 @@ static inline long local_sub_return(long
 	preempt_enable();		\
 })					\
 
-#define cpu_local_read(l)    cpu_local_wrap_v(local_read(&__get_cpu_var((l))))
-#define cpu_local_set(l, i)  cpu_local_wrap(local_set(&__get_cpu_var((l)), (i)))
-#define cpu_local_inc(l)     cpu_local_wrap(local_inc(&__get_cpu_var((l))))
-#define cpu_local_dec(l)     cpu_local_wrap(local_dec(&__get_cpu_var((l))))
-#define cpu_local_add(i, l)  cpu_local_wrap(local_add((i), &__get_cpu_var((l))))
-#define cpu_local_sub(i, l)  cpu_local_wrap(local_sub((i), &__get_cpu_var((l))))
+#define cpu_local_read(l)    cpu_local_wrap_v(local_read(&__get_cpu_var(l)))
+#define cpu_local_set(l, i)  cpu_local_wrap(local_set(&__get_cpu_var(l), (i)))
+#define cpu_local_inc(l)     cpu_local_wrap(local_inc(&__get_cpu_var(l)))
+#define cpu_local_dec(l)     cpu_local_wrap(local_dec(&__get_cpu_var(l)))
+#define cpu_local_add(i, l)  cpu_local_wrap(local_add((i), &__get_cpu_var(l)))
+#define cpu_local_sub(i, l)  cpu_local_wrap(local_sub((i), &__get_cpu_var(l)))
 
-#define __cpu_local_inc(l)	cpu_local_inc((l))
-#define __cpu_local_dec(l)	cpu_local_dec((l))
-#define __cpu_local_add(i, l)	cpu_local_add((i), (l))
-#define __cpu_local_sub(i, l)	cpu_local_sub((i), (l))
+#define __cpu_local_inc(l)	cpu_local_inc(l)
+#define __cpu_local_dec(l)	cpu_local_dec(l)
+#define __cpu_local_add(i, l)	cpu_local_add((i), l)
+#define __cpu_local_sub(i, l)	cpu_local_sub((i), l)
 
 #endif /* _ASM_X86_LOCAL_H */
diff --git a/init/main.c b/init/main.c
--- a/init/main.c
+++ b/init/main.c
@@ -534,6 +534,225 @@ void __init __weak thread_info_cache_ini
 {
 }
 
+/* There are three obvious ways to implement local_t on an arch which
+ * can't do single-instruction inc/dec etc.
+ * 1) atomic_long
+ * 2) irq_save/irq_restore
+ * 3) multiple counters.
+ *
+ * This does a very rough benchmark on each one.
+ */
+struct local1 {
+	atomic_long_t v;
+};
+struct local2 {
+	unsigned long v;
+};
+struct local3 {
+	unsigned long v[3];
+};
+
+/* Enough to put some pressure on the caches. */
+#define NUM_LOCAL_TEST (1024*1024)
+#define NUM_LOCAL_RUNS (NUM_LOCAL_TEST*32)
+/* This will make it jump around looking random */
+#define STRIDE 514001
+
+static void *test_local_variants_mem;
+
+static void init_test_local_variants(void)
+{
+	unsigned long size;
+	size = max(sizeof(struct local1),
+		   max(sizeof(struct local2),
+		       max(sizeof(struct local3), sizeof(local_t))))
+		* NUM_LOCAL_TEST;
+	/* Assume this works in early boot. */
+	test_local_variants_mem = alloc_bootmem_nopanic(size);
+
+	if (!test_local_variants_mem) {
+		printk("test_local_variants: failed to allocate %lu bytes\n",
+		       size);
+		return;
+	}
+}
+
+static void print_result(const char *str,
+			 struct timespec start, struct timespec end)
+{
+	s64 diff;
+
+	diff = ktime_to_ns(ktime_sub(timespec_to_ktime(end), timespec_to_ktime(start)));
+	printk("%s=%lli/%lli ",
+	       str, diff, diff/NUM_LOCAL_RUNS);
+}
+
+static unsigned int warm_local_test_cache(const void *mem, size_t len)
+{
+	unsigned int i, total = 0;
+	for (i = 0; i < len; i++)
+		total += ((char *)mem)[i];
+	return total;
+}
+
+#define TEST_LOOP(expr)				\
+	n = 0;					\
+	getnstimeofday(&start);			\
+	for (i = 0; i < NUM_LOCAL_RUNS; i++) {	\
+		expr;				\
+		n += STRIDE;			\
+		n %= NUM_LOCAL_TEST;		\
+	}					\
+	getnstimeofday(&end);
+
+/* This doesn't test cache effects at all */
+#define NUM_PERCPU_VARS		16
+DEFINE_PER_CPU(struct local1[NUM_PERCPU_VARS], local1_test);
+DEFINE_PER_CPU(struct local2[NUM_PERCPU_VARS], local2_test);
+DEFINE_PER_CPU(struct local3[NUM_PERCPU_VARS], local3_test);
+DEFINE_PER_CPU(local_t[NUM_PERCPU_VARS], local4_test);
+
+static void test_local_variants(void)
+{
+	struct timespec start, end;
+	unsigned int i, n;
+	unsigned long total, warm_total = 0;
+	struct local1 *l1;
+	struct local2 *l2;
+	struct local3 *l3;
+	local_t *l4;
+
+	if (!test_local_variants_mem)
+		return;
+
+	printk("Running local_t variant benchmarks\n");
+	l1 = test_local_variants_mem;
+	l2 = test_local_variants_mem;
+	l3 = test_local_variants_mem;
+	l4 = test_local_variants_mem;
+
+	printk("atomic_long: ");
+	memset(l1, 0, sizeof(*l1)*NUM_LOCAL_TEST);
+	TEST_LOOP(atomic_long_inc(&l1[n].v));
+	print_result("local_inc", start, end);
+
+	warm_total += warm_local_test_cache(l1, sizeof(*l1)*NUM_LOCAL_TEST);
+	TEST_LOOP(atomic_long_add(1234, &l1[n].v));
+	print_result("local_add", start, end);
+
+	warm_total += warm_local_test_cache(l1, sizeof(*l1)*NUM_LOCAL_TEST);
+	TEST_LOOP(atomic_long_inc(&__get_cpu_var(local1_test)[n%NUM_PERCPU_VARS].v));
+	print_result("cpu_local_inc", start, end);
+
+	warm_total += warm_local_test_cache(l1, sizeof(*l1)*NUM_LOCAL_TEST);
+	total = 0;
+	TEST_LOOP(total += atomic_long_read(&l1[n].v));
+	print_result("local_read", start, end);
+
+	warm_total += warm_local_test_cache(l1, sizeof(*l1)*NUM_LOCAL_TEST);
+	TEST_LOOP(total += atomic_long_add_return(7, &l1[n].v));
+	print_result("local_add_return", start, end);
+
+	printk("(total was %lu)\n", total);
+
+	printk("irqsave/restore: ");
+	memset(l2, 0, sizeof(*l2)*NUM_LOCAL_TEST);
+	TEST_LOOP(unsigned long flags;
+		  local_irq_save(flags);
+		  l2[n].v++;
+		  local_irq_restore(flags));
+	print_result("local_inc", start, end);
+
+	warm_total += warm_local_test_cache(l2, sizeof(*l2)*NUM_LOCAL_TEST);
+	TEST_LOOP(unsigned long flags;
+		  local_irq_save(flags);
+		  l2[n].v += 1234;
+		  local_irq_restore(flags));
+	print_result("local_add", start, end);
+
+	warm_total += warm_local_test_cache(l2, sizeof(*l2)*NUM_LOCAL_TEST);
+	TEST_LOOP(unsigned long flags;
+		  local_irq_save(flags);
+		  __get_cpu_var(local2_test)[n%NUM_PERCPU_VARS].v++;
+		  local_irq_restore(flags));
+	print_result("cpu_local_inc", start, end);
+
+	warm_total += warm_local_test_cache(l2, sizeof(*l2)*NUM_LOCAL_TEST);
+	total = 0;
+	TEST_LOOP(total += l2[n].v);
+	print_result("local_read", start, end);
+
+	warm_total += warm_local_test_cache(l1, sizeof(*l1)*NUM_LOCAL_TEST);
+	TEST_LOOP(unsigned long flags;
+		  local_irq_save(flags);
+		  l2[n].v += 7;
+		  total += l2[n].v;
+		  local_irq_restore(flags));
+	print_result("local_add_return", start, end);
+	printk("(total was %lu)\n", total);
+
+	printk("trivalue: ");
+	memset(l3, 0, sizeof(*l3)*NUM_LOCAL_TEST);
+	TEST_LOOP(unsigned int idx
+			= !(preempt_count() & (HARDIRQ_MASK | SOFTIRQ_MASK)) +
+			!(preempt_count() & HARDIRQ_MASK);
+		  l3[n].v[idx]++);
+	print_result("local_inc", start, end);
+
+	warm_total += warm_local_test_cache(l3, sizeof(*l3)*NUM_LOCAL_TEST);
+	TEST_LOOP(unsigned int idx
+			= !(preempt_count() & (HARDIRQ_MASK | SOFTIRQ_MASK)) +
+			!(preempt_count() & HARDIRQ_MASK);
+		  l3[n].v[idx] += 1234);
+	print_result("local_add", start, end);
+
+	warm_total += warm_local_test_cache(l3, sizeof(*l3)*NUM_LOCAL_TEST);
+	TEST_LOOP(unsigned int idx
+			= !(preempt_count() & (HARDIRQ_MASK | SOFTIRQ_MASK)) +
+			!(preempt_count() & HARDIRQ_MASK);
+		  get_cpu_var(local3_test)[n%NUM_PERCPU_VARS].v[idx]++;
+		  put_cpu_var());
+	print_result("cpu_local_inc", start, end);
+
+	warm_total += warm_local_test_cache(l3, sizeof(*l3)*NUM_LOCAL_TEST);
+	total = 0;
+	TEST_LOOP(total += l3[n].v[0] + l3[n].v[1] + l3[n].v[2]);
+	print_result("local_read", start, end);
+
+	warm_total += warm_local_test_cache(l1, sizeof(*l1)*NUM_LOCAL_TEST);
+	TEST_LOOP(unsigned long flags;
+		  local_irq_save(flags);
+		  l3[n].v[0] += 7;
+		  total += l3[n].v[0] + l3[n].v[1] + l3[n].v[2];
+		  local_irq_restore(flags));
+	print_result("local_add_return", start, end);
+
+	printk("(total was %lu)\n", total);
+
+	printk("local_t: ");
+	memset(l4, 0, sizeof(*l4)*NUM_LOCAL_TEST);
+	TEST_LOOP(local_inc(&l4[n]));
+	print_result("local_inc", start, end);
+
+	warm_total += warm_local_test_cache(l4, sizeof(*l4)*NUM_LOCAL_TEST);
+	TEST_LOOP(local_add(1234, &l4[n]));
+	print_result("local_add", start, end);
+
+	warm_total += warm_local_test_cache(l4, sizeof(*l4)*NUM_LOCAL_TEST);
+	TEST_LOOP(cpu_local_inc(local4_test[n%NUM_PERCPU_VARS]));
+	print_result("cpu_local_inc", start, end);
+
+	warm_total += warm_local_test_cache(l4, sizeof(*l4)*NUM_LOCAL_TEST);
+	total = 0;
+	TEST_LOOP(total += local_read(&l4[n]));
+	print_result("local_read", start, end);
+
+	warm_total += warm_local_test_cache(l1, sizeof(*l1)*NUM_LOCAL_TEST);
+	TEST_LOOP(total += local_add_return(7, &l1[n].v));
+	print_result("local_add_return", start, end);
+	printk("(total was %lu, warm_total %lu)\n", total, warm_total);
+}
+
 asmlinkage void __init start_kernel(void)
 {
 	char * command_line;
@@ -630,6 +849,8 @@ asmlinkage void __init start_kernel(void
 	 */
 	locking_selftest();
 
+	init_test_local_variants();
+
 #ifdef CONFIG_BLK_DEV_INITRD
 	if (initrd_start && !initrd_below_start_ok &&
 	    page_to_pfn(virt_to_page((void *)initrd_start)) < min_low_pfn) {
@@ -687,6 +908,8 @@ asmlinkage void __init start_kernel(void
 	acpi_early_init(); /* before LAPIC and SMP init */
 
 	ftrace_init();
+
+	test_local_variants();
 
 	/* Do the rest non-__init'ed, we're now alive */
 	rest_init();

^ permalink raw reply	[flat|nested] 33+ messages in thread

* RE: local_add_return
  2008-12-16 22:50       ` local_add_return Rusty Russell
@ 2008-12-16 23:25         ` Luck, Tony
  -1 siblings, 0 replies; 33+ messages in thread
From: Luck, Tony @ 2008-12-16 23:25 UTC (permalink / raw)
  To: Rusty Russell, David Miller
  Cc: rostedt, akpm, linux-kernel, mathieu.desnoyers, paulus, benh,
	linux-ia64, linux-s390

> Any chance of an IA64 or s390 run?  You can normalize if you like, since
> it's only to compare the different approaches.

-EDOESNOTCOMPILE

  CC      init/main.o
init/main.c: In function 'test_local_variants':
init/main.c:756: error: 'atomic_long_t' has no member named 'a'
make[1]: *** [init/main.o] Error 1

-Tony

^ permalink raw reply	[flat|nested] 33+ messages in thread

* RE: local_add_return
@ 2008-12-16 23:25         ` Luck, Tony
  0 siblings, 0 replies; 33+ messages in thread
From: Luck, Tony @ 2008-12-16 23:25 UTC (permalink / raw)
  To: Rusty Russell, David Miller
  Cc: rostedt, akpm, linux-kernel, mathieu.desnoyers, paulus, benh,
	linux-ia64, linux-s390

> Any chance of an IA64 or s390 run?  You can normalize if you like, since
> it's only to compare the different approaches.

-EDOESNOTCOMPILE

  CC      init/main.o
init/main.c: In function 'test_local_variants':
init/main.c:756: error: 'atomic_long_t' has no member named 'a'
make[1]: *** [init/main.o] Error 1

-Tony

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: local_add_return
  2008-12-16 22:50       ` local_add_return Rusty Russell
@ 2008-12-16 23:43         ` Heiko Carstens
  -1 siblings, 0 replies; 33+ messages in thread
From: Heiko Carstens @ 2008-12-16 23:43 UTC (permalink / raw)
  To: Rusty Russell
  Cc: David Miller, rostedt, akpm, linux-kernel, mathieu.desnoyers,
	paulus, benh, linux-ia64, linux-s390

On Wed, Dec 17, 2008 at 09:08:04AM +1030, Rusty Russell wrote:
> On Tuesday 16 December 2008 17:43:14 David Miller wrote:
> > Here ya go:
> 
> Very interesting.  There's a little noise there (that first local_inc of 243
> is wrong), but the picture is clear: trivalue is the best implementation for
> sparc64.
> 
> Note: trivalue uses 3 values, so instead of hitting random values across 8MB
> it's across 24MB, and despite the resulting cache damage it's 15% faster.  The
> cpu_local_inc test is a single value, so no cache effects: it shows trivalue
> to be 3 to 3.5 times faster in the cache-hot case.
> 
> This sucks, because it really does mean that there's no one-size-fits-all
> implementation of local_t.  There's also no platform yet where atomic_long_t
> is the right choice; and that's the default!
> 
> Any chance of an IA64 or s390 run?  You can normalize if you like, since
> it's only to compare the different approaches.

atomic_long_t seems to be the right choice on s390. IRQ disable/enable is
expensive, but the compare and swap instruction is cheap. I just gave it
a quick shot, but please note that there were two hypervisors running below
my system (add_return is missing since I used your first patch):

atomic_long	19  18  -  3  17
irqsave/rest	57  58  - 39  22
trivalue	43  43  -  4  45
local_t		18  20  -  2  16

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: local_add_return
@ 2008-12-16 23:43         ` Heiko Carstens
  0 siblings, 0 replies; 33+ messages in thread
From: Heiko Carstens @ 2008-12-16 23:43 UTC (permalink / raw)
  To: Rusty Russell
  Cc: David Miller, rostedt, akpm, linux-kernel, mathieu.desnoyers,
	paulus, benh, linux-ia64, linux-s390

On Wed, Dec 17, 2008 at 09:08:04AM +1030, Rusty Russell wrote:
> On Tuesday 16 December 2008 17:43:14 David Miller wrote:
> > Here ya go:
> 
> Very interesting.  There's a little noise there (that first local_inc of 243
> is wrong), but the picture is clear: trivalue is the best implementation for
> sparc64.
> 
> Note: trivalue uses 3 values, so instead of hitting random values across 8MB
> it's across 24MB, and despite the resulting cache damage it's 15% faster.  The
> cpu_local_inc test is a single value, so no cache effects: it shows trivalue
> to be 3 to 3.5 times faster in the cache-hot case.
> 
> This sucks, because it really does mean that there's no one-size-fits-all
> implementation of local_t.  There's also no platform yet where atomic_long_t
> is the right choice; and that's the default!
> 
> Any chance of an IA64 or s390 run?  You can normalize if you like, since
> it's only to compare the different approaches.

atomic_long_t seems to be the right choice on s390. IRQ disable/enable is
expensive, but the compare and swap instruction is cheap. I just gave it
a quick shot, but please note that there were two hypervisors running below
my system (add_return is missing since I used your first patch):

atomic_long	19  18  -  3  17
irqsave/rest	57  58  - 39  22
trivalue	43  43  -  4  45
local_t		18  20  -  2  16

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: local_add_return
  2008-12-16 22:50       ` local_add_return Rusty Russell
@ 2008-12-16 23:59         ` Eric Dumazet
  -1 siblings, 0 replies; 33+ messages in thread
From: Eric Dumazet @ 2008-12-16 23:59 UTC (permalink / raw)
  To: Rusty Russell
  Cc: David Miller, rostedt, akpm, linux-kernel, mathieu.desnoyers,
	paulus, benh, linux-ia64, linux-s390

Rusty Russell a écrit :
> On Tuesday 16 December 2008 17:43:14 David Miller wrote:
>> Here ya go:
> 
> Very interesting.  There's a little noise there (that first local_inc of 243
> is wrong), but the picture is clear: trivalue is the best implementation for
> sparc64.
> 
> Note: trivalue uses 3 values, so instead of hitting random values across 8MB
> it's across 24MB, and despite the resulting cache damage it's 15% faster.  The
> cpu_local_inc test is a single value, so no cache effects: it shows trivalue
> to be 3 to 3.5 times faster in the cache-hot case.
> 
> This sucks, because it really does mean that there's no one-size-fits-all
> implementation of local_t.  There's also no platform yet where atomic_long_t
> is the right choice; and that's the default!
> 
> Any chance of an IA64 or s390 run?  You can normalize if you like, since
> it's only to compare the different approaches.
> 
> Cheers,
> Rusty.
> 
> Benchmarks for local_t variants
> 
> (This patch also fixes the x86 cpu_local_* macros, which are obviously
> unused).
> 
> I chose a large array (1M longs) for the inc/add/add_return tests so
> the trivalue case would show some cache pressure.
> 
> The cpu_local_inc case is always cache-hot, so it's not comparable to
> the others.

Would be good to differenciate results, if data is already in cache or not...

> 
> Time in ns per iteration (brackets is with CONFIG_PREEMPT=y):
> 
> 		inc	add	add_return	cpu_local_inc	read
> x86-32: 2.13 Ghz Core Duo 2
> atomic_long	118	118	115		17		17

really strange atomic_long performs so badly here.
LOCK + data not in cache -> really really bad...

> irqsave/rest	77	78	77		23		16
> trivalue	45	45	127		3(6)		21
> local_t		36	36	36		1(5)		17
> 
> x86-64: 2.6 GHz Dual-Core AMD Opteron 2218
> atomic_long	55	60	-		6		19
> irqsave/rest	54	54	-		11		19
> trivalue	47	47	-		5		28
> local_t		47	46	-		1		19
> 

Running local_t variant benchmarks
atomic_long: local_inc=395001846/11 local_add=395000325/11 cpu_local_inc=362000295/10 local_read=49000040/1 local_add_return=396000322/11 (total was 1728053248)
irqsave/restore: local_inc=498000400/14 local_add=496000395/14 cpu_local_inc=486000384/14 local_read=68000054/2 local_add_return=502000394/14 (total was 1728053248)
trivalue: local_inc=1325001024/39 local_add=1324001226/39 cpu_local_inc=81000080/2 local_read=786000766/23 local_add_return=4193003781/124 (total was 1728053248)
local_t: local_inc=69000059/2 local_add=69000058/2 cpu_local_inc=42000035/1 local_read=50000043/1 local_add_return=90000076/2 (total was 1728053248, warm_total 62914562)


Intel(R) Xeon(R) CPU           E5450  @ 3.00GHz

two quadcore cpus, x86-32 kernel

It seems Core2 are really better than Core Duo 2,
or their cache is big enough to hold the array of your test...

(at least for l1 & l2, their 4Mbytes working set fits in cache)

processor       : 7
vendor_id       : GenuineIntel
cpu family      : 6
model           : 23
model name      : Intel(R) Xeon(R) CPU           E5450  @ 3.00GHz
stepping        : 6
cpu MHz         : 3000.099
cache size      : 6144 KB    <<<< yes, thats big :) >>>>

If I double size of working set

#define NUM_LOCAL_TEST (2*1024*1024)

then I get quite different numbers :

Running local_t variant benchmarks
atomic_long: local_inc=6729007264/100 local_add=6727005943/100 cpu_local_inc=724000569/10 local_read=1030000784/15 local
_add_return=6623004616/98 (total was 3456106496)
irqsave/restore: local_inc=4458002796/66 local_add=4459001998/66 cpu_local_inc=971000381/14 local_read=1060000389/15 loc
al_add_return=4528001388/67 (total was 3456106496)
trivalue: local_inc=2871000855/42 local_add=2867000976/42 cpu_local_inc=162000052/2 local_read=1747000551/26 local_add_r
eturn=8829002352/131 (total was 3456106496)
local_t: local_inc=2210000492/32 local_add=2206000460/32 cpu_local_inc=84000017/1 local_read=1029000203/15 local_add_ret
urn=2216000415/33 (total was 3456106496, warm_total 125829124)

If now I reduce NUM_LOCAL_TEST to 256*1024 so that even trivalue l3 fits cache.

Running local_t variant benchmarks
atomic_long: local_inc=98984929/11 local_add=98984889/11 cpu_local_inc=89986248/10 local_read=11998165/1 local_add_retur
n=99003292/11 (total was 2579496960)
irqsave/restore: local_inc=124000102/14 local_add=124000102/14 cpu_local_inc=121000100/14 local_read=17000013/2 local_ad
d_return=126000103/15 (total was 2579496960)
trivalue: local_inc=21000017/2 local_add=20000016/2 cpu_local_inc=20000017/2 local_read=25000021/2 local_add_return=1360
00110/16 (total was 2579496960)
local_t: local_inc=17000014/2 local_add=17000015/2 cpu_local_inc=11000009/1 local_read=12000010/1 local_add_return=23000
019/2 (total was 2579496960, warm_total 15728642)



About trivalues, their use in percpu_counter local storage (one trivalue for each cpu)
would make the accuracy a litle bit more lazy...



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: local_add_return
@ 2008-12-16 23:59         ` Eric Dumazet
  0 siblings, 0 replies; 33+ messages in thread
From: Eric Dumazet @ 2008-12-16 23:59 UTC (permalink / raw)
  To: Rusty Russell
  Cc: David Miller, rostedt, akpm, linux-kernel, mathieu.desnoyers,
	paulus, benh, linux-ia64, linux-s390

Rusty Russell a Ècrit :
> On Tuesday 16 December 2008 17:43:14 David Miller wrote:
>> Here ya go:
> 
> Very interesting.  There's a little noise there (that first local_inc of 243
> is wrong), but the picture is clear: trivalue is the best implementation for
> sparc64.
> 
> Note: trivalue uses 3 values, so instead of hitting random values across 8MB
> it's across 24MB, and despite the resulting cache damage it's 15% faster.  The
> cpu_local_inc test is a single value, so no cache effects: it shows trivalue
> to be 3 to 3.5 times faster in the cache-hot case.
> 
> This sucks, because it really does mean that there's no one-size-fits-all
> implementation of local_t.  There's also no platform yet where atomic_long_t
> is the right choice; and that's the default!
> 
> Any chance of an IA64 or s390 run?  You can normalize if you like, since
> it's only to compare the different approaches.
> 
> Cheers,
> Rusty.
> 
> Benchmarks for local_t variants
> 
> (This patch also fixes the x86 cpu_local_* macros, which are obviously
> unused).
> 
> I chose a large array (1M longs) for the inc/add/add_return tests so
> the trivalue case would show some cache pressure.
> 
> The cpu_local_inc case is always cache-hot, so it's not comparable to
> the others.

Would be good to differenciate results, if data is already in cache or not...

> 
> Time in ns per iteration (brackets is with CONFIG_PREEMPT=y):
> 
> 		inc	add	add_return	cpu_local_inc	read
> x86-32: 2.13 Ghz Core Duo 2
> atomic_long	118	118	115		17		17

really strange atomic_long performs so badly here.
LOCK + data not in cache -> really really bad...

> irqsave/rest	77	78	77		23		16
> trivalue	45	45	127		3(6)		21
> local_t		36	36	36		1(5)		17
> 
> x86-64: 2.6 GHz Dual-Core AMD Opteron 2218
> atomic_long	55	60	-		6		19
> irqsave/rest	54	54	-		11		19
> trivalue	47	47	-		5		28
> local_t		47	46	-		1		19
> 

Running local_t variant benchmarks
atomic_long: local_inc95001846/11 local_add95000325/11 cpu_local_inc62000295/10 local_readI000040/1 local_add_return96000322/11 (total was 1728053248)
irqsave/restore: local_incI8000400/14 local_addI6000395/14 cpu_local_incH6000384/14 local_readh000054/2 local_add_returnP2000394/14 (total was 1728053248)
trivalue: local_inc\x1325001024/39 local_add\x1324001226/39 cpu_local_incÅ000080/2 local_readx6000766/23 local_add_returnA93003781/124 (total was 1728053248)
local_t: local_inci000059/2 local_addi000058/2 cpu_local_incB000035/1 local_readP000043/1 local_add_returnê000076/2 (total was 1728053248, warm_total 62914562)


Intel(R) Xeon(R) CPU           E5450  @ 3.00GHz

two quadcore cpus, x86-32 kernel

It seems Core2 are really better than Core Duo 2,
or their cache is big enough to hold the array of your test...

(at least for l1 & l2, their 4Mbytes working set fits in cache)

processor       : 7
vendor_id       : GenuineIntel
cpu family      : 6
model           : 23
model name      : Intel(R) Xeon(R) CPU           E5450  @ 3.00GHz
stepping        : 6
cpu MHz         : 3000.099
cache size      : 6144 KB    <<<< yes, thats big :) >>>>

If I double size of working set

#define NUM_LOCAL_TEST (2*1024*1024)

then I get quite different numbers :

Running local_t variant benchmarks
atomic_long: local_incg29007264/100 local_addg27005943/100 cpu_local_incr4000569/10 local_read\x1030000784/15 local
_add_returnf23004616/98 (total was 3456106496)
irqsave/restore: local_incD58002796/66 local_addD59001998/66 cpu_local_incó1000381/14 local_read\x1060000389/15 loc
al_add_returnE28001388/67 (total was 3456106496)
trivalue: local_inc(71000855/42 local_add(67000976/42 cpu_local_inc\x162000052/2 local_read\x1747000551/26 local_add_r
eturnà29002352/131 (total was 3456106496)
local_t: local_inc"10000492/32 local_add"06000460/32 cpu_local_incÑ000017/1 local_read\x1029000203/15 local_add_ret
urn"16000415/33 (total was 3456106496, warm_total 125829124)

If now I reduce NUM_LOCAL_TEST to 256*1024 so that even trivalue l3 fits cache.

Running local_t variant benchmarks
atomic_long: local_incò984929/11 local_addò984889/11 cpu_local_incâ986248/10 local_read\x11998165/1 local_add_retur
nô003292/11 (total was 2579496960)
irqsave/restore: local_inc\x124000102/14 local_add\x124000102/14 cpu_local_inc\x121000100/14 local_read\x17000013/2 local_ad
d_return\x126000103/15 (total was 2579496960)
trivalue: local_inc!000017/2 local_add 000016/2 cpu_local_inc 000017/2 local_read%000021/2 local_add_return\x1360
00110/16 (total was 2579496960)
local_t: local_inc\x17000014/2 local_add\x17000015/2 cpu_local_inc\x11000009/1 local_read\x12000010/1 local_add_return#000
019/2 (total was 2579496960, warm_total 15728642)



About trivalues, their use in percpu_counter local storage (one trivalue for each cpu)
would make the accuracy a litle bit more lazy...


--
To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: local_add_return
  2008-12-16 22:50       ` local_add_return Rusty Russell
@ 2008-12-17  0:01         ` Mathieu Desnoyers
  -1 siblings, 0 replies; 33+ messages in thread
From: Mathieu Desnoyers @ 2008-12-17  0:01 UTC (permalink / raw)
  To: Rusty Russell
  Cc: David Miller, rostedt, akpm, linux-kernel, paulus, benh,
	linux-ia64, linux-s390, Christoph Lameter, Paul E. McKenney,
	Martin Bligh

* Rusty Russell (rusty@rustcorp.com.au) wrote:
> On Tuesday 16 December 2008 17:43:14 David Miller wrote:
> > Here ya go:
> 
> Very interesting.  There's a little noise there (that first local_inc of 243
> is wrong), but the picture is clear: trivalue is the best implementation for
> sparc64.
> 
> Note: trivalue uses 3 values, so instead of hitting random values across 8MB
> it's across 24MB, and despite the resulting cache damage it's 15% faster.  The
> cpu_local_inc test is a single value, so no cache effects: it shows trivalue
> to be 3 to 3.5 times faster in the cache-hot case.
> 
> This sucks, because it really does mean that there's no one-size-fits-all
> implementation of local_t.  There's also no platform yet where atomic_long_t
> is the right choice; and that's the default!
> 

This problem could be fixed by introducing a local_count_t, which maps
to either local_t or to a trivalue, along with read accessors which sums
the trivalues.

I think we have two different use-cases here :

- local_t is useful as-is for things such as a tracer, which need to
  modify an element of data atomically wrt local interrupts. The
  atomic_long_t, in this case, is the correct fallback.
- local_count_t could be used for fast counters. It could be a
  requirement to only use it from thread/softirq/irq context (never NMI
  or exception) so we are sure the trivalue approach will not lead to
  corruption. local_count_t could use either local_t or trivalue
  depending on which is the fastest on a given architecture.

Mathieu


> Any chance of an IA64 or s390 run?  You can normalize if you like, since
> it's only to compare the different approaches.
> 
> Cheers,
> Rusty.
> 
> Benchmarks for local_t variants
> 
> (This patch also fixes the x86 cpu_local_* macros, which are obviously
> unused).
> 
> I chose a large array (1M longs) for the inc/add/add_return tests so
> the trivalue case would show some cache pressure.
> 
> The cpu_local_inc case is always cache-hot, so it's not comparable to
> the others.
> 
> Time in ns per iteration (brackets is with CONFIG_PREEMPT=y):
> 
> 		inc	add	add_return	cpu_local_inc	read
> x86-32: 2.13 Ghz Core Duo 2
> atomic_long	118	118	115		17		17
> irqsave/rest	77	78	77		23		16
> trivalue	45	45	127		3(6)		21
> local_t		36	36	36		1(5)		17
> 
> x86-64: 2.6 GHz Dual-Core AMD Opteron 2218
> atomic_long	55	60	-		6		19
> irqsave/rest	54	54	-		11		19
> trivalue	47	47	-		5		28
> local_t		47	46	-		1		19
> 
> PPC64: 2.7 GHz PPC970MP [normalized]
> atomic_long	18	18	20		3(4)		8
> irqsave/rest	10(4)	5(4)	4		8(9)		10(9)
> trivalue	9	9	2		1(3)		10
> local_t		18	18	18		3(4)		8
> 
> Sparc64: UltraSPARC-IIIi
> atomic_long	243	222	-		37		169
> irqsave/rest	205	205	-		25		169
> trivalue	193	193	-		11		221
> local_t		221	221	-		37		169
> 
> Sparc64: Niagara-2
> atomic_long	207	206	-		72		160
> irqsave/rest	228	228	-		78		160
> trivalue:	172	172	-		20		222
> local_t		206	207	-		73		160
> 
> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
> ---
>  arch/x86/include/asm/local.h |   20 +--
>  init/main.c                  |  223 +++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 233 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/x86/include/asm/local.h b/arch/x86/include/asm/local.h
> --- a/arch/x86/include/asm/local.h
> +++ b/arch/x86/include/asm/local.h
> @@ -220,16 +220,16 @@ static inline long local_sub_return(long
>  	preempt_enable();		\
>  })					\
>  
> -#define cpu_local_read(l)    cpu_local_wrap_v(local_read(&__get_cpu_var((l))))
> -#define cpu_local_set(l, i)  cpu_local_wrap(local_set(&__get_cpu_var((l)), (i)))
> -#define cpu_local_inc(l)     cpu_local_wrap(local_inc(&__get_cpu_var((l))))
> -#define cpu_local_dec(l)     cpu_local_wrap(local_dec(&__get_cpu_var((l))))
> -#define cpu_local_add(i, l)  cpu_local_wrap(local_add((i), &__get_cpu_var((l))))
> -#define cpu_local_sub(i, l)  cpu_local_wrap(local_sub((i), &__get_cpu_var((l))))
> +#define cpu_local_read(l)    cpu_local_wrap_v(local_read(&__get_cpu_var(l)))
> +#define cpu_local_set(l, i)  cpu_local_wrap(local_set(&__get_cpu_var(l), (i)))
> +#define cpu_local_inc(l)     cpu_local_wrap(local_inc(&__get_cpu_var(l)))
> +#define cpu_local_dec(l)     cpu_local_wrap(local_dec(&__get_cpu_var(l)))
> +#define cpu_local_add(i, l)  cpu_local_wrap(local_add((i), &__get_cpu_var(l)))
> +#define cpu_local_sub(i, l)  cpu_local_wrap(local_sub((i), &__get_cpu_var(l)))
>  
> -#define __cpu_local_inc(l)	cpu_local_inc((l))
> -#define __cpu_local_dec(l)	cpu_local_dec((l))
> -#define __cpu_local_add(i, l)	cpu_local_add((i), (l))
> -#define __cpu_local_sub(i, l)	cpu_local_sub((i), (l))
> +#define __cpu_local_inc(l)	cpu_local_inc(l)
> +#define __cpu_local_dec(l)	cpu_local_dec(l)
> +#define __cpu_local_add(i, l)	cpu_local_add((i), l)
> +#define __cpu_local_sub(i, l)	cpu_local_sub((i), l)
>  
>  #endif /* _ASM_X86_LOCAL_H */
> diff --git a/init/main.c b/init/main.c
> --- a/init/main.c
> +++ b/init/main.c
> @@ -534,6 +534,225 @@ void __init __weak thread_info_cache_ini
>  {
>  }
>  
> +/* There are three obvious ways to implement local_t on an arch which
> + * can't do single-instruction inc/dec etc.
> + * 1) atomic_long
> + * 2) irq_save/irq_restore
> + * 3) multiple counters.
> + *
> + * This does a very rough benchmark on each one.
> + */
> +struct local1 {
> +	atomic_long_t v;
> +};
> +struct local2 {
> +	unsigned long v;
> +};
> +struct local3 {
> +	unsigned long v[3];
> +};
> +
> +/* Enough to put some pressure on the caches. */
> +#define NUM_LOCAL_TEST (1024*1024)
> +#define NUM_LOCAL_RUNS (NUM_LOCAL_TEST*32)
> +/* This will make it jump around looking random */
> +#define STRIDE 514001
> +
> +static void *test_local_variants_mem;
> +
> +static void init_test_local_variants(void)
> +{
> +	unsigned long size;
> +	size = max(sizeof(struct local1),
> +		   max(sizeof(struct local2),
> +		       max(sizeof(struct local3), sizeof(local_t))))
> +		* NUM_LOCAL_TEST;
> +	/* Assume this works in early boot. */
> +	test_local_variants_mem = alloc_bootmem_nopanic(size);
> +
> +	if (!test_local_variants_mem) {
> +		printk("test_local_variants: failed to allocate %lu bytes\n",
> +		       size);
> +		return;
> +	}
> +}
> +
> +static void print_result(const char *str,
> +			 struct timespec start, struct timespec end)
> +{
> +	s64 diff;
> +
> +	diff = ktime_to_ns(ktime_sub(timespec_to_ktime(end), timespec_to_ktime(start)));
> +	printk("%s=%lli/%lli ",
> +	       str, diff, diff/NUM_LOCAL_RUNS);
> +}
> +
> +static unsigned int warm_local_test_cache(const void *mem, size_t len)
> +{
> +	unsigned int i, total = 0;
> +	for (i = 0; i < len; i++)
> +		total += ((char *)mem)[i];
> +	return total;
> +}
> +
> +#define TEST_LOOP(expr)				\
> +	n = 0;					\
> +	getnstimeofday(&start);			\
> +	for (i = 0; i < NUM_LOCAL_RUNS; i++) {	\
> +		expr;				\
> +		n += STRIDE;			\
> +		n %= NUM_LOCAL_TEST;		\
> +	}					\
> +	getnstimeofday(&end);
> +
> +/* This doesn't test cache effects at all */
> +#define NUM_PERCPU_VARS		16
> +DEFINE_PER_CPU(struct local1[NUM_PERCPU_VARS], local1_test);
> +DEFINE_PER_CPU(struct local2[NUM_PERCPU_VARS], local2_test);
> +DEFINE_PER_CPU(struct local3[NUM_PERCPU_VARS], local3_test);
> +DEFINE_PER_CPU(local_t[NUM_PERCPU_VARS], local4_test);
> +
> +static void test_local_variants(void)
> +{
> +	struct timespec start, end;
> +	unsigned int i, n;
> +	unsigned long total, warm_total = 0;
> +	struct local1 *l1;
> +	struct local2 *l2;
> +	struct local3 *l3;
> +	local_t *l4;
> +
> +	if (!test_local_variants_mem)
> +		return;
> +
> +	printk("Running local_t variant benchmarks\n");
> +	l1 = test_local_variants_mem;
> +	l2 = test_local_variants_mem;
> +	l3 = test_local_variants_mem;
> +	l4 = test_local_variants_mem;
> +
> +	printk("atomic_long: ");
> +	memset(l1, 0, sizeof(*l1)*NUM_LOCAL_TEST);
> +	TEST_LOOP(atomic_long_inc(&l1[n].v));
> +	print_result("local_inc", start, end);
> +
> +	warm_total += warm_local_test_cache(l1, sizeof(*l1)*NUM_LOCAL_TEST);
> +	TEST_LOOP(atomic_long_add(1234, &l1[n].v));
> +	print_result("local_add", start, end);
> +
> +	warm_total += warm_local_test_cache(l1, sizeof(*l1)*NUM_LOCAL_TEST);
> +	TEST_LOOP(atomic_long_inc(&__get_cpu_var(local1_test)[n%NUM_PERCPU_VARS].v));
> +	print_result("cpu_local_inc", start, end);
> +
> +	warm_total += warm_local_test_cache(l1, sizeof(*l1)*NUM_LOCAL_TEST);
> +	total = 0;
> +	TEST_LOOP(total += atomic_long_read(&l1[n].v));
> +	print_result("local_read", start, end);
> +
> +	warm_total += warm_local_test_cache(l1, sizeof(*l1)*NUM_LOCAL_TEST);
> +	TEST_LOOP(total += atomic_long_add_return(7, &l1[n].v));
> +	print_result("local_add_return", start, end);
> +
> +	printk("(total was %lu)\n", total);
> +
> +	printk("irqsave/restore: ");
> +	memset(l2, 0, sizeof(*l2)*NUM_LOCAL_TEST);
> +	TEST_LOOP(unsigned long flags;
> +		  local_irq_save(flags);
> +		  l2[n].v++;
> +		  local_irq_restore(flags));
> +	print_result("local_inc", start, end);
> +
> +	warm_total += warm_local_test_cache(l2, sizeof(*l2)*NUM_LOCAL_TEST);
> +	TEST_LOOP(unsigned long flags;
> +		  local_irq_save(flags);
> +		  l2[n].v += 1234;
> +		  local_irq_restore(flags));
> +	print_result("local_add", start, end);
> +
> +	warm_total += warm_local_test_cache(l2, sizeof(*l2)*NUM_LOCAL_TEST);
> +	TEST_LOOP(unsigned long flags;
> +		  local_irq_save(flags);
> +		  __get_cpu_var(local2_test)[n%NUM_PERCPU_VARS].v++;
> +		  local_irq_restore(flags));
> +	print_result("cpu_local_inc", start, end);
> +
> +	warm_total += warm_local_test_cache(l2, sizeof(*l2)*NUM_LOCAL_TEST);
> +	total = 0;
> +	TEST_LOOP(total += l2[n].v);
> +	print_result("local_read", start, end);
> +
> +	warm_total += warm_local_test_cache(l1, sizeof(*l1)*NUM_LOCAL_TEST);
> +	TEST_LOOP(unsigned long flags;
> +		  local_irq_save(flags);
> +		  l2[n].v += 7;
> +		  total += l2[n].v;
> +		  local_irq_restore(flags));
> +	print_result("local_add_return", start, end);
> +	printk("(total was %lu)\n", total);
> +
> +	printk("trivalue: ");
> +	memset(l3, 0, sizeof(*l3)*NUM_LOCAL_TEST);
> +	TEST_LOOP(unsigned int idx
> +			= !(preempt_count() & (HARDIRQ_MASK | SOFTIRQ_MASK)) +
> +			!(preempt_count() & HARDIRQ_MASK);
> +		  l3[n].v[idx]++);
> +	print_result("local_inc", start, end);
> +
> +	warm_total += warm_local_test_cache(l3, sizeof(*l3)*NUM_LOCAL_TEST);
> +	TEST_LOOP(unsigned int idx
> +			= !(preempt_count() & (HARDIRQ_MASK | SOFTIRQ_MASK)) +
> +			!(preempt_count() & HARDIRQ_MASK);
> +		  l3[n].v[idx] += 1234);
> +	print_result("local_add", start, end);
> +
> +	warm_total += warm_local_test_cache(l3, sizeof(*l3)*NUM_LOCAL_TEST);
> +	TEST_LOOP(unsigned int idx
> +			= !(preempt_count() & (HARDIRQ_MASK | SOFTIRQ_MASK)) +
> +			!(preempt_count() & HARDIRQ_MASK);
> +		  get_cpu_var(local3_test)[n%NUM_PERCPU_VARS].v[idx]++;
> +		  put_cpu_var());
> +	print_result("cpu_local_inc", start, end);
> +
> +	warm_total += warm_local_test_cache(l3, sizeof(*l3)*NUM_LOCAL_TEST);
> +	total = 0;
> +	TEST_LOOP(total += l3[n].v[0] + l3[n].v[1] + l3[n].v[2]);
> +	print_result("local_read", start, end);
> +
> +	warm_total += warm_local_test_cache(l1, sizeof(*l1)*NUM_LOCAL_TEST);
> +	TEST_LOOP(unsigned long flags;
> +		  local_irq_save(flags);
> +		  l3[n].v[0] += 7;
> +		  total += l3[n].v[0] + l3[n].v[1] + l3[n].v[2];
> +		  local_irq_restore(flags));
> +	print_result("local_add_return", start, end);
> +
> +	printk("(total was %lu)\n", total);
> +
> +	printk("local_t: ");
> +	memset(l4, 0, sizeof(*l4)*NUM_LOCAL_TEST);
> +	TEST_LOOP(local_inc(&l4[n]));
> +	print_result("local_inc", start, end);
> +
> +	warm_total += warm_local_test_cache(l4, sizeof(*l4)*NUM_LOCAL_TEST);
> +	TEST_LOOP(local_add(1234, &l4[n]));
> +	print_result("local_add", start, end);
> +
> +	warm_total += warm_local_test_cache(l4, sizeof(*l4)*NUM_LOCAL_TEST);
> +	TEST_LOOP(cpu_local_inc(local4_test[n%NUM_PERCPU_VARS]));
> +	print_result("cpu_local_inc", start, end);
> +
> +	warm_total += warm_local_test_cache(l4, sizeof(*l4)*NUM_LOCAL_TEST);
> +	total = 0;
> +	TEST_LOOP(total += local_read(&l4[n]));
> +	print_result("local_read", start, end);
> +
> +	warm_total += warm_local_test_cache(l1, sizeof(*l1)*NUM_LOCAL_TEST);
> +	TEST_LOOP(total += local_add_return(7, &l1[n].v));
> +	print_result("local_add_return", start, end);
> +	printk("(total was %lu, warm_total %lu)\n", total, warm_total);
> +}
> +
>  asmlinkage void __init start_kernel(void)
>  {
>  	char * command_line;
> @@ -630,6 +849,8 @@ asmlinkage void __init start_kernel(void
>  	 */
>  	locking_selftest();
>  
> +	init_test_local_variants();
> +
>  #ifdef CONFIG_BLK_DEV_INITRD
>  	if (initrd_start && !initrd_below_start_ok &&
>  	    page_to_pfn(virt_to_page((void *)initrd_start)) < min_low_pfn) {
> @@ -687,6 +908,8 @@ asmlinkage void __init start_kernel(void
>  	acpi_early_init(); /* before LAPIC and SMP init */
>  
>  	ftrace_init();
> +
> +	test_local_variants();
>  
>  	/* Do the rest non-__init'ed, we're now alive */
>  	rest_init();

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: local_add_return
@ 2008-12-17  0:01         ` Mathieu Desnoyers
  0 siblings, 0 replies; 33+ messages in thread
From: Mathieu Desnoyers @ 2008-12-17  0:01 UTC (permalink / raw)
  To: Rusty Russell
  Cc: David Miller, rostedt, akpm, linux-kernel, paulus, benh,
	linux-ia64, linux-s390, Christoph Lameter, Paul E. McKenney,
	Martin Bligh

* Rusty Russell (rusty@rustcorp.com.au) wrote:
> On Tuesday 16 December 2008 17:43:14 David Miller wrote:
> > Here ya go:
> 
> Very interesting.  There's a little noise there (that first local_inc of 243
> is wrong), but the picture is clear: trivalue is the best implementation for
> sparc64.
> 
> Note: trivalue uses 3 values, so instead of hitting random values across 8MB
> it's across 24MB, and despite the resulting cache damage it's 15% faster.  The
> cpu_local_inc test is a single value, so no cache effects: it shows trivalue
> to be 3 to 3.5 times faster in the cache-hot case.
> 
> This sucks, because it really does mean that there's no one-size-fits-all
> implementation of local_t.  There's also no platform yet where atomic_long_t
> is the right choice; and that's the default!
> 

This problem could be fixed by introducing a local_count_t, which maps
to either local_t or to a trivalue, along with read accessors which sums
the trivalues.

I think we have two different use-cases here :

- local_t is useful as-is for things such as a tracer, which need to
  modify an element of data atomically wrt local interrupts. The
  atomic_long_t, in this case, is the correct fallback.
- local_count_t could be used for fast counters. It could be a
  requirement to only use it from thread/softirq/irq context (never NMI
  or exception) so we are sure the trivalue approach will not lead to
  corruption. local_count_t could use either local_t or trivalue
  depending on which is the fastest on a given architecture.

Mathieu


> Any chance of an IA64 or s390 run?  You can normalize if you like, since
> it's only to compare the different approaches.
> 
> Cheers,
> Rusty.
> 
> Benchmarks for local_t variants
> 
> (This patch also fixes the x86 cpu_local_* macros, which are obviously
> unused).
> 
> I chose a large array (1M longs) for the inc/add/add_return tests so
> the trivalue case would show some cache pressure.
> 
> The cpu_local_inc case is always cache-hot, so it's not comparable to
> the others.
> 
> Time in ns per iteration (brackets is with CONFIG_PREEMPT=y):
> 
> 		inc	add	add_return	cpu_local_inc	read
> x86-32: 2.13 Ghz Core Duo 2
> atomic_long	118	118	115		17		17
> irqsave/rest	77	78	77		23		16
> trivalue	45	45	127		3(6)		21
> local_t		36	36	36		1(5)		17
> 
> x86-64: 2.6 GHz Dual-Core AMD Opteron 2218
> atomic_long	55	60	-		6		19
> irqsave/rest	54	54	-		11		19
> trivalue	47	47	-		5		28
> local_t		47	46	-		1		19
> 
> PPC64: 2.7 GHz PPC970MP [normalized]
> atomic_long	18	18	20		3(4)		8
> irqsave/rest	10(4)	5(4)	4		8(9)		10(9)
> trivalue	9	9	2		1(3)		10
> local_t		18	18	18		3(4)		8
> 
> Sparc64: UltraSPARC-IIIi
> atomic_long	243	222	-		37		169
> irqsave/rest	205	205	-		25		169
> trivalue	193	193	-		11		221
> local_t		221	221	-		37		169
> 
> Sparc64: Niagara-2
> atomic_long	207	206	-		72		160
> irqsave/rest	228	228	-		78		160
> trivalue:	172	172	-		20		222
> local_t		206	207	-		73		160
> 
> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
> ---
>  arch/x86/include/asm/local.h |   20 +--
>  init/main.c                  |  223 +++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 233 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/x86/include/asm/local.h b/arch/x86/include/asm/local.h
> --- a/arch/x86/include/asm/local.h
> +++ b/arch/x86/include/asm/local.h
> @@ -220,16 +220,16 @@ static inline long local_sub_return(long
>  	preempt_enable();		\
>  })					\
>  
> -#define cpu_local_read(l)    cpu_local_wrap_v(local_read(&__get_cpu_var((l))))
> -#define cpu_local_set(l, i)  cpu_local_wrap(local_set(&__get_cpu_var((l)), (i)))
> -#define cpu_local_inc(l)     cpu_local_wrap(local_inc(&__get_cpu_var((l))))
> -#define cpu_local_dec(l)     cpu_local_wrap(local_dec(&__get_cpu_var((l))))
> -#define cpu_local_add(i, l)  cpu_local_wrap(local_add((i), &__get_cpu_var((l))))
> -#define cpu_local_sub(i, l)  cpu_local_wrap(local_sub((i), &__get_cpu_var((l))))
> +#define cpu_local_read(l)    cpu_local_wrap_v(local_read(&__get_cpu_var(l)))
> +#define cpu_local_set(l, i)  cpu_local_wrap(local_set(&__get_cpu_var(l), (i)))
> +#define cpu_local_inc(l)     cpu_local_wrap(local_inc(&__get_cpu_var(l)))
> +#define cpu_local_dec(l)     cpu_local_wrap(local_dec(&__get_cpu_var(l)))
> +#define cpu_local_add(i, l)  cpu_local_wrap(local_add((i), &__get_cpu_var(l)))
> +#define cpu_local_sub(i, l)  cpu_local_wrap(local_sub((i), &__get_cpu_var(l)))
>  
> -#define __cpu_local_inc(l)	cpu_local_inc((l))
> -#define __cpu_local_dec(l)	cpu_local_dec((l))
> -#define __cpu_local_add(i, l)	cpu_local_add((i), (l))
> -#define __cpu_local_sub(i, l)	cpu_local_sub((i), (l))
> +#define __cpu_local_inc(l)	cpu_local_inc(l)
> +#define __cpu_local_dec(l)	cpu_local_dec(l)
> +#define __cpu_local_add(i, l)	cpu_local_add((i), l)
> +#define __cpu_local_sub(i, l)	cpu_local_sub((i), l)
>  
>  #endif /* _ASM_X86_LOCAL_H */
> diff --git a/init/main.c b/init/main.c
> --- a/init/main.c
> +++ b/init/main.c
> @@ -534,6 +534,225 @@ void __init __weak thread_info_cache_ini
>  {
>  }
>  
> +/* There are three obvious ways to implement local_t on an arch which
> + * can't do single-instruction inc/dec etc.
> + * 1) atomic_long
> + * 2) irq_save/irq_restore
> + * 3) multiple counters.
> + *
> + * This does a very rough benchmark on each one.
> + */
> +struct local1 {
> +	atomic_long_t v;
> +};
> +struct local2 {
> +	unsigned long v;
> +};
> +struct local3 {
> +	unsigned long v[3];
> +};
> +
> +/* Enough to put some pressure on the caches. */
> +#define NUM_LOCAL_TEST (1024*1024)
> +#define NUM_LOCAL_RUNS (NUM_LOCAL_TEST*32)
> +/* This will make it jump around looking random */
> +#define STRIDE 514001
> +
> +static void *test_local_variants_mem;
> +
> +static void init_test_local_variants(void)
> +{
> +	unsigned long size;
> +	size = max(sizeof(struct local1),
> +		   max(sizeof(struct local2),
> +		       max(sizeof(struct local3), sizeof(local_t))))
> +		* NUM_LOCAL_TEST;
> +	/* Assume this works in early boot. */
> +	test_local_variants_mem = alloc_bootmem_nopanic(size);
> +
> +	if (!test_local_variants_mem) {
> +		printk("test_local_variants: failed to allocate %lu bytes\n",
> +		       size);
> +		return;
> +	}
> +}
> +
> +static void print_result(const char *str,
> +			 struct timespec start, struct timespec end)
> +{
> +	s64 diff;
> +
> +	diff = ktime_to_ns(ktime_sub(timespec_to_ktime(end), timespec_to_ktime(start)));
> +	printk("%s=%lli/%lli ",
> +	       str, diff, diff/NUM_LOCAL_RUNS);
> +}
> +
> +static unsigned int warm_local_test_cache(const void *mem, size_t len)
> +{
> +	unsigned int i, total = 0;
> +	for (i = 0; i < len; i++)
> +		total += ((char *)mem)[i];
> +	return total;
> +}
> +
> +#define TEST_LOOP(expr)				\
> +	n = 0;					\
> +	getnstimeofday(&start);			\
> +	for (i = 0; i < NUM_LOCAL_RUNS; i++) {	\
> +		expr;				\
> +		n += STRIDE;			\
> +		n %= NUM_LOCAL_TEST;		\
> +	}					\
> +	getnstimeofday(&end);
> +
> +/* This doesn't test cache effects at all */
> +#define NUM_PERCPU_VARS		16
> +DEFINE_PER_CPU(struct local1[NUM_PERCPU_VARS], local1_test);
> +DEFINE_PER_CPU(struct local2[NUM_PERCPU_VARS], local2_test);
> +DEFINE_PER_CPU(struct local3[NUM_PERCPU_VARS], local3_test);
> +DEFINE_PER_CPU(local_t[NUM_PERCPU_VARS], local4_test);
> +
> +static void test_local_variants(void)
> +{
> +	struct timespec start, end;
> +	unsigned int i, n;
> +	unsigned long total, warm_total = 0;
> +	struct local1 *l1;
> +	struct local2 *l2;
> +	struct local3 *l3;
> +	local_t *l4;
> +
> +	if (!test_local_variants_mem)
> +		return;
> +
> +	printk("Running local_t variant benchmarks\n");
> +	l1 = test_local_variants_mem;
> +	l2 = test_local_variants_mem;
> +	l3 = test_local_variants_mem;
> +	l4 = test_local_variants_mem;
> +
> +	printk("atomic_long: ");
> +	memset(l1, 0, sizeof(*l1)*NUM_LOCAL_TEST);
> +	TEST_LOOP(atomic_long_inc(&l1[n].v));
> +	print_result("local_inc", start, end);
> +
> +	warm_total += warm_local_test_cache(l1, sizeof(*l1)*NUM_LOCAL_TEST);
> +	TEST_LOOP(atomic_long_add(1234, &l1[n].v));
> +	print_result("local_add", start, end);
> +
> +	warm_total += warm_local_test_cache(l1, sizeof(*l1)*NUM_LOCAL_TEST);
> +	TEST_LOOP(atomic_long_inc(&__get_cpu_var(local1_test)[n%NUM_PERCPU_VARS].v));
> +	print_result("cpu_local_inc", start, end);
> +
> +	warm_total += warm_local_test_cache(l1, sizeof(*l1)*NUM_LOCAL_TEST);
> +	total = 0;
> +	TEST_LOOP(total += atomic_long_read(&l1[n].v));
> +	print_result("local_read", start, end);
> +
> +	warm_total += warm_local_test_cache(l1, sizeof(*l1)*NUM_LOCAL_TEST);
> +	TEST_LOOP(total += atomic_long_add_return(7, &l1[n].v));
> +	print_result("local_add_return", start, end);
> +
> +	printk("(total was %lu)\n", total);
> +
> +	printk("irqsave/restore: ");
> +	memset(l2, 0, sizeof(*l2)*NUM_LOCAL_TEST);
> +	TEST_LOOP(unsigned long flags;
> +		  local_irq_save(flags);
> +		  l2[n].v++;
> +		  local_irq_restore(flags));
> +	print_result("local_inc", start, end);
> +
> +	warm_total += warm_local_test_cache(l2, sizeof(*l2)*NUM_LOCAL_TEST);
> +	TEST_LOOP(unsigned long flags;
> +		  local_irq_save(flags);
> +		  l2[n].v += 1234;
> +		  local_irq_restore(flags));
> +	print_result("local_add", start, end);
> +
> +	warm_total += warm_local_test_cache(l2, sizeof(*l2)*NUM_LOCAL_TEST);
> +	TEST_LOOP(unsigned long flags;
> +		  local_irq_save(flags);
> +		  __get_cpu_var(local2_test)[n%NUM_PERCPU_VARS].v++;
> +		  local_irq_restore(flags));
> +	print_result("cpu_local_inc", start, end);
> +
> +	warm_total += warm_local_test_cache(l2, sizeof(*l2)*NUM_LOCAL_TEST);
> +	total = 0;
> +	TEST_LOOP(total += l2[n].v);
> +	print_result("local_read", start, end);
> +
> +	warm_total += warm_local_test_cache(l1, sizeof(*l1)*NUM_LOCAL_TEST);
> +	TEST_LOOP(unsigned long flags;
> +		  local_irq_save(flags);
> +		  l2[n].v += 7;
> +		  total += l2[n].v;
> +		  local_irq_restore(flags));
> +	print_result("local_add_return", start, end);
> +	printk("(total was %lu)\n", total);
> +
> +	printk("trivalue: ");
> +	memset(l3, 0, sizeof(*l3)*NUM_LOCAL_TEST);
> +	TEST_LOOP(unsigned int idx
> +			= !(preempt_count() & (HARDIRQ_MASK | SOFTIRQ_MASK)) +
> +			!(preempt_count() & HARDIRQ_MASK);
> +		  l3[n].v[idx]++);
> +	print_result("local_inc", start, end);
> +
> +	warm_total += warm_local_test_cache(l3, sizeof(*l3)*NUM_LOCAL_TEST);
> +	TEST_LOOP(unsigned int idx
> +			= !(preempt_count() & (HARDIRQ_MASK | SOFTIRQ_MASK)) +
> +			!(preempt_count() & HARDIRQ_MASK);
> +		  l3[n].v[idx] += 1234);
> +	print_result("local_add", start, end);
> +
> +	warm_total += warm_local_test_cache(l3, sizeof(*l3)*NUM_LOCAL_TEST);
> +	TEST_LOOP(unsigned int idx
> +			= !(preempt_count() & (HARDIRQ_MASK | SOFTIRQ_MASK)) +
> +			!(preempt_count() & HARDIRQ_MASK);
> +		  get_cpu_var(local3_test)[n%NUM_PERCPU_VARS].v[idx]++;
> +		  put_cpu_var());
> +	print_result("cpu_local_inc", start, end);
> +
> +	warm_total += warm_local_test_cache(l3, sizeof(*l3)*NUM_LOCAL_TEST);
> +	total = 0;
> +	TEST_LOOP(total += l3[n].v[0] + l3[n].v[1] + l3[n].v[2]);
> +	print_result("local_read", start, end);
> +
> +	warm_total += warm_local_test_cache(l1, sizeof(*l1)*NUM_LOCAL_TEST);
> +	TEST_LOOP(unsigned long flags;
> +		  local_irq_save(flags);
> +		  l3[n].v[0] += 7;
> +		  total += l3[n].v[0] + l3[n].v[1] + l3[n].v[2];
> +		  local_irq_restore(flags));
> +	print_result("local_add_return", start, end);
> +
> +	printk("(total was %lu)\n", total);
> +
> +	printk("local_t: ");
> +	memset(l4, 0, sizeof(*l4)*NUM_LOCAL_TEST);
> +	TEST_LOOP(local_inc(&l4[n]));
> +	print_result("local_inc", start, end);
> +
> +	warm_total += warm_local_test_cache(l4, sizeof(*l4)*NUM_LOCAL_TEST);
> +	TEST_LOOP(local_add(1234, &l4[n]));
> +	print_result("local_add", start, end);
> +
> +	warm_total += warm_local_test_cache(l4, sizeof(*l4)*NUM_LOCAL_TEST);
> +	TEST_LOOP(cpu_local_inc(local4_test[n%NUM_PERCPU_VARS]));
> +	print_result("cpu_local_inc", start, end);
> +
> +	warm_total += warm_local_test_cache(l4, sizeof(*l4)*NUM_LOCAL_TEST);
> +	total = 0;
> +	TEST_LOOP(total += local_read(&l4[n]));
> +	print_result("local_read", start, end);
> +
> +	warm_total += warm_local_test_cache(l1, sizeof(*l1)*NUM_LOCAL_TEST);
> +	TEST_LOOP(total += local_add_return(7, &l1[n].v));
> +	print_result("local_add_return", start, end);
> +	printk("(total was %lu, warm_total %lu)\n", total, warm_total);
> +}
> +
>  asmlinkage void __init start_kernel(void)
>  {
>  	char * command_line;
> @@ -630,6 +849,8 @@ asmlinkage void __init start_kernel(void
>  	 */
>  	locking_selftest();
>  
> +	init_test_local_variants();
> +
>  #ifdef CONFIG_BLK_DEV_INITRD
>  	if (initrd_start && !initrd_below_start_ok &&
>  	    page_to_pfn(virt_to_page((void *)initrd_start)) < min_low_pfn) {
> @@ -687,6 +908,8 @@ asmlinkage void __init start_kernel(void
>  	acpi_early_init(); /* before LAPIC and SMP init */
>  
>  	ftrace_init();
> +
> +	test_local_variants();
>  
>  	/* Do the rest non-__init'ed, we're now alive */
>  	rest_init();

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: local_add_return
  2008-12-16 16:25   ` local_add_return Mathieu Desnoyers
@ 2008-12-17 11:23     ` Rusty Russell
  0 siblings, 0 replies; 33+ messages in thread
From: Rusty Russell @ 2008-12-17 11:23 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Steven Rostedt, Andrew Morton, LKML, David Miller,
	Paul Mackerras, Benjamin Herrenschmidt, Christoph Lameter,
	Paul E. McKenney, Martin Bligh

On Wednesday 17 December 2008 02:55:32 Mathieu Desnoyers wrote:
> Specifically on this comment :
> 
> +/* There are three obvious ways to implement local_t on an arch which
> + * can't do single-instruction inc/dec etc.
> + * 1) atomic_long
> + * 2) irq_save/irq_restore
> + * 3) multiple counters.
> 
> Option 3) is not workable for tracers, because it's not safe against
> some exceptions (e.g. some hardware errors) nor NMIs.

Hmm, nor is option 2.  Now I understand where you were coming from and
I sympathize with your dilemna, but I don't think that non-x86 archs should
pay for it where local_t is used as intended, so I don't think local_t should
be (have been) hijacked for this.  nmi_safe_t?

> Also, local_t
> operations must have preemption disabled before playing on per-cpu data,
> which I don't see in your test. This has to be taken into account in the
> runtime cost.

atomic_long_t implementations don't have to.  local_irq_save does it
as a side effect.  You're right about multiple counters tho.  We can
either do it conditionally or unconditionally, but I think unconditional
makes sense (CONFIG_PREEMPT=y seems to be less popular than it was).

> the CPU_OPS work done by Christoph Lameter which use
> segments to address the per-cpu data, which effectively removes the need
> for disabling preemption around local_t operations because the CPU ID
> becomes encoded in a cpu register.

Well, we did this for 32-bit x86 some time ago, so that works today.
64-bit was delayed because of the stack protection code, which needs
a fixed offset for the canary so needs zero-based percpu, but IIRC
that's orthogonal to the CPU_OPS work itself.

Here's the timing diff when trivalue is fixed here (preempt on)
Before:
	local_inc=45 local_add=45 cpu_local_inc=6 local_read=21 local_add_return=127
After:
	local_inc=47 local_add=47 cpu_local_inc=6 local_read=41 local_add_return=127

Since sparc64 has CONFIG_PREEMPT=n in its defconfig, I think it is still
ahead with trivalue.

Thanks,
Rusty.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: local_add_return
  2008-12-17  0:01         ` local_add_return Mathieu Desnoyers
@ 2008-12-18 22:53           ` Rusty Russell
  -1 siblings, 0 replies; 33+ messages in thread
From: Rusty Russell @ 2008-12-18 22:52 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: David Miller, rostedt, akpm, linux-kernel, paulus, benh,
	linux-ia64, linux-s390, Christoph Lameter, Paul E. McKenney,
	Martin Bligh

On Wednesday 17 December 2008 10:31:55 Mathieu Desnoyers wrote:
> I think we have two different use-cases here :
> 
> - local_t is useful as-is for things such as a tracer, which need to
>   modify an element of data atomically wrt local interrupts. The
>   atomic_long_t, in this case, is the correct fallback.
> - local_count_t could be used for fast counters.

Hi Mathieu,

   Complete agreement.

   I guess I'm biassed towards local_t == counter version, something else
== nmi-safe version because that's what it was originally.  Looking through
the tree, there are only 5 users: module, dmaengine and percpu_counter want
a counter, and tracing and x86 nmi.c want nmi-safe.  There are several other
places I know of which want local_t-the-counter.

   I'll prepare a patch which adds nmi_safe_t, and see how it looks.  There's
no amazing hurry on this, so I won't race to hit the merge window.

Thanks!
Rusty.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: local_add_return
@ 2008-12-18 22:53           ` Rusty Russell
  0 siblings, 0 replies; 33+ messages in thread
From: Rusty Russell @ 2008-12-18 22:53 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: David Miller, rostedt, akpm, linux-kernel, paulus, benh,
	linux-ia64, linux-s390, Christoph Lameter, Paul E. McKenney,
	Martin Bligh

On Wednesday 17 December 2008 10:31:55 Mathieu Desnoyers wrote:
> I think we have two different use-cases here :
> 
> - local_t is useful as-is for things such as a tracer, which need to
>   modify an element of data atomically wrt local interrupts. The
>   atomic_long_t, in this case, is the correct fallback.
> - local_count_t could be used for fast counters.

Hi Mathieu,

   Complete agreement.

   I guess I'm biassed towards local_t = counter version, something else
= nmi-safe version because that's what it was originally.  Looking through
the tree, there are only 5 users: module, dmaengine and percpu_counter want
a counter, and tracing and x86 nmi.c want nmi-safe.  There are several other
places I know of which want local_t-the-counter.

   I'll prepare a patch which adds nmi_safe_t, and see how it looks.  There's
no amazing hurry on this, so I won't race to hit the merge window.

Thanks!
Rusty.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: local_add_return
  2008-12-18 22:53           ` local_add_return Rusty Russell
@ 2008-12-19  3:35             ` Mathieu Desnoyers
  -1 siblings, 0 replies; 33+ messages in thread
From: Mathieu Desnoyers @ 2008-12-19  3:35 UTC (permalink / raw)
  To: Rusty Russell
  Cc: David Miller, rostedt, akpm, linux-kernel, paulus, benh,
	linux-ia64, linux-s390, Christoph Lameter, Paul E. McKenney,
	Martin Bligh

* Rusty Russell (rusty@rustcorp.com.au) wrote:
> On Wednesday 17 December 2008 10:31:55 Mathieu Desnoyers wrote:
> > I think we have two different use-cases here :
> > 
> > - local_t is useful as-is for things such as a tracer, which need to
> >   modify an element of data atomically wrt local interrupts. The
> >   atomic_long_t, in this case, is the correct fallback.
> > - local_count_t could be used for fast counters.
> 
> Hi Mathieu,
> 
>    Complete agreement.
> 
>    I guess I'm biassed towards local_t == counter version, something else
> == nmi-safe version because that's what it was originally.  Looking through
> the tree, there are only 5 users: module, dmaengine and percpu_counter want
> a counter, and tracing and x86 nmi.c want nmi-safe.  There are several other
> places I know of which want local_t-the-counter.
> 
>    I'll prepare a patch which adds nmi_safe_t, and see how it looks.  There's
> no amazing hurry on this, so I won't race to hit the merge window.
> 

OK,

But can we turn what you call "nmi_safe_t" into "local_atomic_t" then ?
Because we have to specify that this type must only be used as part of
per-cpu data with preemption disabled, and we also specify that it is
atomic.

Plus, nmi_safe_t does not make much sense on architectures without NMIs,
where we sometimes disable interrupts to make the modification "atomic"
wrt all other interrupts that can happen.

Mathieu

> Thanks!
> Rusty.

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: local_add_return
@ 2008-12-19  3:35             ` Mathieu Desnoyers
  0 siblings, 0 replies; 33+ messages in thread
From: Mathieu Desnoyers @ 2008-12-19  3:35 UTC (permalink / raw)
  To: Rusty Russell
  Cc: David Miller, rostedt, akpm, linux-kernel, paulus, benh,
	linux-ia64, linux-s390, Christoph Lameter, Paul E. McKenney,
	Martin Bligh

* Rusty Russell (rusty@rustcorp.com.au) wrote:
> On Wednesday 17 December 2008 10:31:55 Mathieu Desnoyers wrote:
> > I think we have two different use-cases here :
> > 
> > - local_t is useful as-is for things such as a tracer, which need to
> >   modify an element of data atomically wrt local interrupts. The
> >   atomic_long_t, in this case, is the correct fallback.
> > - local_count_t could be used for fast counters.
> 
> Hi Mathieu,
> 
>    Complete agreement.
> 
>    I guess I'm biassed towards local_t = counter version, something else
> = nmi-safe version because that's what it was originally.  Looking through
> the tree, there are only 5 users: module, dmaengine and percpu_counter want
> a counter, and tracing and x86 nmi.c want nmi-safe.  There are several other
> places I know of which want local_t-the-counter.
> 
>    I'll prepare a patch which adds nmi_safe_t, and see how it looks.  There's
> no amazing hurry on this, so I won't race to hit the merge window.
> 

OK,

But can we turn what you call "nmi_safe_t" into "local_atomic_t" then ?
Because we have to specify that this type must only be used as part of
per-cpu data with preemption disabled, and we also specify that it is
atomic.

Plus, nmi_safe_t does not make much sense on architectures without NMIs,
where we sometimes disable interrupts to make the modification "atomic"
wrt all other interrupts that can happen.

Mathieu

> Thanks!
> Rusty.

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: local_add_return
  2008-12-19  3:35             ` local_add_return Mathieu Desnoyers
@ 2008-12-19  5:54               ` Rusty Russell
  -1 siblings, 0 replies; 33+ messages in thread
From: Rusty Russell @ 2008-12-19  5:54 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: David Miller, rostedt, akpm, linux-kernel, paulus, benh,
	linux-ia64, linux-s390, Christoph Lameter, Paul E. McKenney,
	Martin Bligh

On Friday 19 December 2008 14:05:14 Mathieu Desnoyers wrote:
> * Rusty Russell (rusty@rustcorp.com.au) wrote:
> But can we turn what you call "nmi_safe_t" into "local_atomic_t" then ?
> Because we have to specify that this type must only be used as part of
> per-cpu data with preemption disabled, and we also specify that it is
> atomic.
> 
> Plus, nmi_safe_t does not make much sense on architectures without NMIs,
> where we sometimes disable interrupts to make the modification "atomic"
> wrt all other interrupts that can happen.

But those archs can use local_t.  I don't like either name local_t nor
atomic_local_t, but renaming sucks too.

OK, how about a different approach?  Since there's really only one case
where we need this local_t property outside arch-specific code, how about
we define ARCH_LOCAL_T_TRACE_SAFE for x86?

Then some trace-specific typedef like "trace_counter_t" which goes to local_t
or atomic_(long?)_t?

Should be a simple patch and pretty clear.

Thanks,
Rusty.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: local_add_return
@ 2008-12-19  5:54               ` Rusty Russell
  0 siblings, 0 replies; 33+ messages in thread
From: Rusty Russell @ 2008-12-19  5:54 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: David Miller, rostedt, akpm, linux-kernel, paulus, benh,
	linux-ia64, linux-s390, Christoph Lameter, Paul E. McKenney,
	Martin Bligh

On Friday 19 December 2008 14:05:14 Mathieu Desnoyers wrote:
> * Rusty Russell (rusty@rustcorp.com.au) wrote:
> But can we turn what you call "nmi_safe_t" into "local_atomic_t" then ?
> Because we have to specify that this type must only be used as part of
> per-cpu data with preemption disabled, and we also specify that it is
> atomic.
> 
> Plus, nmi_safe_t does not make much sense on architectures without NMIs,
> where we sometimes disable interrupts to make the modification "atomic"
> wrt all other interrupts that can happen.

But those archs can use local_t.  I don't like either name local_t nor
atomic_local_t, but renaming sucks too.

OK, how about a different approach?  Since there's really only one case
where we need this local_t property outside arch-specific code, how about
we define ARCH_LOCAL_T_TRACE_SAFE for x86?

Then some trace-specific typedef like "trace_counter_t" which goes to local_t
or atomic_(long?)_t?

Should be a simple patch and pretty clear.

Thanks,
Rusty.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: local_add_return
  2008-12-19  5:54               ` local_add_return Rusty Russell
@ 2008-12-19 17:06                 ` Mathieu Desnoyers
  -1 siblings, 0 replies; 33+ messages in thread
From: Mathieu Desnoyers @ 2008-12-19 17:06 UTC (permalink / raw)
  To: Rusty Russell
  Cc: David Miller, rostedt, akpm, linux-kernel, paulus, benh,
	linux-ia64, linux-s390, Christoph Lameter, Paul E. McKenney,
	Martin Bligh

* Rusty Russell (rusty@rustcorp.com.au) wrote:
> On Friday 19 December 2008 14:05:14 Mathieu Desnoyers wrote:
> > * Rusty Russell (rusty@rustcorp.com.au) wrote:
> > But can we turn what you call "nmi_safe_t" into "local_atomic_t" then ?
> > Because we have to specify that this type must only be used as part of
> > per-cpu data with preemption disabled, and we also specify that it is
> > atomic.
> > 
> > Plus, nmi_safe_t does not make much sense on architectures without NMIs,
> > where we sometimes disable interrupts to make the modification "atomic"
> > wrt all other interrupts that can happen.
> 
> But those archs can use local_t.  I don't like either name local_t nor
> atomic_local_t, but renaming sucks too.
> 
> OK, how about a different approach?  Since there's really only one case
> where we need this local_t property outside arch-specific code, how about
> we define ARCH_LOCAL_T_TRACE_SAFE for x86?
> 
> Then some trace-specific typedef like "trace_counter_t" which goes to local_t
> or atomic_(long?)_t?
> 
> Should be a simple patch and pretty clear.
> 

Hrm, is it me or linking a basic type definition to a single user seems
like the wrong approach ?

The idea behind declaring new types is, to me, that they should describe
as generally as possible what they provide and what they are. If we
think of the future, where we might want to use such local atomic types
for other purposes than tracing, I think we will end up regretting such
specific naming scheme. I don't think the argument "because the type has
only one arch-independent user" holds, because the idea behind new types
is that they _will_ be used by others eventually. For instance, we've
done some work on moving the slub allocator to such local atomic
operations last year, and it gave very good results on architectures
where disabling interrupt is costly (threefold acceleration of the
fastpath).

In your trace_counter_t proposal, you don't take into account that (what
I call) local_atomic_long_t is a _new_ primitive, which cannot be
implemented by a trivalue and differs from atomic_long_t, on more
architectures than x86. On mips and powerpc, at least, it can be
implemented as an atomic operation without the memory barriers, which
improves performances a lot.

I think the following scheme would be pretty simple and yet not tied to
any specific user :

local_long_t
  - Fast per-cpu counter, not necessarily atomic.
    Implements long trivalues, or uses local_atomic_long_t.
local_atomic_long_t
  - Fast per-cpu atomic counter.
    Implements per-cpu atomic counters or uses atomic_long_t.
atomic_long_t
  - Global atomic counter.
    Implements globally synchronized atomic operations.

We could do the same with "int" type for :
local_t
local_atomic_t
atomic_t

If we need smaller counters.

Mathieu


> Thanks,
> Rusty.

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: local_add_return
@ 2008-12-19 17:06                 ` Mathieu Desnoyers
  0 siblings, 0 replies; 33+ messages in thread
From: Mathieu Desnoyers @ 2008-12-19 17:06 UTC (permalink / raw)
  To: Rusty Russell
  Cc: David Miller, rostedt, akpm, linux-kernel, paulus, benh,
	linux-ia64, linux-s390, Christoph Lameter, Paul E. McKenney,
	Martin Bligh

* Rusty Russell (rusty@rustcorp.com.au) wrote:
> On Friday 19 December 2008 14:05:14 Mathieu Desnoyers wrote:
> > * Rusty Russell (rusty@rustcorp.com.au) wrote:
> > But can we turn what you call "nmi_safe_t" into "local_atomic_t" then ?
> > Because we have to specify that this type must only be used as part of
> > per-cpu data with preemption disabled, and we also specify that it is
> > atomic.
> > 
> > Plus, nmi_safe_t does not make much sense on architectures without NMIs,
> > where we sometimes disable interrupts to make the modification "atomic"
> > wrt all other interrupts that can happen.
> 
> But those archs can use local_t.  I don't like either name local_t nor
> atomic_local_t, but renaming sucks too.
> 
> OK, how about a different approach?  Since there's really only one case
> where we need this local_t property outside arch-specific code, how about
> we define ARCH_LOCAL_T_TRACE_SAFE for x86?
> 
> Then some trace-specific typedef like "trace_counter_t" which goes to local_t
> or atomic_(long?)_t?
> 
> Should be a simple patch and pretty clear.
> 

Hrm, is it me or linking a basic type definition to a single user seems
like the wrong approach ?

The idea behind declaring new types is, to me, that they should describe
as generally as possible what they provide and what they are. If we
think of the future, where we might want to use such local atomic types
for other purposes than tracing, I think we will end up regretting such
specific naming scheme. I don't think the argument "because the type has
only one arch-independent user" holds, because the idea behind new types
is that they _will_ be used by others eventually. For instance, we've
done some work on moving the slub allocator to such local atomic
operations last year, and it gave very good results on architectures
where disabling interrupt is costly (threefold acceleration of the
fastpath).

In your trace_counter_t proposal, you don't take into account that (what
I call) local_atomic_long_t is a _new_ primitive, which cannot be
implemented by a trivalue and differs from atomic_long_t, on more
architectures than x86. On mips and powerpc, at least, it can be
implemented as an atomic operation without the memory barriers, which
improves performances a lot.

I think the following scheme would be pretty simple and yet not tied to
any specific user :

local_long_t
  - Fast per-cpu counter, not necessarily atomic.
    Implements long trivalues, or uses local_atomic_long_t.
local_atomic_long_t
  - Fast per-cpu atomic counter.
    Implements per-cpu atomic counters or uses atomic_long_t.
atomic_long_t
  - Global atomic counter.
    Implements globally synchronized atomic operations.

We could do the same with "int" type for :
local_t
local_atomic_t
atomic_t

If we need smaller counters.

Mathieu


> Thanks,
> Rusty.

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: local_add_return
  2008-12-19 17:06                 ` local_add_return Mathieu Desnoyers
  (?)
@ 2008-12-20  1:33                   ` Rusty Russell
  -1 siblings, 0 replies; 33+ messages in thread
From: Rusty Russell @ 2008-12-20  1:33 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: David Miller, rostedt, akpm, linux-kernel, paulus, benh,
	linux-ia64, linux-s390, Christoph Lameter, Paul E. McKenney,
	Martin Bligh

On Saturday 20 December 2008 03:36:27 Mathieu Desnoyers wrote:
> * Rusty Russell (rusty@rustcorp.com.au) wrote:
> > Then some trace-specific typedef like "trace_counter_t" which goes to local_t
> > or atomic_(long?)_t?
> > 
> > Should be a simple patch and pretty clear.
> 
> Hrm, is it me or linking a basic type definition to a single user seems
> like the wrong approach ?

Well, it's an ongoing debate.  Old school kernel coders believe that
infrastructure should be resisted: you implement what you need to, then
if it turns out to be generically useful you put it somewhere that the
second user can access it.

Otherwise we end up with unused infrastructure, or overspecialized
infrastructure which doesn't actually meet the general need.  local_t
displays both these properties.

> The idea behind declaring new types is, to me, that they should describe
> as generally as possible what they provide and what they are. If we
> think of the future, where we might want to use such local atomic types
> for other purposes than tracing, I think we will end up regretting such
> specific naming scheme.

I can be convinced, but I'll need more than speculation.  Assuming
local_long_atomic_t, can you produce a patch which uses it somewhere else?

> local_atomic_long_t is a _new_ primitive, which cannot be
> implemented by a trivalue and differs from atomic_long_t, on more
> architectures than x86. On mips and powerpc, at least, it can be
> implemented as an atomic operation without the memory barriers, which
> improves performances a lot.

OK, you lost me here.  I don't see a memory barrier in the powerpc atomic
ops.  Is there an implied one I missed?

> I think the following scheme would be pretty simple and yet not tied to
> any specific user :
> 
> local_long_t
>   - Fast per-cpu counter, not necessarily atomic.
>     Implements long trivalues, or uses local_atomic_long_t.
> local_atomic_long_t
>   - Fast per-cpu atomic counter.
>     Implements per-cpu atomic counters or uses atomic_long_t.

>From the point of view of someone trying to decide what to use, the real
difference is: use local_long_t unless you need the atomic-style operators
which are only available on local_atomic_long_t, or NMI-safe behaviour.
Is this correct?

> We could do the same with "int" type for :
> local_t
> local_atomic_t
> atomic_t
> 
> If we need smaller counters.

Erk... no, renaming one to two is bad enough.  Changing the semantics of
one and introducing three more is horrible.

If we're going to rename, I'd choose local_counter_t and local_atomic_t
(both long: I don't think there's a real penalty is there?).

Thanks,
Rusty.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: local_add_return
@ 2008-12-20  1:33                   ` Rusty Russell
  0 siblings, 0 replies; 33+ messages in thread
From: Rusty Russell @ 2008-12-20  1:33 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: David Miller, rostedt, akpm, linux-kernel, paulus, benh,
	linux-ia64, linux-s390, Christoph Lameter, Paul E. McKenney,
	Martin Bligh

On Saturday 20 December 2008 03:36:27 Mathieu Desnoyers wrote:
> * Rusty Russell (rusty@rustcorp.com.au) wrote:
> > Then some trace-specific typedef like "trace_counter_t" which goes to local_t
> > or atomic_(long?)_t?
> > 
> > Should be a simple patch and pretty clear.
> 
> Hrm, is it me or linking a basic type definition to a single user seems
> like the wrong approach ?

Well, it's an ongoing debate.  Old school kernel coders believe that
infrastructure should be resisted: you implement what you need to, then
if it turns out to be generically useful you put it somewhere that the
second user can access it.

Otherwise we end up with unused infrastructure, or overspecialized
infrastructure which doesn't actually meet the general need.  local_t
displays both these properties.

> The idea behind declaring new types is, to me, that they should describe
> as generally as possible what they provide and what they are. If we
> think of the future, where we might want to use such local atomic types
> for other purposes than tracing, I think we will end up regretting such
> specific naming scheme.

I can be convinced, but I'll need more than speculation.  Assuming
local_long_atomic_t, can you produce a patch which uses it somewhere else?

> local_atomic_long_t is a _new_ primitive, which cannot be
> implemented by a trivalue and differs from atomic_long_t, on more
> architectures than x86. On mips and powerpc, at least, it can be
> implemented as an atomic operation without the memory barriers, which
> improves performances a lot.

OK, you lost me here.  I don't see a memory barrier in the powerpc atomic
ops.  Is there an implied one I missed?

> I think the following scheme would be pretty simple and yet not tied to
> any specific user :
> 
> local_long_t
>   - Fast per-cpu counter, not necessarily atomic.
>     Implements long trivalues, or uses local_atomic_long_t.
> local_atomic_long_t
>   - Fast per-cpu atomic counter.
>     Implements per-cpu atomic counters or uses atomic_long_t.

From the point of view of someone trying to decide what to use, the real
difference is: use local_long_t unless you need the atomic-style operators
which are only available on local_atomic_long_t, or NMI-safe behaviour.
Is this correct?

> We could do the same with "int" type for :
> local_t
> local_atomic_t
> atomic_t
> 
> If we need smaller counters.

Erk... no, renaming one to two is bad enough.  Changing the semantics of
one and introducing three more is horrible.

If we're going to rename, I'd choose local_counter_t and local_atomic_t
(both long: I don't think there's a real penalty is there?).

Thanks,
Rusty.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: local_add_return
@ 2008-12-20  1:33                   ` Rusty Russell
  0 siblings, 0 replies; 33+ messages in thread
From: Rusty Russell @ 2008-12-20  1:45 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: David Miller, rostedt, akpm, linux-kernel, paulus, benh,
	linux-ia64, linux-s390, Christoph Lameter, Paul E. McKenney,
	Martin Bligh

On Saturday 20 December 2008 03:36:27 Mathieu Desnoyers wrote:
> * Rusty Russell (rusty@rustcorp.com.au) wrote:
> > Then some trace-specific typedef like "trace_counter_t" which goes to local_t
> > or atomic_(long?)_t?
> > 
> > Should be a simple patch and pretty clear.
> 
> Hrm, is it me or linking a basic type definition to a single user seems
> like the wrong approach ?

Well, it's an ongoing debate.  Old school kernel coders believe that
infrastructure should be resisted: you implement what you need to, then
if it turns out to be generically useful you put it somewhere that the
second user can access it.

Otherwise we end up with unused infrastructure, or overspecialized
infrastructure which doesn't actually meet the general need.  local_t
displays both these properties.

> The idea behind declaring new types is, to me, that they should describe
> as generally as possible what they provide and what they are. If we
> think of the future, where we might want to use such local atomic types
> for other purposes than tracing, I think we will end up regretting such
> specific naming scheme.

I can be convinced, but I'll need more than speculation.  Assuming
local_long_atomic_t, can you produce a patch which uses it somewhere else?

> local_atomic_long_t is a _new_ primitive, which cannot be
> implemented by a trivalue and differs from atomic_long_t, on more
> architectures than x86. On mips and powerpc, at least, it can be
> implemented as an atomic operation without the memory barriers, which
> improves performances a lot.

OK, you lost me here.  I don't see a memory barrier in the powerpc atomic
ops.  Is there an implied one I missed?

> I think the following scheme would be pretty simple and yet not tied to
> any specific user :
> 
> local_long_t
>   - Fast per-cpu counter, not necessarily atomic.
>     Implements long trivalues, or uses local_atomic_long_t.
> local_atomic_long_t
>   - Fast per-cpu atomic counter.
>     Implements per-cpu atomic counters or uses atomic_long_t.

From the point of view of someone trying to decide what to use, the real
difference is: use local_long_t unless you need the atomic-style operators
which are only available on local_atomic_long_t, or NMI-safe behaviour.
Is this correct?

> We could do the same with "int" type for :
> local_t
> local_atomic_t
> atomic_t
> 
> If we need smaller counters.

Erk... no, renaming one to two is bad enough.  Changing the semantics of
one and introducing three more is horrible.

If we're going to rename, I'd choose local_counter_t and local_atomic_t
(both long: I don't think there's a real penalty is there?).

Thanks,
Rusty.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: local_add_return
  2008-12-20  1:33                   ` local_add_return Rusty Russell
@ 2008-12-22 18:43                     ` Mathieu Desnoyers
  -1 siblings, 0 replies; 33+ messages in thread
From: Mathieu Desnoyers @ 2008-12-22 18:43 UTC (permalink / raw)
  To: Rusty Russell
  Cc: David Miller, rostedt, akpm, linux-kernel, paulus, benh,
	linux-ia64, linux-s390, Christoph Lameter, Paul E. McKenney,
	Martin Bligh

* Rusty Russell (rusty@rustcorp.com.au) wrote:
> On Saturday 20 December 2008 03:36:27 Mathieu Desnoyers wrote:
> > * Rusty Russell (rusty@rustcorp.com.au) wrote:
> > > Then some trace-specific typedef like "trace_counter_t" which goes to local_t
> > > or atomic_(long?)_t?
> > > 
> > > Should be a simple patch and pretty clear.
> > 
> > Hrm, is it me or linking a basic type definition to a single user seems
> > like the wrong approach ?
> 
> Well, it's an ongoing debate.  Old school kernel coders believe that
> infrastructure should be resisted: you implement what you need to, then
> if it turns out to be generically useful you put it somewhere that the
> second user can access it.
> 
> Otherwise we end up with unused infrastructure, or overspecialized
> infrastructure which doesn't actually meet the general need.  local_t
> displays both these properties.
> 

Yes.. well given every iteration on that kind of primitive touches _all_
architectures supported by Linux, I think it's good to think a bit about
the design in advance to minimize the amout of wasted effort. Especially
because it requires some coordination between many architecture
maintainers.

> > The idea behind declaring new types is, to me, that they should describe
> > as generally as possible what they provide and what they are. If we
> > think of the future, where we might want to use such local atomic types
> > for other purposes than tracing, I think we will end up regretting such
> > specific naming scheme.
> 
> I can be convinced, but I'll need more than speculation.  Assuming
> local_long_atomic_t, can you produce a patch which uses it somewhere else?
> 

I had this patch applying over Christoph Lameter's vm tree last
February. It did accelerate the slub fastpath allocator by using
cmpxchg_local rather than disabling interrupts. cmpxchg_local is not
using the local_t type, but behaves similarly to local_cmpxchg.

http://lkml.org/lkml/2008/2/28/568


> > local_atomic_long_t is a _new_ primitive, which cannot be
> > implemented by a trivalue and differs from atomic_long_t, on more
> > architectures than x86. On mips and powerpc, at least, it can be
> > implemented as an atomic operation without the memory barriers, which
> > improves performances a lot.
> 
> OK, you lost me here.  I don't see a memory barrier in the powerpc atomic
> ops.  Is there an implied one I missed?
> 

Look for LWSYNC_ON_SMP and ISYNC_ON_SMP in 
arch/powerpc/include/asm/atomic.h

They map to the lwsync and isync instruction, which are more or less
memory ops and instruction execution order barriers. They become both
unneeded when modifying per-cpu data from a single CPU.

> > I think the following scheme would be pretty simple and yet not tied to
> > any specific user :
> > 
> > local_long_t
> >   - Fast per-cpu counter, not necessarily atomic.
> >     Implements long trivalues, or uses local_atomic_long_t.
> > local_atomic_long_t
> >   - Fast per-cpu atomic counter.
> >     Implements per-cpu atomic counters or uses atomic_long_t.
> 
> From the point of view of someone trying to decide what to use, the real
> difference is: use local_long_t unless you need the atomic-style operators
> which are only available on local_atomic_long_t, or NMI-safe behaviour.
> Is this correct?
> 

Yes.

> > We could do the same with "int" type for :
> > local_t
> > local_atomic_t
> > atomic_t
> > 
> > If we need smaller counters.
> 
> Erk... no, renaming one to two is bad enough.  Changing the semantics of
> one and introducing three more is horrible.
> 
> If we're going to rename, I'd choose local_counter_t and local_atomic_t
> (both long: I don't think there's a real penalty is there?).
> 

The penality is only space and wasted cache-lines whenever the data fits
in smaller data types, but I think we can start with a single data type
(long) and add more if needed. I agree with you on renaming, it's bad.
People trying to forward port their code will have a hard time managing
the type behavior change, especially if the compiler does not complain.
local_counter_t and local_atomic_t seems good to me, except the fact
that atomic_t maps to "int" and local_atomic_t would map to "long",
which might be unexpected. If possible, I'd try to follow the current
semantics of "atomic_t" for int and "atomic_long_t" for long, because I
think those types should offer a similar interface. I know that
local_counter_long_t and local_atomic_long_t are painful to write, but
that would follow the current atomic_t vs atomic_long_t semantics. Hm ?

Mathieu

> Thanks,
> Rusty.
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: local_add_return
@ 2008-12-22 18:43                     ` Mathieu Desnoyers
  0 siblings, 0 replies; 33+ messages in thread
From: Mathieu Desnoyers @ 2008-12-22 18:43 UTC (permalink / raw)
  To: Rusty Russell
  Cc: David Miller, rostedt, akpm, linux-kernel, paulus, benh,
	linux-ia64, linux-s390, Christoph Lameter, Paul E. McKenney,
	Martin Bligh

* Rusty Russell (rusty@rustcorp.com.au) wrote:
> On Saturday 20 December 2008 03:36:27 Mathieu Desnoyers wrote:
> > * Rusty Russell (rusty@rustcorp.com.au) wrote:
> > > Then some trace-specific typedef like "trace_counter_t" which goes to local_t
> > > or atomic_(long?)_t?
> > > 
> > > Should be a simple patch and pretty clear.
> > 
> > Hrm, is it me or linking a basic type definition to a single user seems
> > like the wrong approach ?
> 
> Well, it's an ongoing debate.  Old school kernel coders believe that
> infrastructure should be resisted: you implement what you need to, then
> if it turns out to be generically useful you put it somewhere that the
> second user can access it.
> 
> Otherwise we end up with unused infrastructure, or overspecialized
> infrastructure which doesn't actually meet the general need.  local_t
> displays both these properties.
> 

Yes.. well given every iteration on that kind of primitive touches _all_
architectures supported by Linux, I think it's good to think a bit about
the design in advance to minimize the amout of wasted effort. Especially
because it requires some coordination between many architecture
maintainers.

> > The idea behind declaring new types is, to me, that they should describe
> > as generally as possible what they provide and what they are. If we
> > think of the future, where we might want to use such local atomic types
> > for other purposes than tracing, I think we will end up regretting such
> > specific naming scheme.
> 
> I can be convinced, but I'll need more than speculation.  Assuming
> local_long_atomic_t, can you produce a patch which uses it somewhere else?
> 

I had this patch applying over Christoph Lameter's vm tree last
February. It did accelerate the slub fastpath allocator by using
cmpxchg_local rather than disabling interrupts. cmpxchg_local is not
using the local_t type, but behaves similarly to local_cmpxchg.

http://lkml.org/lkml/2008/2/28/568


> > local_atomic_long_t is a _new_ primitive, which cannot be
> > implemented by a trivalue and differs from atomic_long_t, on more
> > architectures than x86. On mips and powerpc, at least, it can be
> > implemented as an atomic operation without the memory barriers, which
> > improves performances a lot.
> 
> OK, you lost me here.  I don't see a memory barrier in the powerpc atomic
> ops.  Is there an implied one I missed?
> 

Look for LWSYNC_ON_SMP and ISYNC_ON_SMP in 
arch/powerpc/include/asm/atomic.h

They map to the lwsync and isync instruction, which are more or less
memory ops and instruction execution order barriers. They become both
unneeded when modifying per-cpu data from a single CPU.

> > I think the following scheme would be pretty simple and yet not tied to
> > any specific user :
> > 
> > local_long_t
> >   - Fast per-cpu counter, not necessarily atomic.
> >     Implements long trivalues, or uses local_atomic_long_t.
> > local_atomic_long_t
> >   - Fast per-cpu atomic counter.
> >     Implements per-cpu atomic counters or uses atomic_long_t.
> 
> From the point of view of someone trying to decide what to use, the real
> difference is: use local_long_t unless you need the atomic-style operators
> which are only available on local_atomic_long_t, or NMI-safe behaviour.
> Is this correct?
> 

Yes.

> > We could do the same with "int" type for :
> > local_t
> > local_atomic_t
> > atomic_t
> > 
> > If we need smaller counters.
> 
> Erk... no, renaming one to two is bad enough.  Changing the semantics of
> one and introducing three more is horrible.
> 
> If we're going to rename, I'd choose local_counter_t and local_atomic_t
> (both long: I don't think there's a real penalty is there?).
> 

The penality is only space and wasted cache-lines whenever the data fits
in smaller data types, but I think we can start with a single data type
(long) and add more if needed. I agree with you on renaming, it's bad.
People trying to forward port their code will have a hard time managing
the type behavior change, especially if the compiler does not complain.
local_counter_t and local_atomic_t seems good to me, except the fact
that atomic_t maps to "int" and local_atomic_t would map to "long",
which might be unexpected. If possible, I'd try to follow the current
semantics of "atomic_t" for int and "atomic_long_t" for long, because I
think those types should offer a similar interface. I know that
local_counter_long_t and local_atomic_long_t are painful to write, but
that would follow the current atomic_t vs atomic_long_t semantics. Hm ?

Mathieu

> Thanks,
> Rusty.
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: local_add_return
  2008-12-22 18:43                     ` local_add_return Mathieu Desnoyers
@ 2008-12-24 11:54                       ` Rusty Russell
  -1 siblings, 0 replies; 33+ messages in thread
From: Rusty Russell @ 2008-12-24 11:42 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: David Miller, rostedt, akpm, linux-kernel, paulus, benh,
	linux-ia64, linux-s390, Christoph Lameter, Paul E. McKenney,
	Martin Bligh

On Tuesday 23 December 2008 05:13:28 Mathieu Desnoyers wrote:
> > I can be convinced, but I'll need more than speculation.  Assuming
> > local_long_atomic_t, can you produce a patch which uses it somewhere else?
> 
> I had this patch applying over Christoph Lameter's vm tree last
> February. It did accelerate the slub fastpath allocator by using
> cmpxchg_local rather than disabling interrupts. cmpxchg_local is not
> using the local_t type, but behaves similarly to local_cmpxchg.
> 
> http://lkml.org/lkml/2008/2/28/568

OK, I'll buy that.  So we split local_t into a counter and an atomic type.

> I know that
> local_counter_long_t and local_atomic_long_t are painful to write, but
> that would follow the current atomic_t vs atomic_long_t semantics. Hm ?

OK, I've looked at how they're used, to try to figure out whether long
is the right thing.  Counters generally want to be long, but I was in doubt
about atomics; yet grep shows that atomic_long_t is quite popular.  Then
I hit struct nfs_iostats which would want a u64 and a long.  I don't think
we want local_counter_u64 etc.

Just thinking out loud, perhaps a new *type* is the wrong direction?  How
about a set of macros which take a fundamental type, such as:

	DECLARE_LOCAL_COUNTER(type, name);
	local_counter_inc(type, addr);
	...
	DECLARE_LOCAL_ATOMIC(type, name);
	local_atomic_add_return(type, addr);

This allows pointers, u32, u64, long, etc.  If a 32-bit arch can't do 64-bit
local_counter_inc easily, at least the hairy 64-bit code can be eliminated at
compile time.

Or maybe that's overdesign?
Rusty.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: local_add_return
@ 2008-12-24 11:54                       ` Rusty Russell
  0 siblings, 0 replies; 33+ messages in thread
From: Rusty Russell @ 2008-12-24 11:54 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: David Miller, rostedt, akpm, linux-kernel, paulus, benh,
	linux-ia64, linux-s390, Christoph Lameter, Paul E. McKenney,
	Martin Bligh

On Tuesday 23 December 2008 05:13:28 Mathieu Desnoyers wrote:
> > I can be convinced, but I'll need more than speculation.  Assuming
> > local_long_atomic_t, can you produce a patch which uses it somewhere else?
> 
> I had this patch applying over Christoph Lameter's vm tree last
> February. It did accelerate the slub fastpath allocator by using
> cmpxchg_local rather than disabling interrupts. cmpxchg_local is not
> using the local_t type, but behaves similarly to local_cmpxchg.
> 
> http://lkml.org/lkml/2008/2/28/568

OK, I'll buy that.  So we split local_t into a counter and an atomic type.

> I know that
> local_counter_long_t and local_atomic_long_t are painful to write, but
> that would follow the current atomic_t vs atomic_long_t semantics. Hm ?

OK, I've looked at how they're used, to try to figure out whether long
is the right thing.  Counters generally want to be long, but I was in doubt
about atomics; yet grep shows that atomic_long_t is quite popular.  Then
I hit struct nfs_iostats which would want a u64 and a long.  I don't think
we want local_counter_u64 etc.

Just thinking out loud, perhaps a new *type* is the wrong direction?  How
about a set of macros which take a fundamental type, such as:

	DECLARE_LOCAL_COUNTER(type, name);
	local_counter_inc(type, addr);
	...
	DECLARE_LOCAL_ATOMIC(type, name);
	local_atomic_add_return(type, addr);

This allows pointers, u32, u64, long, etc.  If a 32-bit arch can't do 64-bit
local_counter_inc easily, at least the hairy 64-bit code can be eliminated at
compile time.

Or maybe that's overdesign?
Rusty.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: local_add_return
  2008-12-24 11:54                       ` local_add_return Rusty Russell
@ 2008-12-24 18:53                         ` Mathieu Desnoyers
  -1 siblings, 0 replies; 33+ messages in thread
From: Mathieu Desnoyers @ 2008-12-24 18:53 UTC (permalink / raw)
  To: Rusty Russell
  Cc: David Miller, rostedt, akpm, linux-kernel, paulus, benh,
	linux-ia64, linux-s390, Christoph Lameter, Paul E. McKenney,
	Martin Bligh

* Rusty Russell (rusty@rustcorp.com.au) wrote:
> On Tuesday 23 December 2008 05:13:28 Mathieu Desnoyers wrote:
> > > I can be convinced, but I'll need more than speculation.  Assuming
> > > local_long_atomic_t, can you produce a patch which uses it somewhere else?
> > 
> > I had this patch applying over Christoph Lameter's vm tree last
> > February. It did accelerate the slub fastpath allocator by using
> > cmpxchg_local rather than disabling interrupts. cmpxchg_local is not
> > using the local_t type, but behaves similarly to local_cmpxchg.
> > 
> > http://lkml.org/lkml/2008/2/28/568
> 
> OK, I'll buy that.  So we split local_t into a counter and an atomic type.
> 
> > I know that
> > local_counter_long_t and local_atomic_long_t are painful to write, but
> > that would follow the current atomic_t vs atomic_long_t semantics. Hm ?
> 
> OK, I've looked at how they're used, to try to figure out whether long
> is the right thing.  Counters generally want to be long, but I was in doubt
> about atomics; yet grep shows that atomic_long_t is quite popular.  Then
> I hit struct nfs_iostats which would want a u64 and a long.  I don't think
> we want local_counter_u64 etc.
> 
> Just thinking out loud, perhaps a new *type* is the wrong direction?  How
> about a set of macros which take a fundamental type, such as:
> 
> 	DECLARE_LOCAL_COUNTER(type, name);
> 	local_counter_inc(type, addr);
> 	...
> 	DECLARE_LOCAL_ATOMIC(type, name);
> 	local_atomic_add_return(type, addr);
> 
> This allows pointers, u32, u64, long, etc.  If a 32-bit arch can't do 64-bit
> local_counter_inc easily, at least the hairy 64-bit code can be eliminated at
> compile time.
> 
> Or maybe that's overdesign?
> Rusty.

Yeah, I also thought of this, but I am not sure every architecture
provides primitives to modify u16 or u8 data atomically like x86 does.
But yes, I remember hearing Christoph Lameter being interested to use
unsigned char or short atomic counters for the vm allocator in the past.
The rationale was mostly that he wanted to keep a counter in a very
small data type, expecting to "poll" the counter periodically (e.g.
every X counter increment) and sum the total somewhere else.

So I think it would be the right design in the end if we want to allow
wider use of such atomic primitives for counters w/o interrupts
disabled. And I would propose we use a BUILD_BUG_ON() when the
architecture does not support an atomic operation on a specific type.
We should also document which type sizes are supported portably and
which are architecture-specific.

Or, as you say, maybe it's overdesign ? If we have to pick something
simple, just supporting "long" would be a good start.

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: local_add_return
@ 2008-12-24 18:53                         ` Mathieu Desnoyers
  0 siblings, 0 replies; 33+ messages in thread
From: Mathieu Desnoyers @ 2008-12-24 18:53 UTC (permalink / raw)
  To: Rusty Russell
  Cc: David Miller, rostedt, akpm, linux-kernel, paulus, benh,
	linux-ia64, linux-s390, Christoph Lameter, Paul E. McKenney,
	Martin Bligh

* Rusty Russell (rusty@rustcorp.com.au) wrote:
> On Tuesday 23 December 2008 05:13:28 Mathieu Desnoyers wrote:
> > > I can be convinced, but I'll need more than speculation.  Assuming
> > > local_long_atomic_t, can you produce a patch which uses it somewhere else?
> > 
> > I had this patch applying over Christoph Lameter's vm tree last
> > February. It did accelerate the slub fastpath allocator by using
> > cmpxchg_local rather than disabling interrupts. cmpxchg_local is not
> > using the local_t type, but behaves similarly to local_cmpxchg.
> > 
> > http://lkml.org/lkml/2008/2/28/568
> 
> OK, I'll buy that.  So we split local_t into a counter and an atomic type.
> 
> > I know that
> > local_counter_long_t and local_atomic_long_t are painful to write, but
> > that would follow the current atomic_t vs atomic_long_t semantics. Hm ?
> 
> OK, I've looked at how they're used, to try to figure out whether long
> is the right thing.  Counters generally want to be long, but I was in doubt
> about atomics; yet grep shows that atomic_long_t is quite popular.  Then
> I hit struct nfs_iostats which would want a u64 and a long.  I don't think
> we want local_counter_u64 etc.
> 
> Just thinking out loud, perhaps a new *type* is the wrong direction?  How
> about a set of macros which take a fundamental type, such as:
> 
> 	DECLARE_LOCAL_COUNTER(type, name);
> 	local_counter_inc(type, addr);
> 	...
> 	DECLARE_LOCAL_ATOMIC(type, name);
> 	local_atomic_add_return(type, addr);
> 
> This allows pointers, u32, u64, long, etc.  If a 32-bit arch can't do 64-bit
> local_counter_inc easily, at least the hairy 64-bit code can be eliminated at
> compile time.
> 
> Or maybe that's overdesign?
> Rusty.

Yeah, I also thought of this, but I am not sure every architecture
provides primitives to modify u16 or u8 data atomically like x86 does.
But yes, I remember hearing Christoph Lameter being interested to use
unsigned char or short atomic counters for the vm allocator in the past.
The rationale was mostly that he wanted to keep a counter in a very
small data type, expecting to "poll" the counter periodically (e.g.
every X counter increment) and sum the total somewhere else.

So I think it would be the right design in the end if we want to allow
wider use of such atomic primitives for counters w/o interrupts
disabled. And I would propose we use a BUILD_BUG_ON() when the
architecture does not support an atomic operation on a specific type.
We should also document which type sizes are supported portably and
which are architecture-specific.

Or, as you say, maybe it's overdesign ? If we have to pick something
simple, just supporting "long" would be a good start.

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2008-12-24 18:53 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-12-15 13:47 local_add_return Steven Rostedt
2008-12-16  6:33 ` local_add_return Rusty Russell
2008-12-16  6:57   ` local_add_return David Miller
2008-12-16  7:13   ` local_add_return David Miller
2008-12-16 22:38     ` local_add_return Rusty Russell
2008-12-16 22:50       ` local_add_return Rusty Russell
2008-12-16 23:25       ` local_add_return Luck, Tony
2008-12-16 23:25         ` local_add_return Luck, Tony
2008-12-16 23:43       ` local_add_return Heiko Carstens
2008-12-16 23:43         ` local_add_return Heiko Carstens
2008-12-16 23:59       ` local_add_return Eric Dumazet
2008-12-16 23:59         ` local_add_return Eric Dumazet
2008-12-17  0:01       ` local_add_return Mathieu Desnoyers
2008-12-17  0:01         ` local_add_return Mathieu Desnoyers
2008-12-18 22:52         ` local_add_return Rusty Russell
2008-12-18 22:53           ` local_add_return Rusty Russell
2008-12-19  3:35           ` local_add_return Mathieu Desnoyers
2008-12-19  3:35             ` local_add_return Mathieu Desnoyers
2008-12-19  5:54             ` local_add_return Rusty Russell
2008-12-19  5:54               ` local_add_return Rusty Russell
2008-12-19 17:06               ` local_add_return Mathieu Desnoyers
2008-12-19 17:06                 ` local_add_return Mathieu Desnoyers
2008-12-20  1:33                 ` local_add_return Rusty Russell
2008-12-20  1:45                   ` local_add_return Rusty Russell
2008-12-20  1:33                   ` local_add_return Rusty Russell
2008-12-22 18:43                   ` local_add_return Mathieu Desnoyers
2008-12-22 18:43                     ` local_add_return Mathieu Desnoyers
2008-12-24 11:42                     ` local_add_return Rusty Russell
2008-12-24 11:54                       ` local_add_return Rusty Russell
2008-12-24 18:53                       ` local_add_return Mathieu Desnoyers
2008-12-24 18:53                         ` local_add_return Mathieu Desnoyers
2008-12-16 16:25   ` local_add_return Mathieu Desnoyers
2008-12-17 11:23     ` local_add_return Rusty Russell

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.