linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC] percpu: add data dependency barrier in percpu accessors and operations
@ 2014-06-12 13:56 Tejun Heo
  2014-06-12 15:34 ` Paul E. McKenney
  2014-06-17 19:27 ` Christoph Lameter
  0 siblings, 2 replies; 43+ messages in thread
From: Tejun Heo @ 2014-06-12 13:56 UTC (permalink / raw)
  To: Christoph Lameter, David Howells, Paul E. McKenney
  Cc: Linus Torvalds, Andrew Morton, Oleg Nesterov, linux-kernel

>From 2ff90ab638d50d6191ba3a3564b53fccb78bd4cd Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Wed, 11 Jun 2014 20:47:02 -0400

percpu areas are zeroed on allocation and, by its nature, accessed
from multiple cpus.  Consider the following scenario.

 p = NULL;

	CPU-1				CPU-2
 p = alloc_percpu()		if (p)
					WARN_ON(this_cpu_read(*p));

As all percpu areas are zeroed on allocation, CPU-2 should never
trigger the WARN; however, there's no barrier between zeroing of the
percpu regions and the assignment of @p or between testing of @p and
dereferencing it and CPU-2 may see garbage value from before the
clearing and trigger the warning.

Note that this may happen even on archs which don't require data
dependency barriers.  While CPU-2 woudln't reorder percpu area access
above the testing of @p, nothing prevents CPU-1 from scheduling
zeroing after the assignment of @p.

This patch fixes the issue by adding a smp_wmb() before a newly
allocated percpu pointer is returned and a smp_read_barrier_depends()
in __verify_pcpu_ptr() which is guaranteed to be invoked at least once
before every percpu access.  It currently may be invoked multiple
times per operation which isn't optimal.  Future patches will update
the definitions so that the macro is invoked only once per access.

It can be argued that smp_read_barrier_depends() is the responsibility
of the caller; however, discerning local and remote accesses gets very
confusing with percpu areas (initialized remotely but local to this
cpu and vice-versa) and data dependency barrier is free on all archs
except for alpha, so I think it's better to include it as part of
percpu accessors and operations.

I wonder whether we also need a smp_wmb() for other __GFP_ZERO
allocations.  There isn't one and requiring the users to perform
smp_wmb() to ensure the propagation of zeroing seems too subtle.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/percpu-defs.h | 15 ++++++++++++---
 mm/percpu.c                 |  9 +++++++++
 2 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/include/linux/percpu-defs.h b/include/linux/percpu-defs.h
index a5fc7d0..b91bb37 100644
--- a/include/linux/percpu-defs.h
+++ b/include/linux/percpu-defs.h
@@ -1,6 +1,8 @@
 #ifndef _LINUX_PERCPU_DEFS_H
 #define _LINUX_PERCPU_DEFS_H
 
+#include <asm/barrier.h>
+
 /*
  * Base implementations of per-CPU variable declarations and definitions, where
  * the section in which the variable is to be placed is provided by the
@@ -19,9 +21,15 @@
 	__attribute__((section(".discard"), unused))
 
 /*
- * Macro which verifies @ptr is a percpu pointer without evaluating
- * @ptr.  This is to be used in percpu accessors to verify that the
- * input parameter is a percpu pointer.
+ * This macro serves two purposes.  It verifies @ptr is a percpu pointer
+ * without evaluating @ptr and provides the data dependency barrier paired
+ * with smp_wmb() at the end of the allocation path so that the memory
+ * clearing in the allocation path is visible to all percpu accsses.
+ *
+ * The existence of the data dependency barrier is guaranteed and percpu
+ * users can take advantage of it - e.g. percpu area updates followed by
+ * smp_wmb() and then a percpu pointer assignment are guaranteed to be
+ * visible to accessors which access through the assigned percpu pointer.
  *
  * + 0 is required in order to convert the pointer type from a
  * potential array type to a pointer to a single item of the array.
@@ -29,6 +37,7 @@
 #define __verify_pcpu_ptr(ptr)	do {					\
 	const void __percpu *__vpp_verify = (typeof((ptr) + 0))NULL;	\
 	(void)__vpp_verify;						\
+	smp_read_barrier_depends();					\
 } while (0)
 
 /*
diff --git a/mm/percpu.c b/mm/percpu.c
index 2ddf9a9..bd3cab8 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -816,6 +816,15 @@ area_found:
 	/* return address relative to base address */
 	ptr = __addr_to_pcpu_ptr(chunk->base_addr + off);
 	kmemleak_alloc_percpu(ptr, size);
+
+	/*
+	 * The following wmb is paired with the data dependency barrier in
+	 * the percpu accessors and guarantees that the zeroing of the
+	 * percpu areas in pcpu_populate_chunk() is visible to all percpu
+	 * accesses through the returned percpu pointer.  The accessors get
+	 * their data dependency barrier from __verify_pcpu_ptr().
+	 */
+	smp_wmb();
 	return ptr;
 
 fail_unlock:
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH RFC] percpu: add data dependency barrier in percpu accessors and operations
  2014-06-12 13:56 [PATCH RFC] percpu: add data dependency barrier in percpu accessors and operations Tejun Heo
@ 2014-06-12 15:34 ` Paul E. McKenney
  2014-06-12 15:52   ` Tejun Heo
  2014-06-17 19:27 ` Christoph Lameter
  1 sibling, 1 reply; 43+ messages in thread
From: Paul E. McKenney @ 2014-06-12 15:34 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Christoph Lameter, David Howells, Linus Torvalds, Andrew Morton,
	Oleg Nesterov, linux-kernel

On Thu, Jun 12, 2014 at 09:56:30AM -0400, Tejun Heo wrote:
> >From 2ff90ab638d50d6191ba3a3564b53fccb78bd4cd Mon Sep 17 00:00:00 2001
> From: Tejun Heo <tj@kernel.org>
> Date: Wed, 11 Jun 2014 20:47:02 -0400
> 
> percpu areas are zeroed on allocation and, by its nature, accessed
> from multiple cpus.  Consider the following scenario.
> 
>  p = NULL;
> 
> 	CPU-1				CPU-2
>  p = alloc_percpu()		if (p)
> 					WARN_ON(this_cpu_read(*p));
> 
> As all percpu areas are zeroed on allocation, CPU-2 should never
> trigger the WARN; however, there's no barrier between zeroing of the
> percpu regions and the assignment of @p or between testing of @p and
> dereferencing it and CPU-2 may see garbage value from before the
> clearing and trigger the warning.
> 
> Note that this may happen even on archs which don't require data
> dependency barriers.  While CPU-2 woudln't reorder percpu area access
> above the testing of @p, nothing prevents CPU-1 from scheduling
> zeroing after the assignment of @p.
> 
> This patch fixes the issue by adding a smp_wmb() before a newly
> allocated percpu pointer is returned and a smp_read_barrier_depends()
> in __verify_pcpu_ptr() which is guaranteed to be invoked at least once
> before every percpu access.  It currently may be invoked multiple
> times per operation which isn't optimal.  Future patches will update
> the definitions so that the macro is invoked only once per access.
> 
> It can be argued that smp_read_barrier_depends() is the responsibility
> of the caller; however, discerning local and remote accesses gets very
> confusing with percpu areas (initialized remotely but local to this
> cpu and vice-versa) and data dependency barrier is free on all archs
> except for alpha, so I think it's better to include it as part of
> percpu accessors and operations.

OK, I will bite...  Did you find a bug of this form?  (I do see a
couple of possible bugs on a quick look, so maybe...)

I would argue that the code above should instead say something like:

	smp_store_release(p, alloc_percpu());

I was worried about use of per_cpu() by the reading CPU, but of course
in that case, things are initialized at compiler time.

> I wonder whether we also need a smp_wmb() for other __GFP_ZERO
> allocations.  There isn't one and requiring the users to perform
> smp_wmb() to ensure the propagation of zeroing seems too subtle.

I would say "no" even if we do decide that alloc_percpu() needs
an smp_wmb().  The reason is that you really do have to store the
pointer at some point, and you should use smp_store_release() for
this task, at least if you are storing to something accessible to
other CPUs.

						Thanx, Paul

> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: Christoph Lameter <cl@linux-foundation.org>
> Cc: David Howells <dhowells@redhat.com>
> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Oleg Nesterov <oleg@redhat.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  include/linux/percpu-defs.h | 15 ++++++++++++---
>  mm/percpu.c                 |  9 +++++++++
>  2 files changed, 21 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/percpu-defs.h b/include/linux/percpu-defs.h
> index a5fc7d0..b91bb37 100644
> --- a/include/linux/percpu-defs.h
> +++ b/include/linux/percpu-defs.h
> @@ -1,6 +1,8 @@
>  #ifndef _LINUX_PERCPU_DEFS_H
>  #define _LINUX_PERCPU_DEFS_H
> 
> +#include <asm/barrier.h>
> +
>  /*
>   * Base implementations of per-CPU variable declarations and definitions, where
>   * the section in which the variable is to be placed is provided by the
> @@ -19,9 +21,15 @@
>  	__attribute__((section(".discard"), unused))
> 
>  /*
> - * Macro which verifies @ptr is a percpu pointer without evaluating
> - * @ptr.  This is to be used in percpu accessors to verify that the
> - * input parameter is a percpu pointer.
> + * This macro serves two purposes.  It verifies @ptr is a percpu pointer
> + * without evaluating @ptr and provides the data dependency barrier paired
> + * with smp_wmb() at the end of the allocation path so that the memory
> + * clearing in the allocation path is visible to all percpu accsses.
> + *
> + * The existence of the data dependency barrier is guaranteed and percpu
> + * users can take advantage of it - e.g. percpu area updates followed by
> + * smp_wmb() and then a percpu pointer assignment are guaranteed to be
> + * visible to accessors which access through the assigned percpu pointer.
>   *
>   * + 0 is required in order to convert the pointer type from a
>   * potential array type to a pointer to a single item of the array.
> @@ -29,6 +37,7 @@
>  #define __verify_pcpu_ptr(ptr)	do {					\
>  	const void __percpu *__vpp_verify = (typeof((ptr) + 0))NULL;	\
>  	(void)__vpp_verify;						\
> +	smp_read_barrier_depends();					\
>  } while (0)
> 
>  /*
> diff --git a/mm/percpu.c b/mm/percpu.c
> index 2ddf9a9..bd3cab8 100644
> --- a/mm/percpu.c
> +++ b/mm/percpu.c
> @@ -816,6 +816,15 @@ area_found:
>  	/* return address relative to base address */
>  	ptr = __addr_to_pcpu_ptr(chunk->base_addr + off);
>  	kmemleak_alloc_percpu(ptr, size);
> +
> +	/*
> +	 * The following wmb is paired with the data dependency barrier in
> +	 * the percpu accessors and guarantees that the zeroing of the
> +	 * percpu areas in pcpu_populate_chunk() is visible to all percpu
> +	 * accesses through the returned percpu pointer.  The accessors get
> +	 * their data dependency barrier from __verify_pcpu_ptr().
> +	 */
> +	smp_wmb();
>  	return ptr;
> 
>  fail_unlock:
> -- 
> 1.9.3
> 


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH RFC] percpu: add data dependency barrier in percpu accessors and operations
  2014-06-12 15:34 ` Paul E. McKenney
@ 2014-06-12 15:52   ` Tejun Heo
  2014-06-17 14:41     ` Paul E. McKenney
  0 siblings, 1 reply; 43+ messages in thread
From: Tejun Heo @ 2014-06-12 15:52 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Christoph Lameter, David Howells, Linus Torvalds, Andrew Morton,
	Oleg Nesterov, linux-kernel

Hello, Paul.

On Thu, Jun 12, 2014 at 08:34:26AM -0700, Paul E. McKenney wrote:
> > It can be argued that smp_read_barrier_depends() is the responsibility
> > of the caller; however, discerning local and remote accesses gets very
> > confusing with percpu areas (initialized remotely but local to this
> > cpu and vice-versa) and data dependency barrier is free on all archs
> > except for alpha, so I think it's better to include it as part of
> > percpu accessors and operations.
> 
> OK, I will bite...  Did you find a bug of this form?  (I do see a
> couple of possible bugs on a quick look, so maybe...)

I don't have an actual bug report or repro case, which isn't too
surprising given that this can't happen on x86, but I know of places
where percpu pointer is used opportunistically (if allocated) which
would be affected by this.

> I would argue that the code above should instead say something like:
> 
> 	smp_store_release(p, alloc_percpu());

Well, we can always say that it's the caller's responsibility to put
in enough barriers, but it becomes weird for percpu areas because the
ownership of the allocated areas isn't really clear.  Whether each
allocated cpu-specific area's queued writes belong to the allocating
cpu or the cpu that the area is associated with is an implementation
detail which may theoretically change and as such I think it'd be best
for the problem to be dealt with in percpu allocator and accessors
proper.  I think this is way too subtle to ask the users to handle
correctly.

> I was worried about use of per_cpu() by the reading CPU, but of course
> in that case, things are initialized at compiler time.

Not really.  The template is initialized on kernel load.  The actual
percpu areas have to be allocated dynamically; however, this happens
while the machine is still running UP, so this should be okay.

> > I wonder whether we also need a smp_wmb() for other __GFP_ZERO
> > allocations.  There isn't one and requiring the users to perform
> > smp_wmb() to ensure the propagation of zeroing seems too subtle.
> 
> I would say "no" even if we do decide that alloc_percpu() needs
> an smp_wmb().  The reason is that you really do have to store the
> pointer at some point, and you should use smp_store_release() for
> this task, at least if you are storing to something accessible to
> other CPUs.

For normal allocations, I don't have a strong opinion.  I'd prefer to
think of memory coming out of the allocator to have memory ordering
already taken care of but it is a lot less murky than with percpu
areas.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH RFC] percpu: add data dependency barrier in percpu accessors and operations
  2014-06-12 15:52   ` Tejun Heo
@ 2014-06-17 14:41     ` Paul E. McKenney
  2014-06-17 15:27       ` Tejun Heo
  0 siblings, 1 reply; 43+ messages in thread
From: Paul E. McKenney @ 2014-06-17 14:41 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Christoph Lameter, David Howells, Linus Torvalds, Andrew Morton,
	Oleg Nesterov, linux-kernel

On Thu, Jun 12, 2014 at 11:52:27AM -0400, Tejun Heo wrote:
> Hello, Paul.
> 
> On Thu, Jun 12, 2014 at 08:34:26AM -0700, Paul E. McKenney wrote:
> > > It can be argued that smp_read_barrier_depends() is the responsibility
> > > of the caller; however, discerning local and remote accesses gets very
> > > confusing with percpu areas (initialized remotely but local to this
> > > cpu and vice-versa) and data dependency barrier is free on all archs
> > > except for alpha, so I think it's better to include it as part of
> > > percpu accessors and operations.
> > 
> > OK, I will bite...  Did you find a bug of this form?  (I do see a
> > couple of possible bugs on a quick look, so maybe...)
> 
> I don't have an actual bug report or repro case, which isn't too
> surprising given that this can't happen on x86, but I know of places
> where percpu pointer is used opportunistically (if allocated) which
> would be affected by this.

Fair enough.

> > I would argue that the code above should instead say something like:
> > 
> > 	smp_store_release(p, alloc_percpu());
> 
> Well, we can always say that it's the caller's responsibility to put
> in enough barriers, but it becomes weird for percpu areas because the
> ownership of the allocated areas isn't really clear.  Whether each
> allocated cpu-specific area's queued writes belong to the allocating
> cpu or the cpu that the area is associated with is an implementation
> detail which may theoretically change and as such I think it'd be best
> for the problem to be dealt with in percpu allocator and accessors
> proper.  I think this is way too subtle to ask the users to handle
> correctly.

But in the case where the allocation is initially zeroed, then a few
fields are initialized to non-zero values, the memory barrier in the
allocator is insufficient.  I believe that it will be easier for people
to always use smp_store_release() or whatever than to not need it if
the initialization is strictly zero, but to need it if additional
initialization is required.

> > I was worried about use of per_cpu() by the reading CPU, but of course
> > in that case, things are initialized at compiler time.
> 
> Not really.  The template is initialized on kernel load.  The actual
> percpu areas have to be allocated dynamically; however, this happens
> while the machine is still running UP, so this should be okay.

Good point.  How about per-CPU variables that are introduced by
loadable modules?  (I would guess that there are plenty of memory
barriers in the load process, given that text and data also needs
to be visible to other CPUs.)

> > > I wonder whether we also need a smp_wmb() for other __GFP_ZERO
> > > allocations.  There isn't one and requiring the users to perform
> > > smp_wmb() to ensure the propagation of zeroing seems too subtle.
> > 
> > I would say "no" even if we do decide that alloc_percpu() needs
> > an smp_wmb().  The reason is that you really do have to store the
> > pointer at some point, and you should use smp_store_release() for
> > this task, at least if you are storing to something accessible to
> > other CPUs.
> 
> For normal allocations, I don't have a strong opinion.  I'd prefer to
> think of memory coming out of the allocator to have memory ordering
> already taken care of but it is a lot less murky than with percpu
> areas.

Again, it won't help for the allocator to strongly order the
initialization to zero if there are additional initializations of some
fields to non-zero values.  And again, it should be a lot easier to
require the smp_store_release() or whatever uniformly than only in cases
where additional initialization occurred.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH RFC] percpu: add data dependency barrier in percpu accessors and operations
  2014-06-17 14:41     ` Paul E. McKenney
@ 2014-06-17 15:27       ` Tejun Heo
  2014-06-17 15:56         ` Christoph Lameter
                           ` (2 more replies)
  0 siblings, 3 replies; 43+ messages in thread
From: Tejun Heo @ 2014-06-17 15:27 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Christoph Lameter, David Howells, Linus Torvalds, Andrew Morton,
	Oleg Nesterov, linux-kernel, Rusty Russell

Hello, Paul.

On Tue, Jun 17, 2014 at 07:41:51AM -0700, Paul E. McKenney wrote:
> > Well, we can always say that it's the caller's responsibility to put
> > in enough barriers, but it becomes weird for percpu areas because the
> > ownership of the allocated areas isn't really clear.  Whether each
> > allocated cpu-specific area's queued writes belong to the allocating
> > cpu or the cpu that the area is associated with is an implementation
> > detail which may theoretically change and as such I think it'd be best
> > for the problem to be dealt with in percpu allocator and accessors
> > proper.  I think this is way too subtle to ask the users to handle
> > correctly.
> 
> But in the case where the allocation is initially zeroed, then a few
> fields are initialized to non-zero values, the memory barrier in the
> allocator is insufficient.  I believe that it will be easier for people

Oh, definitely but in those cases it's clear whose responsbility it
is.  The task/cpu which modified the memory owns the in-flight
modifications and is responsible for propagation.  The problem with
zeroed allocation, especially for percpu areas, is that it isn't clear
who owns the zeroing.  Again, there's no reason the zeroing can't
belong to the each local cpu, which could have performance benefits if
we ever get frequent enough with percpu allocations.  To me, this
really seems like an implementation detail which shouldn't leak to the
users of the allocator.

> to always use smp_store_release() or whatever than to not need it if
> the initialization is strictly zero, but to need it if additional
> initialization is required.

That goes further and would require dedicated assignment macros for
percpu pointers, a la rcu_assign_pointer().  Yeah, I think that would
be helpful.  It'd take quite a bit of work but percpu allocator usage
is still in low hundreds, so it should be doable.  The whole thing is
way too subtle to require each user to handle it manually.

So, how about the followings?

* Add data dependency barrier to percpu accessors and write barrier to
  the allocator with the comment that this will be replaced with
  proper assignment macros and mark this change w/ -stable.

* Later, introduce percpu pointer assignment macros and convert all
  users and remove the wmb added above.

> > > I was worried about use of per_cpu() by the reading CPU, but of course
> > > in that case, things are initialized at compiler time.
> > 
> > Not really.  The template is initialized on kernel load.  The actual
> > percpu areas have to be allocated dynamically; however, this happens
> > while the machine is still running UP, so this should be okay.
> 
> Good point.  How about per-CPU variables that are introduced by
> loadable modules?  (I would guess that there are plenty of memory
> barriers in the load process, given that text and data also needs
> to be visible to other CPUs.)

(cc'ing Rusty, hi!)

Percpu initialization happens in post_relocation() before
module_finalize().  There seem to be enough operations which can act
as write barrier afterwards but nothing seems explicit.

I have no idea how we're guaranteeing that .data is visible to all
cpus without barrier from reader side.  Maybe we don't allow something
like the following?

  module init				built-in code

  static int mod_static_var = X;	if (builtin_ptr)
  builtin_ptr = &mod_static_var;		WARN_ON(*builtin_ptr != X);

Rusty, can you please enlighten me?

> > For normal allocations, I don't have a strong opinion.  I'd prefer to
> > think of memory coming out of the allocator to have memory ordering
> > already taken care of but it is a lot less murky than with percpu
> > areas.
> 
> Again, it won't help for the allocator to strongly order the
> initialization to zero if there are additional initializations of some
> fields to non-zero values.  And again, it should be a lot easier to
> require the smp_store_release() or whatever uniformly than only in cases
> where additional initialization occurred.

This one is less murky as we can say that the cpu which allocated owns
the zeroing; however, it still deviates from requiring the one which
makes changes to take care of barriering for those changes, which is
what makes me feel a bit uneasy.  IOW, it's the allocator which
cleared the memory, why should its users worry about in-flight
operations from it?  That said, this poses a lot less issues compared
to percpu ones as passing normal pointers to other cpus w/o going
through proper set of barriers is a special thing to do anyway.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH RFC] percpu: add data dependency barrier in percpu accessors and operations
  2014-06-17 15:27       ` Tejun Heo
@ 2014-06-17 15:56         ` Christoph Lameter
  2014-06-17 16:00           ` Tejun Heo
  2014-06-17 16:57         ` Paul E. McKenney
  2014-07-09  0:55         ` Rusty Russell
  2 siblings, 1 reply; 43+ messages in thread
From: Christoph Lameter @ 2014-06-17 15:56 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Paul E. McKenney, David Howells, Linus Torvalds, Andrew Morton,
	Oleg Nesterov, linux-kernel, Rusty Russell

On Tue, 17 Jun 2014, Tejun Heo wrote:


> * Add data dependency barrier to percpu accessors and write barrier to
>   the allocator with the comment that this will be replaced with
>   proper assignment macros and mark this change w/ -stable.
>
> * Later, introduce percpu pointer assignment macros and convert all
>   users and remove the wmb added above.

Uhhh no. The percpu stuff and the associated per cpu atomics are to be
used for stuff that is per cpu specific and runs at the fastest speed
doable at that level. Introducing implicit barriers is not that good an
idea.

The concurrency guarantees for the per cpu operations are related to being
interrupted or rescheduled but not for accesses from other processors.

Cpus maintain at least the appearance of operations being visible in
sequence for code running on the same processor. Therefore no barriers are
needed.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH RFC] percpu: add data dependency barrier in percpu accessors and operations
  2014-06-17 15:56         ` Christoph Lameter
@ 2014-06-17 16:00           ` Tejun Heo
  2014-06-17 16:05             ` Tejun Heo
  0 siblings, 1 reply; 43+ messages in thread
From: Tejun Heo @ 2014-06-17 16:00 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Paul E. McKenney, David Howells, Linus Torvalds, Andrew Morton,
	Oleg Nesterov, linux-kernel, Rusty Russell

Hello, Christoph.

On Tue, Jun 17, 2014 at 10:56:10AM -0500, Christoph Lameter wrote:
> Uhhh no. The percpu stuff and the associated per cpu atomics are to be
> used for stuff that is per cpu specific and runs at the fastest speed
> doable at that level. Introducing implicit barriers is not that good an
> idea.

Hmmm?  Read barriers are noops on all archs except for alpha and
percpu pointer assignments aren't exactly a high frequency operation
and I'm pretty sure we'll end up with a raw variant anyway.

> The concurrency guarantees for the per cpu operations are related to being
> interrupted or rescheduled but not for accesses from other processors.
> 
> Cpus maintain at least the appearance of operations being visible in
> sequence for code running on the same processor. Therefore no barriers are
> needed.

Huh?  We're talking about percpu *pointer* assignments not assignments
to percpu areas.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH RFC] percpu: add data dependency barrier in percpu accessors and operations
  2014-06-17 16:00           ` Tejun Heo
@ 2014-06-17 16:05             ` Tejun Heo
  2014-06-17 16:28               ` Christoph Lameter
  0 siblings, 1 reply; 43+ messages in thread
From: Tejun Heo @ 2014-06-17 16:05 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Paul E. McKenney, David Howells, Linus Torvalds, Andrew Morton,
	Oleg Nesterov, linux-kernel, Rusty Russell

On Tue, Jun 17, 2014 at 12:00:40PM -0400, Tejun Heo wrote:
> Hello, Christoph.
> 
> On Tue, Jun 17, 2014 at 10:56:10AM -0500, Christoph Lameter wrote:
> > Uhhh no. The percpu stuff and the associated per cpu atomics are to be
> > used for stuff that is per cpu specific and runs at the fastest speed
> > doable at that level. Introducing implicit barriers is not that good an
> > idea.
> 
> Hmmm?  Read barriers are noops on all archs except for alpha and

Oops, data dependency barriers, not read barrier.

> percpu pointer assignments aren't exactly a high frequency operation
> and I'm pretty sure we'll end up with a raw variant anyway.

-- 
tejun

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH RFC] percpu: add data dependency barrier in percpu accessors and operations
  2014-06-17 16:05             ` Tejun Heo
@ 2014-06-17 16:28               ` Christoph Lameter
       [not found]                 ` <CA+55aFxHr8JXwDR-4g4z1mkXvZRtY=OosYcUMPZRD2upfooS1w@mail.gmail.com>
  0 siblings, 1 reply; 43+ messages in thread
From: Christoph Lameter @ 2014-06-17 16:28 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Paul E. McKenney, David Howells, Linus Torvalds, Andrew Morton,
	Oleg Nesterov, linux-kernel, Rusty Russell

On Tue, 17 Jun 2014, Tejun Heo wrote:

> On Tue, Jun 17, 2014 at 12:00:40PM -0400, Tejun Heo wrote:
> > Hello, Christoph.
> >
> > On Tue, Jun 17, 2014 at 10:56:10AM -0500, Christoph Lameter wrote:
> > > Uhhh no. The percpu stuff and the associated per cpu atomics are to be
> > > used for stuff that is per cpu specific and runs at the fastest speed
> > > doable at that level. Introducing implicit barriers is not that good an
> > > idea.
> >
> > Hmmm?  Read barriers are noops on all archs except for alpha and
>
> Oops, data dependency barriers, not read barrier.

Even alpha maintains the illusion of changes becoming visible in the
proper order for the currently executing thread. No barriers are needed.



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH RFC] percpu: add data dependency barrier in percpu accessors and operations
  2014-06-17 15:27       ` Tejun Heo
  2014-06-17 15:56         ` Christoph Lameter
@ 2014-06-17 16:57         ` Paul E. McKenney
  2014-06-17 18:56           ` Tejun Heo
  2014-07-09  0:55         ` Rusty Russell
  2 siblings, 1 reply; 43+ messages in thread
From: Paul E. McKenney @ 2014-06-17 16:57 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Christoph Lameter, David Howells, Linus Torvalds, Andrew Morton,
	Oleg Nesterov, linux-kernel, Rusty Russell

On Tue, Jun 17, 2014 at 11:27:52AM -0400, Tejun Heo wrote:
> Hello, Paul.
> 
> On Tue, Jun 17, 2014 at 07:41:51AM -0700, Paul E. McKenney wrote:
> > > Well, we can always say that it's the caller's responsibility to put
> > > in enough barriers, but it becomes weird for percpu areas because the
> > > ownership of the allocated areas isn't really clear.  Whether each
> > > allocated cpu-specific area's queued writes belong to the allocating
> > > cpu or the cpu that the area is associated with is an implementation
> > > detail which may theoretically change and as such I think it'd be best
> > > for the problem to be dealt with in percpu allocator and accessors
> > > proper.  I think this is way too subtle to ask the users to handle
> > > correctly.
> > 
> > But in the case where the allocation is initially zeroed, then a few
> > fields are initialized to non-zero values, the memory barrier in the
> > allocator is insufficient.  I believe that it will be easier for people
> 
> Oh, definitely but in those cases it's clear whose responsbility it
> is.  The task/cpu which modified the memory owns the in-flight
> modifications and is responsible for propagation.  The problem with
> zeroed allocation, especially for percpu areas, is that it isn't clear
> who owns the zeroing.  Again, there's no reason the zeroing can't
> belong to the each local cpu, which could have performance benefits if
> we ever get frequent enough with percpu allocations.  To me, this
> really seems like an implementation detail which shouldn't leak to the
> users of the allocator.

Good point about the possibility that CPUs might initialize their own
dynamically allocated per-CPU areas as needed!  And yes, in that case,
we would need some way of communicating what had and had not yet been
zeroed from the initializing CPUs to the allocating CPU, and that would
require memory barriers to sort things out.

That said, my guess is that the initialization would happen in rather
large chunks, otherwise the allocating CPU might frequently need to
wait on the other CPUs to do the initialization, which doesn't sound
like the way to performance and scalability.  So my guess is that the
common case looks something like this:

	CPU 0				CPU 1

	allocate/initialize big chunk
	smp_mb()
	update metadata accordingly

					load metadata
					if change from last time {
						update local copy of metadata
						smp_mb()
					}
					if (need more)
						ask all for more memory
					initialize non-zero pieces if needed
					smp_store_release()

I am not sure that this is actually better than having the allocating
CPU do the initialization, though, even given a large number of CPUs.
Or maybe even especially given a large number of CPUs.  The bit about
asking all the other CPUs to do their allocations and initializations
and then waiting for them to respond might not be pretty.

> > to always use smp_store_release() or whatever than to not need it if
> > the initialization is strictly zero, but to need it if additional
> > initialization is required.
> 
> That goes further and would require dedicated assignment macros for
> percpu pointers, a la rcu_assign_pointer().  Yeah, I think that would
> be helpful.  It'd take quite a bit of work but percpu allocator usage
> is still in low hundreds, so it should be doable.  The whole thing is
> way too subtle to require each user to handle it manually.
> 
> So, how about the followings?
> 
> * Add data dependency barrier to percpu accessors and write barrier to
>   the allocator with the comment that this will be replaced with
>   proper assignment macros and mark this change w/ -stable.
> 
> * Later, introduce percpu pointer assignment macros and convert all
>   users and remove the wmb added above.

This longer-term approach seems like a good one to me.

> > > > I was worried about use of per_cpu() by the reading CPU, but of course
> > > > in that case, things are initialized at compiler time.
> > > 
> > > Not really.  The template is initialized on kernel load.  The actual
> > > percpu areas have to be allocated dynamically; however, this happens
> > > while the machine is still running UP, so this should be okay.
> > 
> > Good point.  How about per-CPU variables that are introduced by
> > loadable modules?  (I would guess that there are plenty of memory
> > barriers in the load process, given that text and data also needs
> > to be visible to other CPUs.)
> 
> (cc'ing Rusty, hi!)
> 
> Percpu initialization happens in post_relocation() before
> module_finalize().  There seem to be enough operations which can act
> as write barrier afterwards but nothing seems explicit.
> 
> I have no idea how we're guaranteeing that .data is visible to all
> cpus without barrier from reader side.  Maybe we don't allow something
> like the following?
> 
>   module init				built-in code
> 
>   static int mod_static_var = X;	if (builtin_ptr)
>   builtin_ptr = &mod_static_var;		WARN_ON(*builtin_ptr != X);
> 
> Rusty, can you please enlighten me?
> 
> > > For normal allocations, I don't have a strong opinion.  I'd prefer to
> > > think of memory coming out of the allocator to have memory ordering
> > > already taken care of but it is a lot less murky than with percpu
> > > areas.
> > 
> > Again, it won't help for the allocator to strongly order the
> > initialization to zero if there are additional initializations of some
> > fields to non-zero values.  And again, it should be a lot easier to
> > require the smp_store_release() or whatever uniformly than only in cases
> > where additional initialization occurred.
> 
> This one is less murky as we can say that the cpu which allocated owns
> the zeroing; however, it still deviates from requiring the one which
> makes changes to take care of barriering for those changes, which is
> what makes me feel a bit uneasy.  IOW, it's the allocator which
> cleared the memory, why should its users worry about in-flight
> operations from it?  That said, this poses a lot less issues compared
> to percpu ones as passing normal pointers to other cpus w/o going
> through proper set of barriers is a special thing to do anyway.

I much prefer the model where the thing that -published- the pointer is
responsible for memory ordering.  After all, if a task allocates some
zeroed memory, uses it locally, then frees it, there is no ordering
to worry about in the first place.  The memory allocator doing the
initialization cannot tell how the memory is to be used, after all.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH RFC] percpu: add data dependency barrier in percpu accessors and operations
       [not found]                 ` <CA+55aFxHr8JXwDR-4g4z1mkXvZRtY=OosYcUMPZRD2upfooS1w@mail.gmail.com>
@ 2014-06-17 18:47                   ` Christoph Lameter
  2014-06-17 18:55                     ` Paul E. McKenney
  0 siblings, 1 reply; 43+ messages in thread
From: Christoph Lameter @ 2014-06-17 18:47 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Paul McKenney, Rusty Russell, Andrew Morton, Tejun Heo,
	David Howells, linux-kernel, Oleg Nesterov


On Tue, 17 Jun 2014, Linus Torvalds wrote:

> On Jun 17, 2014 6:28 AM, "Christoph Lameter" <cl@gentwo.org> wrote:
> >
> > Even alpha maintains the illusion of changes becoming visible in the
> > proper order for the currently executing thread. No barriers are needed.
>
> No Christoph, alpha really doesn't. Barriers needed.

Everythig breaks if a single hardware execution thread can no longer
observe a consistent view of the world of the data it modifies.

We are not talking about synchronization here between multiple hardware
threads. This is simply code running on a single hardware thread and all
architectures maintain the uniprocessor illusion there.



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH RFC] percpu: add data dependency barrier in percpu accessors and operations
  2014-06-17 18:47                   ` Christoph Lameter
@ 2014-06-17 18:55                     ` Paul E. McKenney
  2014-06-17 19:39                       ` Christoph Lameter
  0 siblings, 1 reply; 43+ messages in thread
From: Paul E. McKenney @ 2014-06-17 18:55 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Linus Torvalds, Rusty Russell, Andrew Morton, Tejun Heo,
	David Howells, linux-kernel, Oleg Nesterov

On Tue, Jun 17, 2014 at 01:47:01PM -0500, Christoph Lameter wrote:
> 
> On Tue, 17 Jun 2014, Linus Torvalds wrote:
> 
> > On Jun 17, 2014 6:28 AM, "Christoph Lameter" <cl@gentwo.org> wrote:
> > >
> > > Even alpha maintains the illusion of changes becoming visible in the
> > > proper order for the currently executing thread. No barriers are needed.
> >
> > No Christoph, alpha really doesn't. Barriers needed.
> 
> Everythig breaks if a single hardware execution thread can no longer
> observe a consistent view of the world of the data it modifies.
> 
> We are not talking about synchronization here between multiple hardware
> threads. This is simply code running on a single hardware thread and all
> architectures maintain the uniprocessor illusion there.

We are talking about one CPU initializing all CPUs' portions of
dynamically allocated per-CPU memory, so there really is more than
one CPU involved.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH RFC] percpu: add data dependency barrier in percpu accessors and operations
  2014-06-17 16:57         ` Paul E. McKenney
@ 2014-06-17 18:56           ` Tejun Heo
  2014-06-17 19:42             ` Christoph Lameter
  0 siblings, 1 reply; 43+ messages in thread
From: Tejun Heo @ 2014-06-17 18:56 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Christoph Lameter, David Howells, Linus Torvalds, Andrew Morton,
	Oleg Nesterov, linux-kernel, Rusty Russell

Hello, Paul.

On Tue, Jun 17, 2014 at 09:57:02AM -0700, Paul E. McKenney wrote:
> I am not sure that this is actually better than having the allocating
> CPU do the initialization, though, even given a large number of CPUs.
> Or maybe even especially given a large number of CPUs.  The bit about
> asking all the other CPUs to do their allocations and initializations
> and then waiting for them to respond might not be pretty.

If it ever happens, it'd probably be the allocator trying to keep
certain amount ready as allocation buffer with buffer refill
synchronization happening lazily in large chunks.  We don't need
anything like that at this point.  I was trying to point out that the
semantics isn't immediately clear.

> > * Add data dependency barrier to percpu accessors and write barrier to
> >   the allocator with the comment that this will be replaced with
> >   proper assignment macros and mark this change w/ -stable.
> > 
> > * Later, introduce percpu pointer assignment macros and convert all
> >   users and remove the wmb added above.
> 
> This longer-term approach seems like a good one to me.

Alright, going that way then.  This will bring percpu usages very
close to RCU, which I like.

> I much prefer the model where the thing that -published- the pointer is
> responsible for memory ordering.  After all, if a task allocates some
> zeroed memory, uses it locally, then frees it, there is no ordering
> to worry about in the first place.  The memory allocator doing the
> initialization cannot tell how the memory is to be used, after all.

Yeah, "publish" is a nice verb to put on it.  No objection.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH RFC] percpu: add data dependency barrier in percpu accessors and operations
  2014-06-12 13:56 [PATCH RFC] percpu: add data dependency barrier in percpu accessors and operations Tejun Heo
  2014-06-12 15:34 ` Paul E. McKenney
@ 2014-06-17 19:27 ` Christoph Lameter
  2014-06-17 19:40   ` Paul E. McKenney
  1 sibling, 1 reply; 43+ messages in thread
From: Christoph Lameter @ 2014-06-17 19:27 UTC (permalink / raw)
  To: Tejun Heo
  Cc: David Howells, Paul E. McKenney, Linus Torvalds, Andrew Morton,
	Oleg Nesterov, linux-kernel

On Thu, 12 Jun 2014, Tejun Heo wrote:

> percpu areas are zeroed on allocation and, by its nature, accessed
> from multiple cpus.  Consider the following scenario.

I am not sure that the premise is actually right. Percpu areas are
designed to be accessed from a single cpu and we provide instances
of variables for each cpu.

There is no synchronization guarantee for accesses from other cpu. If
these accesses occur then we tolerate some fuzziness and usualy only do
read accesses. F.e. for statistics if we loop over all cpus to get a sum
of percpu counters (which is a classic use case for percpu data).

But there are numerous uses where no accesses from other cpus are required
(mostly when percpu stuff is not used for statistics but for cpu local
lists and status).

Cross cpu write accesses typically occur only after the allocation and
before the code that actually does something is aware of the existence of
the percpu area allocated or if the processor is being offlines/onlines.

 > >  p = NULL; >
> 	CPU-1				CPU-2
>  p = alloc_percpu()		if (p)
> 					WARN_ON(this_cpu_read(*p));

p is an offset into the per cpu area of the processor. The value of P
first has to be made available to cpu2 somehow and this usually provides
the opportunity for synchronization that avoids the above scenario.

And so it is typical that these offsets are stored in larger structs that
also have other means of synchronization.

F.e. Allocators take a global lock and then instantiate a new
structure with the associated per cpu area allocation which is added to a
global list after it is ready. The address of the allocator structure
is then made available to other processors.

Another method is to perform this allocation on bootup which then also
does not require synchronization (page allocator).

Similar in swapon(). The percpu allocation is performed before access to
the containing structure (via enable_swap_info).


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH RFC] percpu: add data dependency barrier in percpu accessors and operations
  2014-06-17 18:55                     ` Paul E. McKenney
@ 2014-06-17 19:39                       ` Christoph Lameter
  2014-06-17 19:47                         ` Tejun Heo
  2014-06-17 19:56                         ` Paul E. McKenney
  0 siblings, 2 replies; 43+ messages in thread
From: Christoph Lameter @ 2014-06-17 19:39 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Linus Torvalds, Rusty Russell, Andrew Morton, Tejun Heo,
	David Howells, linux-kernel, Oleg Nesterov

On Tue, 17 Jun 2014, Paul E. McKenney wrote:

> We are talking about one CPU initializing all CPUs' portions of
> dynamically allocated per-CPU memory, so there really is more than
> one CPU involved.

Well that only occurs on initialization before the address of the
struct that contains the offset is available to other processors.

During operation the percpu area functions like a single processor. And
its designed that way to avoid synchronization issues and take full
advantage of *no* synchronization for full speed. We compromise on that
for statistics but that is only read access.



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH RFC] percpu: add data dependency barrier in percpu accessors and operations
  2014-06-17 19:27 ` Christoph Lameter
@ 2014-06-17 19:40   ` Paul E. McKenney
  2014-06-19 20:42     ` Christoph Lameter
  0 siblings, 1 reply; 43+ messages in thread
From: Paul E. McKenney @ 2014-06-17 19:40 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Tejun Heo, David Howells, Linus Torvalds, Andrew Morton,
	Oleg Nesterov, linux-kernel

On Tue, Jun 17, 2014 at 02:27:43PM -0500, Christoph Lameter wrote:
> On Thu, 12 Jun 2014, Tejun Heo wrote:
> 
> > percpu areas are zeroed on allocation and, by its nature, accessed
> > from multiple cpus.  Consider the following scenario.
> 
> I am not sure that the premise is actually right. Percpu areas are
> designed to be accessed from a single cpu and we provide instances
> of variables for each cpu.
> 
> There is no synchronization guarantee for accesses from other cpu. If
> these accesses occur then we tolerate some fuzziness and usualy only do
> read accesses. F.e. for statistics if we loop over all cpus to get a sum
> of percpu counters (which is a classic use case for percpu data).
> 
> But there are numerous uses where no accesses from other cpus are required
> (mostly when percpu stuff is not used for statistics but for cpu local
> lists and status).
> 
> Cross cpu write accesses typically occur only after the allocation and
> before the code that actually does something is aware of the existence of
> the percpu area allocated or if the processor is being offlines/onlines.
> 
>  > >  p = NULL; >
> > 	CPU-1				CPU-2
> >  p = alloc_percpu()		if (p)
> > 					WARN_ON(this_cpu_read(*p));
> 
> p is an offset into the per cpu area of the processor. The value of P
> first has to be made available to cpu2 somehow and this usually provides
> the opportunity for synchronization that avoids the above scenario.
> 
> And so it is typical that these offsets are stored in larger structs that
> also have other means of synchronization.
> 
> F.e. Allocators take a global lock and then instantiate a new
> structure with the associated per cpu area allocation which is added to a
> global list after it is ready. The address of the allocator structure
> is then made available to other processors.
> 
> Another method is to perform this allocation on bootup which then also
> does not require synchronization (page allocator).
> 
> Similar in swapon(). The percpu allocation is performed before access to
> the containing structure (via enable_swap_info).

Those are indeed common use cases.  However...

There is code where one CPU writes to another CPU's per-CPU variables.
One example is RCU callback offloading, where a kernel thread (which
might be running anywhere) dequeues a given CPU's RCU callbacks and
processes them.  The act of dequeuing requires write access to that
CPU's per-CPU rcu_data structure.  And yes, atomic operations and memory
barriers are of course required to make this work.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH RFC] percpu: add data dependency barrier in percpu accessors and operations
  2014-06-17 18:56           ` Tejun Heo
@ 2014-06-17 19:42             ` Christoph Lameter
  2014-06-17 20:44               ` Tejun Heo
  0 siblings, 1 reply; 43+ messages in thread
From: Christoph Lameter @ 2014-06-17 19:42 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Paul E. McKenney, David Howells, Linus Torvalds, Andrew Morton,
	Oleg Nesterov, linux-kernel, Rusty Russell

> > I much prefer the model where the thing that -published- the pointer is
> > responsible for memory ordering.  After all, if a task allocates some
> > zeroed memory, uses it locally, then frees it, there is no ordering
> > to worry about in the first place.  The memory allocator doing the
> > initialization cannot tell how the memory is to be used, after all.
>
> Yeah, "publish" is a nice verb to put on it.  No objection.

Well that "publishing" of the structure that contains the per cpu offset
is actually what most of the current code does. So we do not need any
additional synchronization. Clarifying the responsibility for
synchronization being with the code which does the alloc_percpu would be
good.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH RFC] percpu: add data dependency barrier in percpu accessors and operations
  2014-06-17 19:39                       ` Christoph Lameter
@ 2014-06-17 19:47                         ` Tejun Heo
  2014-06-17 19:56                         ` Paul E. McKenney
  1 sibling, 0 replies; 43+ messages in thread
From: Tejun Heo @ 2014-06-17 19:47 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Paul E. McKenney, Linus Torvalds, Rusty Russell, Andrew Morton,
	David Howells, linux-kernel, Oleg Nesterov

On Tue, Jun 17, 2014 at 02:39:46PM -0500, Christoph Lameter wrote:
> On Tue, 17 Jun 2014, Paul E. McKenney wrote:
> 
> > We are talking about one CPU initializing all CPUs' portions of
> > dynamically allocated per-CPU memory, so there really is more than
> > one CPU involved.
> 
> Well that only occurs on initialization before the address of the
> struct that contains the offset is available to other processors.

So does assignment of the pointer in most cases.

> During operation the percpu area functions like a single processor. And
> its designed that way to avoid synchronization issues and take full
> advantage of *no* synchronization for full speed. We compromise on that
> for statistics but that is only read access.

And during normal operation, it doesn't make any difference for
anything except alpha.  It's about making the initialization part
easier / safer.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH RFC] percpu: add data dependency barrier in percpu accessors and operations
  2014-06-17 19:39                       ` Christoph Lameter
  2014-06-17 19:47                         ` Tejun Heo
@ 2014-06-17 19:56                         ` Paul E. McKenney
  2014-06-19 20:39                           ` Christoph Lameter
  1 sibling, 1 reply; 43+ messages in thread
From: Paul E. McKenney @ 2014-06-17 19:56 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Linus Torvalds, Rusty Russell, Andrew Morton, Tejun Heo,
	David Howells, linux-kernel, Oleg Nesterov

On Tue, Jun 17, 2014 at 02:39:46PM -0500, Christoph Lameter wrote:
> On Tue, 17 Jun 2014, Paul E. McKenney wrote:
> 
> > We are talking about one CPU initializing all CPUs' portions of
> > dynamically allocated per-CPU memory, so there really is more than
> > one CPU involved.
> 
> Well that only occurs on initialization before the address of the
> struct that contains the offset is available to other processors.

Given runtime dynamic allocation of per-CPU memory, you still need
proper synchronization.  And yes, on non-Alpha CPUs, the dependency
ordering through any pointer suffices on the use side.  The thing
doing allocation and initialization will still need memory barriers,
of course.

> During operation the percpu area functions like a single processor. And
> its designed that way to avoid synchronization issues and take full
> advantage of *no* synchronization for full speed. We compromise on that
> for statistics but that is only read access.

During operation that does not involve cross-CPU accesses, agreed.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH RFC] percpu: add data dependency barrier in percpu accessors and operations
  2014-06-17 19:42             ` Christoph Lameter
@ 2014-06-17 20:44               ` Tejun Heo
  0 siblings, 0 replies; 43+ messages in thread
From: Tejun Heo @ 2014-06-17 20:44 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Paul E. McKenney, David Howells, Linus Torvalds, Andrew Morton,
	Oleg Nesterov, linux-kernel, Rusty Russell

Hello, Christoph.

On Tue, Jun 17, 2014 at 02:42:48PM -0500, Christoph Lameter wrote:
> > > I much prefer the model where the thing that -published- the pointer is
> > > responsible for memory ordering.  After all, if a task allocates some
> > > zeroed memory, uses it locally, then frees it, there is no ordering
> > > to worry about in the first place.  The memory allocator doing the
> > > initialization cannot tell how the memory is to be used, after all.
> >
> > Yeah, "publish" is a nice verb to put on it.  No objection.
> 
> Well that "publishing" of the structure that contains the per cpu offset
> is actually what most of the current code does. So we do not need any
> additional synchronization. Clarifying the responsibility for
> synchronization being with the code which does the alloc_percpu would be
> good.

Sure, we can declare that it follows the same rules as normal memory
allocations - that is, the zeroing belongs to the allocating cpu and
propagation is fully the responsbility of users including data dep
barrier when necessary.

It seems a lot more confusing to me than normal memory allocations
mostly because the ownership isn't immediately clear; however, it
could be that we just need to clarify.

I'll think more about it.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH RFC] percpu: add data dependency barrier in percpu accessors and operations
  2014-06-17 19:56                         ` Paul E. McKenney
@ 2014-06-19 20:39                           ` Christoph Lameter
  0 siblings, 0 replies; 43+ messages in thread
From: Christoph Lameter @ 2014-06-19 20:39 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Linus Torvalds, Rusty Russell, Andrew Morton, Tejun Heo,
	David Howells, linux-kernel, Oleg Nesterov

On Tue, 17 Jun 2014, Paul E. McKenney wrote:

> During operation that does not involve cross-CPU accesses, agreed.

Well that is the typical operation mode for per cpu data and that is what
it was made designed for.



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH RFC] percpu: add data dependency barrier in percpu accessors and operations
  2014-06-17 19:40   ` Paul E. McKenney
@ 2014-06-19 20:42     ` Christoph Lameter
  2014-06-19 20:46       ` Tejun Heo
  2014-06-19 20:51       ` Paul E. McKenney
  0 siblings, 2 replies; 43+ messages in thread
From: Christoph Lameter @ 2014-06-19 20:42 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Tejun Heo, David Howells, Linus Torvalds, Andrew Morton,
	Oleg Nesterov, linux-kernel

On Tue, 17 Jun 2014, Paul E. McKenney wrote:

> > Similar in swapon(). The percpu allocation is performed before access to
> > the containing structure (via enable_swap_info).
>
> Those are indeed common use cases.  However...
>
> There is code where one CPU writes to another CPU's per-CPU variables.
> One example is RCU callback offloading, where a kernel thread (which
> might be running anywhere) dequeues a given CPU's RCU callbacks and
> processes them.  The act of dequeuing requires write access to that
> CPU's per-CPU rcu_data structure.  And yes, atomic operations and memory
> barriers are of course required to make this work.

In that case special care needs to be taken to get this right. True.

I typically avoid these scenarios by sending an IPI with a pointer to the
data structure. The modification is done by the cpu for which the per cpu
data is local.

Maybe rewrite the code to avoid writing to other processors percpu data
would be the right approach?


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH RFC] percpu: add data dependency barrier in percpu accessors and operations
  2014-06-19 20:42     ` Christoph Lameter
@ 2014-06-19 20:46       ` Tejun Heo
  2014-06-19 21:11         ` Christoph Lameter
  2014-06-19 20:51       ` Paul E. McKenney
  1 sibling, 1 reply; 43+ messages in thread
From: Tejun Heo @ 2014-06-19 20:46 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Paul E. McKenney, David Howells, Linus Torvalds, Andrew Morton,
	Oleg Nesterov, linux-kernel

On Thu, Jun 19, 2014 at 03:42:07PM -0500, Christoph Lameter wrote:
> In that case special care needs to be taken to get this right. True.
> 
> I typically avoid these scenarios by sending an IPI with a pointer to the
> data structure. The modification is done by the cpu for which the per cpu
> data is local.
> 
> Maybe rewrite the code to avoid writing to other processors percpu data
> would be the right approach?

It depends on the specific use case but in general no.  IPIs would be
far more expensive than making use of proper barriers in vast majority
of cases especially when the "hot" side is data dependency barrier,
IOW, nothing.  Also, we are talking about extremely low frequency
events like init and recycling after reinit.  Regular per-cpu
operation isn't really the subject here.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH RFC] percpu: add data dependency barrier in percpu accessors and operations
  2014-06-19 20:42     ` Christoph Lameter
  2014-06-19 20:46       ` Tejun Heo
@ 2014-06-19 20:51       ` Paul E. McKenney
  2014-06-20 15:29         ` Christoph Lameter
  1 sibling, 1 reply; 43+ messages in thread
From: Paul E. McKenney @ 2014-06-19 20:51 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Tejun Heo, David Howells, Linus Torvalds, Andrew Morton,
	Oleg Nesterov, linux-kernel

On Thu, Jun 19, 2014 at 03:42:07PM -0500, Christoph Lameter wrote:
> On Tue, 17 Jun 2014, Paul E. McKenney wrote:
> 
> > > Similar in swapon(). The percpu allocation is performed before access to
> > > the containing structure (via enable_swap_info).
> >
> > Those are indeed common use cases.  However...
> >
> > There is code where one CPU writes to another CPU's per-CPU variables.
> > One example is RCU callback offloading, where a kernel thread (which
> > might be running anywhere) dequeues a given CPU's RCU callbacks and
> > processes them.  The act of dequeuing requires write access to that
> > CPU's per-CPU rcu_data structure.  And yes, atomic operations and memory
> > barriers are of course required to make this work.
> 
> In that case special care needs to be taken to get this right. True.
> 
> I typically avoid these scenarios by sending an IPI with a pointer to the
> data structure. The modification is done by the cpu for which the per cpu
> data is local.
> 
> Maybe rewrite the code to avoid writing to other processors percpu data
> would be the right approach?

Or just keep doing what I am doing.  What exactly is the problem with it?
(Other than probably needing to clean up the cache alignment of some
of the per-CPU structures?)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH RFC] percpu: add data dependency barrier in percpu accessors and operations
  2014-06-19 20:46       ` Tejun Heo
@ 2014-06-19 21:11         ` Christoph Lameter
  2014-06-19 21:15           ` Tejun Heo
  0 siblings, 1 reply; 43+ messages in thread
From: Christoph Lameter @ 2014-06-19 21:11 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Paul E. McKenney, David Howells, Linus Torvalds, Andrew Morton,
	Oleg Nesterov, linux-kernel

On Thu, 19 Jun 2014, Tejun Heo wrote:

> On Thu, Jun 19, 2014 at 03:42:07PM -0500, Christoph Lameter wrote:
> > In that case special care needs to be taken to get this right. True.
> >
> > I typically avoid these scenarios by sending an IPI with a pointer to the
> > data structure. The modification is done by the cpu for which the per cpu
> > data is local.
> >
> > Maybe rewrite the code to avoid writing to other processors percpu data
> > would be the right approach?
>
> It depends on the specific use case but in general no.  IPIs would be
> far more expensive than making use of proper barriers in vast majority
> of cases especially when the "hot" side is data dependency barrier,
> IOW, nothing.  Also, we are talking about extremely low frequency
> events like init and recycling after reinit.  Regular per-cpu
> operation isn't really the subject here.

The aim of having percpu data is to have the ability for a processor to
access memory dedicated to that processor in the fastest way possible by
avoiding synchronization. You are beginning to add synchronization
elements into the accesses of a processor to memory dedicated to its sole
use.

Remote write events are contrary to that design and are exceedingly rare.
An IPI is justifiable for such a rare event. At least in my use cases I
have always found that to be sufficient. Well, I designed the data
structures in a way that made this possible because of the design criteria
that did not allow me remote write access to other processors per cpu
data.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH RFC] percpu: add data dependency barrier in percpu accessors and operations
  2014-06-19 21:11         ` Christoph Lameter
@ 2014-06-19 21:15           ` Tejun Heo
  2014-06-20 15:23             ` Christoph Lameter
  0 siblings, 1 reply; 43+ messages in thread
From: Tejun Heo @ 2014-06-19 21:15 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Paul E. McKenney, David Howells, Linus Torvalds, Andrew Morton,
	Oleg Nesterov, linux-kernel

On Thu, Jun 19, 2014 at 04:11:47PM -0500, Christoph Lameter wrote:
> The aim of having percpu data is to have the ability for a processor to
> access memory dedicated to that processor in the fastest way possible by
> avoiding synchronization. You are beginning to add synchronization
> elements into the accesses of a processor to memory dedicated to its sole
> use.

Again, data dependency barrier is noop in all in-use archs.

> Remote write events are contrary to that design and are exceedingly rare.
> An IPI is justifiable for such a rare event. At least in my use cases I
> have always found that to be sufficient. Well, I designed the data
> structures in a way that made this possible because of the design criteria
> that did not allow me remote write access to other processors per cpu
> data.

You're repeatedly getting wayside in the discussion.  What are you
suggesting?  Sending IPIs on each percpu allocation?

Again, I'm leaning towards just clarifying the init write ownership to
the allocating CPU as that seems the most straight forward way to deal
with it, but please stop brining up the raw performance thing.  Nobody
is doing anything to that.  It's not relevant in the discussion.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH RFC] percpu: add data dependency barrier in percpu accessors and operations
  2014-06-19 21:15           ` Tejun Heo
@ 2014-06-20 15:23             ` Christoph Lameter
  2014-06-20 15:52               ` Tejun Heo
  0 siblings, 1 reply; 43+ messages in thread
From: Christoph Lameter @ 2014-06-20 15:23 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Paul E. McKenney, David Howells, Linus Torvalds, Andrew Morton,
	Oleg Nesterov, linux-kernel

On Thu, 19 Jun 2014, Tejun Heo wrote:

> Again, data dependency barrier is noop in all in-use archs.

A barrier limits what the compiler can do.

> > Remote write events are contrary to that design and are exceedingly rare.
> > An IPI is justifiable for such a rare event. At least in my use cases I
> > have always found that to be sufficient. Well, I designed the data
> > structures in a way that made this possible because of the design criteria
> > that did not allow me remote write access to other processors per cpu
> > data.
>
> You're repeatedly getting wayside in the discussion.  What are you
> suggesting?  Sending IPIs on each percpu allocation?

No this is about sending an IPI if you want to modify the percpu data of
another process. There was a mentionig of code that modifies the per cpu
data of another processor?

> Again, I'm leaning towards just clarifying the init write ownership to
> the allocating CPU as that seems the most straight forward way to deal
> with it, but please stop brining up the raw performance thing.  Nobody
> is doing anything to that.  It's not relevant in the discussion.

Ok sounds good.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH RFC] percpu: add data dependency barrier in percpu accessors and operations
  2014-06-19 20:51       ` Paul E. McKenney
@ 2014-06-20 15:29         ` Christoph Lameter
  2014-06-20 15:50           ` Paul E. McKenney
  0 siblings, 1 reply; 43+ messages in thread
From: Christoph Lameter @ 2014-06-20 15:29 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Tejun Heo, David Howells, Linus Torvalds, Andrew Morton,
	Oleg Nesterov, linux-kernel

On Thu, 19 Jun 2014, Paul E. McKenney wrote:

> Or just keep doing what I am doing.  What exactly is the problem with it?
> (Other than probably needing to clean up the cache alignment of some
> of the per-CPU structures?)

Writing to a cacheline of another processor can impact performance of that
other processor since the cacheline (which may contain other performance
critical data) is evicted from that processors cache.

The mechanisms for handling percpu data are not designed with the
consideration of writes into foreign percpu data areas in mind. Surprises
may result from such use.

In particular I see a danger in understanding what "atomic" percpu
operations are. These are not to be confused with regular atomic ops.
Percpu atomics are atomic for accesses that occur in a single specific
hardware thread. Percpu "atomics" are atomic vs. interrupts or preemption
occuring on that specific processor. No serialization is supported for
accesses may it be read or write from foreign processors.



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH RFC] percpu: add data dependency barrier in percpu accessors and operations
  2014-06-20 15:29         ` Christoph Lameter
@ 2014-06-20 15:50           ` Paul E. McKenney
  0 siblings, 0 replies; 43+ messages in thread
From: Paul E. McKenney @ 2014-06-20 15:50 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Tejun Heo, David Howells, Linus Torvalds, Andrew Morton,
	Oleg Nesterov, linux-kernel

On Fri, Jun 20, 2014 at 10:29:04AM -0500, Christoph Lameter wrote:
> On Thu, 19 Jun 2014, Paul E. McKenney wrote:
> 
> > Or just keep doing what I am doing.  What exactly is the problem with it?
> > (Other than probably needing to clean up the cache alignment of some
> > of the per-CPU structures?)
> 
> Writing to a cacheline of another processor can impact performance of that
> other processor since the cacheline (which may contain other performance
> critical data) is evicted from that processors cache.

I believe that most of the people on this thread already understand this,
and that most of them also understand the used of alignment directives
to avoid false-sharing issues.

> The mechanisms for handling percpu data are not designed with the
> consideration of writes into foreign percpu data areas in mind. Surprises
> may result from such use.
> 
> In particular I see a danger in understanding what "atomic" percpu
> operations are. These are not to be confused with regular atomic ops.
> Percpu atomics are atomic for accesses that occur in a single specific
> hardware thread. Percpu "atomics" are atomic vs. interrupts or preemption
> occuring on that specific processor. No serialization is supported for
> accesses may it be read or write from foreign processors.

It sounds like you are thinking strictly in terms of machine-word
sized and aligned per-CPU data.  Much of the cross-CPU accesses are
to structs placed into per-CPU data.  You are not thinking in terms
of having all of the per-CPU data mapped to the same virtual address,
so that CPUs simply cannot access each others' per-CPU data, are you?
That would result in a re-proliferation of NR_CPUS-element arrays.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH RFC] percpu: add data dependency barrier in percpu accessors and operations
  2014-06-20 15:23             ` Christoph Lameter
@ 2014-06-20 15:52               ` Tejun Heo
  0 siblings, 0 replies; 43+ messages in thread
From: Tejun Heo @ 2014-06-20 15:52 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Paul E. McKenney, David Howells, Linus Torvalds, Andrew Morton,
	Oleg Nesterov, linux-kernel

On Fri, Jun 20, 2014 at 10:23:57AM -0500, Christoph Lameter wrote:
> > You're repeatedly getting wayside in the discussion.  What are you
> > suggesting?  Sending IPIs on each percpu allocation?
> 
> No this is about sending an IPI if you want to modify the percpu data of
> another process. There was a mentionig of code that modifies the per cpu
> data of another processor?

Yes, the initialization of percpu areas on allocation.  That was the
*sole* subject of this thread.

-- 
tejun

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH RFC] percpu: add data dependency barrier in percpu accessors and operations
  2014-06-17 15:27       ` Tejun Heo
  2014-06-17 15:56         ` Christoph Lameter
  2014-06-17 16:57         ` Paul E. McKenney
@ 2014-07-09  0:55         ` Rusty Russell
  2014-07-14 11:39           ` Paul E. McKenney
  2 siblings, 1 reply; 43+ messages in thread
From: Rusty Russell @ 2014-07-09  0:55 UTC (permalink / raw)
  To: Tejun Heo, Paul E. McKenney
  Cc: Christoph Lameter, David Howells, Linus Torvalds, Andrew Morton,
	Oleg Nesterov, linux-kernel

Tejun Heo <tj@kernel.org> writes:
> Hello, Paul.

Rusty wakes up...

>> Good point.  How about per-CPU variables that are introduced by
>> loadable modules?  (I would guess that there are plenty of memory
>> barriers in the load process, given that text and data also needs
>> to be visible to other CPUs.)
>
> (cc'ing Rusty, hi!)
>
> Percpu initialization happens in post_relocation() before
> module_finalize().  There seem to be enough operations which can act
> as write barrier afterwards but nothing seems explicit.
>
> I have no idea how we're guaranteeing that .data is visible to all
> cpus without barrier from reader side.  Maybe we don't allow something
> like the following?
>
>   module init				built-in code
>
>   static int mod_static_var = X;	if (builtin_ptr)
>   builtin_ptr = &mod_static_var;		WARN_ON(*builtin_ptr != X);
>
> Rusty, can you please enlighten me?

Subtle, but I think in theory (though not in practice) this can happen.

Making this this assigner's responsibility is nasty, since we reasonably
assume that .data is consistent across CPUs once code is executing
(similarly on boot).

>> Again, it won't help for the allocator to strongly order the
>> initialization to zero if there are additional initializations of some
>> fields to non-zero values.  And again, it should be a lot easier to
>> require the smp_store_release() or whatever uniformly than only in cases
>> where additional initialization occurred.
>
> This one is less murky as we can say that the cpu which allocated owns
> the zeroing; however, it still deviates from requiring the one which
> makes changes to take care of barriering for those changes, which is
> what makes me feel a bit uneasy.  IOW, it's the allocator which
> cleared the memory, why should its users worry about in-flight
> operations from it?  That said, this poses a lot less issues compared
> to percpu ones as passing normal pointers to other cpus w/o going
> through proper set of barriers is a special thing to do anyway.

I think that the implicit per-cpu allocations done by modules need to
be consistent once the module is running.

I'm deeply reluctant to advocate it in the other per-cpu cases though.
Once we add a barrier, it's impossible to remove: callers may subtly
rely on the behavior.

"Magic barrier sprinkles" is a bad path to start down, IMHO.

Cheers,
Rusty.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH RFC] percpu: add data dependency barrier in percpu accessors and operations
  2014-07-09  0:55         ` Rusty Russell
@ 2014-07-14 11:39           ` Paul E. McKenney
  2014-07-14 15:22             ` Christoph Lameter
  2014-07-15 11:50             ` Rusty Russell
  0 siblings, 2 replies; 43+ messages in thread
From: Paul E. McKenney @ 2014-07-14 11:39 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Tejun Heo, Christoph Lameter, David Howells, Linus Torvalds,
	Andrew Morton, Oleg Nesterov, linux-kernel

On Wed, Jul 09, 2014 at 10:25:44AM +0930, Rusty Russell wrote:
> Tejun Heo <tj@kernel.org> writes:
> > Hello, Paul.
> 
> Rusty wakes up...

;-)

> >> Good point.  How about per-CPU variables that are introduced by
> >> loadable modules?  (I would guess that there are plenty of memory
> >> barriers in the load process, given that text and data also needs
> >> to be visible to other CPUs.)
> >
> > (cc'ing Rusty, hi!)
> >
> > Percpu initialization happens in post_relocation() before
> > module_finalize().  There seem to be enough operations which can act
> > as write barrier afterwards but nothing seems explicit.
> >
> > I have no idea how we're guaranteeing that .data is visible to all
> > cpus without barrier from reader side.  Maybe we don't allow something
> > like the following?
> >
> >   module init				built-in code
> >
> >   static int mod_static_var = X;	if (builtin_ptr)
> >   builtin_ptr = &mod_static_var;		WARN_ON(*builtin_ptr != X);
> >
> > Rusty, can you please enlighten me?
> 
> Subtle, but I think in theory (though not in practice) this can happen.
> 
> Making this this assigner's responsibility is nasty, since we reasonably
> assume that .data is consistent across CPUs once code is executing
> (similarly on boot).
> 
> >> Again, it won't help for the allocator to strongly order the
> >> initialization to zero if there are additional initializations of some
> >> fields to non-zero values.  And again, it should be a lot easier to
> >> require the smp_store_release() or whatever uniformly than only in cases
> >> where additional initialization occurred.
> >
> > This one is less murky as we can say that the cpu which allocated owns
> > the zeroing; however, it still deviates from requiring the one which
> > makes changes to take care of barriering for those changes, which is
> > what makes me feel a bit uneasy.  IOW, it's the allocator which
> > cleared the memory, why should its users worry about in-flight
> > operations from it?  That said, this poses a lot less issues compared
> > to percpu ones as passing normal pointers to other cpus w/o going
> > through proper set of barriers is a special thing to do anyway.
> 
> I think that the implicit per-cpu allocations done by modules need to
> be consistent once the module is running.
> 
> I'm deeply reluctant to advocate it in the other per-cpu cases though.
> Once we add a barrier, it's impossible to remove: callers may subtly
> rely on the behavior.
> 
> "Magic barrier sprinkles" is a bad path to start down, IMHO.

Here is the sort of thing that I would be concerned about:

	p = alloc_percpu(struct foo);
	for_each_possible_cpu(cpu)
		initialize(per_cpu_ptr(p, cpu);
	gp = p;

We clearly need a memory barrier in there somewhere, and it cannot
be buried in alloc_percpu().  Some cases avoid trouble due to locking,
for example, initialize() might acquire a per-CPU lock and later uses
might acquire that same lock.  Clearly, use of a global lock would not
be helpful from a scalability viewpoint.

Thoughts?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH RFC] percpu: add data dependency barrier in percpu accessors and operations
  2014-07-14 11:39           ` Paul E. McKenney
@ 2014-07-14 15:22             ` Christoph Lameter
  2014-07-15 10:11               ` Paul E. McKenney
  2014-07-15 11:50             ` Rusty Russell
  1 sibling, 1 reply; 43+ messages in thread
From: Christoph Lameter @ 2014-07-14 15:22 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Rusty Russell, Tejun Heo, David Howells, Linus Torvalds,
	Andrew Morton, Oleg Nesterov, linux-kernel

On Mon, 14 Jul 2014, Paul E. McKenney wrote:

> Here is the sort of thing that I would be concerned about:
>
> 	p = alloc_percpu(struct foo);
> 	for_each_possible_cpu(cpu)
> 		initialize(per_cpu_ptr(p, cpu);
> 	gp = p;
>
> We clearly need a memory barrier in there somewhere, and it cannot
> be buried in alloc_percpu().  Some cases avoid trouble due to locking,
> for example, initialize() might acquire a per-CPU lock and later uses
> might acquire that same lock.  Clearly, use of a global lock would not
> be helpful from a scalability viewpoint.

The knowledge about the offset p is not available before gp is assigned
to.

gp usually is part of a struct that contains some form of serialization.
F.e. in the slab allocators there is a kmem_cache structure that contains
gp.

After alloc_percpu() and other preparatory work the structure is inserted
into a linked list while holding the global semaphore (slab_mutex). After
release of the semaphore the kmem_cache address is passed to the
subsystem. Then other processors can potentially use that new kmem_cache
structure to access new percpu data related to the new cache.

There is no scalability issue for the initialization since there cannot
be a concurrent access since the offset of the percpu value is not known
by other processors at that point.




^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH RFC] percpu: add data dependency barrier in percpu accessors and operations
  2014-07-14 15:22             ` Christoph Lameter
@ 2014-07-15 10:11               ` Paul E. McKenney
  2014-07-15 14:06                 ` Christoph Lameter
  0 siblings, 1 reply; 43+ messages in thread
From: Paul E. McKenney @ 2014-07-15 10:11 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Rusty Russell, Tejun Heo, David Howells, Linus Torvalds,
	Andrew Morton, Oleg Nesterov, linux-kernel

On Mon, Jul 14, 2014 at 10:22:08AM -0500, Christoph Lameter wrote:
> On Mon, 14 Jul 2014, Paul E. McKenney wrote:
> 
> > Here is the sort of thing that I would be concerned about:
> >
> > 	p = alloc_percpu(struct foo);
> > 	for_each_possible_cpu(cpu)
> > 		initialize(per_cpu_ptr(p, cpu);
> > 	gp = p;
> >
> > We clearly need a memory barrier in there somewhere, and it cannot
> > be buried in alloc_percpu().  Some cases avoid trouble due to locking,
> > for example, initialize() might acquire a per-CPU lock and later uses
> > might acquire that same lock.  Clearly, use of a global lock would not
> > be helpful from a scalability viewpoint.
> 
> The knowledge about the offset p is not available before gp is assigned
> to.
> 
> gp usually is part of a struct that contains some form of serialization.
> F.e. in the slab allocators there is a kmem_cache structure that contains
> gp.
> 
> After alloc_percpu() and other preparatory work the structure is inserted
> into a linked list while holding the global semaphore (slab_mutex). After
> release of the semaphore the kmem_cache address is passed to the
> subsystem. Then other processors can potentially use that new kmem_cache
> structure to access new percpu data related to the new cache.
> 
> There is no scalability issue for the initialization since there cannot
> be a concurrent access since the offset of the percpu value is not known
> by other processors at that point.

If I understand your initialization procedure correctly, you need at least
an smp_wmb() on the update side and at least an smp_read_barrier_depends()
on the read side.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH RFC] percpu: add data dependency barrier in percpu accessors and operations
  2014-07-14 11:39           ` Paul E. McKenney
  2014-07-14 15:22             ` Christoph Lameter
@ 2014-07-15 11:50             ` Rusty Russell
  1 sibling, 0 replies; 43+ messages in thread
From: Rusty Russell @ 2014-07-15 11:50 UTC (permalink / raw)
  To: paulmck
  Cc: Tejun Heo, Christoph Lameter, David Howells, Linus Torvalds,
	Andrew Morton, Oleg Nesterov, linux-kernel

"Paul E. McKenney" <paulmck@linux.vnet.ibm.com> writes:
> On Wed, Jul 09, 2014 at 10:25:44AM +0930, Rusty Russell wrote:
>> Tejun Heo <tj@kernel.org> writes:
>> > Hello, Paul.
>> 
>> Rusty wakes up...
>
> ;-)
>
>> >> Good point.  How about per-CPU variables that are introduced by
>> >> loadable modules?  (I would guess that there are plenty of memory
>> >> barriers in the load process, given that text and data also needs
>> >> to be visible to other CPUs.)
>> >
>> > (cc'ing Rusty, hi!)
>> >
>> > Percpu initialization happens in post_relocation() before
>> > module_finalize().  There seem to be enough operations which can act
>> > as write barrier afterwards but nothing seems explicit.
>> >
>> > I have no idea how we're guaranteeing that .data is visible to all
>> > cpus without barrier from reader side.  Maybe we don't allow something
>> > like the following?
>> >
>> >   module init				built-in code
>> >
>> >   static int mod_static_var = X;	if (builtin_ptr)
>> >   builtin_ptr = &mod_static_var;		WARN_ON(*builtin_ptr != X);
>> >
>> > Rusty, can you please enlighten me?
>> 
>> Subtle, but I think in theory (though not in practice) this can happen.
>> 
>> Making this this assigner's responsibility is nasty, since we reasonably
>> assume that .data is consistent across CPUs once code is executing
>> (similarly on boot).
>> 
>> >> Again, it won't help for the allocator to strongly order the
>> >> initialization to zero if there are additional initializations of some
>> >> fields to non-zero values.  And again, it should be a lot easier to
>> >> require the smp_store_release() or whatever uniformly than only in cases
>> >> where additional initialization occurred.
>> >
>> > This one is less murky as we can say that the cpu which allocated owns
>> > the zeroing; however, it still deviates from requiring the one which
>> > makes changes to take care of barriering for those changes, which is
>> > what makes me feel a bit uneasy.  IOW, it's the allocator which
>> > cleared the memory, why should its users worry about in-flight
>> > operations from it?  That said, this poses a lot less issues compared
>> > to percpu ones as passing normal pointers to other cpus w/o going
>> > through proper set of barriers is a special thing to do anyway.
>> 
>> I think that the implicit per-cpu allocations done by modules need to
>> be consistent once the module is running.
>> 
>> I'm deeply reluctant to advocate it in the other per-cpu cases though.
>> Once we add a barrier, it's impossible to remove: callers may subtly
>> rely on the behavior.
>> 
>> "Magic barrier sprinkles" is a bad path to start down, IMHO.
>
> Here is the sort of thing that I would be concerned about:
>
> 	p = alloc_percpu(struct foo);
> 	for_each_possible_cpu(cpu)
> 		initialize(per_cpu_ptr(p, cpu);
> 	gp = p;
>
> We clearly need a memory barrier in there somewhere, and it cannot
> be buried in alloc_percpu().  Some cases avoid trouble due to locking,
> for example, initialize() might acquire a per-CPU lock and later uses
> might acquire that same lock.  Clearly, use of a global lock would not
> be helpful from a scalability viewpoint.

I agree with Christoph: there's no per-cpu-unique peculiarity here.
Anyone who exposes a pointer needs a barrier first.

And the per-cpu allocation for modules is under a mutex, so that case is
already covered.

Cheers,
Rusty.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH RFC] percpu: add data dependency barrier in percpu accessors and operations
  2014-07-15 10:11               ` Paul E. McKenney
@ 2014-07-15 14:06                 ` Christoph Lameter
  2014-07-15 14:32                   ` Paul E. McKenney
  0 siblings, 1 reply; 43+ messages in thread
From: Christoph Lameter @ 2014-07-15 14:06 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Rusty Russell, Tejun Heo, David Howells, Linus Torvalds,
	Andrew Morton, Oleg Nesterov, linux-kernel

On Tue, 15 Jul 2014, Paul E. McKenney wrote:

> If I understand your initialization procedure correctly, you need at least
> an smp_wmb() on the update side and at least an smp_read_barrier_depends()
> on the read side.

A barrier for data that is not in the cache of the read side? That has
not been accessed yet (well there could have been a free_percpu before but
if so then the cache line was evicted by the initialization code).



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH RFC] percpu: add data dependency barrier in percpu accessors and operations
  2014-07-15 14:06                 ` Christoph Lameter
@ 2014-07-15 14:32                   ` Paul E. McKenney
  2014-07-15 15:06                     ` Christoph Lameter
  0 siblings, 1 reply; 43+ messages in thread
From: Paul E. McKenney @ 2014-07-15 14:32 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Rusty Russell, Tejun Heo, David Howells, Linus Torvalds,
	Andrew Morton, Oleg Nesterov, linux-kernel

On Tue, Jul 15, 2014 at 09:06:00AM -0500, Christoph Lameter wrote:
> On Tue, 15 Jul 2014, Paul E. McKenney wrote:
> 
> > If I understand your initialization procedure correctly, you need at least
> > an smp_wmb() on the update side and at least an smp_read_barrier_depends()
> > on the read side.
> 
> A barrier for data that is not in the cache of the read side? That has
> not been accessed yet (well there could have been a free_percpu before but
> if so then the cache line was evicted by the initialization code).

http://www.openvms.compaq.com/wizard/wiz_2637.html

Besides which, if you don't have barriers on the initialization side,
then both the CPU and the compiler are free to update the pointer before
completing the initialization, which can leave old stuff still in other
CPUs' caches for long enough to break you.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH RFC] percpu: add data dependency barrier in percpu accessors and operations
  2014-07-15 14:32                   ` Paul E. McKenney
@ 2014-07-15 15:06                     ` Christoph Lameter
  2014-07-15 15:41                       ` Linus Torvalds
  2014-07-15 17:41                       ` Paul E. McKenney
  0 siblings, 2 replies; 43+ messages in thread
From: Christoph Lameter @ 2014-07-15 15:06 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Rusty Russell, Tejun Heo, David Howells, Linus Torvalds,
	Andrew Morton, Oleg Nesterov, linux-kernel

On Tue, 15 Jul 2014, Paul E. McKenney wrote:

> On Tue, Jul 15, 2014 at 09:06:00AM -0500, Christoph Lameter wrote:
> > On Tue, 15 Jul 2014, Paul E. McKenney wrote:
> >
> > > If I understand your initialization procedure correctly, you need at least
> > > an smp_wmb() on the update side and at least an smp_read_barrier_depends()
> > > on the read side.
> >
> > A barrier for data that is not in the cache of the read side? That has
> > not been accessed yet (well there could have been a free_percpu before but
> > if so then the cache line was evicted by the initialization code).
>
> http://www.openvms.compaq.com/wizard/wiz_2637.html

Not sure what the intend of this link is?

> Besides which, if you don't have barriers on the initialization side,
> then both the CPU and the compiler are free to update the pointer before
> completing the initialization, which can leave old stuff still in other
> CPUs' caches for long enough to break you.

The cachelines will be evicted from the other processors at
initialization. alloc_percpu *itself* zeroes all data on each percpu areas
before returning the offset to the percpu data structure. See
pcpu_populate_chunk(). At that point *all* other processors have those
cachelines no longer in their caches. The initialization done with values
specific to the subsystem is not that important.

The return value of the function is only available after
pcpu_populate_chunk() returns.

Access to those cachelines is possible only after the other processors
have obtained the offset that was stored in some data struture. That
usually involves additional synchronization which implies barriers
anyways.

I do not think there is anything here.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH RFC] percpu: add data dependency barrier in percpu accessors and operations
  2014-07-15 15:06                     ` Christoph Lameter
@ 2014-07-15 15:41                       ` Linus Torvalds
  2014-07-15 16:12                         ` Christoph Lameter
  2014-07-15 17:41                       ` Paul E. McKenney
  1 sibling, 1 reply; 43+ messages in thread
From: Linus Torvalds @ 2014-07-15 15:41 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Paul E. McKenney, Rusty Russell, Tejun Heo, David Howells,
	Andrew Morton, Oleg Nesterov, Linux Kernel Mailing List

Christoph, stop arguing. Trust me, Paul knows memory ordering. You
clearly do *not*.

On Tue, Jul 15, 2014 at 8:06 AM, Christoph Lameter <cl@gentwo.org> wrote:
>
> The cachelines will be evicted from the other processors at
> initialization. alloc_percpu *itself* zeroes all data on each percpu areas
> before returning the offset to the percpu data structure. See
> pcpu_populate_chunk(). At that point *all* other processors have those
> cachelines no longer in their caches. The initialization done with values
> specific to the subsystem is not that important.

In practice, with enough instructions in the CPU queues and
sufficiently small write buffers etc (or with a sufficiently ordered
CPU core, like x86), that may often be true. But there is absolutely
zero reason to think it's always true.

On the writer side, if there isn't a write barrier, the actual writes
can be visible to other CPU's in arbitrary order. *Including* the
visibility of the offset before the zeroing. Really.

On the reader side, for all sane CPU's, reading the offset and then
reading data from that offset is an implicit barrier. But "all sane"
is not "all". On alpha, reading the offset does NOT guarantee that you
see later data when you use that offset to read data. In theory, it
could be due to value prediction, but in practice it's actually due to
segmented caches, so that one part of the cache has seen data that
arrived "later" (ie written _after_ the wmb on the writing CPU)
_before_ it sees data that arrived earlier. That's what the
"smp_read_barrier_depends()" protects against.

> The return value of the function is only available after
> pcpu_populate_chunk() returns.

Really, "before" and "after" have ABSOLUTELY NO MEANING unless you
have a barrier. And you're arguing against those barriers. So you
cannot use "before" as an argument, since in your world, no such thing
even exists!

There are other arguments, but they basically boil down to "no other
CPU ever accesses the per-cpu data of *this* CPU" (wrong) or "the
users will do their own barriers" (maybe true, maybe not). Your "value
is only available after" argument really isn't an argument. Not
without those barriers.

            Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH RFC] percpu: add data dependency barrier in percpu accessors and operations
  2014-07-15 15:41                       ` Linus Torvalds
@ 2014-07-15 16:12                         ` Christoph Lameter
       [not found]                           ` <CA+55aFxU166V5-vH4vmK9OBdTZKyede=71RjjbOVSN9Qh+Se+A@mail.gmail.com>
  0 siblings, 1 reply; 43+ messages in thread
From: Christoph Lameter @ 2014-07-15 16:12 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Paul E. McKenney, Rusty Russell, Tejun Heo, David Howells,
	Andrew Morton, Oleg Nesterov, Linux Kernel Mailing List

On Tue, 15 Jul 2014, Linus Torvalds wrote:

> Really, "before" and "after" have ABSOLUTELY NO MEANING unless you
> have a barrier. And you're arguing against those barriers. So you
> cannot use "before" as an argument, since in your world, no such thing
> even exists!

I mentioned that there is a barrier because the process of handing over
the offset to the other includes synchronization. In the slab case this is
a semaphore that is use to protect the structure and the list of
kmem_cache structures. The control struct containing the offset must be
entered somehow into something that tracks it for the future and thus
there is synchronization by the subsytem.

> > There are other arguments, but they basically boil down to "no other
> CPU ever accesses the per-cpu data of *this* CPU" (wrong) or "the
> users will do their own barriers" (maybe true, maybe not). Your "value
> is only available after" argument really isn't an argument. Not
> without those barriers.

Ok so what is happening is:

1. cacheline is zeroed on per_cpu_alloc but still exists in remote processor.

(we could actually insert code in alloc_percpu to ensure that the remote
caches are cleaned and not proceed unless that is complete. allocpercpu
is not performance critical).

2. cacheline is initialized with new values by the subsystem looping over
all percpu instances. Other processor still keeps the old data.

3. mutex is taken, list modifications occur, mutex is released. Remote
processor still keeps the old cacheline data.

4. Subsystem makes the percpu offset available.

5. The remote processor is processing using its instance of the per cpu
data for the first time using the offset to determine the percpu data for
its data. This typically means its updating the cacheline (and we hope
that the cacheline will be in exclusive state for good for performance reasons).

And now we still see the old data. The cacheline changes of the initial
processor are ignored?

Ok if this is the case then we have another way of dealing with this in
alloc_percpu. Either zap the relevant remote cpu caches after the areas
were zeroed or do an IPI to make the remote processor run the percpu area
initialization.





^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH RFC] percpu: add data dependency barrier in percpu accessors and operations
  2014-07-15 15:06                     ` Christoph Lameter
  2014-07-15 15:41                       ` Linus Torvalds
@ 2014-07-15 17:41                       ` Paul E. McKenney
  2014-07-16 14:40                         ` Christoph Lameter
  1 sibling, 1 reply; 43+ messages in thread
From: Paul E. McKenney @ 2014-07-15 17:41 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Rusty Russell, Tejun Heo, David Howells, Linus Torvalds,
	Andrew Morton, Oleg Nesterov, linux-kernel

On Tue, Jul 15, 2014 at 10:06:01AM -0500, Christoph Lameter wrote:
> On Tue, 15 Jul 2014, Paul E. McKenney wrote:
> 
> > On Tue, Jul 15, 2014 at 09:06:00AM -0500, Christoph Lameter wrote:
> > > On Tue, 15 Jul 2014, Paul E. McKenney wrote:
> > >
> > > > If I understand your initialization procedure correctly, you need at least
> > > > an smp_wmb() on the update side and at least an smp_read_barrier_depends()
> > > > on the read side.
> > >
> > > A barrier for data that is not in the cache of the read side? That has
> > > not been accessed yet (well there could have been a free_percpu before but
> > > if so then the cache line was evicted by the initialization code).
> >
> > http://www.openvms.compaq.com/wizard/wiz_2637.html
> 
> Not sure what the intend of this link is?

To demonstrate that at least one (mostly historical but nevertheless
very real) architecture can do this:

	p = ACCESS_ONCE(gp);
	r1 = p->a;

and see pre-initialized data in r1 -even- -if- the initialization made
full and careful use of memory barriers.  Aggressive (and mostly not
yet real-world) compiler optimizations can have the same effect.

> > Besides which, if you don't have barriers on the initialization side,
> > then both the CPU and the compiler are free to update the pointer before
> > completing the initialization, which can leave old stuff still in other
> > CPUs' caches for long enough to break you.
> 
> The cachelines will be evicted from the other processors at
> initialization. alloc_percpu *itself* zeroes all data on each percpu areas
> before returning the offset to the percpu data structure. See
> pcpu_populate_chunk(). At that point *all* other processors have those
> cachelines no longer in their caches. The initialization done with values
> specific to the subsystem is not that important.
> 
> The return value of the function is only available after
> pcpu_populate_chunk() returns.
> 
> Access to those cachelines is possible only after the other processors
> have obtained the offset that was stored in some data struture. That
> usually involves additional synchronization which implies barriers
> anyways.
> 
> I do not think there is anything here.

Sorry, but whether you see it or not, there is a very real need for at
least an smp_wmb() from the initializing code and at least an
smp_read_barrier_depends() from the reading code.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH RFC] percpu: add data dependency barrier in percpu accessors and operations
       [not found]                           ` <CA+55aFxU166V5-vH4vmK9OBdTZKyede=71RjjbOVSN9Qh+Se+A@mail.gmail.com>
@ 2014-07-15 17:45                             ` Paul E. McKenney
  0 siblings, 0 replies; 43+ messages in thread
From: Paul E. McKenney @ 2014-07-15 17:45 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Christoph Lameter, Kernel Mailing List, Andrew Morton,
	Rusty Russell, Tejun Heo, David Howells, Oleg Nesterov

On Tue, Jul 15, 2014 at 10:32:13AM -0700, Linus Torvalds wrote:
> On Jul 15, 2014 9:12 AM, "Christoph Lameter" <cl@gentwo.org> wrote:
> >
> > I mentioned that there is a barrier because the process of handing over
> > the offset to the other includes synchronization. In the slab case this is
> > a semaphore that is use to protect the structure and the list of
> > kmem_cache structures. The control struct containing the offset must be
> > entered somehow into something that tracks it for the future and thus
> > there is synchronization by the subsytem.
> 
> Maybe. It's not at all obvious, though. That control structure may be a
> lockless percpu thing, after all. We have front end caches to the
> allocators etc..
> 
> So yes, if there is a lock+unlock, that can be a memory barrier - but even
> that is not necessarily true on all architectures.
> 
> Is there even always one, though? And what about the other CPU? Does it
> have any paired barrier?
> 
> > And now we still see the old data. The cacheline changes of the initial
> > processor are ignored?
> 
> They aren't "ignored". But without the proper barriers, the zeroing writes
> may still be buffered on the other CPU, and may not even have had a cache
> line allocated to them!
> 
> That's the kind of thing that makes a barrier possibly required.
> 
> But yes, the barrier may be implied by other synchronization if that
> exists. A lock doesn't necessarily help, though. A write before a lock can
> migrate down into the locked region, and a write after the lock may
> similarly migrate into the locked region, and then two such writes may be
> seen out of order on another CPU that doesn't take the lock itself to
> serialize.
> 
> See? It's those kinds of things that can cause really subtle memory
> ordering problems. We generally never see them on x86, since writes are
> always ordered against other writes, and reads are allays ordered wrt other
> reads (but not reads against writes) and all locks are full memory barriers.
> 
> But on other architectures even a lock+unlock may not be a barrier, since
> things moving into the locked region is fine (the lock means that things
> had better not move *out* of s locked region).
> 
> So depending on serialization may not always be all that obvious. Even if
> it does happen to all work on x86.
> 
> And don't get me wrong. I despise weak memory models. I think they are a
> mistake. I absolutely detest the alpha model, and I think PowerPC and ARM
> are wrong too. It's just that we have to try to care about their broken
> crap )^:

Well at least all the required barriers are free on TSO systems like
x86 and mainframe, and the read-side barriers are free evne on ARM and
PowerPC (not so much on DEC Alpha, but you can't have everything).  OK,
OK, there is a volatile cast and/or a barrier() directive that might
constrain the compiler a bit in some cases.

So this might well be crap, but at least it is (nearly) zero-cost crap
from the viewpoint of TSO machines like x86.  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH RFC] percpu: add data dependency barrier in percpu accessors and operations
  2014-07-15 17:41                       ` Paul E. McKenney
@ 2014-07-16 14:40                         ` Christoph Lameter
  0 siblings, 0 replies; 43+ messages in thread
From: Christoph Lameter @ 2014-07-16 14:40 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Rusty Russell, Tejun Heo, David Howells, Linus Torvalds,
	Andrew Morton, Oleg Nesterov, linux-kernel

On Tue, 15 Jul 2014, Paul E. McKenney wrote:

> Sorry, but whether you see it or not, there is a very real need for at
> least an smp_wmb() from the initializing code and at least an
> smp_read_barrier_depends() from the reading code.

Nope this is the wrong approach. Adding synchronization to basic
primitives that are designed to be as fast as possible is not good. Even
if it does nothing it tends to bloat over time and get the wrong ideas in
peoples heads.

Fixing this is possible by making sure that the remote cachelines are
evicted on alloc_percpu. alloc_percpu should not be allowed to return
until it has made sure that the remote cachelines all have been evicted.
On some weak ordered platforms that may require special instructions or
even an IPI to move the initilization done by alloc_percpu to the remote
cpu.


^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2014-07-17  4:55 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-06-12 13:56 [PATCH RFC] percpu: add data dependency barrier in percpu accessors and operations Tejun Heo
2014-06-12 15:34 ` Paul E. McKenney
2014-06-12 15:52   ` Tejun Heo
2014-06-17 14:41     ` Paul E. McKenney
2014-06-17 15:27       ` Tejun Heo
2014-06-17 15:56         ` Christoph Lameter
2014-06-17 16:00           ` Tejun Heo
2014-06-17 16:05             ` Tejun Heo
2014-06-17 16:28               ` Christoph Lameter
     [not found]                 ` <CA+55aFxHr8JXwDR-4g4z1mkXvZRtY=OosYcUMPZRD2upfooS1w@mail.gmail.com>
2014-06-17 18:47                   ` Christoph Lameter
2014-06-17 18:55                     ` Paul E. McKenney
2014-06-17 19:39                       ` Christoph Lameter
2014-06-17 19:47                         ` Tejun Heo
2014-06-17 19:56                         ` Paul E. McKenney
2014-06-19 20:39                           ` Christoph Lameter
2014-06-17 16:57         ` Paul E. McKenney
2014-06-17 18:56           ` Tejun Heo
2014-06-17 19:42             ` Christoph Lameter
2014-06-17 20:44               ` Tejun Heo
2014-07-09  0:55         ` Rusty Russell
2014-07-14 11:39           ` Paul E. McKenney
2014-07-14 15:22             ` Christoph Lameter
2014-07-15 10:11               ` Paul E. McKenney
2014-07-15 14:06                 ` Christoph Lameter
2014-07-15 14:32                   ` Paul E. McKenney
2014-07-15 15:06                     ` Christoph Lameter
2014-07-15 15:41                       ` Linus Torvalds
2014-07-15 16:12                         ` Christoph Lameter
     [not found]                           ` <CA+55aFxU166V5-vH4vmK9OBdTZKyede=71RjjbOVSN9Qh+Se+A@mail.gmail.com>
2014-07-15 17:45                             ` Paul E. McKenney
2014-07-15 17:41                       ` Paul E. McKenney
2014-07-16 14:40                         ` Christoph Lameter
2014-07-15 11:50             ` Rusty Russell
2014-06-17 19:27 ` Christoph Lameter
2014-06-17 19:40   ` Paul E. McKenney
2014-06-19 20:42     ` Christoph Lameter
2014-06-19 20:46       ` Tejun Heo
2014-06-19 21:11         ` Christoph Lameter
2014-06-19 21:15           ` Tejun Heo
2014-06-20 15:23             ` Christoph Lameter
2014-06-20 15:52               ` Tejun Heo
2014-06-19 20:51       ` Paul E. McKenney
2014-06-20 15:29         ` Christoph Lameter
2014-06-20 15:50           ` Paul E. McKenney

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).