[RFC] dynticks: dynticks_idle is only modified locally use this

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC] dynticks: dynticks_idle is only modified locally use this_cpu ops
@ 2014-09-02 20:14 Christoph Lameter
  2014-09-02 20:37 ` Paul E. McKenney
  0 siblings, 1 reply; 18+ messages in thread
From: Christoph Lameter @ 2014-09-02 20:14 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Frederic Weisbecker, linux-kernel


Since dynticks_idle is only ever modified by the local cpu we do
not need to use an atomic there. The weak "atomicity" of this_cpu
ops is sufficient since there is no other cpu modifying the variable.

[This is a cautious patch that leaves the barriers in place]

Signed-off-by: Christoph Lameter <cl@linux.com>

Index: linux/kernel/rcu/tree.c
===================================================================
--- linux.orig/kernel/rcu/tree.c
+++ linux/kernel/rcu/tree.c
@@ -213,7 +213,7 @@ static DEFINE_PER_CPU(struct rcu_dyntick
 	.dynticks = ATOMIC_INIT(1),
 #ifdef CONFIG_NO_HZ_FULL_SYSIDLE
 	.dynticks_idle_nesting = DYNTICK_TASK_NEST_VALUE,
-	.dynticks_idle = ATOMIC_INIT(1),
+	.dynticks_idle = 1,
 #endif /* #ifdef CONFIG_NO_HZ_FULL_SYSIDLE */
 };

Index: linux/kernel/rcu/tree.h
===================================================================
--- linux.orig/kernel/rcu/tree.h
+++ linux/kernel/rcu/tree.h
@@ -91,7 +91,7 @@ struct rcu_dynticks {
 #ifdef CONFIG_NO_HZ_FULL_SYSIDLE
 	long long dynticks_idle_nesting;
 				    /* irq/process nesting level from idle. */
-	atomic_t dynticks_idle;	    /* Even value for idle, else odd. */
+	long dynticks_idle;	    /* Even value for idle, else odd. */
 				    /*  "Idle" excludes userspace execution. */
 	unsigned long dynticks_idle_jiffies;
 				    /* End of last non-NMI non-idle period. */
Index: linux/kernel/rcu/tree_plugin.h
===================================================================
--- linux.orig/kernel/rcu/tree_plugin.h
+++ linux/kernel/rcu/tree_plugin.h
@@ -2644,9 +2644,9 @@ static void rcu_sysidle_enter(int irq)
 	j = jiffies;
 	ACCESS_ONCE(rdtp->dynticks_idle_jiffies) = j;
 	smp_mb__before_atomic();
-	atomic_inc(&rdtp->dynticks_idle);
+	this_cpu_inc(rcu_dynticks.dynticks_idle);
 	smp_mb__after_atomic();
-	WARN_ON_ONCE(atomic_read(&rdtp->dynticks_idle) & 0x1);
+	WARN_ON_ONCE(__this_cpu_read(rcu_dynticks.dynticks_idle) & 0x1);
 }

 /*
@@ -2712,9 +2712,9 @@ static void rcu_sysidle_exit(int irq)

 	/* Record end of idle period. */
 	smp_mb__before_atomic();
-	atomic_inc(&rdtp->dynticks_idle);
+	this_cpu_inc(rcu_dynticks.dynticks_idle);
 	smp_mb__after_atomic();
-	WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks_idle) & 0x1));
+	WARN_ON_ONCE(!(__this_cpu_read(rcu_dynticks.dynticks_idle) & 0x1));

 	/*
 	 * If we are the timekeeping CPU, we are permitted to be non-idle
@@ -2755,7 +2755,7 @@ static void rcu_sysidle_check_cpu(struct
 		WARN_ON_ONCE(smp_processor_id() != tick_do_timer_cpu);

 	/* Pick up current idle and NMI-nesting counter and check. */
-	cur = atomic_read(&rdtp->dynticks_idle);
+	cur = rdtp->dynticks_idle;
 	if (cur & 0x1) {
 		*isidle = false; /* We are not idle! */
 		return;

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC] dynticks: dynticks_idle is only modified locally use this_cpu ops
  2014-09-02 20:14 [RFC] dynticks: dynticks_idle is only modified locally use this_cpu ops Christoph Lameter
@ 2014-09-02 20:37 ` Paul E. McKenney
  2014-09-02 20:47   ` Christoph Lameter
  0 siblings, 1 reply; 18+ messages in thread
From: Paul E. McKenney @ 2014-09-02 20:37 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Frederic Weisbecker, linux-kernel

On Tue, Sep 02, 2014 at 03:14:43PM -0500, Christoph Lameter wrote:
> 
> Since dynticks_idle is only ever modified by the local cpu we do
> not need to use an atomic there. The weak "atomicity" of this_cpu
> ops is sufficient since there is no other cpu modifying the variable.
> 
> [This is a cautious patch that leaves the barriers in place]

Actually, not so cautious.  On x86:

#define smp_mb__before_atomic() barrier()
#define smp_mb__after_atomic()  barrier()

But yes, in theory, something like this can work if appropriate memory
barriers are put in place.  In practice, this sort of change needs
profound testing on multiple architectures.

							Thanx, Paul

> Signed-off-by: Christoph Lameter <cl@linux.com>
> 
> Index: linux/kernel/rcu/tree.c
> ===================================================================
> --- linux.orig/kernel/rcu/tree.c
> +++ linux/kernel/rcu/tree.c
> @@ -213,7 +213,7 @@ static DEFINE_PER_CPU(struct rcu_dyntick
>  	.dynticks = ATOMIC_INIT(1),
>  #ifdef CONFIG_NO_HZ_FULL_SYSIDLE
>  	.dynticks_idle_nesting = DYNTICK_TASK_NEST_VALUE,
> -	.dynticks_idle = ATOMIC_INIT(1),
> +	.dynticks_idle = 1,
>  #endif /* #ifdef CONFIG_NO_HZ_FULL_SYSIDLE */
>  };
> 
> Index: linux/kernel/rcu/tree.h
> ===================================================================
> --- linux.orig/kernel/rcu/tree.h
> +++ linux/kernel/rcu/tree.h
> @@ -91,7 +91,7 @@ struct rcu_dynticks {
>  #ifdef CONFIG_NO_HZ_FULL_SYSIDLE
>  	long long dynticks_idle_nesting;
>  				    /* irq/process nesting level from idle. */
> -	atomic_t dynticks_idle;	    /* Even value for idle, else odd. */
> +	long dynticks_idle;	    /* Even value for idle, else odd. */
>  				    /*  "Idle" excludes userspace execution. */
>  	unsigned long dynticks_idle_jiffies;
>  				    /* End of last non-NMI non-idle period. */
> Index: linux/kernel/rcu/tree_plugin.h
> ===================================================================
> --- linux.orig/kernel/rcu/tree_plugin.h
> +++ linux/kernel/rcu/tree_plugin.h
> @@ -2644,9 +2644,9 @@ static void rcu_sysidle_enter(int irq)
>  	j = jiffies;
>  	ACCESS_ONCE(rdtp->dynticks_idle_jiffies) = j;
>  	smp_mb__before_atomic();
> -	atomic_inc(&rdtp->dynticks_idle);
> +	this_cpu_inc(rcu_dynticks.dynticks_idle);
>  	smp_mb__after_atomic();
> -	WARN_ON_ONCE(atomic_read(&rdtp->dynticks_idle) & 0x1);
> +	WARN_ON_ONCE(__this_cpu_read(rcu_dynticks.dynticks_idle) & 0x1);
>  }
> 
>  /*
> @@ -2712,9 +2712,9 @@ static void rcu_sysidle_exit(int irq)
> 
>  	/* Record end of idle period. */
>  	smp_mb__before_atomic();
> -	atomic_inc(&rdtp->dynticks_idle);
> +	this_cpu_inc(rcu_dynticks.dynticks_idle);
>  	smp_mb__after_atomic();
> -	WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks_idle) & 0x1));
> +	WARN_ON_ONCE(!(__this_cpu_read(rcu_dynticks.dynticks_idle) & 0x1));
> 
>  	/*
>  	 * If we are the timekeeping CPU, we are permitted to be non-idle
> @@ -2755,7 +2755,7 @@ static void rcu_sysidle_check_cpu(struct
>  		WARN_ON_ONCE(smp_processor_id() != tick_do_timer_cpu);
> 
>  	/* Pick up current idle and NMI-nesting counter and check. */
> -	cur = atomic_read(&rdtp->dynticks_idle);
> +	cur = rdtp->dynticks_idle;
>  	if (cur & 0x1) {
>  		*isidle = false; /* We are not idle! */
>  		return;
> 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC] dynticks: dynticks_idle is only modified locally use this_cpu ops
  2014-09-02 20:37 ` Paul E. McKenney
@ 2014-09-02 20:47   ` Christoph Lameter
  2014-09-02 21:15     ` Paul E. McKenney
  0 siblings, 1 reply; 18+ messages in thread
From: Christoph Lameter @ 2014-09-02 20:47 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Frederic Weisbecker, linux-kernel

On Tue, 2 Sep 2014, Paul E. McKenney wrote:

> But yes, in theory, something like this can work if appropriate memory
> barriers are put in place.  In practice, this sort of change needs
> profound testing on multiple architectures.

Ok how can we move forward? I just looked further and it seems a similar
approach could perhaps work for the dynticks field.

If the first patch I send gets merged then a lot of other this_cpu related
optimizations become possible regardless of the ones in the RFC.



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC] dynticks: dynticks_idle is only modified locally use this_cpu ops
  2014-09-02 20:47   ` Christoph Lameter
@ 2014-09-02 21:15     ` Paul E. McKenney
  2014-09-02 23:22       ` Christoph Lameter
  0 siblings, 1 reply; 18+ messages in thread
From: Paul E. McKenney @ 2014-09-02 21:15 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Frederic Weisbecker, linux-kernel

On Tue, Sep 02, 2014 at 03:47:55PM -0500, Christoph Lameter wrote:
> On Tue, 2 Sep 2014, Paul E. McKenney wrote:
> 
> > But yes, in theory, something like this can work if appropriate memory
> > barriers are put in place.  In practice, this sort of change needs
> > profound testing on multiple architectures.
> 
> Ok how can we move forward? I just looked further and it seems a similar
> approach could perhaps work for the dynticks field.

Yep, these two have been on my "when I am feeling insanely gutsy" list
for quite some time.

But I have to ask...  On x86, is a pair of mfence instructions really
cheaper than an atomic increment?

> If the first patch I send gets merged then a lot of other this_cpu related
> optimizations become possible regardless of the ones in the RFC.

Yep, I am queuing that one.

But could you please do future patches against the rcu/dev branch of
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git?
I had to hand-apply due to conflicts.  Please see below for my version,
and please check to make sure that I didn't mess something up in the
translation.

							Thanx, Paul

------------------------------------------------------------------------

rcu: Remove rcu_dynticks * parameters when they are always this_cpu_ptr(&rcu_dynticks)

For some functions in kernel/rcu/tree* the rdtp parameter is always
this_cpu_ptr(rdtp).  Remove the parameter if constant and calculate the
pointer in function.

This will have the advantage that it is obvious that the address are
all per cpu offsets and thus it will enable the use of this_cpu_ops in
the future.

Signed-off-by: Christoph Lameter <cl@linux.com>
[ paulmck: Forward-ported to rcu/dev, whitespace adjustment. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 929ea9a6e0a1..65c42a6465eb 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -497,11 +497,11 @@ cpu_needs_another_gp(struct rcu_state *rsp, struct rcu_data *rdp)
  * we really have entered idle, and must do the appropriate accounting.
  * The caller must have disabled interrupts.
  */
-static void rcu_eqs_enter_common(struct rcu_dynticks *rdtp, long long oldval,
-				bool user)
+static void rcu_eqs_enter_common(long long oldval, bool user)
 {
 	struct rcu_state *rsp;
 	struct rcu_data *rdp;
+	struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
 
 	trace_rcu_dyntick(TPS("Start"), oldval, rdtp->dynticks_nesting);
 	if (!user && !is_idle_task(current)) {
@@ -552,7 +552,7 @@ static void rcu_eqs_enter(bool user)
 	WARN_ON_ONCE((oldval & DYNTICK_TASK_NEST_MASK) == 0);
 	if ((oldval & DYNTICK_TASK_NEST_MASK) == DYNTICK_TASK_NEST_VALUE) {
 		rdtp->dynticks_nesting = 0;
-		rcu_eqs_enter_common(rdtp, oldval, user);
+		rcu_eqs_enter_common(oldval, user);
 	} else {
 		rdtp->dynticks_nesting -= DYNTICK_TASK_NEST_VALUE;
 	}
@@ -576,7 +576,7 @@ void rcu_idle_enter(void)
 
 	local_irq_save(flags);
 	rcu_eqs_enter(false);
-	rcu_sysidle_enter(this_cpu_ptr(&rcu_dynticks), 0);
+	rcu_sysidle_enter(0);
 	local_irq_restore(flags);
 }
 EXPORT_SYMBOL_GPL(rcu_idle_enter);
@@ -626,8 +626,8 @@ void rcu_irq_exit(void)
 	if (rdtp->dynticks_nesting)
 		trace_rcu_dyntick(TPS("--="), oldval, rdtp->dynticks_nesting);
 	else
-		rcu_eqs_enter_common(rdtp, oldval, true);
-	rcu_sysidle_enter(rdtp, 1);
+		rcu_eqs_enter_common(oldval, true);
+	rcu_sysidle_enter(1);
 	local_irq_restore(flags);
 }
 
@@ -638,9 +638,10 @@ void rcu_irq_exit(void)
  * we really have exited idle, and must do the appropriate accounting.
  * The caller must have disabled interrupts.
  */
-static void rcu_eqs_exit_common(struct rcu_dynticks *rdtp, long long oldval,
-			       int user)
+static void rcu_eqs_exit_common(long long oldval, int user)
 {
+	struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
+
 	rcu_dynticks_task_exit();
 	smp_mb__before_atomic();  /* Force ordering w/previous sojourn. */
 	atomic_inc(&rdtp->dynticks);
@@ -678,7 +679,7 @@ static void rcu_eqs_exit(bool user)
 		rdtp->dynticks_nesting += DYNTICK_TASK_NEST_VALUE;
 	} else {
 		rdtp->dynticks_nesting = DYNTICK_TASK_EXIT_IDLE;
-		rcu_eqs_exit_common(rdtp, oldval, user);
+		rcu_eqs_exit_common(oldval, user);
 	}
 }
 
@@ -699,7 +700,7 @@ void rcu_idle_exit(void)
 
 	local_irq_save(flags);
 	rcu_eqs_exit(false);
-	rcu_sysidle_exit(this_cpu_ptr(&rcu_dynticks), 0);
+	rcu_sysidle_exit(0);
 	local_irq_restore(flags);
 }
 EXPORT_SYMBOL_GPL(rcu_idle_exit);
@@ -750,8 +751,8 @@ void rcu_irq_enter(void)
 	if (oldval)
 		trace_rcu_dyntick(TPS("++="), oldval, rdtp->dynticks_nesting);
 	else
-		rcu_eqs_exit_common(rdtp, oldval, true);
-	rcu_sysidle_exit(rdtp, 1);
+		rcu_eqs_exit_common(oldval, true);
+	rcu_sysidle_exit(1);
 	local_irq_restore(flags);
 }
 
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index 06d615b1eaa9..f37f77c13c86 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -605,8 +605,8 @@ static void __init rcu_organize_nocb_kthreads(struct rcu_state *rsp);
 #endif /* #ifdef CONFIG_RCU_NOCB_CPU */
 static void __maybe_unused rcu_kick_nohz_cpu(int cpu);
 static bool init_nocb_callback_list(struct rcu_data *rdp);
-static void rcu_sysidle_enter(struct rcu_dynticks *rdtp, int irq);
-static void rcu_sysidle_exit(struct rcu_dynticks *rdtp, int irq);
+static void rcu_sysidle_enter(int irq);
+static void rcu_sysidle_exit(int irq);
 static void rcu_sysidle_check_cpu(struct rcu_data *rdp, bool *isidle,
 				  unsigned long *maxj);
 static bool is_sysidle_rcu_state(struct rcu_state *rsp);
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index b705bd3de4d8..10e2424ae4ab 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -2728,9 +2728,10 @@ static int full_sysidle_state;		/* Current system-idle state. */
  * to detect full-system idle states, not RCU quiescent states and grace
  * periods.  The caller must have disabled interrupts.
  */
-static void rcu_sysidle_enter(struct rcu_dynticks *rdtp, int irq)
+static void rcu_sysidle_enter(int irq)
 {
 	unsigned long j;
+	struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
 
 	/* If there are no nohz_full= CPUs, no need to track this. */
 	if (!tick_nohz_full_enabled())
@@ -2799,8 +2800,10 @@ void rcu_sysidle_force_exit(void)
  * usermode execution does -not- count as idle here!  The caller must
  * have disabled interrupts.
  */
-static void rcu_sysidle_exit(struct rcu_dynticks *rdtp, int irq)
+static void rcu_sysidle_exit(int irq)
 {
+	struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
+
 	/* If there are no nohz_full= CPUs, no need to track this. */
 	if (!tick_nohz_full_enabled())
 		return;
@@ -3094,11 +3097,11 @@ static void rcu_sysidle_init_percpu_data(struct rcu_dynticks *rdtp)
 
 #else /* #ifdef CONFIG_NO_HZ_FULL_SYSIDLE */
 
-static void rcu_sysidle_enter(struct rcu_dynticks *rdtp, int irq)
+static void rcu_sysidle_enter(int irq)
 {
 }
 
-static void rcu_sysidle_exit(struct rcu_dynticks *rdtp, int irq)
+static void rcu_sysidle_exit(int irq)
 {
 }
 


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [RFC] dynticks: dynticks_idle is only modified locally use this_cpu ops
  2014-09-02 21:15     ` Paul E. McKenney
@ 2014-09-02 23:22       ` Christoph Lameter
  2014-09-03  2:11         ` Paul E. McKenney
  0 siblings, 1 reply; 18+ messages in thread
From: Christoph Lameter @ 2014-09-02 23:22 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Frederic Weisbecker, linux-kernel

On Tue, 2 Sep 2014, Paul E. McKenney wrote:

> Yep, these two have been on my "when I am feeling insanely gutsy" list
> for quite some time.
>
> But I have to ask...  On x86, is a pair of mfence instructions really
> cheaper than an atomic increment?

Not sure why you would need an mfence instruction?

> > If the first patch I send gets merged then a lot of other this_cpu related
> > optimizations become possible regardless of the ones in the RFC.
>
> Yep, I am queuing that one.

Great.

> But could you please do future patches against the rcu/dev branch of
> git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git?
> I had to hand-apply due to conflicts.  Please see below for my version,
> and please check to make sure that I didn't mess something up in the
> translation.

Looks ok. Will use the correct tree next time.



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC] dynticks: dynticks_idle is only modified locally use this_cpu ops
  2014-09-02 23:22       ` Christoph Lameter
@ 2014-09-03  2:11         ` Paul E. McKenney
  2014-09-03 14:10           ` Christoph Lameter
  0 siblings, 1 reply; 18+ messages in thread
From: Paul E. McKenney @ 2014-09-03  2:11 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Frederic Weisbecker, linux-kernel

On Tue, Sep 02, 2014 at 06:22:52PM -0500, Christoph Lameter wrote:
> On Tue, 2 Sep 2014, Paul E. McKenney wrote:
> 
> > Yep, these two have been on my "when I am feeling insanely gutsy" list
> > for quite some time.
> >
> > But I have to ask...  On x86, is a pair of mfence instructions really
> > cheaper than an atomic increment?
> 
> Not sure why you would need an mfence instruction?

Because otherwise RCU can break.  As soon as the grace-period machinery
sees that the value of this variable is even, it assumes a quiescent
state.  If there are no memory barriers, the non-quiescent code might
not have completed executing, and your kernel's actuarial statistics
become sub-optimal.

							Thanx, Paul

> > > If the first patch I send gets merged then a lot of other this_cpu related
> > > optimizations become possible regardless of the ones in the RFC.
> >
> > Yep, I am queuing that one.
> 
> Great.
> 
> > But could you please do future patches against the rcu/dev branch of
> > git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git?
> > I had to hand-apply due to conflicts.  Please see below for my version,
> > and please check to make sure that I didn't mess something up in the
> > translation.
> 
> Looks ok. Will use the correct tree next time.
> 
> 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC] dynticks: dynticks_idle is only modified locally use this_cpu ops
  2014-09-03  2:11         ` Paul E. McKenney
@ 2014-09-03 14:10           ` Christoph Lameter
  2014-09-03 14:43             ` Paul E. McKenney
  0 siblings, 1 reply; 18+ messages in thread
From: Christoph Lameter @ 2014-09-03 14:10 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Frederic Weisbecker, linux-kernel

On Tue, 2 Sep 2014, Paul E. McKenney wrote:

> On Tue, Sep 02, 2014 at 06:22:52PM -0500, Christoph Lameter wrote:
> > On Tue, 2 Sep 2014, Paul E. McKenney wrote:
> >
> > > Yep, these two have been on my "when I am feeling insanely gutsy" list
> > > for quite some time.
> > >
> > > But I have to ask...  On x86, is a pair of mfence instructions really
> > > cheaper than an atomic increment?
> >
> > Not sure why you would need an mfence instruction?
>
> Because otherwise RCU can break.  As soon as the grace-period machinery
> sees that the value of this variable is even, it assumes a quiescent
> state.  If there are no memory barriers, the non-quiescent code might
> not have completed executing, and your kernel's actuarial statistics
> become sub-optimal.

Synchronization using per cpu variables is bound to be problematic since
they are simply not made for that. The per cpu variable usually can change
without notice to the other cpu since typically per cpu processing is
ongoing. The improvided performance of per cpu instructions is
possible only because we exploit the fact that there is no need for
synchronization.

Kernel statistics *are* suboptimal for that very reason because they
typically sum up individual counters from multiple processors without
regard to complete accuracy. The manipulation of the VM counters is very
low overhead due to the lack of concern for synchronization. This is a
tradeoff vs. performance. We actually can tune the the fuzziness of
statistics in the VM which allows us to control the overhead generated by
the need for more or less accurate statistics.

Memory barriers ensure that the code has completed executing? I think what
is meant is that they ensure that all modifications to cachelines before the
change of state are visible and the other processor does not have stale
cachelines around?

If the state variable is odd how does the other processor see a state
change to even before processing is complete if the state is updated only
at the end of processing?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC] dynticks: dynticks_idle is only modified locally use this_cpu ops
  2014-09-03 14:10           ` Christoph Lameter
@ 2014-09-03 14:43             ` Paul E. McKenney
  2014-09-03 15:51               ` Christoph Lameter
  0 siblings, 1 reply; 18+ messages in thread
From: Paul E. McKenney @ 2014-09-03 14:43 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Frederic Weisbecker, linux-kernel

On Wed, Sep 03, 2014 at 09:10:24AM -0500, Christoph Lameter wrote:
> On Tue, 2 Sep 2014, Paul E. McKenney wrote:
> 
> > On Tue, Sep 02, 2014 at 06:22:52PM -0500, Christoph Lameter wrote:
> > > On Tue, 2 Sep 2014, Paul E. McKenney wrote:
> > >
> > > > Yep, these two have been on my "when I am feeling insanely gutsy" list
> > > > for quite some time.
> > > >
> > > > But I have to ask...  On x86, is a pair of mfence instructions really
> > > > cheaper than an atomic increment?
> > >
> > > Not sure why you would need an mfence instruction?
> >
> > Because otherwise RCU can break.  As soon as the grace-period machinery
> > sees that the value of this variable is even, it assumes a quiescent
> > state.  If there are no memory barriers, the non-quiescent code might
> > not have completed executing, and your kernel's actuarial statistics
> > become sub-optimal.
> 
> Synchronization using per cpu variables is bound to be problematic since
> they are simply not made for that. The per cpu variable usually can change
> without notice to the other cpu since typically per cpu processing is
> ongoing. The improvided performance of per cpu instructions is
> possible only because we exploit the fact that there is no need for
> synchronization.

Christoph, per-CPU variables are memory.  If I do the correct operations
on them, they work just like any other memory.  And yes, I typically
cannot use the per-CPU operations if I need coherent results visible to
all CPUs (but see your statistical-counters example below).  This is of
course exactly why I use atomic operations and memory barriers on
the dynticks counters.

You would prefer that I instead allocated an NR_CPUS-sized array?

> Kernel statistics *are* suboptimal for that very reason because they
> typically sum up individual counters from multiple processors without
> regard to complete accuracy. The manipulation of the VM counters is very
> low overhead due to the lack of concern for synchronization. This is a
> tradeoff vs. performance. We actually can tune the the fuzziness of
> statistics in the VM which allows us to control the overhead generated by
> the need for more or less accurate statistics.

Yes, statistics is one of the cross-CPU uses of per-CPU variables where
you can get away without tight synchronization.

> Memory barriers ensure that the code has completed executing? I think what
> is meant is that they ensure that all modifications to cachelines before the
> change of state are visible and the other processor does not have stale
> cachelines around?

No, memory barriers simply enforce ordering, and ordering is all
that RCU's dyntick-idle code relies on.  In other words, if the RCU
grace-period kthread sees a value indicating that a CPU is idle, that
kthread needs assurance that all the memory operations that took place
before that CPU went idle have completed.  All that is required for this
is ordering:  I saw that the ->dynticks counter had an even value, and
I know that it was preceded a memory barrier, therefore, all subsequent
code after a memory barrier will see the effects of all pre-idle code.

> If the state variable is odd how does the other processor see a state
> change to even before processing is complete if the state is updated only
> at the end of processing?

I am having some difficulty parsing that question.

However, suppose that the grace-period kthread sees ->dynticks with an
odd value, say three.  Supppose that this kthread later sees another
odd value, say eleven.  Then because of the atomic operations and memory
barriers, the kthread can be sure that the CPU corresponding to that
->dynticks has passed through an idle state since the time it saw the
value three, and therefore that that CPU has passed through a quiescent
state since the start of the grace period.

Similarly, if the grace-period kthread sees ->dynticks with an even value,
it knows that after a subsequent memory barrier, all the pre-idle effects
will be visible, as required.

As a diagram:

	CPU 0				CPU 1

	/* Access some RCU-protected struct. */
	smp_mb__before_atomic()
	atomic_inc()->even value
	/* Now idle. */
					atomic_add_return(0, ...)-> odd value
					/* implied memory barrier. */
					/* post-GP changes won't interfere */
					/*   with pre-idle accesses. */

In other words, if CPU 0 accessed some RCU-protected memory before going
idle, that memory is guaranteed not to be freed until after those pre-idle
accesses have completed.

Of course, it also needs to work the other way around:

	CPU 0				CPU 1

					/* Remove some RCU-protected struct. */
					/* implied memory barrier. */
					atomic_add_return(0, ...)-> even value
	/* Was idle. */
	atomic_inc()->odd value
	smp_mb__after_atomic()
	/* post-idle accesses. */
	/* Will see pre-GP changes. */

								Thanx, Paul

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC] dynticks: dynticks_idle is only modified locally use this_cpu ops
  2014-09-03 14:43             ` Paul E. McKenney
@ 2014-09-03 15:51               ` Christoph Lameter
  2014-09-03 17:14                 ` Paul E. McKenney
  0 siblings, 1 reply; 18+ messages in thread
From: Christoph Lameter @ 2014-09-03 15:51 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Frederic Weisbecker, linux-kernel

On Wed, 3 Sep 2014, Paul E. McKenney wrote:

> You would prefer that I instead allocated an NR_CPUS-sized array?

Well, a shared data structure would be cleaner in general but there are
certainly other approaches.

But lets focus on the dynticks_idle case we are discussing here rather
than tackle the more difficult other atomics. What is checked in the loop
over the remote cpus is the dynticks_idle value plus
dynticks_idle_jiffies. So it seems that memory ordering is only used to
ensure that the jiffies are seen correctly.

In that case both the dynticks_idle and dynticks_idle_jiffies could be
placed in one 64 bit value. If this is stored and retrieved as one then
there is no issue with ordering anymore and the barriers would no longer
be needed.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC] dynticks: dynticks_idle is only modified locally use this_cpu ops
  2014-09-03 15:51               ` Christoph Lameter
@ 2014-09-03 17:14                 ` Paul E. McKenney
  2014-09-03 17:43                   ` Christoph Lameter
  0 siblings, 1 reply; 18+ messages in thread
From: Paul E. McKenney @ 2014-09-03 17:14 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Frederic Weisbecker, linux-kernel

On Wed, Sep 03, 2014 at 10:51:13AM -0500, Christoph Lameter wrote:
> On Wed, 3 Sep 2014, Paul E. McKenney wrote:
> 
> > You would prefer that I instead allocated an NR_CPUS-sized array?
> 
> Well, a shared data structure would be cleaner in general but there are
> certainly other approaches.

Per-CPU variables -are- a shared data structure.

> But lets focus on the dynticks_idle case we are discussing here rather
> than tackle the more difficult other atomics. What is checked in the loop
> over the remote cpus is the dynticks_idle value plus
> dynticks_idle_jiffies. So it seems that memory ordering is only used to
> ensure that the jiffies are seen correctly.
> 
> In that case both the dynticks_idle and dynticks_idle_jiffies could be
> placed in one 64 bit value. If this is stored and retrieved as one then
> there is no issue with ordering anymore and the barriers would no longer
> be needed.

If there was an upper bound on the propagation of values through a system,
I could buy this.

But Mike Galbraith checked the overhead of ->dynticks_idle and found
it to be too small to measure.  So doesn't seem to be a problem worth
extraordinary efforts, especially given that many systems can avoid
it simply by leaving CONFIG_NO_HZ_SYSIDLE=n.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC] dynticks: dynticks_idle is only modified locally use this_cpu ops
  2014-09-03 17:14                 ` Paul E. McKenney
@ 2014-09-03 17:43                   ` Christoph Lameter
  2014-09-03 18:19                     ` Paul E. McKenney
  0 siblings, 1 reply; 18+ messages in thread
From: Christoph Lameter @ 2014-09-03 17:43 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Frederic Weisbecker, linux-kernel

On Wed, 3 Sep 2014, Paul E. McKenney wrote:

> > Well, a shared data structure would be cleaner in general but there are
> > certainly other approaches.
>
> Per-CPU variables -are- a shared data structure.

No the intent is for them to be for the particular cpu and therefore
there is limited support for sharing. Its not a shared data structure in
the classic sense.

The code in the rcu subsystem operates like other percpu code: There is a
modification of local variables by the current processor and other
processors inspect the state once in awhile. The other percpu code does
not need atomics and barriers. RCU for some reason (that is not clear to
me) does.

> > But lets focus on the dynticks_idle case we are discussing here rather
> > than tackle the more difficult other atomics. What is checked in the loop
> > over the remote cpus is the dynticks_idle value plus
> > dynticks_idle_jiffies. So it seems that memory ordering is only used to
> > ensure that the jiffies are seen correctly.
> >
> > In that case both the dynticks_idle and dynticks_idle_jiffies could be
> > placed in one 64 bit value. If this is stored and retrieved as one then
> > there is no issue with ordering anymore and the barriers would no longer
> > be needed.
>
> If there was an upper bound on the propagation of values through a system,
> I could buy this.

What is different in propagation speeds? The atomic read on the function
that checks for the quiescent period having passed is a regular read
anyways. The atomic_inc makes the cacheline propagate faster through the
system? I understand that it evicts the cachelines from other processors
caches (containing other percpu data by the way). That is the desired
effect?

> But Mike Galbraith checked the overhead of ->dynticks_idle and found
> it to be too small to measure.  So doesn't seem to be a problem worth
> extraordinary efforts, especially given that many systems can avoid
> it simply by leaving CONFIG_NO_HZ_SYSIDLE=n.

The code looks fragile and bound to have issues in the future given the
barriers/atomics etc. Its going to be cleaner without that.

And we are right now focusing on the simplest case. The atomics scheme is
used multiple times in the RCU subsystem. There is more weird looking code
there like atomic_add using zero etc.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC] dynticks: dynticks_idle is only modified locally use this_cpu ops
  2014-09-03 17:43                   ` Christoph Lameter
@ 2014-09-03 18:19                     ` Paul E. McKenney
  2014-09-04 15:04                       ` Christoph Lameter
  0 siblings, 1 reply; 18+ messages in thread
From: Paul E. McKenney @ 2014-09-03 18:19 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Frederic Weisbecker, linux-kernel

On Wed, Sep 03, 2014 at 12:43:15PM -0500, Christoph Lameter wrote:
> On Wed, 3 Sep 2014, Paul E. McKenney wrote:
> 
> > > Well, a shared data structure would be cleaner in general but there are
> > > certainly other approaches.
> >
> > Per-CPU variables -are- a shared data structure.
> 
> No the intent is for them to be for the particular cpu and therefore
> there is limited support for sharing. Its not a shared data structure in
> the classic sense.

It really is shared.  Normal memory at various addresses and all that.

So what is your real concern here, anyway?

> The code in the rcu subsystem operates like other percpu code: There is a
> modification of local variables by the current processor and other
> processors inspect the state once in awhile. The other percpu code does
> not need atomics and barriers. RCU for some reason (that is not clear to
> me) does.

Like I have said several times before in multiple email threads, including
this one, RCU needs to reliably detect quiescent states.  To do this, it
needs do know that the accesses prior to the quiescent state will reliably
be seen by all as having happened before the quiescent state started.
Similarly, it needs to know that the accesses after the quiescent state
will reliably be seen by all as having happened after the quiescent
state ended.  The memory barriers are required to enforce this ordering.

As noted earlier, in theory, the atomic operations could be nonatomic,
but on x86 this would require adding explicit memory-barrier instructions.
Not clear to me which would be faster.

> > > But lets focus on the dynticks_idle case we are discussing here rather
> > > than tackle the more difficult other atomics. What is checked in the loop
> > > over the remote cpus is the dynticks_idle value plus
> > > dynticks_idle_jiffies. So it seems that memory ordering is only used to
> > > ensure that the jiffies are seen correctly.
> > >
> > > In that case both the dynticks_idle and dynticks_idle_jiffies could be
> > > placed in one 64 bit value. If this is stored and retrieved as one then
> > > there is no issue with ordering anymore and the barriers would no longer
> > > be needed.
> >
> > If there was an upper bound on the propagation of values through a system,
> > I could buy this.
> 
> What is different in propagation speeds? The atomic read on the function
> that checks for the quiescent period having passed is a regular read
> anyways. The atomic_inc makes the cacheline propagate faster through the
> system? I understand that it evicts the cachelines from other processors
> caches (containing other percpu data by the way). That is the desired
> effect?

The smp_mb__{before,after}_atomic() guarantee the ordering.

> > But Mike Galbraith checked the overhead of ->dynticks_idle and found
> > it to be too small to measure.  So doesn't seem to be a problem worth
> > extraordinary efforts, especially given that many systems can avoid
> > it simply by leaving CONFIG_NO_HZ_SYSIDLE=n.
> 
> The code looks fragile and bound to have issues in the future given the
> barriers/atomics etc. Its going to be cleaner without that.

What exactly looks fragile about it, and exactly what issues do you
anticipate?

> And we are right now focusing on the simplest case. The atomics scheme is
> used multiple times in the RCU subsystem. There is more weird looking code
> there like atomic_add using zero etc.

The atomic_add_return(0,...) reads the value out, forcing full ordering.
Again, in theory, this could be a volatile read with explicit memory-barrier
instructions on either side, but it is not clear which wins.  (Keep in
mind that almost all of the atomic_add_return(0,...) calls for a given
dynticks counter are executed from a single kthread.)

If systems continue to add CPUs like they have over the past decade or
so, I expect that you will be seeing more code like RCU's, not less.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC] dynticks: dynticks_idle is only modified locally use this_cpu ops
  2014-09-03 18:19                     ` Paul E. McKenney
@ 2014-09-04 15:04                       ` Christoph Lameter
  2014-09-04 16:16                         ` Paul E. McKenney
  0 siblings, 1 reply; 18+ messages in thread
From: Christoph Lameter @ 2014-09-04 15:04 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Frederic Weisbecker, linux-kernel

On Wed, 3 Sep 2014, Paul E. McKenney wrote:

> As noted earlier, in theory, the atomic operations could be nonatomic,

Well as demonstrated by the patch earlier: The atomic operations are only
used on a the local cpu. There is no synchronization in that sense needed
between processors because there is never a remote atomic operation.

> > The code looks fragile and bound to have issues in the future given the
> > barriers/atomics etc. Its going to be cleaner without that.
>
> What exactly looks fragile about it, and exactly what issues do you
> anticipate?

I am concerned about creation of unecessary synchronization issues. In
this case we have already discovered that the atomic operations on per
cpu variables are only used to modify the contents from the local cpu.

This means at minimum we can give up on the use of atomics and keep the
barriers to enforce visibility.

> > And we are right now focusing on the simplest case. The atomics scheme is
> > used multiple times in the RCU subsystem. There is more weird looking code
> > there like atomic_add using zero etc.
>
> The atomic_add_return(0,...) reads the value out, forcing full ordering.
> Again, in theory, this could be a volatile read with explicit memory-barrier
> instructions on either side, but it is not clear which wins.  (Keep in
> mind that almost all of the atomic_add_return(0,...) calls for a given
> dynticks counter are executed from a single kthread.)
>
> If systems continue to add CPUs like they have over the past decade or
> so, I expect that you will be seeing more code like RCU's, not less.

We have other code like this in multiple subsystems but it does not have
the barrier issues, per cpu variables are updated always without the use
of atomics and the inspection of the per cpu state from remote cpus works
just fine also without them.

I'd like to simplify this as much as possible and make it consistent
throughout the kernel.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC] dynticks: dynticks_idle is only modified locally use this_cpu ops
  2014-09-04 15:04                       ` Christoph Lameter
@ 2014-09-04 16:16                         ` Paul E. McKenney
  2014-09-04 18:19                           ` Christoph Lameter
  0 siblings, 1 reply; 18+ messages in thread
From: Paul E. McKenney @ 2014-09-04 16:16 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Frederic Weisbecker, linux-kernel

On Thu, Sep 04, 2014 at 10:04:17AM -0500, Christoph Lameter wrote:
> 
> On Wed, 3 Sep 2014, Paul E. McKenney wrote:
> 
> > As noted earlier, in theory, the atomic operations could be nonatomic,
> 
> Well as demonstrated by the patch earlier: The atomic operations are only
> used on a the local cpu. There is no synchronization in that sense needed
> between processors because there is never a remote atomic operation.

Easy to say!  ;-)

> > > The code looks fragile and bound to have issues in the future given the
> > > barriers/atomics etc. Its going to be cleaner without that.
> >
> > What exactly looks fragile about it, and exactly what issues do you
> > anticipate?
> 
> I am concerned about creation of unecessary synchronization issues. In
> this case we have already discovered that the atomic operations on per
> cpu variables are only used to modify the contents from the local cpu.
> 
> This means at minimum we can give up on the use of atomics and keep the
> barriers to enforce visibility.

Sounds like a desire for a potential optimization rather than any
sort of fragility.  And in this case, it is not clear that your desire
of replacing a value-returning atomic operation with a normal memory
reference and a pair of memory barrier actually makes anything go faster.

So in short, you don't see the potential for this use case actually
breaking anything, correct?

Besides RCU is not the only place where atomics are used on per-CPU
variables.  For one thing, there are a number of per-CPU spinlocks in use
in various places throughout the kernel.  For another thing, there is also
a large number of per-CPU structures (not pointers to structures, actual
structures), and I bet that a fair number of these feature cross-CPU
writes and cross-CPU atomics.  RCU's rcu_data structures certainly do.

> > > And we are right now focusing on the simplest case. The atomics scheme is
> > > used multiple times in the RCU subsystem. There is more weird looking code
> > > there like atomic_add using zero etc.
> >
> > The atomic_add_return(0,...) reads the value out, forcing full ordering.
> > Again, in theory, this could be a volatile read with explicit memory-barrier
> > instructions on either side, but it is not clear which wins.  (Keep in
> > mind that almost all of the atomic_add_return(0,...) calls for a given
> > dynticks counter are executed from a single kthread.)
> >
> > If systems continue to add CPUs like they have over the past decade or
> > so, I expect that you will be seeing more code like RCU's, not less.
> 
> We have other code like this in multiple subsystems but it does not have
> the barrier issues, per cpu variables are updated always without the use
> of atomics and the inspection of the per cpu state from remote cpus works
> just fine also without them.

Including the per-CPU spinlocks?  That seems a bit unlikely.  And again,
I expect that a fair number of the per-CPU structures involve cross-CPU
synchronization.

> I'd like to simplify this as much as possible and make it consistent
> throughout the kernel.

It already is consistent, just not in the manner that you want.  ;-)

But -why- do you want these restrictions?  How does it help anything?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC] dynticks: dynticks_idle is only modified locally use this_cpu ops
  2014-09-04 16:16                         ` Paul E. McKenney
@ 2014-09-04 18:19                           ` Christoph Lameter
  2014-09-04 19:32                             ` Paul E. McKenney
  0 siblings, 1 reply; 18+ messages in thread
From: Christoph Lameter @ 2014-09-04 18:19 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Frederic Weisbecker, linux-kernel

On Thu, 4 Sep 2014, Paul E. McKenney wrote:

> So in short, you don't see the potential for this use case actually
> breaking anything, correct?

In general its a performance impact but depending on how this_cpu_ops may
be implemented in a particular platform there may also be correctness
issues since the assumption there is that no remote writes occur.

There is a slight issue in th RCU code. It uses DEFINE_PER_CPU for
per cpu data which is used for true per cpu data where the
cachelines are not evicted. False aliasing RCU structure that are
remotely handled can cause issue for code that expects the per cpu data
to be not contended. I think it would be better to go to

DEFINE_PER_CPU_SHARED_ALIGNED

for your definitions in particular since there are still code pieces where
we are not sure if there are remote write accesses or not. This will give
you separate cachelines so that the false aliasing effect is not
occurring.

> Besides RCU is not the only place where atomics are used on per-CPU
> variables.  For one thing, there are a number of per-CPU spinlocks in use
> in various places throughout the kernel.  For another thing, there is also
> a large number of per-CPU structures (not pointers to structures, actual
> structures), and I bet that a fair number of these feature cross-CPU
> writes and cross-CPU atomics.  RCU's rcu_data structures certainly do.

Would be interested to see where that occurs.

> > the barrier issues, per cpu variables are updated always without the use
> > of atomics and the inspection of the per cpu state from remote cpus works
> > just fine also without them.
>
> Including the per-CPU spinlocks?  That seems a bit unlikely.  And again,
> I expect that a fair number of the per-CPU structures involve cross-CPU
> synchronization.

Where are those per cpu spinlocks? Cross cpu synchronization can be done
in a number of ways that often allow avoiding remote writes to percpu
areas.

> It already is consistent, just not in the manner that you want.  ;-)
>
> But -why- do you want these restrictions?  How does it help anything?

1. It allows potentially faster operations that allow to make the
assumption that no remote writes occur. The design of deterministic low
latency code often needs some assurances that another cpu is not simply
kicking the cacheline out which will then require off chip memory access
and remote cacheline eviction once the cacheline is touched again.

2. The use of atomic without a rationale is something that I frown upon
and it seems very likely that we have such a case here. People make
assumptions that the use of atomic has some reason, like a remote access
or contention, which is not occurring here.

3. this_cpu operations create instructions with reduced latency due tothe
lack of lock prefix. Remote operations at the same time could create
inconsistent results.

See also

linux/Documentation/this_cpu_ops.txt

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC] dynticks: dynticks_idle is only modified locally use this_cpu ops
  2014-09-04 18:19                           ` Christoph Lameter
@ 2014-09-04 19:32                             ` Paul E. McKenney
  2014-09-08 18:20                               ` Christoph Lameter
  0 siblings, 1 reply; 18+ messages in thread
From: Paul E. McKenney @ 2014-09-04 19:32 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Frederic Weisbecker, linux-kernel

On Thu, Sep 04, 2014 at 01:19:29PM -0500, Christoph Lameter wrote:
> On Thu, 4 Sep 2014, Paul E. McKenney wrote:
> 
> > So in short, you don't see the potential for this use case actually
> > breaking anything, correct?
> 
> In general its a performance impact but depending on how this_cpu_ops may
> be implemented in a particular platform there may also be correctness
> issues since the assumption there is that no remote writes occur.

This assumption of yours does not seem to be shared by a fair amount
of the code using per-CPU variables.

> There is a slight issue in th RCU code. It uses DEFINE_PER_CPU for
> per cpu data which is used for true per cpu data where the
> cachelines are not evicted. False aliasing RCU structure that are
> remotely handled can cause issue for code that expects the per cpu data
> to be not contended. I think it would be better to go to
> 
> DEFINE_PER_CPU_SHARED_ALIGNED
> 
> for your definitions in particular since there are still code pieces where
> we are not sure if there are remote write accesses or not. This will give
> you separate cachelines so that the false aliasing effect is not
> occurring.

Fair enough, I have queued a commit that makes this change for the
rcu_data per-CPU structures with your Report-by.  (See below for patch.)

> > Besides RCU is not the only place where atomics are used on per-CPU
> > variables.  For one thing, there are a number of per-CPU spinlocks in use
> > in various places throughout the kernel.  For another thing, there is also
> > a large number of per-CPU structures (not pointers to structures, actual
> > structures), and I bet that a fair number of these feature cross-CPU
> > writes and cross-CPU atomics.  RCU's rcu_data structures certainly do.
> 
> Would be interested to see where that occurs.

Something like the following will find them for you:

git grep "DEFINE_PER_CPU.*spinlock"
git grep "DEFINE_PER_CPU.*(struct[^*]*$"

> > > the barrier issues, per cpu variables are updated always without the use
> > > of atomics and the inspection of the per cpu state from remote cpus works
> > > just fine also without them.
> >
> > Including the per-CPU spinlocks?  That seems a bit unlikely.  And again,
> > I expect that a fair number of the per-CPU structures involve cross-CPU
> > synchronization.
> 
> Where are those per cpu spinlocks? Cross cpu synchronization can be done
> in a number of ways that often allow avoiding remote writes to percpu
> areas.

The lglocks use should be of particular interest to you.  See above to
find the others.

> > It already is consistent, just not in the manner that you want.  ;-)
> >
> > But -why- do you want these restrictions?  How does it help anything?
> 
> 1. It allows potentially faster operations that allow to make the
> assumption that no remote writes occur. The design of deterministic low
> latency code often needs some assurances that another cpu is not simply
> kicking the cacheline out which will then require off chip memory access
> and remote cacheline eviction once the cacheline is touched again.

OK, DEFINE_PER_CPU_SHARED_ALIGNED() should avoid the false sharing.

> 2. The use of atomic without a rationale is something that I frown upon
> and it seems very likely that we have such a case here. People make
> assumptions that the use of atomic has some reason, like a remote access
> or contention, which is not occurring here.

Understood, but no hasty changes to that part of RCU.

> 3. this_cpu operations create instructions with reduced latency due tothe
> lack of lock prefix. Remote operations at the same time could create
> inconsistent results.
> 
> See also
> 
> linux/Documentation/this_cpu_ops.txt

Yep.  If you use atomic operations, you need to be very careful about
mixing in non-atomic operations, which in this case includes the
per-CPU atomic operations.  Normally the atomic_t and atomic_long_t
types make it difficult to mess up.  Of course, you can work around
this type checking, and you can also use xchg() and cmpxchg(), which
don't provide this type-checking misuse-avoidance help.

Just as it has always been.

							Thanx, Paul

------------------------------------------------------------------------

rcu: Use DEFINE_PER_CPU_SHARED_ALIGNED for rcu_data

The rcu_data per-CPU variable has a number of fields that are atomically
manipulated, potentially by any CPU.  This situation can result in false
sharing with per-CPU variables that have the misfortune of being allocated
adjacent to rcu_data in memory.  This commit therefore changes the
DEFINE_PER_CPU() to DEFINE_PER_CPU_SHARED_ALIGNED() in order to avoid
this false sharing.

Reported-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 1b7eba915dbe..e4c6d604694e 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -105,7 +105,7 @@ struct rcu_state sname##_state = { \
 	.name = RCU_STATE_NAME(sname), \
 	.abbr = sabbr, \
 }; \
-DEFINE_PER_CPU(struct rcu_data, sname##_data)
+DEFINE_PER_CPU_SHARED_ALIGNED(struct rcu_data, sname##_data)
 
 RCU_STATE_INITIALIZER(rcu_sched, 's', call_rcu_sched);
 RCU_STATE_INITIALIZER(rcu_bh, 'b', call_rcu_bh);


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [RFC] dynticks: dynticks_idle is only modified locally use this_cpu ops
  2014-09-04 19:32                             ` Paul E. McKenney
@ 2014-09-08 18:20                               ` Christoph Lameter
  2014-09-10 22:18                                 ` Paul E. McKenney
  0 siblings, 1 reply; 18+ messages in thread
From: Christoph Lameter @ 2014-09-08 18:20 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Frederic Weisbecker, linux-kernel


On Thu, 4 Sep 2014, Paul E. McKenney wrote:

> The lglocks use should be of particular interest to you.  See above to
> find the others.

Well the other seem to be in drivers.

> Understood, but no hasty changes to that part of RCU.

At our age I think we have learned to be very deliberate and slow.

> rcu: Use DEFINE_PER_CPU_SHARED_ALIGNED for rcu_data

Reviewed-by: Christoph Lameter <cl@linux.com>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC] dynticks: dynticks_idle is only modified locally use this_cpu ops
  2014-09-08 18:20                               ` Christoph Lameter
@ 2014-09-10 22:18                                 ` Paul E. McKenney
  0 siblings, 0 replies; 18+ messages in thread
From: Paul E. McKenney @ 2014-09-10 22:18 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Frederic Weisbecker, linux-kernel

On Mon, Sep 08, 2014 at 01:20:30PM -0500, Christoph Lameter wrote:
> 
> On Thu, 4 Sep 2014, Paul E. McKenney wrote:
> 
> > The lglocks use should be of particular interest to you.  See above to
> > find the others.
> 
> Well the other seem to be in drivers.

OK...

> > Understood, but no hasty changes to that part of RCU.
> 
> At our age I think we have learned to be very deliberate and slow.

And rcutorture needs some upgrades for this case.

> > rcu: Use DEFINE_PER_CPU_SHARED_ALIGNED for rcu_data
> 
> Reviewed-by: Christoph Lameter <cl@linux.com>

Thank you, recorded.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2014-09-10 22:18 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-09-02 20:14 [RFC] dynticks: dynticks_idle is only modified locally use this_cpu ops Christoph Lameter
2014-09-02 20:37 ` Paul E. McKenney
2014-09-02 20:47   ` Christoph Lameter
2014-09-02 21:15     ` Paul E. McKenney
2014-09-02 23:22       ` Christoph Lameter
2014-09-03  2:11         ` Paul E. McKenney
2014-09-03 14:10           ` Christoph Lameter
2014-09-03 14:43             ` Paul E. McKenney
2014-09-03 15:51               ` Christoph Lameter
2014-09-03 17:14                 ` Paul E. McKenney
2014-09-03 17:43                   ` Christoph Lameter
2014-09-03 18:19                     ` Paul E. McKenney
2014-09-04 15:04                       ` Christoph Lameter
2014-09-04 16:16                         ` Paul E. McKenney
2014-09-04 18:19                           ` Christoph Lameter
2014-09-04 19:32                             ` Paul E. McKenney
2014-09-08 18:20                               ` Christoph Lameter
2014-09-10 22:18                                 ` Paul E. McKenney

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.