[RFC PATCH 1/1] rcu: use atomic_read(v) instead of atomic_add

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH 1/1] rcu: use atomic_read(v) instead of atomic_add_return(0, v)
@ 2014-07-08 22:55 Pranith Kumar
  2014-07-09  0:07 ` Paul E. McKenney
  0 siblings, 1 reply; 11+ messages in thread
From: Pranith Kumar @ 2014-07-08 22:55 UTC (permalink / raw)
  To: Paul E. McKenney, Josh Triplett, open list:READ-COPY UPDATE...

atomic_add_return() invalidates the cache line in other processors where-as
atomic_read does not. I don't see why we would need invalidation in this case.
If indeed it was need a comment would be helpful for readers. Otherwise doesn't
using atomic_read() make more sense here? RFC!

replace atomic_add_return(0, v) with atomic_read(v) as the latter is better.

Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
---
 kernel/rcu/tree.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index dac6d20..a4a8f5f 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -891,7 +891,7 @@ static int rcu_is_cpu_rrupt_from_idle(void)
 static int dyntick_save_progress_counter(struct rcu_data *rdp,
 					 bool *isidle, unsigned long *maxj)
 {
-	rdp->dynticks_snap = atomic_add_return(0, &rdp->dynticks->dynticks);
+	rdp->dynticks_snap = atomic_read(&rdp->dynticks->dynticks);
 	rcu_sysidle_check_cpu(rdp, isidle, maxj);
 	if ((rdp->dynticks_snap & 0x1) == 0) {
 		trace_rcu_fqs(rdp->rsp->name, rdp->gpnum, rdp->cpu, TPS("dti"));
@@ -920,7 +920,7 @@ static int rcu_implicit_dynticks_qs(struct rcu_data *rdp,
 	int *rcrmp;
 	unsigned int snap;
 
-	curr = (unsigned int)atomic_add_return(0, &rdp->dynticks->dynticks);
+	curr = (unsigned int)atomic_read(&rdp->dynticks->dynticks);
 	snap = (unsigned int)rdp->dynticks_snap;
 
 	/*
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 1/1] rcu: use atomic_read(v) instead of atomic_add_return(0, v)
  2014-07-08 22:55 [RFC PATCH 1/1] rcu: use atomic_read(v) instead of atomic_add_return(0, v) Pranith Kumar
@ 2014-07-09  0:07 ` Paul E. McKenney
  2014-07-09  4:00   ` Pranith Kumar
  0 siblings, 1 reply; 11+ messages in thread
From: Paul E. McKenney @ 2014-07-09  0:07 UTC (permalink / raw)
  To: Pranith Kumar; +Cc: Josh Triplett, open list:READ-COPY UPDATE...

On Tue, Jul 08, 2014 at 06:55:45PM -0400, Pranith Kumar wrote:
> atomic_add_return() invalidates the cache line in other processors where-as
> atomic_read does not. I don't see why we would need invalidation in this case.
> If indeed it was need a comment would be helpful for readers. Otherwise doesn't
> using atomic_read() make more sense here? RFC!
> 
> replace atomic_add_return(0, v) with atomic_read(v) as the latter is better.
> 
> Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>

This will break RCU -- the full memory barriers implied both before
and after atomic_add_return() are needed in order for RCU to be able to
avoid death due to memory reordering.

That said, I have considered replacing the atomic_add_return() with:

	smp_mb();
	... = atomic_read(...);
	smp_mb();

However, this is a very ticklish change, and would need serious thought
and even more serious testing.

							Thanx, Paul

> ---
>  kernel/rcu/tree.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index dac6d20..a4a8f5f 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -891,7 +891,7 @@ static int rcu_is_cpu_rrupt_from_idle(void)
>  static int dyntick_save_progress_counter(struct rcu_data *rdp,
>  					 bool *isidle, unsigned long *maxj)
>  {
> -	rdp->dynticks_snap = atomic_add_return(0, &rdp->dynticks->dynticks);
> +	rdp->dynticks_snap = atomic_read(&rdp->dynticks->dynticks);
>  	rcu_sysidle_check_cpu(rdp, isidle, maxj);
>  	if ((rdp->dynticks_snap & 0x1) == 0) {
>  		trace_rcu_fqs(rdp->rsp->name, rdp->gpnum, rdp->cpu, TPS("dti"));
> @@ -920,7 +920,7 @@ static int rcu_implicit_dynticks_qs(struct rcu_data *rdp,
>  	int *rcrmp;
>  	unsigned int snap;
> 
> -	curr = (unsigned int)atomic_add_return(0, &rdp->dynticks->dynticks);
> +	curr = (unsigned int)atomic_read(&rdp->dynticks->dynticks);
>  	snap = (unsigned int)rdp->dynticks_snap;
> 
>  	/*
> -- 
> 1.9.1
> 


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 1/1] rcu: use atomic_read(v) instead of atomic_add_return(0, v)
  2014-07-09  0:07 ` Paul E. McKenney
@ 2014-07-09  4:00   ` Pranith Kumar
  2014-07-09 19:56     ` Paul E. McKenney
  0 siblings, 1 reply; 11+ messages in thread
From: Pranith Kumar @ 2014-07-09  4:00 UTC (permalink / raw)
  To: Paul McKenney; +Cc: Josh Triplett, open list:READ-COPY UPDATE...

On Tue, Jul 8, 2014 at 8:07 PM, Paul E. McKenney
<paulmck@linux.vnet.ibm.com> wrote:
> On Tue, Jul 08, 2014 at 06:55:45PM -0400, Pranith Kumar wrote:
>> atomic_add_return() invalidates the cache line in other processors where-as
>> atomic_read does not. I don't see why we would need invalidation in this case.
>> If indeed it was need a comment would be helpful for readers. Otherwise doesn't
>> using atomic_read() make more sense here? RFC!
>>
>> replace atomic_add_return(0, v) with atomic_read(v) as the latter is better.
>>
>> Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
>
> This will break RCU -- the full memory barriers implied both before
> and after atomic_add_return() are needed in order for RCU to be able to
> avoid death due to memory reordering.
>
> That said, I have considered replacing the atomic_add_return() with:
>
>         smp_mb();
>         ... = atomic_read(...);
>         smp_mb();
>
> However, this is a very ticklish change, and would need serious thought
> and even more serious testing.
>

Thank you for looking at the RFC. I tried understanding the code
deeper and found that the ordering which is needed here is actually on
dynticks_snap.
The ordering currently (by way of atomic_add_return) is on
rdp->dynticks->dynticks which I think is not right.

Looking at the history of the code led me to rev. 23b5c8fa01b723 which
makes me think that the above statement is true. I think providing
ordering guarantees on dynticks_snap should be enough.

I have added an updated patch below. However, it is really hard(and
error prone) to convince oneself that this is right. So I will not
pursue this further if you think this is wrong. You definitely know
better than me :)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 1b70cb6..bbccd0a 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -891,7 +891,7 @@ static int rcu_is_cpu_rrupt_from_idle(void)
 static int dyntick_save_progress_counter(struct rcu_data *rdp,
                                         bool *isidle, unsigned long *maxj)
 {
-       rdp->dynticks_snap = atomic_add_return(0, &rdp->dynticks->dynticks);
+       smp_store_release(&rdp->dynticks_snap,
atomic_read(&rdp->dynticks->dynticks));
        rcu_sysidle_check_cpu(rdp, isidle, maxj);
        if ((rdp->dynticks_snap & 0x1) == 0) {
                trace_rcu_fqs(rdp->rsp->name, rdp->gpnum, rdp->cpu, TPS("dti"));
@@ -920,8 +920,8 @@ static int rcu_implicit_dynticks_qs(struct rcu_data *rdp,
        int *rcrmp;
        unsigned int snap;

-       curr = (unsigned int)atomic_add_return(0, &rdp->dynticks->dynticks);
-       snap = (unsigned int)rdp->dynticks_snap;
+       curr = (unsigned int)atomic_read(&rdp->dynticks->dynticks);
+       snap = (unsigned int)smp_load_acquire(&rdp->dynticks_snap);

        /*
         * If the CPU passed through or entered a dynticks idle phase with


-- 
Pranith

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 1/1] rcu: use atomic_read(v) instead of atomic_add_return(0, v)
  2014-07-09  4:00   ` Pranith Kumar
@ 2014-07-09 19:56     ` Paul E. McKenney
  2014-07-11  1:17       ` Pranith Kumar
  0 siblings, 1 reply; 11+ messages in thread
From: Paul E. McKenney @ 2014-07-09 19:56 UTC (permalink / raw)
  To: Pranith Kumar; +Cc: Josh Triplett, open list:READ-COPY UPDATE...

On Wed, Jul 09, 2014 at 12:00:10AM -0400, Pranith Kumar wrote:
> On Tue, Jul 8, 2014 at 8:07 PM, Paul E. McKenney
> <paulmck@linux.vnet.ibm.com> wrote:
> > On Tue, Jul 08, 2014 at 06:55:45PM -0400, Pranith Kumar wrote:
> >> atomic_add_return() invalidates the cache line in other processors where-as
> >> atomic_read does not. I don't see why we would need invalidation in this case.
> >> If indeed it was need a comment would be helpful for readers. Otherwise doesn't
> >> using atomic_read() make more sense here? RFC!
> >>
> >> replace atomic_add_return(0, v) with atomic_read(v) as the latter is better.
> >>
> >> Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
> >
> > This will break RCU -- the full memory barriers implied both before
> > and after atomic_add_return() are needed in order for RCU to be able to
> > avoid death due to memory reordering.
> >
> > That said, I have considered replacing the atomic_add_return() with:
> >
> >         smp_mb();
> >         ... = atomic_read(...);
> >         smp_mb();
> >
> > However, this is a very ticklish change, and would need serious thought
> > and even more serious testing.
> >
> 
> Thank you for looking at the RFC. I tried understanding the code
> deeper and found that the ordering which is needed here is actually on
> dynticks_snap.
> The ordering currently (by way of atomic_add_return) is on
> rdp->dynticks->dynticks which I think is not right.
> 
> Looking at the history of the code led me to rev. 23b5c8fa01b723 which
> makes me think that the above statement is true. I think providing
> ordering guarantees on dynticks_snap should be enough.
> 
> I have added an updated patch below. However, it is really hard(and
> error prone) to convince oneself that this is right. So I will not
> pursue this further if you think this is wrong. You definitely know
> better than me :)

OK, so ->dynticks_snap is accessed by only one task, namely the
corresponding RCU grace-period kthread.  So it can be accessed without
any atomic instructions or memory barriers, since all accesses to it are
single-threaded.  On the other hand, ->dynticks is written by one CPU
and potentially accessed from any CPU.  Therefore, accesses to it must
take concurrency into account.  Especially given that any confusion can
fool RCU into thinking that a CPU is idle when it really is not, which
could result in too-short grace periods, which could in turn result in
random memory corruption.

There are a huge number of ways to get this code wrong and very few ways
to get it right.

I cannot accept this patch.

							Thanx, Paul

> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 1b70cb6..bbccd0a 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -891,7 +891,7 @@ static int rcu_is_cpu_rrupt_from_idle(void)
>  static int dyntick_save_progress_counter(struct rcu_data *rdp,
>                                          bool *isidle, unsigned long *maxj)
>  {
> -       rdp->dynticks_snap = atomic_add_return(0, &rdp->dynticks->dynticks);
> +       smp_store_release(&rdp->dynticks_snap,
> atomic_read(&rdp->dynticks->dynticks));
>         rcu_sysidle_check_cpu(rdp, isidle, maxj);
>         if ((rdp->dynticks_snap & 0x1) == 0) {
>                 trace_rcu_fqs(rdp->rsp->name, rdp->gpnum, rdp->cpu, TPS("dti"));
> @@ -920,8 +920,8 @@ static int rcu_implicit_dynticks_qs(struct rcu_data *rdp,
>         int *rcrmp;
>         unsigned int snap;
> 
> -       curr = (unsigned int)atomic_add_return(0, &rdp->dynticks->dynticks);
> -       snap = (unsigned int)rdp->dynticks_snap;
> +       curr = (unsigned int)atomic_read(&rdp->dynticks->dynticks);
> +       snap = (unsigned int)smp_load_acquire(&rdp->dynticks_snap);
> 
>         /*
>          * If the CPU passed through or entered a dynticks idle phase with
> 
> 
> -- 
> Pranith
> 


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 1/1] rcu: use atomic_read(v) instead of atomic_add_return(0, v)
  2014-07-09 19:56     ` Paul E. McKenney
@ 2014-07-11  1:17       ` Pranith Kumar
  2014-07-11  9:43         ` Paul E. McKenney
  0 siblings, 1 reply; 11+ messages in thread
From: Pranith Kumar @ 2014-07-11  1:17 UTC (permalink / raw)
  To: Paul McKenney; +Cc: Josh Triplett, open list:READ-COPY UPDATE...

On Wed, Jul 9, 2014 at 3:56 PM, Paul E. McKenney
<paulmck@linux.vnet.ibm.com> wrote:
<snip>
> OK, so ->dynticks_snap is accessed by only one task, namely the
> corresponding RCU grace-period kthread.  So it can be accessed without
> any atomic instructions or memory barriers, since all accesses to it are
> single-threaded.  On the other hand, ->dynticks is written by one CPU
> and potentially accessed from any CPU.  Therefore, accesses to it must
> take concurrency into account.  Especially given that any confusion can
> fool RCU into thinking that a CPU is idle when it really is not, which
> could result in too-short grace periods, which could in turn result in
> random memory corruption.
>

Yes, I missed reading the call-chain for accessing dynticks_snap. It
does not need any synchronization/barriers.

Here since we are reading ->dynticks, doesn't having one barrier
before reading make sense? like

smp_mb();
dynticks_snap = atomic_read(...->dynticks);

instead of having two barriers with atomic_add_return()? i.e.,
why is the second barrier necessary?

On a related note, I see that dynticks is a per-cpu variable _and_
atomic. Is there such a need for that?

Digging into history I see that the change to an atomic variable was
made in the commit 23b5c8fa01b723c70a which says:

<quote>

In addition, the old dyntick-idle synchronization depended on the fact
that grace periods were many milliseconds in duration, so that it could
be assumed that no dyntick-idle CPU could reorder a memory reference
across an entire grace period.  Unfortunately for this design, the
addition of expedited grace periods breaks this assumption, which has
the unfortunate side-effect of requiring atomic operations in the
functions that track dyntick-idle state for RCU.

</quote>

Sorry to ask you about such an old change. But I am not able to see
why we need atomic_t for dynticks here since per-cpu operations are
guaranteed to be atomic.

It gets twisted pretty fast trying to understand the RCU code. No
wonder people say that rcu is scary black magic :)

-- 
Pranith

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 1/1] rcu: use atomic_read(v) instead of atomic_add_return(0, v)
  2014-07-11  1:17       ` Pranith Kumar
@ 2014-07-11  9:43         ` Paul E. McKenney
  2014-07-11 22:32           ` Pranith Kumar
  0 siblings, 1 reply; 11+ messages in thread
From: Paul E. McKenney @ 2014-07-11  9:43 UTC (permalink / raw)
  To: Pranith Kumar; +Cc: Josh Triplett, open list:READ-COPY UPDATE...

On Thu, Jul 10, 2014 at 09:17:33PM -0400, Pranith Kumar wrote:
> On Wed, Jul 9, 2014 at 3:56 PM, Paul E. McKenney
> <paulmck@linux.vnet.ibm.com> wrote:
> <snip>
> > OK, so ->dynticks_snap is accessed by only one task, namely the
> > corresponding RCU grace-period kthread.  So it can be accessed without
> > any atomic instructions or memory barriers, since all accesses to it are
> > single-threaded.  On the other hand, ->dynticks is written by one CPU
> > and potentially accessed from any CPU.  Therefore, accesses to it must
> > take concurrency into account.  Especially given that any confusion can
> > fool RCU into thinking that a CPU is idle when it really is not, which
> > could result in too-short grace periods, which could in turn result in
> > random memory corruption.
> 
> Yes, I missed reading the call-chain for accessing dynticks_snap. It
> does not need any synchronization/barriers.
> 
> Here since we are reading ->dynticks, doesn't having one barrier
> before reading make sense? like
> 
> smp_mb();
> dynticks_snap = atomic_read(...->dynticks);
> 
> instead of having two barriers with atomic_add_return()? i.e.,
> why is the second barrier necessary?

I suggest looking at the docbook comment headers for call_rcu() and
synchronize_rcu() and thinking about the memory-barrier guarantees.
Those guarantees are quite strong, and so if you remove any one of a
large number of memory barriers (either explicit or, as in this case,
implicit in some other operation), you will break RCU.

Now, there might well be some redundant memory barriers somewhere in
RCU, but the second memory barrier implied by this atomic_add_return()
most certainly is not one of them.

> On a related note, I see that dynticks is a per-cpu variable _and_
> atomic. Is there such a need for that?
> 
> Digging into history I see that the change to an atomic variable was
> made in the commit 23b5c8fa01b723c70a which says:
> 
> <quote>
> 
> In addition, the old dyntick-idle synchronization depended on the fact
> that grace periods were many milliseconds in duration, so that it could
> be assumed that no dyntick-idle CPU could reorder a memory reference
> across an entire grace period.  Unfortunately for this design, the
> addition of expedited grace periods breaks this assumption, which has
> the unfortunate side-effect of requiring atomic operations in the
> functions that track dyntick-idle state for RCU.
> 
> </quote>
> 
> Sorry to ask you about such an old change. But I am not able to see
> why we need atomic_t for dynticks here since per-cpu operations are
> guaranteed to be atomic.

Per-CPU operations are guaranteed to be atomic?  When one CPU is accessing
another CPU's per-CPU variable, as is the case here?  Can't say that I
believe you.  ;-)

> It gets twisted pretty fast trying to understand the RCU code. No
> wonder people say that rcu is scary black magic :)

Well, let's just say that this isn't one of the parts of the RCU code
that should be randomly hacked.  Lots and lots of ways to get it wrong,
and very few ways to get it right.  And most of the ways of getting it
right are too slow or too non-scalable to be useful.

Speaking of portions of RCU that are more susceptible to local-knowledge
hacking, how are things going on that rcutorture printing fixup?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 1/1] rcu: use atomic_read(v) instead of atomic_add_return(0, v)
  2014-07-11  9:43         ` Paul E. McKenney
@ 2014-07-11 22:32           ` Pranith Kumar
  2014-07-12 12:08             ` Paul E. McKenney
  0 siblings, 1 reply; 11+ messages in thread
From: Pranith Kumar @ 2014-07-11 22:32 UTC (permalink / raw)
  To: Paul McKenney; +Cc: Josh Triplett, open list:READ-COPY UPDATE...

On Fri, Jul 11, 2014 at 5:43 AM, Paul E. McKenney  wrote:
> On Thu, Jul 10, 2014 at 09:17:33PM -0400, Pranith Kumar wrote:
>> On Wed, Jul 9, 2014 at 3:56 PM, Paul E. McKenney
>> <paulmck@linux.vnet.ibm.com> wrote:
>> <snip>
>> > OK, so ->dynticks_snap is accessed by only one task, namely the
>> > corresponding RCU grace-period kthread.  So it can be accessed without
>> > any atomic instructions or memory barriers, since all accesses to it are
>> > single-threaded.  On the other hand, ->dynticks is written by one CPU
>> > and potentially accessed from any CPU.  Therefore, accesses to it must
>> > take concurrency into account.  Especially given that any confusion can
>> > fool RCU into thinking that a CPU is idle when it really is not, which
>> > could result in too-short grace periods, which could in turn result in
>> > random memory corruption.
>>
>> Yes, I missed reading the call-chain for accessing dynticks_snap. It
>> does not need any synchronization/barriers.
>>
>> Here since we are reading ->dynticks, doesn't having one barrier
>> before reading make sense? like
>>
>> smp_mb();
>> dynticks_snap = atomic_read(...->dynticks);
>>
>> instead of having two barriers with atomic_add_return()? i.e.,
>> why is the second barrier necessary?
>
> I suggest looking at the docbook comment headers for call_rcu() and
> synchronize_rcu() and thinking about the memory-barrier guarantees.
> Those guarantees are quite strong, and so if you remove any one of a
> large number of memory barriers (either explicit or, as in this case,
> implicit in some other operation), you will break RCU.
>
> Now, there might well be some redundant memory barriers somewhere in
> RCU, but the second memory barrier implied by this atomic_add_return()
> most certainly is not one of them.

One reason I could think of having barriers on both sides here is to
disable the read to float around. But in these two particular cases
that does not seem to be necessary.

Could you please clarify why a barrier after this atomic read is
required? Almost all the barriers are commented about why they are
necessary. Would be good to have that here too.

<snip>
>>
>> Sorry to ask you about such an old change. But I am not able to see
>> why we need atomic_t for dynticks here since per-cpu operations are
>> guaranteed to be atomic.
>
> Per-CPU operations are guaranteed to be atomic?  When one CPU is accessing
> another CPU's per-CPU variable, as is the case here?  Can't say that I
> believe you.  ;-)

this_cpu_ops() are guaranteed to be atomic when operating on local
per-cpu variables. When we are operating on other CPU's per-cpu
variables directly this does not hold.

dynticks here is a per-cpu variable. I don't understand why one CPU
needs to access another CPU's dynticks variable.

>
>> It gets twisted pretty fast trying to understand the RCU code. No
>> wonder people say that rcu is scary black magic :)
>
> Well, let's just say that this isn't one of the parts of the RCU code
> that should be randomly hacked.  Lots and lots of ways to get it wrong,
> and very few ways to get it right.  And most of the ways of getting it
> right are too slow or too non-scalable to be useful.

I am definitely not trying to hack randomly. Reading the code is very
educating. I tried looking up why this was being done and it was not
clear from the code and history. I was thinking of getting hold of you
on IRC, but was not sure if that is such a good idea. I'll ask
questions instead of sending RFCs from now on.

>
> Speaking of portions of RCU that are more susceptible to local-knowledge
> hacking, how are things going on that rcutorture printing fixup?
>
>                                                         Thanx, Paul
>

It must have reached your inbox by now :)

-- 
Pranith

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 1/1] rcu: use atomic_read(v) instead of atomic_add_return(0, v)
  2014-07-11 22:32           ` Pranith Kumar
@ 2014-07-12 12:08             ` Paul E. McKenney
  2014-07-14 13:27               ` Pranith Kumar
  0 siblings, 1 reply; 11+ messages in thread
From: Paul E. McKenney @ 2014-07-12 12:08 UTC (permalink / raw)
  To: Pranith Kumar; +Cc: Josh Triplett, open list:READ-COPY UPDATE...

On Fri, Jul 11, 2014 at 06:32:17PM -0400, Pranith Kumar wrote:
> On Fri, Jul 11, 2014 at 5:43 AM, Paul E. McKenney  wrote:
> > On Thu, Jul 10, 2014 at 09:17:33PM -0400, Pranith Kumar wrote:
> >> On Wed, Jul 9, 2014 at 3:56 PM, Paul E. McKenney
> >> <paulmck@linux.vnet.ibm.com> wrote:
> >> <snip>
> >> > OK, so ->dynticks_snap is accessed by only one task, namely the
> >> > corresponding RCU grace-period kthread.  So it can be accessed without
> >> > any atomic instructions or memory barriers, since all accesses to it are
> >> > single-threaded.  On the other hand, ->dynticks is written by one CPU
> >> > and potentially accessed from any CPU.  Therefore, accesses to it must
> >> > take concurrency into account.  Especially given that any confusion can
> >> > fool RCU into thinking that a CPU is idle when it really is not, which
> >> > could result in too-short grace periods, which could in turn result in
> >> > random memory corruption.
> >>
> >> Yes, I missed reading the call-chain for accessing dynticks_snap. It
> >> does not need any synchronization/barriers.
> >>
> >> Here since we are reading ->dynticks, doesn't having one barrier
> >> before reading make sense? like
> >>
> >> smp_mb();
> >> dynticks_snap = atomic_read(...->dynticks);
> >>
> >> instead of having two barriers with atomic_add_return()? i.e.,
> >> why is the second barrier necessary?
> >
> > I suggest looking at the docbook comment headers for call_rcu() and
> > synchronize_rcu() and thinking about the memory-barrier guarantees.
> > Those guarantees are quite strong, and so if you remove any one of a
> > large number of memory barriers (either explicit or, as in this case,
> > implicit in some other operation), you will break RCU.
> >
> > Now, there might well be some redundant memory barriers somewhere in
> > RCU, but the second memory barrier implied by this atomic_add_return()
> > most certainly is not one of them.
> 
> One reason I could think of having barriers on both sides here is to
> disable the read to float around. But in these two particular cases
> that does not seem to be necessary.
> 
> Could you please clarify why a barrier after this atomic read is
> required? Almost all the barriers are commented about why they are
> necessary. Would be good to have that here too.

They ensure that any RCU read-side critical sections that took place before
the current (or previous) idle/userspace period on the remote CPU in
question are seen as having completed before the completion of the current
grace period.  It also ensures that any RCU read-side critical sections
that extend beyond the end of the current grace period (thus starting
after the current (or previous) idle/userspace period) see any updates
that were carried out before the beginning of the current grace period.

Of course, that is also the purpose of many of RCU's memory barriers,
so this probably doesn't help much.  An alternative explanation is that
it ensures a coherent view of the ->dynticks counter, but I am quite
sure that this helps even less.

So here is the problem we are having:

The dyntick_save_progress_counter() and rcu_implicit_dynticks_qs()
functions are really bad places to start reading the RCU code.  I would
say that starting there is like learning to swim by diving into the deep
end of a swimming pool, but that doesn't really capture it.  Instead,
it is more like learning to swim by diving from the top of this waterfall:

http://blog.pacificnorthwestphotography.com/wp-content/uploads/2011/03/54.jpg

To understand these functions, you will first need to understand how
the rest of RCU works.  These functions are tiny cogs in RCU's energy
efficiency optimization mechanism, which fits into the larger grace-period
detection mechanism.  The purpose of the two atomic operations is to
preserve the memory-ordering guarantees called out in the docbook header
comments for call_rcu() and synchronize_rcu(), and I must confess that
it is not clear to me that you actually read these header comments.
Even so, these two functions interact with lots of other accesses to
implement these guarantees -- so again, it is really really difficult
to understand these two functions in isolation.

Please see the end of this message for my suggested order of learning
the RCU code.  A study plan, if you will.

> <snip>
> >>
> >> Sorry to ask you about such an old change. But I am not able to see
> >> why we need atomic_t for dynticks here since per-cpu operations are
> >> guaranteed to be atomic.
> >
> > Per-CPU operations are guaranteed to be atomic?  When one CPU is accessing
> > another CPU's per-CPU variable, as is the case here?  Can't say that I
> > believe you.  ;-)
> 
> this_cpu_ops() are guaranteed to be atomic when operating on local
> per-cpu variables. When we are operating on other CPU's per-cpu
> variables directly this does not hold.

Exactly!!!

> dynticks here is a per-cpu variable. I don't understand why one CPU
> needs to access another CPU's dynticks variable.

Because if RCU wakes up an idle CPU to determine that it was idle,
the guys that care about battery lifetime will get very angry with me.
This means that the CPU running the grace-period kthread needs to access
these idle CPUs' dyntick-idle state.  Because this state is recorded
in per-CPU variables, this means on CPU accessing another CPU's per-CPU
variables.

And yes, there has been talk about restricting cross-CPU access to per-CPU
variables.  Some people have been insisting that you should use IPIs in
these cases, but if I use IPIs, battery lifetime becomes a big problem.

So what to do?  Well, if people really do impose those sorts of
restrictions, RCU will simply move some of its state from per-CPU
variables to the old-school NR_CPUS-element arrays.  We should always use
the right tool for the job, so if some tool suddenly becomes the wrong
tool, then it is necessary to switch to some other tool.  Pretty simple!

> >> It gets twisted pretty fast trying to understand the RCU code. No
> >> wonder people say that rcu is scary black magic :)
> >
> > Well, let's just say that this isn't one of the parts of the RCU code
> > that should be randomly hacked.  Lots and lots of ways to get it wrong,
> > and very few ways to get it right.  And most of the ways of getting it
> > right are too slow or too non-scalable to be useful.
> 
> I am definitely not trying to hack randomly. Reading the code is very
> educating. I tried looking up why this was being done and it was not
> clear from the code and history. I was thinking of getting hold of you
> on IRC, but was not sure if that is such a good idea. I'll ask
> questions instead of sending RFCs from now on.

Perhaps instead of saying "hack randomly" I should have said "hack
locally on portions of RCU requiring global knowledge."  I am not trying
to insult you or to discourage you.  Instead, I am simply pointing out
that some parts of RCU are more intertwined than others, and suggesting
that you start hacking on the parts that are less intertwined.  Over time,
you will learn more about RCU and hopefully become able to take on some
of the more intertwined parts of RCU.

> > Speaking of portions of RCU that are more susceptible to local-knowledge
> > hacking, how are things going on that rcutorture printing fixup?
> 
> It must have reached your inbox by now :)

Indeed it did, along with a review.  ;-)

							Thanx, Paul

------------------------------------------------------------------------

In a perfect world, there would be up-to-date design documents describing
RCU.  This being the real world, what is available is imcomplete and
outdated, but see http://www2.rdrop.com/users/paulmck/RCU/ for a long
list of writeups.  Of these, the most important are Documentation/RCU/*
and http://lwn.net/Articles/262464/, http://lwn.net/Articles/263130/,
and http://lwn.net/Articles/418853/.  But you probably have already
read these.

If you really want to understand RCU strictly from the source code,
that can be done, but you will need to choose your starting point very
very carefully.  I suggest the following approach:

1.	Start with TINY_RCU as a warmup exercise.  This does everything
	that RCU needs to do, but on a uniprocessor system.  Therefore,
	this is a good starting point to see how RCU interacts with
	the rest of the kernel.

2.	Move to userspace RCU, first reading the paper:
	http://www.rdrop.com/users/paulmck/RCU/urcu-main-accepted.2011.08.30a.pdf
	http://www.computer.org/cms/Computer.org/dl/trans/td/2012/02/extras/ttd2012020375s.pdf
	Then moving on to the code, which has changed a bit since the
	paper was written.

	This paper is highly recommended as it also gives a good overview
	of what RCU is trying to accomplish.

3.	Move to SRCU in include/linux/srcu.h and kernel/rcu/srcu.c.
	This gives a compact view of some of the memory-barrier tricks
	that RCU uses to provide its memory-ordering guarantees.

4.	Move to TREE_RCU, but with restricted Kconfig and workload:

	o	Start at call_rcu().

	o	Assume CONFIG_TREE_RCU=y and that most of the other Kconfig
		variables are deselected.  This will allow you to ignore the
		bulk of kernel/rcu/tree_plugin.h initially.

	o	Assume that CPUs never enter dyntick-idle state, so that
		rdp->dynticks->dynticks always has an odd-numbered value.

	o	Assume that all RCU grace periods end quickly, allowing
		you to ignore the stall-warning code.

5.	At this point, a quick review of the recent LWN articles on
	RCU features would help. http://lwn.net/Kernel/Index/#Read-copy-update

6.	The might be a good time to look at the dyntick-idle and
	CPU stall-warning mechanisms.

7.	Add in Kconfig variables one at a time, thus incrementally including
	code from kernel/rcu/tree_plugin.h.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 1/1] rcu: use atomic_read(v) instead of atomic_add_return(0, v)
  2014-07-12 12:08             ` Paul E. McKenney
@ 2014-07-14 13:27               ` Pranith Kumar
  2014-07-16 13:14                 ` Paul E. McKenney
  0 siblings, 1 reply; 11+ messages in thread
From: Pranith Kumar @ 2014-07-14 13:27 UTC (permalink / raw)
  To: Paul McKenney; +Cc: Josh Triplett, open list:READ-COPY UPDATE...

On Sat, Jul 12, 2014 at 8:08 AM, Paul E. McKenney wrote:
>
> They ensure that any RCU read-side critical sections that took place before
> the current (or previous) idle/userspace period on the remote CPU in
> question are seen as having completed before the completion of the current
> grace period.  It also ensures that any RCU read-side critical sections
> that extend beyond the end of the current grace period (thus starting
> after the current (or previous) idle/userspace period) see any updates
> that were carried out before the beginning of the current grace period.
>
> Of course, that is also the purpose of many of RCU's memory barriers,
> so this probably doesn't help much.  An alternative explanation is that
> it ensures a coherent view of the ->dynticks counter, but I am quite
> sure that this helps even less.
>
> So here is the problem we are having:
>
> The dyntick_save_progress_counter() and rcu_implicit_dynticks_qs()
> functions are really bad places to start reading the RCU code.  I would
> say that starting there is like learning to swim by diving into the deep
> end of a swimming pool, but that doesn't really capture it.  Instead,
> it is more like learning to swim by diving from the top of this waterfall:
>
> http://blog.pacificnorthwestphotography.com/wp-content/uploads/2011/03/54.jpg
>
> To understand these functions, you will first need to understand how
> the rest of RCU works.  These functions are tiny cogs in RCU's energy
> efficiency optimization mechanism, which fits into the larger grace-period
> detection mechanism.  The purpose of the two atomic operations is to
> preserve the memory-ordering guarantees called out in the docbook header
> comments for call_rcu() and synchronize_rcu(), and I must confess that
> it is not clear to me that you actually read these header comments.
> Even so, these two functions interact with lots of other accesses to
> implement these guarantees -- so again, it is really really difficult
> to understand these two functions in isolation.
>
> Please see the end of this message for my suggested order of learning
> the RCU code.  A study plan, if you will.
>

This guide helps a lot, thank you for the detailed study plan. I will
make sure to go slow and steady. :)

I believe my question was about a local issue, let me try to explain.

My question stems from my understanding of why barriers are needed:

(i) to prevent compiler re-ordering of memory accesses
(ii) to ensure a partial ordering of memory accesses (global visibility)

Barriers are also costly and hence must be used sparingly, if at all.

I understand the need to use a barrier before/after an update to
a shared variable. And using a barrier before a read access to a
shared variable makes sense, as it ensures that we order this read
with a previous write from the same CPU, if any.

The question here is this: why do we need a barrier after a read
access to a shared variable?

The only reason I could think of is reason (i) above, but looking at
the code, we can see that the read cannot really float around. Did I
miss something or is there something fundamentally wrong with my
thinking?

Please note that all updates to dynticks are already strongly
ordered as the updates are wrapped with barriers both before and
after. So _reading_ the dynticks variable on other CPUs should be
consistent i.e., an update should be immediately visible.

-- 
Pranith

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 1/1] rcu: use atomic_read(v) instead of atomic_add_return(0, v)
  2014-07-14 13:27               ` Pranith Kumar
@ 2014-07-16 13:14                 ` Paul E. McKenney
  2014-07-17  1:50                   ` Pranith Kumar
  0 siblings, 1 reply; 11+ messages in thread
From: Paul E. McKenney @ 2014-07-16 13:14 UTC (permalink / raw)
  To: Pranith Kumar; +Cc: Josh Triplett, open list:READ-COPY UPDATE...

On Mon, Jul 14, 2014 at 09:27:00AM -0400, Pranith Kumar wrote:
> On Sat, Jul 12, 2014 at 8:08 AM, Paul E. McKenney wrote:
> >
> > They ensure that any RCU read-side critical sections that took place before
> > the current (or previous) idle/userspace period on the remote CPU in
> > question are seen as having completed before the completion of the current
> > grace period.  It also ensures that any RCU read-side critical sections
> > that extend beyond the end of the current grace period (thus starting
> > after the current (or previous) idle/userspace period) see any updates
> > that were carried out before the beginning of the current grace period.
> >
> > Of course, that is also the purpose of many of RCU's memory barriers,
> > so this probably doesn't help much.  An alternative explanation is that
> > it ensures a coherent view of the ->dynticks counter, but I am quite
> > sure that this helps even less.
> >
> > So here is the problem we are having:
> >
> > The dyntick_save_progress_counter() and rcu_implicit_dynticks_qs()
> > functions are really bad places to start reading the RCU code.  I would
> > say that starting there is like learning to swim by diving into the deep
> > end of a swimming pool, but that doesn't really capture it.  Instead,
> > it is more like learning to swim by diving from the top of this waterfall:
> >
> > http://blog.pacificnorthwestphotography.com/wp-content/uploads/2011/03/54.jpg
> >
> > To understand these functions, you will first need to understand how
> > the rest of RCU works.  These functions are tiny cogs in RCU's energy
> > efficiency optimization mechanism, which fits into the larger grace-period
> > detection mechanism.  The purpose of the two atomic operations is to
> > preserve the memory-ordering guarantees called out in the docbook header
> > comments for call_rcu() and synchronize_rcu(), and I must confess that
> > it is not clear to me that you actually read these header comments.
> > Even so, these two functions interact with lots of other accesses to
> > implement these guarantees -- so again, it is really really difficult
> > to understand these two functions in isolation.
> >
> > Please see the end of this message for my suggested order of learning
> > the RCU code.  A study plan, if you will.
> 
> This guide helps a lot, thank you for the detailed study plan. I will
> make sure to go slow and steady. :)

Best of everything with it!

> I believe my question was about a local issue, let me try to explain.
> 
> My question stems from my understanding of why barriers are needed:
> 
> (i) to prevent compiler re-ordering of memory accesses
> (ii) to ensure a partial ordering of memory accesses (global visibility)

Another way to put this is that barriers prevent both the compiler
(your i) and the CPU (your ii) from re-ordering memory references.

> Barriers are also costly and hence must be used sparingly, if at all.

Sometimes they are costly, as in smp_mb(), and sometimes they are almost
free, as in smp_read_barrier_depends() on most architectures.

> I understand the need to use a barrier before/after an update to
> a shared variable. And using a barrier before a read access to a
> shared variable makes sense, as it ensures that we order this read
> with a previous write from the same CPU, if any.
> 
> The question here is this: why do we need a barrier after a read
> access to a shared variable?

I suggest studying Scenario 3 (message and flag) in the LWN article
entitled "User-space RCU: Memory-barrier menagerie".  Here if CPU 0
fails to have a memory barrier after the read from "x", the BUG_ON()
expression can trigger.

> The only reason I could think of is reason (i) above, but looking at
> the code, we can see that the read cannot really float around. Did I
> miss something or is there something fundamentally wrong with my
> thinking?

The idea is to keep the things following the read, for example, any
code executed after the end of the grace period, from being reordered
with that read.  But as stated several times before, you really need to
know quite a bit more about how RCU works for this to make much sense.
This particular access is one small piece of a larger group that allows
RCU to correctly implement its memory ordering properties.  Trying to
understand this access in isolation is sort of like trying to understand
a piston without knowing about the cylinder, piston rod, and crankshaft.

> Please note that all updates to dynticks are already strongly
> ordered as the updates are wrapped with barriers both before and
> after. So _reading_ the dynticks variable on other CPUs should be
> consistent i.e., an update should be immediately visible.

Please see above.

Memory ordering is not about consistent access to a single variable,
but rather about ordering accesses to multiple variables.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 1/1] rcu: use atomic_read(v) instead of atomic_add_return(0, v)
  2014-07-16 13:14                 ` Paul E. McKenney
@ 2014-07-17  1:50                   ` Pranith Kumar
  0 siblings, 0 replies; 11+ messages in thread
From: Pranith Kumar @ 2014-07-17  1:50 UTC (permalink / raw)
  To: paulmck; +Cc: Josh Triplett, open list:READ-COPY UPDATE...

On 07/16/2014 09:14 AM, Paul E. McKenney wrote:
> On Mon, Jul 14, 2014 at 09:27:00AM -0400, Pranith Kumar wrote:
>> On Sat, Jul 12, 2014 at 8:08 AM, Paul E. McKenney wrote:
>>>
>>> They ensure that any RCU read-side critical sections that took place before
>>> the current (or previous) idle/userspace period on the remote CPU in
>>> question are seen as having completed before the completion of the current
>>> grace period.  It also ensures that any RCU read-side critical sections
>>> that extend beyond the end of the current grace period (thus starting
>>> after the current (or previous) idle/userspace period) see any updates
>>> that were carried out before the beginning of the current grace period.
>>>
>>> Of course, that is also the purpose of many of RCU's memory barriers,
>>> so this probably doesn't help much.  An alternative explanation is that
>>> it ensures a coherent view of the ->dynticks counter, but I am quite
>>> sure that this helps even less.
>>>
>>> So here is the problem we are having:
>>>
>>> The dyntick_save_progress_counter() and rcu_implicit_dynticks_qs()
>>> functions are really bad places to start reading the RCU code.  I would
>>> say that starting there is like learning to swim by diving into the deep
>>> end of a swimming pool, but that doesn't really capture it.  Instead,
>>> it is more like learning to swim by diving from the top of this waterfall:
>>>
>>> http://blog.pacificnorthwestphotography.com/wp-content/uploads/2011/03/54.jpg
>>>
>>> To understand these functions, you will first need to understand how
>>> the rest of RCU works.  These functions are tiny cogs in RCU's energy
>>> efficiency optimization mechanism, which fits into the larger grace-period
>>> detection mechanism.  The purpose of the two atomic operations is to
>>> preserve the memory-ordering guarantees called out in the docbook header
>>> comments for call_rcu() and synchronize_rcu(), and I must confess that
>>> it is not clear to me that you actually read these header comments.
>>> Even so, these two functions interact with lots of other accesses to
>>> implement these guarantees -- so again, it is really really difficult
>>> to understand these two functions in isolation.
>>>
>>> Please see the end of this message for my suggested order of learning
>>> the RCU code.  A study plan, if you will.
>>
>> This guide helps a lot, thank you for the detailed study plan. I will
>> make sure to go slow and steady. :)
> 
> Best of everything with it!
> 

Thanks a lot, Paul, for taking the time to help me understand. Hopefully one of
these days, after reading the study list, I actually will :)

--
Pranith.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2014-07-17  1:50 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-07-08 22:55 [RFC PATCH 1/1] rcu: use atomic_read(v) instead of atomic_add_return(0, v) Pranith Kumar
2014-07-09  0:07 ` Paul E. McKenney
2014-07-09  4:00   ` Pranith Kumar
2014-07-09 19:56     ` Paul E. McKenney
2014-07-11  1:17       ` Pranith Kumar
2014-07-11  9:43         ` Paul E. McKenney
2014-07-11 22:32           ` Pranith Kumar
2014-07-12 12:08             ` Paul E. McKenney
2014-07-14 13:27               ` Pranith Kumar
2014-07-16 13:14                 ` Paul E. McKenney
2014-07-17  1:50                   ` Pranith Kumar

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.