All of lore.kernel.org
 help / color / mirror / Atom feed
* [GIT PULL rcu/next] fixes and breakup of memory-barrier-decrease patch
@ 2011-05-21 14:06 Paul E. McKenney
  2011-05-21 14:28 ` Ingo Molnar
  0 siblings, 1 reply; 7+ messages in thread
From: Paul E. McKenney @ 2011-05-21 14:06 UTC (permalink / raw)
  To: mingo; +Cc: linux-kernel, randy.dunlap, Valdis.Kletnieks, a.p.zijlstra

Hello, Ingo,

This pull requests covers some RCU bug fixes and one patch rework.

The first group breaks up the infamous now-reverted (but ultimately
vindicated) "Decrease memory-barrier usage based on semi-formal proof"
commit into five commits.  These five commits immediately follow the
revert, and the diff across all six of these commits is empty, so that
the effect of the five commits is to revert the revert.

Another commit, "Avoid build error for third-party modules", fixes
a build error reported by Randy Dunlap.

Another pair of commits, "Add atomic_or()" and "Avoid acquiring rcu_node
locks in timer functions", fix a lockdep splat reported by Valdis Kletnieks.

Finally, "Remove waitqueue usage for cpu, node, and boost kthreads", from
Peter Zijlstra, simplifies the RCU kthread wakeup logic and that also fixes a
bug that resulted in a crash.

These changes are available in the -rcu git repository at:

 git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-2.6-rcu.git rcu/next

------------------>

Paul E. McKenney (8):
      rcu: Add memory barriers
      rcu: Remove old memory barriers from rcu_process_callbacks()
      rcu: Don't do reschedule unless in irq
      rcu: Make rcu_enter_nohz() pay attention to nesting
      rcu: Decrease memory-barrier usage based on semi-formal proof
      rcu: Avoid build error for third-party modules
      atomic: Add atomic_or()
      rcu: Avoid acquiring rcu_node locks in timer functions

Peter Zijlstra (1):
      rcu: Remove waitqueue usage for cpu, node, and boost kthreads

 Documentation/RCU/trace.txt |   17 ++---
 include/linux/atomic.h      |   13 ++++
 include/linux/rcupdate.h    |    5 +-
 kernel/rcutree.c            |  162 +++++++++++++++++--------------------------
 kernel/rcutree.h            |   30 ++++----
 kernel/rcutree_plugin.h     |   23 +-----
 kernel/rcutree_trace.c      |   12 ++--
 7 files changed, 112 insertions(+), 150 deletions(-)

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [GIT PULL rcu/next] fixes and breakup of memory-barrier-decrease patch
  2011-05-21 14:06 [GIT PULL rcu/next] fixes and breakup of memory-barrier-decrease patch Paul E. McKenney
@ 2011-05-21 14:28 ` Ingo Molnar
  2011-05-21 19:08   ` Paul E. McKenney
  0 siblings, 1 reply; 7+ messages in thread
From: Ingo Molnar @ 2011-05-21 14:28 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, randy.dunlap, Valdis.Kletnieks, a.p.zijlstra


* Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:

> Hello, Ingo,
> 
> This pull requests covers some RCU bug fixes and one patch rework.
> 
> The first group breaks up the infamous now-reverted (but ultimately
> vindicated) "Decrease memory-barrier usage based on semi-formal proof"
> commit into five commits.  These five commits immediately follow the
> revert, and the diff across all six of these commits is empty, so that
> the effect of the five commits is to revert the revert.

But ... the regression that was observed with that commit needs to be fixed 
first, or not? In what way was the barrier commit vindicated?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [GIT PULL rcu/next] fixes and breakup of memory-barrier-decrease patch
  2011-05-21 14:28 ` Ingo Molnar
@ 2011-05-21 19:08   ` Paul E. McKenney
  2011-05-21 19:14     ` Ingo Molnar
  0 siblings, 1 reply; 7+ messages in thread
From: Paul E. McKenney @ 2011-05-21 19:08 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel, randy.dunlap, Valdis.Kletnieks, a.p.zijlstra

On Sat, May 21, 2011 at 04:28:44PM +0200, Ingo Molnar wrote:
> 
> * Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:
> 
> > Hello, Ingo,
> > 
> > This pull requests covers some RCU bug fixes and one patch rework.
> > 
> > The first group breaks up the infamous now-reverted (but ultimately
> > vindicated) "Decrease memory-barrier usage based on semi-formal proof"
> > commit into five commits.  These five commits immediately follow the
> > revert, and the diff across all six of these commits is empty, so that
> > the effect of the five commits is to revert the revert.
> 
> But ... the regression that was observed with that commit needs to be fixed 
> first, or not? In what way was the barrier commit vindicated?

>From what I can see, the hang was fixed by Frederic's patch at
https://lkml.org/lkml/2011/5/19/753.  I was interpreting that as
vindication, perhaps ill-advisedly.

Yinghai said that he was still seeing a delay, adn that he was seeing
it even with the "Decrease memory-barrier usage based on semi-formal
proof" reverted: https://lkml.org/lkml/2011/5/20/427.  This hang seems
to happen when he uses gcc 4.5.0, but not when using gcc 4.5.1, assuming
I understood his sequence of emails.  So I was interpreting that as
meaning that the delay was unlikely to be caused by that commit, probably
by one of the later commits.

I clearly need to figure out what is causing this delay.  I asked Yinghai
to apply c7a378603 (Remove waitqueue usage for cpu, node, and boost kthreads)
from Peter Zijlstra because the long delays that Yinghai is seeing
(93 seconds for memory_dev_init() rather than 3 or 4 seconds) might be
due to my less-efficient method of awakening the RCU kthreads, so that
Peter's approache might help.

If that doesn't speed things up for Yinghai, then I will work out some
tracing to help localize the slowdown that he is seeing.

Of course, if you would rather that I get to the bottom of this before
pulling, fair enough!

							Thanx, Paul

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [GIT PULL rcu/next] fixes and breakup of memory-barrier-decrease patch
  2011-05-21 19:08   ` Paul E. McKenney
@ 2011-05-21 19:14     ` Ingo Molnar
  2011-05-21 20:39       ` Paul E. McKenney
  0 siblings, 1 reply; 7+ messages in thread
From: Ingo Molnar @ 2011-05-21 19:14 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, randy.dunlap, Valdis.Kletnieks, a.p.zijlstra


* Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:

> On Sat, May 21, 2011 at 04:28:44PM +0200, Ingo Molnar wrote:
> > 
> > * Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:
> > 
> > > Hello, Ingo,
> > > 
> > > This pull requests covers some RCU bug fixes and one patch rework.
> > > 
> > > The first group breaks up the infamous now-reverted (but ultimately
> > > vindicated) "Decrease memory-barrier usage based on semi-formal proof"
> > > commit into five commits.  These five commits immediately follow the
> > > revert, and the diff across all six of these commits is empty, so that
> > > the effect of the five commits is to revert the revert.
> > 
> > But ... the regression that was observed with that commit needs to be fixed 
> > first, or not? In what way was the barrier commit vindicated?
> 
> From what I can see, the hang was fixed by Frederic's patch at 
> https://lkml.org/lkml/2011/5/19/753.  I was interpreting that as vindication, 
> perhaps ill-advisedly.

I mean, without Frederic's patch we are getting very long hangs due to the 
barrier patch, right?

Even if the barrier patch is not to blame - somehow it still managed to produce 
these hangs - and we do not understand it yet.

> Yinghai said that he was still seeing a delay, adn that he was seeing it even 
> with the "Decrease memory-barrier usage based on semi-formal proof" reverted: 
> https://lkml.org/lkml/2011/5/20/427.  This hang seems to happen when he uses 
> gcc 4.5.0, but not when using gcc 4.5.1, assuming I understood his sequence 
> of emails.  So I was interpreting that as meaning that the delay was unlikely 
> to be caused by that commit, probably by one of the later commits.
> 
> I clearly need to figure out what is causing this delay.  I asked Yinghai to 
> apply c7a378603 (Remove waitqueue usage for cpu, node, and boost kthreads) 
> from Peter Zijlstra because the long delays that Yinghai is seeing (93 
> seconds for memory_dev_init() rather than 3 or 4 seconds) might be due to my 
> less-efficient method of awakening the RCU kthreads, so that Peter's 
> approache might help.
> 
> If that doesn't speed things up for Yinghai, then I will work out some 
> tracing to help localize the slowdown that he is seeing.
> 
> Of course, if you would rather that I get to the bottom of this before 
> pulling, fair enough!

We should fix the delay regression i suspect - do we have to revert more stuff 
perhaps?

Would it be possible to figure out what caused that other delay for Yinghai?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [GIT PULL rcu/next] fixes and breakup of memory-barrier-decrease patch
  2011-05-21 19:14     ` Ingo Molnar
@ 2011-05-21 20:39       ` Paul E. McKenney
  2011-05-22  9:04         ` Ingo Molnar
  0 siblings, 1 reply; 7+ messages in thread
From: Paul E. McKenney @ 2011-05-21 20:39 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel, randy.dunlap, Valdis.Kletnieks, a.p.zijlstra

On Sat, May 21, 2011 at 09:14:18PM +0200, Ingo Molnar wrote:
> 
> * Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:
> 
> > On Sat, May 21, 2011 at 04:28:44PM +0200, Ingo Molnar wrote:
> > > 
> > > * Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:
> > > 
> > > > Hello, Ingo,
> > > > 
> > > > This pull requests covers some RCU bug fixes and one patch rework.
> > > > 
> > > > The first group breaks up the infamous now-reverted (but ultimately
> > > > vindicated) "Decrease memory-barrier usage based on semi-formal proof"
> > > > commit into five commits.  These five commits immediately follow the
> > > > revert, and the diff across all six of these commits is empty, so that
> > > > the effect of the five commits is to revert the revert.
> > > 
> > > But ... the regression that was observed with that commit needs to be fixed 
> > > first, or not? In what way was the barrier commit vindicated?
> > 
> > From what I can see, the hang was fixed by Frederic's patch at 
> > https://lkml.org/lkml/2011/5/19/753.  I was interpreting that as vindication, 
> > perhaps ill-advisedly.
> 
> I mean, without Frederic's patch we are getting very long hangs due to the 
> barrier patch, right?

Yes.  The reason we are seeing these hangs is that HARDIRQ_ENTER()
invoked irq_enter(), which calls rcu_irq_enter() but that the matching
HARDIRQ_EXIT() invoked __irq_exit(), which does not call rcu_irq_exit().
This resulted in calls to rcu_irq_enter() that were not balanced by
matching calls to rcu_irq_exit().  Therefore, after these tests completed,
RCU's dyntick-idle nesting count was a large number, which caused RCU
to conclude that the affected CPU was not in dyntick-idle mode when in
fact it was.

RCU would therefore incorrectly wait for this dyntick-idle CPU.

With Frederic's patch, these tests don't ever call either rcu_irq_enter()
or rcu_irq_exit(), which works because the CPU running the test is
already marked as not being in dyntick-idle mode.

So, with Frederic's patch, the rcu_irq_enter() and rcu_irq_exit() calls
are balanced and things work.

The reason that the imbalance was not noticed before the barrier patch
was applied is that the old implementation of rcu_enter_nohz() ignored
the nesting depth.  This could still result in delays, but much shorter
ones.  Whenever there was a delay, RCU would IPI the CPU with the
unbalanced nesting level, which would eventually result in rcu_enter_nohz()
being called, which in turn would force RCU to see that the CPU was in
dyntick-idle mode.

Hmmm...  I should add this line of reasoning to one of the commit logs,
shouldn't I?  (Added it.  Which of course invalidates my pull request.)

> Even if the barrier patch is not to blame - somehow it still managed to produce 
> these hangs - and we do not understand it yet.

>From Yinghai's message https://lkml.org/lkml/2011/5/12/465, I believe
that the residual delay he is seeing is not due to the barrier patch,
but rather due to a26ac2455 (move TREE_RCU from softirq to kthrea).

More on this below.

> > Yinghai said that he was still seeing a delay, adn that he was seeing it even 
> > with the "Decrease memory-barrier usage based on semi-formal proof" reverted: 
> > https://lkml.org/lkml/2011/5/20/427.  This hang seems to happen when he uses 
> > gcc 4.5.0, but not when using gcc 4.5.1, assuming I understood his sequence 
> > of emails.  So I was interpreting that as meaning that the delay was unlikely 
> > to be caused by that commit, probably by one of the later commits.
> > 
> > I clearly need to figure out what is causing this delay.  I asked Yinghai to 
> > apply c7a378603 (Remove waitqueue usage for cpu, node, and boost kthreads) 
> > from Peter Zijlstra because the long delays that Yinghai is seeing (93 
> > seconds for memory_dev_init() rather than 3 or 4 seconds) might be due to my 
> > less-efficient method of awakening the RCU kthreads, so that Peter's 
> > approache might help.
> > 
> > If that doesn't speed things up for Yinghai, then I will work out some 
> > tracing to help localize the slowdown that he is seeing.
> > 
> > Of course, if you would rather that I get to the bottom of this before 
> > pulling, fair enough!
> 
> We should fix the delay regression i suspect - do we have to revert more stuff 
> perhaps?
> 
> Would it be possible to figure out what caused that other delay for Yinghai?

Earlier, Yinghai reported that reverting a26ac2455ffc (move TREE_RCU
from softirq to kthread) and everything after it made what appears to be
the same sort of delay go away (https://lkml.org/lkml/2011/5/12/465).
This commit replaced raise_softirq() with wait queues, flags, and
wake_up().  Later, Yinghai said that the delay shows up in kernels
built using opensuse 11.3, but not in kernels build using fedora 14
(https://lkml.org/lkml/2011/5/20/469).  Still later, he said that opensuse
11.3 has gcc 4.5.0 and fedora 14 has gcc 4.5.1.

Differences in compilers usually don't produce 20-to-1 latency differences
without something amplifying them.  In this case, that something
is likely to be the wait/wakeup coordination.  Peter's recent patch
(https://lkml.org/lkml/2011/5/19/133) to fix some CPU-hotplug-related
issues in the scheduler (https://lkml.org/lkml/2011/5/13/22) changed
RCU's kthread wait/wakeup coordination.

So I asked that Yinghai try c7a3786030 (Remove waitqueue usage for cpu,
node, and boost kthreads) from Peter currently queued on -rcu.

If that doesn't help, I will probably provide Yinghai some tracing
patches.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [GIT PULL rcu/next] fixes and breakup of memory-barrier-decrease patch
  2011-05-21 20:39       ` Paul E. McKenney
@ 2011-05-22  9:04         ` Ingo Molnar
  2011-05-22 16:17           ` Paul E. McKenney
  0 siblings, 1 reply; 7+ messages in thread
From: Ingo Molnar @ 2011-05-22  9:04 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, randy.dunlap, Valdis.Kletnieks, a.p.zijlstra


* Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:

> > I mean, without Frederic's patch we are getting very long hangs due to the 
> > barrier patch, right?
> 
> Yes.  The reason we are seeing these hangs is that HARDIRQ_ENTER() invoked 
> irq_enter(), which calls rcu_irq_enter() but that the matching HARDIRQ_EXIT() 
> invoked __irq_exit(), which does not call rcu_irq_exit(). This resulted in 
> calls to rcu_irq_enter() that were not balanced by matching calls to 
> rcu_irq_exit().  Therefore, after these tests completed, RCU's dyntick-idle 
> nesting count was a large number, which caused RCU to conclude that the 
> affected CPU was not in dyntick-idle mode when in fact it was.
> 
> RCU would therefore incorrectly wait for this dyntick-idle CPU.
> 
> With Frederic's patch, these tests don't ever call either rcu_irq_enter() or 
> rcu_irq_exit(), which works because the CPU running the test is already 
> marked as not being in dyntick-idle mode.
> 
> So, with Frederic's patch, the rcu_irq_enter() and rcu_irq_exit() calls are 
> balanced and things work.
> 
> The reason that the imbalance was not noticed before the barrier patch was 
> applied is that the old implementation of rcu_enter_nohz() ignored the 
> nesting depth.  This could still result in delays, but much shorter ones.  
> Whenever there was a delay, RCU would IPI the CPU with the unbalanced nesting 
> level, which would eventually result in rcu_enter_nohz() being called, which 
> in turn would force RCU to see that the CPU was in dyntick-idle mode.
> 
> Hmmm...  I should add this line of reasoning to one of the commit logs, 
> shouldn't I?  (Added it.  Which of course invalidates my pull request.)

Well, the thing i was missing from the tree was Frederic's fix patch. Or was 
that included in one of the commits?

I mean, if we just revert the revert, we reintroduce the delay, no matter who 
is to blame - not good! :-)

> > Even if the barrier patch is not to blame - somehow it still managed to 
> > produce these hangs - and we do not understand it yet.
> 
> >From Yinghai's message https://lkml.org/lkml/2011/5/12/465, I believe
> that the residual delay he is seeing is not due to the barrier patch,
> but rather due to a26ac2455 (move TREE_RCU from softirq to kthrea).
> 
> More on this below.

Ok - we can treat that regression differently. Also, that seems like a much 
shorter delay, correct? The delays fixed by Frederic's patch were huge (i think 
i saw a 1 hour delay once) - they were essentially not delays but hangs.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [GIT PULL rcu/next] fixes and breakup of memory-barrier-decrease patch
  2011-05-22  9:04         ` Ingo Molnar
@ 2011-05-22 16:17           ` Paul E. McKenney
  0 siblings, 0 replies; 7+ messages in thread
From: Paul E. McKenney @ 2011-05-22 16:17 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel, randy.dunlap, Valdis.Kletnieks, a.p.zijlstra

On Sun, May 22, 2011 at 11:04:40AM +0200, Ingo Molnar wrote:
> 
> * Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:
> 
> > > I mean, without Frederic's patch we are getting very long hangs due to the 
> > > barrier patch, right?
> > 
> > Yes.  The reason we are seeing these hangs is that HARDIRQ_ENTER() invoked 
> > irq_enter(), which calls rcu_irq_enter() but that the matching HARDIRQ_EXIT() 
> > invoked __irq_exit(), which does not call rcu_irq_exit(). This resulted in 
> > calls to rcu_irq_enter() that were not balanced by matching calls to 
> > rcu_irq_exit().  Therefore, after these tests completed, RCU's dyntick-idle 
> > nesting count was a large number, which caused RCU to conclude that the 
> > affected CPU was not in dyntick-idle mode when in fact it was.
> > 
> > RCU would therefore incorrectly wait for this dyntick-idle CPU.
> > 
> > With Frederic's patch, these tests don't ever call either rcu_irq_enter() or 
> > rcu_irq_exit(), which works because the CPU running the test is already 
> > marked as not being in dyntick-idle mode.
> > 
> > So, with Frederic's patch, the rcu_irq_enter() and rcu_irq_exit() calls are 
> > balanced and things work.
> > 
> > The reason that the imbalance was not noticed before the barrier patch was 
> > applied is that the old implementation of rcu_enter_nohz() ignored the 
> > nesting depth.  This could still result in delays, but much shorter ones.  
> > Whenever there was a delay, RCU would IPI the CPU with the unbalanced nesting 
> > level, which would eventually result in rcu_enter_nohz() being called, which 
> > in turn would force RCU to see that the CPU was in dyntick-idle mode.
> > 
> > Hmmm...  I should add this line of reasoning to one of the commit logs, 
> > shouldn't I?  (Added it.  Which of course invalidates my pull request.)
> 
> Well, the thing i was missing from the tree was Frederic's fix patch. Or was 
> that included in one of the commits?

Ah!  I don't see any evidence of anyone else having taken it, so I just
now queued it.

> I mean, if we just revert the revert, we reintroduce the delay, no matter who 
> is to blame - not good! :-)

Good point!  ;-)

> > > Even if the barrier patch is not to blame - somehow it still managed to 
> > > produce these hangs - and we do not understand it yet.
> > 
> > >From Yinghai's message https://lkml.org/lkml/2011/5/12/465, I believe
> > that the residual delay he is seeing is not due to the barrier patch,
> > but rather due to a26ac2455 (move TREE_RCU from softirq to kthrea).
> > 
> > More on this below.
> 
> Ok - we can treat that regression differently. Also, that seems like a much 
> shorter delay, correct? The delays fixed by Frederic's patch were huge (i think 
> i saw a 1 hour delay once) - they were essentially not delays but hangs.

Yes, the delays fixed by Frederic's patch were hours in length, while
the remaining delays measure in seconds.  And I am looking at the code
and at how grace-period duration has varied, so hope to get to the
bottom of it in a few days.  Hopefully sooner.  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2011-05-22 16:17 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-05-21 14:06 [GIT PULL rcu/next] fixes and breakup of memory-barrier-decrease patch Paul E. McKenney
2011-05-21 14:28 ` Ingo Molnar
2011-05-21 19:08   ` Paul E. McKenney
2011-05-21 19:14     ` Ingo Molnar
2011-05-21 20:39       ` Paul E. McKenney
2011-05-22  9:04         ` Ingo Molnar
2011-05-22 16:17           ` Paul E. McKenney

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.