linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Possible sandybridge livelock issue
@ 2011-05-13 16:12 James Bottomley
  2011-05-13 16:36 ` Andi Kleen
  2011-05-16  6:29 ` Ingo Molnar
  0 siblings, 2 replies; 7+ messages in thread
From: James Bottomley @ 2011-05-13 16:12 UTC (permalink / raw)
  To: x86; +Cc: linux-mm, linux-kernel, Mel Gorman

We've just come off a large round of debugging a kswapd problem over on
linux-mm:

http://marc.info/?t=130392066000001

The upshot was that kswapd wasn't being allowed to sleep (which we're
now fixing).  However, in spite of intensive efforts, the actual hang
was only reproducible on sandybridge laptops.

When the hang occurred, kswapd basically pegged one core in 100% system
time.  This looks like there's something specific to sandybridge that
causes this type of bad interaction.  I was wondering if it could be
something to to with a scheduling problem in turbo mode?  Once kswapd
goes flat out, the core its on will kick into turbo mode, which causes
it to get preferentially scheduled there, leading to the live lock.

The only evidence I have to support this theory is that when I reproduce
the problem with PREEMPT, the core pegs at 100% system time and stays
there even if I turn off the load.  However, if I can execute work that
causes kswapd to be kicked off the core it's running on, it will calm
back down and go to sleep.

James



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Possible sandybridge livelock issue
  2011-05-13 16:12 Possible sandybridge livelock issue James Bottomley
@ 2011-05-13 16:36 ` Andi Kleen
  2011-05-13 17:08   ` Christoph Lameter
  2011-05-16  6:29 ` Ingo Molnar
  1 sibling, 1 reply; 7+ messages in thread
From: Andi Kleen @ 2011-05-13 16:36 UTC (permalink / raw)
  To: James Bottomley; +Cc: x86, linux-mm, linux-kernel, Mel Gorman

James Bottomley <James.Bottomley@HansenPartnership.com> writes:
>
> When the hang occurred, kswapd basically pegged one core in 100% system
> time.  This looks like there's something specific to sandybridge that
> causes this type of bad interaction.  I was wondering if it could be
> something to to with a scheduling problem in turbo mode?  Once kswapd
> goes flat out, the core its on will kick into turbo mode, which causes
> it to get preferentially scheduled there, leading to the live lock.

Sounds unlikely to me.

Turbo mode does not affect the scheduler and the cores are (reasonably) 
independent.


> The only evidence I have to support this theory is that when I reproduce
> the problem with PREEMPT, the core pegs at 100% system time and stays
> there even if I turn off the load.  However, if I can execute work that
> causes kswapd to be kicked off the core it's running on, it will calm
> back down and go to sleep.

Turbo mode just makes the CPU faster, but it should not change 
the scheduler decisions.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Possible sandybridge livelock issue
  2011-05-13 16:36 ` Andi Kleen
@ 2011-05-13 17:08   ` Christoph Lameter
  2011-05-13 18:23     ` Andi Kleen
  0 siblings, 1 reply; 7+ messages in thread
From: Christoph Lameter @ 2011-05-13 17:08 UTC (permalink / raw)
  To: Andi Kleen; +Cc: James Bottomley, x86, linux-mm, linux-kernel, Mel Gorman

On Fri, 13 May 2011, Andi Kleen wrote:

> Turbo mode just makes the CPU faster, but it should not change
> the scheduler decisions.

I also have similar issues with Sandybridge on Ubuntu 11.04 and kernels
2.6.38 as well as 2.6.39 (standard ubuntu kernel configs).

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Possible sandybridge livelock issue
  2011-05-13 17:08   ` Christoph Lameter
@ 2011-05-13 18:23     ` Andi Kleen
  2011-05-13 18:49       ` James Bottomley
  0 siblings, 1 reply; 7+ messages in thread
From: Andi Kleen @ 2011-05-13 18:23 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: James Bottomley, x86, linux-mm, linux-kernel, Mel Gorman

Christoph Lameter <cl@linux.com> writes:

> On Fri, 13 May 2011, Andi Kleen wrote:
>
>> Turbo mode just makes the CPU faster, but it should not change
>> the scheduler decisions.
>
> I also have similar issues with Sandybridge on Ubuntu 11.04 and kernels
> 2.6.38 as well as 2.6.39 (standard ubuntu kernel configs).

It still doesn't make a lot of sense to blame the CPU for this.
This is just not the level how CPU problems would likely appear.

Can you figure out better what the kswapd is doing?

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Possible sandybridge livelock issue
  2011-05-13 18:23     ` Andi Kleen
@ 2011-05-13 18:49       ` James Bottomley
  2011-05-16  6:52         ` Ingo Molnar
  0 siblings, 1 reply; 7+ messages in thread
From: James Bottomley @ 2011-05-13 18:49 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Christoph Lameter, x86, linux-mm, linux-kernel, Mel Gorman

On Fri, 2011-05-13 at 11:23 -0700, Andi Kleen wrote:
> Christoph Lameter <cl@linux.com> writes:
> 
> > On Fri, 13 May 2011, Andi Kleen wrote:
> >
> >> Turbo mode just makes the CPU faster, but it should not change
> >> the scheduler decisions.
> >
> > I also have similar issues with Sandybridge on Ubuntu 11.04 and kernels
> > 2.6.38 as well as 2.6.39 (standard ubuntu kernel configs).
> 
> It still doesn't make a lot of sense to blame the CPU for this.
> This is just not the level how CPU problems would likely appear.
> 
> Can you figure out better what the kswapd is doing?

We have ... it was the thread in the first email.  We don't need a fix
for the kswapd issue, what we're warning about is a potential
sandybridge problem.

The facts are that only sandybridge systems livelocked in the kswapd
problem ... no other systems could reproduce it, although they did see
heavy CPU time accumulate to kswapd.  And this is with a gang of mm
people trying to reproduce the problem on non-sandybridge systems.

On the sandybridge systems that livelocked, it was sometimes possible to
release the lock by pushing kswapd off the cpu it was hogging.

If you think the theory about why this happend to be wrong, fine ...
come up with another one.  The facts are as above and only sandybridge
systems seem to be affected.

James



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Possible sandybridge livelock issue
  2011-05-13 16:12 Possible sandybridge livelock issue James Bottomley
  2011-05-13 16:36 ` Andi Kleen
@ 2011-05-16  6:29 ` Ingo Molnar
  1 sibling, 0 replies; 7+ messages in thread
From: Ingo Molnar @ 2011-05-16  6:29 UTC (permalink / raw)
  To: James Bottomley
  Cc: x86, linux-mm, linux-kernel, Mel Gorman, Peter Zijlstra,
	Mike Galbraith, Thomas Gleixner, H. Peter Anvin


* James Bottomley <James.Bottomley@HansenPartnership.com> wrote:

> We've just come off a large round of debugging a kswapd problem over on
> linux-mm:
> 
> http://marc.info/?t=130392066000001
> 
> The upshot was that kswapd wasn't being allowed to sleep (which we're
> now fixing).  However, in spite of intensive efforts, the actual hang
> was only reproducible on sandybridge laptops.
> 
> When the hang occurred, kswapd basically pegged one core in 100% system
> time.  This looks like there's something specific to sandybridge that
> causes this type of bad interaction.  I was wondering if it could be
> something to to with a scheduling problem in turbo mode?  Once kswapd
> goes flat out, the core its on will kick into turbo mode, which causes
> it to get preferentially scheduled there, leading to the live lock.

There's no explicit 'schedule Sandybridge differently' logic in the scheduler.

Thus turbo mode can only affect scheduling by executing code faster. Executing 
faster *does* mean more scheduling on that CPU: it's faster to do work so it's 
faster back to idle again.

I.e. i can see Sandybridge being special only due to timing and performance 
differences.

> The only evidence I have to support this theory is that when I reproduce the 
> problem with PREEMPT, the core pegs at 100% system time and stays there even 
> if I turn off the load.  However, if I can execute work that causes kswapd to 
> be kicked off the core it's running on, it will calm back down and go to 
> sleep.

At first sight this looks like some sort of kswapd problem: if you put kswapd 
into TASK_*INTERRUPTIBLE and schedule() it then the scheduler won't keep it 
running, on Sandybridge or elsewhere. The scheduler can't magically make kswapd 
runnable unless there's some big bug in it. So you first need to examine why 
kswapd never schedules to idle.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Possible sandybridge livelock issue
  2011-05-13 18:49       ` James Bottomley
@ 2011-05-16  6:52         ` Ingo Molnar
  0 siblings, 0 replies; 7+ messages in thread
From: Ingo Molnar @ 2011-05-16  6:52 UTC (permalink / raw)
  To: James Bottomley
  Cc: Andi Kleen, Christoph Lameter, x86, linux-mm, linux-kernel, Mel Gorman


* James Bottomley <James.Bottomley@HansenPartnership.com> wrote:

> > Can you figure out better what the kswapd is doing?
> 
> We have ... it was the thread in the first email.  We don't need a fix for 
> the kswapd issue, what we're warning about is a potential sandybridge 
> problem.
> 
> The facts are that only sandybridge systems livelocked in the kswapd problem 
> ... no other systems could reproduce it, although they did see heavy CPU time 
> accumulate to kswapd.  And this is with a gang of mm people trying to 
> reproduce the problem on non-sandybridge systems.
> 
> On the sandybridge systems that livelocked, it was sometimes possible to 
> release the lock by pushing kswapd off the cpu it was hogging.

It's not uncommon at all to see certain races (or even livelocks) only with the 
latest and greatest CPUs.

I have a first-gen CPU system that when i got it a couple of years ago 
triggered like a dozen Linux kernel races and bugs possible theoretically on 
all other CPUs but not reported on any other Linux system up to that point, 
*ever* - and some of those bugs were many years old.

> If you think the theory about why this happend to be wrong, fine ... come up 
> with another one.  The facts are as above and only sandybridge systems seem 
> to be affected.

I can see at least four other plausible hypotheses, all matching the facts as 
you laid them out:

 - i could be a bug/race in the kswapd code.

 - it could be that the race window needs a certain level of instruction 
   parallelism - which occurs with a higher likelyhood on Sandybridge.

 - it could be that Sandybridge CPUs keep dirty cachelines owned a bit longer 
   than other CPUs, making an existing livelock bug in the kernel code easier 
   to trigger.

 - a hardware bug: if cacheline ownership is not arbitrated between 
   nodes/cpus/cores fairly (enough) and a specific CPU can monopolize a 
   cacheline for a very long time if only it keeps modifying it in an 
   aggressive enough kswapd loop.

Note, since each of these hypotheses has a specific non-zero chance of being 
the objective truth, your hypothesis might in the end turn out to be the right 
one and might turn into a proven scientific theory: CPU and scheduler bugs do 
happen after all.

The other hypotheses i outlined have non-zero chances as well: kswapd bugs do 
happen as well and various CPU timing differences do tend to occur as well.

But above you seem to be confused about how supporting facts and hypotheses 
relate to each other: you seemed to imply that because your facts support your 
hypothesis the ball is somehow on the other side. As things stand now we 
clearly need more facts, to exclude more of the many possibilities.

So i wanted to clear up these basics of science first, before any of us wastes 
too much time on writing mails and such. Oh ... never mind ;-)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2011-05-16  6:52 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-05-13 16:12 Possible sandybridge livelock issue James Bottomley
2011-05-13 16:36 ` Andi Kleen
2011-05-13 17:08   ` Christoph Lameter
2011-05-13 18:23     ` Andi Kleen
2011-05-13 18:49       ` James Bottomley
2011-05-16  6:52         ` Ingo Molnar
2011-05-16  6:29 ` Ingo Molnar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).