All of lore.kernel.org
 help / color / mirror / Atom feed
* Commit cb83b62 fails to boot with a divide by zero error.
@ 2012-05-11 13:39 Robin Holt
  2012-05-11 14:33 ` Peter Zijlstra
  0 siblings, 1 reply; 8+ messages in thread
From: Robin Holt @ 2012-05-11 13:39 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar; +Cc: linux-kernel

Ingo, there is a breakage in the x86/master branch.

While testing some of our configurations, we found they would not boot.
The following got things working:

--- linux.orig/kernel/sched/fair.c      2012-05-11 06:29:44.000000000 -0500
+++ linux/kernel/sched/fair.c   2012-05-11 06:31:52.217156410 -0500
@@ -3835,7 +3835,7 @@ static inline void update_sg_lb_stats(st
        }
 
        /* Adjust by relative CPU power of the group */
-       sgs->avg_load = (sgs->group_load*SCHED_POWER_SCALE) / group->sgp->power;
+       sgs->avg_load = (sgs->group_load*SCHED_POWER_SCALE) / max(group->sgp->power, 1);
 
        /*
         * Consider the group unbalanced when the imbalance is larger


We found that reverting the commit:
cb83b62 (x86/sched/core) sched/numa: Rewrite the CONFIG_NUMA sched domain support

also got things working.

Thanks,
Robin Holt

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Commit cb83b62 fails to boot with a divide by zero error.
  2012-05-11 13:39 Commit cb83b62 fails to boot with a divide by zero error Robin Holt
@ 2012-05-11 14:33 ` Peter Zijlstra
  2012-05-11 15:05   ` Robin Holt
  0 siblings, 1 reply; 8+ messages in thread
From: Peter Zijlstra @ 2012-05-11 14:33 UTC (permalink / raw)
  To: Robin Holt; +Cc: Ingo Molnar, linux-kernel

On Fri, 2012-05-11 at 08:39 -0500, Robin Holt wrote:

> We found that reverting the commit:
> cb83b62 (x86/sched/core) sched/numa: Rewrite the CONFIG_NUMA sched domain support
> 
> also got things working.

there's a particularly stupid bug in that code



---
Subject: sched, numa: Fix the new NUMA topology bits
From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date: Fri May 11 00:56:20 CEST 2012

There's no need to convert a node number to a node number by
pretending its a cpu number..

Reported-by: Yinghai Lu <yinghai@kernel.org>
Reported-and-Tested-by: Greg Pearson <greg.pearson@hp.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/sched/core.c |    3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6378,8 +6378,7 @@ static void sched_init_numa(void)
 			sched_domains_numa_masks[i][j] = mask;
 
 			for (k = 0; k < nr_node_ids; k++) {
-				if (node_distance(cpu_to_node(j), k) >
-						sched_domains_numa_distance[i])
+				if (node_distance(j, k) > sched_domains_numa_distance[i])
 					continue;
 
 				cpumask_or(mask, mask, cpumask_of_node(k));


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Commit cb83b62 fails to boot with a divide by zero error.
  2012-05-11 14:33 ` Peter Zijlstra
@ 2012-05-11 15:05   ` Robin Holt
  2012-05-11 15:36     ` Peter Zijlstra
  0 siblings, 1 reply; 8+ messages in thread
From: Robin Holt @ 2012-05-11 15:05 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Robin Holt, Ingo Molnar, linux-kernel

On Fri, May 11, 2012 at 04:33:10PM +0200, Peter Zijlstra wrote:
> On Fri, 2012-05-11 at 08:39 -0500, Robin Holt wrote:
> 
> > We found that reverting the commit:
> > cb83b62 (x86/sched/core) sched/numa: Rewrite the CONFIG_NUMA sched domain support
> > 
> > also got things working.
> 
> there's a particularly stupid bug in that code

Even with that applied, I still get the divide by zero.

Robin

> 
> 
> 
> ---
> Subject: sched, numa: Fix the new NUMA topology bits
> From: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Date: Fri May 11 00:56:20 CEST 2012
> 
> There's no need to convert a node number to a node number by
> pretending its a cpu number..
> 
> Reported-by: Yinghai Lu <yinghai@kernel.org>
> Reported-and-Tested-by: Greg Pearson <greg.pearson@hp.com>
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---
>  kernel/sched/core.c |    3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6378,8 +6378,7 @@ static void sched_init_numa(void)
>  			sched_domains_numa_masks[i][j] = mask;
>  
>  			for (k = 0; k < nr_node_ids; k++) {
> -				if (node_distance(cpu_to_node(j), k) >
> -						sched_domains_numa_distance[i])
> +				if (node_distance(j, k) > sched_domains_numa_distance[i])
>  					continue;
>  
>  				cpumask_or(mask, mask, cpumask_of_node(k));

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Commit cb83b62 fails to boot with a divide by zero error.
  2012-05-11 15:05   ` Robin Holt
@ 2012-05-11 15:36     ` Peter Zijlstra
  2012-05-11 15:55       ` Robin Holt
  0 siblings, 1 reply; 8+ messages in thread
From: Peter Zijlstra @ 2012-05-11 15:36 UTC (permalink / raw)
  To: Robin Holt; +Cc: Ingo Molnar, linux-kernel

On Fri, 2012-05-11 at 10:05 -0500, Robin Holt wrote:
> On Fri, May 11, 2012 at 04:33:10PM +0200, Peter Zijlstra wrote:
> > On Fri, 2012-05-11 at 08:39 -0500, Robin Holt wrote:
> > 
> > > We found that reverting the commit:
> > > cb83b62 (x86/sched/core) sched/numa: Rewrite the CONFIG_NUMA sched domain support
> > > 
> > > also got things working.
> > 
> > there's a particularly stupid bug in that code
> 
> Even with that applied, I still get the divide by zero.

Humm.. what kind of machine is this? And how far along does it get in
booting? ->power isn't supposed to get to 0.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Commit cb83b62 fails to boot with a divide by zero error.
  2012-05-11 15:36     ` Peter Zijlstra
@ 2012-05-11 15:55       ` Robin Holt
  2012-05-11 16:01         ` Peter Zijlstra
  2012-05-14 10:48         ` Ingo Molnar
  0 siblings, 2 replies; 8+ messages in thread
From: Robin Holt @ 2012-05-11 15:55 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Robin Holt, Ingo Molnar, linux-kernel

On Fri, May 11, 2012 at 05:36:13PM +0200, Peter Zijlstra wrote:
> On Fri, 2012-05-11 at 10:05 -0500, Robin Holt wrote:
> > On Fri, May 11, 2012 at 04:33:10PM +0200, Peter Zijlstra wrote:
> > > On Fri, 2012-05-11 at 08:39 -0500, Robin Holt wrote:
> > > 
> > > > We found that reverting the commit:
> > > > cb83b62 (x86/sched/core) sched/numa: Rewrite the CONFIG_NUMA sched domain support
> > > > 
> > > > also got things working.
> > > 
> > > there's a particularly stupid bug in that code
> > 
> > Even with that applied, I still get the divide by zero.
> 
> Humm.. what kind of machine is this? And how far along does it get in
> booting? ->power isn't supposed to get to 0.

It is a four blade (8 socket 80 core 160 hyper-thread machine) with 40
GB of RAM.

Looking at the earlier kernel messages, I am wondering if I don't have a
BIOS that is giving me crud.  I have messages about hyperthreads being
on different nodes.  That had not been happening in the past.  I don't
have access to the machine now, but the BIOS string that had printed
out is from a developer's debug version.

When I get access to the machine again (likely not until Monday), I
will flash a release BIOS and retest.  Until then, please feel free to
ignore me.

Thanks,
Robin

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Commit cb83b62 fails to boot with a divide by zero error.
  2012-05-11 15:55       ` Robin Holt
@ 2012-05-11 16:01         ` Peter Zijlstra
  2012-05-14 10:48         ` Ingo Molnar
  1 sibling, 0 replies; 8+ messages in thread
From: Peter Zijlstra @ 2012-05-11 16:01 UTC (permalink / raw)
  To: Robin Holt; +Cc: Ingo Molnar, linux-kernel

On Fri, 2012-05-11 at 10:55 -0500, Robin Holt wrote:
> 
> It is a four blade (8 socket 80 core 160 hyper-thread machine) with 40
> GB of RAM.

Ok, big but not ridiculously so.. I'll try and see if I can find
anything.

> Looking at the earlier kernel messages, I am wondering if I don't have a
> BIOS that is giving me crud.  I have messages about hyperthreads being
> on different nodes.  That had not been happening in the past.  I don't
> have access to the machine now, but the BIOS string that had printed
> out is from a developer's debug version.
> 
> When I get access to the machine again (likely not until Monday), I
> will flash a release BIOS and retest.  Until then, please feel free to
> ignore me. 

No need, that's another (known) bug.. 

  http://marc.info/?l=linux-kernel&m=133673488311294&w=2



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Commit cb83b62 fails to boot with a divide by zero error.
  2012-05-11 15:55       ` Robin Holt
  2012-05-11 16:01         ` Peter Zijlstra
@ 2012-05-14 10:48         ` Ingo Molnar
  2012-05-14 12:40           ` Robin Holt
  1 sibling, 1 reply; 8+ messages in thread
From: Ingo Molnar @ 2012-05-14 10:48 UTC (permalink / raw)
  To: Robin Holt; +Cc: Peter Zijlstra, linux-kernel


* Robin Holt <holt@sgi.com> wrote:

> On Fri, May 11, 2012 at 05:36:13PM +0200, Peter Zijlstra wrote:
> > On Fri, 2012-05-11 at 10:05 -0500, Robin Holt wrote:
> > > On Fri, May 11, 2012 at 04:33:10PM +0200, Peter Zijlstra wrote:
> > > > On Fri, 2012-05-11 at 08:39 -0500, Robin Holt wrote:
> > > > 
> > > > > We found that reverting the commit:
> > > > > cb83b62 (x86/sched/core) sched/numa: Rewrite the CONFIG_NUMA sched domain support
> > > > > 
> > > > > also got things working.
> > > > 
> > > > there's a particularly stupid bug in that code
> > > 
> > > Even with that applied, I still get the divide by zero.
> > 
> > Humm.. what kind of machine is this? And how far along does it get in
> > booting? ->power isn't supposed to get to 0.
> 
> It is a four blade (8 socket 80 core 160 hyper-thread machine) 
> with 40 GB of RAM.
> 
> Looking at the earlier kernel messages, I am wondering if I 
> don't have a BIOS that is giving me crud.  I have messages 
> about hyperthreads being on different nodes.  That had not 
> been happening in the past.  I don't have access to the 
> machine now, but the BIOS string that had printed out is from 
> a developer's debug version.
> 
> When I get access to the machine again (likely not until 
> Monday), I will flash a release BIOS and retest.  Until then, 
> please feel free to ignore me.

Please don't re-flash the BIOS! We want to fix this bug - the 
kernel should never crash on whatever topology data the BIOS 
passes.

We can sanitize it or ignore it, but crashing is not an option. 
So lets figure this out, ok?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Commit cb83b62 fails to boot with a divide by zero error.
  2012-05-14 10:48         ` Ingo Molnar
@ 2012-05-14 12:40           ` Robin Holt
  0 siblings, 0 replies; 8+ messages in thread
From: Robin Holt @ 2012-05-14 12:40 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Robin Holt, Peter Zijlstra, linux-kernel

On Mon, May 14, 2012 at 12:48:29PM +0200, Ingo Molnar wrote:
> 
> * Robin Holt <holt@sgi.com> wrote:
> 
> > On Fri, May 11, 2012 at 05:36:13PM +0200, Peter Zijlstra wrote:
> > > On Fri, 2012-05-11 at 10:05 -0500, Robin Holt wrote:
> > > > On Fri, May 11, 2012 at 04:33:10PM +0200, Peter Zijlstra wrote:
> > > > > On Fri, 2012-05-11 at 08:39 -0500, Robin Holt wrote:
> > > > > 
> > > > > > We found that reverting the commit:
> > > > > > cb83b62 (x86/sched/core) sched/numa: Rewrite the CONFIG_NUMA sched domain support
> > > > > > 
> > > > > > also got things working.
> > > > > 
> > > > > there's a particularly stupid bug in that code
> > > > 
> > > > Even with that applied, I still get the divide by zero.
> > > 
> > > Humm.. what kind of machine is this? And how far along does it get in
> > > booting? ->power isn't supposed to get to 0.
> > 
> > It is a four blade (8 socket 80 core 160 hyper-thread machine) 
> > with 40 GB of RAM.
> > 
> > Looking at the earlier kernel messages, I am wondering if I 
> > don't have a BIOS that is giving me crud.  I have messages 
> > about hyperthreads being on different nodes.  That had not 
> > been happening in the past.  I don't have access to the 
> > machine now, but the BIOS string that had printed out is from 
> > a developer's debug version.
> > 
> > When I get access to the machine again (likely not until 
> > Monday), I will flash a release BIOS and retest.  Until then, 
> > please feel free to ignore me.
> 
> Please don't re-flash the BIOS! We want to fix this bug - the 
> kernel should never crash on whatever topology data the BIOS 
> passes.
> 
> We can sanitize it or ignore it, but crashing is not an option. 
> So lets figure this out, ok?

I have the old BIOS as well so I can flash back.  Plus, I have the
BIOS developer's description of his changes and he has saved his
workarea.  Toggling back and forth should not be a problem to help
us determine the source and "correct" fix.

Robin

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2012-05-14 12:40 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-05-11 13:39 Commit cb83b62 fails to boot with a divide by zero error Robin Holt
2012-05-11 14:33 ` Peter Zijlstra
2012-05-11 15:05   ` Robin Holt
2012-05-11 15:36     ` Peter Zijlstra
2012-05-11 15:55       ` Robin Holt
2012-05-11 16:01         ` Peter Zijlstra
2012-05-14 10:48         ` Ingo Molnar
2012-05-14 12:40           ` Robin Holt

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.