All of lore.kernel.org
 help / color / mirror / Atom feed
* divide by zero bug in find_busiest_group
@ 2010-08-26  1:17 Chetan Ahuja
  2010-08-26 19:19 ` Venkatesh Pallipadi
  0 siblings, 1 reply; 8+ messages in thread
From: Chetan Ahuja @ 2010-08-26  1:17 UTC (permalink / raw)
  To: linux-kernel

This has been filed as a bug in the kernel bugzilla
(https://bugzilla.kernel.org/show_bug.cgi?id=16991)
but the visibility on bugzilla seems low ( and the bugizlla server
seems to get overly "stressed" during
certain parts of the day)  so here's my "summary" of the discussion so
far. If for nothing else, so it gets
indexed by search engines etc.

We've seen a divide-by-zero crash  in the function  update_sg_lb_stats
(inlined into find_busiest_group) at the following location :

 /usr/src/linux/kernel/sched.c:3769
*balance = 0;
return;
}

/* Adjust by relative CPU power of the group */
sgs->avg_load = (sgs->group_load * SCHED_LOAD_SCALE) /
group->cpu_power;
aff5: 48 c1 e0 0a shl $0xa,%rax
aff9: 48 f7 f6 div %rsi

Apparently group->cpu_power can be zero under some conditions.

    I noticed  (what I thought was) a race condition between cpu_power
being initted to zero
(in build_xxx_groups functions in sched.c) and their use as
denominator in find_busiest/idlest_group
functions. PeterZ replied that there's a safe codepath from
build_*_groups functions to the crash
location  which guaranteed  a non-zero value.  I did express concern
that in absence of explicit
synchronization/mem-barriers we're at the mercy of compiler and
hardware doing us favors (by
 not re-ordering  instructions in an adverse way) for that guarantee.
But I don't think we got hit
 by the initial zeroes because all the crashes I saw happened after
many months of uptime.

  There's also another place group->cpu_power values gets updated
without any synchronization, in
the update_cpu_power function. Though the only way  this could result
in a bad value for cpu_power
is by  core A reading an in-transit value for a non-atomically-updated
64 bit value from core B :-). Unlikely ?
Very !!. Should we make that update explicity atomic ?  Would be prudent.

We do need more ideas on how the zero could have gotten there. The two
paths I mentioned above don't
provide that warm, fuzzy feeling yet.

Thanks
Chetan

P.S.

a)   kernel version (2.6.32 release from kernel.org. Though a similar
divide-by-zero has been
   reported as recently as 2.6.35 in a Ubuntu distribution kernel
here: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/615135
b) Hardware :  8 core nehalem (Intel E5520).. /proc/cpuinfo shows 16
"hyperthreaded" cores.

some relevant CONFIG settings:
CONFIG_NUMA=y
CONFIG_K8_NUMA=y
CONFIG_X86_64_ACPI_NUMA=y
# CONFIG_NUMA_EMU is not set
CONFIG_ACPI_NUMA=y
.
.
CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y
CONFIG_GROUP_SCHED=y
CONFIG_FAIR_GROUP_SCHED=y
# CONFIG_RT_GROUP_SCHED is not set
# CONFIG_USER_SCHED is not set
CONFIG_CGROUP_SCHED=y

CONFIG_SCHED_OMIT_FRAME_POINTER=y
CONFIG_SCHED_SMT=y
CONFIG_SCHED_MC=y
CONFIG_SCHED_HRTICK=y
CONFIG_SCHED_DEBUG=y
# CONFIG_SCHEDSTATS is not set
# CONFIG_SCHED_TRACER is not set

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: divide by zero bug in find_busiest_group
  2010-08-26  1:17 divide by zero bug in find_busiest_group Chetan Ahuja
@ 2010-08-26 19:19 ` Venkatesh Pallipadi
  2010-08-26 23:52   ` Chetan Ahuja
  2010-08-27  8:08   ` Peter Zijlstra
  0 siblings, 2 replies; 8+ messages in thread
From: Venkatesh Pallipadi @ 2010-08-26 19:19 UTC (permalink / raw)
  To: Chetan Ahuja; +Cc: linux-kernel, Suresh Siddha, Peter Zijlstra

On Wed, Aug 25, 2010 at 6:17 PM, Chetan Ahuja <chetan.ahuja@gmail.com> wrote:
> This has been filed as a bug in the kernel bugzilla
> (https://bugzilla.kernel.org/show_bug.cgi?id=16991)
> but the visibility on bugzilla seems low ( and the bugizlla server
> seems to get overly "stressed" during
> certain parts of the day)  so here's my "summary" of the discussion so
> far. If for nothing else, so it gets
> indexed by search engines etc.
>
> We've seen a divide-by-zero crash  in the function  update_sg_lb_stats
> (inlined into find_busiest_group) at the following location :
>
>  /usr/src/linux/kernel/sched.c:3769
> *balance = 0;
> return;
> }
>
> /* Adjust by relative CPU power of the group */
> sgs->avg_load = (sgs->group_load * SCHED_LOAD_SCALE) /
> group->cpu_power;
> aff5: 48 c1 e0 0a shl $0xa,%rax
> aff9: 48 f7 f6 div %rsi
>
> Apparently group->cpu_power can be zero under some conditions.
>
>    I noticed  (what I thought was) a race condition between cpu_power
> being initted to zero
> (in build_xxx_groups functions in sched.c) and their use as
> denominator in find_busiest/idlest_group
> functions. PeterZ replied that there's a safe codepath from
> build_*_groups functions to the crash
> location  which guaranteed  a non-zero value.  I did express concern
> that in absence of explicit
> synchronization/mem-barriers we're at the mercy of compiler and
> hardware doing us favors (by
>  not re-ordering  instructions in an adverse way) for that guarantee.
> But I don't think we got hit
>  by the initial zeroes because all the crashes I saw happened after
> many months of uptime.
>
>  There's also another place group->cpu_power values gets updated
> without any synchronization, in
> the update_cpu_power function. Though the only way  this could result
> in a bad value for cpu_power
> is by  core A reading an in-transit value for a non-atomically-updated
> 64 bit value from core B :-). Unlikely ?
> Very !!. Should we make that update explicity atomic ?  Would be prudent.
>

There is another potential bug that may cause zero power.

I actually saw similar problem a while back, while I was playing
around with rt_avg stuff. But, I have only seen the problem with my
modified kernel (playing with softirq accouting part) and not with
vanilla kernel. May be your workload includes some RT tasks that
triggers this in vanilla kernel as well.

This is what I saw happening in my case:
scale_rt_power()
- In some corner case total ends up being less than rq->rt_avg.
- As a result, available is negative
- scale_rt_power returns negative value
update_cpu_power()
- returns negative power

When we later sum up power across groups, we can end up with a group
having negative power making the sum power a zero for a parent group,
causing div_by_zero.

I now have this change in my tree along with my softirq+rt_avg changes
and haven't seen the failure since.

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index ae2a225..8c31e38 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -2858,6 +2858,9 @@ unsigned long scale_rt_power(int cpu)
 	u64 total, available;

 	total = sched_avg_period() + (rq->clock - rq->age_stamp);
+	if (unlikely(total < rq->rt_avg))
+		return 0;
+
 	available = total - rq->rt_avg;

 	if (unlikely((s64)total < SCHED_LOAD_SCALE))



The theory I had on why total can be less than rq->rt_avg.
- (Time T0) update_curr calls sched_rt_avg_update and delta accumulation starts
- (Time T1) lb scale_rt_power calls sched_avg_update() and we say
there was aging at this point (while rt_avg doesn't have delta added
yet)
- (Time T2) timer int -> update_curr calls sched_rt_avg_update() with
no aging of timestamp, but delta added to rt_avg.
- Time passes
- (Time T3) lb calls scale_latc_power again, and at this point
  total = period + (T3 - T1)
  rt_avg = halved at T1 (which can be ~period) + T3 - T0
which can be slightly greater than total, spoiling the math from there on.

I tried reproducing the problem with vanilla kernel and couldn't
reproduce it. May be your workload is triggering this some how?
Also, Suresh did recent fixes to rt_avg and this may not be a problem
after that. I Haven't looked at this case closely after that recent
change.

Thanks,
Venki

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: divide by zero bug in find_busiest_group
  2010-08-26 19:19 ` Venkatesh Pallipadi
@ 2010-08-26 23:52   ` Chetan Ahuja
  2010-08-27  7:51     ` Peter Zijlstra
  2010-08-27 17:39     ` Venkatesh Pallipadi
  2010-08-27  8:08   ` Peter Zijlstra
  1 sibling, 2 replies; 8+ messages in thread
From: Chetan Ahuja @ 2010-08-26 23:52 UTC (permalink / raw)
  To: Venkatesh Pallipadi; +Cc: linux-kernel, Suresh Siddha, Peter Zijlstra

On Thu, Aug 26, 2010 at 12:19 PM, Venkatesh Pallipadi <venki@google.com> wrote:
> On Wed, Aug 25, 2010 at 6:17 PM, Chetan Ahuja <chetan.ahuja@gmail.com> wrote:

> I tried reproducing the problem with vanilla kernel and couldn't
> reproduce it. May be your workload is triggering this some how?
> Also, Suresh did recent fixes to rt_avg and this may not be a problem
> after that. I Haven't looked at this case closely after that recent
> change.
>
> Thanks,
> Venki
>

Venkatesh,

     Thanks for your detailed response. I admit complete ignorance of
the actual logic of the complicated cpu_power computations (or indeed,
any other part of the scheduler code). I'm just looking at it from the
perspective of bug-hunting in an unfamiliar piece of code. I think the
three ways we've so far identified the cpu_power can be zero so far
(all three would be extremely low probability events admittedly) are
warning signs that the code is trying to do  a bit too much. It seems
to be rather difficult to prove  correctness of the math or the
synchronization for my taste. Is there a clear and present use-case
which driving all these complicated computations ?

 For one thing, I don't understand the reason for recomputing
cpu_power dynamically. How does cpu_power of a gorup (which I believe
refers to the group of "hyper-threaded" cores on one physical core)
change while inthe system is in operation ? Is this in support of
hot-swappable CPU's or something ?

  To be honest, for my use case, I'd  like to just disable the whole
work-stealing stuff (referred to as load-balancing in the kernel).
There seems to be no config option that isolates just the dynamic
load-balancing parts (or is there ?) so maybe I'd just hack away the
load-balancing functions by hand so as never to worry about those
denominators again.

Chetan

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: divide by zero bug in find_busiest_group
  2010-08-26 23:52   ` Chetan Ahuja
@ 2010-08-27  7:51     ` Peter Zijlstra
  2010-08-27  8:13       ` Peter Zijlstra
  2010-08-27 17:39     ` Venkatesh Pallipadi
  1 sibling, 1 reply; 8+ messages in thread
From: Peter Zijlstra @ 2010-08-27  7:51 UTC (permalink / raw)
  To: Chetan Ahuja; +Cc: Venkatesh Pallipadi, linux-kernel, Suresh Siddha

On Thu, 2010-08-26 at 16:52 -0700, Chetan Ahuja wrote:
>  For one thing, I don't understand the reason for recomputing
> cpu_power dynamically. How does cpu_power of a gorup (which I believe
> refers to the group of "hyper-threaded" cores on one physical core)
> change while inthe system is in operation ? Is this in support of
> hot-swappable CPU's or something ?

No the groups are used for increasingly larger accumulations of cpus,
the first group is the singleton group which contains on the one cpu in
question, after that we start following the machine topology,
hyper-threads, multi-core, package, node etc. until the last group spans
the whole machine.

The dynamic cpu_power computation comes from RT tasks, when we
load-balance SCHED_OTHER tasks, we want each cpu to have an equal amount
of pending work, however when one CPU is say 50% busy with RT tasks (and
hopefully soon IRQ load), it cannot perform the same amount of work as
another unloaded cpu, therefore we need to normalize the relative load
levels between cpu. We use cpu_power as this load normalization factor.

Load balancing works by balancing this group hierarchy, a number of
groups are contained in a domain. Each cpu does a load-balance pull,
from the busiest group in that domain to his local domain. Since there's
ever smaller groups/domains you eventually end up with a nicely spread
load.

>   To be honest, for my use case, I'd  like to just disable the whole
> work-stealing stuff (referred to as load-balancing in the kernel).
> There seems to be no config option that isolates just the dynamic
> load-balancing parts (or is there ?) so maybe I'd just hack away the
> load-balancing functions by hand so as never to worry about those
> denominators again. 

You can do that at runtime, read: Documentation/cgroups/cpusets.txt

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: divide by zero bug in find_busiest_group
  2010-08-26 19:19 ` Venkatesh Pallipadi
  2010-08-26 23:52   ` Chetan Ahuja
@ 2010-08-27  8:08   ` Peter Zijlstra
  1 sibling, 0 replies; 8+ messages in thread
From: Peter Zijlstra @ 2010-08-27  8:08 UTC (permalink / raw)
  To: Venkatesh Pallipadi; +Cc: Chetan Ahuja, linux-kernel, Suresh Siddha

On Thu, 2010-08-26 at 12:19 -0700, Venkatesh Pallipadi wrote:
> >  There's also another place group->cpu_power values gets updated
> > without any synchronization, in
> > the update_cpu_power function. Though the only way  this could result
> > in a bad value for cpu_power
> > is by  core A reading an in-transit value for a non-atomically-updated
> > 64 bit value from core B :-). Unlikely ?
> > Very !!. Should we make that update explicity atomic ?  Would be prudent.

struct sched_group {
	...
        unsigned int cpu_power, cpu_power_orig;
	...
}

we assume that things are naturally aligned and reads/writes to word
sized entities are 'atomic' -- lots of kernel code makes this
assumption.

So the worst thing that can happen with the unsynchronized update of
cpu_power is that a reader sees the old value, no problem, we don't care
its all statistics anyway. Racing writers (could happen between regular
and nohz load-balance) we don't care about either, since the result is
either one or the other, not a mixture of both.



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: divide by zero bug in find_busiest_group
  2010-08-27  7:51     ` Peter Zijlstra
@ 2010-08-27  8:13       ` Peter Zijlstra
  0 siblings, 0 replies; 8+ messages in thread
From: Peter Zijlstra @ 2010-08-27  8:13 UTC (permalink / raw)
  To: Chetan Ahuja; +Cc: Venkatesh Pallipadi, linux-kernel, Suresh Siddha

On Fri, 2010-08-27 at 09:51 +0200, Peter Zijlstra wrote:
> 
> Load balancing works by balancing this group hierarchy, a number of
> groups are contained in a domain. Each cpu does a load-balance pull,
> from the busiest group in that domain to his local domain. 

s/local domain/local group/

> Since there's
> ever smaller groups/domains you eventually end up with a nicely spread
> load. 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: divide by zero bug in find_busiest_group
  2010-08-26 23:52   ` Chetan Ahuja
  2010-08-27  7:51     ` Peter Zijlstra
@ 2010-08-27 17:39     ` Venkatesh Pallipadi
  1 sibling, 0 replies; 8+ messages in thread
From: Venkatesh Pallipadi @ 2010-08-27 17:39 UTC (permalink / raw)
  To: Chetan Ahuja; +Cc: linux-kernel, Suresh Siddha, Peter Zijlstra

On Thu, Aug 26, 2010 at 4:52 PM, Chetan Ahuja <chetan.ahuja@gmail.com> wrote:
> On Thu, Aug 26, 2010 at 12:19 PM, Venkatesh Pallipadi <venki@google.com> wrote:
>> On Wed, Aug 25, 2010 at 6:17 PM, Chetan Ahuja <chetan.ahuja@gmail.com> wrote:
>
>> I tried reproducing the problem with vanilla kernel and couldn't
>> reproduce it. May be your workload is triggering this some how?
>> Also, Suresh did recent fixes to rt_avg and this may not be a problem
>> after that. I Haven't looked at this case closely after that recent
>> change.
>>
>> Thanks,
>> Venki
>>
>
> Venkatesh,
>
>     Thanks for your detailed response. I admit complete ignorance of
> the actual logic of the complicated cpu_power computations (or indeed,
> any other part of the scheduler code). I'm just looking at it from the
> perspective of bug-hunting in an unfamiliar piece of code. I think the
> three ways we've so far identified the cpu_power can be zero so far
> (all three would be extremely low probability events admittedly) are
> warning signs that the code is trying to do  a bit too much. It seems
> to be rather difficult to prove  correctness of the math or the
> synchronization for my taste. Is there a clear and present use-case
> which driving all these complicated computations ?
>
>  For one thing, I don't understand the reason for recomputing
> cpu_power dynamically. How does cpu_power of a gorup (which I believe
> refers to the group of "hyper-threaded" cores on one physical core)
> change while inthe system is in operation ? Is this in support of
> hot-swappable CPU's or something ?
>
>  To be honest, for my use case, I'd  like to just disable the whole
> work-stealing stuff (referred to as load-balancing in the kernel).
> There seems to be no config option that isolates just the dynamic
> load-balancing parts (or is there ?) so maybe I'd just hack away the
> load-balancing functions by hand so as never to worry about those
> denominators again.
>

Chetan,

For the purpose of this bug (assuming you can run a modified kernel in
your environment),
you can add above change and/or add some printk's, WARN_ON_ONCE() to see whether
power is going negative-zero at any point. That will give some more
clue on what may be
happening here.

Thanks,
Venki

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: divide by zero bug in find_busiest_group
@ 2010-09-15 20:29 Joe Williams
  0 siblings, 0 replies; 8+ messages in thread
From: Joe Williams @ 2010-09-15 20:29 UTC (permalink / raw)
  To: linux-kernel

> This has been filed as a bug in the kernel bugzilla (https://bugzilla.kernel.org/show_bug.cgi?id=16991) but the visibility on bugzilla seems low ( and the bugizlla server seems to get overly "stressed" during certain parts of the day)  so here's my &quot;summary&quot; of the discussion so far. If for nothing else, so it gets indexed by search engines etc.

Chetan,

Thanks for starting this bug ticket, I have been running into it as well. A fix would be much appreciated.

-Joe


Name: Joseph A. Williams
Email: joe@joetify.com
Blog: http://www.joeandmotorboat.com/
Twitter: http://twitter.com/williamsjoe


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2010-09-15 20:36 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-08-26  1:17 divide by zero bug in find_busiest_group Chetan Ahuja
2010-08-26 19:19 ` Venkatesh Pallipadi
2010-08-26 23:52   ` Chetan Ahuja
2010-08-27  7:51     ` Peter Zijlstra
2010-08-27  8:13       ` Peter Zijlstra
2010-08-27 17:39     ` Venkatesh Pallipadi
2010-08-27  8:08   ` Peter Zijlstra
2010-09-15 20:29 Joe Williams

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.