All of lore.kernel.org
 help / color / mirror / Atom feed
* Latest Linus tree oopses on Nehalem box
@ 2009-08-21 10:53 Jes Sorensen
  2009-08-21 11:46 ` Ingo Molnar
  0 siblings, 1 reply; 14+ messages in thread
From: Jes Sorensen @ 2009-08-21 10:53 UTC (permalink / raw)
  To: linux-kernel; +Cc: Ingo Molnar, Linus Torvalds

Hi,

I am seeing this one with the latest Linus' git tree as of this morning
on a Nehalem box. Using the defconfig + megaraid driver.

Not sure if this is already fixed, or if someone already knows whats
wrong? Smells like a yet another BIOS bug - yes the BIOS on this thing
is rubbish.

Cheers,
Jes

Starting Bluetooth services:[  OK  ]scd): [  OK  ]
Starting sshd: [  OK  ]ices:[  OK  ]
[    1.380099] pci 0000:01:01.0: BAR 6: address space collision on of 
device [0xfbbc0000-0xfbbdffff] 

�               Welcome to Fedora 

                 Press 'I' to enter interactive startup. 
          Starting udev: [    6.468279] BUG: unable to handle kernel 
NULL pointer dereference at 0000000000000008 

[    6.491835] IP: [<ffffffff810391e7>] find_busiest_group+0x620/0x6fd 

[    6.499207] usb usb8: uevent 
         [    6.499220] usb usb3: uevent 

[    6.499232] usb usb6: uevent 

[    6.499249] usb usb1: uevent 

[    6.499373] usb usb2: uevent 

[    6.499408] usb usb7: uevent 
         [    6.499602] usb usb4: uevent 

[    6.499949] usb usb5: uevent 

[    6.501821] usb 1-5: uevent 

[    6.588040] PGD 0 

[    6.594124] Oops: 0000 [#1] SMP 

[    6.603870] last sysfs file: /sys/devices/virtual/vc/vcsa1/dev 

[    6.621339] CPU 1 

[    6.627420] Modules linked in: [last unloaded: scsi_wait_scan] 

[    6.644994] Pid: 0, comm: swapper Not tainted 2.6.31-rc6 #15 
AltixXE270
[    6.664800] RIP: 0010:[<ffffffff810391e7>]  [<ffffffff810391e7>] 
find_busiest_group+0x620/0x6fd 

[    6.690897] RSP: 0018:ffffc90000203c50  EFLAGS: 00010216
[    6.706805] RAX: 0000000000000000 RBX: 0000000000000716 RCX: 
ffffc9000000e160
[    6.728173] RDX: 00000000000009c5 RSI: 0000000000000000 RDI: 
0000000000000040
[    6.749540] RBP: ffffc90000203dc0 R08: 0000000000000000 R09: 
00000000000002af
[    6.770908] R10: ffffc90001613980 R11: 0000000000000000 R12: 
ffffc9000080e160
[    6.792274] R13: 0000000000013980 R14: 0000000000000001 R15: 
ffffc90000203e58
[    6.813643] FS:  0000000000000000(0000) GS:ffffc90000200000(0000) 
knlGS:0000000000000000wn NFS mountd: [  OK  ]
[    6.837869] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
[    6.855076] CR2: 0000000000000008 CR3: 0000000001001000 CR4: 
00000000000006e0
[    6.876445] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
0000000000000000
[    6.897812] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 
0000000000000400
[    6.919179] Process swapper (pid: 0, threadinfo ffff88063e458000, 
task ffff88063e452d60)wn sendmail: [  OK  ]
[    6.943404] Stack:  ]
[    6.949436]  0000000000013980 0000000000000000 ffffc90000000000 
ffffc9000000e160pping HAL daemon: [  OK  ]
[    6.971165] <0> 0000000000013990 0000000000013980 ffffc9000020de18 
ffffc90000203e58ng Bluetooth services:[  OK  ]
[    6.994302] <0> 0000000000000010 0000000000013980 0000000000013980 
ffffc90000203e64ng RPC idmapd: [  OK  ]
[    7.017982] Call Trace:emon: M signal...
[    7.025312]  <IRQ>                       Sending all processes the 
KILL signa[    7.031657]  [<ffffffff8103c66b>] rebalance_domains+0x173/0x513
[    7.049388]  [<ffffffff8105e1fa>] ? clocksource_read+0xa/0xc
[    7.066333]  [<ffffffff8103ca8d>] run_rebalance_domains+0x82/0xcc
[    7.084582]  [<ffffffff8101ea54>] ? apic_write+0x11/0x13dware clock 
to system[    7.100491]  [<ffffffff81047c55>] __do_softirq+0xd2/0x19c
[    7.116658]  [<ffffffff8100cb9c>] call_softirq+0x1c/0x28
[    7.132566]  [<ffffffff8100df68>] do_softirq+0x34/0x72
[    7.147954]  [<ffffffff8104799a>] irq_exit+0x3f/0x81g pipe file 
systems:
[    7.162823]  [<ffffffff8101f28b>] smp_apic_timer_interrupt+0x81/0x8f 
      Unm[    7.181852]  [<ffffffff8100c573>] apic_timer_interrupt+0x13/0x20
[    7.199837]  <EOI>  Please stand by while rebooting the system...
[    7.206185]  [<ffffffff81226c7b>] ? acpi_idle_do_entry+0x3f/0x60 �
[    7.224169]  [<ffffffff81226cf7>] ? acpi_idle_enter_c1+0x5b/0xa4
[    7.242157]  [<ffffffff81226e11>] ? acpi_idle_enter_bm+0xd1/0x285
[    7.260408]  [<ffffffff8138afec>] ? cpuidle_idle_call+0x88/0xc0
[    7.278133]  [<ffffffff8100abc4>] ? cpu_idle+0x52/0x95
[    7.293522]  [<ffffffff815040c9>] ? start_secondary+0x179/0x17d
[    7.311248] Code: 09 49 c7 07 00 00 00 00 eb 51 48 8b b5 18 ff ff ff 
48 89 ca 49 89 c1 48 29 c8 48 8b 8d 10 ff ff ff 48 2b 95 38 ff ff ff 49 
29 d9 <8b> 76 08 8b 49 08 48 0f af d6 49 39 c1 49 0f 46 c1 48 0f af c1
[    7.370696] RIP  [<ffffffff810391e7>] find_busiest_group+0x620/0x6fd
[    7.389776]  RSP <ffffc90000203c50>
[    7.400226] CR2: 0000000000000008
[    7.410160] BUG: unable to handle kernel
[    7.410163] ---[ end trace ceb5be95b4c33d3f ]---
[    7.410166] Kernel panic - not syncing: Fatal exception in interrupt
[    7.410169] Pid: 0, comm: swapper Tainted: G      D    2.6.31-rc6 #15
[    7.410169] Call Trace:
[    7.410170]  <IRQ>  [<ffffffff81507b67>] panic+0x75/0x11c
[    7.410176]  [<ffffffff8150a81c>] oops_end+0xa9/0xb9
[    7.410180]  [<ffffffff8102b250>] no_context+0x1f1/0x200
[    7.410182]  [<ffffffff8102b3f7>] __bad_area_nosemaphore+0x198/0x1be
[    7.410185]  [<ffffffff8105b8a8>] ? sched_clock_cpu+0x18/0x151
[    7.410187]  [<ffffffff8102b42b>] bad_area_nosemaphore+0xe/0x10
[    7.410189]  [<ffffffff8150bbdb>] do_page_fault+0x135/0x273
[    7.410191]  [<ffffffff81509d7f>] page_fault+0x1f/0x30
[    7.410193]  [<ffffffff810391e7>] ? find_busiest_group+0x620/0x6fd
[    7.410196]  [<ffffffff8103c66b>] rebalance_domains+0x173/0x513
[    7.410198]  [<ffffffff8105e1fa>] ? clocksource_read+0xa/0xc
[    7.410200]  [<ffffffff8103ca8d>] run_rebalance_domains+0x82/0xcc
[    7.410202]  [<ffffffff8101ea54>] ? apic_write+0x11/0x13
[    7.410204]  [<ffffffff81047c55>] __do_softirq+0xd2/0x19c
[    7.410206]  [<ffffffff8100cb9c>] call_softirq+0x1c/0x28
[    7.410208]  [<ffffffff8100df68>] do_softirq+0x34/0x72
[    7.410209]  [<ffffffff8104799a>] irq_exit+0x3f/0x81
[    7.410211]  [<ffffffff8101f28b>] smp_apic_timer_interrupt+0x81/0x8f
[    7.410213]  [<ffffffff8100c573>] apic_timer_interrupt+0x13/0x20
[    7.410214]  <EOI>  [<ffffffff81226c7b>] ? acpi_idle_do_entry+0x3f/0x60
[    7.410218]  [<ffffffff81226cf7>] ? acpi_idle_enter_c1+0x5b/0xa4
[    7.410220]  [<ffffffff81226e11>] ? acpi_idle_enter_bm+0xd1/0x285
[    7.410222]  [<ffffffff8138afec>] ? cpuidle_idle_call+0x88/0xc0
[    7.410224]  [<ffffffff8100abc4>] ? cpu_idle+0x52/0x95
[    7.410226]  [<ffffffff815040c9>] ? start_secondary+0x179/0x17d
[    7.908334] NULL pointer dereference at 0000000000000008
[    7.924293] IP: [<ffffffff810391e7>] find_busiest_group+0x620/0x6fd
[    7.943115] PGD 0
[    7.949197] Oops: 0000 [#2] SMP
[    7.958945] last sysfs file: /sys/devices/virtual/vc/vcsa1/dev
[    7.976414] CPU 6
[    7.982496] Modules linked in: [last unloaded: scsi_wait_scan]
[    8.000043] Pid: 0, comm: swapper Tainted: G      D    2.6.31-rc6 #15 
AltixXE270
[    8.022189] RIP: 0010:[<ffffffff810391e7>]  [<ffffffff810391e7>] 
find_busiest_group+0x620/0x6fd
[    8.048288] RSP: 0018:ffffc90000c03c50  EFLAGS: 00010216
[    8.064195] RAX: 00000000000007fc RBX: 00000000000007fb RCX: 
ffffc9000080e160
[    8.085563] RDX: 00000000000007fb RSI: 0000000000000000 RDI: 
0000000000000040
[    8.106930] RBP: ffffc90000c03dc0 R08: 0000000000000000 R09: 
00000000000007fc
[    8.128296] R10: ffffc90001613980 R11: 0000000000000000 R12: 
ffffc9000080e160
[    8.149664] R13: 0000000000013980 R14: 0000000000000001 R15: 
ffffc90000c03e58
[    8.171031] FS:  0000000000000000(0000) GS:ffffc90000c00000(0000) 
knlGS:0000000000000000
[    8.195259] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
[    8.212466] CR2: 0000000000000008 CR3: 0000000001001000 CR4: 
00000000000006e0
[    8.233833] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
0000000000000000
[    8.255200] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 
0000000000000400
[    8.276568] Process swapper (pid: 0, threadinfo ffff88063e46e000, 
task ffff88033e49a5d0)
[    8.300794] Stack:
[    8.306824]  0000000000013980 ffffc90000000000 ffffffff00000000 
ffffc9000000e160
[    8.328608] <0> 0000000000013990 0000000000013980 ffffc90000c0de18 
ffffc90000c03e58
[    8.351794] <0> 0000000000000010 0000000000013980 0000000000013980 
ffffc90000c03e64
[    8.375553] Call Trace:
[    8.382883]  <IRQ>
[    8.389229]  [<ffffffff8103c66b>] rebalance_domains+0x173/0x513
[    8.406957]  [<ffffffff8105e1fa>] ? clocksource_read+0xa/0xc
[    8.423904]  [<ffffffff8103ca4f>] run_rebalance_domains+0x44/0xcc
[    8.442152]  [<ffffffff8101ea54>] ? apic_write+0x11/0x13
[    8.458063]  [<ffffffff81047c55>] __do_softirq+0xd2/0x19c
[    8.474229]  [<ffffffff8100cb9c>] call_softirq+0x1c/0x28
[    8.490137]  [<ffffffff8100df68>] do_softirq+0x34/0x72
[    8.505525]  [<ffffffff8104799a>] irq_exit+0x3f/0x81
[    8.520394]  [<ffffffff8101f28b>] smp_apic_timer_interrupt+0x81/0x8f
[    8.539422]  [<ffffffff8100c573>] apic_timer_interrupt+0x13/0x20
[    8.557409]  <EOI>
[    8.563755]  [<ffffffff81226f9a>] ? acpi_idle_enter_bm+0x25a/0x285
[    8.582259]  [<ffffffff81226f90>] ? acpi_idle_enter_bm+0x250/0x285
[    8.600770]  [<ffffffff8138afec>] ? cpuidle_idle_call+0x88/0xc0
[    8.618495]  [<ffffffff8100abc4>] ? cpu_idle+0x52/0x95
[    8.633885]  [<ffffffff815040c9>] ? start_secondary+0x179/0x17d
[    8.651612] Code: 09 49 c7 07 00 00 00 00 eb 51 48 8b b5 18 ff ff ff 
48 89 ca 49 89 c1 48 29 c8 48 8b 8d 10 ff ff ff 48 2b 95 38 ff ff ff 49 
29 d9 <8b> 76 08 8b 49 08 48 0f af d6 49 39 c1 49 0f 46 c1 48 0f af c1
[    8.711320] RIP  [<ffffffff810391e7>] find_busiest_group+0x620/0x6fd
[    8.730400]  RSP <ffffc90000c03c50>
[    8.740848] CR2: 0000000000000008
[    8.750780] BUG: unable to handle kernel NULL pointer dereference at 
0000000000000008
[    8.774301] IP: [<ffffffff810391e7>] find_busiest_group+0x620/0x6fd
[    8.793096] PGD 0
 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Latest Linus tree oopses on Nehalem box
  2009-08-21 10:53 Latest Linus tree oopses on Nehalem box Jes Sorensen
@ 2009-08-21 11:46 ` Ingo Molnar
  2009-08-21 11:58   ` Peter Zijlstra
  2009-08-21 13:04   ` Latest Linus tree oopses on Nehalem box Jes Sorensen
  0 siblings, 2 replies; 14+ messages in thread
From: Ingo Molnar @ 2009-08-21 11:46 UTC (permalink / raw)
  To: Jes Sorensen, Jens Axboe, Peter Zijlstra, Thomas Gleixner,
	H. Peter Anvin, Yinghai Lu
  Cc: linux-kernel, Ingo Molnar, Linus Torvalds


* Jes Sorensen <jes@sgi.com> wrote:

> Hi,
>
> I am seeing this one with the latest Linus' git tree as of this 
> morning on a Nehalem box. Using the defconfig + megaraid driver.
>
> Not sure if this is already fixed, or if someone already knows 
> whats wrong? Smells like a yet another BIOS bug - yes the BIOS on 
> this thing is rubbish.

my Nehalem (16 logical cpus) boots fine:

 aldebaran:~> uname -a
 Linux aldebaran 2.6.31-rc6-tip-01272-g9919e28-dirty #1518 SMP Fri 
 Aug 21 11:13:12 CEST 2009 x86_64 x86_64 x86_64 GNU/Linux

> [    6.664800] RIP: 0010:[<ffffffff810391e7>]  [<ffffffff810391e7>]  
> find_busiest_group+0x620/0x6fd 

Nothing similar is open at the moment.

There's only one open .31 scheduler regression bug at the moment: a 
rare division by zero bug that sometimes crashes boxes - the bigger 
the box the likelier the crash.

Your crash looks to be one of:

 1) a genuine scheduler bug tickled on your new hardware. Needs to 
    be bisected/debugged/fixed.

 2) a BIOS bug passing crappy ACPI tables which cause us to create a
    buggy sched-domains tree or so. We do treat ACPI data as 
    external untrusted data and try to use it in sane ways only, but 
    such bugs have happened in the past and could happen again.

The scheduler has sanity check for the sched-domains arch setup: if 
you enable CONFIG_SCHED_DEBUG=y then sched_domain_debug() will 
become noisy in your syslog if there's something wrong (but wont 
stop the bootup so you have to actively check your syslog).

Might be useful to see your full crashlog, if you are allowed to 
post that, plus your kernel .config would be useful to know too. 
Plus would be useful to know whether this is a regression relative 
to .30 or a yet unfixed bug triggering on your class of hardware.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Latest Linus tree oopses on Nehalem box
  2009-08-21 11:46 ` Ingo Molnar
@ 2009-08-21 11:58   ` Peter Zijlstra
  2009-08-21 14:42     ` [tip:sched/core] sched: Avoid division by zero tip-bot for Peter Zijlstra
  2009-08-21 13:04   ` Latest Linus tree oopses on Nehalem box Jes Sorensen
  1 sibling, 1 reply; 14+ messages in thread
From: Peter Zijlstra @ 2009-08-21 11:58 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jes Sorensen, Jens Axboe, Thomas Gleixner, H. Peter Anvin,
	Yinghai Lu, linux-kernel, Ingo Molnar, Linus Torvalds

On Fri, 2009-08-21 at 13:46 +0200, Ingo Molnar wrote:
> * Jes Sorensen <jes@sgi.com> wrote:
> 
> > Hi,
> >
> > I am seeing this one with the latest Linus' git tree as of this 
> > morning on a Nehalem box. Using the defconfig + megaraid driver.
> >
> > Not sure if this is already fixed, or if someone already knows 
> > whats wrong? Smells like a yet another BIOS bug - yes the BIOS on 
> > this thing is rubbish.
> 
> my Nehalem (16 logical cpus) boots fine:
> 
>  aldebaran:~> uname -a
>  Linux aldebaran 2.6.31-rc6-tip-01272-g9919e28-dirty #1518 SMP Fri 
>  Aug 21 11:13:12 CEST 2009 x86_64 x86_64 x86_64 GNU/Linux
> 
> > [    6.664800] RIP: 0010:[<ffffffff810391e7>]  [<ffffffff810391e7>]  
> > find_busiest_group+0x620/0x6fd 
> 
> Nothing similar is open at the moment.
> 
> There's only one open .31 scheduler regression bug at the moment: a 
> rare division by zero bug that sometimes crashes boxes - the bigger 
> the box the likelier the crash.

That's actually a -tip only regression caused by
a5004278f0525dcb9aa43703ef77bf371ea837cd.

I thought to had found the race that caused the /0 (the below patch),
but testing has proven me wrong. Still looking at that.

---
Subject: sched: Avoid division by zero
From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date: Fri Aug 07 21:53:17 CEST 2009

Patch a5004278f0525dcb9aa43703ef77bf371ea837cd (sched: Fix cgroup smp
fairness) introduced the possibility of a divide-by-zero because
load-balancing is not synchronized between sched_domains.

This can cause the state of cpus to change between the first and
second loop over the sched domain in tg_shares_up().

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/sched.c |   23 ++++++++++-------------
 1 file changed, 10 insertions(+), 13 deletions(-)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -1522,7 +1522,8 @@ static void __set_se_shares(struct sched
  */
 static void
 update_group_shares_cpu(struct task_group *tg, int cpu,
-			unsigned long sd_shares, unsigned long sd_rq_weight)
+			unsigned long sd_shares, unsigned long sd_rq_weight,
+			unsigned long sd_eff_weight)
 {
 	unsigned long rq_weight;
 	unsigned long shares;
@@ -1535,13 +1536,15 @@ update_group_shares_cpu(struct task_grou
 	if (!rq_weight) {
 		boost = 1;
 		rq_weight = NICE_0_LOAD;
+		if (sd_rq_weight == sd_eff_weight)
+			sd_eff_weight += NICE_0_LOAD;
+		sd_rq_weight = sd_eff_weight;
 	}
 
 	/*
-	 *           \Sum shares * rq_weight
-	 * shares =  -----------------------
-	 *               \Sum rq_weight
-	 *
+	 *             \Sum_j shares_j * rq_weight_i
+	 * shares_i =  -----------------------------
+	 *                  \Sum_j rq_weight_j
 	 */
 	shares = (sd_shares * rq_weight) / sd_rq_weight;
 	shares = clamp_t(unsigned long, shares, MIN_SHARES, MAX_SHARES);
@@ -1593,14 +1596,8 @@ static int tg_shares_up(struct task_grou
 	if (!sd->parent || !(sd->parent->flags & SD_LOAD_BALANCE))
 		shares = tg->shares;
 
-	for_each_cpu(i, sched_domain_span(sd)) {
-		unsigned long sd_rq_weight = rq_weight;
-
-		if (!tg->cfs_rq[i]->rq_weight)
-			sd_rq_weight = eff_weight;
-
-		update_group_shares_cpu(tg, i, shares, sd_rq_weight);
-	}
+	for_each_cpu(i, sched_domain_span(sd))
+		update_group_shares_cpu(tg, i, shares, rq_weight, eff_weight);
 
 	return 0;
 }



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Latest Linus tree oopses on Nehalem box
  2009-08-21 11:46 ` Ingo Molnar
  2009-08-21 11:58   ` Peter Zijlstra
@ 2009-08-21 13:04   ` Jes Sorensen
  2009-08-21 13:26     ` Ingo Molnar
  1 sibling, 1 reply; 14+ messages in thread
From: Jes Sorensen @ 2009-08-21 13:04 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jens Axboe, Peter Zijlstra, Thomas Gleixner, H. Peter Anvin,
	Yinghai Lu, linux-kernel, Ingo Molnar, Linus Torvalds

On 08/21/2009 01:46 PM, Ingo Molnar wrote:
> Might be useful to see your full crashlog, if you are allowed to
> post that, plus your kernel .config would be useful to know too.
> Plus would be useful to know whether this is a regression relative
> to .30 or a yet unfixed bug triggering on your class of hardware.

Hi again,

It looks like this is either timing related or a false alarm :(

I saved the .config reverted to an older commit, and then tried to go
back, and now the thing will suddenly boot.

I'll try to see if I can reproduce it, if I manage, I shall be happy to
post the .config and the full boot log. It's a Supermicro motherboard of
some sort:

Supermicro X8DTN+
AMIBIOS Core Ver:08.00.15

There is nothing special about it, but I know the BIOS is utter junk.

Sorry for the noise.

Cheers,
Jes

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Latest Linus tree oopses on Nehalem box
  2009-08-21 13:04   ` Latest Linus tree oopses on Nehalem box Jes Sorensen
@ 2009-08-21 13:26     ` Ingo Molnar
  2009-08-21 13:35       ` Jes Sorensen
  0 siblings, 1 reply; 14+ messages in thread
From: Ingo Molnar @ 2009-08-21 13:26 UTC (permalink / raw)
  To: Jes Sorensen
  Cc: Jens Axboe, Peter Zijlstra, Thomas Gleixner, H. Peter Anvin,
	Yinghai Lu, linux-kernel, Ingo Molnar, Linus Torvalds


* Jes Sorensen <jes@sgi.com> wrote:

> On 08/21/2009 01:46 PM, Ingo Molnar wrote:
>> Might be useful to see your full crashlog, if you are allowed to
>> post that, plus your kernel .config would be useful to know too.
>> Plus would be useful to know whether this is a regression relative
>> to .30 or a yet unfixed bug triggering on your class of hardware.
>
> Hi again,
>
> It looks like this is either timing related or a false alarm :(
>
> I saved the .config reverted to an older commit, and then tried to 
> go back, and now the thing will suddenly boot.
>
> I'll try to see if I can reproduce it, if I manage, I shall be 
> happy to post the .config and the full boot log. It's a Supermicro 
> motherboard of some sort:
>
> Supermicro X8DTN+
> AMIBIOS Core Ver:08.00.15
>
> There is nothing special about it, but I know the BIOS is utter junk.
>
> Sorry for the noise.

I'd say it's timing related and still unfixed - crashes that deep in 
the scheduler we'd sure know about, had we fixed such in any recent 
kernels. Do you still have the last 10 lines of the bootup leading 
up to the crash?

	Ingo

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Latest Linus tree oopses on Nehalem box
  2009-08-21 13:26     ` Ingo Molnar
@ 2009-08-21 13:35       ` Jes Sorensen
  0 siblings, 0 replies; 14+ messages in thread
From: Jes Sorensen @ 2009-08-21 13:35 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jens Axboe, Peter Zijlstra, Thomas Gleixner, H. Peter Anvin,
	Yinghai Lu, linux-kernel, Ingo Molnar, Linus Torvalds

On 08/21/2009 03:26 PM, Ingo Molnar wrote:
> I'd say it's timing related and still unfixed - crashes that deep in
> the scheduler we'd sure know about, had we fixed such in any recent
> kernels. Do you still have the last 10 lines of the bootup leading
> up to the crash?

Hi Ingo,

Sorry no I don't have that part anymore :-( Should have saved it.

Cheers,
Jes


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [tip:sched/core] sched: Avoid division by zero
  2009-08-21 11:58   ` Peter Zijlstra
@ 2009-08-21 14:42     ` tip-bot for Peter Zijlstra
  2009-08-25 19:11       ` Peter Zijlstra
  0 siblings, 1 reply; 14+ messages in thread
From: tip-bot for Peter Zijlstra @ 2009-08-21 14:42 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, yinghai, torvalds, a.p.zijlstra,
	jens.axboe, jes, tglx, mingo

Commit-ID:  a8af7246c114bfd939e539f9566b872c06f6225c
Gitweb:     http://git.kernel.org/tip/a8af7246c114bfd939e539f9566b872c06f6225c
Author:     Peter Zijlstra <a.p.zijlstra@chello.nl>
AuthorDate: Fri, 21 Aug 2009 13:58:54 +0200
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Fri, 21 Aug 2009 14:15:10 +0200

sched: Avoid division by zero

Patch a5004278f0525dcb9aa43703ef77bf371ea837cd (sched: Fix
cgroup smp fairness) introduced the possibility of a
divide-by-zero because load-balancing is not synchronized
between sched_domains.

This can cause the state of cpus to change between the first
and second loop over the sched domain in tg_shares_up().

Reported-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Jes Sorensen <jes@sgi.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
LKML-Reference: <1250855934.7538.30.camel@twins>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


---
 kernel/sched.c |   23 ++++++++++-------------
 1 files changed, 10 insertions(+), 13 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 1b529ef..8f8a98e 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1522,7 +1522,8 @@ static void __set_se_shares(struct sched_entity *se, unsigned long shares);
  */
 static void
 update_group_shares_cpu(struct task_group *tg, int cpu,
-			unsigned long sd_shares, unsigned long sd_rq_weight)
+			unsigned long sd_shares, unsigned long sd_rq_weight,
+			unsigned long sd_eff_weight)
 {
 	unsigned long rq_weight;
 	unsigned long shares;
@@ -1535,13 +1536,15 @@ update_group_shares_cpu(struct task_group *tg, int cpu,
 	if (!rq_weight) {
 		boost = 1;
 		rq_weight = NICE_0_LOAD;
+		if (sd_rq_weight == sd_eff_weight)
+			sd_eff_weight += NICE_0_LOAD;
+		sd_rq_weight = sd_eff_weight;
 	}
 
 	/*
-	 *           \Sum shares * rq_weight
-	 * shares =  -----------------------
-	 *               \Sum rq_weight
-	 *
+	 *             \Sum_j shares_j * rq_weight_i
+	 * shares_i =  -----------------------------
+	 *                  \Sum_j rq_weight_j
 	 */
 	shares = (sd_shares * rq_weight) / sd_rq_weight;
 	shares = clamp_t(unsigned long, shares, MIN_SHARES, MAX_SHARES);
@@ -1593,14 +1596,8 @@ static int tg_shares_up(struct task_group *tg, void *data)
 	if (!sd->parent || !(sd->parent->flags & SD_LOAD_BALANCE))
 		shares = tg->shares;
 
-	for_each_cpu(i, sched_domain_span(sd)) {
-		unsigned long sd_rq_weight = rq_weight;
-
-		if (!tg->cfs_rq[i]->rq_weight)
-			sd_rq_weight = eff_weight;
-
-		update_group_shares_cpu(tg, i, shares, sd_rq_weight);
-	}
+	for_each_cpu(i, sched_domain_span(sd))
+		update_group_shares_cpu(tg, i, shares, rq_weight, eff_weight);
 
 	return 0;
 }

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [tip:sched/core] sched: Avoid division by zero
  2009-08-21 14:42     ` [tip:sched/core] sched: Avoid division by zero tip-bot for Peter Zijlstra
@ 2009-08-25 19:11       ` Peter Zijlstra
  2009-08-26  9:16         ` Yinghai Lu
  0 siblings, 1 reply; 14+ messages in thread
From: Peter Zijlstra @ 2009-08-25 19:11 UTC (permalink / raw)
  To: mingo, hpa, linux-kernel, yinghai, torvalds, jes, jens.axboe,
	tglx, mingo, Balbir Singh, Arjan van de Ven
  Cc: linux-tip-commits


Yinghai, Balbir, Arjan,

Could you try the below to see if that fully does away with the /0 in
the group scheduler thing?

---
 kernel/sched.c |   53 +++++++++++++++++++++++++++++++++--------------------
 1 files changed, 33 insertions(+), 20 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 0e76b17..45cebe0 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1515,30 +1515,33 @@ static unsigned long cpu_avg_load_per_task(int cpu)
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
 
+struct update_shares_data {
+	spinlock_t lock;
+	unsigned long sum_weight;
+	unsigned long shares;
+	unsigned long rq_weight[NR_CPUS];
+};
+
+static DEFINE_PER_CPU(struct update_shares_data, update_shares_data);
+
 static void __set_se_shares(struct sched_entity *se, unsigned long shares);
 
 /*
  * Calculate and set the cpu's group shares.
  */
-static void
-update_group_shares_cpu(struct task_group *tg, int cpu,
-			unsigned long sd_shares, unsigned long sd_rq_weight,
-			unsigned long sd_eff_weight)
+static void update_group_shares_cpu(struct task_group *tg,
+				    struct update_shares_data *usd, int cpu)
 {
-	unsigned long rq_weight;
-	unsigned long shares;
+	unsigned long shares, rq_weight;
 	int boost = 0;
 
 	if (!tg->se[cpu])
 		return;
 
-	rq_weight = tg->cfs_rq[cpu]->rq_weight;
+	rq_weight = usd->rq_weight[cpu];
 	if (!rq_weight) {
 		boost = 1;
 		rq_weight = NICE_0_LOAD;
-		if (sd_rq_weight == sd_eff_weight)
-			sd_eff_weight += NICE_0_LOAD;
-		sd_rq_weight = sd_eff_weight;
 	}
 
 	/*
@@ -1546,7 +1549,7 @@ update_group_shares_cpu(struct task_group *tg, int cpu,
 	 * shares_i =  -----------------------------
 	 *                  \Sum_j rq_weight_j
 	 */
-	shares = (sd_shares * rq_weight) / sd_rq_weight;
+	shares = (usd->shares * rq_weight) / usd->sum_weight;
 	shares = clamp_t(unsigned long, shares, MIN_SHARES, MAX_SHARES);
 
 	if (abs(shares - tg->se[cpu]->load.weight) >
@@ -1555,6 +1558,7 @@ update_group_shares_cpu(struct task_group *tg, int cpu,
 		unsigned long flags;
 
 		spin_lock_irqsave(&rq->lock, flags);
+		tg->cfs_rq[cpu]->rq_weight = boost ? 0 : rq_weight;
 		tg->cfs_rq[cpu]->shares = boost ? 0 : shares;
 		__set_se_shares(tg->se[cpu], shares);
 		spin_unlock_irqrestore(&rq->lock, flags);
@@ -1568,36 +1572,44 @@ update_group_shares_cpu(struct task_group *tg, int cpu,
  */
 static int tg_shares_up(struct task_group *tg, void *data)
 {
-	unsigned long weight, rq_weight = 0, eff_weight = 0;
-	unsigned long shares = 0;
+	struct update_shares_data *usd = &get_cpu_var(update_shares_data);
+	unsigned long weight, sum_weight = 0, shares = 0;
 	struct sched_domain *sd = data;
+	unsigned long flags;
 	int i;
 
+	spin_lock_irqsave(&usd->lock, flags);
+
 	for_each_cpu(i, sched_domain_span(sd)) {
+		weight = tg->cfs_rq[i]->load.weight;
+		usd->rq_weight[i] = weight;
+
 		/*
 		 * If there are currently no tasks on the cpu pretend there
 		 * is one of average load so that when a new task gets to
 		 * run here it will not get delayed by group starvation.
 		 */
-		weight = tg->cfs_rq[i]->load.weight;
-		tg->cfs_rq[i]->rq_weight = weight;
-		rq_weight += weight;
-
 		if (!weight)
 			weight = NICE_0_LOAD;
 
-		eff_weight += weight;
+		sum_weight += weight;
 		shares += tg->cfs_rq[i]->shares;
 	}
 
-	if ((!shares && rq_weight) || shares > tg->shares)
+	if ((!shares && sum_weight) || shares > tg->shares)
 		shares = tg->shares;
 
 	if (!sd->parent || !(sd->parent->flags & SD_LOAD_BALANCE))
 		shares = tg->shares;
 
+	usd->sum_weight = sum_weight;
+	usd->shares = shares;
+
 	for_each_cpu(i, sched_domain_span(sd))
-		update_group_shares_cpu(tg, i, shares, rq_weight, eff_weight);
+		update_group_shares_cpu(tg, usd, i);
+
+	spin_unlock_irqrestore(&usd->lock, flags);
+	put_cpu_var(update_shares_data);
 
 	return 0;
 }
@@ -9449,6 +9461,7 @@ void __init sched_init(void)
 		init_cfs_rq(&rq->cfs, rq);
 		init_rt_rq(&rq->rt, rq);
 #ifdef CONFIG_FAIR_GROUP_SCHED
+		spin_lock_init(&per_cpu(update_shares_data, i).lock);
 		init_task_group.shares = init_task_group_load;
 		INIT_LIST_HEAD(&rq->leaf_cfs_rq_list);
 #ifdef CONFIG_CGROUP_SCHED



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [tip:sched/core] sched: Avoid division by zero
  2009-08-25 19:11       ` Peter Zijlstra
@ 2009-08-26  9:16         ` Yinghai Lu
  2009-08-26  9:25           ` Peter Zijlstra
  2009-08-27 11:08           ` [PATCH] sched: Avoid division by zero - really Peter Zijlstra
  0 siblings, 2 replies; 14+ messages in thread
From: Yinghai Lu @ 2009-08-26  9:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, hpa, linux-kernel, torvalds, jes, jens.axboe, tglx, mingo,
	Balbir Singh, Arjan van de Ven, linux-tip-commits

Peter Zijlstra wrote:
> Yinghai, Balbir, Arjan,
> 
> Could you try the below to see if that fully does away with the /0 in
> the group scheduler thing?

yes this one fix the problem.

YH

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [tip:sched/core] sched: Avoid division by zero
  2009-08-26  9:16         ` Yinghai Lu
@ 2009-08-26  9:25           ` Peter Zijlstra
  2009-08-27 11:08           ` [PATCH] sched: Avoid division by zero - really Peter Zijlstra
  1 sibling, 0 replies; 14+ messages in thread
From: Peter Zijlstra @ 2009-08-26  9:25 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: mingo, hpa, linux-kernel, torvalds, jes, jens.axboe, tglx, mingo,
	Balbir Singh, Arjan van de Ven, linux-tip-commits

On Wed, 2009-08-26 at 02:16 -0700, Yinghai Lu wrote:
> Peter Zijlstra wrote:
> > Yinghai, Balbir, Arjan,
> > 
> > Could you try the below to see if that fully does away with the /0 in
> > the group scheduler thing?
> 
> yes this one fix the problem.

Awesome, I'll polish her up a bit and send it to Ingo.

Thanks!


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH] sched: Avoid division by zero - really
  2009-08-26  9:16         ` Yinghai Lu
  2009-08-26  9:25           ` Peter Zijlstra
@ 2009-08-27 11:08           ` Peter Zijlstra
  2009-08-27 12:19             ` Eric Dumazet
  2009-08-28  6:30             ` [tip:sched/core] sched: Fix " tip-bot for Peter Zijlstra
  1 sibling, 2 replies; 14+ messages in thread
From: Peter Zijlstra @ 2009-08-27 11:08 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: mingo, hpa, linux-kernel, torvalds, jes, jens.axboe, tglx, mingo,
	Balbir Singh, Arjan van de Ven, linux-tip-commits

When re-computing the shares for each task group's cpu representation we
need the ratio of weight on each cpu vs the total weight of the sched
domain.

Since load-balancing is loosely (read not) synchronized, the weight of
individual cpus can change between doing the sum and calculating the
ratio.

The previous patch dealt with only one of the race scenarios, this patch
side steps them all by saving a snapshot of all the individual cpu
weights, thereby always working on a consistent set.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/sched.c |   50 +++++++++++++++++++++++++++++---------------------
 1 files changed, 29 insertions(+), 21 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 0e76b17..4591054 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1515,30 +1515,29 @@ static unsigned long cpu_avg_load_per_task(int cpu)
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
 
+struct update_shares_data {
+	unsigned long rq_weight[NR_CPUS];
+};
+
+static DEFINE_PER_CPU(struct update_shares_data, update_shares_data);
+
 static void __set_se_shares(struct sched_entity *se, unsigned long shares);
 
 /*
  * Calculate and set the cpu's group shares.
  */
-static void
-update_group_shares_cpu(struct task_group *tg, int cpu,
-			unsigned long sd_shares, unsigned long sd_rq_weight,
-			unsigned long sd_eff_weight)
+static void update_group_shares_cpu(struct task_group *tg, int cpu,
+				    unsigned long sd_shares,
+				    unsigned long sd_rq_weight,
+				    struct update_shares_data *usd)
 {
-	unsigned long rq_weight;
-	unsigned long shares;
+	unsigned long shares, rq_weight;
 	int boost = 0;
 
-	if (!tg->se[cpu])
-		return;
-
-	rq_weight = tg->cfs_rq[cpu]->rq_weight;
+	rq_weight = usd->rq_weight[cpu];
 	if (!rq_weight) {
 		boost = 1;
 		rq_weight = NICE_0_LOAD;
-		if (sd_rq_weight == sd_eff_weight)
-			sd_eff_weight += NICE_0_LOAD;
-		sd_rq_weight = sd_eff_weight;
 	}
 
 	/*
@@ -1555,6 +1554,7 @@ update_group_shares_cpu(struct task_group *tg, int cpu,
 		unsigned long flags;
 
 		spin_lock_irqsave(&rq->lock, flags);
+		tg->cfs_rq[cpu]->rq_weight = boost ? 0 : rq_weight;
 		tg->cfs_rq[cpu]->shares = boost ? 0 : shares;
 		__set_se_shares(tg->se[cpu], shares);
 		spin_unlock_irqrestore(&rq->lock, flags);
@@ -1568,25 +1568,31 @@ update_group_shares_cpu(struct task_group *tg, int cpu,
  */
 static int tg_shares_up(struct task_group *tg, void *data)
 {
-	unsigned long weight, rq_weight = 0, eff_weight = 0;
-	unsigned long shares = 0;
+	unsigned long weight, rq_weight = 0, shares = 0;
+	struct update_shares_data *usd;
 	struct sched_domain *sd = data;
+	unsigned long flags;
 	int i;
 
+	if (!tg->se[0])
+		return 0;
+
+	local_irq_save(flags);
+	usd = &__get_cpu_var(update_shares_data);
+
 	for_each_cpu(i, sched_domain_span(sd)) {
+		weight = tg->cfs_rq[i]->load.weight;
+		usd->rq_weight[i] = weight;
+
 		/*
 		 * If there are currently no tasks on the cpu pretend there
 		 * is one of average load so that when a new task gets to
 		 * run here it will not get delayed by group starvation.
 		 */
-		weight = tg->cfs_rq[i]->load.weight;
-		tg->cfs_rq[i]->rq_weight = weight;
-		rq_weight += weight;
-
 		if (!weight)
 			weight = NICE_0_LOAD;
 
-		eff_weight += weight;
+		rq_weight += weight;
 		shares += tg->cfs_rq[i]->shares;
 	}
 
@@ -1597,7 +1603,9 @@ static int tg_shares_up(struct task_group *tg, void *data)
 		shares = tg->shares;
 
 	for_each_cpu(i, sched_domain_span(sd))
-		update_group_shares_cpu(tg, i, shares, rq_weight, eff_weight);
+		update_group_shares_cpu(tg, i, shares, rq_weight, usd);
+
+	local_irq_restore(flags);
 
 	return 0;
 }



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] sched: Avoid division by zero - really
  2009-08-27 11:08           ` [PATCH] sched: Avoid division by zero - really Peter Zijlstra
@ 2009-08-27 12:19             ` Eric Dumazet
  2009-08-27 12:32               ` Peter Zijlstra
  2009-08-28  6:30             ` [tip:sched/core] sched: Fix " tip-bot for Peter Zijlstra
  1 sibling, 1 reply; 14+ messages in thread
From: Eric Dumazet @ 2009-08-27 12:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Yinghai Lu, mingo, hpa, linux-kernel, torvalds, jes, jens.axboe,
	tglx, mingo, Balbir Singh, Arjan van de Ven, linux-tip-commits

Peter Zijlstra a écrit :
> When re-computing the shares for each task group's cpu representation we
> need the ratio of weight on each cpu vs the total weight of the sched
> domain.
> 
> Since load-balancing is loosely (read not) synchronized, the weight of
> individual cpus can change between doing the sum and calculating the
> ratio.
> 
> The previous patch dealt with only one of the race scenarios, this patch
> side steps them all by saving a snapshot of all the individual cpu
> weights, thereby always working on a consistent set.
> 
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---
>  kernel/sched.c |   50 +++++++++++++++++++++++++++++---------------------
>  1 files changed, 29 insertions(+), 21 deletions(-)
> 
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 0e76b17..4591054 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -1515,30 +1515,29 @@ static unsigned long cpu_avg_load_per_task(int cpu)
>  
>  #ifdef CONFIG_FAIR_GROUP_SCHED
>  
> +struct update_shares_data {
> +	unsigned long rq_weight[NR_CPUS];
> +};
> +
> +static DEFINE_PER_CPU(struct update_shares_data, update_shares_data);

ouch... thats quite large IMHO, up to 4096*8 = 32768 bytes per cpu...

Now we have nice dynamic per cpu allocations, we could use it here,
and use nr_cpus instead of NR_CPUS as the array size ?


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] sched: Avoid division by zero - really
  2009-08-27 12:19             ` Eric Dumazet
@ 2009-08-27 12:32               ` Peter Zijlstra
  0 siblings, 0 replies; 14+ messages in thread
From: Peter Zijlstra @ 2009-08-27 12:32 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Yinghai Lu, mingo, hpa, linux-kernel, torvalds, jes, jens.axboe,
	tglx, mingo, Balbir Singh, Arjan van de Ven, linux-tip-commits

On Thu, 2009-08-27 at 14:19 +0200, Eric Dumazet wrote:
> Peter Zijlstra a écrit :
> > When re-computing the shares for each task group's cpu representation we
> > need the ratio of weight on each cpu vs the total weight of the sched
> > domain.
> > 
> > Since load-balancing is loosely (read not) synchronized, the weight of
> > individual cpus can change between doing the sum and calculating the
> > ratio.
> > 
> > The previous patch dealt with only one of the race scenarios, this patch
> > side steps them all by saving a snapshot of all the individual cpu
> > weights, thereby always working on a consistent set.
> > 
> > Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> > ---
> >  kernel/sched.c |   50 +++++++++++++++++++++++++++++---------------------
> >  1 files changed, 29 insertions(+), 21 deletions(-)
> > 
> > diff --git a/kernel/sched.c b/kernel/sched.c
> > index 0e76b17..4591054 100644
> > --- a/kernel/sched.c
> > +++ b/kernel/sched.c
> > @@ -1515,30 +1515,29 @@ static unsigned long cpu_avg_load_per_task(int cpu)
> >  
> >  #ifdef CONFIG_FAIR_GROUP_SCHED
> >  
> > +struct update_shares_data {
> > +	unsigned long rq_weight[NR_CPUS];
> > +};
> > +
> > +static DEFINE_PER_CPU(struct update_shares_data, update_shares_data);
> 
> ouch... thats quite large IMHO, up to 4096*8 = 32768 bytes per cpu...
> 
> Now we have nice dynamic per cpu allocations, we could use it here,
> and use nr_cpus instead of NR_CPUS as the array size ?

Possibly, but I guess that should include stuff like
static_sched_{domain,group} too, since they seem to have the same
problem.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [tip:sched/core] sched: Fix division by zero - really
  2009-08-27 11:08           ` [PATCH] sched: Avoid division by zero - really Peter Zijlstra
  2009-08-27 12:19             ` Eric Dumazet
@ 2009-08-28  6:30             ` tip-bot for Peter Zijlstra
  1 sibling, 0 replies; 14+ messages in thread
From: tip-bot for Peter Zijlstra @ 2009-08-28  6:30 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, yinghai, a.p.zijlstra, balbir, arjan,
	tglx, mingo

Commit-ID:  34d76c41554a05425613d16efebb3069c4c545f0
Gitweb:     http://git.kernel.org/tip/34d76c41554a05425613d16efebb3069c4c545f0
Author:     Peter Zijlstra <a.p.zijlstra@chello.nl>
AuthorDate: Thu, 27 Aug 2009 13:08:56 +0200
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Fri, 28 Aug 2009 08:26:49 +0200

sched: Fix division by zero - really

When re-computing the shares for each task group's cpu
representation we need the ratio of weight on each cpu vs the
total weight of the sched domain.

Since load-balancing is loosely (read not) synchronized, the
weight of individual cpus can change between doing the sum and
calculating the ratio.

The previous patch dealt with only one of the race scenarios,
this patch side steps them all by saving a snapshot of all the
individual cpu weights, thereby always working on a consistent
set.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: torvalds@linux-foundation.org
Cc: jes@sgi.com
Cc: jens.axboe@oracle.com
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <1251371336.18584.77.camel@twins>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


---
 kernel/sched.c |   50 +++++++++++++++++++++++++++++---------------------
 1 files changed, 29 insertions(+), 21 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 8f8a98e..523e20a 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1515,30 +1515,29 @@ static unsigned long cpu_avg_load_per_task(int cpu)
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
 
+struct update_shares_data {
+	unsigned long rq_weight[NR_CPUS];
+};
+
+static DEFINE_PER_CPU(struct update_shares_data, update_shares_data);
+
 static void __set_se_shares(struct sched_entity *se, unsigned long shares);
 
 /*
  * Calculate and set the cpu's group shares.
  */
-static void
-update_group_shares_cpu(struct task_group *tg, int cpu,
-			unsigned long sd_shares, unsigned long sd_rq_weight,
-			unsigned long sd_eff_weight)
+static void update_group_shares_cpu(struct task_group *tg, int cpu,
+				    unsigned long sd_shares,
+				    unsigned long sd_rq_weight,
+				    struct update_shares_data *usd)
 {
-	unsigned long rq_weight;
-	unsigned long shares;
+	unsigned long shares, rq_weight;
 	int boost = 0;
 
-	if (!tg->se[cpu])
-		return;
-
-	rq_weight = tg->cfs_rq[cpu]->rq_weight;
+	rq_weight = usd->rq_weight[cpu];
 	if (!rq_weight) {
 		boost = 1;
 		rq_weight = NICE_0_LOAD;
-		if (sd_rq_weight == sd_eff_weight)
-			sd_eff_weight += NICE_0_LOAD;
-		sd_rq_weight = sd_eff_weight;
 	}
 
 	/*
@@ -1555,6 +1554,7 @@ update_group_shares_cpu(struct task_group *tg, int cpu,
 		unsigned long flags;
 
 		spin_lock_irqsave(&rq->lock, flags);
+		tg->cfs_rq[cpu]->rq_weight = boost ? 0 : rq_weight;
 		tg->cfs_rq[cpu]->shares = boost ? 0 : shares;
 		__set_se_shares(tg->se[cpu], shares);
 		spin_unlock_irqrestore(&rq->lock, flags);
@@ -1568,25 +1568,31 @@ update_group_shares_cpu(struct task_group *tg, int cpu,
  */
 static int tg_shares_up(struct task_group *tg, void *data)
 {
-	unsigned long weight, rq_weight = 0, eff_weight = 0;
-	unsigned long shares = 0;
+	unsigned long weight, rq_weight = 0, shares = 0;
+	struct update_shares_data *usd;
 	struct sched_domain *sd = data;
+	unsigned long flags;
 	int i;
 
+	if (!tg->se[0])
+		return 0;
+
+	local_irq_save(flags);
+	usd = &__get_cpu_var(update_shares_data);
+
 	for_each_cpu(i, sched_domain_span(sd)) {
+		weight = tg->cfs_rq[i]->load.weight;
+		usd->rq_weight[i] = weight;
+
 		/*
 		 * If there are currently no tasks on the cpu pretend there
 		 * is one of average load so that when a new task gets to
 		 * run here it will not get delayed by group starvation.
 		 */
-		weight = tg->cfs_rq[i]->load.weight;
-		tg->cfs_rq[i]->rq_weight = weight;
-		rq_weight += weight;
-
 		if (!weight)
 			weight = NICE_0_LOAD;
 
-		eff_weight += weight;
+		rq_weight += weight;
 		shares += tg->cfs_rq[i]->shares;
 	}
 
@@ -1597,7 +1603,9 @@ static int tg_shares_up(struct task_group *tg, void *data)
 		shares = tg->shares;
 
 	for_each_cpu(i, sched_domain_span(sd))
-		update_group_shares_cpu(tg, i, shares, rq_weight, eff_weight);
+		update_group_shares_cpu(tg, i, shares, rq_weight, usd);
+
+	local_irq_restore(flags);
 
 	return 0;
 }

^ permalink raw reply related	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2009-08-28  6:31 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-08-21 10:53 Latest Linus tree oopses on Nehalem box Jes Sorensen
2009-08-21 11:46 ` Ingo Molnar
2009-08-21 11:58   ` Peter Zijlstra
2009-08-21 14:42     ` [tip:sched/core] sched: Avoid division by zero tip-bot for Peter Zijlstra
2009-08-25 19:11       ` Peter Zijlstra
2009-08-26  9:16         ` Yinghai Lu
2009-08-26  9:25           ` Peter Zijlstra
2009-08-27 11:08           ` [PATCH] sched: Avoid division by zero - really Peter Zijlstra
2009-08-27 12:19             ` Eric Dumazet
2009-08-27 12:32               ` Peter Zijlstra
2009-08-28  6:30             ` [tip:sched/core] sched: Fix " tip-bot for Peter Zijlstra
2009-08-21 13:04   ` Latest Linus tree oopses on Nehalem box Jes Sorensen
2009-08-21 13:26     ` Ingo Molnar
2009-08-21 13:35       ` Jes Sorensen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.