All of lore.kernel.org
 help / color / mirror / Atom feed
* kerneloops.org report for the week of June 14 2009
@ 2009-06-15  0:33 Arjan van de Ven
  2009-06-15  5:53   ` Takashi Iwai
                   ` (3 more replies)
  0 siblings, 4 replies; 15+ messages in thread
From: Arjan van de Ven @ 2009-06-15  0:33 UTC (permalink / raw)
  To: linux-kernel
  Cc: torvalds, akpm, tglx, alan, rjw, netdev, mingo, dwmw2,
	alsa-devel, dri-devel

Hi,

First, a word of thanks to the kernel.org team for helping to host the kerneloops.org database
on a real machine rather than a small virtualized host. Performance and continuity will be a lot
better in the new setup! 
(as a result of the transition and hosting issues on the original server, there was a downtime of 
two days, meaning that the total number of reports is lower than typical)


New sightings this report include the XFI driver, the i915 gem code and the DMAR code.


This week, a total of 4026 oopses and warnings have been reported
for kernels 2.6.29 and later. Reports for kernels earlier than 2.6.29
have been omitted from this report.

Per file statistics
830	drivers/gpu/drm/i915/i915_gem_tiling.c
470	drivers/net/wireless/iwlwifi/iwl-sta.c
303	kernel/time.c
276	drivers/parport/procfs.c
269	include/linux/slob_def.h
187	include/linux/hrtimer.h
138	arch/x86/kernel/cpu/mtrr/generic.c
120	drivers/pci/dmar.c
65	net/mac80211/rc80211_minstrel.c
62	drivers/net/r8169.c
37	net/ipv4/tcp.c
34	fs/cifs/cifssmb.c


Rank 1: i915_gem_set_tiling (warning)
	Reported 830 times (3700 total reports)
	[gem] Failure in the tiling code
	This warning was last seen in version 2.6.30, and first seen in 2.6.29-rc2.
	More info: http://www.kerneloops.org/searchweek.php?search=i915_gem_set_tiling

Rank 2: iwl_set_dynamic_key (warning)
	Reported 470 times (13682 total reports)
	"no space for new kew" message.
	This warning was last seen in version 2.6.30, and first seen in 2.6.27.
	More info: http://www.kerneloops.org/searchweek.php?search=iwl_set_dynamic_key

Rank 3: getnstimeofday (warning)
	Reported 309 times (2446 total reports)
	[suspend resume] getnstimeofday() is called before timekeeping is resumed
	This warning was last seen in version 2.6.29.4, and first seen in 2.6.24.
	More info: http://www.kerneloops.org/searchweek.php?search=getnstimeofday

Rank 4: ct_vm_map (warning) - sleeping in atomic context
	Reported 285 times (960 total reports)
	Bug in the creative XFI driver (backported to Fedora)
	This warning was last seen in version 2.6.29.4, and first seen in 2.6.29-rc4-git1.
	More info: http://www.kerneloops.org/searchweek.php?search=ct_vm_map

Rank 5: parport_device_proc_unregister (warning)
	Reported 278 times (1810 total reports)
        Alan has a fix for this on LKML
	This warning was last seen in version 2.6.29.4, and first seen in 2.6.27.
	More info: http://www.kerneloops.org/searchweek.php?search=parport_device_proc_unregister

Rank 6: hres_timers_resume (warning)
	Reported 188 times (1024 total reports)
	[suspend resume] hres_timers_resume() is incorrectly called with interrupts on
	This warning was last seen in version 2.6.30-rc6, and first seen in 2.6.24.7.
	More info: http://www.kerneloops.org/searchweek.php?search=hres_timers_resume

Rank 7: generic_get_mtrr (warning)
	Reported 139 times (1009 total reports)
	BIOS bug where the MTRRs are not set up correctly
	This warning was last seen in version 2.6.30, and first seen in 2.6.25.3.
	More info: http://www.kerneloops.org/searchweek.php?search=generic_get_mtrr

Rank 8: dmar_table_init (warning)
	Reported 121 times (507 total reports)
	BIOS bug exposed via a WARN_ON
	This warning was last seen in version 2.6.30-rc8, and first seen in 2.6.29.
	More info: http://www.kerneloops.org/searchweek.php?search=dmar_table_init

Rank 9: minstrel_get_rate (warning)
	Reported 64 times (384 total reports)
	Issue with the ath5k driver
	This warning was last seen in version 2.6.30-rc3, and first seen in 2.6.27.12.
	More info: http://www.kerneloops.org/searchweek.php?search=minstrel_get_rate

Rank 10: dev_watchdog(r8169) (warning)
	Reported 62 times (9116 total reports)
	This warning was last seen in version 2.6.30-rc7, and first seen in 2.6.26.6.
	More info: http://www.kerneloops.org/searchweek.php?search=dev_watchdog(r8169)

Rank 11: nv_tx_done_optimized (warning)
	Reported 61 times (2974 total reports)
	the NV ethernet driver is getting caught with the new DMA API checks
	This warning was last seen in version 2.6.30, and first seen in 2.6.29-rc3-git7.
	More info: http://www.kerneloops.org/searchweek.php?search=nv_tx_done_optimized

Rank 12: tcp_recvmsg (warning)
	Reported 37 times (3797 total reports)
	fixed in 2.6.30 (commit 775273131810caa41dfc7f9e552ea5d8508caf40)
	This warning was last seen in version 2.6.30, and first seen in 2.6.25.
	More info: http://www.kerneloops.org/searchweek.php?search=tcp_recvmsg


-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [alsa-devel] kerneloops.org report for the week of June 14 2009
  2009-06-15  0:33 kerneloops.org report for the week of June 14 2009 Arjan van de Ven
@ 2009-06-15  5:53   ` Takashi Iwai
  2009-06-15  7:06 ` Stefan Schmidt
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 15+ messages in thread
From: Takashi Iwai @ 2009-06-15  5:53 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: linux-kernel, alsa-devel, dri-devel, tglx, netdev, rjw, akpm,
	dwmw2, torvalds, mingo, alan

At Sun, 14 Jun 2009 17:33:31 -0700,
Arjan van de Ven wrote:
> 
> Rank 4: ct_vm_map (warning) - sleeping in atomic context
> 	Reported 285 times (960 total reports)
> 	Bug in the creative XFI driver (backported to Fedora)
> 	This warning was last seen in version 2.6.29.4, and first seen in 2.6.29-rc4-git1.
> 	More info: http://www.kerneloops.org/searchweek.php?search=ct_vm_map

This should have been fixed in the latest code, at least, the one
in Linus tree should be OK.


thanks,

Takashi

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: kerneloops.org report for the week of June 14 2009
@ 2009-06-15  5:53   ` Takashi Iwai
  0 siblings, 0 replies; 15+ messages in thread
From: Takashi Iwai @ 2009-06-15  5:53 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: alsa-devel, tglx, netdev, linux-kernel, rjw, torvalds, akpm,
	dri-devel, dwmw2, mingo, alan

At Sun, 14 Jun 2009 17:33:31 -0700,
Arjan van de Ven wrote:
> 
> Rank 4: ct_vm_map (warning) - sleeping in atomic context
> 	Reported 285 times (960 total reports)
> 	Bug in the creative XFI driver (backported to Fedora)
> 	This warning was last seen in version 2.6.29.4, and first seen in 2.6.29-rc4-git1.
> 	More info: http://www.kerneloops.org/searchweek.php?search=ct_vm_map

This should have been fixed in the latest code, at least, the one
in Linus tree should be OK.


thanks,

Takashi

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: kerneloops.org report for the week of June 14 2009
  2009-06-15  0:33 kerneloops.org report for the week of June 14 2009 Arjan van de Ven
  2009-06-15  5:53   ` Takashi Iwai
@ 2009-06-15  7:06 ` Stefan Schmidt
  2009-06-16  7:11     ` Stefan Schmidt
  2009-06-18 17:08 ` kerneloops.org report for the week of June 14 2009 Bob Copeland
  2009-06-23 10:05 ` Thomas Gleixner
  3 siblings, 1 reply; 15+ messages in thread
From: Stefan Schmidt @ 2009-06-15  7:06 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: linux-kernel, torvalds, akpm, tglx, alan, rjw, netdev, mingo,
	dwmw2, alsa-devel, dri-devel

Hello.

On Sun, 2009-06-14 at 17:33, Arjan van de Ven wrote:
> 
> Rank 1: i915_gem_set_tiling (warning)
> 	Reported 830 times (3700 total reports)
> 	[gem] Failure in the tiling code
> 	This warning was last seen in version 2.6.30, and first seen in 2.6.29-rc2.
> 	More info: http://www.kerneloops.org/searchweek.php?search=i915_gem_set_tiling

There seem to be more then one issue. I'm running into one of them here on a
Lenovo Thinkpad X200s. It's 2.6.30 and I enabled KMS the first time here. So I
can't tell if it is a regression.

http://www.kerneloops.org/oops.php?number=207353

The result is that the EDID data could not be read and the wrong screen size
set:
[   45.524589] i2c-adapter i2c-3: unable to read EDID block.
[   45.588200] i915 0000:00:02.0: DVI-D-2: no EDID data
[   45.777251] [drm] TMDS-14: set mode 1360x768 13
[   45.777261] [drm] TMDS-16: set mode 1360x768 13 <= Should be 1280x900 here

Somehow the driver tries to use DVI1 and DVI2 while the display is connected to
LVDS1.

stefan@excalibur:~$ xrandr
Screen 0: minimum 320 x 200, current 1360 x 800, maximum 8192 x 8192
VGA1 disconnected (normal left inverted right x axis y axis)
LVDS1 connected 1280x800+0+0 (normal left inverted right x axis y axis) 261mm x
163mm
   1280x800       60.0*+   50.0  
DVI1 connected 1360x768+0+0 (normal left inverted right x axis y axis) 0mm x 0mm
   1360x768       59.8* 
   1152x864       60.0  
   1024x768       60.0  
   800x600        60.3  
   640x480        59.9  
DVI2 connected 1360x768+0+0 (normal left inverted right x axis y axis) 0mm x 0mm
   1360x768       59.8* 
   1152x864       60.0  
   1024x768       60.0  
   800x600        60.3  
   640x480        59.9

regards
Stefan Schmidt

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: WARN_ON's in i915_gem_tiling.c
  2009-06-15  7:06 ` Stefan Schmidt
@ 2009-06-16  7:11     ` Stefan Schmidt
  0 siblings, 0 replies; 15+ messages in thread
From: Stefan Schmidt @ 2009-06-16  7:11 UTC (permalink / raw)
  To: Stefan Schmidt
  Cc: Arjan van de Ven, linux-kernel, torvalds, akpm, tglx, alan, rjw,
	netdev, mingo, dwmw2, alsa-devel, dri-devel, eric

Hello.

Bringing Eric into the loop here as he is listed as author of the
i915_gem_tiling.c

On Mon, 2009-06-15 at 09:06, Stefan Schmidt wrote:
> 
> On Sun, 2009-06-14 at 17:33, Arjan van de Ven wrote:
> > 
> > Rank 1: i915_gem_set_tiling (warning)
> > 	Reported 830 times (3700 total reports)
> > 	[gem] Failure in the tiling code
> > 	This warning was last seen in version 2.6.30, and first seen in 2.6.29-rc2.
> > 	More info: http://www.kerneloops.org/searchweek.php?search=i915_gem_set_tiling
> 
> There seem to be more then one issue. I'm running into one of them here on a
> Lenovo Thinkpad X200s. It's 2.6.30 and I enabled KMS the first time here. So I
> can't tell if it is a regression.
> 
> http://www.kerneloops.org/oops.php?number=207353

http://www.kerneloops.org/raw.php?rawid=453362&msgid=

and

http://www.kerneloops.org/oops.php?number=453361

Are looking similar to #207353 to me. That's the one I can easily reproduce by
booting into an KMS enabled kernel. Without modsetting eveything is fine. Let me
know if there are patches I should give a try.

> The result is that the EDID data could not be read and the wrong screen size
> set:
> [   45.524589] i2c-adapter i2c-3: unable to read EDID block.
> [   45.588200] i915 0000:00:02.0: DVI-D-2: no EDID data
> [   45.777251] [drm] TMDS-14: set mode 1360x768 13
> [   45.777261] [drm] TMDS-16: set mode 1360x768 13 <= Should be 1280x900 here
> 
> Somehow the driver tries to use DVI1 and DVI2 while the display is connected to
> LVDS1.
> 
> stefan@excalibur:~$ xrandr
> Screen 0: minimum 320 x 200, current 1360 x 800, maximum 8192 x 8192
> VGA1 disconnected (normal left inverted right x axis y axis)
> LVDS1 connected 1280x800+0+0 (normal left inverted right x axis y axis) 261mm x
> 163mm
>    1280x800       60.0*+   50.0  
> DVI1 connected 1360x768+0+0 (normal left inverted right x axis y axis) 0mm x 0mm
>    1360x768       59.8* 
>    1152x864       60.0  
>    1024x768       60.0  
>    800x600        60.3  
>    640x480        59.9  
> DVI2 connected 1360x768+0+0 (normal left inverted right x axis y axis) 0mm x 0mm
>    1360x768       59.8* 
>    1152x864       60.0  
>    1024x768       60.0  
>    800x600        60.3  
>    640x480        59.9

regards
Stefan Schmidt

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: WARN_ON's in i915_gem_tiling.c
@ 2009-06-16  7:11     ` Stefan Schmidt
  0 siblings, 0 replies; 15+ messages in thread
From: Stefan Schmidt @ 2009-06-16  7:11 UTC (permalink / raw)
  To: Stefan Schmidt
  Cc: alsa-devel, dri-devel, tglx, netdev, linux-kernel, rjw, eric,
	akpm, dwmw2, torvalds, mingo, alan, Arjan van de Ven

Hello.

Bringing Eric into the loop here as he is listed as author of the
i915_gem_tiling.c

On Mon, 2009-06-15 at 09:06, Stefan Schmidt wrote:
> 
> On Sun, 2009-06-14 at 17:33, Arjan van de Ven wrote:
> > 
> > Rank 1: i915_gem_set_tiling (warning)
> > 	Reported 830 times (3700 total reports)
> > 	[gem] Failure in the tiling code
> > 	This warning was last seen in version 2.6.30, and first seen in 2.6.29-rc2.
> > 	More info: http://www.kerneloops.org/searchweek.php?search=i915_gem_set_tiling
> 
> There seem to be more then one issue. I'm running into one of them here on a
> Lenovo Thinkpad X200s. It's 2.6.30 and I enabled KMS the first time here. So I
> can't tell if it is a regression.
> 
> http://www.kerneloops.org/oops.php?number=207353

http://www.kerneloops.org/raw.php?rawid=453362&msgid=

and

http://www.kerneloops.org/oops.php?number=453361

Are looking similar to #207353 to me. That's the one I can easily reproduce by
booting into an KMS enabled kernel. Without modsetting eveything is fine. Let me
know if there are patches I should give a try.

> The result is that the EDID data could not be read and the wrong screen size
> set:
> [   45.524589] i2c-adapter i2c-3: unable to read EDID block.
> [   45.588200] i915 0000:00:02.0: DVI-D-2: no EDID data
> [   45.777251] [drm] TMDS-14: set mode 1360x768 13
> [   45.777261] [drm] TMDS-16: set mode 1360x768 13 <= Should be 1280x900 here
> 
> Somehow the driver tries to use DVI1 and DVI2 while the display is connected to
> LVDS1.
> 
> stefan@excalibur:~$ xrandr
> Screen 0: minimum 320 x 200, current 1360 x 800, maximum 8192 x 8192
> VGA1 disconnected (normal left inverted right x axis y axis)
> LVDS1 connected 1280x800+0+0 (normal left inverted right x axis y axis) 261mm x
> 163mm
>    1280x800       60.0*+   50.0  
> DVI1 connected 1360x768+0+0 (normal left inverted right x axis y axis) 0mm x 0mm
>    1360x768       59.8* 
>    1152x864       60.0  
>    1024x768       60.0  
>    800x600        60.3  
>    640x480        59.9  
> DVI2 connected 1360x768+0+0 (normal left inverted right x axis y axis) 0mm x 0mm
>    1360x768       59.8* 
>    1152x864       60.0  
>    1024x768       60.0  
>    800x600        60.3  
>    640x480        59.9

regards
Stefan Schmidt

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: kerneloops.org report for the week of June 14 2009
  2009-06-15  0:33 kerneloops.org report for the week of June 14 2009 Arjan van de Ven
  2009-06-15  5:53   ` Takashi Iwai
  2009-06-15  7:06 ` Stefan Schmidt
@ 2009-06-18 17:08 ` Bob Copeland
  2009-06-19  5:32   ` Arjan van de Ven
  2009-06-23 10:05 ` Thomas Gleixner
  3 siblings, 1 reply; 15+ messages in thread
From: Bob Copeland @ 2009-06-18 17:08 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: linux-kernel, torvalds, akpm, tglx, alan, rjw, netdev, mingo,
	dwmw2, alsa-devel, dri-devel

On Sun, Jun 14, 2009 at 8:33 PM, Arjan van de Ven<arjan@infradead.org> wrote:
> Rank 9: minstrel_get_rate (warning)
>        Reported 64 times (384 total reports)
>        Issue with the ath5k driver
>        This warning was last seen in version 2.6.30-rc3, and first seen in 2.6.27.12.
>        More info: http://www.kerneloops.org/searchweek.php?search=minstrel_get_rate

I believe this is fixed by http://lkml.org/lkml/2009/6/5/269, now
in pre-2.6.31.  Hmm, odd though, we should be seeing reports from
other wireless drivers since it's a general memory corruption
unrelated to ath5k (but not all use the mac80211 rate controllers).

(2.6.27.12 must be a backport, minstrel didn't exist until .29)

-- 
Bob Copeland %% www.bobcopeland.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: kerneloops.org report for the week of June 14 2009
  2009-06-18 17:08 ` kerneloops.org report for the week of June 14 2009 Bob Copeland
@ 2009-06-19  5:32   ` Arjan van de Ven
  2009-06-28  9:11     ` Bob Copeland
  0 siblings, 1 reply; 15+ messages in thread
From: Arjan van de Ven @ 2009-06-19  5:32 UTC (permalink / raw)
  To: Bob Copeland
  Cc: linux-kernel, torvalds, akpm, tglx, alan, rjw, netdev, mingo,
	dwmw2, alsa-devel, dri-devel

On Thu, 18 Jun 2009 13:08:39 -0400
Bob Copeland <me@bobcopeland.com> wrote:

> On Sun, Jun 14, 2009 at 8:33 PM, Arjan van de
> Ven<arjan@infradead.org> wrote:
> > Rank 9: minstrel_get_rate (warning)
> >        Reported 64 times (384 total reports)
> >        Issue with the ath5k driver
> >        This warning was last seen in version 2.6.30-rc3, and first
> > seen in 2.6.27.12. More info:
> > http://www.kerneloops.org/searchweek.php?search=minstrel_get_rate
> 
> I believe this is fixed by http://lkml.org/lkml/2009/6/5/269, now
> in pre-2.6.31.  Hmm, odd though, we should be seeing reports from
> other wireless drivers since it's a general memory corruption
> unrelated to ath5k (but not all use the mac80211 rate controllers).
> 

well.. out of the 534 reports or so that I have by now... 525 have
ath5k loaded.....

it's not 100% but it'd pretty close.


-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: kerneloops.org report for the week of June 14 2009
  2009-06-15  0:33 kerneloops.org report for the week of June 14 2009 Arjan van de Ven
                   ` (2 preceding siblings ...)
  2009-06-18 17:08 ` kerneloops.org report for the week of June 14 2009 Bob Copeland
@ 2009-06-23 10:05 ` Thomas Gleixner
  2009-06-23 11:55   ` Ingo Molnar
  2009-06-25 13:03   ` Pádraig Brady
  3 siblings, 2 replies; 15+ messages in thread
From: Thomas Gleixner @ 2009-06-23 10:05 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: LKML, Linus Torvalds, Andrew Morton, Rafael J. Wysocki, Ingo Molnar

On Sun, 14 Jun 2009, Arjan van de Ven wrote:
> Rank 3: getnstimeofday (warning)
> 	Reported 309 times (2446 total reports)
> 	[suspend resume] getnstimeofday() is called before timekeeping is
        resumed

> Rank 6: hres_timers_resume (warning)
> 	Reported 188 times (1024 total reports)
> 	[suspend resume] hres_timers_resume() is incorrectly called with
>       interrupts on

Both have the same root cause. Something enables interrupts in the
early resume path. IIRC, there was a culprit identified
recently. Rafael ?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: kerneloops.org report for the week of June 14 2009
  2009-06-23 10:05 ` Thomas Gleixner
@ 2009-06-23 11:55   ` Ingo Molnar
  2009-06-23 14:50     ` Arjan van de Ven
  2009-06-23 16:56     ` Rafael J. Wysocki
  2009-06-25 13:03   ` Pádraig Brady
  1 sibling, 2 replies; 15+ messages in thread
From: Ingo Molnar @ 2009-06-23 11:55 UTC (permalink / raw)
  To: Thomas Gleixner, Peter Zijlstra
  Cc: Arjan van de Ven, LKML, Linus Torvalds, Andrew Morton, Rafael J. Wysocki


* Thomas Gleixner <tglx@linutronix.de> wrote:

> On Sun, 14 Jun 2009, Arjan van de Ven wrote:
> > Rank 3: getnstimeofday (warning)
> > 	Reported 309 times (2446 total reports)
> > 	[suspend resume] getnstimeofday() is called before timekeeping is
>         resumed
> 
> > Rank 6: hres_timers_resume (warning)
> > 	Reported 188 times (1024 total reports)
> > 	[suspend resume] hres_timers_resume() is incorrectly called with
> >       interrupts on
> 
> Both have the same root cause. Something enables interrupts in the 
> early resume path. IIRC, there was a culprit identified recently. 
> Rafael ?

This can be debugged automatically today, using lockdep, by using a 
'helper lock':

  static DEFINE_PER_CPU(struct lockdep_map, helper_lock);

Then mark the lock irq-safe by doing something like:

static void mark_lock_irqsafe(void)
{
	unsigned long flags;
	int cpu;

	local_irq_save(flags);
	irq_enter(0);

	for_each_online_cpu(cpu) {
		lock_acquire(&per_cpu(helper_lock, cpu), 0, 0, 0, 0, NULL, 0);
		lock_release(&per_cpu(helper_lock, cpu), 0, 0, 0, 0, NULL, 0);
	}

	irq_exit(0);
	local_irq_restore(flags);
}

Then, the resume path, when it disables irqs, you can disallow 
irq-enable via:

	local_irq_disable();
	lock_acquire(&__get_cpu_var(helper_lock), 0, 0, 0, 0, NULL, 0);
	...
	<extensive suspend or resume codepaths, callbacks> 
	...
	lock_release(&__get_cpu_var(helper_lock), 0, 0, 0, 0, NULL, 0);
	local_irq_enable();

And lockdep will warn if any function inbetween enables IRQs, by 
emitting a splat about incorrectly enabled hardirqs. It will warn 
about the specific place and will emit a relevant backtrace, - not 
just the handler in general.

This should work just fine with current lockdep facilities.

Rafael?

	Ingo

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: kerneloops.org report for the week of June 14 2009
  2009-06-23 11:55   ` Ingo Molnar
@ 2009-06-23 14:50     ` Arjan van de Ven
  2009-06-23 16:56     ` Rafael J. Wysocki
  1 sibling, 0 replies; 15+ messages in thread
From: Arjan van de Ven @ 2009-06-23 14:50 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Thomas Gleixner, Peter Zijlstra, LKML, Linus Torvalds,
	Andrew Morton, Rafael J. Wysocki

On Tue, 23 Jun 2009 13:55:10 +0200
Ingo Molnar <mingo@elte.hu> wrote:

> 
> * Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> > On Sun, 14 Jun 2009, Arjan van de Ven wrote:
> > > Rank 3: getnstimeofday (warning)
> > > 	Reported 309 times (2446 total reports)
> > > 	[suspend resume] getnstimeofday() is called before
> > > timekeeping is
> >         resumed
> > 
> > > Rank 6: hres_timers_resume (warning)
> > > 	Reported 188 times (1024 total reports)
> > > 	[suspend resume] hres_timers_resume() is incorrectly
> > > called with interrupts on
> > 
> > Both have the same root cause. Something enables interrupts in the 
> > early resume path. IIRC, there was a culprit identified recently. 
> > Rafael ?
> 
> This can be debugged automatically today, using lockdep, by using a 
> 'helper lock':
> 
>   static DEFINE_PER_CPU(struct lockdep_map, helper_lock);
> 
> Then mark the lock irq-safe by doing something like:
> 
> static void mark_lock_irqsafe(void)
> {
> 	unsigned long flags;
> 	int cpu;
> 
> 	local_irq_save(flags);
> 	irq_enter(0);
> 
> 	for_each_online_cpu(cpu) {
> 		lock_acquire(&per_cpu(helper_lock, cpu), 0, 0, 0, 0,
> NULL, 0); lock_release(&per_cpu(helper_lock, cpu), 0, 0, 0, 0, NULL,
> 0); }
> 
> 	irq_exit(0);
> 	local_irq_restore(flags);
> }
> 
> Then, the resume path, when it disables irqs, you can disallow 
> irq-enable via:
> 
> 	local_irq_disable();
> 	lock_acquire(&__get_cpu_var(helper_lock), 0, 0, 0, 0, NULL,
> 0); ...
> 	<extensive suspend or resume codepaths, callbacks> 
> 	...
> 	lock_release(&__get_cpu_var(helper_lock), 0, 0, 0, 0, NULL,
> 0); local_irq_enable();
> 
> And lockdep will warn if any function inbetween enables IRQs, by 
> emitting a splat about incorrectly enabled hardirqs. It will warn 
> about the specific place and will emit a relevant backtrace, - not 
> just the handler in general.

looks like an interesting approach; it'll pinpoint the bad guy exactly
where he's enabling interrupts.. (assuming he's using kernel facilities
of course)



-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: kerneloops.org report for the week of June 14 2009
  2009-06-23 11:55   ` Ingo Molnar
  2009-06-23 14:50     ` Arjan van de Ven
@ 2009-06-23 16:56     ` Rafael J. Wysocki
  2009-06-23 16:58       ` Thomas Gleixner
  1 sibling, 1 reply; 15+ messages in thread
From: Rafael J. Wysocki @ 2009-06-23 16:56 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Thomas Gleixner, Peter Zijlstra, Arjan van de Ven, LKML,
	Linus Torvalds, Andrew Morton, Venki Pallipadi, Len Brown,
	Benjamin Herrenschmidt, ACPI Devel Maling List

On Tuesday 23 June 2009, Ingo Molnar wrote:
> 
> * Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> > On Sun, 14 Jun 2009, Arjan van de Ven wrote:
> > > Rank 3: getnstimeofday (warning)
> > > 	Reported 309 times (2446 total reports)
> > > 	[suspend resume] getnstimeofday() is called before timekeeping is
> >         resumed
> > 
> > > Rank 6: hres_timers_resume (warning)
> > > 	Reported 188 times (1024 total reports)
> > > 	[suspend resume] hres_timers_resume() is incorrectly called with
> > >       interrupts on
> > 
> > Both have the same root cause. Something enables interrupts in the 
> > early resume path. IIRC, there was a culprit identified recently. 
> > Rafael ?

Apparently, we have smp_call_function_single() called from cpufreq_suspend
via acpi_cpufreq somehow, but I'm still to figure out how this happens.

> This can be debugged automatically today, using lockdep, by using a 
> 'helper lock':
> 
>   static DEFINE_PER_CPU(struct lockdep_map, helper_lock);
> 
> Then mark the lock irq-safe by doing something like:
> 
> static void mark_lock_irqsafe(void)
> {
> 	unsigned long flags;
> 	int cpu;
> 
> 	local_irq_save(flags);
> 	irq_enter(0);
> 
> 	for_each_online_cpu(cpu) {
> 		lock_acquire(&per_cpu(helper_lock, cpu), 0, 0, 0, 0, NULL, 0);
> 		lock_release(&per_cpu(helper_lock, cpu), 0, 0, 0, 0, NULL, 0);
> 	}
> 
> 	irq_exit(0);
> 	local_irq_restore(flags);
> }
> 
> Then, the resume path, when it disables irqs, you can disallow 
> irq-enable via:
> 
> 	local_irq_disable();
> 	lock_acquire(&__get_cpu_var(helper_lock), 0, 0, 0, 0, NULL, 0);
> 	...
> 	<extensive suspend or resume codepaths, callbacks> 
> 	...
> 	lock_release(&__get_cpu_var(helper_lock), 0, 0, 0, 0, NULL, 0);
> 	local_irq_enable();
> 
> And lockdep will warn if any function inbetween enables IRQs, by 
> emitting a splat about incorrectly enabled hardirqs. It will warn 
> about the specific place and will emit a relevant backtrace, - not 
> just the handler in general.
> 
> This should work just fine with current lockdep facilities.
> 
> Rafael?

We have some debug code for checking interrupts disabled in sysdev_suspend
and sysdev_resume already and these reports are from 2.6.29 where that code
was not present.

The long term solution for the issue at hand is to clean up the suspend-resume
support in cpufreq so that it doesn't do stupid things like calling
smp_call_function_single() with interrupts disabled, but that requires someone
(I can do it, but I need to dig through the cpufreq code for this purpose) to
figure out how to fix it.

I'm not quite sure if there's an acceptable short term solution, though.

In principle we can do

local_irq_save()
...
local_irq_restore()

around each sysdevs ->susend() and ->resume() in addition to checking the
status of interrupts.  Would that work?

Rafael

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: kerneloops.org report for the week of June 14 2009
  2009-06-23 16:56     ` Rafael J. Wysocki
@ 2009-06-23 16:58       ` Thomas Gleixner
  0 siblings, 0 replies; 15+ messages in thread
From: Thomas Gleixner @ 2009-06-23 16:58 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Ingo Molnar, Peter Zijlstra, Arjan van de Ven, LKML,
	Linus Torvalds, Andrew Morton, Venki Pallipadi, Len Brown,
	Benjamin Herrenschmidt, ACPI Devel Maling List

On Tue, 23 Jun 2009, Rafael J. Wysocki wrote:
> The long term solution for the issue at hand is to clean up the suspend-resume
> support in cpufreq so that it doesn't do stupid things like calling
> smp_call_function_single() with interrupts disabled, but that requires someone
> (I can do it, but I need to dig through the cpufreq code for this purpose) to
> figure out how to fix it.
> 
> I'm not quite sure if there's an acceptable short term solution, though.
> 
> In principle we can do
> 
> local_irq_save()
> ...
> local_irq_restore()
> 
> around each sysdevs ->susend() and ->resume() in addition to checking the
> status of interrupts.  Would that work?

Well not really, if the function enables interrupts you run into the
same issue (interrupt service routine calls ktime_get()) again.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: kerneloops.org report for the week of June 14 2009
  2009-06-23 10:05 ` Thomas Gleixner
  2009-06-23 11:55   ` Ingo Molnar
@ 2009-06-25 13:03   ` Pádraig Brady
  1 sibling, 0 replies; 15+ messages in thread
From: Pádraig Brady @ 2009-06-25 13:03 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: Arjan van de Ven, LKML, Rafael J. Wysocki, Ingo Molnar

Thomas Gleixner wrote:
> On Sun, 14 Jun 2009, Arjan van de Ven wrote:
>> Rank 3: getnstimeofday (warning)
>> 	Reported 309 times (2446 total reports)
>> 	[suspend resume] getnstimeofday() is called before timekeeping is
>         resumed
> 
>> Rank 6: hres_timers_resume (warning)
>> 	Reported 188 times (1024 total reports)
>> 	[suspend resume] hres_timers_resume() is incorrectly called with
>>       interrupts on
> 
> Both have the same root cause. Something enables interrupts in the
> early resume path. IIRC, there was a culprit identified
> recently. Rafael ?

Lots of reports I guess from Fedora 11 users trying to suspend/resume:
https://bugzilla.redhat.com/show_bug.cgi?id=499651

cheers,
Pádraig.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: kerneloops.org report for the week of June 14 2009
  2009-06-19  5:32   ` Arjan van de Ven
@ 2009-06-28  9:11     ` Bob Copeland
  0 siblings, 0 replies; 15+ messages in thread
From: Bob Copeland @ 2009-06-28  9:11 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: linux-kernel, torvalds, akpm, tglx, alan, rjw, netdev, mingo,
	dwmw2, alsa-devel, dri-devel

On Fri, Jun 19, 2009 at 1:32 AM, Arjan van de Ven<arjan@infradead.org> wrote:
> On Thu, 18 Jun 2009 13:08:39 -0400
> Bob Copeland <me@bobcopeland.com> wrote:
>> > Rank 9: minstrel_get_rate (warning)
>>
>> I believe this is fixed by http://lkml.org/lkml/2009/6/5/269, now
>> in pre-2.6.31.  Hmm, odd though, we should be seeing reports from
>> other wireless drivers since it's a general memory corruption
>> unrelated to ath5k (but not all use the mac80211 rate controllers).
>
> well.. out of the 534 reports or so that I have by now... 525 have
> ath5k loaded.....
>
> it's not 100% but it'd pretty close.

Indeed, I can replicate this now, looks like some kind of race condition
since it's looking for a rate in the 5 ghz band while I'm communicating
with a 2 ghz AP (during a scan perhaps?).  The warning is harmless but
there is some underlying issue to be fixed.

-- 
Bob Copeland %% www.bobcopeland.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2009-06-28  9:12 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-06-15  0:33 kerneloops.org report for the week of June 14 2009 Arjan van de Ven
2009-06-15  5:53 ` [alsa-devel] " Takashi Iwai
2009-06-15  5:53   ` Takashi Iwai
2009-06-15  7:06 ` Stefan Schmidt
2009-06-16  7:11   ` WARN_ON's in i915_gem_tiling.c Stefan Schmidt
2009-06-16  7:11     ` Stefan Schmidt
2009-06-18 17:08 ` kerneloops.org report for the week of June 14 2009 Bob Copeland
2009-06-19  5:32   ` Arjan van de Ven
2009-06-28  9:11     ` Bob Copeland
2009-06-23 10:05 ` Thomas Gleixner
2009-06-23 11:55   ` Ingo Molnar
2009-06-23 14:50     ` Arjan van de Ven
2009-06-23 16:56     ` Rafael J. Wysocki
2009-06-23 16:58       ` Thomas Gleixner
2009-06-25 13:03   ` Pádraig Brady

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.