linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Possibly SATA related freeze killed networking and RAID
@ 2007-11-20 19:09 noah
  2007-11-20 22:05 ` Alan Cox
  0 siblings, 1 reply; 20+ messages in thread
From: noah @ 2007-11-20 19:09 UTC (permalink / raw)
  To: Linux Kernel Mailing List; +Cc: linux-ide

I just had a strange freeze that killed networking and made software
RAID fail two of my harddisks.

There are a bunch of messages from the kernel which I extracted from
the system log after reboot at the end of this mail. I hit power off
in pure paranoia after the box froze, and then started to do disk I/O
again just right after I noticed the messages about two of my RAID
disks had failed on the console.
The network didn't recover when the harddrive suddenly started working again.
I managed to connect an USB keyboard and wake up the monitor from
sleep so I could see some of the messages printed on the console.

I looked through some other threads and found a mention of
smartmontools which I too use (5.37-5ubuntu2).

Kernel 2.6.22-14-generic (Ubuntu Gutsy Gibbon 7.10)
Motherboard: Asus M2N32 WS Professional nForce 590 SLI MCP (MCP55)
CPU: Athlon64 X2 Dual-Core 5600+
RAM: 4GB (passed memtest86 just a few minutes ago)

The harddrives are four Samsung HD501LJ 500GB drives.
sda and sdb have firmware CR100-10 and sdc and sdd have firmware CR100-11.
The drives are just a couple of months old, well cooled and so far
there's nothing interesting reported by S.M.A.R.T.

Software raid is configured like this:
sda1,sdc1 -> md0 (raid 1)
sdb1,sdd1 -> md1 (raid 1)
Both md0 and md1 are then encrypted with dm-crypt and the dm-devices
are then used to form md2 (stripe).

  -- noah


# lspci
00:00.0 RAM memory: nVidia Corporation C51 Host Bridge (rev a2)
00:00.1 RAM memory: nVidia Corporation C51 Memory Controller 0 (rev a2)
00:00.2 RAM memory: nVidia Corporation C51 Memory Controller 1 (rev a2)
00:00.3 RAM memory: nVidia Corporation C51 Memory Controller 5 (rev a2)
00:00.4 RAM memory: nVidia Corporation C51 Memory Controller 4 (rev a2)
00:00.5 RAM memory: nVidia Corporation C51 Host Bridge (rev a2)
00:00.6 RAM memory: nVidia Corporation C51 Memory Controller 3 (rev a2)
00:00.7 RAM memory: nVidia Corporation C51 Memory Controller 2 (rev a2)
00:04.0 PCI bridge: nVidia Corporation C51 PCI Express Bridge (rev a1)
00:08.0 RAM memory: nVidia Corporation MCP55 Memory Controller (rev a1)
00:09.0 ISA bridge: nVidia Corporation MCP55 LPC Bridge (rev a2)
00:09.1 SMBus: nVidia Corporation MCP55 SMBus (rev a2)
00:09.2 RAM memory: nVidia Corporation MCP55 Memory Controller (rev a2)
00:0a.0 USB Controller: nVidia Corporation MCP55 USB Controller (rev a1)
00:0a.1 USB Controller: nVidia Corporation MCP55 USB Controller (rev a2)
00:0c.0 IDE interface: nVidia Corporation MCP55 IDE (rev a1)
00:0d.0 IDE interface: nVidia Corporation MCP55 SATA Controller (rev a2)
00:0d.1 IDE interface: nVidia Corporation MCP55 SATA Controller (rev a2)
00:0d.2 IDE interface: nVidia Corporation MCP55 SATA Controller (rev a2)
00:0e.0 PCI bridge: nVidia Corporation MCP55 PCI bridge (rev a2)
00:0e.1 Audio device: nVidia Corporation MCP55 High Definition Audio (rev a2)
00:10.0 Bridge: nVidia Corporation MCP55 Ethernet (rev a2)
00:11.0 Bridge: nVidia Corporation MCP55 Ethernet (rev a2)
00:12.0 PCI bridge: nVidia Corporation MCP55 PCI Express bridge (rev a2)
00:14.0 PCI bridge: nVidia Corporation MCP55 PCI Express bridge (rev a2)
00:15.0 PCI bridge: nVidia Corporation MCP55 PCI Express bridge (rev a2)
00:16.0 PCI bridge: nVidia Corporation MCP55 PCI Express bridge (rev a2)
00:17.0 PCI bridge: nVidia Corporation MCP55 PCI Express bridge (rev a2)
00:18.0 Host bridge: Advanced Micro Devices [AMD] K8
[Athlon64/Opteron] HyperTransport Technology Configuration
00:18.1 Host bridge: Advanced Micro Devices [AMD] K8
[Athlon64/Opteron] Address Map
00:18.2 Host bridge: Advanced Micro Devices [AMD] K8
[Athlon64/Opteron] DRAM Controller
00:18.3 Host bridge: Advanced Micro Devices [AMD] K8
[Athlon64/Opteron] Miscellaneous Control
01:00.0 VGA compatible controller: nVidia Corporation GeForce 8400 GS (rev a1)
02:06.0 Communication controller: Tiger Jet Network Inc. Tiger3XX
Modem/ISDN interface
03:00.0 PCI bridge: NEC Corporation uPD720400 PCI Express - PCI/PCI-X
Bridge (rev 06)
03:00.1 PCI bridge: NEC Corporation uPD720400 PCI Express - PCI/PCI-X
Bridge (rev 06)
08:00.0 RAID bus controller: Marvell Technology Group Ltd. 88SE6145
SATA II PCI-E controller (rev a1)



 kernel: [734344.717844] irq 21: nobody cared (try booting with the
"irqpoll" option)
 kernel: [734344.717866]
 kernel: [734344.717866] Call Trace:
 kernel: [734344.717868]  <IRQ>  [__report_bad_irq+30/128]
__report_bad_irq+0x1e/0x80
 mdadm: Fail event detected on md device /dev/md1, component device /dev/sdd1
 kernel: [734344.717888]  [note_interrupt+643/704] note_interrupt+0x283/0x2c0
 kernel: [734344.717895]  [handle_fasteoi_irq+221/272]
handle_fasteoi_irq+0xdd/0x110
 mdadm: Fail event detected on md device /dev/md0, component device /dev/sdc1
 kernel: [734344.717901]  [do_IRQ+123/256] do_IRQ+0x7b/0x100
 kernel: [734344.717904]  [default_idle+0/64] default_idle+0x0/0x40
 kernel: [734344.717907]  [ret_from_intr+0/10] ret_from_intr+0x0/0xa
 kernel: [734344.717909]  <EOI>  [tcp_poll+0/368] tcp_poll+0x0/0x170
 kernel: [734344.717918]  [default_idle+41/64] default_idle+0x29/0x40
 kernel: [734344.717923]  [cpu_idle+112/192] cpu_idle+0x70/0xc0
 kernel: [734344.717936]
 kernel: [734344.717937] handlers:
 kernel: [734344.717950] [_end+131265960/2130332920]
(nv_generic_interrupt+0x0/0xe0 [sata_nv])
 kernel: [734344.717973] [_end+131152728/2130332920]
(nv_nic_irq_optimized+0x0/0x2b0 [forcedeth])
 kernel: [734344.718003] Disabling IRQ #21
 kernel: [734356.827155] ata5.00: exception Emask 0x0 SAct 0x0 SErr
0x0 action 0x2 frozen
 kernel: [734356.827169] ata5.00: cmd
25/00:90:ff:3c:82/00:01:02:00:00/e0 tag 0 cdb 0x0 data 204800 in
 kernel: [734356.827170]          res
40/00:b4:cc:88:7f/40:00:02:00:00/e0 Emask 0x4 (timeout)
 kernel: [734355.620185] ata6.00: exception Emask 0x0 SAct 0x0 SErr
0x0 action 0x2 frozen
 kernel: [734355.620199] ata6.00: cmd
c8/00:08:ef:52:a9/00:00:00:00:00/e0 tag 0 cdb 0x0 data 4096 in
 kernel: [734355.620200]          res
40/00:00:02:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
 kernel: [734355.731595] ata6: soft resetting port
 kernel: [734355.787308] ata6: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
 kernel: [734358.738407] ata5: port is slow to respond, please be
patient (Status 0xd8)
 kernel: [734360.418260] ata5: device not ready (errno=-16), forcing hardreset
 kernel: [734360.418262] ata5: hard resetting port
 kernel: [734360.643958] ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
 kernel: [734366.509226] ata6.00: qc timeout (cmd 0x27)
 kernel: [734366.509231] ata6.00: ata_hpa_resize 1: hpa sectors (0) is
smaller than sectors (976773168)
 kernel: [734366.509241] ata6.00: failed to set xfermode (err_mask=0x40)
 kernel: [734366.509250] ata6: failed to recover some devices,
retrying in 5 secs
 kernel: [734368.296234] ata6: hard resetting port
 kernel: [734368.521935] ata6: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
 kernel: [734371.365916] ata5.00: qc timeout (cmd 0x27)
 kernel: [734371.365921] ata5.00: ata_hpa_resize 1: hpa sectors (0) is
smaller than sectors (976773168)
 kernel: [734371.365928] ata5.00: failed to set xfermode (err_mask=0x40)
 kernel: [734371.365936] ata5: failed to recover some devices,
retrying in 5 secs
 kernel: [734373.152942] ata5: hard resetting port
 kernel: [734373.378644] ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
 kernel: [734379.244112] ata6.00: qc timeout (cmd 0x27)
 kernel: [734379.244118] ata6.00: ata_hpa_resize 1: hpa sectors (0) is
smaller than sectors (976773168)
 kernel: [734379.244126] ata6.00: failed to set xfermode (err_mask=0x40)
 kernel: [734379.244135] ata6: limiting SATA link speed to 1.5 Gbps
 kernel: [734379.244138] ata6.00: limiting speed to UDMA/133:PIO3
 kernel: [734379.244140] ata6: failed to recover some devices,
retrying in 5 secs
 kernel: [734381.031138] ata6: hard resetting port
 kernel: [734381.256840] ata6: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
 kernel: [734384.095108] ata5.00: qc timeout (cmd 0x27)
 kernel: [734384.095113] ata5.00: ata_hpa_resize 1: hpa sectors (0) is
smaller than sectors (976773168)
 kernel: [734384.095120] ata5.00: failed to set xfermode (err_mask=0x40)
 kernel: [734384.095129] ata5: limiting SATA link speed to 1.5 Gbps
 kernel: [734384.095131] ata5.00: limiting speed to UDMA/133:PIO3
 kernel: [734384.095133] ata5: failed to recover some devices,
retrying in 5 secs
 kernel: [734385.882133] ata5: hard resetting port
 kernel: [734386.107836] ata5: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
 kernel: [734391.979018] ata6.00: qc timeout (cmd 0x27)
 kernel: [734391.979024] ata6.00: ata_hpa_resize 1: hpa sectors (0) is
smaller than sectors (976773168)
 kernel: [734391.979032] ata6.00: failed to set xfermode (err_mask=0x40)
 kernel: [734391.979041] ata6.00: disabled
 kernel: [734392.159007] ata6: EH complete
 kernel: [734392.159018] sd 5:0:0:0: [sdd] Result:
hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK,SUGGEST_OK
 kernel: [734392.159022] end_request: I/O error, dev sdd, sector 11096815
 kernel: [734392.159026] raid1: sdd1: rescheduling sector 11096752
 kernel: [734394.574692] sd 5:0:0:0: [sdd] Result:
hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK,SUGGEST_OK
 kernel: [734394.574697] end_request: I/O error, dev sdd, sector 976767935
 kernel: [734394.574703] md: super_written gets error=-5, uptodate=0
 kernel: [734394.574706] raid1: Disk failure on sdd1, disabling device.
 kernel: [734394.574707] ^IOperation continuing on 1 devices
 kernel: [734394.574960] sd 5:0:0:0: [sdd] READ CAPACITY failed
 kernel: [734394.574961] sd 5:0:0:0: [sdd] Result:
hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK,SUGGEST_OK
 kernel: [734394.574964] sd 5:0:0:0: [sdd] Sense not available.
 kernel: [734394.575233] sd 5:0:0:0: [sdd] Write Protect is off
 kernel: [734394.575235] sd 5:0:0:0: [sdd] Mode Sense: 00 00 00 00
 kernel: [734394.575275] sd 5:0:0:0: [sdd] Asking for cache data failed
 kernel: [734394.575282] sd 5:0:0:0: [sdd] Assuming drive cache: write through
 kernel: [734394.577622] RAID1 conf printout:
 kernel: [734394.577624]  --- wd:1 rd:2
 kernel: [734394.577626]  disk 0, wo:1, o:0, dev:sdd1
 kernel: [734394.577627]  disk 1, wo:0, o:1, dev:sdb1
 kernel: [734394.583008] RAID1 conf printout:
 kernel: [734394.583010]  --- wd:1 rd:2
 kernel: [734394.583011]  disk 1, wo:0, o:1, dev:sdb1
 kernel: [734394.593189] raid1: sdb1: redirecting sector 11096752 to
another mirror
 kernel: [734399.802574] ata5.00: qc timeout (cmd 0x27)
 kernel: [734399.802580] ata5.00: ata_hpa_resize 1: hpa sectors (0) is
smaller than sectors (976773168)
 kernel: [734399.802588] ata5.00: failed to set xfermode (err_mask=0x40)
 kernel: [734399.802606] ata5.00: disabled
 kernel: [734400.306547] ata5: EH complete
 kernel: [734400.306601] sd 4:0:0:0: [sdc] Result:
hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK,SUGGEST_OK
 kernel: [734400.306605] end_request: I/O error, dev sdc, sector 42089727
 kernel: [734400.306608] raid1: sdc1: rescheduling sector 42089664
 kernel: [734400.306626] raid1: sdc1: rescheduling sector 42089672
 kernel: [734400.306642] raid1: sdc1: rescheduling sector 42089680
 kernel: [734400.306658] raid1: sdc1: rescheduling sector 42089688
 kernel: [734400.306674] raid1: sdc1: rescheduling sector 42089696
 kernel: [734400.306690] raid1: sdc1: rescheduling sector 42089704
 kernel: [734400.306706] raid1: sdc1: rescheduling sector 42089712
 kernel: [734400.306722] raid1: sdc1: rescheduling sector 42089720
 kernel: [734400.306738] raid1: sdc1: rescheduling sector 42089728
 kernel: [734400.307304] sd 4:0:0:0: [sdc] Result:
hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK,SUGGEST_OK
 kernel: [734400.307307] end_request: I/O error, dev sdc, sector 976767935
 kernel: [734400.307313] md: super_written gets error=-5, uptodate=0
 kernel: [734400.307315] raid1: Disk failure on sdc1, disabling device.
 kernel: [734400.307316] ^IOperation continuing on 1 devices
 kernel: [734400.307643] sd 4:0:0:0: [sdc] READ CAPACITY failed
 kernel: [734400.307645] sd 4:0:0:0: [sdc] Result:
hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK,SUGGEST_OK
 kernel: [734400.307647] sd 4:0:0:0: [sdc] Sense not available.
 kernel: [734400.307767] sd 4:0:0:0: [sdc] Write Protect is off
 kernel: [734400.307768] sd 4:0:0:0: [sdc] Mode Sense: 00 00 00 00
 kernel: [734400.307947] sd 4:0:0:0: [sdc] Asking for cache data failed
 kernel: [734400.307963] sd 4:0:0:0: [sdc] Assuming drive cache: write through
 kernel: [734400.319399] RAID1 conf printout:
 kernel: [734400.319401]  --- wd:1 rd:2
 kernel: [734400.319402]  disk 0, wo:1, o:0, dev:sdc1
 kernel: [734400.319404]  disk 1, wo:0, o:1, dev:sda1
 kernel: [734400.330537] RAID1 conf printout:
 kernel: [734400.330539]  --- wd:1 rd:2
 kernel: [734400.330540]  disk 1, wo:0, o:1, dev:sda1

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Possibly SATA related freeze killed networking and RAID
  2007-11-20 19:09 Possibly SATA related freeze killed networking and RAID noah
@ 2007-11-20 22:05 ` Alan Cox
  2007-11-20 22:16   ` noah
  2007-11-26 12:06   ` Pavel Machek
  0 siblings, 2 replies; 20+ messages in thread
From: Alan Cox @ 2007-11-20 22:05 UTC (permalink / raw)
  To: noah; +Cc: Linux Kernel Mailing List, linux-ide

>  kernel: [734344.717844] irq 21: nobody cared (try booting with the
> "irqpoll" option)
>  kernel: [734344.717866]

Your machine decided to emit interrupt 21 without an apparent reason.
Whatever caused that made the kernel shut down IRQ 21 at which point the
disk drives on that IRQ were no longer being serviced. Everything on IRQ
21 would have died - which may be why your networking failed too.

What do you have on IRQ 21 and is this a one off ?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Possibly SATA related freeze killed networking and RAID
  2007-11-20 22:05 ` Alan Cox
@ 2007-11-20 22:16   ` noah
  2007-11-21  0:45     ` Alan Cox
  2007-11-26 12:06   ` Pavel Machek
  1 sibling, 1 reply; 20+ messages in thread
From: noah @ 2007-11-20 22:16 UTC (permalink / raw)
  To: Alan Cox; +Cc: Linux Kernel Mailing List, linux-ide

2007/11/20, Alan Cox <alan@lxorguk.ukuu.org.uk>:
> >  kernel: [734344.717844] irq 21: nobody cared (try booting with the
> > "irqpoll" option)
> >  kernel: [734344.717866]
>
> Your machine decided to emit interrupt 21 without an apparent reason.
> Whatever caused that made the kernel shut down IRQ 21 at which point the
> disk drives on that IRQ were no longer being serviced. Everything on IRQ
> 21 would have died - which may be why your networking failed too.
>
> What do you have on IRQ 21 and is this a one off ?

I've had other freezes before but this was the first time I was able
to see what was actually going on.
IRQ 21 appears to be shared between sata_nv and ethernet.

Does this mean my hardware/BIOS is broken somehow?
I'm running the latest BIOS available.

# cat /proc/interrupts            CPU0       CPU1
  0:      26497    3603163   IO-APIC-edge      timer
  1:          0          2   IO-APIC-edge      i8042
  8:          0          0   IO-APIC-edge      rtc
 9:          0          0   IO-APIC-fasteoi   acpi
 12:          0          6   IO-APIC-edge      i8042
 16:       4851     669159   IO-APIC-fasteoi   shpchp, libata
 20:          0          0   IO-APIC-fasteoi   sata_nv
 21:      36443    4775430   IO-APIC-fasteoi   sata_nv, eth0
 22:      31261    4531218   IO-APIC-fasteoi   ohci_hcd:usb1, sata_nv
 23:          4       1649   IO-APIC-fasteoi   HDA Intel, ehci_hcd:usb2
NMI:          0          0
LOC:    3629562    3629543
ERR:          0

  -- noah

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Possibly SATA related freeze killed networking and RAID
  2007-11-20 22:16   ` noah
@ 2007-11-21  0:45     ` Alan Cox
  2007-11-21 19:06       ` noah
  0 siblings, 1 reply; 20+ messages in thread
From: Alan Cox @ 2007-11-21  0:45 UTC (permalink / raw)
  To: noah; +Cc: Linux Kernel Mailing List, linux-ide

> I've had other freezes before but this was the first time I was able
> to see what was actually going on.
> IRQ 21 appears to be shared between sata_nv and ethernet.
> 
> Does this mean my hardware/BIOS is broken somehow?

Not neccessarily. It could a bug in one of the drivers using IRQ 21
(sata_nv or the nvidia ethernet), it could be another inactive device, or
it could be a hardware funny.

Nvidia stuff can be quite hard to diagnose as we have no documentation
but we can try. The first question is whether it is network or disk
triggered - seeing if heavy loads to one or the other trigger the problem
might be a first plan.


Alan

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Possibly SATA related freeze killed networking and RAID
  2007-11-21  0:45     ` Alan Cox
@ 2007-11-21 19:06       ` noah
  2007-12-10 12:33         ` noah
  0 siblings, 1 reply; 20+ messages in thread
From: noah @ 2007-11-21 19:06 UTC (permalink / raw)
  To: Alan Cox; +Cc: Linux Kernel Mailing List, linux-ide

2007/11/21, Alan Cox <alan@lxorguk.ukuu.org.uk>:
> > I've had other freezes before but this was the first time I was able
> > to see what was actually going on.
> > IRQ 21 appears to be shared between sata_nv and ethernet.
> >
> > Does this mean my hardware/BIOS is broken somehow?
>
> Not neccessarily. It could a bug in one of the drivers using IRQ 21
> (sata_nv or the nvidia ethernet), it could be another inactive device, or
> it could be a hardware funny.

How can I tell if there's an inactive device?

> Nvidia stuff can be quite hard to diagnose as we have no documentation
> but we can try. The first question is whether it is network or disk
> triggered - seeing if heavy loads to one or the other trigger the problem
> might be a first plan.

I haven't managed to trigger it again yet but at the time the CPU was
heavily loaded and I was re-indexing a database which caused a lot of
disk activity. I'm quite confident the network was pretty much idle at
the time.

  -- noah

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Possibly SATA related freeze killed networking and RAID
  2007-11-30 13:13         ` Alan Cox
@ 2007-11-26  7:55           ` Pavel Machek
  2007-11-30 15:00             ` Mark Lord
  0 siblings, 1 reply; 20+ messages in thread
From: Pavel Machek @ 2007-11-26  7:55 UTC (permalink / raw)
  To: Alan Cox
  Cc: Phillip Susi, Tejun Heo, noah, Linux Kernel Mailing List, linux-ide

On Fri 2007-11-30 13:13:44, Alan Cox wrote:
> > Why does a single spurious interrupt cause it to be shut down?  I can 
> 
> It doesn't.
> 
> > see if the interrupt is stuck on and keeps interrupting constantly, but 
> > if it's just the occasional spurious interrupt, why not just ignore it 
> > and move on?
> 
> The interrupt is usually level triggered so it continues to create
> interrupts until you silence it. The thresholds are about 10,000
> interrupt events and on newer kernels we also reset the count if we don't
> see any for a while. That works for most stuff except the thinkpad
> bluetooth problem.

Which is confirmed hw problem now, btw.
						Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Possibly SATA related freeze killed networking and RAID
  2007-11-20 22:05 ` Alan Cox
  2007-11-20 22:16   ` noah
@ 2007-11-26 12:06   ` Pavel Machek
  2007-11-28  1:55     ` Tejun Heo
  1 sibling, 1 reply; 20+ messages in thread
From: Pavel Machek @ 2007-11-26 12:06 UTC (permalink / raw)
  To: Alan Cox; +Cc: noah, Linux Kernel Mailing List, linux-ide

Hi!

> >  kernel: [734344.717844] irq 21: nobody cared (try booting with the
> > "irqpoll" option)
> >  kernel: [734344.717866]
> 
> Your machine decided to emit interrupt 21 without an apparent reason.
> Whatever caused that made the kernel shut down IRQ 21 at which point the
> disk drives on that IRQ were no longer being serviced. Everything on IRQ
> 21 would have died - which may be why your networking failed too.

Hmm, perhaps that 'nobody cared' message should be worded more
strongly, and printed and KERN_CRIT?
							Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Possibly SATA related freeze killed networking and RAID
  2007-11-26 12:06   ` Pavel Machek
@ 2007-11-28  1:55     ` Tejun Heo
  2007-11-29 22:16       ` Phillip Susi
  0 siblings, 1 reply; 20+ messages in thread
From: Tejun Heo @ 2007-11-28  1:55 UTC (permalink / raw)
  To: Pavel Machek; +Cc: Alan Cox, noah, Linux Kernel Mailing List, linux-ide

Pavel Machek wrote:
> Hi!
> 
>>>  kernel: [734344.717844] irq 21: nobody cared (try booting with the
>>> "irqpoll" option)
>>>  kernel: [734344.717866]
>> Your machine decided to emit interrupt 21 without an apparent reason.
>> Whatever caused that made the kernel shut down IRQ 21 at which point the
>> disk drives on that IRQ were no longer being serviced. Everything on IRQ
>> 21 would have died - which may be why your networking failed too.
> 
> Hmm, perhaps that 'nobody cared' message should be worded more
> strongly, and printed and KERN_CRIT?

Agreed.  Nobody cared on ATA controllers is usually very effective at
taking the whole machine down.  Is there any reason why we don't turn on
irqpoll on turned off IRQs automatically?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Possibly SATA related freeze killed networking and RAID
  2007-11-28  1:55     ` Tejun Heo
@ 2007-11-29 22:16       ` Phillip Susi
  2007-11-30  0:02         ` Tejun Heo
  2007-11-30 13:13         ` Alan Cox
  0 siblings, 2 replies; 20+ messages in thread
From: Phillip Susi @ 2007-11-29 22:16 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Pavel Machek, Alan Cox, noah, Linux Kernel Mailing List, linux-ide

Tejun Heo wrote:
> Agreed.  Nobody cared on ATA controllers is usually very effective at
> taking the whole machine down.  Is there any reason why we don't turn on
> irqpoll on turned off IRQs automatically?

Why does a single spurious interrupt cause it to be shut down?  I can 
see if the interrupt is stuck on and keeps interrupting constantly, but 
if it's just the occasional spurious interrupt, why not just ignore it 
and move on?


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Possibly SATA related freeze killed networking and RAID
  2007-11-29 22:16       ` Phillip Susi
@ 2007-11-30  0:02         ` Tejun Heo
  2007-11-30 18:45           ` Phillip Susi
  2007-11-30 13:13         ` Alan Cox
  1 sibling, 1 reply; 20+ messages in thread
From: Tejun Heo @ 2007-11-30  0:02 UTC (permalink / raw)
  To: Phillip Susi
  Cc: Pavel Machek, Alan Cox, noah, Linux Kernel Mailing List, linux-ide

Phillip Susi wrote:
> Tejun Heo wrote:
>> Agreed.  Nobody cared on ATA controllers is usually very effective at
>> taking the whole machine down.  Is there any reason why we don't turn on
>> irqpoll on turned off IRQs automatically?
> 
> Why does a single spurious interrupt cause it to be shut down?  I can
> see if the interrupt is stuck on and keeps interrupting constantly, but
> if it's just the occasional spurious interrupt, why not just ignore it
> and move on?

Because SFF ATA controller don't have IRQ pending bit.  You don't know
whether IRQ is raised or not.  Plus, accessing the status register which
clears pending IRQ can be very slow on PATA machines.  It has to go
through the PCI and ATA bus and come back.  So, unconditionally trying
to clear IRQ by accessing Status can incur noticeable overhead if the
IRQ is shared with devices which raise a lot of IRQs.

-- 
tejun

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Possibly SATA related freeze killed networking and RAID
  2007-11-29 22:16       ` Phillip Susi
  2007-11-30  0:02         ` Tejun Heo
@ 2007-11-30 13:13         ` Alan Cox
  2007-11-26  7:55           ` Pavel Machek
  1 sibling, 1 reply; 20+ messages in thread
From: Alan Cox @ 2007-11-30 13:13 UTC (permalink / raw)
  To: Phillip Susi
  Cc: Tejun Heo, Pavel Machek, noah, Linux Kernel Mailing List, linux-ide

> Why does a single spurious interrupt cause it to be shut down?  I can 

It doesn't.

> see if the interrupt is stuck on and keeps interrupting constantly, but 
> if it's just the occasional spurious interrupt, why not just ignore it 
> and move on?

The interrupt is usually level triggered so it continues to create
interrupts until you silence it. The thresholds are about 10,000
interrupt events and on newer kernels we also reset the count if we don't
see any for a while. That works for most stuff except the thinkpad
bluetooth problem.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Possibly SATA related freeze killed networking and RAID
  2007-11-26  7:55           ` Pavel Machek
@ 2007-11-30 15:00             ` Mark Lord
  2007-11-30 20:25               ` Pavel Machek
  0 siblings, 1 reply; 20+ messages in thread
From: Mark Lord @ 2007-11-30 15:00 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Alan Cox, Phillip Susi, Tejun Heo, noah,
	Linux Kernel Mailing List, linux-ide

Pavel Machek wrote:
> On Fri 2007-11-30 13:13:44, Alan Cox wrote:
>>> Why does a single spurious interrupt cause it to be shut down?  I can 
>> It doesn't.
>>
>>> see if the interrupt is stuck on and keeps interrupting constantly, but 
>>> if it's just the occasional spurious interrupt, why not just ignore it 
>>> and move on?
>> The interrupt is usually level triggered so it continues to create
>> interrupts until you silence it. The thresholds are about 10,000
>> interrupt events and on newer kernels we also reset the count if we don't
>> see any for a while. That works for most stuff except the thinkpad
>> bluetooth problem.
> 
> Which is confirmed hw problem now, btw.
...

What problem is that, exactly?

My Dell has an internal USB BT adapter that briefly appears
and then disappears again on resume (or stays if I have "enabled" it
via the BIOS key).

I wonder if that has anything to do with the (new in) 2.6.23 pauses
that machine has on resume (about every 10th time).

Cheers

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Possibly SATA related freeze killed networking and RAID
  2007-11-30  0:02         ` Tejun Heo
@ 2007-11-30 18:45           ` Phillip Susi
  2007-11-30 23:56             ` Tejun Heo
  0 siblings, 1 reply; 20+ messages in thread
From: Phillip Susi @ 2007-11-30 18:45 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Pavel Machek, Alan Cox, noah, Linux Kernel Mailing List, linux-ide

Tejun Heo wrote:
> Because SFF ATA controller don't have IRQ pending bit.  You don't know
> whether IRQ is raised or not.  Plus, accessing the status register which
> clears pending IRQ can be very slow on PATA machines.  It has to go
> through the PCI and ATA bus and come back.  So, unconditionally trying
> to clear IRQ by accessing Status can incur noticeable overhead if the
> IRQ is shared with devices which raise a lot of IRQs.

There HAS to be a way to determine if that device generated the 
interrupt, or the interrupt can not be shared.  Since the kernel said 
nobody cared about the interrupt, that indicates that the sata driver 
checked the status register and realized the sata chip didn't generate 
the interrupt, and returned to the kernel letting it know that the 
interrupt was not for it.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Possibly SATA related freeze killed networking and RAID
  2007-11-30 15:00             ` Mark Lord
@ 2007-11-30 20:25               ` Pavel Machek
  0 siblings, 0 replies; 20+ messages in thread
From: Pavel Machek @ 2007-11-30 20:25 UTC (permalink / raw)
  To: Mark Lord
  Cc: Alan Cox, Phillip Susi, Tejun Heo, noah,
	Linux Kernel Mailing List, linux-ide

On Fri 2007-11-30 10:00:55, Mark Lord wrote:
> Pavel Machek wrote:
>> On Fri 2007-11-30 13:13:44, Alan Cox wrote:
>>>> Why does a single spurious interrupt cause it to be shut down?  I can 
>>> It doesn't.
>>>
>>>> see if the interrupt is stuck on and keeps interrupting constantly, but 
>>>> if it's just the occasional spurious interrupt, why not just ignore it 
>>>> and move on?
>>> The interrupt is usually level triggered so it continues to create
>>> interrupts until you silence it. The thresholds are about 10,000
>>> interrupt events and on newer kernels we also reset the count if we don't
>>> see any for a while. That works for most stuff except the thinkpad
>>> bluetooth problem.
>> Which is confirmed hw problem now, btw.
> ...
>
> What problem is that, exactly?

Spurious interrupt, interrupt link is disabled after ~15 minutes. It
seems pretty unique to t61.

> My Dell has an internal USB BT adapter that briefly appears
> and then disappears again on resume (or stays if I have "enabled" it
> via the BIOS key).
>
> I wonder if that has anything to do with the (new in) 2.6.23 pauses
> that machine has on resume (about every 10th time).

No idea, but t61 problem seems different.
									Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Possibly SATA related freeze killed networking and RAID
  2007-11-30 18:45           ` Phillip Susi
@ 2007-11-30 23:56             ` Tejun Heo
  2007-12-03 17:15               ` Phillip Susi
  0 siblings, 1 reply; 20+ messages in thread
From: Tejun Heo @ 2007-11-30 23:56 UTC (permalink / raw)
  To: Phillip Susi
  Cc: Pavel Machek, Alan Cox, noah, Linux Kernel Mailing List, linux-ide

Phillip Susi wrote:
> Tejun Heo wrote:
>> Because SFF ATA controller don't have IRQ pending bit.  You don't know
>> whether IRQ is raised or not.  Plus, accessing the status register which
>> clears pending IRQ can be very slow on PATA machines.  It has to go
>> through the PCI and ATA bus and come back.  So, unconditionally trying
>> to clear IRQ by accessing Status can incur noticeable overhead if the
>> IRQ is shared with devices which raise a lot of IRQs.
> 
> There HAS to be a way to determine if that device generated the
> interrupt, or the interrupt can not be shared.  Since the kernel said
> nobody cared about the interrupt, that indicates that the sata driver
> checked the status register and realized the sata chip didn't generate
> the interrupt, and returned to the kernel letting it know that the
> interrupt was not for it.

Surprise, surprise.  There's no way to tell whether the controller
raised interrupt or not if command is not in progress.  As I said
before, there's no IRQ pending bit.  While processing commands, you can
tell by looking at other status registers but when there's nothing in
flight and the controller determines it's a good time to raise a
spurious interrupt, there's no way you can tell.  That dang SFF
interface is like 15+ years old.

But we can still make things pretty robust.  We're working on it.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Possibly SATA related freeze killed networking and RAID
  2007-11-30 23:56             ` Tejun Heo
@ 2007-12-03 17:15               ` Phillip Susi
  2007-12-04  1:32                 ` Tejun Heo
  0 siblings, 1 reply; 20+ messages in thread
From: Phillip Susi @ 2007-12-03 17:15 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Pavel Machek, Alan Cox, noah, Linux Kernel Mailing List, linux-ide

Tejun Heo wrote:
> Surprise, surprise.  There's no way to tell whether the controller
> raised interrupt or not if command is not in progress.  As I said
> before, there's no IRQ pending bit.  While processing commands, you can
> tell by looking at other status registers but when there's nothing in
> flight and the controller determines it's a good time to raise a
> spurious interrupt, there's no way you can tell.  That dang SFF
> interface is like 15+ years old.
> 
> But we can still make things pretty robust.  We're working on it.
> 
> Thanks.
> 

It sounds like you mean that you know the controller did NOT raise the 
interrupt ( intentionally/correctly ) if there was no command in 
progress, as opposed to not being able to tell.  Unless there is some 
condition under which it is valid for the controller to raise an 
interrupt when it had no commands in progress?  And if that's the case 
and there's know way to know WHY, that's a broken design.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Possibly SATA related freeze killed networking and RAID
  2007-12-03 17:15               ` Phillip Susi
@ 2007-12-04  1:32                 ` Tejun Heo
  0 siblings, 0 replies; 20+ messages in thread
From: Tejun Heo @ 2007-12-04  1:32 UTC (permalink / raw)
  To: Phillip Susi
  Cc: Pavel Machek, Alan Cox, noah, Linux Kernel Mailing List, linux-ide

Phillip Susi wrote:
> Tejun Heo wrote:
>> Surprise, surprise.  There's no way to tell whether the controller
>> raised interrupt or not if command is not in progress.  As I said
>> before, there's no IRQ pending bit.  While processing commands, you can
>> tell by looking at other status registers but when there's nothing in
>> flight and the controller determines it's a good time to raise a
>> spurious interrupt, there's no way you can tell.  That dang SFF
>> interface is like 15+ years old.
>>
>> But we can still make things pretty robust.  We're working on it.
> 
> It sounds like you mean that you know the controller did NOT raise the
> interrupt ( intentionally/correctly ) if there was no command in
> progress, as opposed to not being able to tell.  Unless there is some
> condition under which it is valid for the controller to raise an
> interrupt when it had no commands in progress?  And if that's the case
> and there's know way to know WHY, that's a broken design.

If everything works correctly, all interrupts can be accounted for.
It's just that there's no margin for erratic behaviors and most ATA
controllers are built really cheap.  So, yeah, it's a 15+ years old
half-broken design.

-- 
tejun

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Possibly SATA related freeze killed networking and RAID
  2007-11-21 19:06       ` noah
@ 2007-12-10 12:33         ` noah
  0 siblings, 0 replies; 20+ messages in thread
From: noah @ 2007-12-10 12:33 UTC (permalink / raw)
  To: Alan Cox; +Cc: Linux Kernel Mailing List, linux-ide

2007/11/21, noah <noah123@gmail.com>:
> 2007/11/21, Alan Cox <alan@lxorguk.ukuu.org.uk>:
> > > I've had other freezes before but this was the first time I was able
> > > to see what was actually going on.
> > > IRQ 21 appears to be shared between sata_nv and ethernet.
> > >
> > > Does this mean my hardware/BIOS is broken somehow?
> >
> > Not neccessarily. It could a bug in one of the drivers using IRQ 21
> > (sata_nv or the nvidia ethernet), it could be another inactive device, or
> > it could be a hardware funny.
>
> How can I tell if there's an inactive device?
>
> > Nvidia stuff can be quite hard to diagnose as we have no documentation
> > but we can try. The first question is whether it is network or disk
> > triggered - seeing if heavy loads to one or the other trigger the problem
> > might be a first plan.
>
> I haven't managed to trigger it again yet but at the time the CPU was
> heavily loaded and I was re-indexing a database which caused a lot of
> disk activity. I'm quite confident the network was pretty much idle at
> the time.

The same thing has happened twice now, both during the weekly check of
the md0 and md1 RAID1-arrays. That is, networking on the primary
interface is dead. It's interrupt (irq 21) is shared between sata_nv
and forcedeth.

Is there anything I can do to debug this problem?

I don't have access to the logs right now but will have later.

  -- noah

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Possibly SATA related freeze killed networking and RAID
@ 2007-12-10 21:25 Thiemo Nagel
  0 siblings, 0 replies; 20+ messages in thread
From: Thiemo Nagel @ 2007-12-10 21:25 UTC (permalink / raw)
  To: linux-kernel; +Cc: noah123

Hello,

I think, I'm experiencing the same problem:

09:16:34 : NETDEV WATCHDOG: eth0: transmit timed out
09:16:34 : eth0: Got tx_timeout. irq: 00000000
09:16:34 : eth0: Ring at 37e50000
09:16:34 : eth0: Dumping tx registers
09:16:34 :   0: 00000000 000000ff 00000003 025003ca 00000000 00000000
00000000 00000000
09:16:34 :  20: 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000

[...]

09:16:54 : ata5.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
09:16:54 : ata6.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
09:16:54 : ata6.00: cmd 25/00:08:1e:97:48/00:00:19:00:00/e0 tag 0 cdb 0x0
data 4096 in
09:16:54 :          res 40/00:00:01:4f:c2/00:00:00:00:00/00 Emask 0x4
(timeout)
09:16:54 : ata5.00: cmd 25/00:70:1e:97:48/00:00:19:00:00/e0 tag 0 cdb 0x0
data 57344 in
09:16:54 :          res 40/00:00:01:4f:c2/00:00:00:00:00/00 Emask 0x4
(timeout)
09:16:54 : ata6: soft resetting port
09:16:54 : ata5: soft resetting port
09:16:54 : ata6: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
09:16:54 : ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
09:16:54 : NETDEV WATCHDOG: eth0: transmit timed out
09:16:54 : eth0: Got tx_timeout. irq: 00000032
09:16:54 : eth0: Ring at 37e50000
09:16:54 : eth0: Dumping tx registers

A more complete log can be found at:
http://www.e18.physik.tu-muenchen.de/~tnagel/misc/kernel-crash.log

The setup is strikingly similar to that of noah (I'm quoting all of this
by heart, if somebody is interested in more detail, just ask.):

Kernel: 2.6.22 (amd64, Debian patches, tainted)
Mainboard: Asus M2N-SLI Deluxe (nForce 570 SLI MCP --> MCP55, same as noah)
CPU: Athlon64 Dual-Core (same as noah)
RAM: 1GB
HD: 22 x Samsung HD501LJ 500GB (same as noah), 1-6 connected to chipset,
7-22 connected to RocketRaid 2340.

I'm using software RAID like noah, (levels 1, 5 and 6), and like with noah
the problem occurred during RAID check, in my case during heavy NFS load
which had been ongoing for ~4 days.  This is the third time, it has
happened, but only this time I could catch the logs via netconsole.  The
two affected drives are connected to the chipset and show no SMART errors.

Unfortunately, the kernel is tainted since I'm using HighPoint's drivers
for the RR2340.  I don't know whether I can change this easily.

Kind regards,

Thiemo Nagel


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Possibly SATA related freeze killed networking and RAID
       [not found]       ` <fa.0muxGr+d7CjnQ/J5GaB+TSqouuU@ifi.uio.no>
@ 2007-11-30  0:03         ` Robert Hancock
  0 siblings, 0 replies; 20+ messages in thread
From: Robert Hancock @ 2007-11-30  0:03 UTC (permalink / raw)
  To: Phillip Susi
  Cc: Tejun Heo, Pavel Machek, Alan Cox, noah,
	Linux Kernel Mailing List, linux-ide

Phillip Susi wrote:
> Tejun Heo wrote:
>> Agreed.  Nobody cared on ATA controllers is usually very effective at
>> taking the whole machine down.  Is there any reason why we don't turn on
>> irqpoll on turned off IRQs automatically?
> 
> Why does a single spurious interrupt cause it to be shut down?  I can 
> see if the interrupt is stuck on and keeps interrupting constantly, but 
> if it's just the occasional spurious interrupt, why not just ignore it 
> and move on?

I'm not certain offhand, but I think there may be such a threshold. 
However, an occasional spurious interrupt isn't likely. For a 
level-triggered interrupt, an unhandled interrupt will keep interrupting 
forever since nobody knows how to clear it (until we decide to disable 
the IRQ entirely).

-- 
Robert Hancock      Saskatoon, SK, Canada
To email, remove "nospam" from hancockr@nospamshaw.ca
Home Page: http://www.roberthancock.com/


^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2007-12-10 21:52 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-11-20 19:09 Possibly SATA related freeze killed networking and RAID noah
2007-11-20 22:05 ` Alan Cox
2007-11-20 22:16   ` noah
2007-11-21  0:45     ` Alan Cox
2007-11-21 19:06       ` noah
2007-12-10 12:33         ` noah
2007-11-26 12:06   ` Pavel Machek
2007-11-28  1:55     ` Tejun Heo
2007-11-29 22:16       ` Phillip Susi
2007-11-30  0:02         ` Tejun Heo
2007-11-30 18:45           ` Phillip Susi
2007-11-30 23:56             ` Tejun Heo
2007-12-03 17:15               ` Phillip Susi
2007-12-04  1:32                 ` Tejun Heo
2007-11-30 13:13         ` Alan Cox
2007-11-26  7:55           ` Pavel Machek
2007-11-30 15:00             ` Mark Lord
2007-11-30 20:25               ` Pavel Machek
     [not found] <fa.hz1AWBbuhaxCXiVbvWdM1r83meE@ifi.uio.no>
     [not found] ` <fa.fUEfthqYoWrlov4j7OjtVSgx42g@ifi.uio.no>
     [not found]   ` <fa.qGibmi9xfm4JHJhvn0KM4rdFM+M@ifi.uio.no>
     [not found]     ` <fa.YcQmwdppCKfJhxSMgOb0RJPmHg8@ifi.uio.no>
     [not found]       ` <fa.0muxGr+d7CjnQ/J5GaB+TSqouuU@ifi.uio.no>
2007-11-30  0:03         ` Robert Hancock
2007-12-10 21:25 Thiemo Nagel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).