* Update on e1000 troubles (over-heating!) @ 2002-10-06 3:38 Ben Greear 2002-10-06 3:47 ` Andre Hedrick 0 siblings, 1 reply; 18+ messages in thread From: Ben Greear @ 2002-10-06 3:38 UTC (permalink / raw) To: linux-kernel, 'netdev@oss.sgi.com' I believe I have figured out why the e1000 crashed my machine after .5 - 1 hours: The NIC was over-heating. I measured one of the NICs after the machine crashed with an external (cheap) temp probe. It registered right at 50 degrees C, and this was about 15-30 seconds after it crashed. The dual e1000 NIC I have seems to run much cooler, and has been running at 430Mbps bi-directional on both ports for about 6 hours now with no obvious problems. So, I'm going to try to purchase some heat sinks and glue them onto the e1000 server nics, to see if that fixes the problem. Hope this proves useful to anyone experiencing similar strange crashes! Thanks, Ben -- Ben Greear <greearb@candelatech.com> <Ben_Greear AT excite.com> President of Candela Technologies Inc http://www.candelatech.com ScryMUD: http://scry.wanfear.com http://scry.wanfear.com/~greear ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Update on e1000 troubles (over-heating!) 2002-10-06 3:38 Update on e1000 troubles (over-heating!) Ben Greear @ 2002-10-06 3:47 ` Andre Hedrick 2002-10-06 22:38 ` jamal 0 siblings, 1 reply; 18+ messages in thread From: Andre Hedrick @ 2002-10-06 3:47 UTC (permalink / raw) To: Ben Greear; +Cc: linux-kernel, 'netdev@oss.sgi.com' I have a pair of Compaq e1000's which have never overheated, and I use them for heavy duty iSCSI testing and designing of drivers. These are massive 66/64 cards but still nothing like what you are reporting. I will look some more at the issue soon. Cheers, Andre Hedrick iSCSI Software Solutions Provider http://www.PyXTechnologies.com/ On Sat, 5 Oct 2002, Ben Greear wrote: > I believe I have figured out why the e1000 crashed my machine > after .5 - 1 hours: The NIC was over-heating. I measured one of > the NICs after the machine crashed with an external (cheap) temp > probe. It registered right at 50 degrees C, and this was about 15-30 > seconds after it crashed. > > The dual e1000 NIC I have seems to run much cooler, and has been > running at 430Mbps bi-directional on both ports for about 6 hours now > with no obvious problems. > > So, I'm going to try to purchase some heat sinks and glue them onto > the e1000 server nics, to see if that fixes the problem. > > Hope this proves useful to anyone experiencing similar strange > crashes! > > Thanks, > Ben > > -- > Ben Greear <greearb@candelatech.com> <Ben_Greear AT excite.com> > President of Candela Technologies Inc http://www.candelatech.com > ScryMUD: http://scry.wanfear.com http://scry.wanfear.com/~greear > > > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Update on e1000 troubles (over-heating!) 2002-10-06 3:47 ` Andre Hedrick @ 2002-10-06 22:38 ` jamal 2002-10-07 0:14 ` Andre Hedrick 2002-10-07 3:46 ` Ben Greear 0 siblings, 2 replies; 18+ messages in thread From: jamal @ 2002-10-06 22:38 UTC (permalink / raw) To: Andre Hedrick; +Cc: Ben Greear, linux-kernel, 'netdev@oss.sgi.com' On Sat, 5 Oct 2002, Andre Hedrick wrote: > > I have a pair of Compaq e1000's which have never overheated, and I use > them for heavy duty iSCSI testing and designing of drivers. These are > massive 66/64 cards but still nothing like what you are reporting. > > I will look some more at the issue soon. > It seems like the prerequisite to reproduce it is you beat the NIC heavily with a lot of packets/sec and then run it at that sustained rate for at least 30 minutes. isci would tend to use MTU sized packets which will not be that effective. cheers, jamal ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Update on e1000 troubles (over-heating!) 2002-10-06 22:38 ` jamal @ 2002-10-07 0:14 ` Andre Hedrick 2002-10-07 11:56 ` jamal 2002-10-07 3:46 ` Ben Greear 1 sibling, 1 reply; 18+ messages in thread From: Andre Hedrick @ 2002-10-07 0:14 UTC (permalink / raw) To: jamal; +Cc: Ben Greear, linux-kernel, 'netdev@oss.sgi.com' However doing a data integrity test with a pattern buffer write-verify-read on multi-lun, multi-session, and multiple connections per session, while issuing load-balancing commands (ie thread tag) over each session to roast the bandwidth of the line should be enough. Now toss in injected errors to randomly fail data pdu's and calling a sync-and-steering layer to scan the header and or data digests to execute a within connection recovery, regardless if the reason, should be enough to warm up the beast. If that is not enough, I can toss in multi-initiators all with the features above or invoke the interoperablity modes to add the cisco and ibm initiator (both limited to error recovery level zero, while pyx's is capable of error recovery level one and part of two). Please let me know if I need to throttle it harder. Cheers, On Sun, 6 Oct 2002, jamal wrote: > > > On Sat, 5 Oct 2002, Andre Hedrick wrote: > > > > > I have a pair of Compaq e1000's which have never overheated, and I use > > them for heavy duty iSCSI testing and designing of drivers. These are > > massive 66/64 cards but still nothing like what you are reporting. > > > > I will look some more at the issue soon. > > > > It seems like the prerequisite to reproduce it is you beat the NIC heavily > with a lot of packets/sec and then run it at that sustained rate for at > least 30 minutes. isci would tend to use MTU sized packets which will > not be that effective. > > cheers, > jamal > > > > Andre Hedrick iSCSI Software Solutions Provider http://www.PyXTechnologies.com/ ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Update on e1000 troubles (over-heating!) 2002-10-07 0:14 ` Andre Hedrick @ 2002-10-07 11:56 ` jamal 0 siblings, 0 replies; 18+ messages in thread From: jamal @ 2002-10-07 11:56 UTC (permalink / raw) To: Andre Hedrick; +Cc: Ben Greear, linux-kernel, 'netdev@oss.sgi.com' It does seem like you need a lot of packets over a period of time to recreate it. So if what you are trying to do can achieve that, you should reproduce it. How many connections and sessions can you support? BTW, does iscsi call for a zero-copy receive? cheers, jamal ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Update on e1000 troubles (over-heating!) 2002-10-06 22:38 ` jamal 2002-10-07 0:14 ` Andre Hedrick @ 2002-10-07 3:46 ` Ben Greear 2002-10-07 5:26 ` David S. Miller 2002-10-07 11:53 ` jamal 1 sibling, 2 replies; 18+ messages in thread From: Ben Greear @ 2002-10-07 3:46 UTC (permalink / raw) To: jamal; +Cc: Andre Hedrick, linux-kernel, 'netdev@oss.sgi.com' jamal wrote: > It seems like the prerequisite to reproduce it is you beat the NIC heavily > with a lot of packets/sec and then run it at that sustained rate for at > least 30 minutes. isci would tend to use MTU sized packets which will > not be that effective. I can reproduce my crash using mtu sized pkts running only 50Mbps send + receive on 2 nics. It took over-night to do it though. Running as hard as I can with MTU packets will crash it as well, and much quicker. Interestingly enough, the tg3 NIC (netgear 302t), registered 57 deg C between the fins of it's heat sink in the 32-bit slots. Makes me wonder if my PCI bus is running too hot :P Dave says I'm wierd and no one else sees these bizarre problems, btw :) More trouble-shooting to follow this next week. Thanks, Ben -- Ben Greear <greearb@candelatech.com> <Ben_Greear AT excite.com> President of Candela Technologies Inc http://www.candelatech.com ScryMUD: http://scry.wanfear.com http://scry.wanfear.com/~greear ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Update on e1000 troubles (over-heating!) 2002-10-07 3:46 ` Ben Greear @ 2002-10-07 5:26 ` David S. Miller 2002-10-07 11:53 ` jamal 1 sibling, 0 replies; 18+ messages in thread From: David S. Miller @ 2002-10-07 5:26 UTC (permalink / raw) To: greearb; +Cc: hadi, andre, linux-kernel, netdev From: Ben Greear <greearb@candelatech.com> Date: Sun, 06 Oct 2002 20:46:42 -0700 Dave says I'm wierd and no one else sees these bizarre problems, btw :) The only case where I'm really concerned about the health of your PCI controller is the most recent case you've reported to me where pci_find_capability(pdev, PCI_CAP_ID_PM) fails. That is just completely bizarre. I hope your boards aren't being permanently harmed by your box which is overheating.:( ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Update on e1000 troubles (over-heating!) 2002-10-07 3:46 ` Ben Greear 2002-10-07 5:26 ` David S. Miller @ 2002-10-07 11:53 ` jamal 2002-10-07 11:58 ` David S. Miller 2002-10-07 16:40 ` Ben Greear 1 sibling, 2 replies; 18+ messages in thread From: jamal @ 2002-10-07 11:53 UTC (permalink / raw) To: Ben Greear; +Cc: Andre Hedrick, linux-kernel, 'netdev@oss.sgi.com' On Sun, 6 Oct 2002, Ben Greear wrote: > I can reproduce my crash using mtu sized pkts running only 50Mbps > send + receive on 2 nics. It took over-night to do it though. Running > as hard as I can with MTU packets will crash it as well, and much >quicker. > So is there a correlation with packet count then? > Interestingly enough, the tg3 NIC (netgear 302t), registered 57 deg C between > the fins of it's heat sink in the 32-bit slots. Makes me wonder if my PCI bus > is running too hot :P Does the problem happen with the tg3? cheers, jamal ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Update on e1000 troubles (over-heating!) 2002-10-07 11:53 ` jamal @ 2002-10-07 11:58 ` David S. Miller 2002-10-07 16:40 ` Ben Greear 1 sibling, 0 replies; 18+ messages in thread From: David S. Miller @ 2002-10-07 11:58 UTC (permalink / raw) To: hadi; +Cc: greearb, andre, linux-kernel, netdev From: jamal <hadi@cyberus.ca> Date: Mon, 7 Oct 2002 07:53:26 -0400 (EDT) Does the problem happen with the tg3? He gets hangs in one box, inoperable PCI config space accesses for the cards in another box. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Update on e1000 troubles (over-heating!) 2002-10-07 11:53 ` jamal 2002-10-07 11:58 ` David S. Miller @ 2002-10-07 16:40 ` Ben Greear 1 sibling, 0 replies; 18+ messages in thread From: Ben Greear @ 2002-10-07 16:40 UTC (permalink / raw) To: jamal; +Cc: Andre Hedrick, linux-kernel, 'netdev@oss.sgi.com' jamal wrote: > > On Sun, 6 Oct 2002, Ben Greear wrote: > > >>I can reproduce my crash using mtu sized pkts running only 50Mbps >>send + receive on 2 nics. It took over-night to do it though. Running >>as hard as I can with MTU packets will crash it as well, and much >>quicker. >> > > > So is there a correlation with packet count then? No, running at slower speeds (50Mbps), the packet count was well over 4 billion (ie it successfully wrapped 32-bits). At higher speeds, it crashes before the 32-bit wrap, generally. It also does not coorelate to bytes-sent/received, or anything else that I could think of to look at. > > > >>Interestingly enough, the tg3 NIC (netgear 302t), registered 57 deg C between >>the fins of it's heat sink in the 32-bit slots. Makes me wonder if my PCI bus >>is running too hot :P > > > Does the problem happen with the tg3? As Dave mentioned, tg3 locks up almost immediately (like within 30 seconds), and in the meantime, it's spitting out errors that are 'impossible'. The messages I sent a day or two ago. I may have cooked my cards, or something like that, because one of the tg3's do not work in my other machine now. Still trouble-shooting that one. Ben > > cheers, > jamal > > -- Ben Greear <greearb@candelatech.com> <Ben_Greear AT excite.com> President of Candela Technologies Inc http://www.candelatech.com ScryMUD: http://scry.wanfear.com http://scry.wanfear.com/~greear ^ permalink raw reply [flat|nested] 18+ messages in thread
* RE: Update on e1000 troubles (over-heating!)
@ 2002-10-06 7:33 Feldman, Scott
2002-10-08 18:44 ` Ben Greear
0 siblings, 1 reply; 18+ messages in thread
From: Feldman, Scott @ 2002-10-06 7:33 UTC (permalink / raw)
To: 'Ben Greear'; +Cc: linux-kernel, 'netdev@oss.sgi.com'
> I believe I have figured out why the e1000 crashed my machine
> after .5 - 1 hours: The NIC was over-heating. I measured
> one of the NICs after the machine crashed with an external
> (cheap) temp probe. It registered right at 50 degrees C, and
> this was about 15-30 seconds after it crashed.
Ben, please send lspci -x on the hot nic.
-scott
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Update on e1000 troubles (over-heating!) 2002-10-06 7:33 Feldman, Scott @ 2002-10-08 18:44 ` Ben Greear 0 siblings, 0 replies; 18+ messages in thread From: Ben Greear @ 2002-10-08 18:44 UTC (permalink / raw) To: Feldman, Scott; +Cc: linux-kernel, 'netdev@oss.sgi.com' [-- Attachment #1: Type: text/plain, Size: 932 bytes --] Feldman, Scott wrote: >>I believe I have figured out why the e1000 crashed my machine >>after .5 - 1 hours: The NIC was over-heating. I measured >>one of the NICs after the machine crashed with an external >>(cheap) temp probe. It registered right at 50 degrees C, and >>this was about 15-30 seconds after it crashed. > > > Ben, please send lspci -x on the hot nic. Here is the lspci information, both -x and -vv. This is with two of the e1000 single-port NICS side-by-side. I have also strapped a P-IV CPU fan on top of the two cards to blow some air over them....running tests now to see if that actually helps anything. If it does, I'll be sure to send you a picture :) Thanks, Ben > > -scott > -- Ben Greear <greearb@candelatech.com> <Ben_Greear AT excite.com> President of Candela Technologies Inc http://www.candelatech.com ScryMUD: http://scry.wanfear.com http://scry.wanfear.com/~greear [-- Attachment #2: lspci.txt --] [-- Type: text/plain, Size: 10779 bytes --] 00:00.0 Host bridge: Advanced Micro Devices [AMD] AMD-760 MP [IGD4-2P] System Controller (rev 11) 00: 22 10 0c 70 06 00 30 22 11 00 00 06 00 40 00 00 10: 08 00 00 f8 08 00 20 f6 91 10 00 00 00 00 00 00 20: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 30: 00 00 00 00 a0 00 00 00 00 00 00 00 00 00 00 00 00:01.0 PCI bridge: Advanced Micro Devices [AMD] AMD-760 MP [IGD4-2P] AGP Bridge 00: 22 10 0d 70 07 00 20 02 00 00 04 06 00 40 01 00 10: 00 00 00 00 00 00 00 00 00 01 01 44 f1 01 20 22 20: f0 ff 00 00 f0 ff 00 00 00 00 00 00 00 00 00 00 30: 00 00 00 00 00 00 00 00 00 00 00 00 ff 00 04 00 00:07.0 ISA bridge: Advanced Micro Devices [AMD] AMD-768 [Opus] ISA (rev 05) 00: 22 10 40 74 0f 00 20 02 05 00 01 06 00 00 80 00 10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 20: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 30: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00:07.1 IDE interface: Advanced Micro Devices [AMD] AMD-768 [Opus] IDE (rev 04) 00: 22 10 41 74 05 00 00 02 04 8a 01 01 00 40 00 00 10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 20: 01 f0 00 00 00 00 00 00 00 00 00 00 22 10 41 74 30: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00:07.3 Bridge: Advanced Micro Devices [AMD] AMD-768 [Opus] ACPI (rev 03) 00: 22 10 43 74 00 00 80 02 03 00 80 06 00 40 00 00 10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 20: 00 00 00 00 00 00 00 00 00 00 00 00 22 10 43 74 30: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00:08.0 Ethernet controller: Intel Corp.: Unknown device 100f (rev 01) 00: 86 80 0f 10 17 00 30 02 01 00 00 02 10 40 00 00 10: 04 00 00 f4 00 00 00 00 00 00 00 00 00 00 00 00 20: 01 10 00 00 00 00 00 00 00 00 00 00 86 80 01 10 30: 00 00 00 00 dc 00 00 00 00 00 00 00 0a 01 ff 00 00:09.0 Ethernet controller: Intel Corp.: Unknown device 100f (rev 01) 00: 86 80 0f 10 17 00 30 02 01 00 00 02 10 40 00 00 10: 04 00 02 f4 00 00 00 00 00 00 00 00 00 00 00 00 20: 41 10 00 00 00 00 00 00 00 00 00 00 86 80 01 10 30: 00 00 00 00 dc 00 00 00 00 00 00 00 09 01 ff 00 00:10.0 PCI bridge: Advanced Micro Devices [AMD] AMD-768 [Opus] PCI (rev 05) 00: 22 10 48 74 17 00 20 22 05 00 04 06 00 63 01 00 10: 00 00 00 00 00 00 00 00 00 02 02 a8 20 20 00 22 20: 10 f4 f0 f5 f0 ff 00 00 00 00 00 00 00 00 00 00 30: 00 00 00 00 00 00 00 00 00 00 00 00 ff 00 0c 00 02:00.0 USB Controller: Advanced Micro Devices [AMD] AMD-768 [Opus] USB (rev 07) 00: 22 10 49 74 17 00 80 82 07 10 03 0c 00 40 00 00 10: 00 00 10 f4 00 00 00 00 00 00 00 00 00 00 00 00 20: 00 00 00 00 00 00 00 00 00 00 00 00 22 10 49 74 30: 00 00 00 00 00 00 00 00 00 00 00 00 0a 04 00 50 02:07.0 VGA compatible controller: ATI Technologies Inc Rage XL (rev 27) 00: 02 10 52 47 87 00 90 02 27 00 00 03 10 42 00 00 10: 00 00 00 f5 01 20 00 00 00 10 10 f4 00 00 00 00 20: 00 00 00 00 00 00 00 00 00 00 00 00 02 10 08 80 30: 00 00 00 00 5c 00 00 00 00 00 00 00 ff 00 08 00 02:08.0 Ethernet controller: 3Com Corporation 3c980-TX 10/100baseTX NIC [Python-T] (rev 78) 00: b7 10 05 98 17 00 10 02 78 00 00 02 10 50 00 00 10: 01 24 00 00 00 20 10 f4 00 00 00 00 00 00 00 00 20: 00 00 00 00 00 00 00 00 00 00 00 00 f1 10 62 24 30: 00 00 00 00 dc 00 00 00 00 00 00 00 0b 01 0a 0a 02:09.0 Ethernet controller: 3Com Corporation 3c980-TX 10/100baseTX NIC [Python-T] (rev 78) 00: b7 10 05 98 17 00 10 02 78 00 00 02 10 50 00 00 10: 81 24 00 00 00 24 10 f4 00 00 00 00 00 00 00 00 20: 00 00 00 00 00 00 00 00 00 00 00 00 f1 10 62 24 30: 00 00 00 00 dc 00 00 00 00 00 00 00 05 01 0a 0a 00:00.0 Host bridge: Advanced Micro Devices [AMD] AMD-760 MP [IGD4-2P] System Controller (rev 11) Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap+ 66Mhz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort+ >SERR- <PERR- Latency: 64 Region 0: Memory at f8000000 (32-bit, prefetchable) [size=64M] Region 1: Memory at f6200000 (32-bit, prefetchable) [size=4K] Region 2: I/O ports at 1090 [disabled] [size=4] Capabilities: [a0] AGP version 2.0 Status: RQ=15 SBA+ 64bit- FW- Rate=x1,x2 Command: RQ=0 SBA+ AGP+ 64bit- FW- Rate=<none> 00:01.0 PCI bridge: Advanced Micro Devices [AMD] AMD-760 MP [IGD4-2P] AGP Bridge (prog-if 00 [Normal decode]) Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap- 66Mhz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- Latency: 64 Bus: primary=00, secondary=01, subordinate=01, sec-latency=68 BridgeCtl: Parity- SERR- NoISA+ VGA- MAbort- >Reset- FastB2B- 00:07.0 ISA bridge: Advanced Micro Devices [AMD] AMD-768 [Opus] ISA (rev 05) Control: I/O+ Mem+ BusMaster+ SpecCycle+ MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap- 66Mhz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- Latency: 0 00:07.1 IDE interface: Advanced Micro Devices [AMD] AMD-768 [Opus] IDE (rev 04) (prog-if 8a [Master SecP PriP]) Subsystem: Advanced Micro Devices [AMD] AMD-768 [Opus] IDE Control: I/O+ Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap- 66Mhz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- Latency: 64 Region 4: I/O ports at f000 [size=16] 00:07.3 Bridge: Advanced Micro Devices [AMD] AMD-768 [Opus] ACPI (rev 03) Subsystem: Advanced Micro Devices [AMD] AMD-768 [Opus] ACPI Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap- 66Mhz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- 00:08.0 Ethernet controller: Intel Corp.: Unknown device 100f (rev 01) Subsystem: Intel Corp.: Unknown device 1001 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap+ 66Mhz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- Latency: 64 (63750ns min), cache line size 10 Interrupt: pin A routed to IRQ 10 Region 0: Memory at f4000000 (64-bit, non-prefetchable) [size=128K] Region 4: I/O ports at 1000 [size=64] Capabilities: [dc] Power Management version 2 Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 PME-Enable- DSel=0 DScale=0 PME- Capabilities: [e4] PCI-X non-bridge device. Command: DPERE- ERO+ RBC=0 OST=0 Status: Bus=0 Dev=0 Func=0 64bit- 133MHz- SCD- USC-, DC=simple, DMMRBC=0, DMOST=0, DMCRS=0, RSCEM- Capabilities: [f0] Message Signalled Interrupts: 64bit+ Queue=0/0 Enable- Address: 0000000000000000 Data: 0000 00:09.0 Ethernet controller: Intel Corp.: Unknown device 100f (rev 01) Subsystem: Intel Corp.: Unknown device 1001 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap+ 66Mhz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- Latency: 64 (63750ns min), cache line size 10 Interrupt: pin A routed to IRQ 9 Region 0: Memory at f4020000 (64-bit, non-prefetchable) [size=128K] Region 4: I/O ports at 1040 [size=64] Capabilities: [dc] Power Management version 2 Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 PME-Enable- DSel=0 DScale=0 PME- Capabilities: [e4] PCI-X non-bridge device. Command: DPERE- ERO+ RBC=0 OST=0 Status: Bus=0 Dev=0 Func=0 64bit- 133MHz- SCD- USC-, DC=simple, DMMRBC=0, DMOST=0, DMCRS=0, RSCEM- Capabilities: [f0] Message Signalled Interrupts: 64bit+ Queue=0/0 Enable- Address: 0000000000000000 Data: 0000 00:10.0 PCI bridge: Advanced Micro Devices [AMD] AMD-768 [Opus] PCI (rev 05) (prog-if 00 [Normal decode]) Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap- 66Mhz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort+ >SERR- <PERR- Latency: 99 Bus: primary=00, secondary=02, subordinate=02, sec-latency=168 I/O behind bridge: 00002000-00002fff Memory behind bridge: f4100000-f5ffffff BridgeCtl: Parity- SERR- NoISA+ VGA+ MAbort- >Reset- FastB2B- 02:00.0 USB Controller: Advanced Micro Devices [AMD] AMD-768 [Opus] USB (rev 07) (prog-if 10 [OHCI]) Subsystem: Advanced Micro Devices [AMD] AMD-768 [Opus] USB Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap- 66Mhz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR+ Latency: 64 (20000ns max) Interrupt: pin D routed to IRQ 10 Region 0: Memory at f4100000 (32-bit, non-prefetchable) [size=4K] 02:07.0 VGA compatible controller: ATI Technologies Inc Rage XL (rev 27) (prog-if 00 [VGA]) Subsystem: ATI Technologies Inc: Unknown device 8008 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping+ SERR- FastB2B- Status: Cap+ 66Mhz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- Latency: 66 (2000ns min), cache line size 10 Region 0: Memory at f5000000 (32-bit, non-prefetchable) [size=16M] Region 1: I/O ports at 2000 [size=256] Region 2: Memory at f4101000 (32-bit, non-prefetchable) [size=4K] Expansion ROM at <unassigned> [disabled] [size=128K] Capabilities: [5c] Power Management version 2 Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 PME-Enable- DSel=0 DScale=0 PME- 02:08.0 Ethernet controller: 3Com Corporation 3c980-TX 10/100baseTX NIC [Python-T] (rev 78) Subsystem: Tyan Computer: Unknown device 2462 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap+ 66Mhz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- Latency: 80 (2500ns min, 2500ns max), cache line size 10 Interrupt: pin A routed to IRQ 11 Region 0: I/O ports at 2400 [size=128] Region 1: Memory at f4102000 (32-bit, non-prefetchable) [size=128] Expansion ROM at <unassigned> [disabled] [size=128K] Capabilities: [dc] Power Management version 2 Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0+,D1+,D2+,D3hot+,D3cold+) Status: D0 PME-Enable- DSel=0 DScale=2 PME- 02:09.0 Ethernet controller: 3Com Corporation 3c980-TX 10/100baseTX NIC [Python-T] (rev 78) Subsystem: Tyan Computer: Unknown device 2462 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap+ 66Mhz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- Latency: 80 (2500ns min, 2500ns max), cache line size 10 Interrupt: pin A routed to IRQ 5 Region 0: I/O ports at 2480 [size=128] Region 1: Memory at f4102400 (32-bit, non-prefetchable) [size=128] Expansion ROM at <unassigned> [disabled] [size=128K] Capabilities: [dc] Power Management version 2 Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0+,D1+,D2+,D3hot+,D3cold+) Status: D0 PME-Enable- DSel=0 DScale=2 PME- ^ permalink raw reply [flat|nested] 18+ messages in thread
* RE: Update on e1000 troubles (over-heating!)
@ 2002-10-15 2:20 Feldman, Scott
2002-10-15 2:37 ` Andi Kleen
` (2 more replies)
0 siblings, 3 replies; 18+ messages in thread
From: Feldman, Scott @ 2002-10-15 2:20 UTC (permalink / raw)
To: 'Ben Greear'; +Cc: linux-kernel, 'netdev@oss.sgi.com'
> Here is the lspci information, both -x and -vv. This is with
> two of the e1000 single-port NICS side-by-side. I have also
> strapped a P-IV CPU fan on top of the two cards to blow some
> air over them....running tests now to see if that actually
> helps anything. If it does, I'll be sure to send you a picture :)
Ben, I checked the datasheet for the part shown in the lspci dump, and it
shows an operating temperature of 0-55 degrees C. You said you measured 50
degrees C, so you're within the safe range. Did the fans help?
Here's the datasheet:
http://www.intel.com/network/connectivity/resources/doc_library/data_sheets/
pro1000mt_sa.pdf
-scott
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Update on e1000 troubles (over-heating!) 2002-10-15 2:20 Feldman, Scott @ 2002-10-15 2:37 ` Andi Kleen 2002-10-15 2:54 ` Jonathan Lundell 2002-10-15 5:42 ` Dave Hansen 2002-10-15 7:01 ` Ben Greear 2 siblings, 1 reply; 18+ messages in thread From: Andi Kleen @ 2002-10-15 2:37 UTC (permalink / raw) To: Feldman, Scott Cc: 'Ben Greear', linux-kernel, 'netdev@oss.sgi.com' On Mon, Oct 14, 2002 at 07:20:04PM -0700, Feldman, Scott wrote: > > Here is the lspci information, both -x and -vv. This is with > > two of the e1000 single-port NICS side-by-side. I have also > > strapped a P-IV CPU fan on top of the two cards to blow some > > air over them....running tests now to see if that actually > > helps anything. If it does, I'll be sure to send you a picture :) > > Ben, I checked the datasheet for the part shown in the lspci dump, and it > shows an operating temperature of 0-55 degrees C. You said you measured 50 > degrees C, so you're within the safe range. Did the fans help? The thermometer he used likely showed a much lower temperature than what was actually on the die. 5-10 C more are not unlikely. It's hard to measure chip temperatures accurately without an on die thermal diode or special kit. So I would expect that when an external normal thermometer showed 50C it was already operating out of spec. -Andi ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Update on e1000 troubles (over-heating!) 2002-10-15 2:37 ` Andi Kleen @ 2002-10-15 2:54 ` Jonathan Lundell 0 siblings, 0 replies; 18+ messages in thread From: Jonathan Lundell @ 2002-10-15 2:54 UTC (permalink / raw) To: Andi Kleen, Feldman, Scott Cc: 'Ben Greear', linux-kernel, 'netdev@oss.sgi.com' At 4:37am +0200 10/15/02, Andi Kleen wrote: > > Ben, I checked the datasheet for the part shown in the lspci dump, and it >> shows an operating temperature of 0-55 degrees C. You said you measured 50 >> degrees C, so you're within the safe range. Did the fans help? > >The thermometer he used likely showed a much lower temperature than what was >actually on the die. 5-10 C more are not unlikely. It's hard to measure chip >temperatures accurately without an on die thermal diode or special kit. >So I would expect that when an external normal thermometer showed 50C >it was already operating out of spec. The datasheet's for the card, so the operating temperature is surely ambient, not die temperature. "Ambient measured how?" would be a reasonable question, though. -- /Jonathan Lundell. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Update on e1000 troubles (over-heating!) 2002-10-15 2:20 Feldman, Scott 2002-10-15 2:37 ` Andi Kleen @ 2002-10-15 5:42 ` Dave Hansen 2002-10-15 7:07 ` Ben Greear 2002-10-15 7:01 ` Ben Greear 2 siblings, 1 reply; 18+ messages in thread From: Dave Hansen @ 2002-10-15 5:42 UTC (permalink / raw) To: Feldman, Scott Cc: 'Ben Greear', linux-kernel, 'netdev@oss.sgi.com' Feldman, Scott wrote: >>Here is the lspci information, both -x and -vv. This is with >>two of the e1000 single-port NICS side-by-side. I have also >>strapped a P-IV CPU fan on top of the two cards to blow some >>air over them....running tests now to see if that actually >>helps anything. If it does, I'll be sure to send you a picture :) > > Ben, I checked the datasheet for the part shown in the lspci dump, and it > shows an operating temperature of 0-55 degrees C. You said you measured 50 > degrees C, so you're within the safe range. Did the fans help? > > Here's the datasheet: > http://www.intel.com/network/connectivity/resources/doc_library/data_sheets/ > pro1000mt_sa.pdf I get some strange e1000 failures too. It usually involves the watchdog kicking them back into order, but sometimes they'll stay offline for a while. Heat would explain it, though, because it only happens when I'm actually using the cards for a benchmark. I figured that it was either my cables, or a shoddy switch. The new dual-port e1000 that I have doesn't seem to have this problem, even though I'm running 4 times more traffic than the singles that I had. -- Dave Hansen haveblue@us.ibm.com ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Update on e1000 troubles (over-heating!) 2002-10-15 5:42 ` Dave Hansen @ 2002-10-15 7:07 ` Ben Greear 0 siblings, 0 replies; 18+ messages in thread From: Ben Greear @ 2002-10-15 7:07 UTC (permalink / raw) To: Dave Hansen; +Cc: Feldman, Scott, linux-kernel, 'netdev@oss.sgi.com' Dave Hansen wrote: > > I get some strange e1000 failures too. It usually involves the watchdog > kicking them back into order, but sometimes they'll stay offline for a > while. Heat would explain it, though, because it only happens when I'm > actually using the cards for a benchmark. I figured that it was either > my cables, or a shoddy switch. > > The new dual-port e1000 that I have doesn't seem to have this problem, > even though I'm running 4 times more traffic than the singles that I had. That was exactly the behaviour I noticed. I believe it's because when you run two side-by-side, they cook each other (I'm assuming you didn't run 2 2-ports side-by-side) Try strapping a fan on them somehow and I bet all your troubles go away (and maybe your .ibm email will shame Intel into putting heat-sinks and/or small fans on their NICs... ;) (I ran two Netgear 302t NICs (tigon-3) side-by-side for 4 days at max speed, and they didn't drop a single packet, even though their heat-sinks were too hot to touch!) Ben -- Ben Greear <greearb@candelatech.com> <Ben_Greear AT excite.com> President of Candela Technologies Inc http://www.candelatech.com ScryMUD: http://scry.wanfear.com http://scry.wanfear.com/~greear ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Update on e1000 troubles (over-heating!) 2002-10-15 2:20 Feldman, Scott 2002-10-15 2:37 ` Andi Kleen 2002-10-15 5:42 ` Dave Hansen @ 2002-10-15 7:01 ` Ben Greear 2 siblings, 0 replies; 18+ messages in thread From: Ben Greear @ 2002-10-15 7:01 UTC (permalink / raw) To: Feldman, Scott; +Cc: linux-kernel, 'netdev@oss.sgi.com' Feldman, Scott wrote: >>Here is the lspci information, both -x and -vv. This is with >>two of the e1000 single-port NICS side-by-side. I have also >>strapped a P-IV CPU fan on top of the two cards to blow some >>air over them....running tests now to see if that actually >>helps anything. If it does, I'll be sure to send you a picture :) > > > Ben, I checked the datasheet for the part shown in the lspci dump, and it > shows an operating temperature of 0-55 degrees C. You said you measured 50 > degrees C, so you're within the safe range. Did the fans help? The fan did help, and Andi is right, the chip was much hotter than what my probe read (I was gently pushing it against the top of the chip, cause it was too hot to really press my finger against it to get good contact :)) With the fan blowing on the chips, it has been perfect. This implies to me that if you are going to run the e1000, you need significant air-flow over the chipset, and the generic 2U chassis that I have is definately inadequate, partially because the MB is so big that the fans are too far away from the PCI slots... This is all doubly true if you are running two NICs side-by-side, which is what I was doing. I am also considering glueing heat-sinks onto the main chip, which may make it work in more marginal environments. Ben -- Ben Greear <greearb@candelatech.com> <Ben_Greear AT excite.com> President of Candela Technologies Inc http://www.candelatech.com ScryMUD: http://scry.wanfear.com http://scry.wanfear.com/~greear ^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2002-10-15 7:01 UTC | newest] Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2002-10-06 3:38 Update on e1000 troubles (over-heating!) Ben Greear 2002-10-06 3:47 ` Andre Hedrick 2002-10-06 22:38 ` jamal 2002-10-07 0:14 ` Andre Hedrick 2002-10-07 11:56 ` jamal 2002-10-07 3:46 ` Ben Greear 2002-10-07 5:26 ` David S. Miller 2002-10-07 11:53 ` jamal 2002-10-07 11:58 ` David S. Miller 2002-10-07 16:40 ` Ben Greear 2002-10-06 7:33 Feldman, Scott 2002-10-08 18:44 ` Ben Greear 2002-10-15 2:20 Feldman, Scott 2002-10-15 2:37 ` Andi Kleen 2002-10-15 2:54 ` Jonathan Lundell 2002-10-15 5:42 ` Dave Hansen 2002-10-15 7:07 ` Ben Greear 2002-10-15 7:01 ` Ben Greear
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).