* e1000e interface hang on 82574L @ 2011-12-27 22:01 Chris Boot 2011-12-27 22:33 ` Dave Taht 2011-12-31 9:31 ` Chris Boot 0 siblings, 2 replies; 32+ messages in thread From: Chris Boot @ 2011-12-27 22:01 UTC (permalink / raw) To: netdev, lkml Hi folks, Another networking issue I've run into, this time with e1000e (Intel Corporation 82574L Gigabit). My new VM cluster appears to drop a NIC - the port stops responding within Linux and shows the link as being down with ethtool. My ISP says 'Ports running Half Duplex or reduced speed' on the port. When the port stops working I see this in dmesg: [35481.659629] ------------[ cut here ]------------ [35481.667837] WARNING: at net/sched/sch_generic.c:255 dev_watchdog+0xe9/0x148() [35481.676370] Hardware name: X9SCL/X9SCM [35481.684793] NETDEV WATCHDOG: eth2 (e1000e): transmit queue 0 timed out [35481.684795] Modules linked in: hmac sha256_generic dlm configfs ebtable_nat ebtables acpi_cpufreq mperf cpufreq_stats cpufreq_conservative cpufreq_userspace cpufreq_powersave microcode xt_NOTRACK ip_set_hash_net act_police cls_basic cls_flow cls_fw cls_u32 sch_tbf sch_prio sch_htb sch_hfsc sch_ingress sch_sfq xt_connlimit xt_realm xt_addrtype ip_set_hash_ip iptable_raw xt_comment xt_recent ipt_ULOG ipt_REJECT ipt_REDIRECT ipt_NETMAP ipt_MASQUERADE ipt_ECN ipt_ecn ipt_CLUSTERIP ipt_ah nf_nat_tftp nf_nat_snmp_basic nf_conntrack_snmp nf_nat_sip nf_nat_pptp nf_nat_proto_gre nf_nat_irc nf_nat_h323 nf_nat_ftp ip6_queue nf_nat_amanda xt_set ip_set nf_conntrack_tftp nf_conntrack_sip nf_conntrack_sane nf_conntrack_proto_udplite nf_conntrack_proto_sctp nf_conntrack_pptp nf_conntrack_proto_gre nf_conntrack_netlink nf_conntrack_netbios_ns nf_conntrack_broadcast nf_conntrack_irc nf_conntrack_h323 nf_conntrack_ftp ts_kmp nf_conntrack_amanda xt_TPROXY xt_NFLOG nfnetlink_log nf_tproxy_core xt_time xt_TCPMSS xt_tcpmss xt_sctp xt_policy xt_pkttype xt_physdev xt_owner xt_NFQUEUE xt_multiport xt_mark xt_mac xt_limit xt_length xt_iprange xt_helper xt_hashlimit xt_DSCP xt_dscp xt_dccp xt_connmark xt_CLASSIFY xt_AUDIT ip6t_LOG ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack ip6table_raw ipt_LOG xt_tcpudp ip6table_mangle xt_state iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack iptable_mangle nfnetlink iptable_filter ip_tables ip6table_filter ip6_tables x_tables bridge stp bonding w83627ehf hwmon_vid coretemp sha1_ssse3 sha1_generic crc32c_intel aesni_intel cryptd aes_x86_64 aes_generic ipmi_poweroff ipmi_devintf ipmi_si ipmi_msghandler vhost_net macvtap macvlan tun drbd lru_cache cn loop kvm_intel kvm snd_pcm snd_timer snd iTCO_wdt soundcore psmouse snd_page_alloc i2c_i801 i2c_core cdc_acm iTCO_vendor_support joydev evdev serio_raw processor button pcspkr thermal_sys ext4 mbcache jbd2 crc16 dm_mod raid1 md_mod sd_mod crc_t10dif usb_storage uas usbhid hid ahci libahci libata igb ehci_hcd scsi_mod usbcore e1000e dca usb_common [last unloaded: scsi_wait_scan] [35481.685740] Pid: 0, comm: swapper/4 Not tainted 3.2.0-rc6+ #4 [35481.685744] Call Trace: [35481.685746] <IRQ> [<ffffffff810467ed>] ? warn_slowpath_common+0x78/0x8c [35481.685849] [<ffffffff81046899>] ? warn_slowpath_fmt+0x45/0x4a [35481.685875] [<ffffffff810aeaa0>] ? perf_event_task_tick+0x166/0x1ab [35481.686018] [<ffffffff81294219>] ? netif_tx_lock+0x40/0x72 [35481.686090] [<ffffffff8129437a>] ? dev_watchdog+0xe9/0x148 [35481.686136] [<ffffffff81051e58>] ? run_timer_softirq+0x19a/0x261 [35481.686176] [<ffffffff81294291>] ? netif_tx_unlock+0x46/0x46 [35481.686215] [<ffffffff810659bb>] ? timekeeping_get_ns+0xd/0x2a [35481.686286] [<ffffffff8104bdd4>] ? __do_softirq+0xb9/0x177 [35481.686365] [<ffffffff81341d6c>] ? call_softirq+0x1c/0x30 [35481.686530] [<ffffffff8100f841>] ? do_softirq+0x3c/0x7b [35481.686580] [<ffffffff8104c03c>] ? irq_exit+0x3c/0x9a [35481.686742] [<ffffffff81023e58>] ? smp_apic_timer_interrupt+0x74/0x82 [35481.686820] [<ffffffff813405de>] ? apic_timer_interrupt+0x6e/0x80 [35481.686826] <EOI> [<ffffffff811ddf49>] ? intel_idle+0xea/0x119 [35481.686991] [<ffffffff811ddf28>] ? intel_idle+0xc9/0x119 [35481.687051] [<ffffffff8125dce3>] ? cpuidle_idle_call+0xec/0x179 [35481.687089] [<ffffffff8100d255>] ? cpu_idle+0xa1/0xe8 [35481.687143] [<ffffffff810706ee>] ? arch_local_irq_restore+0x2/0x8 [35481.687189] [<ffffffff8132d191>] ? start_secondary+0x1d5/0x1db [35481.687234] ---[ end trace 01e9907674757948 ]--- [35481.687817] e1000e 0000:05:00.0: eth2: Reset adapter To try to regain connectivity I bring down the bond and the interface (eth2), then unload e1000e. Upon loading the module again: [36021.888962] e1000e: Intel(R) PRO/1000 Network Driver - 1.5.1-k [36021.900258] e1000e: Copyright(c) 1999 - 2011 Intel Corporation. [36021.911446] e1000e 0000:00:19.0: PCI INT A -> GSI 20 (level, low) -> IRQ 20 [36021.923204] e1000e 0000:00:19.0: setting latency timer to 64 [36021.923372] e1000e 0000:00:19.0: irq 45 for MSI/MSI-X [36022.202737] e1000e 0000:00:19.0: eth2: (PCI Express:2.5GT/s:Width x1) 00:25:90:56:ac:75 [36022.214480] e1000e 0000:00:19.0: eth3: Intel(R) PRO/1000 Network Connection [36022.227506] e1000e 0000:00:19.0: eth3: MAC: 10, PHY: 11, PBA No: FFFFFF-0FF [36022.239789] e1000e 0000:05:00.0: Disabling ASPM L0s [36022.239805] e1000e 0000:05:00.0: enabling device (0000 -> 0002) [36022.239829] e1000e 0000:05:00.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16 [36022.239921] e1000e 0000:05:00.0: setting latency timer to 64 [36022.240963] e1000e 0000:05:00.0: irq 64 for MSI/MSI-X [36022.240995] e1000e 0000:05:00.0: irq 65 for MSI/MSI-X [36022.241028] e1000e 0000:05:00.0: irq 66 for MSI/MSI-X [36022.241596] e1000e 0000:05:00.0: PCI INT A disabled [36022.241606] e1000e: probe of 0000:05:00.0 failed with error -2 [36022.304706] udevd[3634]: renamed network interface eth2 to eth3 I then don't get an eth2 interface. Only a reboot brings the interface back. This has happened twice so far on this server in the past week, both times using v3.2-rc7-3-g4962516. lspci -vnn shows: 05:00.0 Ethernet controller [0200]: Intel Corporation 82574L Gigabit Network Connection [8086:10d3] Subsystem: Super Micro Computer Inc Device [15d9:0000] Flags: bus master, fast devsel, latency 0, IRQ 16 Memory at fbd00000 (32-bit, non-prefetchable) [size=128K] I/O ports at e000 [size=32] Memory at fbd20000 (32-bit, non-prefetchable) [size=16K] Capabilities: [c8] Power Management version 2 Capabilities: [d0] MSI: Enable- Count=1/1 Maskable- 64bit+ Capabilities: [e0] Express Endpoint, MSI 00 Capabilities: [a0] MSI-X: Enable+ Count=5 Masked- Capabilities: [100] Advanced Error Reporting Capabilities: [140] Device Serial Number 00-25-90-ff-ff-56-ac-74 Kernel driver in use: e1000e Thanks, Chris -- Chris Boot bootc@bootc.net ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: e1000e interface hang on 82574L 2011-12-27 22:01 e1000e interface hang on 82574L Chris Boot @ 2011-12-27 22:33 ` Dave Taht 2011-12-31 9:31 ` Chris Boot 1 sibling, 0 replies; 32+ messages in thread From: Dave Taht @ 2011-12-27 22:33 UTC (permalink / raw) To: Chris Boot; +Cc: netdev, lkml On Tue, Dec 27, 2011 at 11:01 PM, Chris Boot <bootc@bootc.net> wrote: > Hi folks, > > Another networking issue I've run into, this time with e1000e (Intel > Corporation 82574L Gigabit). My new VM cluster appears to drop a NIC - the > port stops responding within Linux and shows the link as being down with > ethtool. My ISP says 'Ports running Half Duplex or reduced speed' on the > port. > > When the port stops working I see this in dmesg: > > [35481.659629] ------------[ cut here ]------------ > [35481.667837] WARNING: at net/sched/sch_generic.c:255 > dev_watchdog+0xe9/0x148() > [35481.676370] Hardware name: X9SCL/X9SCM > [35481.684793] NETDEV WATCHDOG: eth2 (e1000e): transmit queue 0 timed out > [35481.684795] Modules linked in: hmac sha256_generic dlm configfs > ebtable_nat ebtables acpi_cpufreq mperf cpufreq_stats cpufreq_conservative > cpufreq_userspace cpufreq_powersave microcode xt_NOTRACK ip_set_hash_net > act_police cls_basic cls_flow cls_fw cls_u32 sch_tbf sch_prio sch_htb > sch_hfsc sch_ingress sch_sfq xt_connlimit xt_realm xt_addrtype > ip_set_hash_ip iptable_raw xt_comment xt_recent ipt_ULOG ipt_REJECT > ipt_REDIRECT ipt_NETMAP ipt_MASQUERADE ipt_ECN ipt_ecn ipt_CLUSTERIP ipt_ah > nf_nat_tftp nf_nat_snmp_basic nf_conntrack_snmp nf_nat_sip nf_nat_pptp > nf_nat_proto_gre nf_nat_irc nf_nat_h323 nf_nat_ftp ip6_queue nf_nat_amanda > xt_set ip_set nf_conntrack_tftp nf_conntrack_sip nf_conntrack_sane > nf_conntrack_proto_udplite nf_conntrack_proto_sctp nf_conntrack_pptp > nf_conntrack_proto_gre nf_conntrack_netlink nf_conntrack_netbios_ns > nf_conntrack_broadcast nf_conntrack_irc nf_conntrack_h323 nf_conntrack_ftp > ts_kmp nf_conntrack_amanda xt_TPROXY xt_NFLOG nfnetlink_log nf_tproxy_core > xt_time xt_TCPMSS xt_tcpmss xt_sctp xt_policy xt_pkttype xt_physdev xt_owner > xt_NFQUEUE xt_multiport xt_mark xt_mac xt_limit xt_length xt_iprange > xt_helper xt_hashlimit xt_DSCP xt_dscp xt_dccp xt_connmark xt_CLASSIFY > xt_AUDIT ip6t_LOG ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack > ip6table_raw ipt_LOG xt_tcpudp ip6table_mangle xt_state iptable_nat nf_nat > nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack iptable_mangle nfnetlink > iptable_filter ip_tables ip6table_filter ip6_tables x_tables bridge stp > bonding w83627ehf hwmon_vid coretemp sha1_ssse3 sha1_generic crc32c_intel > aesni_intel cryptd aes_x86_64 aes_generic ipmi_poweroff ipmi_devintf ipmi_si > ipmi_msghandler vhost_net macvtap macvlan tun drbd lru_cache cn loop > kvm_intel kvm snd_pcm snd_timer snd iTCO_wdt soundcore psmouse > snd_page_alloc i2c_i801 i2c_core cdc_acm iTCO_vendor_support joydev evdev > serio_raw processor button pcspkr thermal_sys ext4 mbcache jbd2 crc16 dm_mod > raid1 md_mod sd_mod crc_t10dif usb_storage uas usbhid hid ahci libahci > libata igb ehci_hcd scsi_mod usbcore e1000e dca usb_common [last unloaded: > scsi_wait_scan] > [35481.685740] Pid: 0, comm: swapper/4 Not tainted 3.2.0-rc6+ #4 > [35481.685744] Call Trace: > [35481.685746] <IRQ> [<ffffffff810467ed>] ? warn_slowpath_common+0x78/0x8c > [35481.685849] [<ffffffff81046899>] ? warn_slowpath_fmt+0x45/0x4a > [35481.685875] [<ffffffff810aeaa0>] ? perf_event_task_tick+0x166/0x1ab > [35481.686018] [<ffffffff81294219>] ? netif_tx_lock+0x40/0x72 > [35481.686090] [<ffffffff8129437a>] ? dev_watchdog+0xe9/0x148 > [35481.686136] [<ffffffff81051e58>] ? run_timer_softirq+0x19a/0x261 > [35481.686176] [<ffffffff81294291>] ? netif_tx_unlock+0x46/0x46 > [35481.686215] [<ffffffff810659bb>] ? timekeeping_get_ns+0xd/0x2a > [35481.686286] [<ffffffff8104bdd4>] ? __do_softirq+0xb9/0x177 > [35481.686365] [<ffffffff81341d6c>] ? call_softirq+0x1c/0x30 > [35481.686530] [<ffffffff8100f841>] ? do_softirq+0x3c/0x7b > [35481.686580] [<ffffffff8104c03c>] ? irq_exit+0x3c/0x9a > [35481.686742] [<ffffffff81023e58>] ? smp_apic_timer_interrupt+0x74/0x82 > [35481.686820] [<ffffffff813405de>] ? apic_timer_interrupt+0x6e/0x80 > [35481.686826] <EOI> [<ffffffff811ddf49>] ? intel_idle+0xea/0x119 > [35481.686991] [<ffffffff811ddf28>] ? intel_idle+0xc9/0x119 > [35481.687051] [<ffffffff8125dce3>] ? cpuidle_idle_call+0xec/0x179 > [35481.687089] [<ffffffff8100d255>] ? cpu_idle+0xa1/0xe8 > [35481.687143] [<ffffffff810706ee>] ? arch_local_irq_restore+0x2/0x8 > [35481.687189] [<ffffffff8132d191>] ? start_secondary+0x1d5/0x1db > [35481.687234] ---[ end trace 01e9907674757948 ]--- > [35481.687817] e1000e 0000:05:00.0: eth2: Reset adapter > > To try to regain connectivity I bring down the bond and the interface > (eth2), then unload e1000e. Upon loading the module again: > > [36021.888962] e1000e: Intel(R) PRO/1000 Network Driver - 1.5.1-k > [36021.900258] e1000e: Copyright(c) 1999 - 2011 Intel Corporation. > [36021.911446] e1000e 0000:00:19.0: PCI INT A -> GSI 20 (level, low) -> IRQ > 20 > [36021.923204] e1000e 0000:00:19.0: setting latency timer to 64 > [36021.923372] e1000e 0000:00:19.0: irq 45 for MSI/MSI-X > [36022.202737] e1000e 0000:00:19.0: eth2: (PCI Express:2.5GT/s:Width x1) > 00:25:90:56:ac:75 > [36022.214480] e1000e 0000:00:19.0: eth3: Intel(R) PRO/1000 Network > Connection > [36022.227506] e1000e 0000:00:19.0: eth3: MAC: 10, PHY: 11, PBA No: > FFFFFF-0FF > [36022.239789] e1000e 0000:05:00.0: Disabling ASPM L0s > [36022.239805] e1000e 0000:05:00.0: enabling device (0000 -> 0002) > [36022.239829] e1000e 0000:05:00.0: PCI INT A -> GSI 16 (level, low) -> IRQ > 16 > [36022.239921] e1000e 0000:05:00.0: setting latency timer to 64 > [36022.240963] e1000e 0000:05:00.0: irq 64 for MSI/MSI-X > [36022.240995] e1000e 0000:05:00.0: irq 65 for MSI/MSI-X > [36022.241028] e1000e 0000:05:00.0: irq 66 for MSI/MSI-X > [36022.241596] e1000e 0000:05:00.0: PCI INT A disabled > [36022.241606] e1000e: probe of 0000:05:00.0 failed with error -2 > [36022.304706] udevd[3634]: renamed network interface eth2 to eth3 > > I then don't get an eth2 interface. Only a reboot brings the interface back. > This has happened twice so far on this server in the past week, both times > using v3.2-rc7-3-g4962516. > > lspci -vnn shows: > > 05:00.0 Ethernet controller [0200]: Intel Corporation 82574L Gigabit Network > Connection [8086:10d3] > Subsystem: Super Micro Computer Inc Device [15d9:0000] > Flags: bus master, fast devsel, latency 0, IRQ 16 > Memory at fbd00000 (32-bit, non-prefetchable) [size=128K] > I/O ports at e000 [size=32] > Memory at fbd20000 (32-bit, non-prefetchable) [size=16K] > Capabilities: [c8] Power Management version 2 > Capabilities: [d0] MSI: Enable- Count=1/1 Maskable- 64bit+ > Capabilities: [e0] Express Endpoint, MSI 00 > Capabilities: [a0] MSI-X: Enable+ Count=5 Masked- > Capabilities: [100] Advanced Error Reporting > Capabilities: [140] Device Serial Number 00-25-90-ff-ff-56-ac-74 > Kernel driver in use: e1000e > > Thanks, > Chris > > -- > Chris Boot > bootc@bootc.net I too am experiencing problems with the e1000e. It takes hours to happen, sometimes days, under a sustained, heavy load (10 iperfs, 1 netperf RR, ping) while : do for i in `seq 1 10` do iperf -w254k -t 60 -c MY_SERVER & done netperf -H MY_SERVER -t TCP_RR & wait sleep2 done but eventually... ifconfig will show the e1000e receiving packets, but none will be transmitted. I kill off the qdisc (tc del dev eth0 root) and sometimes it comes back (so I was assuming it was a problem with qfq) - but this morning I managed to get a full on kernel panic from it and scribble it down. This is with net-next as of c5e1fd8ccae09f574d6f978c90c2b968ee29030c - but I have been experiencing lockups since I started fiddling with BQL last month. That said, I wouldn't consider my environment terribly normal as I'm running with no tso, no gso, tx rings of 64, at 100Mbit, BQL's limit at 4500 bytes, and the QFQ qdisc, and I was willing to write it off to being too early to jump on net-next until now. The super duper new fair QFQ based shaping script I've been testing is at: https://github.com/dtaht/deBloat/blob/master/src/staqfq.lua and my scribbled down morning's panic was: __schedule_bug _shedule atomic_notifier_call_chain __cond_resched _cond_resched __kmalloc [drm_ks_helper] [drm_kms_help] drm_crtc_helper_set_config drm_fb_helper_restore_fb_mode drm_fb_helper_force_kernel_mode drm_fb_helper_panic notifier_call_chain atomic_notifier_call_chain panic oops_end no_context __bad_area_nosemeaphore _do_page_Fault ? T something qfq_deactivate_class qfq_deactivate_class qfq_reset_qdisc m@cruithne:~$ more trace.txt __schedule_bug _shedule atomic_notifier_call_chain __cond_resched _cond_resched __kmalloc [drm_ks_helper] [drm_kms_help] drm_crtc_helper_set_config drm_fb_helper_restore_fb_mode drm_fb_helper_force_kernel_mode drm_fb_helper_panic notifier_call_chain atomic_notifier_call_chain panic oops_end no_context __bad_area_nosemeaphore _do_page_Fault ? T something qfq_deactivate_class qfq_deactivate_class qfq_reset_qdisc qdisc_reset dev_deactivate_queue dev_deativate_many qdic_graft tc_get_qdisc zone_statistics rtnetlink_rcv_msg rtnetlink_rcv netlink_rcu_sb rtnetlink_rcv netlink_unicast netlink_sendmsg sock_sendmsg unlock_page __do_fault move_addr_to_kernel verify_iovec __sys_sendmsg handle_mm_fault do_page_fault sys_sendmsg system_call_fastpath -- Dave Täht SKYPE: davetaht http://www.bufferbloat.net ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: e1000e interface hang on 82574L 2011-12-27 22:01 e1000e interface hang on 82574L Chris Boot 2011-12-27 22:33 ` Dave Taht @ 2011-12-31 9:31 ` Chris Boot 2012-01-03 0:02 ` Wyborny, Carolyn 1 sibling, 1 reply; 32+ messages in thread From: Chris Boot @ 2011-12-31 9:31 UTC (permalink / raw) To: netdev, lkml, e1000-devel On 27 Dec 2011, at 22:01, Chris Boot wrote: > Hi folks, > > Another networking issue I've run into, this time with e1000e (Intel Corporation 82574L Gigabit). My new VM cluster appears to drop a NIC - the port stops responding within Linux and shows the link as being down with ethtool. My ISP says 'Ports running Half Duplex or reduced speed' on the port. > > When the port stops working I see this in dmesg: > > [35481.659629] ------------[ cut here ]------------ > [35481.667837] WARNING: at net/sched/sch_generic.c:255 dev_watchdog+0xe9/0x148() > [35481.676370] Hardware name: X9SCL/X9SCM > [35481.684793] NETDEV WATCHDOG: eth2 (e1000e): transmit queue 0 timed out > [35481.684795] Modules linked in: hmac sha256_generic dlm configfs ebtable_nat ebtables acpi_cpufreq mperf cpufreq_stats cpufreq_conservative cpufreq_userspace cpufreq_powersave microcode xt_NOTRACK ip_set_hash_net act_police cls_basic cls_flow cls_fw cls_u32 sch_tbf sch_prio sch_htb sch_hfsc sch_ingress sch_sfq xt_connlimit xt_realm xt_addrtype ip_set_hash_ip iptable_raw xt_comment xt_recent ipt_ULOG ipt_REJECT ipt_REDIRECT ipt_NETMAP ipt_MASQUERADE ipt_ECN ipt_ecn ipt_CLUSTERIP ipt_ah nf_nat_tftp nf_nat_snmp_basic nf_conntrack_snmp nf_nat_sip nf_nat_pptp nf_nat_proto_gre nf_nat_irc nf_nat_h323 nf_nat_ftp ip6_queue nf_nat_amanda xt_set ip_set nf_conntrack_tftp nf_conntrack_sip nf_conntrack_sane nf_conntrack_proto_udplite nf_conntrack_proto_sctp nf_conntrack_pptp nf_conntrack_proto_gre nf_conntrack_netlink nf_conntrack_netbios_ns nf_conntrack_broadcast nf_conntrack_irc nf_conntrack_h323 nf_conntrack_ftp ts_kmp nf_conntrack_amanda xt_TPROXY xt_NFLOG nfnetlink_log nf_tproxy_core xt_time xt_TCPMSS xt_tcpmss xt_sctp xt_policy xt_pkttype xt_physdev xt_owner xt_NFQUEUE xt_multiport xt_mark xt_mac xt_limit xt_length xt_iprange xt_helper xt_hashlimit xt_DSCP xt_dscp xt_dccp xt_connmark xt_CLASSIFY xt_AUDIT ip6t_LOG ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack ip6table_raw ipt_LOG xt_tcpudp ip6table_mangle xt_state iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack iptable_mangle nfnetlink iptable_filter ip_tables ip6table_filter ip6_tables x_tables bridge stp bonding w83627ehf hwmon_vid coretemp sha1_ssse3 sha1_generic crc32c_intel aesni_intel cryptd aes_x86_64 aes_generic ipmi_poweroff ipmi_devintf ipmi_si ipmi_msghandler vhost_net macvtap macvlan tun drbd lru_cache cn loop kvm_intel kvm snd_pcm snd_timer snd iTCO_wdt soundcore psmouse snd_page_alloc i2c_i801 i2c_core cdc_acm iTCO_vendor_support joydev evdev serio_raw processor button pcspkr thermal_sys ext4 mbcache jbd2 crc16 dm_mod raid1 md_mod sd_mod crc_t10dif usb_storage uas usbhid hid ahci libahci libata igb ehci_hcd scsi_mod usbcore e1000e dca usb_common [last unloaded: scsi_wait_scan] > [35481.685740] Pid: 0, comm: swapper/4 Not tainted 3.2.0-rc6+ #4 > [35481.685744] Call Trace: > [35481.685746] <IRQ> [<ffffffff810467ed>] ? warn_slowpath_common+0x78/0x8c > [35481.685849] [<ffffffff81046899>] ? warn_slowpath_fmt+0x45/0x4a > [35481.685875] [<ffffffff810aeaa0>] ? perf_event_task_tick+0x166/0x1ab > [35481.686018] [<ffffffff81294219>] ? netif_tx_lock+0x40/0x72 > [35481.686090] [<ffffffff8129437a>] ? dev_watchdog+0xe9/0x148 > [35481.686136] [<ffffffff81051e58>] ? run_timer_softirq+0x19a/0x261 > [35481.686176] [<ffffffff81294291>] ? netif_tx_unlock+0x46/0x46 > [35481.686215] [<ffffffff810659bb>] ? timekeeping_get_ns+0xd/0x2a > [35481.686286] [<ffffffff8104bdd4>] ? __do_softirq+0xb9/0x177 > [35481.686365] [<ffffffff81341d6c>] ? call_softirq+0x1c/0x30 > [35481.686530] [<ffffffff8100f841>] ? do_softirq+0x3c/0x7b > [35481.686580] [<ffffffff8104c03c>] ? irq_exit+0x3c/0x9a > [35481.686742] [<ffffffff81023e58>] ? smp_apic_timer_interrupt+0x74/0x82 > [35481.686820] [<ffffffff813405de>] ? apic_timer_interrupt+0x6e/0x80 > [35481.686826] <EOI> [<ffffffff811ddf49>] ? intel_idle+0xea/0x119 > [35481.686991] [<ffffffff811ddf28>] ? intel_idle+0xc9/0x119 > [35481.687051] [<ffffffff8125dce3>] ? cpuidle_idle_call+0xec/0x179 > [35481.687089] [<ffffffff8100d255>] ? cpu_idle+0xa1/0xe8 > [35481.687143] [<ffffffff810706ee>] ? arch_local_irq_restore+0x2/0x8 > [35481.687189] [<ffffffff8132d191>] ? start_secondary+0x1d5/0x1db > [35481.687234] ---[ end trace 01e9907674757948 ]--- > [35481.687817] e1000e 0000:05:00.0: eth2: Reset adapter > > To try to regain connectivity I bring down the bond and the interface (eth2), then unload e1000e. Upon loading the module again: > > [36021.888962] e1000e: Intel(R) PRO/1000 Network Driver - 1.5.1-k > [36021.900258] e1000e: Copyright(c) 1999 - 2011 Intel Corporation. > [36021.911446] e1000e 0000:00:19.0: PCI INT A -> GSI 20 (level, low) -> IRQ 20 > [36021.923204] e1000e 0000:00:19.0: setting latency timer to 64 > [36021.923372] e1000e 0000:00:19.0: irq 45 for MSI/MSI-X > [36022.202737] e1000e 0000:00:19.0: eth2: (PCI Express:2.5GT/s:Width x1) 00:25:90:56:ac:75 > [36022.214480] e1000e 0000:00:19.0: eth3: Intel(R) PRO/1000 Network Connection > [36022.227506] e1000e 0000:00:19.0: eth3: MAC: 10, PHY: 11, PBA No: FFFFFF-0FF > [36022.239789] e1000e 0000:05:00.0: Disabling ASPM L0s > [36022.239805] e1000e 0000:05:00.0: enabling device (0000 -> 0002) > [36022.239829] e1000e 0000:05:00.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16 > [36022.239921] e1000e 0000:05:00.0: setting latency timer to 64 > [36022.240963] e1000e 0000:05:00.0: irq 64 for MSI/MSI-X > [36022.240995] e1000e 0000:05:00.0: irq 65 for MSI/MSI-X > [36022.241028] e1000e 0000:05:00.0: irq 66 for MSI/MSI-X > [36022.241596] e1000e 0000:05:00.0: PCI INT A disabled > [36022.241606] e1000e: probe of 0000:05:00.0 failed with error -2 > [36022.304706] udevd[3634]: renamed network interface eth2 to eth3 > > I then don't get an eth2 interface. Only a reboot brings the interface back. This has happened twice so far on this server in the past week, both times using v3.2-rc7-3-g4962516. > > lspci -vnn shows: > > 05:00.0 Ethernet controller [0200]: Intel Corporation 82574L Gigabit Network Connection [8086:10d3] > Subsystem: Super Micro Computer Inc Device [15d9:0000] > Flags: bus master, fast devsel, latency 0, IRQ 16 > Memory at fbd00000 (32-bit, non-prefetchable) [size=128K] > I/O ports at e000 [size=32] > Memory at fbd20000 (32-bit, non-prefetchable) [size=16K] > Capabilities: [c8] Power Management version 2 > Capabilities: [d0] MSI: Enable- Count=1/1 Maskable- 64bit+ > Capabilities: [e0] Express Endpoint, MSI 00 > Capabilities: [a0] MSI-X: Enable+ Count=5 Masked- > Capabilities: [100] Advanced Error Reporting > Capabilities: [140] Device Serial Number 00-25-90-ff-ff-56-ac-74 > Kernel driver in use: e1000e I've just had this happen on my other (identical) server with a nearly identical trace. Is there anything I can do do avoid this at all or at least help narrow down the problem? Cheers, Chris -- Chris Boot bootc@bootc.net ^ permalink raw reply [flat|nested] 32+ messages in thread
* RE: e1000e interface hang on 82574L 2011-12-31 9:31 ` Chris Boot @ 2012-01-03 0:02 ` Wyborny, Carolyn 2012-01-04 17:12 ` Chris Boot 0 siblings, 1 reply; 32+ messages in thread From: Wyborny, Carolyn @ 2012-01-03 0:02 UTC (permalink / raw) To: Chris Boot, netdev, lkml, e1000-devel >-----Original Message----- >From: netdev-owner@vger.kernel.org [mailto:netdev-owner@vger.kernel.org] >On Behalf Of Chris Boot >Sent: Saturday, December 31, 2011 1:32 AM >To: netdev; lkml; e1000-devel@lists.sourceforge.net >Subject: Re: e1000e interface hang on 82574L > >On 27 Dec 2011, at 22:01, Chris Boot wrote: > >> Hi folks, >> >> Another networking issue I've run into, this time with e1000e (Intel >Corporation 82574L Gigabit). My new VM cluster appears to drop a NIC - >the port stops responding within Linux and shows the link as being down >with ethtool. My ISP says 'Ports running Half Duplex or reduced speed' >on the port. >> >> When the port stops working I see this in dmesg: >> >> [35481.659629] ------------[ cut here ]------------ >> [35481.667837] WARNING: at net/sched/sch_generic.c:255 >dev_watchdog+0xe9/0x148() >> [35481.676370] Hardware name: X9SCL/X9SCM >> [35481.684793] NETDEV WATCHDOG: eth2 (e1000e): transmit queue 0 timed >out >> [35481.684795] Modules linked in: hmac sha256_generic dlm configfs >ebtable_nat ebtables acpi_cpufreq mperf cpufreq_stats >cpufreq_conservative cpufreq_userspace cpufreq_powersave microcode >xt_NOTRACK ip_set_hash_net act_police cls_basic cls_flow cls_fw cls_u32 >sch_tbf sch_prio sch_htb sch_hfsc sch_ingress sch_sfq xt_connlimit >xt_realm xt_addrtype ip_set_hash_ip iptable_raw xt_comment xt_recent >ipt_ULOG ipt_REJECT ipt_REDIRECT ipt_NETMAP ipt_MASQUERADE ipt_ECN >ipt_ecn ipt_CLUSTERIP ipt_ah nf_nat_tftp nf_nat_snmp_basic >nf_conntrack_snmp nf_nat_sip nf_nat_pptp nf_nat_proto_gre nf_nat_irc >nf_nat_h323 nf_nat_ftp ip6_queue nf_nat_amanda xt_set ip_set >nf_conntrack_tftp nf_conntrack_sip nf_conntrack_sane >nf_conntrack_proto_udplite nf_conntrack_proto_sctp nf_conntrack_pptp >nf_conntrack_proto_gre nf_conntrack_netlink nf_conntrack_netbios_ns >nf_conntrack_broadcast nf_conntrack_irc nf_conntrack_h323 >nf_conntrack_ftp ts_kmp nf_conntrack_amanda xt_TPROXY xt_NFLOG >nfnetlink_log nf_tproxy_core xt_time xt_TCPMSS xt_tcpmss xt_sctp >xt_policy xt_pkttype xt_physdev xt_owner xt_NFQUEUE xt_multiport xt_mark >xt_mac xt_limit xt_length xt_iprange xt_helper xt_hashlimit xt_DSCP >xt_dscp xt_dccp xt_connmark xt_CLASSIFY xt_AUDIT ip6t_LOG ip6t_REJECT >nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack ip6table_raw ipt_LOG >xt_tcpudp ip6table_mangle xt_state iptable_nat nf_nat nf_conntrack_ipv4 >nf_defrag_ipv4 nf_conntrack iptable_mangle nfnetlink iptable_filter >ip_tables ip6table_filter ip6_tables x_tables bridge stp bonding >w83627ehf hwmon_vid coretemp sha1_ssse3 sha1_generic crc32c_intel >aesni_intel cryptd aes_x86_64 aes_generic ipmi_poweroff ipmi_devintf >ipmi_si ipmi_msghandler vhost_net macvtap macvlan tun drbd lru_cache cn >loop kvm_intel kvm snd_pcm snd_timer snd iTCO_wdt soundcore psmouse >snd_page_alloc i2c_i801 i2c_core cdc_acm iTCO_vendor_support joydev >evdev serio_raw processor button pcspkr thermal_sys ext4 mbcache jbd2 >crc16 dm_mod raid1 md_mod sd_mod crc_t10dif usb_storage uas usbhid hid >ahci libahci libata igb ehci_hcd scsi_mod usbcore e1000e dca usb_common >[last unloaded: scsi_wait_scan] >> [35481.685740] Pid: 0, comm: swapper/4 Not tainted 3.2.0-rc6+ #4 >> [35481.685744] Call Trace: >> [35481.685746] <IRQ> [<ffffffff810467ed>] ? >warn_slowpath_common+0x78/0x8c >> [35481.685849] [<ffffffff81046899>] ? warn_slowpath_fmt+0x45/0x4a >> [35481.685875] [<ffffffff810aeaa0>] ? >perf_event_task_tick+0x166/0x1ab >> [35481.686018] [<ffffffff81294219>] ? netif_tx_lock+0x40/0x72 >> [35481.686090] [<ffffffff8129437a>] ? dev_watchdog+0xe9/0x148 >> [35481.686136] [<ffffffff81051e58>] ? run_timer_softirq+0x19a/0x261 >> [35481.686176] [<ffffffff81294291>] ? netif_tx_unlock+0x46/0x46 >> [35481.686215] [<ffffffff810659bb>] ? timekeeping_get_ns+0xd/0x2a >> [35481.686286] [<ffffffff8104bdd4>] ? __do_softirq+0xb9/0x177 >> [35481.686365] [<ffffffff81341d6c>] ? call_softirq+0x1c/0x30 >> [35481.686530] [<ffffffff8100f841>] ? do_softirq+0x3c/0x7b >> [35481.686580] [<ffffffff8104c03c>] ? irq_exit+0x3c/0x9a >> [35481.686742] [<ffffffff81023e58>] ? >smp_apic_timer_interrupt+0x74/0x82 >> [35481.686820] [<ffffffff813405de>] ? apic_timer_interrupt+0x6e/0x80 >> [35481.686826] <EOI> [<ffffffff811ddf49>] ? intel_idle+0xea/0x119 >> [35481.686991] [<ffffffff811ddf28>] ? intel_idle+0xc9/0x119 >> [35481.687051] [<ffffffff8125dce3>] ? cpuidle_idle_call+0xec/0x179 >> [35481.687089] [<ffffffff8100d255>] ? cpu_idle+0xa1/0xe8 >> [35481.687143] [<ffffffff810706ee>] ? arch_local_irq_restore+0x2/0x8 >> [35481.687189] [<ffffffff8132d191>] ? start_secondary+0x1d5/0x1db >> [35481.687234] ---[ end trace 01e9907674757948 ]--- >> [35481.687817] e1000e 0000:05:00.0: eth2: Reset adapter >> >> To try to regain connectivity I bring down the bond and the interface >(eth2), then unload e1000e. Upon loading the module again: >> >> [36021.888962] e1000e: Intel(R) PRO/1000 Network Driver - 1.5.1-k >> [36021.900258] e1000e: Copyright(c) 1999 - 2011 Intel Corporation. >> [36021.911446] e1000e 0000:00:19.0: PCI INT A -> GSI 20 (level, low) - >> IRQ 20 >> [36021.923204] e1000e 0000:00:19.0: setting latency timer to 64 >> [36021.923372] e1000e 0000:00:19.0: irq 45 for MSI/MSI-X >> [36022.202737] e1000e 0000:00:19.0: eth2: (PCI Express:2.5GT/s:Width >x1) 00:25:90:56:ac:75 >> [36022.214480] e1000e 0000:00:19.0: eth3: Intel(R) PRO/1000 Network >Connection >> [36022.227506] e1000e 0000:00:19.0: eth3: MAC: 10, PHY: 11, PBA No: >FFFFFF-0FF >> [36022.239789] e1000e 0000:05:00.0: Disabling ASPM L0s >> [36022.239805] e1000e 0000:05:00.0: enabling device (0000 -> 0002) >> [36022.239829] e1000e 0000:05:00.0: PCI INT A -> GSI 16 (level, low) - >> IRQ 16 >> [36022.239921] e1000e 0000:05:00.0: setting latency timer to 64 >> [36022.240963] e1000e 0000:05:00.0: irq 64 for MSI/MSI-X >> [36022.240995] e1000e 0000:05:00.0: irq 65 for MSI/MSI-X >> [36022.241028] e1000e 0000:05:00.0: irq 66 for MSI/MSI-X >> [36022.241596] e1000e 0000:05:00.0: PCI INT A disabled >> [36022.241606] e1000e: probe of 0000:05:00.0 failed with error -2 >> [36022.304706] udevd[3634]: renamed network interface eth2 to eth3 >> >> I then don't get an eth2 interface. Only a reboot brings the interface >back. This has happened twice so far on this server in the past week, >both times using v3.2-rc7-3-g4962516. >> >> lspci -vnn shows: >> >> 05:00.0 Ethernet controller [0200]: Intel Corporation 82574L Gigabit >Network Connection [8086:10d3] >> Subsystem: Super Micro Computer Inc Device [15d9:0000] >> Flags: bus master, fast devsel, latency 0, IRQ 16 >> Memory at fbd00000 (32-bit, non-prefetchable) [size=128K] >> I/O ports at e000 [size=32] >> Memory at fbd20000 (32-bit, non-prefetchable) [size=16K] >> Capabilities: [c8] Power Management version 2 >> Capabilities: [d0] MSI: Enable- Count=1/1 Maskable- 64bit+ >> Capabilities: [e0] Express Endpoint, MSI 00 >> Capabilities: [a0] MSI-X: Enable+ Count=5 Masked- >> Capabilities: [100] Advanced Error Reporting >> Capabilities: [140] Device Serial Number 00-25-90-ff-ff-56-ac- >74 >> Kernel driver in use: e1000e > >I've just had this happen on my other (identical) server with a nearly >identical trace. Is there anything I can do do avoid this at all or at >least help narrow down the problem? > >Cheers, >Chris > >-- >Chris Boot >bootc@bootc.net > >-- >To unsubscribe from this list: send the line "unsubscribe netdev" in >the body of a message to majordomo@vger.kernel.org >More majordomo info at http://vger.kernel.org/majordomo-info.html Hello, Sorry for the delay in responding. We have seen some hang issues using MSI-X on 82574 parts. Can you try reloading the driver the IntMode module parameter. IntMode=1 (you'll need a setting for each device in the system so two adapters would be IntMode=1,1) See if that changes the symptom you are seeing with this part. That setting will make sure the adapter uses MSI interrupts instead of MSI-X. Thanks, Carolyn Carolyn Wyborny Linux Development LAN Access Division Intel Corporation ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: e1000e interface hang on 82574L 2012-01-03 0:02 ` Wyborny, Carolyn @ 2012-01-04 17:12 ` Chris Boot 2012-01-15 11:10 ` Chris Boot 0 siblings, 1 reply; 32+ messages in thread From: Chris Boot @ 2012-01-04 17:12 UTC (permalink / raw) To: Wyborny, Carolyn; +Cc: netdev, lkml, e1000-devel On 03/01/2012 00:02, Wyborny, Carolyn wrote: > > >> -----Original Message----- >> From: netdev-owner@vger.kernel.org [mailto:netdev-owner@vger.kernel.org] >> On Behalf Of Chris Boot >> Sent: Saturday, December 31, 2011 1:32 AM >> To: netdev; lkml; e1000-devel@lists.sourceforge.net >> Subject: Re: e1000e interface hang on 82574L >> >> On 27 Dec 2011, at 22:01, Chris Boot wrote: >> >>> Hi folks, >>> >>> Another networking issue I've run into, this time with e1000e (Intel >> Corporation 82574L Gigabit). My new VM cluster appears to drop a NIC - >> the port stops responding within Linux and shows the link as being down >> with ethtool. My ISP says 'Ports running Half Duplex or reduced speed' >> on the port. >>> >>> When the port stops working I see this in dmesg: >>> >>> [35481.659629] ------------[ cut here ]------------ >>> [35481.667837] WARNING: at net/sched/sch_generic.c:255 >> dev_watchdog+0xe9/0x148() >>> [35481.676370] Hardware name: X9SCL/X9SCM >>> [35481.684793] NETDEV WATCHDOG: eth2 (e1000e): transmit queue 0 timed >> out >>> [35481.684795] Modules linked in: hmac sha256_generic dlm configfs >> ebtable_nat ebtables acpi_cpufreq mperf cpufreq_stats >> cpufreq_conservative cpufreq_userspace cpufreq_powersave microcode >> xt_NOTRACK ip_set_hash_net act_police cls_basic cls_flow cls_fw cls_u32 >> sch_tbf sch_prio sch_htb sch_hfsc sch_ingress sch_sfq xt_connlimit >> xt_realm xt_addrtype ip_set_hash_ip iptable_raw xt_comment xt_recent >> ipt_ULOG ipt_REJECT ipt_REDIRECT ipt_NETMAP ipt_MASQUERADE ipt_ECN >> ipt_ecn ipt_CLUSTERIP ipt_ah nf_nat_tftp nf_nat_snmp_basic >> nf_conntrack_snmp nf_nat_sip nf_nat_pptp nf_nat_proto_gre nf_nat_irc >> nf_nat_h323 nf_nat_ftp ip6_queue nf_nat_amanda xt_set ip_set >> nf_conntrack_tftp nf_conntrack_sip nf_conntrack_sane >> nf_conntrack_proto_udplite nf_conntrack_proto_sctp nf_conntrack_pptp >> nf_conntrack_proto_gre nf_conntrack_netlink nf_conntrack_netbios_ns >> nf_conntrack_broadcast nf_conntrack_irc nf_conntrack_h323 >> nf_conntrack_ftp ts_kmp nf_conntrack_amanda xt_TPROXY xt_NFLOG >> nfnetlink_log nf_tproxy_core xt_time xt_TCPMSS xt_tcpmss xt_sctp >> xt_policy xt_pkttype xt_physdev xt_owner xt_NFQUEUE xt_multiport xt_mark >> xt_mac xt_limit xt_length xt_iprange xt_helper xt_hashlimit xt_DSCP >> xt_dscp xt_dccp xt_connmark xt_CLASSIFY xt_AUDIT ip6t_LOG ip6t_REJECT >> nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack ip6table_raw ipt_LOG >> xt_tcpudp ip6table_mangle xt_state iptable_nat nf_nat nf_conntrack_ipv4 >> nf_defrag_ipv4 nf_conntrack iptable_mangle nfnetlink iptable_filter >> ip_tables ip6table_filter ip6_tables x_tables bridge stp bonding >> w83627ehf hwmon_vid coretemp sha1_ssse3 sha1_generic crc32c_intel >> aesni_intel cryptd aes_x86_64 aes_generic ipmi_poweroff ipmi_devintf >> ipmi_si ipmi_msghandler vhost_net macvtap macvlan tun drbd lru_cache cn >> loop kvm_intel kvm snd_pcm snd_timer snd iTCO_wdt soundcore psmouse >> snd_page_alloc i2c_i801 i2c_core cdc_acm iTCO_vendor_support joydev >> evdev serio_raw processor button pcspkr thermal_sys ext4 mbcache jbd2 >> crc16 dm_mod raid1 md_mod sd_mod crc_t10dif usb_storage uas usbhid hid >> ahci libahci libata igb ehci_hcd scsi_mod usbcore e1000e dca usb_common >> [last unloaded: scsi_wait_scan] >>> [35481.685740] Pid: 0, comm: swapper/4 Not tainted 3.2.0-rc6+ #4 >>> [35481.685744] Call Trace: >>> [35481.685746]<IRQ> [<ffffffff810467ed>] ? >> warn_slowpath_common+0x78/0x8c >>> [35481.685849] [<ffffffff81046899>] ? warn_slowpath_fmt+0x45/0x4a >>> [35481.685875] [<ffffffff810aeaa0>] ? >> perf_event_task_tick+0x166/0x1ab >>> [35481.686018] [<ffffffff81294219>] ? netif_tx_lock+0x40/0x72 >>> [35481.686090] [<ffffffff8129437a>] ? dev_watchdog+0xe9/0x148 >>> [35481.686136] [<ffffffff81051e58>] ? run_timer_softirq+0x19a/0x261 >>> [35481.686176] [<ffffffff81294291>] ? netif_tx_unlock+0x46/0x46 >>> [35481.686215] [<ffffffff810659bb>] ? timekeeping_get_ns+0xd/0x2a >>> [35481.686286] [<ffffffff8104bdd4>] ? __do_softirq+0xb9/0x177 >>> [35481.686365] [<ffffffff81341d6c>] ? call_softirq+0x1c/0x30 >>> [35481.686530] [<ffffffff8100f841>] ? do_softirq+0x3c/0x7b >>> [35481.686580] [<ffffffff8104c03c>] ? irq_exit+0x3c/0x9a >>> [35481.686742] [<ffffffff81023e58>] ? >> smp_apic_timer_interrupt+0x74/0x82 >>> [35481.686820] [<ffffffff813405de>] ? apic_timer_interrupt+0x6e/0x80 >>> [35481.686826]<EOI> [<ffffffff811ddf49>] ? intel_idle+0xea/0x119 >>> [35481.686991] [<ffffffff811ddf28>] ? intel_idle+0xc9/0x119 >>> [35481.687051] [<ffffffff8125dce3>] ? cpuidle_idle_call+0xec/0x179 >>> [35481.687089] [<ffffffff8100d255>] ? cpu_idle+0xa1/0xe8 >>> [35481.687143] [<ffffffff810706ee>] ? arch_local_irq_restore+0x2/0x8 >>> [35481.687189] [<ffffffff8132d191>] ? start_secondary+0x1d5/0x1db >>> [35481.687234] ---[ end trace 01e9907674757948 ]--- >>> [35481.687817] e1000e 0000:05:00.0: eth2: Reset adapter >>> >>> To try to regain connectivity I bring down the bond and the interface >> (eth2), then unload e1000e. Upon loading the module again: >>> >>> [36021.888962] e1000e: Intel(R) PRO/1000 Network Driver - 1.5.1-k >>> [36021.900258] e1000e: Copyright(c) 1999 - 2011 Intel Corporation. >>> [36021.911446] e1000e 0000:00:19.0: PCI INT A -> GSI 20 (level, low) - >>> IRQ 20 >>> [36021.923204] e1000e 0000:00:19.0: setting latency timer to 64 >>> [36021.923372] e1000e 0000:00:19.0: irq 45 for MSI/MSI-X >>> [36022.202737] e1000e 0000:00:19.0: eth2: (PCI Express:2.5GT/s:Width >> x1) 00:25:90:56:ac:75 >>> [36022.214480] e1000e 0000:00:19.0: eth3: Intel(R) PRO/1000 Network >> Connection >>> [36022.227506] e1000e 0000:00:19.0: eth3: MAC: 10, PHY: 11, PBA No: >> FFFFFF-0FF >>> [36022.239789] e1000e 0000:05:00.0: Disabling ASPM L0s >>> [36022.239805] e1000e 0000:05:00.0: enabling device (0000 -> 0002) >>> [36022.239829] e1000e 0000:05:00.0: PCI INT A -> GSI 16 (level, low) - >>> IRQ 16 >>> [36022.239921] e1000e 0000:05:00.0: setting latency timer to 64 >>> [36022.240963] e1000e 0000:05:00.0: irq 64 for MSI/MSI-X >>> [36022.240995] e1000e 0000:05:00.0: irq 65 for MSI/MSI-X >>> [36022.241028] e1000e 0000:05:00.0: irq 66 for MSI/MSI-X >>> [36022.241596] e1000e 0000:05:00.0: PCI INT A disabled >>> [36022.241606] e1000e: probe of 0000:05:00.0 failed with error -2 >>> [36022.304706] udevd[3634]: renamed network interface eth2 to eth3 >>> >>> I then don't get an eth2 interface. Only a reboot brings the interface >> back. This has happened twice so far on this server in the past week, >> both times using v3.2-rc7-3-g4962516. >>> >>> lspci -vnn shows: >>> >>> 05:00.0 Ethernet controller [0200]: Intel Corporation 82574L Gigabit >> Network Connection [8086:10d3] >>> Subsystem: Super Micro Computer Inc Device [15d9:0000] >>> Flags: bus master, fast devsel, latency 0, IRQ 16 >>> Memory at fbd00000 (32-bit, non-prefetchable) [size=128K] >>> I/O ports at e000 [size=32] >>> Memory at fbd20000 (32-bit, non-prefetchable) [size=16K] >>> Capabilities: [c8] Power Management version 2 >>> Capabilities: [d0] MSI: Enable- Count=1/1 Maskable- 64bit+ >>> Capabilities: [e0] Express Endpoint, MSI 00 >>> Capabilities: [a0] MSI-X: Enable+ Count=5 Masked- >>> Capabilities: [100] Advanced Error Reporting >>> Capabilities: [140] Device Serial Number 00-25-90-ff-ff-56-ac- >> 74 >>> Kernel driver in use: e1000e >> >> I've just had this happen on my other (identical) server with a nearly >> identical trace. Is there anything I can do do avoid this at all or at >> least help narrow down the problem? >> >> Cheers, >> Chris >> >> -- >> Chris Boot >> bootc@bootc.net >> >> -- >> To unsubscribe from this list: send the line "unsubscribe netdev" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > Hello, > > Sorry for the delay in responding. We have seen some hang issues using MSI-X on 82574 parts. Can you try reloading the driver the IntMode module parameter. IntMode=1 (you'll need a setting for each device in the system so two adapters would be IntMode=1,1) See if that changes the symptom you are seeing with this part. That setting will make sure the adapter uses MSI interrupts instead of MSI-X. Carolyn, I'll give this a go next time I reproduce it. I built a new kernel with more debugging and so far it hasn't yet triggered again... Chris -- Chris Boot bootc@bootc.net ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: e1000e interface hang on 82574L 2012-01-04 17:12 ` Chris Boot @ 2012-01-15 11:10 ` Chris Boot 2012-01-16 15:56 ` Wyborny, Carolyn 0 siblings, 1 reply; 32+ messages in thread From: Chris Boot @ 2012-01-15 11:10 UTC (permalink / raw) To: Wyborny, Carolyn; +Cc: netdev, lkml, e1000-devel On 04/01/2012 17:12, Chris Boot wrote: > On 03/01/2012 00:02, Wyborny, Carolyn wrote: >> >> >>> -----Original Message----- >>> From: netdev-owner@vger.kernel.org [mailto:netdev-owner@vger.kernel.org] >>> On Behalf Of Chris Boot >>> Sent: Saturday, December 31, 2011 1:32 AM >>> To: netdev; lkml; e1000-devel@lists.sourceforge.net >>> Subject: Re: e1000e interface hang on 82574L >>> >>> On 27 Dec 2011, at 22:01, Chris Boot wrote: >>> >>>> Hi folks, >>>> >>>> Another networking issue I've run into, this time with e1000e (Intel >>> Corporation 82574L Gigabit). My new VM cluster appears to drop a NIC - >>> the port stops responding within Linux and shows the link as being down >>> with ethtool. My ISP says 'Ports running Half Duplex or reduced speed' >>> on the port. >>>> >>>> When the port stops working I see this in dmesg: >>>> >>>> [35481.659629] ------------[ cut here ]------------ >>>> [35481.667837] WARNING: at net/sched/sch_generic.c:255 >>> dev_watchdog+0xe9/0x148() >>>> [35481.676370] Hardware name: X9SCL/X9SCM >>>> [35481.684793] NETDEV WATCHDOG: eth2 (e1000e): transmit queue 0 timed >>> out >>>> [35481.684795] Modules linked in: hmac sha256_generic dlm configfs >>> ebtable_nat ebtables acpi_cpufreq mperf cpufreq_stats >>> cpufreq_conservative cpufreq_userspace cpufreq_powersave microcode >>> xt_NOTRACK ip_set_hash_net act_police cls_basic cls_flow cls_fw cls_u32 >>> sch_tbf sch_prio sch_htb sch_hfsc sch_ingress sch_sfq xt_connlimit >>> xt_realm xt_addrtype ip_set_hash_ip iptable_raw xt_comment xt_recent >>> ipt_ULOG ipt_REJECT ipt_REDIRECT ipt_NETMAP ipt_MASQUERADE ipt_ECN >>> ipt_ecn ipt_CLUSTERIP ipt_ah nf_nat_tftp nf_nat_snmp_basic >>> nf_conntrack_snmp nf_nat_sip nf_nat_pptp nf_nat_proto_gre nf_nat_irc >>> nf_nat_h323 nf_nat_ftp ip6_queue nf_nat_amanda xt_set ip_set >>> nf_conntrack_tftp nf_conntrack_sip nf_conntrack_sane >>> nf_conntrack_proto_udplite nf_conntrack_proto_sctp nf_conntrack_pptp >>> nf_conntrack_proto_gre nf_conntrack_netlink nf_conntrack_netbios_ns >>> nf_conntrack_broadcast nf_conntrack_irc nf_conntrack_h323 >>> nf_conntrack_ftp ts_kmp nf_conntrack_amanda xt_TPROXY xt_NFLOG >>> nfnetlink_log nf_tproxy_core xt_time xt_TCPMSS xt_tcpmss xt_sctp >>> xt_policy xt_pkttype xt_physdev xt_owner xt_NFQUEUE xt_multiport xt_mark >>> xt_mac xt_limit xt_length xt_iprange xt_helper xt_hashlimit xt_DSCP >>> xt_dscp xt_dccp xt_connmark xt_CLASSIFY xt_AUDIT ip6t_LOG ip6t_REJECT >>> nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack ip6table_raw ipt_LOG >>> xt_tcpudp ip6table_mangle xt_state iptable_nat nf_nat nf_conntrack_ipv4 >>> nf_defrag_ipv4 nf_conntrack iptable_mangle nfnetlink iptable_filter >>> ip_tables ip6table_filter ip6_tables x_tables bridge stp bonding >>> w83627ehf hwmon_vid coretemp sha1_ssse3 sha1_generic crc32c_intel >>> aesni_intel cryptd aes_x86_64 aes_generic ipmi_poweroff ipmi_devintf >>> ipmi_si ipmi_msghandler vhost_net macvtap macvlan tun drbd lru_cache cn >>> loop kvm_intel kvm snd_pcm snd_timer snd iTCO_wdt soundcore psmouse >>> snd_page_alloc i2c_i801 i2c_core cdc_acm iTCO_vendor_support joydev >>> evdev serio_raw processor button pcspkr thermal_sys ext4 mbcache jbd2 >>> crc16 dm_mod raid1 md_mod sd_mod crc_t10dif usb_storage uas usbhid hid >>> ahci libahci libata igb ehci_hcd scsi_mod usbcore e1000e dca usb_common >>> [last unloaded: scsi_wait_scan] >>>> [35481.685740] Pid: 0, comm: swapper/4 Not tainted 3.2.0-rc6+ #4 >>>> [35481.685744] Call Trace: >>>> [35481.685746]<IRQ> [<ffffffff810467ed>] ? >>> warn_slowpath_common+0x78/0x8c >>>> [35481.685849] [<ffffffff81046899>] ? warn_slowpath_fmt+0x45/0x4a >>>> [35481.685875] [<ffffffff810aeaa0>] ? >>> perf_event_task_tick+0x166/0x1ab >>>> [35481.686018] [<ffffffff81294219>] ? netif_tx_lock+0x40/0x72 >>>> [35481.686090] [<ffffffff8129437a>] ? dev_watchdog+0xe9/0x148 >>>> [35481.686136] [<ffffffff81051e58>] ? run_timer_softirq+0x19a/0x261 >>>> [35481.686176] [<ffffffff81294291>] ? netif_tx_unlock+0x46/0x46 >>>> [35481.686215] [<ffffffff810659bb>] ? timekeeping_get_ns+0xd/0x2a >>>> [35481.686286] [<ffffffff8104bdd4>] ? __do_softirq+0xb9/0x177 >>>> [35481.686365] [<ffffffff81341d6c>] ? call_softirq+0x1c/0x30 >>>> [35481.686530] [<ffffffff8100f841>] ? do_softirq+0x3c/0x7b >>>> [35481.686580] [<ffffffff8104c03c>] ? irq_exit+0x3c/0x9a >>>> [35481.686742] [<ffffffff81023e58>] ? >>> smp_apic_timer_interrupt+0x74/0x82 >>>> [35481.686820] [<ffffffff813405de>] ? apic_timer_interrupt+0x6e/0x80 >>>> [35481.686826]<EOI> [<ffffffff811ddf49>] ? intel_idle+0xea/0x119 >>>> [35481.686991] [<ffffffff811ddf28>] ? intel_idle+0xc9/0x119 >>>> [35481.687051] [<ffffffff8125dce3>] ? cpuidle_idle_call+0xec/0x179 >>>> [35481.687089] [<ffffffff8100d255>] ? cpu_idle+0xa1/0xe8 >>>> [35481.687143] [<ffffffff810706ee>] ? arch_local_irq_restore+0x2/0x8 >>>> [35481.687189] [<ffffffff8132d191>] ? start_secondary+0x1d5/0x1db >>>> [35481.687234] ---[ end trace 01e9907674757948 ]--- >>>> [35481.687817] e1000e 0000:05:00.0: eth2: Reset adapter >>>> >>>> To try to regain connectivity I bring down the bond and the interface >>> (eth2), then unload e1000e. Upon loading the module again: >>>> >>>> [36021.888962] e1000e: Intel(R) PRO/1000 Network Driver - 1.5.1-k >>>> [36021.900258] e1000e: Copyright(c) 1999 - 2011 Intel Corporation. >>>> [36021.911446] e1000e 0000:00:19.0: PCI INT A -> GSI 20 (level, low) - >>>> IRQ 20 >>>> [36021.923204] e1000e 0000:00:19.0: setting latency timer to 64 >>>> [36021.923372] e1000e 0000:00:19.0: irq 45 for MSI/MSI-X >>>> [36022.202737] e1000e 0000:00:19.0: eth2: (PCI Express:2.5GT/s:Width >>> x1) 00:25:90:56:ac:75 >>>> [36022.214480] e1000e 0000:00:19.0: eth3: Intel(R) PRO/1000 Network >>> Connection >>>> [36022.227506] e1000e 0000:00:19.0: eth3: MAC: 10, PHY: 11, PBA No: >>> FFFFFF-0FF >>>> [36022.239789] e1000e 0000:05:00.0: Disabling ASPM L0s >>>> [36022.239805] e1000e 0000:05:00.0: enabling device (0000 -> 0002) >>>> [36022.239829] e1000e 0000:05:00.0: PCI INT A -> GSI 16 (level, low) - >>>> IRQ 16 >>>> [36022.239921] e1000e 0000:05:00.0: setting latency timer to 64 >>>> [36022.240963] e1000e 0000:05:00.0: irq 64 for MSI/MSI-X >>>> [36022.240995] e1000e 0000:05:00.0: irq 65 for MSI/MSI-X >>>> [36022.241028] e1000e 0000:05:00.0: irq 66 for MSI/MSI-X >>>> [36022.241596] e1000e 0000:05:00.0: PCI INT A disabled >>>> [36022.241606] e1000e: probe of 0000:05:00.0 failed with error -2 >>>> [36022.304706] udevd[3634]: renamed network interface eth2 to eth3 >>>> >>>> I then don't get an eth2 interface. Only a reboot brings the interface >>> back. This has happened twice so far on this server in the past week, >>> both times using v3.2-rc7-3-g4962516. >>>> >>>> lspci -vnn shows: >>>> >>>> 05:00.0 Ethernet controller [0200]: Intel Corporation 82574L Gigabit >>> Network Connection [8086:10d3] >>>> Subsystem: Super Micro Computer Inc Device [15d9:0000] >>>> Flags: bus master, fast devsel, latency 0, IRQ 16 >>>> Memory at fbd00000 (32-bit, non-prefetchable) [size=128K] >>>> I/O ports at e000 [size=32] >>>> Memory at fbd20000 (32-bit, non-prefetchable) [size=16K] >>>> Capabilities: [c8] Power Management version 2 >>>> Capabilities: [d0] MSI: Enable- Count=1/1 Maskable- 64bit+ >>>> Capabilities: [e0] Express Endpoint, MSI 00 >>>> Capabilities: [a0] MSI-X: Enable+ Count=5 Masked- >>>> Capabilities: [100] Advanced Error Reporting >>>> Capabilities: [140] Device Serial Number 00-25-90-ff-ff-56-ac- >>> 74 >>>> Kernel driver in use: e1000e >>> >>> I've just had this happen on my other (identical) server with a nearly >>> identical trace. Is there anything I can do do avoid this at all or at >>> least help narrow down the problem? >>> >>> Cheers, >>> Chris >>> >>> -- >>> Chris Boot >>> bootc@bootc.net >>> >>> -- >>> To unsubscribe from this list: send the line "unsubscribe netdev" in >>> the body of a message to majordomo@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> Hello, >> >> Sorry for the delay in responding. We have seen some hang issues using >> MSI-X on 82574 parts. Can you try reloading the driver the IntMode >> module parameter. IntMode=1 (you'll need a setting for each device in >> the system so two adapters would be IntMode=1,1) See if that changes >> the symptom you are seeing with this part. That setting will make sure >> the adapter uses MSI interrupts instead of MSI-X. > > Carolyn, > > I'll give this a go next time I reproduce it. I built a new kernel with > more debugging and so far it hasn't yet triggered again... Upgrading to a more recent 3.2-rc snapshot seems to have cured the problem - I haven't had an interface stop responding since. Must have been some seemingly unrelated patch that I can't seem to locate. Cheers, Chris -- Chris Boot bootc@bootc.net ^ permalink raw reply [flat|nested] 32+ messages in thread
* RE: e1000e interface hang on 82574L 2012-01-15 11:10 ` Chris Boot @ 2012-01-16 15:56 ` Wyborny, Carolyn 2012-01-16 16:04 ` Chris Boot 0 siblings, 1 reply; 32+ messages in thread From: Wyborny, Carolyn @ 2012-01-16 15:56 UTC (permalink / raw) To: Chris Boot; +Cc: netdev, lkml, e1000-devel [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #1: Type: text/plain; charset="utf-8", Size: 9523 bytes --] >-----Original Message----- >From: Chris Boot [mailto:bootc@bootc.net] >Sent: Sunday, January 15, 2012 3:11 AM >To: Wyborny, Carolyn >Cc: netdev; lkml; e1000-devel@lists.sourceforge.net >Subject: Re: e1000e interface hang on 82574L > >On 04/01/2012 17:12, Chris Boot wrote: >> On 03/01/2012 00:02, Wyborny, Carolyn wrote: >>> >>> >>>> -----Original Message----- >>>> From: netdev-owner@vger.kernel.org [mailto:netdev- >owner@vger.kernel.org] >>>> On Behalf Of Chris Boot >>>> Sent: Saturday, December 31, 2011 1:32 AM >>>> To: netdev; lkml; e1000-devel@lists.sourceforge.net >>>> Subject: Re: e1000e interface hang on 82574L >>>> >>>> On 27 Dec 2011, at 22:01, Chris Boot wrote: >>>> >>>>> Hi folks, >>>>> >>>>> Another networking issue I've run into, this time with e1000e >(Intel >>>> Corporation 82574L Gigabit). My new VM cluster appears to drop a NIC >- >>>> the port stops responding within Linux and shows the link as being >down >>>> with ethtool. My ISP says 'Ports running Half Duplex or reduced >speed' >>>> on the port. >>>>> >>>>> When the port stops working I see this in dmesg: >>>>> >>>>> [35481.659629] ------------[ cut here ]------------ >>>>> [35481.667837] WARNING: at net/sched/sch_generic.c:255 >>>> dev_watchdog+0xe9/0x148() >>>>> [35481.676370] Hardware name: X9SCL/X9SCM >>>>> [35481.684793] NETDEV WATCHDOG: eth2 (e1000e): transmit queue 0 >timed >>>> out >>>>> [35481.684795] Modules linked in: hmac sha256_generic dlm configfs >>>> ebtable_nat ebtables acpi_cpufreq mperf cpufreq_stats >>>> cpufreq_conservative cpufreq_userspace cpufreq_powersave microcode >>>> xt_NOTRACK ip_set_hash_net act_police cls_basic cls_flow cls_fw >cls_u32 >>>> sch_tbf sch_prio sch_htb sch_hfsc sch_ingress sch_sfq xt_connlimit >>>> xt_realm xt_addrtype ip_set_hash_ip iptable_raw xt_comment xt_recent >>>> ipt_ULOG ipt_REJECT ipt_REDIRECT ipt_NETMAP ipt_MASQUERADE ipt_ECN >>>> ipt_ecn ipt_CLUSTERIP ipt_ah nf_nat_tftp nf_nat_snmp_basic >>>> nf_conntrack_snmp nf_nat_sip nf_nat_pptp nf_nat_proto_gre nf_nat_irc >>>> nf_nat_h323 nf_nat_ftp ip6_queue nf_nat_amanda xt_set ip_set >>>> nf_conntrack_tftp nf_conntrack_sip nf_conntrack_sane >>>> nf_conntrack_proto_udplite nf_conntrack_proto_sctp nf_conntrack_pptp >>>> nf_conntrack_proto_gre nf_conntrack_netlink nf_conntrack_netbios_ns >>>> nf_conntrack_broadcast nf_conntrack_irc nf_conntrack_h323 >>>> nf_conntrack_ftp ts_kmp nf_conntrack_amanda xt_TPROXY xt_NFLOG >>>> nfnetlink_log nf_tproxy_core xt_time xt_TCPMSS xt_tcpmss xt_sctp >>>> xt_policy xt_pkttype xt_physdev xt_owner xt_NFQUEUE xt_multiport >xt_mark >>>> xt_mac xt_limit xt_length xt_iprange xt_helper xt_hashlimit xt_DSCP >>>> xt_dscp xt_dccp xt_connmark xt_CLASSIFY xt_AUDIT ip6t_LOG >ip6t_REJECT >>>> nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack ip6table_raw ipt_LOG >>>> xt_tcpudp ip6table_mangle xt_state iptable_nat nf_nat >nf_conntrack_ipv4 >>>> nf_defrag_ipv4 nf_conntrack iptable_mangle nfnetlink iptable_filter >>>> ip_tables ip6table_filter ip6_tables x_tables bridge stp bonding >>>> w83627ehf hwmon_vid coretemp sha1_ssse3 sha1_generic crc32c_intel >>>> aesni_intel cryptd aes_x86_64 aes_generic ipmi_poweroff ipmi_devintf >>>> ipmi_si ipmi_msghandler vhost_net macvtap macvlan tun drbd lru_cache >cn >>>> loop kvm_intel kvm snd_pcm snd_timer snd iTCO_wdt soundcore psmouse >>>> snd_page_alloc i2c_i801 i2c_core cdc_acm iTCO_vendor_support joydev >>>> evdev serio_raw processor button pcspkr thermal_sys ext4 mbcache >jbd2 >>>> crc16 dm_mod raid1 md_mod sd_mod crc_t10dif usb_storage uas usbhid >hid >>>> ahci libahci libata igb ehci_hcd scsi_mod usbcore e1000e dca >usb_common >>>> [last unloaded: scsi_wait_scan] >>>>> [35481.685740] Pid: 0, comm: swapper/4 Not tainted 3.2.0-rc6+ #4 >>>>> [35481.685744] Call Trace: >>>>> [35481.685746]<IRQ> [<ffffffff810467ed>] ? >>>> warn_slowpath_common+0x78/0x8c >>>>> [35481.685849] [<ffffffff81046899>] ? warn_slowpath_fmt+0x45/0x4a >>>>> [35481.685875] [<ffffffff810aeaa0>] ? >>>> perf_event_task_tick+0x166/0x1ab >>>>> [35481.686018] [<ffffffff81294219>] ? netif_tx_lock+0x40/0x72 >>>>> [35481.686090] [<ffffffff8129437a>] ? dev_watchdog+0xe9/0x148 >>>>> [35481.686136] [<ffffffff81051e58>] ? run_timer_softirq+0x19a/0x261 >>>>> [35481.686176] [<ffffffff81294291>] ? netif_tx_unlock+0x46/0x46 >>>>> [35481.686215] [<ffffffff810659bb>] ? timekeeping_get_ns+0xd/0x2a >>>>> [35481.686286] [<ffffffff8104bdd4>] ? __do_softirq+0xb9/0x177 >>>>> [35481.686365] [<ffffffff81341d6c>] ? call_softirq+0x1c/0x30 >>>>> [35481.686530] [<ffffffff8100f841>] ? do_softirq+0x3c/0x7b >>>>> [35481.686580] [<ffffffff8104c03c>] ? irq_exit+0x3c/0x9a >>>>> [35481.686742] [<ffffffff81023e58>] ? >>>> smp_apic_timer_interrupt+0x74/0x82 >>>>> [35481.686820] [<ffffffff813405de>] ? >apic_timer_interrupt+0x6e/0x80 >>>>> [35481.686826]<EOI> [<ffffffff811ddf49>] ? intel_idle+0xea/0x119 >>>>> [35481.686991] [<ffffffff811ddf28>] ? intel_idle+0xc9/0x119 >>>>> [35481.687051] [<ffffffff8125dce3>] ? cpuidle_idle_call+0xec/0x179 >>>>> [35481.687089] [<ffffffff8100d255>] ? cpu_idle+0xa1/0xe8 >>>>> [35481.687143] [<ffffffff810706ee>] ? >arch_local_irq_restore+0x2/0x8 >>>>> [35481.687189] [<ffffffff8132d191>] ? start_secondary+0x1d5/0x1db >>>>> [35481.687234] ---[ end trace 01e9907674757948 ]--- >>>>> [35481.687817] e1000e 0000:05:00.0: eth2: Reset adapter >>>>> >>>>> To try to regain connectivity I bring down the bond and the >interface >>>> (eth2), then unload e1000e. Upon loading the module again: >>>>> >>>>> [36021.888962] e1000e: Intel(R) PRO/1000 Network Driver - 1.5.1-k >>>>> [36021.900258] e1000e: Copyright(c) 1999 - 2011 Intel Corporation. >>>>> [36021.911446] e1000e 0000:00:19.0: PCI INT A -> GSI 20 (level, >low) - >>>>> IRQ 20 >>>>> [36021.923204] e1000e 0000:00:19.0: setting latency timer to 64 >>>>> [36021.923372] e1000e 0000:00:19.0: irq 45 for MSI/MSI-X >>>>> [36022.202737] e1000e 0000:00:19.0: eth2: (PCI >Express:2.5GT/s:Width >>>> x1) 00:25:90:56:ac:75 >>>>> [36022.214480] e1000e 0000:00:19.0: eth3: Intel(R) PRO/1000 Network >>>> Connection >>>>> [36022.227506] e1000e 0000:00:19.0: eth3: MAC: 10, PHY: 11, PBA No: >>>> FFFFFF-0FF >>>>> [36022.239789] e1000e 0000:05:00.0: Disabling ASPM L0s >>>>> [36022.239805] e1000e 0000:05:00.0: enabling device (0000 -> 0002) >>>>> [36022.239829] e1000e 0000:05:00.0: PCI INT A -> GSI 16 (level, >low) - >>>>> IRQ 16 >>>>> [36022.239921] e1000e 0000:05:00.0: setting latency timer to 64 >>>>> [36022.240963] e1000e 0000:05:00.0: irq 64 for MSI/MSI-X >>>>> [36022.240995] e1000e 0000:05:00.0: irq 65 for MSI/MSI-X >>>>> [36022.241028] e1000e 0000:05:00.0: irq 66 for MSI/MSI-X >>>>> [36022.241596] e1000e 0000:05:00.0: PCI INT A disabled >>>>> [36022.241606] e1000e: probe of 0000:05:00.0 failed with error -2 >>>>> [36022.304706] udevd[3634]: renamed network interface eth2 to eth3 >>>>> >>>>> I then don't get an eth2 interface. Only a reboot brings the >interface >>>> back. This has happened twice so far on this server in the past >week, >>>> both times using v3.2-rc7-3-g4962516. >>>>> >>>>> lspci -vnn shows: >>>>> >>>>> 05:00.0 Ethernet controller [0200]: Intel Corporation 82574L >Gigabit >>>> Network Connection [8086:10d3] >>>>> Subsystem: Super Micro Computer Inc Device [15d9:0000] >>>>> Flags: bus master, fast devsel, latency 0, IRQ 16 >>>>> Memory at fbd00000 (32-bit, non-prefetchable) [size=128K] >>>>> I/O ports at e000 [size=32] >>>>> Memory at fbd20000 (32-bit, non-prefetchable) [size=16K] >>>>> Capabilities: [c8] Power Management version 2 >>>>> Capabilities: [d0] MSI: Enable- Count=1/1 Maskable- 64bit+ >>>>> Capabilities: [e0] Express Endpoint, MSI 00 >>>>> Capabilities: [a0] MSI-X: Enable+ Count=5 Masked- >>>>> Capabilities: [100] Advanced Error Reporting >>>>> Capabilities: [140] Device Serial Number 00-25-90-ff-ff-56-ac- >>>> 74 >>>>> Kernel driver in use: e1000e >>>> >>>> I've just had this happen on my other (identical) server with a >nearly >>>> identical trace. Is there anything I can do do avoid this at all or >at >>>> least help narrow down the problem? >>>> >>>> Cheers, >>>> Chris >>>> >>>> -- >>>> Chris Boot >>>> bootc@bootc.net >>>> >>>> -- >>>> To unsubscribe from this list: send the line "unsubscribe netdev" in >>>> the body of a message to majordomo@vger.kernel.org >>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> >>> Hello, >>> >>> Sorry for the delay in responding. We have seen some hang issues >using >>> MSI-X on 82574 parts. Can you try reloading the driver the IntMode >>> module parameter. IntMode=1 (you'll need a setting for each device in >>> the system so two adapters would be IntMode=1,1) See if that changes >>> the symptom you are seeing with this part. That setting will make >sure >>> the adapter uses MSI interrupts instead of MSI-X. >> >> Carolyn, >> >> I'll give this a go next time I reproduce it. I built a new kernel >with >> more debugging and so far it hasn't yet triggered again... > >Upgrading to a more recent 3.2-rc snapshot seems to have cured the >problem - I haven't had an interface stop responding since. Must have >been some seemingly unrelated patch that I can't seem to locate. > >Cheers, >Chris > >-- >Chris Boot >bootc@bootc.net Thanks for letting me know Chris. For my own edification, are you still configured with MSI-X? Thanks, Carolyn Carolyn Wyborny Linux Development LAN Access Division Intel Corporation ÿôèº{.nÇ+·®+%Ëÿ±éݶ\x17¥wÿº{.nÇ+·¥{±þG«éÿ{ayº\x1dÊÚë,j\a¢f£¢·hïêÿêçz_è®\x03(éÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?¨èÚ&£ø§~á¶iOæ¬z·vØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?I¥ ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: e1000e interface hang on 82574L 2012-01-16 15:56 ` Wyborny, Carolyn @ 2012-01-16 16:04 ` Chris Boot 2012-03-17 15:59 ` Chris Boot 0 siblings, 1 reply; 32+ messages in thread From: Chris Boot @ 2012-01-16 16:04 UTC (permalink / raw) To: Wyborny, Carolyn; +Cc: netdev, lkml, e1000-devel On 16/01/2012 15:56, Wyborny, Carolyn wrote: > > >> -----Original Message----- >> From: Chris Boot [mailto:bootc@bootc.net] >> Sent: Sunday, January 15, 2012 3:11 AM >> To: Wyborny, Carolyn >> Cc: netdev; lkml; e1000-devel@lists.sourceforge.net >> Subject: Re: e1000e interface hang on 82574L >> >> On 04/01/2012 17:12, Chris Boot wrote: >>> On 03/01/2012 00:02, Wyborny, Carolyn wrote: >>>> >>>> >>>>> -----Original Message----- >>>>> From: netdev-owner@vger.kernel.org [mailto:netdev- >> owner@vger.kernel.org] >>>>> On Behalf Of Chris Boot >>>>> Sent: Saturday, December 31, 2011 1:32 AM >>>>> To: netdev; lkml; e1000-devel@lists.sourceforge.net >>>>> Subject: Re: e1000e interface hang on 82574L >>>>> >>>>> On 27 Dec 2011, at 22:01, Chris Boot wrote: >>>>> >>>>>> Hi folks, >>>>>> >>>>>> Another networking issue I've run into, this time with e1000e >> (Intel >>>>> Corporation 82574L Gigabit). My new VM cluster appears to drop a NIC >> - >>>>> the port stops responding within Linux and shows the link as being >> down >>>>> with ethtool. My ISP says 'Ports running Half Duplex or reduced >> speed' >>>>> on the port. >>>>>> >>>>>> When the port stops working I see this in dmesg: >>>>>> >>>>>> [35481.659629] ------------[ cut here ]------------ >>>>>> [35481.667837] WARNING: at net/sched/sch_generic.c:255 >>>>> dev_watchdog+0xe9/0x148() >>>>>> [35481.676370] Hardware name: X9SCL/X9SCM >>>>>> [35481.684793] NETDEV WATCHDOG: eth2 (e1000e): transmit queue 0 >> timed >>>>> out >>>>>> [35481.684795] Modules linked in: hmac sha256_generic dlm configfs >>>>> ebtable_nat ebtables acpi_cpufreq mperf cpufreq_stats >>>>> cpufreq_conservative cpufreq_userspace cpufreq_powersave microcode >>>>> xt_NOTRACK ip_set_hash_net act_police cls_basic cls_flow cls_fw >> cls_u32 >>>>> sch_tbf sch_prio sch_htb sch_hfsc sch_ingress sch_sfq xt_connlimit >>>>> xt_realm xt_addrtype ip_set_hash_ip iptable_raw xt_comment xt_recent >>>>> ipt_ULOG ipt_REJECT ipt_REDIRECT ipt_NETMAP ipt_MASQUERADE ipt_ECN >>>>> ipt_ecn ipt_CLUSTERIP ipt_ah nf_nat_tftp nf_nat_snmp_basic >>>>> nf_conntrack_snmp nf_nat_sip nf_nat_pptp nf_nat_proto_gre nf_nat_irc >>>>> nf_nat_h323 nf_nat_ftp ip6_queue nf_nat_amanda xt_set ip_set >>>>> nf_conntrack_tftp nf_conntrack_sip nf_conntrack_sane >>>>> nf_conntrack_proto_udplite nf_conntrack_proto_sctp nf_conntrack_pptp >>>>> nf_conntrack_proto_gre nf_conntrack_netlink nf_conntrack_netbios_ns >>>>> nf_conntrack_broadcast nf_conntrack_irc nf_conntrack_h323 >>>>> nf_conntrack_ftp ts_kmp nf_conntrack_amanda xt_TPROXY xt_NFLOG >>>>> nfnetlink_log nf_tproxy_core xt_time xt_TCPMSS xt_tcpmss xt_sctp >>>>> xt_policy xt_pkttype xt_physdev xt_owner xt_NFQUEUE xt_multiport >> xt_mark >>>>> xt_mac xt_limit xt_length xt_iprange xt_helper xt_hashlimit xt_DSCP >>>>> xt_dscp xt_dccp xt_connmark xt_CLASSIFY xt_AUDIT ip6t_LOG >> ip6t_REJECT >>>>> nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack ip6table_raw ipt_LOG >>>>> xt_tcpudp ip6table_mangle xt_state iptable_nat nf_nat >> nf_conntrack_ipv4 >>>>> nf_defrag_ipv4 nf_conntrack iptable_mangle nfnetlink iptable_filter >>>>> ip_tables ip6table_filter ip6_tables x_tables bridge stp bonding >>>>> w83627ehf hwmon_vid coretemp sha1_ssse3 sha1_generic crc32c_intel >>>>> aesni_intel cryptd aes_x86_64 aes_generic ipmi_poweroff ipmi_devintf >>>>> ipmi_si ipmi_msghandler vhost_net macvtap macvlan tun drbd lru_cache >> cn >>>>> loop kvm_intel kvm snd_pcm snd_timer snd iTCO_wdt soundcore psmouse >>>>> snd_page_alloc i2c_i801 i2c_core cdc_acm iTCO_vendor_support joydev >>>>> evdev serio_raw processor button pcspkr thermal_sys ext4 mbcache >> jbd2 >>>>> crc16 dm_mod raid1 md_mod sd_mod crc_t10dif usb_storage uas usbhid >> hid >>>>> ahci libahci libata igb ehci_hcd scsi_mod usbcore e1000e dca >> usb_common >>>>> [last unloaded: scsi_wait_scan] >>>>>> [35481.685740] Pid: 0, comm: swapper/4 Not tainted 3.2.0-rc6+ #4 >>>>>> [35481.685744] Call Trace: >>>>>> [35481.685746]<IRQ> [<ffffffff810467ed>] ? >>>>> warn_slowpath_common+0x78/0x8c >>>>>> [35481.685849] [<ffffffff81046899>] ? warn_slowpath_fmt+0x45/0x4a >>>>>> [35481.685875] [<ffffffff810aeaa0>] ? >>>>> perf_event_task_tick+0x166/0x1ab >>>>>> [35481.686018] [<ffffffff81294219>] ? netif_tx_lock+0x40/0x72 >>>>>> [35481.686090] [<ffffffff8129437a>] ? dev_watchdog+0xe9/0x148 >>>>>> [35481.686136] [<ffffffff81051e58>] ? run_timer_softirq+0x19a/0x261 >>>>>> [35481.686176] [<ffffffff81294291>] ? netif_tx_unlock+0x46/0x46 >>>>>> [35481.686215] [<ffffffff810659bb>] ? timekeeping_get_ns+0xd/0x2a >>>>>> [35481.686286] [<ffffffff8104bdd4>] ? __do_softirq+0xb9/0x177 >>>>>> [35481.686365] [<ffffffff81341d6c>] ? call_softirq+0x1c/0x30 >>>>>> [35481.686530] [<ffffffff8100f841>] ? do_softirq+0x3c/0x7b >>>>>> [35481.686580] [<ffffffff8104c03c>] ? irq_exit+0x3c/0x9a >>>>>> [35481.686742] [<ffffffff81023e58>] ? >>>>> smp_apic_timer_interrupt+0x74/0x82 >>>>>> [35481.686820] [<ffffffff813405de>] ? >> apic_timer_interrupt+0x6e/0x80 >>>>>> [35481.686826]<EOI> [<ffffffff811ddf49>] ? intel_idle+0xea/0x119 >>>>>> [35481.686991] [<ffffffff811ddf28>] ? intel_idle+0xc9/0x119 >>>>>> [35481.687051] [<ffffffff8125dce3>] ? cpuidle_idle_call+0xec/0x179 >>>>>> [35481.687089] [<ffffffff8100d255>] ? cpu_idle+0xa1/0xe8 >>>>>> [35481.687143] [<ffffffff810706ee>] ? >> arch_local_irq_restore+0x2/0x8 >>>>>> [35481.687189] [<ffffffff8132d191>] ? start_secondary+0x1d5/0x1db >>>>>> [35481.687234] ---[ end trace 01e9907674757948 ]--- >>>>>> [35481.687817] e1000e 0000:05:00.0: eth2: Reset adapter >>>>>> >>>>>> To try to regain connectivity I bring down the bond and the >> interface >>>>> (eth2), then unload e1000e. Upon loading the module again: >>>>>> >>>>>> [36021.888962] e1000e: Intel(R) PRO/1000 Network Driver - 1.5.1-k >>>>>> [36021.900258] e1000e: Copyright(c) 1999 - 2011 Intel Corporation. >>>>>> [36021.911446] e1000e 0000:00:19.0: PCI INT A -> GSI 20 (level, >> low) - >>>>>> IRQ 20 >>>>>> [36021.923204] e1000e 0000:00:19.0: setting latency timer to 64 >>>>>> [36021.923372] e1000e 0000:00:19.0: irq 45 for MSI/MSI-X >>>>>> [36022.202737] e1000e 0000:00:19.0: eth2: (PCI >> Express:2.5GT/s:Width >>>>> x1) 00:25:90:56:ac:75 >>>>>> [36022.214480] e1000e 0000:00:19.0: eth3: Intel(R) PRO/1000 Network >>>>> Connection >>>>>> [36022.227506] e1000e 0000:00:19.0: eth3: MAC: 10, PHY: 11, PBA No: >>>>> FFFFFF-0FF >>>>>> [36022.239789] e1000e 0000:05:00.0: Disabling ASPM L0s >>>>>> [36022.239805] e1000e 0000:05:00.0: enabling device (0000 -> 0002) >>>>>> [36022.239829] e1000e 0000:05:00.0: PCI INT A -> GSI 16 (level, >> low) - >>>>>> IRQ 16 >>>>>> [36022.239921] e1000e 0000:05:00.0: setting latency timer to 64 >>>>>> [36022.240963] e1000e 0000:05:00.0: irq 64 for MSI/MSI-X >>>>>> [36022.240995] e1000e 0000:05:00.0: irq 65 for MSI/MSI-X >>>>>> [36022.241028] e1000e 0000:05:00.0: irq 66 for MSI/MSI-X >>>>>> [36022.241596] e1000e 0000:05:00.0: PCI INT A disabled >>>>>> [36022.241606] e1000e: probe of 0000:05:00.0 failed with error -2 >>>>>> [36022.304706] udevd[3634]: renamed network interface eth2 to eth3 >>>>>> >>>>>> I then don't get an eth2 interface. Only a reboot brings the >> interface >>>>> back. This has happened twice so far on this server in the past >> week, >>>>> both times using v3.2-rc7-3-g4962516. >>>>>> >>>>>> lspci -vnn shows: >>>>>> >>>>>> 05:00.0 Ethernet controller [0200]: Intel Corporation 82574L >> Gigabit >>>>> Network Connection [8086:10d3] >>>>>> Subsystem: Super Micro Computer Inc Device [15d9:0000] >>>>>> Flags: bus master, fast devsel, latency 0, IRQ 16 >>>>>> Memory at fbd00000 (32-bit, non-prefetchable) [size=128K] >>>>>> I/O ports at e000 [size=32] >>>>>> Memory at fbd20000 (32-bit, non-prefetchable) [size=16K] >>>>>> Capabilities: [c8] Power Management version 2 >>>>>> Capabilities: [d0] MSI: Enable- Count=1/1 Maskable- 64bit+ >>>>>> Capabilities: [e0] Express Endpoint, MSI 00 >>>>>> Capabilities: [a0] MSI-X: Enable+ Count=5 Masked- >>>>>> Capabilities: [100] Advanced Error Reporting >>>>>> Capabilities: [140] Device Serial Number 00-25-90-ff-ff-56-ac- >>>>> 74 >>>>>> Kernel driver in use: e1000e >>>>> >>>>> I've just had this happen on my other (identical) server with a >> nearly >>>>> identical trace. Is there anything I can do do avoid this at all or >> at >>>>> least help narrow down the problem? >>>>> >>>>> Cheers, >>>>> Chris >>>>> >>>>> -- >>>>> Chris Boot >>>>> bootc@bootc.net >>>>> >>>>> -- >>>>> To unsubscribe from this list: send the line "unsubscribe netdev" in >>>>> the body of a message to majordomo@vger.kernel.org >>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>> >>>> Hello, >>>> >>>> Sorry for the delay in responding. We have seen some hang issues >> using >>>> MSI-X on 82574 parts. Can you try reloading the driver the IntMode >>>> module parameter. IntMode=1 (you'll need a setting for each device in >>>> the system so two adapters would be IntMode=1,1) See if that changes >>>> the symptom you are seeing with this part. That setting will make >> sure >>>> the adapter uses MSI interrupts instead of MSI-X. >>> >>> Carolyn, >>> >>> I'll give this a go next time I reproduce it. I built a new kernel >> with >>> more debugging and so far it hasn't yet triggered again... >> >> Upgrading to a more recent 3.2-rc snapshot seems to have cured the >> problem - I haven't had an interface stop responding since. Must have >> been some seemingly unrelated patch that I can't seem to locate. >> >> Cheers, >> Chris >> >> -- >> Chris Boot >> bootc@bootc.net > Thanks for letting me know Chris. For my own edification, are you still configured with MSI-X? Carolyn, I have made no changes to my configuration to change the interrupt format. I see the following in dmesg at boot: [ 3.276819] e1000e: Intel(R) PRO/1000 Network Driver - 1.5.1-k [ 3.288193] e1000e: Copyright(c) 1999 - 2011 Intel Corporation. [ 3.299842] e1000e 0000:00:19.0: PCI INT A -> GSI 20 (level, low) -> IRQ 20 [ 3.299909] e1000e 0000:00:19.0: setting latency timer to 64 [ 3.352929] e1000e 0000:00:19.0: irq 45 for MSI/MSI-X [ 3.710080] e1000e 0000:00:19.0: eth2: (PCI Express:2.5GT/s:Width x1) 00:25:90:56:ac:75 [ 3.710082] e1000e 0000:00:19.0: eth2: Intel(R) PRO/1000 Network Connection [ 3.710670] e1000e 0000:00:19.0: eth2: MAC: 10, PHY: 11, PBA No: FFFFFF-0FF [ 3.710678] e1000e 0000:05:00.0: Disabling ASPM L0s [ 3.710850] e1000e 0000:05:00.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16 [ 3.710951] e1000e 0000:05:00.0: setting latency timer to 64 [ 3.712757] e1000e 0000:05:00.0: irq 64 for MSI/MSI-X [ 3.712787] e1000e 0000:05:00.0: irq 65 for MSI/MSI-X [ 3.712805] e1000e 0000:05:00.0: irq 66 for MSI/MSI-X [ 3.830364] e1000e 0000:05:00.0: eth3: (PCI Express:2.5GT/s:Width x1) 00:25:90:56:ac:74 [ 3.830366] e1000e 0000:05:00.0: eth3: Intel(R) PRO/1000 Network Connection [ 3.830510] e1000e 0000:05:00.0: eth3: MAC: 3, PHY: 8, PBA No: FFFFFF-0FF /proc/interrupts shows: 45: 615958 0 0 0 0 0 0 0 IR-PCI-MSI-edge eth3 64: 65126106 0 0 0 0 0 0 0 IR-PCI-MSI-edge eth2-rx-0 65: 52700392 0 0 0 0 0 0 0 IR-PCI-MSI-edge eth2-tx-0 66: 2 0 0 0 0 0 0 0 IR-PCI-MSI-edge eth2 HTH, Chris -- Chris Boot bootc@bootc.net ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: e1000e interface hang on 82574L 2012-01-16 16:04 ` Chris Boot @ 2012-03-17 15:59 ` Chris Boot 2012-03-17 17:54 ` Chris Boot 0 siblings, 1 reply; 32+ messages in thread From: Chris Boot @ 2012-03-17 15:59 UTC (permalink / raw) To: Wyborny, Carolyn; +Cc: netdev, lkml, e1000-devel On 16/01/2012 16:04, Chris Boot wrote: > On 16/01/2012 15:56, Wyborny, Carolyn wrote: >> >> >>> -----Original Message----- >>> From: Chris Boot [mailto:bootc@bootc.net] >>> Sent: Sunday, January 15, 2012 3:11 AM >>> To: Wyborny, Carolyn >>> Cc: netdev; lkml; e1000-devel@lists.sourceforge.net >>> Subject: Re: e1000e interface hang on 82574L >>> >>> On 04/01/2012 17:12, Chris Boot wrote: >>>> On 03/01/2012 00:02, Wyborny, Carolyn wrote: >>>>> >>>>> >>>>>> -----Original Message----- >>>>>> From: netdev-owner@vger.kernel.org [mailto:netdev- >>> owner@vger.kernel.org] >>>>>> On Behalf Of Chris Boot >>>>>> Sent: Saturday, December 31, 2011 1:32 AM >>>>>> To: netdev; lkml; e1000-devel@lists.sourceforge.net >>>>>> Subject: Re: e1000e interface hang on 82574L >>>>>> >>>>>> On 27 Dec 2011, at 22:01, Chris Boot wrote: >>>>>> >>>>>>> Hi folks, >>>>>>> >>>>>>> Another networking issue I've run into, this time with e1000e >>> (Intel >>>>>> Corporation 82574L Gigabit). My new VM cluster appears to drop a NIC >>> - >>>>>> the port stops responding within Linux and shows the link as being >>> down >>>>>> with ethtool. My ISP says 'Ports running Half Duplex or reduced >>> speed' >>>>>> on the port. >>>>>>> >>>>>>> When the port stops working I see this in dmesg: >>>>>>> >>>>>>> [35481.659629] ------------[ cut here ]------------ >>>>>>> [35481.667837] WARNING: at net/sched/sch_generic.c:255 >>>>>> dev_watchdog+0xe9/0x148() >>>>>>> [35481.676370] Hardware name: X9SCL/X9SCM >>>>>>> [35481.684793] NETDEV WATCHDOG: eth2 (e1000e): transmit queue 0 >>> timed >>>>>> out >>>>>>> [35481.684795] Modules linked in: hmac sha256_generic dlm configfs >>>>>> ebtable_nat ebtables acpi_cpufreq mperf cpufreq_stats >>>>>> cpufreq_conservative cpufreq_userspace cpufreq_powersave microcode >>>>>> xt_NOTRACK ip_set_hash_net act_police cls_basic cls_flow cls_fw >>> cls_u32 >>>>>> sch_tbf sch_prio sch_htb sch_hfsc sch_ingress sch_sfq xt_connlimit >>>>>> xt_realm xt_addrtype ip_set_hash_ip iptable_raw xt_comment xt_recent >>>>>> ipt_ULOG ipt_REJECT ipt_REDIRECT ipt_NETMAP ipt_MASQUERADE ipt_ECN >>>>>> ipt_ecn ipt_CLUSTERIP ipt_ah nf_nat_tftp nf_nat_snmp_basic >>>>>> nf_conntrack_snmp nf_nat_sip nf_nat_pptp nf_nat_proto_gre nf_nat_irc >>>>>> nf_nat_h323 nf_nat_ftp ip6_queue nf_nat_amanda xt_set ip_set >>>>>> nf_conntrack_tftp nf_conntrack_sip nf_conntrack_sane >>>>>> nf_conntrack_proto_udplite nf_conntrack_proto_sctp nf_conntrack_pptp >>>>>> nf_conntrack_proto_gre nf_conntrack_netlink nf_conntrack_netbios_ns >>>>>> nf_conntrack_broadcast nf_conntrack_irc nf_conntrack_h323 >>>>>> nf_conntrack_ftp ts_kmp nf_conntrack_amanda xt_TPROXY xt_NFLOG >>>>>> nfnetlink_log nf_tproxy_core xt_time xt_TCPMSS xt_tcpmss xt_sctp >>>>>> xt_policy xt_pkttype xt_physdev xt_owner xt_NFQUEUE xt_multiport >>> xt_mark >>>>>> xt_mac xt_limit xt_length xt_iprange xt_helper xt_hashlimit xt_DSCP >>>>>> xt_dscp xt_dccp xt_connmark xt_CLASSIFY xt_AUDIT ip6t_LOG >>> ip6t_REJECT >>>>>> nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack ip6table_raw ipt_LOG >>>>>> xt_tcpudp ip6table_mangle xt_state iptable_nat nf_nat >>> nf_conntrack_ipv4 >>>>>> nf_defrag_ipv4 nf_conntrack iptable_mangle nfnetlink iptable_filter >>>>>> ip_tables ip6table_filter ip6_tables x_tables bridge stp bonding >>>>>> w83627ehf hwmon_vid coretemp sha1_ssse3 sha1_generic crc32c_intel >>>>>> aesni_intel cryptd aes_x86_64 aes_generic ipmi_poweroff ipmi_devintf >>>>>> ipmi_si ipmi_msghandler vhost_net macvtap macvlan tun drbd lru_cache >>> cn >>>>>> loop kvm_intel kvm snd_pcm snd_timer snd iTCO_wdt soundcore psmouse >>>>>> snd_page_alloc i2c_i801 i2c_core cdc_acm iTCO_vendor_support joydev >>>>>> evdev serio_raw processor button pcspkr thermal_sys ext4 mbcache >>> jbd2 >>>>>> crc16 dm_mod raid1 md_mod sd_mod crc_t10dif usb_storage uas usbhid >>> hid >>>>>> ahci libahci libata igb ehci_hcd scsi_mod usbcore e1000e dca >>> usb_common >>>>>> [last unloaded: scsi_wait_scan] >>>>>>> [35481.685740] Pid: 0, comm: swapper/4 Not tainted 3.2.0-rc6+ #4 >>>>>>> [35481.685744] Call Trace: >>>>>>> [35481.685746]<IRQ> [<ffffffff810467ed>] ? >>>>>> warn_slowpath_common+0x78/0x8c >>>>>>> [35481.685849] [<ffffffff81046899>] ? warn_slowpath_fmt+0x45/0x4a >>>>>>> [35481.685875] [<ffffffff810aeaa0>] ? >>>>>> perf_event_task_tick+0x166/0x1ab >>>>>>> [35481.686018] [<ffffffff81294219>] ? netif_tx_lock+0x40/0x72 >>>>>>> [35481.686090] [<ffffffff8129437a>] ? dev_watchdog+0xe9/0x148 >>>>>>> [35481.686136] [<ffffffff81051e58>] ? run_timer_softirq+0x19a/0x261 >>>>>>> [35481.686176] [<ffffffff81294291>] ? netif_tx_unlock+0x46/0x46 >>>>>>> [35481.686215] [<ffffffff810659bb>] ? timekeeping_get_ns+0xd/0x2a >>>>>>> [35481.686286] [<ffffffff8104bdd4>] ? __do_softirq+0xb9/0x177 >>>>>>> [35481.686365] [<ffffffff81341d6c>] ? call_softirq+0x1c/0x30 >>>>>>> [35481.686530] [<ffffffff8100f841>] ? do_softirq+0x3c/0x7b >>>>>>> [35481.686580] [<ffffffff8104c03c>] ? irq_exit+0x3c/0x9a >>>>>>> [35481.686742] [<ffffffff81023e58>] ? >>>>>> smp_apic_timer_interrupt+0x74/0x82 >>>>>>> [35481.686820] [<ffffffff813405de>] ? >>> apic_timer_interrupt+0x6e/0x80 >>>>>>> [35481.686826]<EOI> [<ffffffff811ddf49>] ? intel_idle+0xea/0x119 >>>>>>> [35481.686991] [<ffffffff811ddf28>] ? intel_idle+0xc9/0x119 >>>>>>> [35481.687051] [<ffffffff8125dce3>] ? cpuidle_idle_call+0xec/0x179 >>>>>>> [35481.687089] [<ffffffff8100d255>] ? cpu_idle+0xa1/0xe8 >>>>>>> [35481.687143] [<ffffffff810706ee>] ? >>> arch_local_irq_restore+0x2/0x8 >>>>>>> [35481.687189] [<ffffffff8132d191>] ? start_secondary+0x1d5/0x1db >>>>>>> [35481.687234] ---[ end trace 01e9907674757948 ]--- >>>>>>> [35481.687817] e1000e 0000:05:00.0: eth2: Reset adapter >>>>>>> >>>>>>> To try to regain connectivity I bring down the bond and the >>> interface >>>>>> (eth2), then unload e1000e. Upon loading the module again: >>>>>>> >>>>>>> [36021.888962] e1000e: Intel(R) PRO/1000 Network Driver - 1.5.1-k >>>>>>> [36021.900258] e1000e: Copyright(c) 1999 - 2011 Intel Corporation. >>>>>>> [36021.911446] e1000e 0000:00:19.0: PCI INT A -> GSI 20 (level, >>> low) - >>>>>>> IRQ 20 >>>>>>> [36021.923204] e1000e 0000:00:19.0: setting latency timer to 64 >>>>>>> [36021.923372] e1000e 0000:00:19.0: irq 45 for MSI/MSI-X >>>>>>> [36022.202737] e1000e 0000:00:19.0: eth2: (PCI >>> Express:2.5GT/s:Width >>>>>> x1) 00:25:90:56:ac:75 >>>>>>> [36022.214480] e1000e 0000:00:19.0: eth3: Intel(R) PRO/1000 Network >>>>>> Connection >>>>>>> [36022.227506] e1000e 0000:00:19.0: eth3: MAC: 10, PHY: 11, PBA No: >>>>>> FFFFFF-0FF >>>>>>> [36022.239789] e1000e 0000:05:00.0: Disabling ASPM L0s >>>>>>> [36022.239805] e1000e 0000:05:00.0: enabling device (0000 -> 0002) >>>>>>> [36022.239829] e1000e 0000:05:00.0: PCI INT A -> GSI 16 (level, >>> low) - >>>>>>> IRQ 16 >>>>>>> [36022.239921] e1000e 0000:05:00.0: setting latency timer to 64 >>>>>>> [36022.240963] e1000e 0000:05:00.0: irq 64 for MSI/MSI-X >>>>>>> [36022.240995] e1000e 0000:05:00.0: irq 65 for MSI/MSI-X >>>>>>> [36022.241028] e1000e 0000:05:00.0: irq 66 for MSI/MSI-X >>>>>>> [36022.241596] e1000e 0000:05:00.0: PCI INT A disabled >>>>>>> [36022.241606] e1000e: probe of 0000:05:00.0 failed with error -2 >>>>>>> [36022.304706] udevd[3634]: renamed network interface eth2 to eth3 >>>>>>> >>>>>>> I then don't get an eth2 interface. Only a reboot brings the >>> interface >>>>>> back. This has happened twice so far on this server in the past >>> week, >>>>>> both times using v3.2-rc7-3-g4962516. >>>>>>> >>>>>>> lspci -vnn shows: >>>>>>> >>>>>>> 05:00.0 Ethernet controller [0200]: Intel Corporation 82574L >>> Gigabit >>>>>> Network Connection [8086:10d3] >>>>>>> Subsystem: Super Micro Computer Inc Device [15d9:0000] >>>>>>> Flags: bus master, fast devsel, latency 0, IRQ 16 >>>>>>> Memory at fbd00000 (32-bit, non-prefetchable) [size=128K] >>>>>>> I/O ports at e000 [size=32] >>>>>>> Memory at fbd20000 (32-bit, non-prefetchable) [size=16K] >>>>>>> Capabilities: [c8] Power Management version 2 >>>>>>> Capabilities: [d0] MSI: Enable- Count=1/1 Maskable- 64bit+ >>>>>>> Capabilities: [e0] Express Endpoint, MSI 00 >>>>>>> Capabilities: [a0] MSI-X: Enable+ Count=5 Masked- >>>>>>> Capabilities: [100] Advanced Error Reporting >>>>>>> Capabilities: [140] Device Serial Number 00-25-90-ff-ff-56-ac- >>>>>> 74 >>>>>>> Kernel driver in use: e1000e >>>>>> >>>>>> I've just had this happen on my other (identical) server with a >>> nearly >>>>>> identical trace. Is there anything I can do do avoid this at all or >>> at >>>>>> least help narrow down the problem? >>>>>> >>>>>> Cheers, >>>>>> Chris >>>>>> >>>>>> -- >>>>>> Chris Boot >>>>>> bootc@bootc.net >>>>>> >>>>>> -- >>>>>> To unsubscribe from this list: send the line "unsubscribe netdev" in >>>>>> the body of a message to majordomo@vger.kernel.org >>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>>> >>>>> Hello, >>>>> >>>>> Sorry for the delay in responding. We have seen some hang issues >>> using >>>>> MSI-X on 82574 parts. Can you try reloading the driver the IntMode >>>>> module parameter. IntMode=1 (you'll need a setting for each device in >>>>> the system so two adapters would be IntMode=1,1) See if that changes >>>>> the symptom you are seeing with this part. That setting will make >>> sure >>>>> the adapter uses MSI interrupts instead of MSI-X. >>>> >>>> Carolyn, >>>> >>>> I'll give this a go next time I reproduce it. I built a new kernel >>> with >>>> more debugging and so far it hasn't yet triggered again... >>> >>> Upgrading to a more recent 3.2-rc snapshot seems to have cured the >>> problem - I haven't had an interface stop responding since. Must have >>> been some seemingly unrelated patch that I can't seem to locate. >>> >>> Cheers, >>> Chris >>> >>> -- >>> Chris Boot >>> bootc@bootc.net >> Thanks for letting me know Chris. For my own edification, are you >> still configured with MSI-X? > > Carolyn, > > I have made no changes to my configuration to change the interrupt > format. I see the following in dmesg at boot: > > [ 3.276819] e1000e: Intel(R) PRO/1000 Network Driver - 1.5.1-k > [ 3.288193] e1000e: Copyright(c) 1999 - 2011 Intel Corporation. > > [ 3.299842] e1000e 0000:00:19.0: PCI INT A -> GSI 20 (level, low) > -> IRQ 20 > [ 3.299909] e1000e 0000:00:19.0: setting latency timer to 64 > [ 3.352929] e1000e 0000:00:19.0: irq 45 for MSI/MSI-X > [ 3.710080] e1000e 0000:00:19.0: eth2: (PCI Express:2.5GT/s:Width > x1) 00:25:90:56:ac:75 > [ 3.710082] e1000e 0000:00:19.0: eth2: Intel(R) PRO/1000 Network > Connection > [ 3.710670] e1000e 0000:00:19.0: eth2: MAC: 10, PHY: 11, PBA No: > FFFFFF-0FF > > [ 3.710678] e1000e 0000:05:00.0: Disabling ASPM L0s > [ 3.710850] e1000e 0000:05:00.0: PCI INT A -> GSI 16 (level, low) > -> IRQ 16 > [ 3.710951] e1000e 0000:05:00.0: setting latency timer to 64 > [ 3.712757] e1000e 0000:05:00.0: irq 64 for MSI/MSI-X > [ 3.712787] e1000e 0000:05:00.0: irq 65 for MSI/MSI-X > [ 3.712805] e1000e 0000:05:00.0: irq 66 for MSI/MSI-X > [ 3.830364] e1000e 0000:05:00.0: eth3: (PCI Express:2.5GT/s:Width > x1) 00:25:90:56:ac:74 > [ 3.830366] e1000e 0000:05:00.0: eth3: Intel(R) PRO/1000 Network > Connection > [ 3.830510] e1000e 0000:05:00.0: eth3: MAC: 3, PHY: 8, PBA No: > FFFFFF-0FF > > /proc/interrupts shows: > > 45: 615958 0 0 0 0 0 > 0 0 IR-PCI-MSI-edge eth3 > 64: 65126106 0 0 0 0 0 > 0 0 IR-PCI-MSI-edge eth2-rx-0 > 65: 52700392 0 0 0 0 0 > 0 0 IR-PCI-MSI-edge eth2-tx-0 > 66: 2 0 0 0 0 0 > 0 0 IR-PCI-MSI-edge eth2 Carolyn, I've just had the opportunity to upgrade to a 3.2.9 kernel on these systems and have made sure e1000e is loaded with IntMode=1,1. One of the servers was only up 5.5 hours before the NIC has crashed/stopped working again. Here is the latest dmesg after the failure: [ 3.254553] e1000e: Intel(R) PRO/1000 Network Driver - 1.5.1-k [ 3.265852] e1000e: Copyright(c) 1999 - 2011 Intel Corporation. [ 3.266034] e1000e 0000:00:19.0: PCI INT A -> GSI 20 (level, low) -> IRQ 20 [ 3.266067] e1000e 0000:00:19.0: setting latency timer to 64 [ 3.266460] e1000e 0000:00:19.0: (unregistered net_device): Interrupt Mode set to 1 [ 3.266800] e1000e 0000:00:19.0: irq 45 for MSI/MSI-X [ 3.611840] e1000e 0000:00:19.0: eth2: (PCI Express:2.5GT/s:Width x1) 00:25:90:56:ac:75 [ 3.611855] e1000e 0000:00:19.0: eth2: Intel(R) PRO/1000 Network Connection [ 3.612303] e1000e 0000:00:19.0: eth2: MAC: 10, PHY: 11, PBA No: FFFFFF-0FF [ 3.612350] e1000e 0000:05:00.0: Disabling ASPM L0s [ 3.612594] e1000e 0000:05:00.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16 [ 3.612812] e1000e 0000:05:00.0: setting latency timer to 64 [ 3.613582] e1000e 0000:05:00.0: (unregistered net_device): Interrupt Mode set to 1 [ 3.614156] e1000e 0000:05:00.0: irq 64 for MSI/MSI-X [ 3.734442] e1000e 0000:05:00.0: eth3: (PCI Express:2.5GT/s:Width x1) 00:25:90:56:ac:74 [ 3.734465] e1000e 0000:05:00.0: eth3: Intel(R) PRO/1000 Network Connection [ 3.734689] e1000e 0000:05:00.0: eth3: MAC: 3, PHY: 8, PBA No: FFFFFF-0FF [ 13.799848] e1000e 0000:05:00.0: irq 64 for MSI/MSI-X [ 13.855646] e1000e 0000:05:00.0: irq 64 for MSI/MSI-X [ 14.031739] e1000e 0000:00:19.0: irq 45 for MSI/MSI-X [ 14.087566] e1000e 0000:00:19.0: irq 45 for MSI/MSI-X [ 16.112504] e1000e: eth2 NIC Link is Up 100 Mbps Full Duplex, Flow Control: None [ 16.124129] e1000e 0000:05:00.0: eth2: 10/100 speed: disabling TSO And here is the output just as it hangs: [19745.327241] ------------[ cut here ]------------ [19745.334501] WARNING: at /build/buildd-linux-2.6_3.2.9-1-amd64-KTPapN/linux-2.6-3.2.9/debian/build/source_amd64_none/net/sched/sch_generic.c:255 dev_watchdog+0xe9/0x148() [19745.350441] Hardware name: X9SCL/X9SCM [19745.358859] NETDEV WATCHDOG: eth2 (e1000e): transmit queue 0 timed out [19745.367287] Modules linked in: hmac sha256_generic dlm configfs ebtable_nat ebtables acpi_cpufreq mperf cpufreq_stats cpufreq_conservative cpufreq_userspace cpufreq_powersave microcode ip6_queue xt_TCPMSS xt_sctp ip6t_LOG ip6t_REJECT nf_conntrack_ipv6 ip6table_raw ip6table_mangle ip6table_filter xt_NOTRACK ip_set_hash_net act_police cls_basic cls_flow cls_fw cls_u32 sch_tbf sch_prio sch_htb sch_hfsc sch_ingress sch_sfq xt_statistic xt_CT xt_time xt_connlimit xt_realm xt_addrtype ip_set_hash_ip iptable_raw xt_comment xt_recent xt_policy ipt_ULOG ipt_REJECT ipt_REDIRECT ipt_NETMAP ipt_MASQUERADE ipt_ECN ipt_ecn ipt_CLUSTERIP ipt_ah xt_set ip_set nf_nat_tftp nf_nat_snmp_basic nf_conntrack_snmp nf_nat_sip nf_nat_pptp nf_nat_proto_gre nf_nat_irc nf_nat_h323 nf_nat_ftp nf_nat_amanda ts_kmp nf_conntrack_amanda nf_conntrack_sane nf_conntrack_tftp nf_conntrack_sip nf_conntrack_proto_udplite nf_conntrack_proto_sctp nf_conntrack_pptp nf_conntrack_proto_gre nf_conntrack_netlink nf_conntrack_netbios_ns nf_conntrack_broadcast nf_conntrack_irc nf_conntrack_h323 nf_conntrack_ftp xt_TPROXY nf_tproxy_core ip6_tables nf_defrag_ipv6 xt_tcpmss xt_pkttype xt_physdev xt_owner xt_NFQUEUE xt_NFLOG nfnetlink_log xt_multiport xt_mark xt_mac xt_limit xt_length xt_iprange xt_helper xt_hashlimit xt_DSCP xt_dscp xt_dccp xt_conntrack xt_connmark xt_CLASSIFY xt_AUDIT ipt_LOG xt_tcpudp xt_state iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack iptable_mangle nfnetlink iptable_filter ip_tables x_tables kvm_intel kvm bridge stp bonding w83627ehf hwmon_vid coretemp sha1_ssse3 sha1_generic crc32c_intel aesni_intel cryptd aes_x86_64 aes_generic ipmi_poweroff ipmi_devintf ipmi_si ipmi_msghandler vhost_net macvtap macvlan tun drbd lru_cache cn loop snd_pcm snd_timer snd soundcore snd_page_alloc iTCO_wdt i2c_i801 psmouse cdc_acm processor i2c_core iTCO_vendor_support serio_raw pcspkr thermal_sys button evdev joydev ext4 mbcache jbd2 crc16 dm_mod raid1 md_mod sd_mod crc_t10dif usb_storage uas usbhid hid ahci libahci libata ehci_hcd usbcore igb scsi_mod e1000e usb_common dca [last unloaded: scsi_wait_scan] [19745.502559] Pid: 0, comm: swapper/0 Not tainted 3.2.0-2-amd64 #1 [19745.502561] Call Trace: [19745.502562] <IRQ> [<ffffffff81046879>] ? warn_slowpath_common+0x78/0x8c [19745.502570] [<ffffffff81046925>] ? warn_slowpath_fmt+0x45/0x4a [19745.502574] [<ffffffff8129aa11>] ? netif_tx_lock+0x40/0x72 [19745.502588] [<ffffffff8129ab72>] ? dev_watchdog+0xe9/0x148 [19745.502601] [<ffffffff81051f38>] ? run_timer_softirq+0x19a/0x261 [19745.502603] [<ffffffff8129aa89>] ? netif_tx_unlock+0x46/0x46 [19745.502606] [<ffffffff81065a73>] ? timekeeping_get_ns+0xd/0x2a [19745.502609] [<ffffffff8104be98>] ? __do_softirq+0xb9/0x177 [19745.502612] [<ffffffff8134892c>] ? call_softirq+0x1c/0x30 [19745.502615] [<ffffffff8100f8e5>] ? do_softirq+0x3c/0x7b [19745.502617] [<ffffffff8104c100>] ? irq_exit+0x3c/0x9a [19745.502621] [<ffffffff81023f18>] ? smp_apic_timer_interrupt+0x74/0x82 [19745.502624] [<ffffffff8134719e>] ? apic_timer_interrupt+0x6e/0x80 [19745.502625] <EOI> [<ffffffff81070761>] ? arch_local_irq_save+0x11/0x17 [19745.502631] [<ffffffff811e45d9>] ? intel_idle+0xea/0x119 [19745.502633] [<ffffffff811e45b8>] ? intel_idle+0xc9/0x119 [19745.502637] [<ffffffff812643f7>] ? cpuidle_idle_call+0xec/0x179 [19745.502639] [<ffffffff8100d248>] ? cpu_idle+0xa5/0xf2 [19745.502641] [<ffffffff816aab3d>] ? start_kernel+0x3bd/0x3c8 [19745.502643] [<ffffffff816aa140>] ? early_idt_handlers+0x140/0x140 [19745.502645] [<ffffffff816aa3c4>] ? x86_64_start_kernel+0x104/0x111 [19745.502646] ---[ end trace 10e791a6f31603fa ]--- [19745.503125] e1000e 0000:05:00.0: eth2: Reset adapter Once again, rmmod e1000e followed by modprobe e1000e does not fix the problem: [20508.158919] e1000e 0000:05:00.0: PCI INT A disabled [20508.194927] e1000e 0000:00:19.0: PCI INT A disabled [20511.119765] e1000e: Intel(R) PRO/1000 Network Driver - 1.5.1-k [20511.130711] e1000e: Copyright(c) 1999 - 2011 Intel Corporation. [20511.141206] e1000e 0000:00:19.0: PCI INT A -> GSI 20 (level, low) -> IRQ 20 [20511.151797] e1000e 0000:00:19.0: setting latency timer to 64 [20511.151921] e1000e 0000:00:19.0: (unregistered net_device): Interrupt Mode set to 1 [20511.162853] e1000e 0000:00:19.0: irq 45 for MSI/MSI-X [20511.528436] e1000e 0000:00:19.0: eth2: (PCI Express:2.5GT/s:Width x1) 00:25:90:56:ac:75 [20511.539261] e1000e 0000:00:19.0: eth3: Intel(R) PRO/1000 Network Connection [20511.550066] e1000e 0000:00:19.0: eth3: MAC: 10, PHY: 11, PBA No: FFFFFF-0FF [20511.561027] e1000e 0000:05:00.0: Disabling ASPM L0s [20511.571883] e1000e 0000:05:00.0: enabling device (0000 -> 0002) [20511.575224] udevd[5449]: renamed network interface eth2 to eth3 [20511.594234] e1000e 0000:05:00.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16 [20511.605703] e1000e 0000:05:00.0: setting latency timer to 64 [20511.605871] e1000e 0000:05:00.0: (unregistered net_device): Interrupt Mode set to 1 [20511.617706] e1000e 0000:05:00.0: irq 64 for MSI/MSI-X [20511.617828] e1000e 0000:05:00.0: PCI INT A disabled [20511.629565] e1000e: probe of 0000:05:00.0 failed with error -2 Please let me know if/how I can debug this further. Many thanks, Chris -- Chris Boot bootc@bootc.net. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: e1000e interface hang on 82574L 2012-03-17 15:59 ` Chris Boot @ 2012-03-17 17:54 ` Chris Boot 2012-03-17 23:50 ` [E1000-devel] " Nix 2012-03-19 14:59 ` Wyborny, Carolyn 0 siblings, 2 replies; 32+ messages in thread From: Chris Boot @ 2012-03-17 17:54 UTC (permalink / raw) To: Wyborny, Carolyn; +Cc: netdev, lkml, e1000-devel On 17/03/2012 15:59, Chris Boot wrote: > On 16/01/2012 16:04, Chris Boot wrote: >> On 16/01/2012 15:56, Wyborny, Carolyn wrote: >>> >>>> -----Original Message----- >>>> From: Chris Boot [mailto:bootc@bootc.net] >>>> Sent: Sunday, January 15, 2012 3:11 AM >>>> To: Wyborny, Carolyn >>>> Cc: netdev; lkml; e1000-devel@lists.sourceforge.net >>>> Subject: Re: e1000e interface hang on 82574L >>>> >>>> On 04/01/2012 17:12, Chris Boot wrote: >>>>> On 03/01/2012 00:02, Wyborny, Carolyn wrote: >>>>>> >>>>>>> -----Original Message----- >>>>>>> From: netdev-owner@vger.kernel.org [mailto:netdev- >>>> owner@vger.kernel.org] >>>>>>> On Behalf Of Chris Boot >>>>>>> Sent: Saturday, December 31, 2011 1:32 AM >>>>>>> To: netdev; lkml; e1000-devel@lists.sourceforge.net >>>>>>> Subject: Re: e1000e interface hang on 82574L >>>>>>> >>>>>>> On 27 Dec 2011, at 22:01, Chris Boot wrote: >>>>>>> >>>>>>>> Hi folks, >>>>>>>> >>>>>>>> Another networking issue I've run into, this time with e1000e >>>> (Intel >>>>>>> Corporation 82574L Gigabit). My new VM cluster appears to drop a NIC >>>> - >>>>>>> the port stops responding within Linux and shows the link as being >>>> down >>>>>>> with ethtool. My ISP says 'Ports running Half Duplex or reduced >>>> speed' >>>>>>> on the port. >>>>>>>> When the port stops working I see this in dmesg: >>>>>>>> >>>>>>>> [35481.659629] ------------[ cut here ]------------ >>>>>>>> [35481.667837] WARNING: at net/sched/sch_generic.c:255 >>>>>>> dev_watchdog+0xe9/0x148() >>>>>>>> [35481.676370] Hardware name: X9SCL/X9SCM >>>>>>>> [35481.684793] NETDEV WATCHDOG: eth2 (e1000e): transmit queue 0 >>>> timed >>>>>>> out >>>>>>>> [35481.684795] Modules linked in: hmac sha256_generic dlm configfs >>>>>>> ebtable_nat ebtables acpi_cpufreq mperf cpufreq_stats >>>>>>> cpufreq_conservative cpufreq_userspace cpufreq_powersave microcode >>>>>>> xt_NOTRACK ip_set_hash_net act_police cls_basic cls_flow cls_fw >>>> cls_u32 >>>>>>> sch_tbf sch_prio sch_htb sch_hfsc sch_ingress sch_sfq xt_connlimit >>>>>>> xt_realm xt_addrtype ip_set_hash_ip iptable_raw xt_comment xt_recent >>>>>>> ipt_ULOG ipt_REJECT ipt_REDIRECT ipt_NETMAP ipt_MASQUERADE ipt_ECN >>>>>>> ipt_ecn ipt_CLUSTERIP ipt_ah nf_nat_tftp nf_nat_snmp_basic >>>>>>> nf_conntrack_snmp nf_nat_sip nf_nat_pptp nf_nat_proto_gre nf_nat_irc >>>>>>> nf_nat_h323 nf_nat_ftp ip6_queue nf_nat_amanda xt_set ip_set >>>>>>> nf_conntrack_tftp nf_conntrack_sip nf_conntrack_sane >>>>>>> nf_conntrack_proto_udplite nf_conntrack_proto_sctp nf_conntrack_pptp >>>>>>> nf_conntrack_proto_gre nf_conntrack_netlink nf_conntrack_netbios_ns >>>>>>> nf_conntrack_broadcast nf_conntrack_irc nf_conntrack_h323 >>>>>>> nf_conntrack_ftp ts_kmp nf_conntrack_amanda xt_TPROXY xt_NFLOG >>>>>>> nfnetlink_log nf_tproxy_core xt_time xt_TCPMSS xt_tcpmss xt_sctp >>>>>>> xt_policy xt_pkttype xt_physdev xt_owner xt_NFQUEUE xt_multiport >>>> xt_mark >>>>>>> xt_mac xt_limit xt_length xt_iprange xt_helper xt_hashlimit xt_DSCP >>>>>>> xt_dscp xt_dccp xt_connmark xt_CLASSIFY xt_AUDIT ip6t_LOG >>>> ip6t_REJECT >>>>>>> nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack ip6table_raw ipt_LOG >>>>>>> xt_tcpudp ip6table_mangle xt_state iptable_nat nf_nat >>>> nf_conntrack_ipv4 >>>>>>> nf_defrag_ipv4 nf_conntrack iptable_mangle nfnetlink iptable_filter >>>>>>> ip_tables ip6table_filter ip6_tables x_tables bridge stp bonding >>>>>>> w83627ehf hwmon_vid coretemp sha1_ssse3 sha1_generic crc32c_intel >>>>>>> aesni_intel cryptd aes_x86_64 aes_generic ipmi_poweroff ipmi_devintf >>>>>>> ipmi_si ipmi_msghandler vhost_net macvtap macvlan tun drbd lru_cache >>>> cn >>>>>>> loop kvm_intel kvm snd_pcm snd_timer snd iTCO_wdt soundcore psmouse >>>>>>> snd_page_alloc i2c_i801 i2c_core cdc_acm iTCO_vendor_support joydev >>>>>>> evdev serio_raw processor button pcspkr thermal_sys ext4 mbcache >>>> jbd2 >>>>>>> crc16 dm_mod raid1 md_mod sd_mod crc_t10dif usb_storage uas usbhid >>>> hid >>>>>>> ahci libahci libata igb ehci_hcd scsi_mod usbcore e1000e dca >>>> usb_common >>>>>>> [last unloaded: scsi_wait_scan] >>>>>>>> [35481.685740] Pid: 0, comm: swapper/4 Not tainted 3.2.0-rc6+ #4 >>>>>>>> [35481.685744] Call Trace: >>>>>>>> [35481.685746]<IRQ> [<ffffffff810467ed>] ? >>>>>>> warn_slowpath_common+0x78/0x8c >>>>>>>> [35481.685849] [<ffffffff81046899>] ? warn_slowpath_fmt+0x45/0x4a >>>>>>>> [35481.685875] [<ffffffff810aeaa0>] ? >>>>>>> perf_event_task_tick+0x166/0x1ab >>>>>>>> [35481.686018] [<ffffffff81294219>] ? netif_tx_lock+0x40/0x72 >>>>>>>> [35481.686090] [<ffffffff8129437a>] ? dev_watchdog+0xe9/0x148 >>>>>>>> [35481.686136] [<ffffffff81051e58>] ? run_timer_softirq+0x19a/0x261 >>>>>>>> [35481.686176] [<ffffffff81294291>] ? netif_tx_unlock+0x46/0x46 >>>>>>>> [35481.686215] [<ffffffff810659bb>] ? timekeeping_get_ns+0xd/0x2a >>>>>>>> [35481.686286] [<ffffffff8104bdd4>] ? __do_softirq+0xb9/0x177 >>>>>>>> [35481.686365] [<ffffffff81341d6c>] ? call_softirq+0x1c/0x30 >>>>>>>> [35481.686530] [<ffffffff8100f841>] ? do_softirq+0x3c/0x7b >>>>>>>> [35481.686580] [<ffffffff8104c03c>] ? irq_exit+0x3c/0x9a >>>>>>>> [35481.686742] [<ffffffff81023e58>] ? >>>>>>> smp_apic_timer_interrupt+0x74/0x82 >>>>>>>> [35481.686820] [<ffffffff813405de>] ? >>>> apic_timer_interrupt+0x6e/0x80 >>>>>>>> [35481.686826]<EOI> [<ffffffff811ddf49>] ? intel_idle+0xea/0x119 >>>>>>>> [35481.686991] [<ffffffff811ddf28>] ? intel_idle+0xc9/0x119 >>>>>>>> [35481.687051] [<ffffffff8125dce3>] ? cpuidle_idle_call+0xec/0x179 >>>>>>>> [35481.687089] [<ffffffff8100d255>] ? cpu_idle+0xa1/0xe8 >>>>>>>> [35481.687143] [<ffffffff810706ee>] ? >>>> arch_local_irq_restore+0x2/0x8 >>>>>>>> [35481.687189] [<ffffffff8132d191>] ? start_secondary+0x1d5/0x1db >>>>>>>> [35481.687234] ---[ end trace 01e9907674757948 ]--- >>>>>>>> [35481.687817] e1000e 0000:05:00.0: eth2: Reset adapter >>>>>>>> >>>>>>>> To try to regain connectivity I bring down the bond and the >>>> interface >>>>>>> (eth2), then unload e1000e. Upon loading the module again: >>>>>>>> [36021.888962] e1000e: Intel(R) PRO/1000 Network Driver - 1.5.1-k >>>>>>>> [36021.900258] e1000e: Copyright(c) 1999 - 2011 Intel Corporation. >>>>>>>> [36021.911446] e1000e 0000:00:19.0: PCI INT A -> GSI 20 (level, >>>> low) - >>>>>>>> IRQ 20 >>>>>>>> [36021.923204] e1000e 0000:00:19.0: setting latency timer to 64 >>>>>>>> [36021.923372] e1000e 0000:00:19.0: irq 45 for MSI/MSI-X >>>>>>>> [36022.202737] e1000e 0000:00:19.0: eth2: (PCI >>>> Express:2.5GT/s:Width >>>>>>> x1) 00:25:90:56:ac:75 >>>>>>>> [36022.214480] e1000e 0000:00:19.0: eth3: Intel(R) PRO/1000 Network >>>>>>> Connection >>>>>>>> [36022.227506] e1000e 0000:00:19.0: eth3: MAC: 10, PHY: 11, PBA No: >>>>>>> FFFFFF-0FF >>>>>>>> [36022.239789] e1000e 0000:05:00.0: Disabling ASPM L0s >>>>>>>> [36022.239805] e1000e 0000:05:00.0: enabling device (0000 -> 0002) >>>>>>>> [36022.239829] e1000e 0000:05:00.0: PCI INT A -> GSI 16 (level, >>>> low) - >>>>>>>> IRQ 16 >>>>>>>> [36022.239921] e1000e 0000:05:00.0: setting latency timer to 64 >>>>>>>> [36022.240963] e1000e 0000:05:00.0: irq 64 for MSI/MSI-X >>>>>>>> [36022.240995] e1000e 0000:05:00.0: irq 65 for MSI/MSI-X >>>>>>>> [36022.241028] e1000e 0000:05:00.0: irq 66 for MSI/MSI-X >>>>>>>> [36022.241596] e1000e 0000:05:00.0: PCI INT A disabled >>>>>>>> [36022.241606] e1000e: probe of 0000:05:00.0 failed with error -2 >>>>>>>> [36022.304706] udevd[3634]: renamed network interface eth2 to eth3 >>>>>>>> >>>>>>>> I then don't get an eth2 interface. Only a reboot brings the >>>> interface >>>>>>> back. This has happened twice so far on this server in the past >>>> week, >>>>>>> both times using v3.2-rc7-3-g4962516. >>>>>>>> lspci -vnn shows: >>>>>>>> >>>>>>>> 05:00.0 Ethernet controller [0200]: Intel Corporation 82574L >>>> Gigabit >>>>>>> Network Connection [8086:10d3] >>>>>>>> Subsystem: Super Micro Computer Inc Device [15d9:0000] >>>>>>>> Flags: bus master, fast devsel, latency 0, IRQ 16 >>>>>>>> Memory at fbd00000 (32-bit, non-prefetchable) [size=128K] >>>>>>>> I/O ports at e000 [size=32] >>>>>>>> Memory at fbd20000 (32-bit, non-prefetchable) [size=16K] >>>>>>>> Capabilities: [c8] Power Management version 2 >>>>>>>> Capabilities: [d0] MSI: Enable- Count=1/1 Maskable- 64bit+ >>>>>>>> Capabilities: [e0] Express Endpoint, MSI 00 >>>>>>>> Capabilities: [a0] MSI-X: Enable+ Count=5 Masked- >>>>>>>> Capabilities: [100] Advanced Error Reporting >>>>>>>> Capabilities: [140] Device Serial Number 00-25-90-ff-ff-56-ac- >>>>>>> 74 >>>>>>>> Kernel driver in use: e1000e >>>>>>> I've just had this happen on my other (identical) server with a >>>> nearly >>>>>>> identical trace. Is there anything I can do do avoid this at all or >>>> at >>>>>>> least help narrow down the problem? >>>>>>> >>>>>>> Cheers, >>>>>>> Chris >>>>>>> >>>>>>> -- >>>>>>> Chris Boot >>>>>>> bootc@bootc.net >>>>>>> >>>>>>> -- >>>>>>> To unsubscribe from this list: send the line "unsubscribe netdev" in >>>>>>> the body of a message to majordomo@vger.kernel.org >>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>>>> Hello, >>>>>> >>>>>> Sorry for the delay in responding. We have seen some hang issues >>>> using >>>>>> MSI-X on 82574 parts. Can you try reloading the driver the IntMode >>>>>> module parameter. IntMode=1 (you'll need a setting for each device in >>>>>> the system so two adapters would be IntMode=1,1) See if that changes >>>>>> the symptom you are seeing with this part. That setting will make >>>> sure >>>>>> the adapter uses MSI interrupts instead of MSI-X. >>>>> Carolyn, >>>>> >>>>> I'll give this a go next time I reproduce it. I built a new kernel >>>> with >>>>> more debugging and so far it hasn't yet triggered again... >>>> Upgrading to a more recent 3.2-rc snapshot seems to have cured the >>>> problem - I haven't had an interface stop responding since. Must have >>>> been some seemingly unrelated patch that I can't seem to locate. >>>> >>>> Cheers, >>>> Chris >>>> >>>> -- >>>> Chris Boot >>>> bootc@bootc.net >>> Thanks for letting me know Chris. For my own edification, are you >>> still configured with MSI-X? >> Carolyn, >> >> I have made no changes to my configuration to change the interrupt >> format. I see the following in dmesg at boot: >> >> [ 3.276819] e1000e: Intel(R) PRO/1000 Network Driver - 1.5.1-k >> [ 3.288193] e1000e: Copyright(c) 1999 - 2011 Intel Corporation. >> >> [ 3.299842] e1000e 0000:00:19.0: PCI INT A -> GSI 20 (level, low) >> -> IRQ 20 >> [ 3.299909] e1000e 0000:00:19.0: setting latency timer to 64 >> [ 3.352929] e1000e 0000:00:19.0: irq 45 for MSI/MSI-X >> [ 3.710080] e1000e 0000:00:19.0: eth2: (PCI Express:2.5GT/s:Width >> x1) 00:25:90:56:ac:75 >> [ 3.710082] e1000e 0000:00:19.0: eth2: Intel(R) PRO/1000 Network >> Connection >> [ 3.710670] e1000e 0000:00:19.0: eth2: MAC: 10, PHY: 11, PBA No: >> FFFFFF-0FF >> >> [ 3.710678] e1000e 0000:05:00.0: Disabling ASPM L0s >> [ 3.710850] e1000e 0000:05:00.0: PCI INT A -> GSI 16 (level, low) >> -> IRQ 16 >> [ 3.710951] e1000e 0000:05:00.0: setting latency timer to 64 >> [ 3.712757] e1000e 0000:05:00.0: irq 64 for MSI/MSI-X >> [ 3.712787] e1000e 0000:05:00.0: irq 65 for MSI/MSI-X >> [ 3.712805] e1000e 0000:05:00.0: irq 66 for MSI/MSI-X >> [ 3.830364] e1000e 0000:05:00.0: eth3: (PCI Express:2.5GT/s:Width >> x1) 00:25:90:56:ac:74 >> [ 3.830366] e1000e 0000:05:00.0: eth3: Intel(R) PRO/1000 Network >> Connection >> [ 3.830510] e1000e 0000:05:00.0: eth3: MAC: 3, PHY: 8, PBA No: >> FFFFFF-0FF >> >> /proc/interrupts shows: >> >> 45: 615958 0 0 0 0 0 >> 0 0 IR-PCI-MSI-edge eth3 >> 64: 65126106 0 0 0 0 0 >> 0 0 IR-PCI-MSI-edge eth2-rx-0 >> 65: 52700392 0 0 0 0 0 >> 0 0 IR-PCI-MSI-edge eth2-tx-0 >> 66: 2 0 0 0 0 0 >> 0 0 IR-PCI-MSI-edge eth2 > Carolyn, > > I've just had the opportunity to upgrade to a 3.2.9 kernel on these > systems and have made sure e1000e is loaded with IntMode=1,1. One of the > servers was only up 5.5 hours before the NIC has crashed/stopped working > again. > > Here is the latest dmesg after the failure: > > [ 3.254553] e1000e: Intel(R) PRO/1000 Network Driver - 1.5.1-k > [ 3.265852] e1000e: Copyright(c) 1999 - 2011 Intel Corporation. > [ 3.266034] e1000e 0000:00:19.0: PCI INT A -> GSI 20 (level, low) -> > IRQ 20 > [ 3.266067] e1000e 0000:00:19.0: setting latency timer to 64 > [ 3.266460] e1000e 0000:00:19.0: (unregistered net_device): Interrupt > Mode set to 1 > [ 3.266800] e1000e 0000:00:19.0: irq 45 for MSI/MSI-X > [ 3.611840] e1000e 0000:00:19.0: eth2: (PCI Express:2.5GT/s:Width x1) > 00:25:90:56:ac:75 > [ 3.611855] e1000e 0000:00:19.0: eth2: Intel(R) PRO/1000 Network > Connection > [ 3.612303] e1000e 0000:00:19.0: eth2: MAC: 10, PHY: 11, PBA No: > FFFFFF-0FF > [ 3.612350] e1000e 0000:05:00.0: Disabling ASPM L0s > [ 3.612594] e1000e 0000:05:00.0: PCI INT A -> GSI 16 (level, low) -> > IRQ 16 > [ 3.612812] e1000e 0000:05:00.0: setting latency timer to 64 > [ 3.613582] e1000e 0000:05:00.0: (unregistered net_device): Interrupt > Mode set to 1 > [ 3.614156] e1000e 0000:05:00.0: irq 64 for MSI/MSI-X > [ 3.734442] e1000e 0000:05:00.0: eth3: (PCI Express:2.5GT/s:Width x1) > 00:25:90:56:ac:74 > [ 3.734465] e1000e 0000:05:00.0: eth3: Intel(R) PRO/1000 Network > Connection > [ 3.734689] e1000e 0000:05:00.0: eth3: MAC: 3, PHY: 8, PBA No: FFFFFF-0FF > [ 13.799848] e1000e 0000:05:00.0: irq 64 for MSI/MSI-X > [ 13.855646] e1000e 0000:05:00.0: irq 64 for MSI/MSI-X > [ 14.031739] e1000e 0000:00:19.0: irq 45 for MSI/MSI-X > [ 14.087566] e1000e 0000:00:19.0: irq 45 for MSI/MSI-X > [ 16.112504] e1000e: eth2 NIC Link is Up 100 Mbps Full Duplex, Flow > Control: None > [ 16.124129] e1000e 0000:05:00.0: eth2: 10/100 speed: disabling TSO > > And here is the output just as it hangs: > > [19745.327241] ------------[ cut here ]------------ > [19745.334501] WARNING: at > /build/buildd-linux-2.6_3.2.9-1-amd64-KTPapN/linux-2.6-3.2.9/debian/build/source_amd64_none/net/sched/sch_generic.c:255 > dev_watchdog+0xe9/0x148() > [19745.350441] Hardware name: X9SCL/X9SCM > [19745.358859] NETDEV WATCHDOG: eth2 (e1000e): transmit queue 0 timed out > [19745.367287] Modules linked in: hmac sha256_generic dlm configfs > ebtable_nat ebtables acpi_cpufreq mperf cpufreq_stats > cpufreq_conservative cpufreq_userspace cpufreq_powersave microcode > ip6_queue xt_TCPMSS xt_sctp ip6t_LOG ip6t_REJECT nf_conntrack_ipv6 > ip6table_raw ip6table_mangle ip6table_filter xt_NOTRACK ip_set_hash_net > act_police cls_basic cls_flow cls_fw cls_u32 sch_tbf sch_prio sch_htb > sch_hfsc sch_ingress sch_sfq xt_statistic xt_CT xt_time xt_connlimit > xt_realm xt_addrtype ip_set_hash_ip iptable_raw xt_comment xt_recent > xt_policy ipt_ULOG ipt_REJECT ipt_REDIRECT ipt_NETMAP ipt_MASQUERADE > ipt_ECN ipt_ecn ipt_CLUSTERIP ipt_ah xt_set ip_set nf_nat_tftp > nf_nat_snmp_basic nf_conntrack_snmp nf_nat_sip nf_nat_pptp > nf_nat_proto_gre nf_nat_irc nf_nat_h323 nf_nat_ftp nf_nat_amanda ts_kmp > nf_conntrack_amanda nf_conntrack_sane nf_conntrack_tftp nf_conntrack_sip > nf_conntrack_proto_udplite nf_conntrack_proto_sctp nf_conntrack_pptp > nf_conntrack_proto_gre nf_conntrack_netlink nf_conntrack_netbios_ns > nf_conntrack_broadcast nf_conntrack_irc nf_conntrack_h323 > nf_conntrack_ftp xt_TPROXY nf_tproxy_core ip6_tables nf_defrag_ipv6 > xt_tcpmss xt_pkttype xt_physdev xt_owner xt_NFQUEUE xt_NFLOG > nfnetlink_log xt_multiport xt_mark xt_mac xt_limit xt_length xt_iprange > xt_helper xt_hashlimit xt_DSCP xt_dscp xt_dccp xt_conntrack xt_connmark > xt_CLASSIFY xt_AUDIT ipt_LOG xt_tcpudp xt_state iptable_nat nf_nat > nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack iptable_mangle nfnetlink > iptable_filter ip_tables x_tables kvm_intel kvm bridge stp bonding > w83627ehf hwmon_vid coretemp sha1_ssse3 sha1_generic crc32c_intel > aesni_intel cryptd aes_x86_64 aes_generic ipmi_poweroff ipmi_devintf > ipmi_si ipmi_msghandler vhost_net macvtap macvlan tun drbd lru_cache cn > loop snd_pcm snd_timer snd soundcore snd_page_alloc iTCO_wdt i2c_i801 > psmouse cdc_acm processor i2c_core iTCO_vendor_support serio_raw pcspkr > thermal_sys button evdev joydev ext4 mbcache jbd2 crc16 dm_mod raid1 > md_mod sd_mod crc_t10dif usb_storage uas usbhid hid ahci libahci libata > ehci_hcd usbcore igb scsi_mod e1000e usb_common dca [last unloaded: > scsi_wait_scan] > [19745.502559] Pid: 0, comm: swapper/0 Not tainted 3.2.0-2-amd64 #1 > [19745.502561] Call Trace: > [19745.502562]<IRQ> [<ffffffff81046879>] ? warn_slowpath_common+0x78/0x8c > [19745.502570] [<ffffffff81046925>] ? warn_slowpath_fmt+0x45/0x4a > [19745.502574] [<ffffffff8129aa11>] ? netif_tx_lock+0x40/0x72 > [19745.502588] [<ffffffff8129ab72>] ? dev_watchdog+0xe9/0x148 > [19745.502601] [<ffffffff81051f38>] ? run_timer_softirq+0x19a/0x261 > [19745.502603] [<ffffffff8129aa89>] ? netif_tx_unlock+0x46/0x46 > [19745.502606] [<ffffffff81065a73>] ? timekeeping_get_ns+0xd/0x2a > [19745.502609] [<ffffffff8104be98>] ? __do_softirq+0xb9/0x177 > [19745.502612] [<ffffffff8134892c>] ? call_softirq+0x1c/0x30 > [19745.502615] [<ffffffff8100f8e5>] ? do_softirq+0x3c/0x7b > [19745.502617] [<ffffffff8104c100>] ? irq_exit+0x3c/0x9a > [19745.502621] [<ffffffff81023f18>] ? smp_apic_timer_interrupt+0x74/0x82 > [19745.502624] [<ffffffff8134719e>] ? apic_timer_interrupt+0x6e/0x80 > [19745.502625]<EOI> [<ffffffff81070761>] ? arch_local_irq_save+0x11/0x17 > [19745.502631] [<ffffffff811e45d9>] ? intel_idle+0xea/0x119 > [19745.502633] [<ffffffff811e45b8>] ? intel_idle+0xc9/0x119 > [19745.502637] [<ffffffff812643f7>] ? cpuidle_idle_call+0xec/0x179 > [19745.502639] [<ffffffff8100d248>] ? cpu_idle+0xa5/0xf2 > [19745.502641] [<ffffffff816aab3d>] ? start_kernel+0x3bd/0x3c8 > [19745.502643] [<ffffffff816aa140>] ? early_idt_handlers+0x140/0x140 > [19745.502645] [<ffffffff816aa3c4>] ? x86_64_start_kernel+0x104/0x111 > [19745.502646] ---[ end trace 10e791a6f31603fa ]--- > [19745.503125] e1000e 0000:05:00.0: eth2: Reset adapter > > Once again, rmmod e1000e followed by modprobe e1000e does not fix the > problem: > > [20508.158919] e1000e 0000:05:00.0: PCI INT A disabled > [20508.194927] e1000e 0000:00:19.0: PCI INT A disabled > [20511.119765] e1000e: Intel(R) PRO/1000 Network Driver - 1.5.1-k > [20511.130711] e1000e: Copyright(c) 1999 - 2011 Intel Corporation. > [20511.141206] e1000e 0000:00:19.0: PCI INT A -> GSI 20 (level, low) -> > IRQ 20 > [20511.151797] e1000e 0000:00:19.0: setting latency timer to 64 > [20511.151921] e1000e 0000:00:19.0: (unregistered net_device): Interrupt > Mode set to 1 > [20511.162853] e1000e 0000:00:19.0: irq 45 for MSI/MSI-X > [20511.528436] e1000e 0000:00:19.0: eth2: (PCI Express:2.5GT/s:Width x1) > 00:25:90:56:ac:75 > [20511.539261] e1000e 0000:00:19.0: eth3: Intel(R) PRO/1000 Network > Connection > [20511.550066] e1000e 0000:00:19.0: eth3: MAC: 10, PHY: 11, PBA No: > FFFFFF-0FF > [20511.561027] e1000e 0000:05:00.0: Disabling ASPM L0s > [20511.571883] e1000e 0000:05:00.0: enabling device (0000 -> 0002) > [20511.575224] udevd[5449]: renamed network interface eth2 to eth3 > [20511.594234] e1000e 0000:05:00.0: PCI INT A -> GSI 16 (level, low) -> > IRQ 16 > [20511.605703] e1000e 0000:05:00.0: setting latency timer to 64 > [20511.605871] e1000e 0000:05:00.0: (unregistered net_device): Interrupt > Mode set to 1 > [20511.617706] e1000e 0000:05:00.0: irq 64 for MSI/MSI-X > [20511.617828] e1000e 0000:05:00.0: PCI INT A disabled > [20511.629565] e1000e: probe of 0000:05:00.0 failed with error -2 > > Please let me know if/how I can debug this further. As further information, I have a machine with the same NIC and chipset (Intel S1200BTL motherboard) but with quite different lspci -vvv outputs. Both are pasted below. First, from the working S1200BTL: 03:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network Connection Subsystem: Intel Corporation Device 3578 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 16 Region 0: Memory at c1300000 (32-bit, non-prefetchable) [size=128K] Region 2: I/O ports at 2000 [size=32] Region 3: Memory at c1320000 (32-bit, non-prefetchable) [size=16K] Capabilities: [c8] Power Management version 2 Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+) Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME- Capabilities: [d0] MSI: Enable- Count=1/1 Maskable- 64bit+ Address: 0000000000000000 Data: 0000 Capabilities: [e0] Express (v1) Endpoint, MSI 00 DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset- DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+ RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ MaxPayload 128 bytes, MaxReadReq 512 bytes DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr+ TransPend- LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Latency L0 <128ns, L1 <64us ClockPM- Surprise- LLActRep- BwNot- LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+ ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- Capabilities: [a0] MSI-X: Enable+ Count=5 Masked- Vector table: BAR=3 offset=00000000 PBA: BAR=3 offset=00002000 Capabilities: [100 v1] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol- UESvrt: DLP+ SDES- TLP+ FCP+ CmpltTO+ CmpltAbrt+ UnxCmplt+ RxOF+ MalfTLP+ ECRC- UnsupReq+ ACSViol- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr- CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+ AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn- Capabilities: [140 v1] Device Serial Number 00-1e-67-ff-ff-14-69-f4 Kernel driver in use: e1000e And from the Supermicro server, where the NIC hangs: 05:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network Connection Subsystem: Super Micro Computer Inc Device 0000 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 65 Region 0: Memory at fbd00000 (32-bit, non-prefetchable) [size=128K] Region 2: I/O ports at e000 [size=32] Region 3: Memory at fbd20000 (32-bit, non-prefetchable) [size=16K] Capabilities: [c8] Power Management version 2 Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+) Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME- Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+ Address: 00000000fee00858 Data: 0000 Capabilities: [e0] Express (v1) Endpoint, MSI 00 DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset- DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+ RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+ MaxPayload 128 bytes, MaxReadReq 512 bytes DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr+ TransPend- LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Latency L0 <128ns, L1 <64us ClockPM- Surprise- LLActRep- BwNot- LnkCtl: ASPM L1 Enabled; RCB 64 bytes Disabled- Retrain- CommClk+ ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- Capabilities: [a0] MSI-X: Enable- Count=5 Masked- Vector table: BAR=3 offset=00000000 PBA: BAR=3 offset=00002000 Capabilities: [100 v1] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- CESta: RxErr+ BadTLP+ BadDLLP+ Rollover- Timeout- NonFatalErr+ CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+ AERCap: First Error Pointer: 14, GenCap- CGenEn- ChkCap- ChkEn- Capabilities: [140 v1] Device Serial Number 00-25-90-ff-ff-56-ac-74 Kernel driver in use: e1000e Most notably it appears as though MSI-X is not enabled on the Supermicro, and ASPM L1 is. There appears to be no difference on the Supermicro as to the MSI-X status when booting with IntMode=1,1 compared to without it. Thanks, Chris ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [E1000-devel] e1000e interface hang on 82574L 2012-03-17 17:54 ` Chris Boot @ 2012-03-17 23:50 ` Nix 2012-03-19 14:59 ` Wyborny, Carolyn 1 sibling, 0 replies; 32+ messages in thread From: Nix @ 2012-03-17 23:50 UTC (permalink / raw) To: Chris Boot; +Cc: Wyborny, Carolyn, e1000-devel, netdev, lkml On 17 Mar 2012, Chris Boot verbalised: > Most notably it appears as though MSI-X is not enabled on the > Supermicro, and ASPM L1 is. There appears to be no difference on the > Supermicro as to the MSI-X status when booting with IntMode=1,1 compared > to without it. This bug is an ASPM bug, not an MSI bug, and has been present in the in-kernel drivers since something like 2.6.36. I reported it a rather long time ago to the e1000e bugzilla: <http://sourceforge.net/tracker/index.php?func=detail&aid=3170405&group_id=42302&atid=447449> but then I got a severe attack of forgetfulness and forgot what bz it was on until this post prodded me into finding it again. (And then kernel.org was penetrated and I didn't even bother looking, because of course I reported it to the offlined kernel bz, right? No, I didn't.) I really should follow up on it now and ask the kernel PCI hackers to suggest reasons why ASPM might be getting magically re-enabled at around the same time as the interface is brought up. (Disabling ASPM via setpci at boot doesn't help if the interface hasn't stabilized before that point.) I haven't done much printf()-scattering to try to track it down because rebooting this machine is quite annoying: it's the heart of my network, my damn-near-everything-server and the machine on which all my work virtual machines run, so rebooting it means disappearing from work for some time while the reboot happens... (but of course this is a really pathetic excuse because I could have devoted a weekend to it or something. So add laziness to my sins.) So currently I'm doing setpci -s 02:00.0 CAP_EXP+10.b=40 setpci -s 03:00.0 CAP_EXP+10.b=40 in a root shell to force ASPM off on my two 82574Ls after every boot. It is quite annoying, but 'solves' the problem (for a very crap value of 'solves'). -- NULL && (void) ^ permalink raw reply [flat|nested] 32+ messages in thread
* RE: e1000e interface hang on 82574L 2012-03-17 17:54 ` Chris Boot 2012-03-17 23:50 ` [E1000-devel] " Nix @ 2012-03-19 14:59 ` Wyborny, Carolyn 2012-03-19 16:19 ` [E1000-devel] " Nix 2012-04-23 21:29 ` [PATCH RFC 0/2] e1000e: 82574 also needs ASPM L1 completely disabled Chris Boot 1 sibling, 2 replies; 32+ messages in thread From: Wyborny, Carolyn @ 2012-03-19 14:59 UTC (permalink / raw) To: Chris Boot; +Cc: netdev, lkml, e1000-devel [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #1: Type: text/plain; charset="utf-8", Size: 1449 bytes --] >-----Original Message----- >From: Chris Boot [mailto:bootc@bootc.net] >Sent: Saturday, March 17, 2012 10:54 AM >To: Wyborny, Carolyn >Cc: netdev; lkml; e1000-devel@lists.sourceforge.net >Subject: Re: e1000e interface hang on 82574L [...] >> Carolyn, >> >> I've just had the opportunity to upgrade to a 3.2.9 kernel on these >> systems and have made sure e1000e is loaded with IntMode=1,1. One of >the >> servers was only up 5.5 hours before the NIC has crashed/stopped >working >> again. Hello Chris, The ASPM problem with 82574L is hardware based and is not solvable in software other than to disable it. Since the platforms vary in their reliability in disabling the feature from the driver, your best option is to always boot with pcie_aspm=off with that part in the system. [...] >Most notably it appears as though MSI-X is not enabled on the >Supermicro, and ASPM L1 is. There appears to be no difference on the >Supermicro as to the MSI-X status when booting with IntMode=1,1 compared >to without it. > >Thanks, >Chris So, at least we are clear in your situation, the ASPM needs to be disabled. Please let me know if there are continued problems after booting with pcie_aspm=off. Thanks, Carolyn Carolyn Wyborny Linux Development LAN Access Division Intel Corporation ÿôèº{.nÇ+·®+%Ëÿ±éݶ\x17¥wÿº{.nÇ+·¥{±þG«éÿ{ayº\x1dÊÚë,j\a¢f£¢·hïêÿêçz_è®\x03(éÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?¨èÚ&£ø§~á¶iOæ¬z·vØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?I¥ ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [E1000-devel] e1000e interface hang on 82574L 2012-03-19 14:59 ` Wyborny, Carolyn @ 2012-03-19 16:19 ` Nix 2012-03-19 16:29 ` Wyborny, Carolyn 2012-04-23 21:29 ` [PATCH RFC 0/2] e1000e: 82574 also needs ASPM L1 completely disabled Chris Boot 1 sibling, 1 reply; 32+ messages in thread From: Nix @ 2012-03-19 16:19 UTC (permalink / raw) To: Wyborny, Carolyn; +Cc: Chris Boot, e1000-devel, netdev, lkml On 19 Mar 2012, Carolyn Wyborny stated: > So, at least we are clear in your situation, the ASPM needs to be > disabled. Please let me know if there are continued problems after > booting with pcie_aspm=off. If you look further down in <http://sourceforge.net/tracker/index.php?func=detail&aid=3170405&group_id=42302&atid=447449> you'll see that I tested that, and it doesn't work :( even if it did work, it shouldn't be needed: the driver attempts to turn off PCIe ASPM on affected NICs, and fails, apparently because *something* turns it back on again. -- NULL && (void) ^ permalink raw reply [flat|nested] 32+ messages in thread
* RE: [E1000-devel] e1000e interface hang on 82574L 2012-03-19 16:19 ` [E1000-devel] " Nix @ 2012-03-19 16:29 ` Wyborny, Carolyn 2012-03-19 17:31 ` Nix 0 siblings, 1 reply; 32+ messages in thread From: Wyborny, Carolyn @ 2012-03-19 16:29 UTC (permalink / raw) To: Nix; +Cc: Chris Boot, e1000-devel, netdev, lkml >-----Original Message----- >From: Nix [mailto:nix@esperi.org.uk] >Sent: Monday, March 19, 2012 9:20 AM >To: Wyborny, Carolyn >Cc: Chris Boot; e1000-devel@lists.sourceforge.net; netdev; lkml >Subject: Re: [E1000-devel] e1000e interface hang on 82574L > [...] >you'll see that I tested that, and it doesn't work :( even if it did >work, it shouldn't be needed: the driver attempts to turn off PCIe ASPM >on affected NICs, and fails, apparently because *something* turns it >back on again. > >-- >NULL && (void) The driver attempts to disable L0s state, not the entire feature. It is also required that the device upstream on the bus from the 82574L have this disabled. Yes, I agree there appears to be something in the os that either ren-enables or fails to disable the feature on the upstream device, as desired. Platforms/systems also appear to vary in this regard, so the solutions may vary a bit as well. Its worth trying your solution as well if what I suggested doesn't work, but there is not one solution that fits all, unfortunately. Thanks, Carolyn Carolyn Wyborny Linux Development LAN Access Division Intel Corporation ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [E1000-devel] e1000e interface hang on 82574L 2012-03-19 16:29 ` Wyborny, Carolyn @ 2012-03-19 17:31 ` Nix 2012-04-06 10:17 ` Chris Boot 0 siblings, 1 reply; 32+ messages in thread From: Nix @ 2012-03-19 17:31 UTC (permalink / raw) To: Wyborny, Carolyn; +Cc: Chris Boot, e1000-devel, netdev, lkml On 19 Mar 2012, Carolyn Wyborny said: >>you'll see that I tested that, and it doesn't work :( even if it did >>work, it shouldn't be needed: the driver attempts to turn off PCIe ASPM >>on affected NICs, and fails, apparently because *something* turns it >>back on again. >> > The driver attempts to disable L0s state, not the entire feature. It It tries to disable L1 state as well (or it did when I tested this last, although I suspect you're right and it may leave L1 turned on these days: judging by the contents of e1000_82574_info, anyway.) > is also required that the device upstream on the bus from the 82574L > have this disabled. Yes, I agree there appears to be something in the > os that either ren-enables or fails to disable the feature on the > upstream device, as desired. Platforms/systems also appear to vary in > this regard, so the solutions may vary a bit as well. > > Its worth trying your solution as well if what I suggested doesn't > work, but there is not one solution that fits all, unfortunately. I don't *have* a solution. :( 'setpci by hand some unknown amount of time after booting once the interface has stabilized' hardly counts as a solution of any sort. It's, at best, a workaround that lets me use my systems without hourly lockups until a real solution is found. (To clarify: manual setpci to force off the ASPM bits is the only thing that works for me. The driver's automatic disabling of L0s and L1 doesn't work: nor does booting with pcie_aspm=off. In both cases, I end up with both L0s and L1 turned on, and a lockup some time later, unless I setpci the bits off by hand.) -- NULL && (void) ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [E1000-devel] e1000e interface hang on 82574L 2012-03-19 17:31 ` Nix @ 2012-04-06 10:17 ` Chris Boot 2012-04-06 12:12 ` Bjorn Helgaas 0 siblings, 1 reply; 32+ messages in thread From: Chris Boot @ 2012-04-06 10:17 UTC (permalink / raw) To: Nix; +Cc: Wyborny, Carolyn, e1000-devel, netdev, lkml, Bjorn Helgaas, linux-pci On 19 Mar 2012, at 17:31, Nix wrote: > On 19 Mar 2012, Carolyn Wyborny said: > >>> you'll see that I tested that, and it doesn't work :( even if it did >>> work, it shouldn't be needed: the driver attempts to turn off PCIe ASPM >>> on affected NICs, and fails, apparently because *something* turns it >>> back on again. >>> >> The driver attempts to disable L0s state, not the entire feature. It > > It tries to disable L1 state as well (or it did when I tested this last, > although I suspect you're right and it may leave L1 turned on these > days: judging by the contents of e1000_82574_info, anyway.) > >> is also required that the device upstream on the bus from the 82574L >> have this disabled. Yes, I agree there appears to be something in the >> os that either ren-enables or fails to disable the feature on the >> upstream device, as desired. Platforms/systems also appear to vary in >> this regard, so the solutions may vary a bit as well. >> >> Its worth trying your solution as well if what I suggested doesn't >> work, but there is not one solution that fits all, unfortunately. > > I don't *have* a solution. :( 'setpci by hand some unknown amount of > time after booting once the interface has stabilized' hardly counts as a > solution of any sort. It's, at best, a workaround that lets me use my > systems without hourly lockups until a real solution is found. > > (To clarify: manual setpci to force off the ASPM bits is the only thing > that works for me. The driver's automatic disabling of L0s and L1 > doesn't work: nor does booting with pcie_aspm=off. In both cases, I end > up with both L0s and L1 turned on, and a lockup some time later, unless > I setpci the bits off by hand.) Well, with that setpci incantation run against the NIC and its upstream device to disable ASPM L1s (setpci -s <dev> CAP_EXP+10.b=40), everything has been working very well indeed. Is there something the e1000e driver could do to disable L1s as well as L0s if we know there's a problem with them for these devices? Adding Bjorn Helgaas and linux-pci to CCs to try to get the ball rolling some more, as this is crippling without the fixes. Cheers, Chris -- Chris Boot bootc@bootc.net ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [E1000-devel] e1000e interface hang on 82574L 2012-04-06 10:17 ` Chris Boot @ 2012-04-06 12:12 ` Bjorn Helgaas 2012-04-06 13:41 ` Henrique de Moraes Holschuh 2012-04-06 16:04 ` Nix 0 siblings, 2 replies; 32+ messages in thread From: Bjorn Helgaas @ 2012-04-06 12:12 UTC (permalink / raw) To: Chris Boot Cc: Nix, Wyborny, Carolyn, e1000-devel, netdev, lkml, linux-pci, Matthew Garrett On Fri, Apr 6, 2012 at 4:17 AM, Chris Boot <bootc@bootc.net> wrote: > On 19 Mar 2012, at 17:31, Nix wrote: > >> On 19 Mar 2012, Carolyn Wyborny said: >> >>>> you'll see that I tested that, and it doesn't work :( even if it did >>>> work, it shouldn't be needed: the driver attempts to turn off PCIe ASPM >>>> on affected NICs, and fails, apparently because *something* turns it >>>> back on again. >>>> >>> The driver attempts to disable L0s state, not the entire feature. It >> >> It tries to disable L1 state as well (or it did when I tested this last, >> although I suspect you're right and it may leave L1 turned on these >> days: judging by the contents of e1000_82574_info, anyway.) >> >>> is also required that the device upstream on the bus from the 82574L >>> have this disabled. Yes, I agree there appears to be something in the >>> os that either ren-enables or fails to disable the feature on the >>> upstream device, as desired. Platforms/systems also appear to vary in >>> this regard, so the solutions may vary a bit as well. >>> >>> Its worth trying your solution as well if what I suggested doesn't >>> work, but there is not one solution that fits all, unfortunately. >> >> I don't *have* a solution. :( 'setpci by hand some unknown amount of >> time after booting once the interface has stabilized' hardly counts as a >> solution of any sort. It's, at best, a workaround that lets me use my >> systems without hourly lockups until a real solution is found. >> >> (To clarify: manual setpci to force off the ASPM bits is the only thing >> that works for me. The driver's automatic disabling of L0s and L1 >> doesn't work: nor does booting with pcie_aspm=off. In both cases, I end >> up with both L0s and L1 turned on, and a lockup some time later, unless >> I setpci the bits off by hand.) > > > Well, with that setpci incantation run against the NIC and its upstream device to disable ASPM L1s (setpci -s <dev> CAP_EXP+10.b=40), everything has been working very well indeed. Is there something the e1000e driver could do to disable L1s as well as L0s if we know there's a problem with them for these devices? > > Adding Bjorn Helgaas and linux-pci to CCs to try to get the ball rolling some more, as this is crippling without the fixes. [+cc Matthew Garrett for ASPM stuff] If I understand correctly, e1000e attempts to disable ASPM to work around an 82574L hardware erratum, but the PCI core either doesn't disable ASPM or it gets re-enabled somehow. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [E1000-devel] e1000e interface hang on 82574L 2012-04-06 12:12 ` Bjorn Helgaas @ 2012-04-06 13:41 ` Henrique de Moraes Holschuh 2012-04-06 13:48 ` Chris Boot 2012-04-06 16:05 ` Nix 2012-04-06 16:04 ` Nix 1 sibling, 2 replies; 32+ messages in thread From: Henrique de Moraes Holschuh @ 2012-04-06 13:41 UTC (permalink / raw) To: Bjorn Helgaas Cc: Chris Boot, Nix, Wyborny, Carolyn, e1000-devel, netdev, lkml, linux-pci, Matthew Garrett On Fri, 06 Apr 2012, Bjorn Helgaas wrote: > On Fri, Apr 6, 2012 at 4:17 AM, Chris Boot <bootc@bootc.net> wrote: > > On 19 Mar 2012, at 17:31, Nix wrote: > > > >> On 19 Mar 2012, Carolyn Wyborny said: > >> > >>>> you'll see that I tested that, and it doesn't work :( even if it > >>>> did work, it shouldn't be needed: the driver attempts to turn off > >>>> PCIe ASPM on affected NICs, and fails, apparently because > >>>> *something* turns it back on again. > >>>> > >>> The driver attempts to disable L0s state, not the entire feature. > >>> It > >> > >> It tries to disable L1 state as well (or it did when I tested this > >> last, although I suspect you're right and it may leave L1 turned on > >> these days: judging by the contents of e1000_82574_info, anyway.) > >> > >>> is also required that the device upstream on the bus from the > >>> 82574L have this disabled. Yes, I agree there appears to be > >>> something in the os that either ren-enables or fails to disable > >>> the feature on the upstream device, as desired. Platforms/systems > >>> also appear to vary in this regard, so the solutions may vary a > >>> bit as well. > >>> > >>> Its worth trying your solution as well if what I suggested doesn't > >>> work, but there is not one solution that fits all, unfortunately. > >> > >> I don't *have* a solution. :( 'setpci by hand some unknown amount > >> of time after booting once the interface has stabilized' hardly > >> counts as a solution of any sort. It's, at best, a workaround that > >> lets me use my systems without hourly lockups until a real solution > >> is found. > >> > >> (To clarify: manual setpci to force off the ASPM bits is the only > >> thing that works for me. The driver's automatic disabling of L0s > >> and L1 doesn't work: nor does booting with pcie_aspm=off. In both > >> cases, I end up with both L0s and L1 turned on, and a lockup some > >> time later, unless I setpci the bits off by hand.) > > > > > > Well, with that setpci incantation run against the NIC and its > > upstream device to disable ASPM L1s (setpci -s <dev> > > CAP_EXP+10.b=40), everything has been working very well indeed. Is > > there something the e1000e driver could do to disable L1s as well as > > L0s if we know there's a problem with them for these devices? > > > > Adding Bjorn Helgaas and linux-pci to CCs to try to get the ball > > rolling some more, as this is crippling without the fixes. > > [+cc Matthew Garrett for ASPM stuff] > > If I understand correctly, e1000e attempts to disable ASPM to work > around an 82574L hardware erratum, but the PCI core either doesn't > disable ASPM or it gets re-enabled somehow. You probably need to disable it upstream of the 82574L as well. Here (SuperMicro C7X58) I managed to get it to be stable by telling the BIOS to disable L0s and L1 system-wide. But not all BIOSes will have that option... -- "One disk to rule them all, One disk to find them. One disk to bring them all and in the darkness grind them. In the Land of Redmond where the shadows lie." -- The Silicon Valley Tarot Henrique Holschuh ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [E1000-devel] e1000e interface hang on 82574L 2012-04-06 13:41 ` Henrique de Moraes Holschuh @ 2012-04-06 13:48 ` Chris Boot 2012-04-06 16:05 ` Nix 1 sibling, 0 replies; 32+ messages in thread From: Chris Boot @ 2012-04-06 13:48 UTC (permalink / raw) To: Henrique de Moraes Holschuh Cc: Bjorn Helgaas, Nix, Wyborny, Carolyn, e1000-devel, netdev, lkml, linux-pci, Matthew Garrett On 6 Apr 2012, at 14:41, Henrique de Moraes Holschuh <hmh@hmh.eng.br> wrote: > On Fri, 06 Apr 2012, Bjorn Helgaas wrote: >> On Fri, Apr 6, 2012 at 4:17 AM, Chris Boot <bootc@bootc.net> wrote: >>> On 19 Mar 2012, at 17:31, Nix wrote: >>> >>>> On 19 Mar 2012, Carolyn Wyborny said: >>>> >>>>>> you'll see that I tested that, and it doesn't work :( even if it >>>>>> did work, it shouldn't be needed: the driver attempts to turn off >>>>>> PCIe ASPM on affected NICs, and fails, apparently because >>>>>> *something* turns it back on again. >>>>>> >>>>> The driver attempts to disable L0s state, not the entire feature. >>>>> It >>>> >>>> It tries to disable L1 state as well (or it did when I tested this >>>> last, although I suspect you're right and it may leave L1 turned on >>>> these days: judging by the contents of e1000_82574_info, anyway.) >>>> >>>>> is also required that the device upstream on the bus from the >>>>> 82574L have this disabled. Yes, I agree there appears to be >>>>> something in the os that either ren-enables or fails to disable >>>>> the feature on the upstream device, as desired. Platforms/systems >>>>> also appear to vary in this regard, so the solutions may vary a >>>>> bit as well. >>>>> >>>>> Its worth trying your solution as well if what I suggested doesn't >>>>> work, but there is not one solution that fits all, unfortunately. >>>> >>>> I don't *have* a solution. :( 'setpci by hand some unknown amount >>>> of time after booting once the interface has stabilized' hardly >>>> counts as a solution of any sort. It's, at best, a workaround that >>>> lets me use my systems without hourly lockups until a real solution >>>> is found. >>>> >>>> (To clarify: manual setpci to force off the ASPM bits is the only >>>> thing that works for me. The driver's automatic disabling of L0s >>>> and L1 doesn't work: nor does booting with pcie_aspm=off. In both >>>> cases, I end up with both L0s and L1 turned on, and a lockup some >>>> time later, unless I setpci the bits off by hand.) >>> >>> >>> Well, with that setpci incantation run against the NIC and its >>> upstream device to disable ASPM L1s (setpci -s <dev> >>> CAP_EXP+10.b=40), everything has been working very well indeed. Is >>> there something the e1000e driver could do to disable L1s as well as >>> L0s if we know there's a problem with them for these devices? >>> >>> Adding Bjorn Helgaas and linux-pci to CCs to try to get the ball >>> rolling some more, as this is crippling without the fixes. >> >> [+cc Matthew Garrett for ASPM stuff] >> >> If I understand correctly, e1000e attempts to disable ASPM to work >> around an 82574L hardware erratum, but the PCI core either doesn't >> disable ASPM or it gets re-enabled somehow. > > You probably need to disable it upstream of the 82574L as well. Here > (SuperMicro C7X58) I managed to get it to be stable by telling the BIOS > to disable L0s and L1 system-wide. > > But not all BIOSes will have that option... This is not something I can really do as ASPM makes a real difference to power consumption across the system, and I have a strict power budget to adhere to (else I will be charged more to host my servers). Disabling it for the NIC and upstream device is enough to make it stable, and doesn't increase power consumption by enough to matter. The driver seems to disable ASPM L0s just fine, but L1s are not disabled on the NIC nor are they on the upstream device. If e1000e can't do it maybe we can do so using a PCI quirk or something? Cheers, Chris -- Chris Boot bootc@bootc.net ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [E1000-devel] e1000e interface hang on 82574L 2012-04-06 13:41 ` Henrique de Moraes Holschuh 2012-04-06 13:48 ` Chris Boot @ 2012-04-06 16:05 ` Nix 1 sibling, 0 replies; 32+ messages in thread From: Nix @ 2012-04-06 16:05 UTC (permalink / raw) To: Henrique de Moraes Holschuh Cc: Bjorn Helgaas, Chris Boot, Wyborny, Carolyn, e1000-devel, netdev, lkml, linux-pci, Matthew Garrett On 6 Apr 2012, Henrique de Moraes Holschuh outgrape: > You probably need to disable it upstream of the 82574L as well. Here > (SuperMicro C7X58) I managed to get it to be stable by telling the BIOS > to disable L0s and L1 system-wide. > > But not all BIOSes will have that option... Indeed not :( the Tyan S7010 mobo in my headless server can't do it. -- NULL && (void) ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [E1000-devel] e1000e interface hang on 82574L 2012-04-06 12:12 ` Bjorn Helgaas 2012-04-06 13:41 ` Henrique de Moraes Holschuh @ 2012-04-06 16:04 ` Nix 1 sibling, 0 replies; 32+ messages in thread From: Nix @ 2012-04-06 16:04 UTC (permalink / raw) To: Bjorn Helgaas Cc: Chris Boot, Wyborny, Carolyn, e1000-devel, netdev, lkml, linux-pci, Matthew Garrett On 6 Apr 2012, Bjorn Helgaas outgrape: > If I understand correctly, e1000e attempts to disable ASPM to work > around an 82574L hardware erratum, but the PCI core either doesn't > disable ASPM or it gets re-enabled somehow. It gets re-enabled. If you explicitly do a setpci in the boot process to turn ASPM off on the interface, after doing your 'ip link up' and routing initialization, by the end of the boot process ASPM is back on again. I speculate that the stabilization of the interface (as indicated by the link-enabled message) has somehow flipped ASPM on, but I have no actual evidence for when this re-enabling happens. I just know it does. -- NULL && (void) ^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH RFC 0/2] e1000e: 82574 also needs ASPM L1 completely disabled 2012-03-19 14:59 ` Wyborny, Carolyn 2012-03-19 16:19 ` [E1000-devel] " Nix @ 2012-04-23 21:29 ` Chris Boot 2012-04-23 21:29 ` [PATCH 1/2] e1000e: Disable ASPM L1 on 82574 Chris Boot ` (2 more replies) 1 sibling, 3 replies; 32+ messages in thread From: Chris Boot @ 2012-04-23 21:29 UTC (permalink / raw) To: e1000-devel, netdev; +Cc: linux-kernel, nix, carolyn.wyborny, Chris Boot After much toing and froing on LKML, netdev and the e1000 mailing lists over the past few months we've determined that the 82574L needs to have both ASPM L0s and L1 disabled or else it's likely to lock up. This little series does just that, also cleaning up some now-unnecessary code that disables L1 on the 82573 and 82574 if the MTU is greater than 1500 bytes. Please note I haven't as-yet tested this code at all, but I do know that disabling ASPM L1 on these NICs (using setpci) fixes the hangs that I have been seeing on my Supermicro servers with X9SCL-F boards. I hope to get the chance to install an updated kernel on my two afftected servers later this week. Chris Boot (2): e1000e: Disable ASPM L1 on 82574 e1000e: Remove special case for 82573/82574 ASPM L1 disablement drivers/net/ethernet/intel/e1000e/82571.c | 3 ++- drivers/net/ethernet/intel/e1000e/netdev.c | 8 -------- 2 files changed, 2 insertions(+), 9 deletions(-) -- 1.7.10 ^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH 1/2] e1000e: Disable ASPM L1 on 82574 2012-04-23 21:29 ` [PATCH RFC 0/2] e1000e: 82574 also needs ASPM L1 completely disabled Chris Boot @ 2012-04-23 21:29 ` Chris Boot 2012-04-23 23:18 ` [E1000-devel] " Jeff Kirsher ` (2 more replies) 2012-04-23 21:29 ` [PATCH 2/2] e1000e: Remove special case for 82573/82574 ASPM L1 disablement Chris Boot 2012-04-23 23:11 ` [PATCH RFC 0/2] e1000e: 82574 also needs ASPM L1 completely disabled Jesse Brandeburg 2 siblings, 3 replies; 32+ messages in thread From: Chris Boot @ 2012-04-23 21:29 UTC (permalink / raw) To: e1000-devel, netdev; +Cc: linux-kernel, nix, carolyn.wyborny, Chris Boot ASPM on the 82574 causes trouble. Currently the driver disables L0s for this NIC but only disables L1 if the MTU is >1500. This patch simply causes L1 to be disabled regardless of the MTU setting. Signed-off-by: Chris Boot <bootc@bootc.net> Cc: "Wyborny, Carolyn" <carolyn.wyborny@intel.com> Cc: Nix <nix@esperi.org.uk> Link: https://lkml.org/lkml/2012/3/19/362 --- drivers/net/ethernet/intel/e1000e/82571.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/intel/e1000e/82571.c b/drivers/net/ethernet/intel/e1000e/82571.c index b3fdc69..c6d95f2 100644 --- a/drivers/net/ethernet/intel/e1000e/82571.c +++ b/drivers/net/ethernet/intel/e1000e/82571.c @@ -2061,8 +2061,9 @@ const struct e1000_info e1000_82574_info = { | FLAG_HAS_SMART_POWER_DOWN | FLAG_HAS_AMT | FLAG_HAS_CTRLEXT_ON_LOAD, - .flags2 = FLAG2_CHECK_PHY_HANG + .flags2 = FLAG2_CHECK_PHY_HANG | FLAG2_DISABLE_ASPM_L0S + | FLAG2_DISABLE_ASPM_L1 | FLAG2_NO_DISABLE_RX, .pba = 32, .max_hw_frame_size = DEFAULT_JUMBO, -- 1.7.10 ^ permalink raw reply related [flat|nested] 32+ messages in thread
* Re: [E1000-devel] [PATCH 1/2] e1000e: Disable ASPM L1 on 82574 2012-04-23 21:29 ` [PATCH 1/2] e1000e: Disable ASPM L1 on 82574 Chris Boot @ 2012-04-23 23:18 ` Jeff Kirsher 2012-04-24 11:08 ` Nix 2012-06-01 21:17 ` Chris Boot 2 siblings, 0 replies; 32+ messages in thread From: Jeff Kirsher @ 2012-04-23 23:18 UTC (permalink / raw) To: Chris Boot; +Cc: e1000-devel, netdev, nix, linux-kernel [-- Attachment #1: Type: text/plain, Size: 617 bytes --] On Mon, 2012-04-23 at 22:29 +0100, Chris Boot wrote: > > ASPM on the 82574 causes trouble. Currently the driver disables L0s > for > this NIC but only disables L1 if the MTU is >1500. This patch simply > causes L1 to be disabled regardless of the MTU setting. > > Signed-off-by: Chris Boot <bootc@bootc.net> > Cc: "Wyborny, Carolyn" <carolyn.wyborny@intel.com> > Cc: Nix <nix@esperi.org.uk> > Link: https://lkml.org/lkml/2012/3/19/362 > --- > drivers/net/ethernet/intel/e1000e/82571.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) I have added the patch to my queue, thanks Chris! [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 1/2] e1000e: Disable ASPM L1 on 82574 2012-04-23 21:29 ` [PATCH 1/2] e1000e: Disable ASPM L1 on 82574 Chris Boot 2012-04-23 23:18 ` [E1000-devel] " Jeff Kirsher @ 2012-04-24 11:08 ` Nix 2012-06-01 21:17 ` Chris Boot 2 siblings, 0 replies; 32+ messages in thread From: Nix @ 2012-04-24 11:08 UTC (permalink / raw) To: Chris Boot; +Cc: e1000-devel, netdev, linux-kernel, carolyn.wyborny On 23 Apr 2012, Chris Boot uttered the following: > ASPM on the 82574 causes trouble. Currently the driver disables L0s for > this NIC but only disables L1 if the MTU is >1500. This patch simply > causes L1 to be disabled regardless of the MTU setting. FWIW, that existing code doesn't actually work in any case. I've been running with an MTU of 7200 on one such NIC for some time, and L0s and L1 are definitely enabled, even though the driver says it's turning them off. I'll try your patch shortly, probably tomorrow. (Now I only have to worry about the *other* bug that's been bruited about on this list -- the one where the card locks up if its peer shuts down. It's worrying because one of my 82574Ls has a peer that's regularly suspended... I guess I'll try and see if I can reproduce that lockup!) -- NULL && (void) ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 1/2] e1000e: Disable ASPM L1 on 82574 2012-04-23 21:29 ` [PATCH 1/2] e1000e: Disable ASPM L1 on 82574 Chris Boot 2012-04-23 23:18 ` [E1000-devel] " Jeff Kirsher 2012-04-24 11:08 ` Nix @ 2012-06-01 21:17 ` Chris Boot 2012-06-07 1:41 ` Greg KH 2 siblings, 1 reply; 32+ messages in thread From: Chris Boot @ 2012-06-01 21:17 UTC (permalink / raw) To: Chris Boot Cc: e1000-devel, netdev, linux-kernel, nix, carolyn.wyborny, stable On 23/04/2012 22:29, Chris Boot wrote: > ASPM on the 82574 causes trouble. Currently the driver disables L0s for > this NIC but only disables L1 if the MTU is >1500. This patch simply > causes L1 to be disabled regardless of the MTU setting. > > Signed-off-by: Chris Boot <bootc@bootc.net> > Cc: "Wyborny, Carolyn" <carolyn.wyborny@intel.com> > Cc: Nix <nix@esperi.org.uk> > Link: https://lkml.org/lkml/2012/3/19/362 > --- > drivers/net/ethernet/intel/e1000e/82571.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/drivers/net/ethernet/intel/e1000e/82571.c b/drivers/net/ethernet/intel/e1000e/82571.c > index b3fdc69..c6d95f2 100644 > --- a/drivers/net/ethernet/intel/e1000e/82571.c > +++ b/drivers/net/ethernet/intel/e1000e/82571.c > @@ -2061,8 +2061,9 @@ const struct e1000_info e1000_82574_info = { > | FLAG_HAS_SMART_POWER_DOWN > | FLAG_HAS_AMT > | FLAG_HAS_CTRLEXT_ON_LOAD, > - .flags2 = FLAG2_CHECK_PHY_HANG > + .flags2 = FLAG2_CHECK_PHY_HANG > | FLAG2_DISABLE_ASPM_L0S > + | FLAG2_DISABLE_ASPM_L1 > | FLAG2_NO_DISABLE_RX, > .pba = 32, > .max_hw_frame_size = DEFAULT_JUMBO, Now that this patch is in master (d4a4206e) and has presumably been widely tested, what's the possibility of it making it into stable? I really should have included a CC to stable when I sent it... This patch should probably also be accompanied with 59aed952 (e1000e: Remove special case for 82573/82574 ASPM L1 disablement) on top, to remove a special case that's no longer required once this is applied. Thanks, Chris -- Chris Boot bootc@bootc.net ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 1/2] e1000e: Disable ASPM L1 on 82574 2012-06-01 21:17 ` Chris Boot @ 2012-06-07 1:41 ` Greg KH 0 siblings, 0 replies; 32+ messages in thread From: Greg KH @ 2012-06-07 1:41 UTC (permalink / raw) To: Chris Boot Cc: e1000-devel, netdev, linux-kernel, nix, carolyn.wyborny, stable On Fri, Jun 01, 2012 at 10:17:08PM +0100, Chris Boot wrote: > On 23/04/2012 22:29, Chris Boot wrote: > > ASPM on the 82574 causes trouble. Currently the driver disables L0s for > > this NIC but only disables L1 if the MTU is >1500. This patch simply > > causes L1 to be disabled regardless of the MTU setting. > > > > Signed-off-by: Chris Boot <bootc@bootc.net> > > Cc: "Wyborny, Carolyn" <carolyn.wyborny@intel.com> > > Cc: Nix <nix@esperi.org.uk> > > Link: https://lkml.org/lkml/2012/3/19/362 > > --- > > drivers/net/ethernet/intel/e1000e/82571.c | 3 ++- > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > diff --git a/drivers/net/ethernet/intel/e1000e/82571.c b/drivers/net/ethernet/intel/e1000e/82571.c > > index b3fdc69..c6d95f2 100644 > > --- a/drivers/net/ethernet/intel/e1000e/82571.c > > +++ b/drivers/net/ethernet/intel/e1000e/82571.c > > @@ -2061,8 +2061,9 @@ const struct e1000_info e1000_82574_info = { > > | FLAG_HAS_SMART_POWER_DOWN > > | FLAG_HAS_AMT > > | FLAG_HAS_CTRLEXT_ON_LOAD, > > - .flags2 = FLAG2_CHECK_PHY_HANG > > + .flags2 = FLAG2_CHECK_PHY_HANG > > | FLAG2_DISABLE_ASPM_L0S > > + | FLAG2_DISABLE_ASPM_L1 > > | FLAG2_NO_DISABLE_RX, > > .pba = 32, > > .max_hw_frame_size = DEFAULT_JUMBO, > > Now that this patch is in master (d4a4206e) and has presumably been > widely tested, what's the possibility of it making it into stable? I > really should have included a CC to stable when I sent it... I'd be glad to apply it, but it doesn't apply properly to the 3.4-stable tree :( > This patch should probably also be accompanied with 59aed952 (e1000e: > Remove special case for 82573/82574 ASPM L1 disablement) on top, to > remove a special case that's no longer required once this is applied. As I can't apply the first one, this one shouldn't be applied either at this point in time... thanks, greg k-h ^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH 2/2] e1000e: Remove special case for 82573/82574 ASPM L1 disablement 2012-04-23 21:29 ` [PATCH RFC 0/2] e1000e: 82574 also needs ASPM L1 completely disabled Chris Boot 2012-04-23 21:29 ` [PATCH 1/2] e1000e: Disable ASPM L1 on 82574 Chris Boot @ 2012-04-23 21:29 ` Chris Boot 2012-04-23 23:18 ` [E1000-devel] " Jeff Kirsher 2012-04-23 23:11 ` [PATCH RFC 0/2] e1000e: 82574 also needs ASPM L1 completely disabled Jesse Brandeburg 2 siblings, 1 reply; 32+ messages in thread From: Chris Boot @ 2012-04-23 21:29 UTC (permalink / raw) To: e1000-devel, netdev; +Cc: linux-kernel, nix, carolyn.wyborny, Chris Boot For the 82573, ASPM L1 gets disabled wholesale so this special-case code is not required. For the 82574 the previous patch does the same as for the 82573, disabling L1 on the adapter. Thus, this code is no longer required and can be removed. Signed-off-by: Chris Boot <bootc@bootc.net> --- drivers/net/ethernet/intel/e1000e/netdev.c | 8 -------- 1 file changed, 8 deletions(-) diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c index 19ab215..ea96cfd 100644 --- a/drivers/net/ethernet/intel/e1000e/netdev.c +++ b/drivers/net/ethernet/intel/e1000e/netdev.c @@ -5293,14 +5293,6 @@ static int e1000_change_mtu(struct net_device *netdev, int new_mtu) return -EINVAL; } - /* 82573 Errata 17 */ - if (((adapter->hw.mac.type == e1000_82573) || - (adapter->hw.mac.type == e1000_82574)) && - (max_frame > ETH_FRAME_LEN + ETH_FCS_LEN)) { - adapter->flags2 |= FLAG2_DISABLE_ASPM_L1; - e1000e_disable_aspm(adapter->pdev, PCIE_LINK_STATE_L1); - } - while (test_and_set_bit(__E1000_RESETTING, &adapter->state)) usleep_range(1000, 2000); /* e1000e_down -> e1000e_reset dependent on max_frame_size & mtu */ -- 1.7.10 ^ permalink raw reply related [flat|nested] 32+ messages in thread
* Re: [E1000-devel] [PATCH 2/2] e1000e: Remove special case for 82573/82574 ASPM L1 disablement 2012-04-23 21:29 ` [PATCH 2/2] e1000e: Remove special case for 82573/82574 ASPM L1 disablement Chris Boot @ 2012-04-23 23:18 ` Jeff Kirsher 0 siblings, 0 replies; 32+ messages in thread From: Jeff Kirsher @ 2012-04-23 23:18 UTC (permalink / raw) To: Chris Boot; +Cc: e1000-devel, netdev, nix, linux-kernel [-- Attachment #1: Type: text/plain, Size: 521 bytes --] On Mon, 2012-04-23 at 22:29 +0100, Chris Boot wrote: > For the 82573, ASPM L1 gets disabled wholesale so this special-case > code > is not required. For the 82574 the previous patch does the same as for > the 82573, disabling L1 on the adapter. Thus, this code is no longer > required and can be removed. > > Signed-off-by: Chris Boot <bootc@bootc.net> > --- > drivers/net/ethernet/intel/e1000e/netdev.c | 8 -------- > 1 file changed, 8 deletions(-) I have added the patch to my queue, thanks Chris! [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH RFC 0/2] e1000e: 82574 also needs ASPM L1 completely disabled 2012-04-23 21:29 ` [PATCH RFC 0/2] e1000e: 82574 also needs ASPM L1 completely disabled Chris Boot 2012-04-23 21:29 ` [PATCH 1/2] e1000e: Disable ASPM L1 on 82574 Chris Boot 2012-04-23 21:29 ` [PATCH 2/2] e1000e: Remove special case for 82573/82574 ASPM L1 disablement Chris Boot @ 2012-04-23 23:11 ` Jesse Brandeburg 2012-04-29 16:45 ` Nix 2 siblings, 1 reply; 32+ messages in thread From: Jesse Brandeburg @ 2012-04-23 23:11 UTC (permalink / raw) To: Chris Boot; +Cc: e1000-devel, netdev, linux-kernel, nix, carolyn.wyborny [-- Attachment #1: Type: text/plain, Size: 763 bytes --] On Mon, 23 Apr 2012 22:29:36 +0100 Chris Boot <bootc@bootc.net> wrote: > Please note I haven't as-yet tested this code at all, but I do know that > disabling ASPM L1 on these NICs (using setpci) fixes the hangs that I > have been seeing on my Supermicro servers with X9SCL-F boards. I hope to > get the chance to install an updated kernel on my two afftected servers > later this week. > > Chris Boot (2): > e1000e: Disable ASPM L1 on 82574 > e1000e: Remove special case for 82573/82574 ASPM L1 disablement Thanks Chris, we are going to take a look over the patches and Jeff Kirsher should apply them to our internal testing tree. Please let us know the results of your testing, we will let you know if we see any issues as well. Jesse [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 834 bytes --] ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH RFC 0/2] e1000e: 82574 also needs ASPM L1 completely disabled 2012-04-23 23:11 ` [PATCH RFC 0/2] e1000e: 82574 also needs ASPM L1 completely disabled Jesse Brandeburg @ 2012-04-29 16:45 ` Nix 2012-04-29 18:03 ` Chris Boot 0 siblings, 1 reply; 32+ messages in thread From: Nix @ 2012-04-29 16:45 UTC (permalink / raw) To: Jesse Brandeburg Cc: Chris Boot, e1000-devel, netdev, linux-kernel, carolyn.wyborny On 24 Apr 2012, Jesse Brandeburg outgrape: > Please let us know the results of your testing, we will let you know if > we see any issues as well. Alas, it has no effect at all here; L0s and L1 claim to be being disabled at boot time, but if you ask with lspci you see that they are not. I strongly suspect that they *are* being disabled, but then get re-enabled by something else, because even if I force them off with setpci in the boot scripts, by the time the scripts have finished executing and I've got to a root prompt where I can run setpci, L0s and L1 are always back on again. I may try simply running setpci between every line of my boot scripts just to see if it's something I'm doing there, but I very much doubt it, since I see these symptoms even if I run setpci at a point in the boot scripts after all the network interface setup is over. I suspect it's more to do with the link stabilizing or something. (But this is purest guesswork.) [ 1.087592] e1000e 0000:03:00.0: Disabling ASPM L0s L1 [ 1.211748] e1000e 0000:02:00.0: Disabling ASPM L0s L1 spindle# setpci -s 02:00.0 CAP_EXP+10.b 43 spindle# setpci -s 03:00.0 CAP_EXP+10.b 43 spindle:/etc/rc.d# lspci -vv -s 02:00.0 02:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network Connection Subsystem: Intel Corporation Device 0000 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 17 Region 0: Memory at fbce0000 (32-bit, non-prefetchable) [size=128K] Region 2: I/O ports at dc00 [size=32] Region 3: Memory at fbcdc000 (32-bit, non-prefetchable) [size=16K] Capabilities: [c8] Power Management version 2 Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+) Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME- Capabilities: [d0] MSI: Enable- Count=1/1 Maskable- 64bit+ Address: 0000000000000000 Data: 0000 Capabilities: [e0] Express (v1) Endpoint, MSI 00 DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset- DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+ RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ MaxPayload 128 bytes, MaxReadReq 512 bytes DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr+ TransPend- LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Latency L0 <128ns, L1 <64us ClockPM- Surprise- LLActRep- BwNot- LnkCtl: ASPM L0s L1 Enabled; RCB 64 bytes Disabled- Retrain- CommClk+ ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- Capabilities: [a0] MSI-X: Enable+ Count=5 Masked- Vector table: BAR=3 offset=00000000 PBA: BAR=3 offset=00002000 Capabilities: [100 v1] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- CESta: RxErr+ BadTLP+ BadDLLP+ Rollover- Timeout+ NonFatalErr+ CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+ AERCap: First Error Pointer: 14, GenCap- CGenEn- ChkCap- ChkEn- Kernel driver in use: e1000e spindle:/etc/rc.d# lspci -vv -s 03:00.0 03:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network Connection Subsystem: Intel Corporation Device 0000 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 16 Region 0: Memory at fbde0000 (32-bit, non-prefetchable) [size=128K] Region 2: I/O ports at ec00 [size=32] Region 3: Memory at fbddc000 (32-bit, non-prefetchable) [size=16K] Capabilities: [c8] Power Management version 2 Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+) Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME- Capabilities: [d0] MSI: Enable- Count=1/1 Maskable- 64bit+ Address: 0000000000000000 Data: 0000 Capabilities: [e0] Express (v1) Endpoint, MSI 00 DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset- DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+ RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ MaxPayload 128 bytes, MaxReadReq 512 bytes DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr+ TransPend- LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Latency L0 <128ns, L1 <64us ClockPM- Surprise- LLActRep- BwNot- LnkCtl: ASPM L0s L1 Enabled; RCB 64 bytes Disabled- Retrain- CommClk+ ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- Capabilities: [a0] MSI-X: Enable+ Count=5 Masked- Vector table: BAR=3 offset=00000000 PBA: BAR=3 offset=00002000 Capabilities: [100 v1] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- CESta: RxErr+ BadTLP+ BadDLLP+ Rollover+ Timeout+ NonFatalErr+ CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+ AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn- Kernel driver in use: e1000e Note in particular that L0s and L1 are still listed as enabled. (So I forced them off with setpci again, by hand.) -- NULL && (void) ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH RFC 0/2] e1000e: 82574 also needs ASPM L1 completely disabled 2012-04-29 16:45 ` Nix @ 2012-04-29 18:03 ` Chris Boot 0 siblings, 0 replies; 32+ messages in thread From: Chris Boot @ 2012-04-29 18:03 UTC (permalink / raw) To: Nix; +Cc: Jesse Brandeburg, e1000-devel, netdev, linux-kernel, carolyn.wyborny On 29/04/2012 17:45, Nix wrote: > On 24 Apr 2012, Jesse Brandeburg outgrape: > >> Please let us know the results of your testing, we will let you know if >> we see any issues as well. Right, I have finally managed to test my patch on my servers. I've had a really tough week with them due to my cluster falling over inexplicably so I didn't want to change too much too soon after everything came back up. The patch does properly disable ASPM L1 as well as L0s as before. Unlike for Nix, these do remain disabled. I'll keep running with the patch now but I'm confident this will solve my NIC lockups just as Nix's setpci incantations did. Please apply the patches. I'd also really like to have them CCed to stable so that Debian will pick them up in time. > Alas, it has no effect at all here; L0s and L1 claim to be being > disabled at boot time, but if you ask with lspci you see that they are > not. I strongly suspect that they *are* being disabled, but then get > re-enabled by something else, because even if I force them off with > setpci in the boot scripts, by the time the scripts have finished > executing and I've got to a root prompt where I can run setpci, L0s and > L1 are always back on again. Indeed our troubles must be different. My patch definitely disables ASPM fully on the NIC and the upstream device as evidenced by lspci. Here are extracts from the boot logs and lspci before my patch: [ 3.305372] e1000e: Intel(R) PRO/1000 Network Driver - 1.5.1-k [ 3.317015] e1000e: Copyright(c) 1999 - 2011 Intel Corporation. [ 3.328436] e1000e 0000:00:19.0: PCI INT A -> GSI 20 (level, low) -> IRQ 20 [ 3.328482] e1000e 0000:00:19.0: setting latency timer to 64 [ 3.329493] e1000e 0000:00:19.0: irq 45 for MSI/MSI-X [ 3.679153] e1000e 0000:00:19.0: eth1: (PCI Express:2.5GT/s:Width x1) 00:25:90:56:ac:d1 [ 3.691391] e1000e 0000:00:19.0: eth1: Intel(R) PRO/1000 Network Connection [ 3.703689] e1000e 0000:00:19.0: eth1: MAC: 10, PHY: 11, PBA No: FFFFFF-0FF [ 3.715639] e1000e 0000:05:00.0: Disabling ASPM L0s [ 4.156806] e1000e 0000:05:00.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16 [ 4.371659] e1000e 0000:05:00.0: setting latency timer to 64 [ 4.371928] e1000e 0000:05:00.0: irq 65 for MSI/MSI-X [ 4.371933] e1000e 0000:05:00.0: irq 66 for MSI/MSI-X [ 4.371937] e1000e 0000:05:00.0: irq 67 for MSI/MSI-X [ 4.485505] e1000e 0000:05:00.0: eth3: (PCI Express:2.5GT/s:Width x1) 00:25:90:56:ac:d0 [ 4.485507] e1000e 0000:05:00.0: eth3: Intel(R) PRO/1000 Network Connection [ 4.485647] e1000e 0000:05:00.0: eth3: MAC: 3, PHY: 8, PBA No: FFFFFF-0FF [ 14.237551] e1000e 0000:00:19.0: irq 45 for MSI/MSI-X [ 14.293193] e1000e 0000:00:19.0: irq 45 for MSI/MSI-X [ 16.160177] e1000e: eth2 NIC Link is Up 100 Mbps Full Duplex, Flow Control: None [ 16.174293] e1000e 0000:05:00.0: eth2: 10/100 speed: disabling TSO tidyup ~ # lspci -vvv -s 05:00.0 | grep ASPM LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Latency L0 <128ns, L1 <64us LnkCtl: ASPM L1 Enabled; RCB 64 bytes Disabled- Retrain- CommClk+ tidyup ~ # lspci -vvv -s 00:1c.4 | grep ASPM LnkCap: Port #5, Speed 5GT/s, Width x1, ASPM L0s L1, Latency L0 <512ns, L1 <4us LnkCtl: ASPM L1 Enabled; RCB 64 bytes Disabled- Retrain- CommClk+ And now the same kernel with the patch applied: [ 3.310165] e1000e: Intel(R) PRO/1000 Network Driver - 1.5.1-k [ 3.321625] e1000e: Copyright(c) 1999 - 2011 Intel Corporation. [ 3.332996] e1000e 0000:00:19.0: PCI INT A -> GSI 20 (level, low) -> IRQ 20 [ 3.413898] e1000e 0000:00:19.0: setting latency timer to 64 [ 3.426699] e1000e 0000:00:19.0: irq 54 for MSI/MSI-X [ 3.731112] e1000e 0000:00:19.0: eth2: (PCI Express:2.5GT/s:Width x1) 00:25:90:56:ac:d1 [ 3.743437] e1000e 0000:00:19.0: eth2: Intel(R) PRO/1000 Network Connection [ 3.755918] e1000e 0000:00:19.0: eth2: MAC: 10, PHY: 11, PBA No: FFFFFF-0FF [ 3.768758] e1000e 0000:05:00.0: Disabling ASPM L0s L1 [ 3.794095] e1000e 0000:05:00.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16 [ 3.794178] e1000e 0000:05:00.0: setting latency timer to 64 [ 3.795074] e1000e 0000:05:00.0: irq 64 for MSI/MSI-X [ 3.795088] e1000e 0000:05:00.0: irq 65 for MSI/MSI-X [ 3.795107] e1000e 0000:05:00.0: irq 66 for MSI/MSI-X [ 3.912691] e1000e 0000:05:00.0: eth3: (PCI Express:2.5GT/s:Width x1) 00:25:90:56:ac:d0 [ 3.912693] e1000e 0000:05:00.0: eth3: Intel(R) PRO/1000 Network Connection [ 3.912842] e1000e 0000:05:00.0: eth3: MAC: 3, PHY: 8, PBA No: FFFFFF-0FF [ 14.454955] e1000e 0000:00:19.0: irq 54 for MSI/MSI-X [ 14.507724] e1000e 0000:00:19.0: irq 54 for MSI/MSI-X [ 15.944706] e1000e: eth2 NIC Link is Up 100 Mbps Full Duplex, Flow Control: None [ 15.956279] e1000e 0000:05:00.0: eth2: 10/100 speed: disabling TSO tidyup ~ # lspci -vvv -s 05:00.0 | grep ASPM LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Latency L0 <128ns, L1 <64us LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+ tidyup ~ # lspci -vvv -s 00:1c.4 | grep ASPM LnkCap: Port #5, Speed 5GT/s, Width x1, ASPM L0s L1, Latency L0 <512ns, L1 <4us LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+ Cheers, Chris -- Chris Boot bootc@bootc.net ^ permalink raw reply [flat|nested] 32+ messages in thread
end of thread, other threads:[~2012-06-07 1:41 UTC | newest] Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2011-12-27 22:01 e1000e interface hang on 82574L Chris Boot 2011-12-27 22:33 ` Dave Taht 2011-12-31 9:31 ` Chris Boot 2012-01-03 0:02 ` Wyborny, Carolyn 2012-01-04 17:12 ` Chris Boot 2012-01-15 11:10 ` Chris Boot 2012-01-16 15:56 ` Wyborny, Carolyn 2012-01-16 16:04 ` Chris Boot 2012-03-17 15:59 ` Chris Boot 2012-03-17 17:54 ` Chris Boot 2012-03-17 23:50 ` [E1000-devel] " Nix 2012-03-19 14:59 ` Wyborny, Carolyn 2012-03-19 16:19 ` [E1000-devel] " Nix 2012-03-19 16:29 ` Wyborny, Carolyn 2012-03-19 17:31 ` Nix 2012-04-06 10:17 ` Chris Boot 2012-04-06 12:12 ` Bjorn Helgaas 2012-04-06 13:41 ` Henrique de Moraes Holschuh 2012-04-06 13:48 ` Chris Boot 2012-04-06 16:05 ` Nix 2012-04-06 16:04 ` Nix 2012-04-23 21:29 ` [PATCH RFC 0/2] e1000e: 82574 also needs ASPM L1 completely disabled Chris Boot 2012-04-23 21:29 ` [PATCH 1/2] e1000e: Disable ASPM L1 on 82574 Chris Boot 2012-04-23 23:18 ` [E1000-devel] " Jeff Kirsher 2012-04-24 11:08 ` Nix 2012-06-01 21:17 ` Chris Boot 2012-06-07 1:41 ` Greg KH 2012-04-23 21:29 ` [PATCH 2/2] e1000e: Remove special case for 82573/82574 ASPM L1 disablement Chris Boot 2012-04-23 23:18 ` [E1000-devel] " Jeff Kirsher 2012-04-23 23:11 ` [PATCH RFC 0/2] e1000e: 82574 also needs ASPM L1 completely disabled Jesse Brandeburg 2012-04-29 16:45 ` Nix 2012-04-29 18:03 ` Chris Boot
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).