* e1000e hardware unit hangs @ 2018-01-23 23:46 Ben Greear 2018-01-24 16:11 ` [Intel-wired-lan] " Alexander Duyck 0 siblings, 1 reply; 13+ messages in thread From: Ben Greear @ 2018-01-23 23:46 UTC (permalink / raw) To: netdev Hello, Anyone have any more suggestions for making e1000e work better? This is from a 4.9.65+ kernel, with these additional e1000e patches applied: e1000e: Fix error path in link detection e1000e: Fix wrong comment related to link detection e1000e: Fix return value test e1000e: Separate signaling for link check/link up e1000e: Avoid receiver overrun interrupt bursts Test case is simply to run 30000 tcp connections each trying to send 56Kbps of bi-directional data between a pair of e1000e interfaces :) No OOM related issues are seen on this kernel...similar test on 4.13 showed some OOM issues, but I have not debugged that yet... Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: eth3 (e1000e): transmit queue 0 timed out, trans_start: 4294737199, wd-timeout: 5000 jiffies: 4294745088 tx-queues: 1 Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: eth2 (e1000e): transmit queue 0 timed out, trans_start: 4294737200, wd-timeout: 5000 jiffies: 4294745088 tx-queues: 1 Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: ------------[ cut here ]------------ Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: WARNING: CPU: 7 PID: 0 at /home/greearb/git/linux-4.9.dev.y/net/sched/sch_generic.c:322 dev_watchdog+0x267/0x270 Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: Modules linked in: nf_conntrack_netlink nf_conntrack nfnetlink nf_defrag_ipv4 cfg80211 bnep bluetooth macvlan wanlink(O) pktgen fuse corete...sunrpc ipmi_d Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: CPU: 7 PID: 0 Comm: swapper/7 Tainted: G O 4.9.65+ #21 Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: Hardware name: Supermicro X9SCI/X9SCA/X9SCI/X9SCA, BIOS 2.0b 09/17/2012 Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: ffff88042fdc3df0 ffffffff8142d791 0000000000000000 0000000000000000 Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: ffff88042fdc3e30 ffffffff8110f266 000001422fdc3e08 0000000000000000 Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: 0000000000001388 00000000fffc7d30 ffff880417d0c000 00000000fffc9c00 Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: Call Trace: Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: <IRQ> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff8142d791>] dump_stack+0x63/0x82 Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff8110f266>] __warn+0xc6/0xe0 Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff8110f338>] warn_slowpath_null+0x18/0x20 Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff817da497>] dev_watchdog+0x267/0x270 Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff817da230>] ? qdisc_rcu_free+0x40/0x40 Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff8117bf70>] call_timer_fn+0x30/0x150 Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff817da230>] ? qdisc_rcu_free+0x40/0x40 Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff8117c350>] run_timer_softirq+0x1f0/0x450 Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff81051021>] ? lapic_next_deadline+0x21/0x30 Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff8118a54d>] ? clockevents_program_event+0x7d/0x120 Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff81115101>] __do_softirq+0xc1/0x2c0 Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff81115461>] irq_exit+0xb1/0xc0 Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff81051c9d>] smp_apic_timer_interrupt+0x3d/0x50 Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff81895842>] apic_timer_interrupt+0x82/0x90 Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: <EOI> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff81726e46>] ? cpuidle_enter_state+0x126/0x300 Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff81727042>] cpuidle_enter+0x12/0x20 Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff811521ce>] call_cpuidle+0x1e/0x40 Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff8115240a>] cpu_startup_entry+0x13a/0x220 Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff8104fbd9>] start_secondary+0x149/0x170 Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: ---[ end trace 69e31de175b59d4f ]--- Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: e1000e 0000:06:00.0 eth2: Reset adapter unexpectedly Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: e1000e 0000:07:00.0 eth3: Reset adapter unexpectedly Jan 23 15:39:02 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth2 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx Jan 23 15:39:02 lf1003-e3v2-13100124-f20x64 kernel: e1000e 0000:06:00.0 eth2: Detected Hardware Unit Hang: TDH <a8> TDT <f3>... Jan 23 15:39:02 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx Jan 23 15:39:13 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: eth3 (e1000e): transmit queue 0 timed out, trans_start: 4294748730, wd-timeout: 5000 jiffies: 4294759424 tx-queues: 1 Jan 23 15:39:13 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: eth2 (e1000e): transmit queue 0 timed out, trans_start: 4294748730, wd-timeout: 5000 jiffies: 4294759424 tx-queues: 1 Jan 23 15:39:13 lf1003-e3v2-13100124-f20x64 kernel: e1000e 0000:07:00.0 eth3: Reset adapter unexpectedly Jan 23 15:39:13 lf1003-e3v2-13100124-f20x64 kernel: e1000e 0000:06:00.0 eth2: Reset adapter unexpectedly Jan 23 15:39:20 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth2 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx Jan 23 15:39:20 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx Jan 23 15:39:25 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: eth2 (e1000e): transmit queue 0 timed out, trans_start: 4294766123, wd-timeout: 5000 jiffies: 4294771200 tx-queues: 1 Jan 23 15:39:25 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: eth3 (e1000e): transmit queue 0 timed out, trans_start: 4294766125, wd-timeout: 5000 jiffies: 4294771200 tx-queues: 1 Jan 23 15:39:25 lf1003-e3v2-13100124-f20x64 kernel: e1000e 0000:06:00.0 eth2: Reset adapter unexpectedly Jan 23 15:39:25 lf1003-e3v2-13100124-f20x64 kernel: e1000e 0000:07:00.0 eth3: Reset adapter unexpectedly Jan 23 15:39:28 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth2 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx Jan 23 15:39:28 lf1003-e3v2-13100124-f20x64 kernel: e1000e 0000:06:00.0 eth2: Detected Hardware Unit Hang: TDH <c8> TDT <f5>... Jan 23 15:39:28 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx Thanks, Ben -- Ben Greear <greearb@candelatech.com> Candela Technologies Inc http://www.candelatech.com ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: e1000e hardware unit hangs 2018-01-23 23:46 e1000e hardware unit hangs Ben Greear @ 2018-01-24 16:11 ` Alexander Duyck 0 siblings, 0 replies; 13+ messages in thread From: Alexander Duyck @ 2018-01-24 16:11 UTC (permalink / raw) To: Ben Greear, intel-wired-lan, e1000-devel; +Cc: netdev, Neftin, Sasha On Tue, Jan 23, 2018 at 3:46 PM, Ben Greear <greearb@candelatech.com> wrote: > Hello, > > Anyone have any more suggestions for making e1000e work better? This is > from a 4.9.65+ kernel, > with these additional e1000e patches applied: > > e1000e: Fix error path in link detection > e1000e: Fix wrong comment related to link detection > e1000e: Fix return value test > e1000e: Separate signaling for link check/link up > e1000e: Avoid receiver overrun interrupt bursts Most of these patches shouldn't address anything that would trigger Tx hangs. They are mostly related to just link detection. > Test case is simply to run 30000 tcp connections each trying to send 56Kbps > of bi-directional > data between a pair of e1000e interfaces :) > > No OOM related issues are seen on this kernel...similar test on 4.13 showed > some OOM > issues, but I have not debugged that yet... Really a question like this probably belongs on e1000-devel or intel-wired-lan so I have added those lists and the e1000e maintainer to the thread. It would be useful if you could provide more information about the device itself such as the ID and the kind of test you are running. Keep in mind the e1000e driver supports a pretty broad swath of devices so we need to narrow things down a bit. > Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: eth3 > (e1000e): transmit queue 0 timed out, trans_start: 4294737199, wd-timeout: > 5000 jiffies: 4294745088 tx-queues: 1 > Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: eth2 > (e1000e): transmit queue 0 timed out, trans_start: 4294737200, wd-timeout: > 5000 jiffies: 4294745088 tx-queues: 1 > Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: ------------[ cut here > ]------------ > Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: WARNING: CPU: 7 PID: 0 > at /home/greearb/git/linux-4.9.dev.y/net/sched/sch_generic.c:322 > dev_watchdog+0x267/0x270 > Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: Modules linked in: > nf_conntrack_netlink nf_conntrack nfnetlink nf_defrag_ipv4 cfg80211 bnep > bluetooth macvlan wanlink(O) pktgen fuse corete...sunrpc ipmi_d > Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: CPU: 7 PID: 0 Comm: > swapper/7 Tainted: G O 4.9.65+ #21 > Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: Hardware name: > Supermicro X9SCI/X9SCA/X9SCI/X9SCA, BIOS 2.0b 09/17/2012 > Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: ffff88042fdc3df0 > ffffffff8142d791 0000000000000000 0000000000000000 > Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: ffff88042fdc3e30 > ffffffff8110f266 000001422fdc3e08 0000000000000000 > Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: 0000000000001388 > 00000000fffc7d30 ffff880417d0c000 00000000fffc9c00 > Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: Call Trace: > Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: <IRQ> > Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff8142d791>] > dump_stack+0x63/0x82 > Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff8110f266>] > __warn+0xc6/0xe0 > Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff8110f338>] > warn_slowpath_null+0x18/0x20 > Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff817da497>] > dev_watchdog+0x267/0x270 > Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff817da230>] ? > qdisc_rcu_free+0x40/0x40 > Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff8117bf70>] > call_timer_fn+0x30/0x150 > Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff817da230>] ? > qdisc_rcu_free+0x40/0x40 > Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff8117c350>] > run_timer_softirq+0x1f0/0x450 > Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff81051021>] ? > lapic_next_deadline+0x21/0x30 > Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff8118a54d>] ? > clockevents_program_event+0x7d/0x120 > Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff81115101>] > __do_softirq+0xc1/0x2c0 > Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff81115461>] > irq_exit+0xb1/0xc0 > Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff81051c9d>] > smp_apic_timer_interrupt+0x3d/0x50 > Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff81895842>] > apic_timer_interrupt+0x82/0x90 > Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: <EOI> > Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff81726e46>] ? > cpuidle_enter_state+0x126/0x300 > Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff81727042>] > cpuidle_enter+0x12/0x20 > Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff811521ce>] > call_cpuidle+0x1e/0x40 > Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff8115240a>] > cpu_startup_entry+0x13a/0x220 > Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff8104fbd9>] > start_secondary+0x149/0x170 > Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: ---[ end trace > 69e31de175b59d4f ]--- > Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: e1000e 0000:06:00.0 > eth2: Reset adapter unexpectedly > Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: e1000e 0000:07:00.0 > eth3: Reset adapter unexpectedly > Jan 23 15:39:02 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth2 NIC Link is > Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx > Jan 23 15:39:02 lf1003-e3v2-13100124-f20x64 kernel: e1000e 0000:06:00.0 > eth2: Detected Hardware Unit Hang: > TDH > <a8> > TDT > <f3>... > Jan 23 15:39:02 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is > Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx > Jan 23 15:39:13 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: eth3 > (e1000e): transmit queue 0 timed out, trans_start: 4294748730, wd-timeout: > 5000 jiffies: 4294759424 tx-queues: 1 > Jan 23 15:39:13 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: eth2 > (e1000e): transmit queue 0 timed out, trans_start: 4294748730, wd-timeout: > 5000 jiffies: 4294759424 tx-queues: 1 > Jan 23 15:39:13 lf1003-e3v2-13100124-f20x64 kernel: e1000e 0000:07:00.0 > eth3: Reset adapter unexpectedly > Jan 23 15:39:13 lf1003-e3v2-13100124-f20x64 kernel: e1000e 0000:06:00.0 > eth2: Reset adapter unexpectedly > Jan 23 15:39:20 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth2 NIC Link is > Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx > Jan 23 15:39:20 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is > Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx > Jan 23 15:39:25 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: eth2 > (e1000e): transmit queue 0 timed out, trans_start: 4294766123, wd-timeout: > 5000 jiffies: 4294771200 tx-queues: 1 > Jan 23 15:39:25 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: eth3 > (e1000e): transmit queue 0 timed out, trans_start: 4294766125, wd-timeout: > 5000 jiffies: 4294771200 tx-queues: 1 > Jan 23 15:39:25 lf1003-e3v2-13100124-f20x64 kernel: e1000e 0000:06:00.0 > eth2: Reset adapter unexpectedly > Jan 23 15:39:25 lf1003-e3v2-13100124-f20x64 kernel: e1000e 0000:07:00.0 > eth3: Reset adapter unexpectedly > Jan 23 15:39:28 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth2 NIC Link is > Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx > Jan 23 15:39:28 lf1003-e3v2-13100124-f20x64 kernel: e1000e 0000:06:00.0 > eth2: Detected Hardware Unit Hang: > TDH > <c8> > TDT > <f5>... > Jan 23 15:39:28 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is > Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx > > > Thanks, > Ben > > -- > Ben Greear <greearb@candelatech.com> > Candela Technologies Inc http://www.candelatech.com > ^ permalink raw reply [flat|nested] 13+ messages in thread
* [Intel-wired-lan] e1000e hardware unit hangs @ 2018-01-24 16:11 ` Alexander Duyck 0 siblings, 0 replies; 13+ messages in thread From: Alexander Duyck @ 2018-01-24 16:11 UTC (permalink / raw) To: intel-wired-lan On Tue, Jan 23, 2018 at 3:46 PM, Ben Greear <greearb@candelatech.com> wrote: > Hello, > > Anyone have any more suggestions for making e1000e work better? This is > from a 4.9.65+ kernel, > with these additional e1000e patches applied: > > e1000e: Fix error path in link detection > e1000e: Fix wrong comment related to link detection > e1000e: Fix return value test > e1000e: Separate signaling for link check/link up > e1000e: Avoid receiver overrun interrupt bursts Most of these patches shouldn't address anything that would trigger Tx hangs. They are mostly related to just link detection. > Test case is simply to run 30000 tcp connections each trying to send 56Kbps > of bi-directional > data between a pair of e1000e interfaces :) > > No OOM related issues are seen on this kernel...similar test on 4.13 showed > some OOM > issues, but I have not debugged that yet... Really a question like this probably belongs on e1000-devel or intel-wired-lan so I have added those lists and the e1000e maintainer to the thread. It would be useful if you could provide more information about the device itself such as the ID and the kind of test you are running. Keep in mind the e1000e driver supports a pretty broad swath of devices so we need to narrow things down a bit. > Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: eth3 > (e1000e): transmit queue 0 timed out, trans_start: 4294737199, wd-timeout: > 5000 jiffies: 4294745088 tx-queues: 1 > Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: eth2 > (e1000e): transmit queue 0 timed out, trans_start: 4294737200, wd-timeout: > 5000 jiffies: 4294745088 tx-queues: 1 > Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: ------------[ cut here > ]------------ > Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: WARNING: CPU: 7 PID: 0 > at /home/greearb/git/linux-4.9.dev.y/net/sched/sch_generic.c:322 > dev_watchdog+0x267/0x270 > Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: Modules linked in: > nf_conntrack_netlink nf_conntrack nfnetlink nf_defrag_ipv4 cfg80211 bnep > bluetooth macvlan wanlink(O) pktgen fuse corete...sunrpc ipmi_d > Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: CPU: 7 PID: 0 Comm: > swapper/7 Tainted: G O 4.9.65+ #21 > Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: Hardware name: > Supermicro X9SCI/X9SCA/X9SCI/X9SCA, BIOS 2.0b 09/17/2012 > Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: ffff88042fdc3df0 > ffffffff8142d791 0000000000000000 0000000000000000 > Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: ffff88042fdc3e30 > ffffffff8110f266 000001422fdc3e08 0000000000000000 > Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: 0000000000001388 > 00000000fffc7d30 ffff880417d0c000 00000000fffc9c00 > Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: Call Trace: > Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: <IRQ> > Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff8142d791>] > dump_stack+0x63/0x82 > Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff8110f266>] > __warn+0xc6/0xe0 > Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff8110f338>] > warn_slowpath_null+0x18/0x20 > Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff817da497>] > dev_watchdog+0x267/0x270 > Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff817da230>] ? > qdisc_rcu_free+0x40/0x40 > Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff8117bf70>] > call_timer_fn+0x30/0x150 > Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff817da230>] ? > qdisc_rcu_free+0x40/0x40 > Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff8117c350>] > run_timer_softirq+0x1f0/0x450 > Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff81051021>] ? > lapic_next_deadline+0x21/0x30 > Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff8118a54d>] ? > clockevents_program_event+0x7d/0x120 > Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff81115101>] > __do_softirq+0xc1/0x2c0 > Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff81115461>] > irq_exit+0xb1/0xc0 > Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff81051c9d>] > smp_apic_timer_interrupt+0x3d/0x50 > Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff81895842>] > apic_timer_interrupt+0x82/0x90 > Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: <EOI> > Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff81726e46>] ? > cpuidle_enter_state+0x126/0x300 > Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff81727042>] > cpuidle_enter+0x12/0x20 > Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff811521ce>] > call_cpuidle+0x1e/0x40 > Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff8115240a>] > cpu_startup_entry+0x13a/0x220 > Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff8104fbd9>] > start_secondary+0x149/0x170 > Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: ---[ end trace > 69e31de175b59d4f ]--- > Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: e1000e 0000:06:00.0 > eth2: Reset adapter unexpectedly > Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: e1000e 0000:07:00.0 > eth3: Reset adapter unexpectedly > Jan 23 15:39:02 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth2 NIC Link is > Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx > Jan 23 15:39:02 lf1003-e3v2-13100124-f20x64 kernel: e1000e 0000:06:00.0 > eth2: Detected Hardware Unit Hang: > TDH > <a8> > TDT > <f3>... > Jan 23 15:39:02 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is > Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx > Jan 23 15:39:13 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: eth3 > (e1000e): transmit queue 0 timed out, trans_start: 4294748730, wd-timeout: > 5000 jiffies: 4294759424 tx-queues: 1 > Jan 23 15:39:13 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: eth2 > (e1000e): transmit queue 0 timed out, trans_start: 4294748730, wd-timeout: > 5000 jiffies: 4294759424 tx-queues: 1 > Jan 23 15:39:13 lf1003-e3v2-13100124-f20x64 kernel: e1000e 0000:07:00.0 > eth3: Reset adapter unexpectedly > Jan 23 15:39:13 lf1003-e3v2-13100124-f20x64 kernel: e1000e 0000:06:00.0 > eth2: Reset adapter unexpectedly > Jan 23 15:39:20 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth2 NIC Link is > Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx > Jan 23 15:39:20 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is > Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx > Jan 23 15:39:25 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: eth2 > (e1000e): transmit queue 0 timed out, trans_start: 4294766123, wd-timeout: > 5000 jiffies: 4294771200 tx-queues: 1 > Jan 23 15:39:25 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: eth3 > (e1000e): transmit queue 0 timed out, trans_start: 4294766125, wd-timeout: > 5000 jiffies: 4294771200 tx-queues: 1 > Jan 23 15:39:25 lf1003-e3v2-13100124-f20x64 kernel: e1000e 0000:06:00.0 > eth2: Reset adapter unexpectedly > Jan 23 15:39:25 lf1003-e3v2-13100124-f20x64 kernel: e1000e 0000:07:00.0 > eth3: Reset adapter unexpectedly > Jan 23 15:39:28 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth2 NIC Link is > Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx > Jan 23 15:39:28 lf1003-e3v2-13100124-f20x64 kernel: e1000e 0000:06:00.0 > eth2: Detected Hardware Unit Hang: > TDH > <c8> > TDT > <f5>... > Jan 23 15:39:28 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is > Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx > > > Thanks, > Ben > > -- > Ben Greear <greearb@candelatech.com> > Candela Technologies Inc http://www.candelatech.com > ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: e1000e hardware unit hangs 2018-01-24 16:11 ` [Intel-wired-lan] " Alexander Duyck @ 2018-01-24 16:34 ` Neftin, Sasha -1 siblings, 0 replies; 13+ messages in thread From: Neftin, Sasha @ 2018-01-24 16:34 UTC (permalink / raw) To: Alexander Duyck, Ben Greear, intel-wired-lan, e1000-devel; +Cc: netdev On 1/24/2018 18:11, Alexander Duyck wrote: > On Tue, Jan 23, 2018 at 3:46 PM, Ben Greear <greearb@candelatech.com> wrote: >> Hello, >> >> Anyone have any more suggestions for making e1000e work better? This is >> from a 4.9.65+ kernel, >> with these additional e1000e patches applied: >> >> e1000e: Fix error path in link detection >> e1000e: Fix wrong comment related to link detection >> e1000e: Fix return value test >> e1000e: Separate signaling for link check/link up >> e1000e: Avoid receiver overrun interrupt bursts > > Most of these patches shouldn't address anything that would trigger Tx > hangs. They are mostly related to just link detection. > >> Test case is simply to run 30000 tcp connections each trying to send 56Kbps >> of bi-directional >> data between a pair of e1000e interfaces :) >> >> No OOM related issues are seen on this kernel...similar test on 4.13 showed >> some OOM >> issues, but I have not debugged that yet... > > Really a question like this probably belongs on e1000-devel or > intel-wired-lan so I have added those lists and the e1000e maintainer > to the thread. > > It would be useful if you could provide more information about the > device itself such as the ID and the kind of test you are running. > Keep in mind the e1000e driver supports a pretty broad swath of > devices so we need to narrow things down a bit. > please, also re-check if your kernel include: e1000e: fix buffer overrun while the I219 is processing DMA transactions e1000e: fix the use of magic numbers for buffer overrun issue where you take fresh version of kernel? >> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: eth3 >> (e1000e): transmit queue 0 timed out, trans_start: 4294737199, wd-timeout: >> 5000 jiffies: 4294745088 tx-queues: 1 >> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: eth2 >> (e1000e): transmit queue 0 timed out, trans_start: 4294737200, wd-timeout: >> 5000 jiffies: 4294745088 tx-queues: 1 >> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: ------------[ cut here >> ]------------ >> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: WARNING: CPU: 7 PID: 0 >> at /home/greearb/git/linux-4.9.dev.y/net/sched/sch_generic.c:322 >> dev_watchdog+0x267/0x270 >> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: Modules linked in: >> nf_conntrack_netlink nf_conntrack nfnetlink nf_defrag_ipv4 cfg80211 bnep >> bluetooth macvlan wanlink(O) pktgen fuse corete...sunrpc ipmi_d >> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: CPU: 7 PID: 0 Comm: >> swapper/7 Tainted: G O 4.9.65+ #21 >> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: Hardware name: >> Supermicro X9SCI/X9SCA/X9SCI/X9SCA, BIOS 2.0b 09/17/2012 >> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: ffff88042fdc3df0 >> ffffffff8142d791 0000000000000000 0000000000000000 >> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: ffff88042fdc3e30 >> ffffffff8110f266 000001422fdc3e08 0000000000000000 >> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: 0000000000001388 >> 00000000fffc7d30 ffff880417d0c000 00000000fffc9c00 >> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: Call Trace: >> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: <IRQ> >> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff8142d791>] >> dump_stack+0x63/0x82 >> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff8110f266>] >> __warn+0xc6/0xe0 >> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff8110f338>] >> warn_slowpath_null+0x18/0x20 >> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff817da497>] >> dev_watchdog+0x267/0x270 >> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff817da230>] ? >> qdisc_rcu_free+0x40/0x40 >> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff8117bf70>] >> call_timer_fn+0x30/0x150 >> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff817da230>] ? >> qdisc_rcu_free+0x40/0x40 >> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff8117c350>] >> run_timer_softirq+0x1f0/0x450 >> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff81051021>] ? >> lapic_next_deadline+0x21/0x30 >> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff8118a54d>] ? >> clockevents_program_event+0x7d/0x120 >> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff81115101>] >> __do_softirq+0xc1/0x2c0 >> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff81115461>] >> irq_exit+0xb1/0xc0 >> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff81051c9d>] >> smp_apic_timer_interrupt+0x3d/0x50 >> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff81895842>] >> apic_timer_interrupt+0x82/0x90 >> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: <EOI> >> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff81726e46>] ? >> cpuidle_enter_state+0x126/0x300 >> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff81727042>] >> cpuidle_enter+0x12/0x20 >> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff811521ce>] >> call_cpuidle+0x1e/0x40 >> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff8115240a>] >> cpu_startup_entry+0x13a/0x220 >> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff8104fbd9>] >> start_secondary+0x149/0x170 >> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: ---[ end trace >> 69e31de175b59d4f ]--- >> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: e1000e 0000:06:00.0 >> eth2: Reset adapter unexpectedly >> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: e1000e 0000:07:00.0 >> eth3: Reset adapter unexpectedly >> Jan 23 15:39:02 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth2 NIC Link is >> Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx >> Jan 23 15:39:02 lf1003-e3v2-13100124-f20x64 kernel: e1000e 0000:06:00.0 >> eth2: Detected Hardware Unit Hang: >> TDH >> <a8> >> TDT >> <f3>... >> Jan 23 15:39:02 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is >> Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx >> Jan 23 15:39:13 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: eth3 >> (e1000e): transmit queue 0 timed out, trans_start: 4294748730, wd-timeout: >> 5000 jiffies: 4294759424 tx-queues: 1 >> Jan 23 15:39:13 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: eth2 >> (e1000e): transmit queue 0 timed out, trans_start: 4294748730, wd-timeout: >> 5000 jiffies: 4294759424 tx-queues: 1 >> Jan 23 15:39:13 lf1003-e3v2-13100124-f20x64 kernel: e1000e 0000:07:00.0 >> eth3: Reset adapter unexpectedly >> Jan 23 15:39:13 lf1003-e3v2-13100124-f20x64 kernel: e1000e 0000:06:00.0 >> eth2: Reset adapter unexpectedly >> Jan 23 15:39:20 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth2 NIC Link is >> Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx >> Jan 23 15:39:20 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is >> Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx >> Jan 23 15:39:25 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: eth2 >> (e1000e): transmit queue 0 timed out, trans_start: 4294766123, wd-timeout: >> 5000 jiffies: 4294771200 tx-queues: 1 >> Jan 23 15:39:25 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: eth3 >> (e1000e): transmit queue 0 timed out, trans_start: 4294766125, wd-timeout: >> 5000 jiffies: 4294771200 tx-queues: 1 >> Jan 23 15:39:25 lf1003-e3v2-13100124-f20x64 kernel: e1000e 0000:06:00.0 >> eth2: Reset adapter unexpectedly >> Jan 23 15:39:25 lf1003-e3v2-13100124-f20x64 kernel: e1000e 0000:07:00.0 >> eth3: Reset adapter unexpectedly >> Jan 23 15:39:28 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth2 NIC Link is >> Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx >> Jan 23 15:39:28 lf1003-e3v2-13100124-f20x64 kernel: e1000e 0000:06:00.0 >> eth2: Detected Hardware Unit Hang: >> TDH >> <c8> >> TDT >> <f5>... >> Jan 23 15:39:28 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is >> Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx >> >> >> Thanks, >> Ben >> >> -- >> Ben Greear <greearb@candelatech.com> >> Candela Technologies Inc http://www.candelatech.com >> ^ permalink raw reply [flat|nested] 13+ messages in thread
* [Intel-wired-lan] e1000e hardware unit hangs @ 2018-01-24 16:34 ` Neftin, Sasha 0 siblings, 0 replies; 13+ messages in thread From: Neftin, Sasha @ 2018-01-24 16:34 UTC (permalink / raw) To: intel-wired-lan On 1/24/2018 18:11, Alexander Duyck wrote: > On Tue, Jan 23, 2018 at 3:46 PM, Ben Greear <greearb@candelatech.com> wrote: >> Hello, >> >> Anyone have any more suggestions for making e1000e work better? This is >> from a 4.9.65+ kernel, >> with these additional e1000e patches applied: >> >> e1000e: Fix error path in link detection >> e1000e: Fix wrong comment related to link detection >> e1000e: Fix return value test >> e1000e: Separate signaling for link check/link up >> e1000e: Avoid receiver overrun interrupt bursts > > Most of these patches shouldn't address anything that would trigger Tx > hangs. They are mostly related to just link detection. > >> Test case is simply to run 30000 tcp connections each trying to send 56Kbps >> of bi-directional >> data between a pair of e1000e interfaces :) >> >> No OOM related issues are seen on this kernel...similar test on 4.13 showed >> some OOM >> issues, but I have not debugged that yet... > > Really a question like this probably belongs on e1000-devel or > intel-wired-lan so I have added those lists and the e1000e maintainer > to the thread. > > It would be useful if you could provide more information about the > device itself such as the ID and the kind of test you are running. > Keep in mind the e1000e driver supports a pretty broad swath of > devices so we need to narrow things down a bit. > please, also re-check if your kernel include: e1000e: fix buffer overrun while the I219 is processing DMA transactions e1000e: fix the use of magic numbers for buffer overrun issue where you take fresh version of kernel? >> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: eth3 >> (e1000e): transmit queue 0 timed out, trans_start: 4294737199, wd-timeout: >> 5000 jiffies: 4294745088 tx-queues: 1 >> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: eth2 >> (e1000e): transmit queue 0 timed out, trans_start: 4294737200, wd-timeout: >> 5000 jiffies: 4294745088 tx-queues: 1 >> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: ------------[ cut here >> ]------------ >> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: WARNING: CPU: 7 PID: 0 >> at /home/greearb/git/linux-4.9.dev.y/net/sched/sch_generic.c:322 >> dev_watchdog+0x267/0x270 >> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: Modules linked in: >> nf_conntrack_netlink nf_conntrack nfnetlink nf_defrag_ipv4 cfg80211 bnep >> bluetooth macvlan wanlink(O) pktgen fuse corete...sunrpc ipmi_d >> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: CPU: 7 PID: 0 Comm: >> swapper/7 Tainted: G O 4.9.65+ #21 >> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: Hardware name: >> Supermicro X9SCI/X9SCA/X9SCI/X9SCA, BIOS 2.0b 09/17/2012 >> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: ffff88042fdc3df0 >> ffffffff8142d791 0000000000000000 0000000000000000 >> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: ffff88042fdc3e30 >> ffffffff8110f266 000001422fdc3e08 0000000000000000 >> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: 0000000000001388 >> 00000000fffc7d30 ffff880417d0c000 00000000fffc9c00 >> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: Call Trace: >> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: <IRQ> >> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff8142d791>] >> dump_stack+0x63/0x82 >> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff8110f266>] >> __warn+0xc6/0xe0 >> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff8110f338>] >> warn_slowpath_null+0x18/0x20 >> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff817da497>] >> dev_watchdog+0x267/0x270 >> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff817da230>] ? >> qdisc_rcu_free+0x40/0x40 >> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff8117bf70>] >> call_timer_fn+0x30/0x150 >> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff817da230>] ? >> qdisc_rcu_free+0x40/0x40 >> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff8117c350>] >> run_timer_softirq+0x1f0/0x450 >> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff81051021>] ? >> lapic_next_deadline+0x21/0x30 >> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff8118a54d>] ? >> clockevents_program_event+0x7d/0x120 >> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff81115101>] >> __do_softirq+0xc1/0x2c0 >> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff81115461>] >> irq_exit+0xb1/0xc0 >> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff81051c9d>] >> smp_apic_timer_interrupt+0x3d/0x50 >> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff81895842>] >> apic_timer_interrupt+0x82/0x90 >> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: <EOI> >> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff81726e46>] ? >> cpuidle_enter_state+0x126/0x300 >> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff81727042>] >> cpuidle_enter+0x12/0x20 >> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff811521ce>] >> call_cpuidle+0x1e/0x40 >> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff8115240a>] >> cpu_startup_entry+0x13a/0x220 >> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: [<ffffffff8104fbd9>] >> start_secondary+0x149/0x170 >> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: ---[ end trace >> 69e31de175b59d4f ]--- >> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: e1000e 0000:06:00.0 >> eth2: Reset adapter unexpectedly >> Jan 23 15:38:59 lf1003-e3v2-13100124-f20x64 kernel: e1000e 0000:07:00.0 >> eth3: Reset adapter unexpectedly >> Jan 23 15:39:02 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth2 NIC Link is >> Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx >> Jan 23 15:39:02 lf1003-e3v2-13100124-f20x64 kernel: e1000e 0000:06:00.0 >> eth2: Detected Hardware Unit Hang: >> TDH >> <a8> >> TDT >> <f3>... >> Jan 23 15:39:02 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is >> Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx >> Jan 23 15:39:13 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: eth3 >> (e1000e): transmit queue 0 timed out, trans_start: 4294748730, wd-timeout: >> 5000 jiffies: 4294759424 tx-queues: 1 >> Jan 23 15:39:13 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: eth2 >> (e1000e): transmit queue 0 timed out, trans_start: 4294748730, wd-timeout: >> 5000 jiffies: 4294759424 tx-queues: 1 >> Jan 23 15:39:13 lf1003-e3v2-13100124-f20x64 kernel: e1000e 0000:07:00.0 >> eth3: Reset adapter unexpectedly >> Jan 23 15:39:13 lf1003-e3v2-13100124-f20x64 kernel: e1000e 0000:06:00.0 >> eth2: Reset adapter unexpectedly >> Jan 23 15:39:20 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth2 NIC Link is >> Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx >> Jan 23 15:39:20 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is >> Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx >> Jan 23 15:39:25 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: eth2 >> (e1000e): transmit queue 0 timed out, trans_start: 4294766123, wd-timeout: >> 5000 jiffies: 4294771200 tx-queues: 1 >> Jan 23 15:39:25 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: eth3 >> (e1000e): transmit queue 0 timed out, trans_start: 4294766125, wd-timeout: >> 5000 jiffies: 4294771200 tx-queues: 1 >> Jan 23 15:39:25 lf1003-e3v2-13100124-f20x64 kernel: e1000e 0000:06:00.0 >> eth2: Reset adapter unexpectedly >> Jan 23 15:39:25 lf1003-e3v2-13100124-f20x64 kernel: e1000e 0000:07:00.0 >> eth3: Reset adapter unexpectedly >> Jan 23 15:39:28 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth2 NIC Link is >> Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx >> Jan 23 15:39:28 lf1003-e3v2-13100124-f20x64 kernel: e1000e 0000:06:00.0 >> eth2: Detected Hardware Unit Hang: >> TDH >> <c8> >> TDT >> <f5>... >> Jan 23 15:39:28 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is >> Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx >> >> >> Thanks, >> Ben >> >> -- >> Ben Greear <greearb@candelatech.com> >> Candela Technologies Inc http://www.candelatech.com >> ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: e1000e hardware unit hangs 2018-01-24 16:34 ` [Intel-wired-lan] " Neftin, Sasha @ 2018-01-24 18:31 ` Ben Greear -1 siblings, 0 replies; 13+ messages in thread From: Ben Greear @ 2018-01-24 18:31 UTC (permalink / raw) To: Neftin, Sasha, Alexander Duyck, intel-wired-lan, e1000-devel; +Cc: netdev On 01/24/2018 08:34 AM, Neftin, Sasha wrote: > On 1/24/2018 18:11, Alexander Duyck wrote: >> On Tue, Jan 23, 2018 at 3:46 PM, Ben Greear <greearb@candelatech.com> wrote: >>> Hello, >>> >>> Anyone have any more suggestions for making e1000e work better? This is >>> from a 4.9.65+ kernel, >>> with these additional e1000e patches applied: >>> >>> e1000e: Fix error path in link detection >>> e1000e: Fix wrong comment related to link detection >>> e1000e: Fix return value test >>> e1000e: Separate signaling for link check/link up >>> e1000e: Avoid receiver overrun interrupt bursts >> >> Most of these patches shouldn't address anything that would trigger Tx >> hangs. They are mostly related to just link detection. >> >>> Test case is simply to run 30000 tcp connections each trying to send 56Kbps >>> of bi-directional >>> data between a pair of e1000e interfaces :) >>> >>> No OOM related issues are seen on this kernel...similar test on 4.13 showed >>> some OOM >>> issues, but I have not debugged that yet... >> >> Really a question like this probably belongs on e1000-devel or >> intel-wired-lan so I have added those lists and the e1000e maintainer >> to the thread. >> >> It would be useful if you could provide more information about the >> device itself such as the ID and the kind of test you are running. >> Keep in mind the e1000e driver supports a pretty broad swath of >> devices so we need to narrow things down a bit. >> > please, also re-check if your kernel include: > e1000e: fix buffer overrun while the I219 is processing DMA transactions > e1000e: fix the use of magic numbers for buffer overrun issue > where you take fresh version of kernel? Hello, I tried adding those two patches, but I still see this splat shortly after starting my test. The kernel I am using is here: https://github.com/greearb/linux-ct-4.13 I've seen similar issues at least back to the 4.0 kernel, including stock kernels and my own kernels with additional patches. Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: eth2 (e1000e): transmit queue 0 timed out, trans_start: 4295298499, wd-timeout: 5000 jiffies: 4295304192 tx-queues: 1 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: ------------[ cut here ]------------ Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: WARNING: CPU: 0 PID: 0 at /home/greearb/git/linux-4.13.dev.y/net/sched/sch_generic.c:322 dev_watchdog+0x228/0x250 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: Modules linked in: nf_conntrack_netlink nf_conntrack nfnetlink nf_defrag_ipv4 libcrc32c cfg80211 macvlan wanlink(O) pktgen bnep bluetooth f...ss tpm_tis ip Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: CPU: 0 PID: 0 Comm: swapper/0 Tainted: G O 4.13.16+ #22 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: Hardware name: Supermicro X9SCI/X9SCA/X9SCI/X9SCA, BIOS 2.0b 09/17/2012 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: task: ffffffff81e104c0 task.stack: ffffffff81e00000 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RIP: 0010:dev_watchdog+0x228/0x250 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RSP: 0018:ffff88042fc03e50 EFLAGS: 00010282 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RAX: 0000000000000086 RBX: 0000000000000000 RCX: 0000000000000000 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RDX: ffff88042fc15b40 RSI: ffff88042fc0dbf8 RDI: ffff88042fc0dbf8 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RBP: ffff88042fc03e98 R08: 0000000000000001 R09: 00000000000003c4 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: R10: 0000000000000000 R11: 00000000000003c4 R12: 0000000000001388 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: R13: 0000000100050dc3 R14: ffff880417670000 R15: 0000000100052400 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: FS: 0000000000000000(0000) GS:ffff88042fc00000(0000) knlGS:0000000000000000 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: CR2: 0000000001d14000 CR3: 0000000001e09000 CR4: 00000000001406f0 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: Call Trace: Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: <IRQ> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: ? qdisc_rcu_free+0x40/0x40 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: call_timer_fn+0x30/0x160 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: ? qdisc_rcu_free+0x40/0x40 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: run_timer_softirq+0x1f0/0x450 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: ? lapic_next_deadline+0x21/0x30 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: ? clockevents_program_event+0x78/0xf0 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: __do_softirq+0xc1/0x2c0 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: irq_exit+0xb1/0xc0 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: smp_apic_timer_interrupt+0x38/0x50 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: apic_timer_interrupt+0x89/0x90 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RIP: 0010:cpuidle_enter_state+0x12b/0x310 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RSP: 0018:ffffffff81e03de8 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RAX: 0000000000000000 RBX: 0000000000000003 RCX: 000000000000001f Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RDX: 0000000000000000 RSI: 00000000238e2b4c RDI: 0000000000000000 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RBP: ffffffff81e03e20 R08: 00000000000000af R09: 0000000000000018 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: R10: 00000000000000af R11: 0000000000000f27 R12: 0000000000000003 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: R13: ffff88042fc24918 R14: ffffffff81eae658 R15: 00000093fd9af742 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: </IRQ> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: ? cpuidle_enter_state+0x119/0x310 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: cpuidle_enter+0x12/0x20 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: call_cpuidle+0x1e/0x40 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: do_idle+0x17f/0x1d0 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: cpu_startup_entry+0x5f/0x70 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: rest_init+0xc9/0xd0 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: start_kernel+0x483/0x490 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: ? early_idt_handler_array+0x120/0x120 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: x86_64_start_reservations+0x2a/0x2c Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: x86_64_start_kernel+0x13c/0x14b Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: secondary_startup_64+0x9f/0x9f Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: Code: 04 00 00 89 4d cc e8 b8 88 fd ff 8b 4d cc 45 89 e1 4d 89 e8 48 89 c2 4c 89 f6 48 c7 c7 98 23 d4 81 51 41 57 89 d9 e8 44 48 94 ff <0f>... 63 8e 60 04 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: ---[ end trace 04264863cdced748 ]--- Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: e1000e 0000:06:00.0 eth2: Reset adapter unexpectedly Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is Down Jan 24 10:19:48 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth2 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx Jan 24 10:19:48 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx .... Jan 24 10:27:05 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth2 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx Jan 24 10:27:24 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: eth3 (e1000e): transmit queue 0 timed out, trans_start: 4295760337, wd-timeout: 5000 jiffies: 4295767040 tx-queues: 1 Jan 24 10:27:24 lf1003-e3v2-13100124-f20x64 kernel: e1000e 0000:07:00.0 eth3: Reset adapter unexpectedly Jan 24 10:27:27 lf1003-e3v2-13100124-f20x64 kernel: e1000e 0000:06:00.0 eth2: Detected Hardware Unit Hang: TDH <43> TDT <90>... Jan 24 10:27:29 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx Jan 24 10:27:46 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: eth3 (e1000e): transmit queue 0 timed out, trans_start: 4295782403, wd-timeout: 5000 jiffies: 4295789056 tx-queues: 1 Jan 24 10:27:46 lf1003-e3v2-13100124-f20x64 kernel: e1000e 0000:07:00.0 eth3: Reset adapter unexpectedly Jan 24 10:27:51 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx Jan 24 10:28:06 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: eth2 (e1000e): transmit queue 0 timed out, trans_start: 4295802883, wd-timeout: 5000 jiffies: 4295809024 tx-queues: 1 Jan 24 10:28:06 lf1003-e3v2-13100124-f20x64 kernel: e1000e 0000:06:00.0 eth2: Reset adapter unexpectedly Jan 24 10:28:10 lf1003-e3v2-13100124-f20x64 kernel: e1000e 0000:07:00.0 eth3: Detected Hardware Unit Hang: TDH <10> TDT <5d>... Jan 24 10:28:11 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth2 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx Jan 24 10:28:30 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: eth3 (e1000e): transmit queue 0 timed out, trans_start: 4295827457, wd-timeout: 5000 jiffies: 4295833088 tx-queues: 1 Jan 24 10:28:30 lf1003-e3v2-13100124-f20x64 kernel: e1000e 0000:07:00.0 eth3: Reset adapter unexpectedly Jan 24 10:28:35 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx Jan 24 10:28:45 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: eth2 (e1000e): transmit queue 0 timed out, trans_start: 4295841678, wd-timeout: 5000 jiffies: 4295847424 tx-queues: 1 Jan 24 10:28:45 lf1003-e3v2-13100124-f20x64 kernel: e1000e 0000:06:00.0 eth2: Reset adapter unexpectedly Jan 24 10:28:48 lf1003-e3v2-13100124-f20x64 kernel: e1000e 0000:07:00.0 eth3: Detected Hardware Unit Hang: TDH <8> TDT <55>... Jan 24 10:28:49 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth2 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx Jan 24 10:29:20 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: eth3 (e1000e): transmit queue 0 timed out, trans_start: 4295874528, wd-timeout: 5000 jiffies: 4295882240 tx-queues: 1 Jan 24 10:29:20 lf1003-e3v2-13100124-f20x64 kernel: e1000e 0000:07:00.0 eth3: Reset adapter unexpectedly Jan 24 10:29:20 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth2 NIC Link is Down Jan 24 10:29:26 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth2 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx Jan 24 10:29:26 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx ..... [root@lf1003-e3v2-13100124-f20x64 ~]# ethtool -i eth2 driver: e1000e version: 3.2.6-k firmware-version: 2.1-2 bus-info: 0000:06:00.0 supports-statistics: yes supports-test: yes supports-eeprom-access: yes supports-register-dump: yes supports-priv-flags: no [root@lf1003-e3v2-13100124-f20x64 ~]# lspci -vvv -s 0000:06:00.0 06:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network Connection Subsystem: Super Micro Computer Inc Device 0000 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 18 Region 0: Memory at df600000 (32-bit, non-prefetchable) [size=128K] Region 2: I/O ports at b000 [size=32] Region 3: Memory at df620000 (32-bit, non-prefetchable) [size=16K] Capabilities: [c8] Power Management version 2 Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+) Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME- Capabilities: [d0] MSI: Enable- Count=1/1 Maskable- 64bit+ Address: 0000000000000000 Data: 0000 Capabilities: [e0] Express (v1) Endpoint, MSI 00 DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset- DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+ RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+ MaxPayload 128 bytes, MaxReadReq 512 bytes DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr+ TransPend- LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Exit Latency L0s <128ns, L1 <64us ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp- LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+ ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- Capabilities: [a0] MSI-X: Enable+ Count=5 Masked- Vector table: BAR=3 offset=00000000 PBA: BAR=3 offset=00002000 Capabilities: [100 v1] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr- CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+ AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn- Capabilities: [140 v1] Device Serial Number 00-25-90-ff-ff-d2-06-aa Kernel driver in use: e1000e Kernel modules: e1000e My test is a (custom) traffic generator that is setting up 30k tcp connections between two e1000e ports and sending traffic as fast as possible. I'd be happy to help you set up this exact tool on your system(s), but we have seen similar issues with e1000e in other high-speed tests, so I don't think it is specific to this particular test. Maybe this test makes it easier to reproduce however. Thanks, Ben -- Ben Greear <greearb@candelatech.com> Candela Technologies Inc http://www.candelatech.com ^ permalink raw reply [flat|nested] 13+ messages in thread
* [Intel-wired-lan] e1000e hardware unit hangs @ 2018-01-24 18:31 ` Ben Greear 0 siblings, 0 replies; 13+ messages in thread From: Ben Greear @ 2018-01-24 18:31 UTC (permalink / raw) To: intel-wired-lan On 01/24/2018 08:34 AM, Neftin, Sasha wrote: > On 1/24/2018 18:11, Alexander Duyck wrote: >> On Tue, Jan 23, 2018 at 3:46 PM, Ben Greear <greearb@candelatech.com> wrote: >>> Hello, >>> >>> Anyone have any more suggestions for making e1000e work better? This is >>> from a 4.9.65+ kernel, >>> with these additional e1000e patches applied: >>> >>> e1000e: Fix error path in link detection >>> e1000e: Fix wrong comment related to link detection >>> e1000e: Fix return value test >>> e1000e: Separate signaling for link check/link up >>> e1000e: Avoid receiver overrun interrupt bursts >> >> Most of these patches shouldn't address anything that would trigger Tx >> hangs. They are mostly related to just link detection. >> >>> Test case is simply to run 30000 tcp connections each trying to send 56Kbps >>> of bi-directional >>> data between a pair of e1000e interfaces :) >>> >>> No OOM related issues are seen on this kernel...similar test on 4.13 showed >>> some OOM >>> issues, but I have not debugged that yet... >> >> Really a question like this probably belongs on e1000-devel or >> intel-wired-lan so I have added those lists and the e1000e maintainer >> to the thread. >> >> It would be useful if you could provide more information about the >> device itself such as the ID and the kind of test you are running. >> Keep in mind the e1000e driver supports a pretty broad swath of >> devices so we need to narrow things down a bit. >> > please, also re-check if your kernel include: > e1000e: fix buffer overrun while the I219 is processing DMA transactions > e1000e: fix the use of magic numbers for buffer overrun issue > where you take fresh version of kernel? Hello, I tried adding those two patches, but I still see this splat shortly after starting my test. The kernel I am using is here: https://github.com/greearb/linux-ct-4.13 I've seen similar issues at least back to the 4.0 kernel, including stock kernels and my own kernels with additional patches. Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: eth2 (e1000e): transmit queue 0 timed out, trans_start: 4295298499, wd-timeout: 5000 jiffies: 4295304192 tx-queues: 1 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: ------------[ cut here ]------------ Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: WARNING: CPU: 0 PID: 0 at /home/greearb/git/linux-4.13.dev.y/net/sched/sch_generic.c:322 dev_watchdog+0x228/0x250 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: Modules linked in: nf_conntrack_netlink nf_conntrack nfnetlink nf_defrag_ipv4 libcrc32c cfg80211 macvlan wanlink(O) pktgen bnep bluetooth f...ss tpm_tis ip Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: CPU: 0 PID: 0 Comm: swapper/0 Tainted: G O 4.13.16+ #22 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: Hardware name: Supermicro X9SCI/X9SCA/X9SCI/X9SCA, BIOS 2.0b 09/17/2012 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: task: ffffffff81e104c0 task.stack: ffffffff81e00000 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RIP: 0010:dev_watchdog+0x228/0x250 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RSP: 0018:ffff88042fc03e50 EFLAGS: 00010282 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RAX: 0000000000000086 RBX: 0000000000000000 RCX: 0000000000000000 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RDX: ffff88042fc15b40 RSI: ffff88042fc0dbf8 RDI: ffff88042fc0dbf8 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RBP: ffff88042fc03e98 R08: 0000000000000001 R09: 00000000000003c4 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: R10: 0000000000000000 R11: 00000000000003c4 R12: 0000000000001388 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: R13: 0000000100050dc3 R14: ffff880417670000 R15: 0000000100052400 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: FS: 0000000000000000(0000) GS:ffff88042fc00000(0000) knlGS:0000000000000000 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: CR2: 0000000001d14000 CR3: 0000000001e09000 CR4: 00000000001406f0 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: Call Trace: Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: <IRQ> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: ? qdisc_rcu_free+0x40/0x40 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: call_timer_fn+0x30/0x160 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: ? qdisc_rcu_free+0x40/0x40 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: run_timer_softirq+0x1f0/0x450 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: ? lapic_next_deadline+0x21/0x30 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: ? clockevents_program_event+0x78/0xf0 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: __do_softirq+0xc1/0x2c0 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: irq_exit+0xb1/0xc0 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: smp_apic_timer_interrupt+0x38/0x50 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: apic_timer_interrupt+0x89/0x90 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RIP: 0010:cpuidle_enter_state+0x12b/0x310 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RSP: 0018:ffffffff81e03de8 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RAX: 0000000000000000 RBX: 0000000000000003 RCX: 000000000000001f Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RDX: 0000000000000000 RSI: 00000000238e2b4c RDI: 0000000000000000 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RBP: ffffffff81e03e20 R08: 00000000000000af R09: 0000000000000018 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: R10: 00000000000000af R11: 0000000000000f27 R12: 0000000000000003 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: R13: ffff88042fc24918 R14: ffffffff81eae658 R15: 00000093fd9af742 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: </IRQ> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: ? cpuidle_enter_state+0x119/0x310 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: cpuidle_enter+0x12/0x20 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: call_cpuidle+0x1e/0x40 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: do_idle+0x17f/0x1d0 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: cpu_startup_entry+0x5f/0x70 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: rest_init+0xc9/0xd0 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: start_kernel+0x483/0x490 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: ? early_idt_handler_array+0x120/0x120 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: x86_64_start_reservations+0x2a/0x2c Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: x86_64_start_kernel+0x13c/0x14b Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: secondary_startup_64+0x9f/0x9f Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: Code: 04 00 00 89 4d cc e8 b8 88 fd ff 8b 4d cc 45 89 e1 4d 89 e8 48 89 c2 4c 89 f6 48 c7 c7 98 23 d4 81 51 41 57 89 d9 e8 44 48 94 ff <0f>... 63 8e 60 04 Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: ---[ end trace 04264863cdced748 ]--- Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: e1000e 0000:06:00.0 eth2: Reset adapter unexpectedly Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is Down Jan 24 10:19:48 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth2 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx Jan 24 10:19:48 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx .... Jan 24 10:27:05 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth2 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx Jan 24 10:27:24 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: eth3 (e1000e): transmit queue 0 timed out, trans_start: 4295760337, wd-timeout: 5000 jiffies: 4295767040 tx-queues: 1 Jan 24 10:27:24 lf1003-e3v2-13100124-f20x64 kernel: e1000e 0000:07:00.0 eth3: Reset adapter unexpectedly Jan 24 10:27:27 lf1003-e3v2-13100124-f20x64 kernel: e1000e 0000:06:00.0 eth2: Detected Hardware Unit Hang: TDH <43> TDT <90>... Jan 24 10:27:29 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx Jan 24 10:27:46 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: eth3 (e1000e): transmit queue 0 timed out, trans_start: 4295782403, wd-timeout: 5000 jiffies: 4295789056 tx-queues: 1 Jan 24 10:27:46 lf1003-e3v2-13100124-f20x64 kernel: e1000e 0000:07:00.0 eth3: Reset adapter unexpectedly Jan 24 10:27:51 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx Jan 24 10:28:06 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: eth2 (e1000e): transmit queue 0 timed out, trans_start: 4295802883, wd-timeout: 5000 jiffies: 4295809024 tx-queues: 1 Jan 24 10:28:06 lf1003-e3v2-13100124-f20x64 kernel: e1000e 0000:06:00.0 eth2: Reset adapter unexpectedly Jan 24 10:28:10 lf1003-e3v2-13100124-f20x64 kernel: e1000e 0000:07:00.0 eth3: Detected Hardware Unit Hang: TDH <10> TDT <5d>... Jan 24 10:28:11 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth2 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx Jan 24 10:28:30 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: eth3 (e1000e): transmit queue 0 timed out, trans_start: 4295827457, wd-timeout: 5000 jiffies: 4295833088 tx-queues: 1 Jan 24 10:28:30 lf1003-e3v2-13100124-f20x64 kernel: e1000e 0000:07:00.0 eth3: Reset adapter unexpectedly Jan 24 10:28:35 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx Jan 24 10:28:45 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: eth2 (e1000e): transmit queue 0 timed out, trans_start: 4295841678, wd-timeout: 5000 jiffies: 4295847424 tx-queues: 1 Jan 24 10:28:45 lf1003-e3v2-13100124-f20x64 kernel: e1000e 0000:06:00.0 eth2: Reset adapter unexpectedly Jan 24 10:28:48 lf1003-e3v2-13100124-f20x64 kernel: e1000e 0000:07:00.0 eth3: Detected Hardware Unit Hang: TDH <8> TDT <55>... Jan 24 10:28:49 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth2 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx Jan 24 10:29:20 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: eth3 (e1000e): transmit queue 0 timed out, trans_start: 4295874528, wd-timeout: 5000 jiffies: 4295882240 tx-queues: 1 Jan 24 10:29:20 lf1003-e3v2-13100124-f20x64 kernel: e1000e 0000:07:00.0 eth3: Reset adapter unexpectedly Jan 24 10:29:20 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth2 NIC Link is Down Jan 24 10:29:26 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth2 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx Jan 24 10:29:26 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx ..... [root at lf1003-e3v2-13100124-f20x64 ~]# ethtool -i eth2 driver: e1000e version: 3.2.6-k firmware-version: 2.1-2 bus-info: 0000:06:00.0 supports-statistics: yes supports-test: yes supports-eeprom-access: yes supports-register-dump: yes supports-priv-flags: no [root at lf1003-e3v2-13100124-f20x64 ~]# lspci -vvv -s 0000:06:00.0 06:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network Connection Subsystem: Super Micro Computer Inc Device 0000 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 18 Region 0: Memory at df600000 (32-bit, non-prefetchable) [size=128K] Region 2: I/O ports at b000 [size=32] Region 3: Memory at df620000 (32-bit, non-prefetchable) [size=16K] Capabilities: [c8] Power Management version 2 Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+) Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME- Capabilities: [d0] MSI: Enable- Count=1/1 Maskable- 64bit+ Address: 0000000000000000 Data: 0000 Capabilities: [e0] Express (v1) Endpoint, MSI 00 DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset- DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+ RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+ MaxPayload 128 bytes, MaxReadReq 512 bytes DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr+ TransPend- LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Exit Latency L0s <128ns, L1 <64us ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp- LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+ ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- Capabilities: [a0] MSI-X: Enable+ Count=5 Masked- Vector table: BAR=3 offset=00000000 PBA: BAR=3 offset=00002000 Capabilities: [100 v1] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr- CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+ AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn- Capabilities: [140 v1] Device Serial Number 00-25-90-ff-ff-d2-06-aa Kernel driver in use: e1000e Kernel modules: e1000e My test is a (custom) traffic generator that is setting up 30k tcp connections between two e1000e ports and sending traffic as fast as possible. I'd be happy to help you set up this exact tool on your system(s), but we have seen similar issues with e1000e in other high-speed tests, so I don't think it is specific to this particular test. Maybe this test makes it easier to reproduce however. Thanks, Ben -- Ben Greear <greearb@candelatech.com> Candela Technologies Inc http://www.candelatech.com ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: e1000e hardware unit hangs 2018-01-24 18:31 ` [Intel-wired-lan] " Ben Greear @ 2018-01-24 18:38 ` Denys Fedoryshchenko -1 siblings, 0 replies; 13+ messages in thread From: Denys Fedoryshchenko @ 2018-01-24 18:38 UTC (permalink / raw) To: Ben Greear Cc: Neftin, Sasha, Alexander Duyck, intel-wired-lan, e1000-devel, netdev, netdev-owner On 2018-01-24 20:31, Ben Greear wrote: > On 01/24/2018 08:34 AM, Neftin, Sasha wrote: >> On 1/24/2018 18:11, Alexander Duyck wrote: >>> On Tue, Jan 23, 2018 at 3:46 PM, Ben Greear <greearb@candelatech.com> >>> wrote: >>>> Hello, >>>> >>>> Anyone have any more suggestions for making e1000e work better? >>>> This is >>>> from a 4.9.65+ kernel, >>>> with these additional e1000e patches applied: >>>> >>>> e1000e: Fix error path in link detection >>>> e1000e: Fix wrong comment related to link detection >>>> e1000e: Fix return value test >>>> e1000e: Separate signaling for link check/link up >>>> e1000e: Avoid receiver overrun interrupt bursts >>> >>> Most of these patches shouldn't address anything that would trigger >>> Tx >>> hangs. They are mostly related to just link detection. >>> >>>> Test case is simply to run 30000 tcp connections each trying to send >>>> 56Kbps >>>> of bi-directional >>>> data between a pair of e1000e interfaces :) >>>> >>>> No OOM related issues are seen on this kernel...similar test on 4.13 >>>> showed >>>> some OOM >>>> issues, but I have not debugged that yet... >>> >>> Really a question like this probably belongs on e1000-devel or >>> intel-wired-lan so I have added those lists and the e1000e maintainer >>> to the thread. >>> >>> It would be useful if you could provide more information about the >>> device itself such as the ID and the kind of test you are running. >>> Keep in mind the e1000e driver supports a pretty broad swath of >>> devices so we need to narrow things down a bit. >>> >> please, also re-check if your kernel include: >> e1000e: fix buffer overrun while the I219 is processing DMA >> transactions >> e1000e: fix the use of magic numbers for buffer overrun issue >> where you take fresh version of kernel? > > Hello, > > I tried adding those two patches, but I still see this splat shortly > after starting > my test. The kernel I am using is here: > > https://github.com/greearb/linux-ct-4.13 > > I've seen similar issues at least back to the 4.0 kernel, including > stock kernels and my > own kernels with additional patches. > > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: > eth2 (e1000e): transmit queue 0 timed out, trans_start: 4295298499, > wd-timeout: 5000 jiffies: 4295304192 tx-queues: 1 > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: ------------[ cut > here ]------------ > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: WARNING: CPU: 0 > PID: 0 at > /home/greearb/git/linux-4.13.dev.y/net/sched/sch_generic.c:322 > dev_watchdog+0x228/0x250 > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: Modules linked in: > nf_conntrack_netlink nf_conntrack nfnetlink nf_defrag_ipv4 libcrc32c > cfg80211 macvlan wanlink(O) pktgen bnep bluetooth f...ss tpm_tis ip > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: CPU: 0 PID: 0 > Comm: swapper/0 Tainted: G O 4.13.16+ #22 > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: Hardware name: > Supermicro X9SCI/X9SCA/X9SCI/X9SCA, BIOS 2.0b 09/17/2012 > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: task: > ffffffff81e104c0 task.stack: ffffffff81e00000 > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RIP: > 0010:dev_watchdog+0x228/0x250 > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RSP: > 0018:ffff88042fc03e50 EFLAGS: 00010282 > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RAX: > 0000000000000086 RBX: 0000000000000000 RCX: 0000000000000000 > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RDX: > ffff88042fc15b40 RSI: ffff88042fc0dbf8 RDI: ffff88042fc0dbf8 > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RBP: > ffff88042fc03e98 R08: 0000000000000001 R09: 00000000000003c4 > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: R10: > 0000000000000000 R11: 00000000000003c4 R12: 0000000000001388 > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: R13: > 0000000100050dc3 R14: ffff880417670000 R15: 0000000100052400 > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: FS: > 0000000000000000(0000) GS:ffff88042fc00000(0000) > knlGS:0000000000000000 > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: CS: 0010 DS: 0000 > ES: 0000 CR0: 0000000080050033 > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: CR2: > 0000000001d14000 CR3: 0000000001e09000 CR4: 00000000001406f0 > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: Call Trace: > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: <IRQ> > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: ? > qdisc_rcu_free+0x40/0x40 > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: > call_timer_fn+0x30/0x160 > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: ? > qdisc_rcu_free+0x40/0x40 > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: > run_timer_softirq+0x1f0/0x450 > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: ? > lapic_next_deadline+0x21/0x30 > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: ? > clockevents_program_event+0x78/0xf0 > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: > __do_softirq+0xc1/0x2c0 > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: irq_exit+0xb1/0xc0 > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: > smp_apic_timer_interrupt+0x38/0x50 > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: > apic_timer_interrupt+0x89/0x90 > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RIP: > 0010:cpuidle_enter_state+0x12b/0x310 > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RSP: > 0018:ffffffff81e03de8 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10 > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RAX: > 0000000000000000 RBX: 0000000000000003 RCX: 000000000000001f > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RDX: > 0000000000000000 RSI: 00000000238e2b4c RDI: 0000000000000000 > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RBP: > ffffffff81e03e20 R08: 00000000000000af R09: 0000000000000018 > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: R10: > 00000000000000af R11: 0000000000000f27 R12: 0000000000000003 > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: R13: > ffff88042fc24918 R14: ffffffff81eae658 R15: 00000093fd9af742 > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: </IRQ> > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: ? > cpuidle_enter_state+0x119/0x310 > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: > cpuidle_enter+0x12/0x20 > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: > call_cpuidle+0x1e/0x40 > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: > do_idle+0x17f/0x1d0 > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: > cpu_startup_entry+0x5f/0x70 > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: > rest_init+0xc9/0xd0 > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: > start_kernel+0x483/0x490 > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: ? > early_idt_handler_array+0x120/0x120 > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: > x86_64_start_reservations+0x2a/0x2c > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: > x86_64_start_kernel+0x13c/0x14b > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: > secondary_startup_64+0x9f/0x9f > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: Code: 04 00 00 89 > 4d cc e8 b8 88 fd ff 8b 4d cc 45 89 e1 4d 89 e8 48 89 c2 4c 89 f6 48 > c7 c7 98 23 d4 81 51 41 57 89 d9 e8 44 48 94 ff <0f>... 63 8e 60 04 > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: ---[ end trace > 04264863cdced748 ]--- > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: e1000e > 0000:06:00.0 eth2: Reset adapter unexpectedly > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC > Link is Down > Jan 24 10:19:48 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth2 NIC > Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx > Jan 24 10:19:48 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC > Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx > > .... > > > Jan 24 10:27:05 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth2 NIC > Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx > Jan 24 10:27:24 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: > eth3 (e1000e): transmit queue 0 timed out, trans_start: 4295760337, > wd-timeout: 5000 jiffies: 4295767040 tx-queues: 1 > Jan 24 10:27:24 lf1003-e3v2-13100124-f20x64 kernel: e1000e > 0000:07:00.0 eth3: Reset adapter unexpectedly > Jan 24 10:27:27 lf1003-e3v2-13100124-f20x64 kernel: e1000e > 0000:06:00.0 eth2: Detected Hardware Unit Hang: > TDH > <43> > TDT > <90>... > Jan 24 10:27:29 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC > Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx > Jan 24 10:27:46 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: > eth3 (e1000e): transmit queue 0 timed out, trans_start: 4295782403, > wd-timeout: 5000 jiffies: 4295789056 tx-queues: 1 > Jan 24 10:27:46 lf1003-e3v2-13100124-f20x64 kernel: e1000e > 0000:07:00.0 eth3: Reset adapter unexpectedly > Jan 24 10:27:51 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC > Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx > Jan 24 10:28:06 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: > eth2 (e1000e): transmit queue 0 timed out, trans_start: 4295802883, > wd-timeout: 5000 jiffies: 4295809024 tx-queues: 1 > Jan 24 10:28:06 lf1003-e3v2-13100124-f20x64 kernel: e1000e > 0000:06:00.0 eth2: Reset adapter unexpectedly > Jan 24 10:28:10 lf1003-e3v2-13100124-f20x64 kernel: e1000e > 0000:07:00.0 eth3: Detected Hardware Unit Hang: > TDH > <10> > TDT > <5d>... > Jan 24 10:28:11 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth2 NIC > Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx > Jan 24 10:28:30 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: > eth3 (e1000e): transmit queue 0 timed out, trans_start: 4295827457, > wd-timeout: 5000 jiffies: 4295833088 tx-queues: 1 > Jan 24 10:28:30 lf1003-e3v2-13100124-f20x64 kernel: e1000e > 0000:07:00.0 eth3: Reset adapter unexpectedly > Jan 24 10:28:35 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC > Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx > Jan 24 10:28:45 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: > eth2 (e1000e): transmit queue 0 timed out, trans_start: 4295841678, > wd-timeout: 5000 jiffies: 4295847424 tx-queues: 1 > Jan 24 10:28:45 lf1003-e3v2-13100124-f20x64 kernel: e1000e > 0000:06:00.0 eth2: Reset adapter unexpectedly > Jan 24 10:28:48 lf1003-e3v2-13100124-f20x64 kernel: e1000e > 0000:07:00.0 eth3: Detected Hardware Unit Hang: > TDH > <8> > TDT > <55>... > Jan 24 10:28:49 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth2 NIC > Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx > Jan 24 10:29:20 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: > eth3 (e1000e): transmit queue 0 timed out, trans_start: 4295874528, > wd-timeout: 5000 jiffies: 4295882240 tx-queues: 1 > Jan 24 10:29:20 lf1003-e3v2-13100124-f20x64 kernel: e1000e > 0000:07:00.0 eth3: Reset adapter unexpectedly > Jan 24 10:29:20 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth2 NIC > Link is Down > Jan 24 10:29:26 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth2 NIC > Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx > Jan 24 10:29:26 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC > Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx > > ..... > > > [root@lf1003-e3v2-13100124-f20x64 ~]# ethtool -i eth2 > driver: e1000e > version: 3.2.6-k > firmware-version: 2.1-2 > bus-info: 0000:06:00.0 > supports-statistics: yes > supports-test: yes > supports-eeprom-access: yes > supports-register-dump: yes > supports-priv-flags: no > > [root@lf1003-e3v2-13100124-f20x64 ~]# lspci -vvv -s 0000:06:00.0 > 06:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network > Connection > Subsystem: Super Micro Computer Inc Device 0000 > Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- > Stepping- SERR+ FastB2B- DisINTx+ > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- > <TAbort- <MAbort- >SERR- <PERR- INTx- > Latency: 0, Cache Line Size: 64 bytes > Interrupt: pin A routed to IRQ 18 > Region 0: Memory at df600000 (32-bit, non-prefetchable) [size=128K] > Region 2: I/O ports at b000 [size=32] > Region 3: Memory at df620000 (32-bit, non-prefetchable) [size=16K] > Capabilities: [c8] Power Management version 2 > Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA > PME(D0+,D1-,D2-,D3hot+,D3cold+) > Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME- > Capabilities: [d0] MSI: Enable- Count=1/1 Maskable- 64bit+ > Address: 0000000000000000 Data: 0000 > Capabilities: [e0] Express (v1) Endpoint, MSI 00 > DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <512ns, L1 > <64us > ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset- > DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+ > RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+ > MaxPayload 128 bytes, MaxReadReq 512 bytes > DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr+ TransPend- > LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Exit Latency > L0s <128ns, L1 <64us > ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp- > LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+ > ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- > LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- > BWMgmt- ABWMgmt- > Capabilities: [a0] MSI-X: Enable+ Count=5 Masked- > Vector table: BAR=3 offset=00000000 > PBA: BAR=3 offset=00002000 > Capabilities: [100 v1] Advanced Error Reporting > UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- > MalfTLP- ECRC- UnsupReq- ACSViol- > UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- > MalfTLP- ECRC- UnsupReq- ACSViol- > UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ > MalfTLP+ ECRC- UnsupReq- ACSViol- > CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr- > CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+ > AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn- > Capabilities: [140 v1] Device Serial Number 00-25-90-ff-ff-d2-06-aa > Kernel driver in use: e1000e > Kernel modules: e1000e > > > My test is a (custom) traffic generator that is setting up 30k tcp > connections > between two e1000e ports and sending traffic as fast as possible. > I'd be happy to help you set up this exact tool on your system(s), > but we have seen similar issues with e1000e in other high-speed tests, > so I don't think it > is specific to this particular test. Maybe this test makes it easier > to reproduce > however. Silly suggestion: Maybe worth to try disabling TSO? ethtool -K eth2 tso off ^ permalink raw reply [flat|nested] 13+ messages in thread
* [Intel-wired-lan] e1000e hardware unit hangs @ 2018-01-24 18:38 ` Denys Fedoryshchenko 0 siblings, 0 replies; 13+ messages in thread From: Denys Fedoryshchenko @ 2018-01-24 18:38 UTC (permalink / raw) To: intel-wired-lan On 2018-01-24 20:31, Ben Greear wrote: > On 01/24/2018 08:34 AM, Neftin, Sasha wrote: >> On 1/24/2018 18:11, Alexander Duyck wrote: >>> On Tue, Jan 23, 2018 at 3:46 PM, Ben Greear <greearb@candelatech.com> >>> wrote: >>>> Hello, >>>> >>>> Anyone have any more suggestions for making e1000e work better? >>>> This is >>>> from a 4.9.65+ kernel, >>>> with these additional e1000e patches applied: >>>> >>>> e1000e: Fix error path in link detection >>>> e1000e: Fix wrong comment related to link detection >>>> e1000e: Fix return value test >>>> e1000e: Separate signaling for link check/link up >>>> e1000e: Avoid receiver overrun interrupt bursts >>> >>> Most of these patches shouldn't address anything that would trigger >>> Tx >>> hangs. They are mostly related to just link detection. >>> >>>> Test case is simply to run 30000 tcp connections each trying to send >>>> 56Kbps >>>> of bi-directional >>>> data between a pair of e1000e interfaces :) >>>> >>>> No OOM related issues are seen on this kernel...similar test on 4.13 >>>> showed >>>> some OOM >>>> issues, but I have not debugged that yet... >>> >>> Really a question like this probably belongs on e1000-devel or >>> intel-wired-lan so I have added those lists and the e1000e maintainer >>> to the thread. >>> >>> It would be useful if you could provide more information about the >>> device itself such as the ID and the kind of test you are running. >>> Keep in mind the e1000e driver supports a pretty broad swath of >>> devices so we need to narrow things down a bit. >>> >> please, also re-check if your kernel include: >> e1000e: fix buffer overrun while the I219 is processing DMA >> transactions >> e1000e: fix the use of magic numbers for buffer overrun issue >> where you take fresh version of kernel? > > Hello, > > I tried adding those two patches, but I still see this splat shortly > after starting > my test. The kernel I am using is here: > > https://github.com/greearb/linux-ct-4.13 > > I've seen similar issues at least back to the 4.0 kernel, including > stock kernels and my > own kernels with additional patches. > > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: > eth2 (e1000e): transmit queue 0 timed out, trans_start: 4295298499, > wd-timeout: 5000 jiffies: 4295304192 tx-queues: 1 > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: ------------[ cut > here ]------------ > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: WARNING: CPU: 0 > PID: 0 at > /home/greearb/git/linux-4.13.dev.y/net/sched/sch_generic.c:322 > dev_watchdog+0x228/0x250 > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: Modules linked in: > nf_conntrack_netlink nf_conntrack nfnetlink nf_defrag_ipv4 libcrc32c > cfg80211 macvlan wanlink(O) pktgen bnep bluetooth f...ss tpm_tis ip > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: CPU: 0 PID: 0 > Comm: swapper/0 Tainted: G O 4.13.16+ #22 > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: Hardware name: > Supermicro X9SCI/X9SCA/X9SCI/X9SCA, BIOS 2.0b 09/17/2012 > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: task: > ffffffff81e104c0 task.stack: ffffffff81e00000 > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RIP: > 0010:dev_watchdog+0x228/0x250 > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RSP: > 0018:ffff88042fc03e50 EFLAGS: 00010282 > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RAX: > 0000000000000086 RBX: 0000000000000000 RCX: 0000000000000000 > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RDX: > ffff88042fc15b40 RSI: ffff88042fc0dbf8 RDI: ffff88042fc0dbf8 > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RBP: > ffff88042fc03e98 R08: 0000000000000001 R09: 00000000000003c4 > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: R10: > 0000000000000000 R11: 00000000000003c4 R12: 0000000000001388 > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: R13: > 0000000100050dc3 R14: ffff880417670000 R15: 0000000100052400 > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: FS: > 0000000000000000(0000) GS:ffff88042fc00000(0000) > knlGS:0000000000000000 > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: CS: 0010 DS: 0000 > ES: 0000 CR0: 0000000080050033 > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: CR2: > 0000000001d14000 CR3: 0000000001e09000 CR4: 00000000001406f0 > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: Call Trace: > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: <IRQ> > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: ? > qdisc_rcu_free+0x40/0x40 > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: > call_timer_fn+0x30/0x160 > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: ? > qdisc_rcu_free+0x40/0x40 > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: > run_timer_softirq+0x1f0/0x450 > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: ? > lapic_next_deadline+0x21/0x30 > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: ? > clockevents_program_event+0x78/0xf0 > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: > __do_softirq+0xc1/0x2c0 > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: irq_exit+0xb1/0xc0 > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: > smp_apic_timer_interrupt+0x38/0x50 > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: > apic_timer_interrupt+0x89/0x90 > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RIP: > 0010:cpuidle_enter_state+0x12b/0x310 > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RSP: > 0018:ffffffff81e03de8 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10 > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RAX: > 0000000000000000 RBX: 0000000000000003 RCX: 000000000000001f > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RDX: > 0000000000000000 RSI: 00000000238e2b4c RDI: 0000000000000000 > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RBP: > ffffffff81e03e20 R08: 00000000000000af R09: 0000000000000018 > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: R10: > 00000000000000af R11: 0000000000000f27 R12: 0000000000000003 > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: R13: > ffff88042fc24918 R14: ffffffff81eae658 R15: 00000093fd9af742 > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: </IRQ> > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: ? > cpuidle_enter_state+0x119/0x310 > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: > cpuidle_enter+0x12/0x20 > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: > call_cpuidle+0x1e/0x40 > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: > do_idle+0x17f/0x1d0 > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: > cpu_startup_entry+0x5f/0x70 > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: > rest_init+0xc9/0xd0 > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: > start_kernel+0x483/0x490 > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: ? > early_idt_handler_array+0x120/0x120 > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: > x86_64_start_reservations+0x2a/0x2c > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: > x86_64_start_kernel+0x13c/0x14b > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: > secondary_startup_64+0x9f/0x9f > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: Code: 04 00 00 89 > 4d cc e8 b8 88 fd ff 8b 4d cc 45 89 e1 4d 89 e8 48 89 c2 4c 89 f6 48 > c7 c7 98 23 d4 81 51 41 57 89 d9 e8 44 48 94 ff <0f>... 63 8e 60 04 > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: ---[ end trace > 04264863cdced748 ]--- > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: e1000e > 0000:06:00.0 eth2: Reset adapter unexpectedly > Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC > Link is Down > Jan 24 10:19:48 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth2 NIC > Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx > Jan 24 10:19:48 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC > Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx > > .... > > > Jan 24 10:27:05 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth2 NIC > Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx > Jan 24 10:27:24 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: > eth3 (e1000e): transmit queue 0 timed out, trans_start: 4295760337, > wd-timeout: 5000 jiffies: 4295767040 tx-queues: 1 > Jan 24 10:27:24 lf1003-e3v2-13100124-f20x64 kernel: e1000e > 0000:07:00.0 eth3: Reset adapter unexpectedly > Jan 24 10:27:27 lf1003-e3v2-13100124-f20x64 kernel: e1000e > 0000:06:00.0 eth2: Detected Hardware Unit Hang: > TDH > <43> > TDT > <90>... > Jan 24 10:27:29 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC > Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx > Jan 24 10:27:46 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: > eth3 (e1000e): transmit queue 0 timed out, trans_start: 4295782403, > wd-timeout: 5000 jiffies: 4295789056 tx-queues: 1 > Jan 24 10:27:46 lf1003-e3v2-13100124-f20x64 kernel: e1000e > 0000:07:00.0 eth3: Reset adapter unexpectedly > Jan 24 10:27:51 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC > Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx > Jan 24 10:28:06 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: > eth2 (e1000e): transmit queue 0 timed out, trans_start: 4295802883, > wd-timeout: 5000 jiffies: 4295809024 tx-queues: 1 > Jan 24 10:28:06 lf1003-e3v2-13100124-f20x64 kernel: e1000e > 0000:06:00.0 eth2: Reset adapter unexpectedly > Jan 24 10:28:10 lf1003-e3v2-13100124-f20x64 kernel: e1000e > 0000:07:00.0 eth3: Detected Hardware Unit Hang: > TDH > <10> > TDT > <5d>... > Jan 24 10:28:11 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth2 NIC > Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx > Jan 24 10:28:30 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: > eth3 (e1000e): transmit queue 0 timed out, trans_start: 4295827457, > wd-timeout: 5000 jiffies: 4295833088 tx-queues: 1 > Jan 24 10:28:30 lf1003-e3v2-13100124-f20x64 kernel: e1000e > 0000:07:00.0 eth3: Reset adapter unexpectedly > Jan 24 10:28:35 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC > Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx > Jan 24 10:28:45 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: > eth2 (e1000e): transmit queue 0 timed out, trans_start: 4295841678, > wd-timeout: 5000 jiffies: 4295847424 tx-queues: 1 > Jan 24 10:28:45 lf1003-e3v2-13100124-f20x64 kernel: e1000e > 0000:06:00.0 eth2: Reset adapter unexpectedly > Jan 24 10:28:48 lf1003-e3v2-13100124-f20x64 kernel: e1000e > 0000:07:00.0 eth3: Detected Hardware Unit Hang: > TDH > <8> > TDT > <55>... > Jan 24 10:28:49 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth2 NIC > Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx > Jan 24 10:29:20 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: > eth3 (e1000e): transmit queue 0 timed out, trans_start: 4295874528, > wd-timeout: 5000 jiffies: 4295882240 tx-queues: 1 > Jan 24 10:29:20 lf1003-e3v2-13100124-f20x64 kernel: e1000e > 0000:07:00.0 eth3: Reset adapter unexpectedly > Jan 24 10:29:20 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth2 NIC > Link is Down > Jan 24 10:29:26 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth2 NIC > Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx > Jan 24 10:29:26 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC > Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx > > ..... > > > [root at lf1003-e3v2-13100124-f20x64 ~]# ethtool -i eth2 > driver: e1000e > version: 3.2.6-k > firmware-version: 2.1-2 > bus-info: 0000:06:00.0 > supports-statistics: yes > supports-test: yes > supports-eeprom-access: yes > supports-register-dump: yes > supports-priv-flags: no > > [root at lf1003-e3v2-13100124-f20x64 ~]# lspci -vvv -s 0000:06:00.0 > 06:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network > Connection > Subsystem: Super Micro Computer Inc Device 0000 > Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- > Stepping- SERR+ FastB2B- DisINTx+ > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- > <TAbort- <MAbort- >SERR- <PERR- INTx- > Latency: 0, Cache Line Size: 64 bytes > Interrupt: pin A routed to IRQ 18 > Region 0: Memory at df600000 (32-bit, non-prefetchable) [size=128K] > Region 2: I/O ports at b000 [size=32] > Region 3: Memory at df620000 (32-bit, non-prefetchable) [size=16K] > Capabilities: [c8] Power Management version 2 > Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA > PME(D0+,D1-,D2-,D3hot+,D3cold+) > Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME- > Capabilities: [d0] MSI: Enable- Count=1/1 Maskable- 64bit+ > Address: 0000000000000000 Data: 0000 > Capabilities: [e0] Express (v1) Endpoint, MSI 00 > DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <512ns, L1 > <64us > ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset- > DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+ > RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+ > MaxPayload 128 bytes, MaxReadReq 512 bytes > DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr+ TransPend- > LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Exit Latency > L0s <128ns, L1 <64us > ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp- > LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+ > ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- > LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- > BWMgmt- ABWMgmt- > Capabilities: [a0] MSI-X: Enable+ Count=5 Masked- > Vector table: BAR=3 offset=00000000 > PBA: BAR=3 offset=00002000 > Capabilities: [100 v1] Advanced Error Reporting > UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- > MalfTLP- ECRC- UnsupReq- ACSViol- > UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- > MalfTLP- ECRC- UnsupReq- ACSViol- > UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ > MalfTLP+ ECRC- UnsupReq- ACSViol- > CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr- > CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+ > AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn- > Capabilities: [140 v1] Device Serial Number 00-25-90-ff-ff-d2-06-aa > Kernel driver in use: e1000e > Kernel modules: e1000e > > > My test is a (custom) traffic generator that is setting up 30k tcp > connections > between two e1000e ports and sending traffic as fast as possible. > I'd be happy to help you set up this exact tool on your system(s), > but we have seen similar issues with e1000e in other high-speed tests, > so I don't think it > is specific to this particular test. Maybe this test makes it easier > to reproduce > however. Silly suggestion: Maybe worth to try disabling TSO? ethtool -K eth2 tso off ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: e1000e hardware unit hangs 2018-01-24 18:38 ` [Intel-wired-lan] " Denys Fedoryshchenko @ 2018-01-24 18:41 ` Ben Greear -1 siblings, 0 replies; 13+ messages in thread From: Ben Greear @ 2018-01-24 18:41 UTC (permalink / raw) To: Denys Fedoryshchenko Cc: Neftin, Sasha, Alexander Duyck, intel-wired-lan, e1000-devel, netdev, netdev-owner On 01/24/2018 10:38 AM, Denys Fedoryshchenko wrote: > On 2018-01-24 20:31, Ben Greear wrote: >> On 01/24/2018 08:34 AM, Neftin, Sasha wrote: >>> On 1/24/2018 18:11, Alexander Duyck wrote: >>>> On Tue, Jan 23, 2018 at 3:46 PM, Ben Greear <greearb@candelatech.com> wrote: >>>>> Hello, >>>>> >>>>> Anyone have any more suggestions for making e1000e work better? This is >>>>> from a 4.9.65+ kernel, >>>>> with these additional e1000e patches applied: >>>>> >>>>> e1000e: Fix error path in link detection >>>>> e1000e: Fix wrong comment related to link detection >>>>> e1000e: Fix return value test >>>>> e1000e: Separate signaling for link check/link up >>>>> e1000e: Avoid receiver overrun interrupt bursts >>>> >>>> Most of these patches shouldn't address anything that would trigger Tx >>>> hangs. They are mostly related to just link detection. >>>> >>>>> Test case is simply to run 30000 tcp connections each trying to send 56Kbps >>>>> of bi-directional >>>>> data between a pair of e1000e interfaces :) >>>>> >>>>> No OOM related issues are seen on this kernel...similar test on 4.13 showed >>>>> some OOM >>>>> issues, but I have not debugged that yet... >>>> >>>> Really a question like this probably belongs on e1000-devel or >>>> intel-wired-lan so I have added those lists and the e1000e maintainer >>>> to the thread. >>>> >>>> It would be useful if you could provide more information about the >>>> device itself such as the ID and the kind of test you are running. >>>> Keep in mind the e1000e driver supports a pretty broad swath of >>>> devices so we need to narrow things down a bit. >>>> >>> please, also re-check if your kernel include: >>> e1000e: fix buffer overrun while the I219 is processing DMA transactions >>> e1000e: fix the use of magic numbers for buffer overrun issue >>> where you take fresh version of kernel? >> >> Hello, >> >> I tried adding those two patches, but I still see this splat shortly >> after starting >> my test. The kernel I am using is here: >> >> https://github.com/greearb/linux-ct-4.13 >> >> I've seen similar issues at least back to the 4.0 kernel, including >> stock kernels and my >> own kernels with additional patches. >> >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: >> eth2 (e1000e): transmit queue 0 timed out, trans_start: 4295298499, >> wd-timeout: 5000 jiffies: 4295304192 tx-queues: 1 >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: ------------[ cut >> here ]------------ >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: WARNING: CPU: 0 >> PID: 0 at >> /home/greearb/git/linux-4.13.dev.y/net/sched/sch_generic.c:322 >> dev_watchdog+0x228/0x250 >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: Modules linked in: >> nf_conntrack_netlink nf_conntrack nfnetlink nf_defrag_ipv4 libcrc32c >> cfg80211 macvlan wanlink(O) pktgen bnep bluetooth f...ss tpm_tis ip >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: CPU: 0 PID: 0 >> Comm: swapper/0 Tainted: G O 4.13.16+ #22 >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: Hardware name: >> Supermicro X9SCI/X9SCA/X9SCI/X9SCA, BIOS 2.0b 09/17/2012 >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: task: >> ffffffff81e104c0 task.stack: ffffffff81e00000 >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RIP: >> 0010:dev_watchdog+0x228/0x250 >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RSP: >> 0018:ffff88042fc03e50 EFLAGS: 00010282 >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RAX: >> 0000000000000086 RBX: 0000000000000000 RCX: 0000000000000000 >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RDX: >> ffff88042fc15b40 RSI: ffff88042fc0dbf8 RDI: ffff88042fc0dbf8 >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RBP: >> ffff88042fc03e98 R08: 0000000000000001 R09: 00000000000003c4 >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: R10: >> 0000000000000000 R11: 00000000000003c4 R12: 0000000000001388 >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: R13: >> 0000000100050dc3 R14: ffff880417670000 R15: 0000000100052400 >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: FS: >> 0000000000000000(0000) GS:ffff88042fc00000(0000) >> knlGS:0000000000000000 >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: CS: 0010 DS: 0000 >> ES: 0000 CR0: 0000000080050033 >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: CR2: >> 0000000001d14000 CR3: 0000000001e09000 CR4: 00000000001406f0 >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: Call Trace: >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: <IRQ> >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: ? qdisc_rcu_free+0x40/0x40 >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: call_timer_fn+0x30/0x160 >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: ? qdisc_rcu_free+0x40/0x40 >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: >> run_timer_softirq+0x1f0/0x450 >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: ? >> lapic_next_deadline+0x21/0x30 >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: ? >> clockevents_program_event+0x78/0xf0 >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: __do_softirq+0xc1/0x2c0 >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: irq_exit+0xb1/0xc0 >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: >> smp_apic_timer_interrupt+0x38/0x50 >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: >> apic_timer_interrupt+0x89/0x90 >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RIP: >> 0010:cpuidle_enter_state+0x12b/0x310 >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RSP: >> 0018:ffffffff81e03de8 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10 >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RAX: >> 0000000000000000 RBX: 0000000000000003 RCX: 000000000000001f >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RDX: >> 0000000000000000 RSI: 00000000238e2b4c RDI: 0000000000000000 >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RBP: >> ffffffff81e03e20 R08: 00000000000000af R09: 0000000000000018 >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: R10: >> 00000000000000af R11: 0000000000000f27 R12: 0000000000000003 >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: R13: >> ffff88042fc24918 R14: ffffffff81eae658 R15: 00000093fd9af742 >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: </IRQ> >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: ? >> cpuidle_enter_state+0x119/0x310 >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: cpuidle_enter+0x12/0x20 >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: call_cpuidle+0x1e/0x40 >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: do_idle+0x17f/0x1d0 >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: cpu_startup_entry+0x5f/0x70 >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: rest_init+0xc9/0xd0 >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: start_kernel+0x483/0x490 >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: ? >> early_idt_handler_array+0x120/0x120 >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: >> x86_64_start_reservations+0x2a/0x2c >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: >> x86_64_start_kernel+0x13c/0x14b >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: >> secondary_startup_64+0x9f/0x9f >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: Code: 04 00 00 89 >> 4d cc e8 b8 88 fd ff 8b 4d cc 45 89 e1 4d 89 e8 48 89 c2 4c 89 f6 48 >> c7 c7 98 23 d4 81 51 41 57 89 d9 e8 44 48 94 ff <0f>... 63 8e 60 04 >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: ---[ end trace >> 04264863cdced748 ]--- >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: e1000e >> 0000:06:00.0 eth2: Reset adapter unexpectedly >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC >> Link is Down >> Jan 24 10:19:48 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth2 NIC >> Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx >> Jan 24 10:19:48 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC >> Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx >> >> .... >> >> >> Jan 24 10:27:05 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth2 NIC >> Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx >> Jan 24 10:27:24 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: >> eth3 (e1000e): transmit queue 0 timed out, trans_start: 4295760337, >> wd-timeout: 5000 jiffies: 4295767040 tx-queues: 1 >> Jan 24 10:27:24 lf1003-e3v2-13100124-f20x64 kernel: e1000e >> 0000:07:00.0 eth3: Reset adapter unexpectedly >> Jan 24 10:27:27 lf1003-e3v2-13100124-f20x64 kernel: e1000e >> 0000:06:00.0 eth2: Detected Hardware Unit Hang: >> TDH <43> >> TDT >> <90>... >> Jan 24 10:27:29 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC >> Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx >> Jan 24 10:27:46 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: >> eth3 (e1000e): transmit queue 0 timed out, trans_start: 4295782403, >> wd-timeout: 5000 jiffies: 4295789056 tx-queues: 1 >> Jan 24 10:27:46 lf1003-e3v2-13100124-f20x64 kernel: e1000e >> 0000:07:00.0 eth3: Reset adapter unexpectedly >> Jan 24 10:27:51 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC >> Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx >> Jan 24 10:28:06 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: >> eth2 (e1000e): transmit queue 0 timed out, trans_start: 4295802883, >> wd-timeout: 5000 jiffies: 4295809024 tx-queues: 1 >> Jan 24 10:28:06 lf1003-e3v2-13100124-f20x64 kernel: e1000e >> 0000:06:00.0 eth2: Reset adapter unexpectedly >> Jan 24 10:28:10 lf1003-e3v2-13100124-f20x64 kernel: e1000e >> 0000:07:00.0 eth3: Detected Hardware Unit Hang: >> TDH <10> >> TDT >> <5d>... >> Jan 24 10:28:11 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth2 NIC >> Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx >> Jan 24 10:28:30 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: >> eth3 (e1000e): transmit queue 0 timed out, trans_start: 4295827457, >> wd-timeout: 5000 jiffies: 4295833088 tx-queues: 1 >> Jan 24 10:28:30 lf1003-e3v2-13100124-f20x64 kernel: e1000e >> 0000:07:00.0 eth3: Reset adapter unexpectedly >> Jan 24 10:28:35 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC >> Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx >> Jan 24 10:28:45 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: >> eth2 (e1000e): transmit queue 0 timed out, trans_start: 4295841678, >> wd-timeout: 5000 jiffies: 4295847424 tx-queues: 1 >> Jan 24 10:28:45 lf1003-e3v2-13100124-f20x64 kernel: e1000e >> 0000:06:00.0 eth2: Reset adapter unexpectedly >> Jan 24 10:28:48 lf1003-e3v2-13100124-f20x64 kernel: e1000e >> 0000:07:00.0 eth3: Detected Hardware Unit Hang: >> TDH <8> >> TDT >> <55>... >> Jan 24 10:28:49 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth2 NIC >> Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx >> Jan 24 10:29:20 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: >> eth3 (e1000e): transmit queue 0 timed out, trans_start: 4295874528, >> wd-timeout: 5000 jiffies: 4295882240 tx-queues: 1 >> Jan 24 10:29:20 lf1003-e3v2-13100124-f20x64 kernel: e1000e >> 0000:07:00.0 eth3: Reset adapter unexpectedly >> Jan 24 10:29:20 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth2 NIC >> Link is Down >> Jan 24 10:29:26 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth2 NIC >> Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx >> Jan 24 10:29:26 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC >> Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx >> >> ..... >> >> >> [root@lf1003-e3v2-13100124-f20x64 ~]# ethtool -i eth2 >> driver: e1000e >> version: 3.2.6-k >> firmware-version: 2.1-2 >> bus-info: 0000:06:00.0 >> supports-statistics: yes >> supports-test: yes >> supports-eeprom-access: yes >> supports-register-dump: yes >> supports-priv-flags: no >> >> [root@lf1003-e3v2-13100124-f20x64 ~]# lspci -vvv -s 0000:06:00.0 >> 06:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network Connection >> Subsystem: Super Micro Computer Inc Device 0000 >> Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- >> Stepping- SERR+ FastB2B- DisINTx+ >> Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- >> <TAbort- <MAbort- >SERR- <PERR- INTx- >> Latency: 0, Cache Line Size: 64 bytes >> Interrupt: pin A routed to IRQ 18 >> Region 0: Memory at df600000 (32-bit, non-prefetchable) [size=128K] >> Region 2: I/O ports at b000 [size=32] >> Region 3: Memory at df620000 (32-bit, non-prefetchable) [size=16K] >> Capabilities: [c8] Power Management version 2 >> Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+) >> Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME- >> Capabilities: [d0] MSI: Enable- Count=1/1 Maskable- 64bit+ >> Address: 0000000000000000 Data: 0000 >> Capabilities: [e0] Express (v1) Endpoint, MSI 00 >> DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us >> ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset- >> DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+ >> RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+ >> MaxPayload 128 bytes, MaxReadReq 512 bytes >> DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr+ TransPend- >> LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Exit Latency >> L0s <128ns, L1 <64us >> ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp- >> LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+ >> ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- >> LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- >> BWMgmt- ABWMgmt- >> Capabilities: [a0] MSI-X: Enable+ Count=5 Masked- >> Vector table: BAR=3 offset=00000000 >> PBA: BAR=3 offset=00002000 >> Capabilities: [100 v1] Advanced Error Reporting >> UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- >> MalfTLP- ECRC- UnsupReq- ACSViol- >> UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- >> MalfTLP- ECRC- UnsupReq- ACSViol- >> UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ >> MalfTLP+ ECRC- UnsupReq- ACSViol- >> CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr- >> CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+ >> AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn- >> Capabilities: [140 v1] Device Serial Number 00-25-90-ff-ff-d2-06-aa >> Kernel driver in use: e1000e >> Kernel modules: e1000e >> >> >> My test is a (custom) traffic generator that is setting up 30k tcp connections >> between two e1000e ports and sending traffic as fast as possible. >> I'd be happy to help you set up this exact tool on your system(s), >> but we have seen similar issues with e1000e in other high-speed tests, >> so I don't think it >> is specific to this particular test. Maybe this test makes it easier >> to reproduce >> however. > > Silly suggestion: > Maybe worth to try disabling TSO? > ethtool -K eth2 tso off I tried that just now...and the problem did not change. Thanks, Ben -- Ben Greear <greearb@candelatech.com> Candela Technologies Inc http://www.candelatech.com ^ permalink raw reply [flat|nested] 13+ messages in thread
* [Intel-wired-lan] e1000e hardware unit hangs @ 2018-01-24 18:41 ` Ben Greear 0 siblings, 0 replies; 13+ messages in thread From: Ben Greear @ 2018-01-24 18:41 UTC (permalink / raw) To: intel-wired-lan On 01/24/2018 10:38 AM, Denys Fedoryshchenko wrote: > On 2018-01-24 20:31, Ben Greear wrote: >> On 01/24/2018 08:34 AM, Neftin, Sasha wrote: >>> On 1/24/2018 18:11, Alexander Duyck wrote: >>>> On Tue, Jan 23, 2018 at 3:46 PM, Ben Greear <greearb@candelatech.com> wrote: >>>>> Hello, >>>>> >>>>> Anyone have any more suggestions for making e1000e work better? This is >>>>> from a 4.9.65+ kernel, >>>>> with these additional e1000e patches applied: >>>>> >>>>> e1000e: Fix error path in link detection >>>>> e1000e: Fix wrong comment related to link detection >>>>> e1000e: Fix return value test >>>>> e1000e: Separate signaling for link check/link up >>>>> e1000e: Avoid receiver overrun interrupt bursts >>>> >>>> Most of these patches shouldn't address anything that would trigger Tx >>>> hangs. They are mostly related to just link detection. >>>> >>>>> Test case is simply to run 30000 tcp connections each trying to send 56Kbps >>>>> of bi-directional >>>>> data between a pair of e1000e interfaces :) >>>>> >>>>> No OOM related issues are seen on this kernel...similar test on 4.13 showed >>>>> some OOM >>>>> issues, but I have not debugged that yet... >>>> >>>> Really a question like this probably belongs on e1000-devel or >>>> intel-wired-lan so I have added those lists and the e1000e maintainer >>>> to the thread. >>>> >>>> It would be useful if you could provide more information about the >>>> device itself such as the ID and the kind of test you are running. >>>> Keep in mind the e1000e driver supports a pretty broad swath of >>>> devices so we need to narrow things down a bit. >>>> >>> please, also re-check if your kernel include: >>> e1000e: fix buffer overrun while the I219 is processing DMA transactions >>> e1000e: fix the use of magic numbers for buffer overrun issue >>> where you take fresh version of kernel? >> >> Hello, >> >> I tried adding those two patches, but I still see this splat shortly >> after starting >> my test. The kernel I am using is here: >> >> https://github.com/greearb/linux-ct-4.13 >> >> I've seen similar issues at least back to the 4.0 kernel, including >> stock kernels and my >> own kernels with additional patches. >> >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: >> eth2 (e1000e): transmit queue 0 timed out, trans_start: 4295298499, >> wd-timeout: 5000 jiffies: 4295304192 tx-queues: 1 >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: ------------[ cut >> here ]------------ >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: WARNING: CPU: 0 >> PID: 0 at >> /home/greearb/git/linux-4.13.dev.y/net/sched/sch_generic.c:322 >> dev_watchdog+0x228/0x250 >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: Modules linked in: >> nf_conntrack_netlink nf_conntrack nfnetlink nf_defrag_ipv4 libcrc32c >> cfg80211 macvlan wanlink(O) pktgen bnep bluetooth f...ss tpm_tis ip >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: CPU: 0 PID: 0 >> Comm: swapper/0 Tainted: G O 4.13.16+ #22 >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: Hardware name: >> Supermicro X9SCI/X9SCA/X9SCI/X9SCA, BIOS 2.0b 09/17/2012 >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: task: >> ffffffff81e104c0 task.stack: ffffffff81e00000 >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RIP: >> 0010:dev_watchdog+0x228/0x250 >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RSP: >> 0018:ffff88042fc03e50 EFLAGS: 00010282 >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RAX: >> 0000000000000086 RBX: 0000000000000000 RCX: 0000000000000000 >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RDX: >> ffff88042fc15b40 RSI: ffff88042fc0dbf8 RDI: ffff88042fc0dbf8 >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RBP: >> ffff88042fc03e98 R08: 0000000000000001 R09: 00000000000003c4 >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: R10: >> 0000000000000000 R11: 00000000000003c4 R12: 0000000000001388 >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: R13: >> 0000000100050dc3 R14: ffff880417670000 R15: 0000000100052400 >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: FS: >> 0000000000000000(0000) GS:ffff88042fc00000(0000) >> knlGS:0000000000000000 >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: CS: 0010 DS: 0000 >> ES: 0000 CR0: 0000000080050033 >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: CR2: >> 0000000001d14000 CR3: 0000000001e09000 CR4: 00000000001406f0 >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: Call Trace: >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: <IRQ> >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: ? qdisc_rcu_free+0x40/0x40 >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: call_timer_fn+0x30/0x160 >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: ? qdisc_rcu_free+0x40/0x40 >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: >> run_timer_softirq+0x1f0/0x450 >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: ? >> lapic_next_deadline+0x21/0x30 >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: ? >> clockevents_program_event+0x78/0xf0 >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: __do_softirq+0xc1/0x2c0 >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: irq_exit+0xb1/0xc0 >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: >> smp_apic_timer_interrupt+0x38/0x50 >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: >> apic_timer_interrupt+0x89/0x90 >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RIP: >> 0010:cpuidle_enter_state+0x12b/0x310 >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RSP: >> 0018:ffffffff81e03de8 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10 >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RAX: >> 0000000000000000 RBX: 0000000000000003 RCX: 000000000000001f >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RDX: >> 0000000000000000 RSI: 00000000238e2b4c RDI: 0000000000000000 >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RBP: >> ffffffff81e03e20 R08: 00000000000000af R09: 0000000000000018 >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: R10: >> 00000000000000af R11: 0000000000000f27 R12: 0000000000000003 >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: R13: >> ffff88042fc24918 R14: ffffffff81eae658 R15: 00000093fd9af742 >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: </IRQ> >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: ? >> cpuidle_enter_state+0x119/0x310 >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: cpuidle_enter+0x12/0x20 >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: call_cpuidle+0x1e/0x40 >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: do_idle+0x17f/0x1d0 >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: cpu_startup_entry+0x5f/0x70 >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: rest_init+0xc9/0xd0 >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: start_kernel+0x483/0x490 >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: ? >> early_idt_handler_array+0x120/0x120 >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: >> x86_64_start_reservations+0x2a/0x2c >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: >> x86_64_start_kernel+0x13c/0x14b >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: >> secondary_startup_64+0x9f/0x9f >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: Code: 04 00 00 89 >> 4d cc e8 b8 88 fd ff 8b 4d cc 45 89 e1 4d 89 e8 48 89 c2 4c 89 f6 48 >> c7 c7 98 23 d4 81 51 41 57 89 d9 e8 44 48 94 ff <0f>... 63 8e 60 04 >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: ---[ end trace >> 04264863cdced748 ]--- >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: e1000e >> 0000:06:00.0 eth2: Reset adapter unexpectedly >> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC >> Link is Down >> Jan 24 10:19:48 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth2 NIC >> Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx >> Jan 24 10:19:48 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC >> Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx >> >> .... >> >> >> Jan 24 10:27:05 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth2 NIC >> Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx >> Jan 24 10:27:24 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: >> eth3 (e1000e): transmit queue 0 timed out, trans_start: 4295760337, >> wd-timeout: 5000 jiffies: 4295767040 tx-queues: 1 >> Jan 24 10:27:24 lf1003-e3v2-13100124-f20x64 kernel: e1000e >> 0000:07:00.0 eth3: Reset adapter unexpectedly >> Jan 24 10:27:27 lf1003-e3v2-13100124-f20x64 kernel: e1000e >> 0000:06:00.0 eth2: Detected Hardware Unit Hang: >> TDH <43> >> TDT >> <90>... >> Jan 24 10:27:29 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC >> Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx >> Jan 24 10:27:46 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: >> eth3 (e1000e): transmit queue 0 timed out, trans_start: 4295782403, >> wd-timeout: 5000 jiffies: 4295789056 tx-queues: 1 >> Jan 24 10:27:46 lf1003-e3v2-13100124-f20x64 kernel: e1000e >> 0000:07:00.0 eth3: Reset adapter unexpectedly >> Jan 24 10:27:51 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC >> Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx >> Jan 24 10:28:06 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: >> eth2 (e1000e): transmit queue 0 timed out, trans_start: 4295802883, >> wd-timeout: 5000 jiffies: 4295809024 tx-queues: 1 >> Jan 24 10:28:06 lf1003-e3v2-13100124-f20x64 kernel: e1000e >> 0000:06:00.0 eth2: Reset adapter unexpectedly >> Jan 24 10:28:10 lf1003-e3v2-13100124-f20x64 kernel: e1000e >> 0000:07:00.0 eth3: Detected Hardware Unit Hang: >> TDH <10> >> TDT >> <5d>... >> Jan 24 10:28:11 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth2 NIC >> Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx >> Jan 24 10:28:30 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: >> eth3 (e1000e): transmit queue 0 timed out, trans_start: 4295827457, >> wd-timeout: 5000 jiffies: 4295833088 tx-queues: 1 >> Jan 24 10:28:30 lf1003-e3v2-13100124-f20x64 kernel: e1000e >> 0000:07:00.0 eth3: Reset adapter unexpectedly >> Jan 24 10:28:35 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC >> Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx >> Jan 24 10:28:45 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: >> eth2 (e1000e): transmit queue 0 timed out, trans_start: 4295841678, >> wd-timeout: 5000 jiffies: 4295847424 tx-queues: 1 >> Jan 24 10:28:45 lf1003-e3v2-13100124-f20x64 kernel: e1000e >> 0000:06:00.0 eth2: Reset adapter unexpectedly >> Jan 24 10:28:48 lf1003-e3v2-13100124-f20x64 kernel: e1000e >> 0000:07:00.0 eth3: Detected Hardware Unit Hang: >> TDH <8> >> TDT >> <55>... >> Jan 24 10:28:49 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth2 NIC >> Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx >> Jan 24 10:29:20 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: >> eth3 (e1000e): transmit queue 0 timed out, trans_start: 4295874528, >> wd-timeout: 5000 jiffies: 4295882240 tx-queues: 1 >> Jan 24 10:29:20 lf1003-e3v2-13100124-f20x64 kernel: e1000e >> 0000:07:00.0 eth3: Reset adapter unexpectedly >> Jan 24 10:29:20 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth2 NIC >> Link is Down >> Jan 24 10:29:26 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth2 NIC >> Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx >> Jan 24 10:29:26 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC >> Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx >> >> ..... >> >> >> [root at lf1003-e3v2-13100124-f20x64 ~]# ethtool -i eth2 >> driver: e1000e >> version: 3.2.6-k >> firmware-version: 2.1-2 >> bus-info: 0000:06:00.0 >> supports-statistics: yes >> supports-test: yes >> supports-eeprom-access: yes >> supports-register-dump: yes >> supports-priv-flags: no >> >> [root at lf1003-e3v2-13100124-f20x64 ~]# lspci -vvv -s 0000:06:00.0 >> 06:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network Connection >> Subsystem: Super Micro Computer Inc Device 0000 >> Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- >> Stepping- SERR+ FastB2B- DisINTx+ >> Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- >> <TAbort- <MAbort- >SERR- <PERR- INTx- >> Latency: 0, Cache Line Size: 64 bytes >> Interrupt: pin A routed to IRQ 18 >> Region 0: Memory at df600000 (32-bit, non-prefetchable) [size=128K] >> Region 2: I/O ports at b000 [size=32] >> Region 3: Memory at df620000 (32-bit, non-prefetchable) [size=16K] >> Capabilities: [c8] Power Management version 2 >> Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+) >> Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME- >> Capabilities: [d0] MSI: Enable- Count=1/1 Maskable- 64bit+ >> Address: 0000000000000000 Data: 0000 >> Capabilities: [e0] Express (v1) Endpoint, MSI 00 >> DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us >> ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset- >> DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+ >> RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+ >> MaxPayload 128 bytes, MaxReadReq 512 bytes >> DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr+ TransPend- >> LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Exit Latency >> L0s <128ns, L1 <64us >> ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp- >> LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+ >> ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- >> LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- >> BWMgmt- ABWMgmt- >> Capabilities: [a0] MSI-X: Enable+ Count=5 Masked- >> Vector table: BAR=3 offset=00000000 >> PBA: BAR=3 offset=00002000 >> Capabilities: [100 v1] Advanced Error Reporting >> UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- >> MalfTLP- ECRC- UnsupReq- ACSViol- >> UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- >> MalfTLP- ECRC- UnsupReq- ACSViol- >> UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ >> MalfTLP+ ECRC- UnsupReq- ACSViol- >> CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr- >> CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+ >> AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn- >> Capabilities: [140 v1] Device Serial Number 00-25-90-ff-ff-d2-06-aa >> Kernel driver in use: e1000e >> Kernel modules: e1000e >> >> >> My test is a (custom) traffic generator that is setting up 30k tcp connections >> between two e1000e ports and sending traffic as fast as possible. >> I'd be happy to help you set up this exact tool on your system(s), >> but we have seen similar issues with e1000e in other high-speed tests, >> so I don't think it >> is specific to this particular test. Maybe this test makes it easier >> to reproduce >> however. > > Silly suggestion: > Maybe worth to try disabling TSO? > ethtool -K eth2 tso off I tried that just now...and the problem did not change. Thanks, Ben -- Ben Greear <greearb@candelatech.com> Candela Technologies Inc http://www.candelatech.com ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: e1000e hardware unit hangs 2018-01-24 18:41 ` [Intel-wired-lan] " Ben Greear @ 2018-01-25 8:29 ` Neftin, Sasha -1 siblings, 0 replies; 13+ messages in thread From: Neftin, Sasha @ 2018-01-25 8:29 UTC (permalink / raw) To: Ben Greear, Denys Fedoryshchenko Cc: Alexander Duyck, intel-wired-lan, e1000-devel, netdev, netdev-owner On 1/24/2018 20:41, Ben Greear wrote: > On 01/24/2018 10:38 AM, Denys Fedoryshchenko wrote: >> On 2018-01-24 20:31, Ben Greear wrote: >>> On 01/24/2018 08:34 AM, Neftin, Sasha wrote: >>>> On 1/24/2018 18:11, Alexander Duyck wrote: >>>>> On Tue, Jan 23, 2018 at 3:46 PM, Ben Greear >>>>> <greearb@candelatech.com> wrote: >>>>>> Hello, >>>>>> >>>>>> Anyone have any more suggestions for making e1000e work better? >>>>>> This is >>>>>> from a 4.9.65+ kernel, >>>>>> with these additional e1000e patches applied: >>>>>> >>>>>> e1000e: Fix error path in link detection >>>>>> e1000e: Fix wrong comment related to link detection >>>>>> e1000e: Fix return value test >>>>>> e1000e: Separate signaling for link check/link up >>>>>> e1000e: Avoid receiver overrun interrupt bursts >>>>> >>>>> Most of these patches shouldn't address anything that would trigger Tx >>>>> hangs. They are mostly related to just link detection. >>>>> >>>>>> Test case is simply to run 30000 tcp connections each trying to >>>>>> send 56Kbps >>>>>> of bi-directional >>>>>> data between a pair of e1000e interfaces :) >>>>>> >>>>>> No OOM related issues are seen on this kernel...similar test on >>>>>> 4.13 showed >>>>>> some OOM >>>>>> issues, but I have not debugged that yet... >>>>> >>>>> Really a question like this probably belongs on e1000-devel or >>>>> intel-wired-lan so I have added those lists and the e1000e maintainer >>>>> to the thread. >>>>> >>>>> It would be useful if you could provide more information about the >>>>> device itself such as the ID and the kind of test you are running. >>>>> Keep in mind the e1000e driver supports a pretty broad swath of >>>>> devices so we need to narrow things down a bit. >>>>> >>>> please, also re-check if your kernel include: >>>> e1000e: fix buffer overrun while the I219 is processing DMA >>>> transactions >>>> e1000e: fix the use of magic numbers for buffer overrun issue >>>> where you take fresh version of kernel? >>> >>> Hello, >>> >>> I tried adding those two patches, but I still see this splat shortly >>> after starting >>> my test. The kernel I am using is here: >>> >>> https://github.com/greearb/linux-ct-4.13 >>> >>> I've seen similar issues at least back to the 4.0 kernel, including >>> stock kernels and my >>> own kernels with additional patches. >>> >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: >>> eth2 (e1000e): transmit queue 0 timed out, trans_start: 4295298499, >>> wd-timeout: 5000 jiffies: 4295304192 tx-queues: 1 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: ------------[ cut >>> here ]------------ >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: WARNING: CPU: 0 >>> PID: 0 at >>> /home/greearb/git/linux-4.13.dev.y/net/sched/sch_generic.c:322 >>> dev_watchdog+0x228/0x250 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: Modules linked in: >>> nf_conntrack_netlink nf_conntrack nfnetlink nf_defrag_ipv4 libcrc32c >>> cfg80211 macvlan wanlink(O) pktgen bnep bluetooth f...ss tpm_tis ip >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: CPU: 0 PID: 0 >>> Comm: swapper/0 Tainted: G O 4.13.16+ #22 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: Hardware name: >>> Supermicro X9SCI/X9SCA/X9SCI/X9SCA, BIOS 2.0b 09/17/2012 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: task: >>> ffffffff81e104c0 task.stack: ffffffff81e00000 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RIP: >>> 0010:dev_watchdog+0x228/0x250 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RSP: >>> 0018:ffff88042fc03e50 EFLAGS: 00010282 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RAX: >>> 0000000000000086 RBX: 0000000000000000 RCX: 0000000000000000 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RDX: >>> ffff88042fc15b40 RSI: ffff88042fc0dbf8 RDI: ffff88042fc0dbf8 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RBP: >>> ffff88042fc03e98 R08: 0000000000000001 R09: 00000000000003c4 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: R10: >>> 0000000000000000 R11: 00000000000003c4 R12: 0000000000001388 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: R13: >>> 0000000100050dc3 R14: ffff880417670000 R15: 0000000100052400 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: FS: >>> 0000000000000000(0000) GS:ffff88042fc00000(0000) >>> knlGS:0000000000000000 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: CS: 0010 DS: 0000 >>> ES: 0000 CR0: 0000000080050033 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: CR2: >>> 0000000001d14000 CR3: 0000000001e09000 CR4: 00000000001406f0 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: Call Trace: >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: <IRQ> >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: ? >>> qdisc_rcu_free+0x40/0x40 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: >>> call_timer_fn+0x30/0x160 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: ? >>> qdisc_rcu_free+0x40/0x40 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: >>> run_timer_softirq+0x1f0/0x450 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: ? >>> lapic_next_deadline+0x21/0x30 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: ? >>> clockevents_program_event+0x78/0xf0 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: >>> __do_softirq+0xc1/0x2c0 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: irq_exit+0xb1/0xc0 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: >>> smp_apic_timer_interrupt+0x38/0x50 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: >>> apic_timer_interrupt+0x89/0x90 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RIP: >>> 0010:cpuidle_enter_state+0x12b/0x310 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RSP: >>> 0018:ffffffff81e03de8 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RAX: >>> 0000000000000000 RBX: 0000000000000003 RCX: 000000000000001f >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RDX: >>> 0000000000000000 RSI: 00000000238e2b4c RDI: 0000000000000000 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RBP: >>> ffffffff81e03e20 R08: 00000000000000af R09: 0000000000000018 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: R10: >>> 00000000000000af R11: 0000000000000f27 R12: 0000000000000003 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: R13: >>> ffff88042fc24918 R14: ffffffff81eae658 R15: 00000093fd9af742 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: </IRQ> >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: ? >>> cpuidle_enter_state+0x119/0x310 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: >>> cpuidle_enter+0x12/0x20 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: >>> call_cpuidle+0x1e/0x40 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: do_idle+0x17f/0x1d0 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: >>> cpu_startup_entry+0x5f/0x70 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: rest_init+0xc9/0xd0 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: >>> start_kernel+0x483/0x490 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: ? >>> early_idt_handler_array+0x120/0x120 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: >>> x86_64_start_reservations+0x2a/0x2c >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: >>> x86_64_start_kernel+0x13c/0x14b >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: >>> secondary_startup_64+0x9f/0x9f >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: Code: 04 00 00 89 >>> 4d cc e8 b8 88 fd ff 8b 4d cc 45 89 e1 4d 89 e8 48 89 c2 4c 89 f6 48 >>> c7 c7 98 23 d4 81 51 41 57 89 d9 e8 44 48 94 ff <0f>... 63 8e 60 04 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: ---[ end trace >>> 04264863cdced748 ]--- >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: e1000e >>> 0000:06:00.0 eth2: Reset adapter unexpectedly >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC >>> Link is Down >>> Jan 24 10:19:48 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth2 NIC >>> Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx >>> Jan 24 10:19:48 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC >>> Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx >>> >>> .... >>> >>> >>> Jan 24 10:27:05 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth2 NIC >>> Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx >>> Jan 24 10:27:24 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: >>> eth3 (e1000e): transmit queue 0 timed out, trans_start: 4295760337, >>> wd-timeout: 5000 jiffies: 4295767040 tx-queues: 1 >>> Jan 24 10:27:24 lf1003-e3v2-13100124-f20x64 kernel: e1000e >>> 0000:07:00.0 eth3: Reset adapter unexpectedly >>> Jan 24 10:27:27 lf1003-e3v2-13100124-f20x64 kernel: e1000e >>> 0000:06:00.0 eth2: Detected Hardware Unit Hang: >>> >>> TDH <43> >>> TDT >>> <90>... >>> Jan 24 10:27:29 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC >>> Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx >>> Jan 24 10:27:46 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: >>> eth3 (e1000e): transmit queue 0 timed out, trans_start: 4295782403, >>> wd-timeout: 5000 jiffies: 4295789056 tx-queues: 1 >>> Jan 24 10:27:46 lf1003-e3v2-13100124-f20x64 kernel: e1000e >>> 0000:07:00.0 eth3: Reset adapter unexpectedly >>> Jan 24 10:27:51 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC >>> Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx >>> Jan 24 10:28:06 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: >>> eth2 (e1000e): transmit queue 0 timed out, trans_start: 4295802883, >>> wd-timeout: 5000 jiffies: 4295809024 tx-queues: 1 >>> Jan 24 10:28:06 lf1003-e3v2-13100124-f20x64 kernel: e1000e >>> 0000:06:00.0 eth2: Reset adapter unexpectedly >>> Jan 24 10:28:10 lf1003-e3v2-13100124-f20x64 kernel: e1000e >>> 0000:07:00.0 eth3: Detected Hardware Unit Hang: >>> >>> TDH <10> >>> TDT >>> <5d>... >>> Jan 24 10:28:11 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth2 NIC >>> Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx >>> Jan 24 10:28:30 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: >>> eth3 (e1000e): transmit queue 0 timed out, trans_start: 4295827457, >>> wd-timeout: 5000 jiffies: 4295833088 tx-queues: 1 >>> Jan 24 10:28:30 lf1003-e3v2-13100124-f20x64 kernel: e1000e >>> 0000:07:00.0 eth3: Reset adapter unexpectedly >>> Jan 24 10:28:35 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC >>> Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx >>> Jan 24 10:28:45 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: >>> eth2 (e1000e): transmit queue 0 timed out, trans_start: 4295841678, >>> wd-timeout: 5000 jiffies: 4295847424 tx-queues: 1 >>> Jan 24 10:28:45 lf1003-e3v2-13100124-f20x64 kernel: e1000e >>> 0000:06:00.0 eth2: Reset adapter unexpectedly >>> Jan 24 10:28:48 lf1003-e3v2-13100124-f20x64 kernel: e1000e >>> 0000:07:00.0 eth3: Detected Hardware Unit Hang: >>> >>> TDH <8> >>> TDT >>> <55>... >>> Jan 24 10:28:49 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth2 NIC >>> Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx >>> Jan 24 10:29:20 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: >>> eth3 (e1000e): transmit queue 0 timed out, trans_start: 4295874528, >>> wd-timeout: 5000 jiffies: 4295882240 tx-queues: 1 >>> Jan 24 10:29:20 lf1003-e3v2-13100124-f20x64 kernel: e1000e >>> 0000:07:00.0 eth3: Reset adapter unexpectedly >>> Jan 24 10:29:20 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth2 NIC >>> Link is Down >>> Jan 24 10:29:26 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth2 NIC >>> Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx >>> Jan 24 10:29:26 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC >>> Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx >>> >>> ..... >>> >>> >>> [root@lf1003-e3v2-13100124-f20x64 ~]# ethtool -i eth2 >>> driver: e1000e >>> version: 3.2.6-k >>> firmware-version: 2.1-2 >>> bus-info: 0000:06:00.0 >>> supports-statistics: yes >>> supports-test: yes >>> supports-eeprom-access: yes >>> supports-register-dump: yes >>> supports-priv-flags: no >>> >>> [root@lf1003-e3v2-13100124-f20x64 ~]# lspci -vvv -s 0000:06:00.0 >>> 06:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network >>> Connection >>> Subsystem: Super Micro Computer Inc Device 0000 >>> Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- >>> Stepping- SERR+ FastB2B- DisINTx+ >>> Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- >>> <TAbort- <MAbort- >SERR- <PERR- INTx- >>> Latency: 0, Cache Line Size: 64 bytes >>> Interrupt: pin A routed to IRQ 18 >>> Region 0: Memory at df600000 (32-bit, non-prefetchable) [size=128K] >>> Region 2: I/O ports at b000 [size=32] >>> Region 3: Memory at df620000 (32-bit, non-prefetchable) [size=16K] >>> Capabilities: [c8] Power Management version 2 >>> Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA >>> PME(D0+,D1-,D2-,D3hot+,D3cold+) >>> Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME- >>> Capabilities: [d0] MSI: Enable- Count=1/1 Maskable- 64bit+ >>> Address: 0000000000000000 Data: 0000 >>> Capabilities: [e0] Express (v1) Endpoint, MSI 00 >>> DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s >>> <512ns, L1 <64us >>> ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset- >>> DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ >>> Unsupported+ >>> RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+ >>> MaxPayload 128 bytes, MaxReadReq 512 bytes >>> DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr+ >>> TransPend- >>> LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, >>> Exit Latency >>> L0s <128ns, L1 <64us >>> ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp- >>> LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+ >>> ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- >>> LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ >>> DLActive- >>> BWMgmt- ABWMgmt- >>> Capabilities: [a0] MSI-X: Enable+ Count=5 Masked- >>> Vector table: BAR=3 offset=00000000 >>> PBA: BAR=3 offset=00002000 >>> Capabilities: [100 v1] Advanced Error Reporting >>> UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- >>> RxOF- >>> MalfTLP- ECRC- UnsupReq- ACSViol- >>> UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- >>> RxOF- >>> MalfTLP- ECRC- UnsupReq- ACSViol- >>> UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- >>> RxOF+ >>> MalfTLP+ ECRC- UnsupReq- ACSViol- >>> CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- >>> NonFatalErr- >>> CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- >>> NonFatalErr+ >>> AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- >>> ChkEn- >>> Capabilities: [140 v1] Device Serial Number 00-25-90-ff-ff-d2-06-aa >>> Kernel driver in use: e1000e >>> Kernel modules: e1000e >>> >>> >>> My test is a (custom) traffic generator that is setting up 30k tcp >>> connections >>> between two e1000e ports and sending traffic as fast as possible. >>> I'd be happy to help you set up this exact tool on your system(s), >>> but we have seen similar issues with e1000e in other high-speed tests, >>> so I don't think it >>> is specific to this particular test. Maybe this test makes it easier >>> to reproduce >>> however. >> >> Silly suggestion: >> Maybe worth to try disabling TSO? >> ethtool -K eth2 tso off > > > I tried that just now...and the problem did not change. > > Thanks, > Ben > > > 82574L is pretty old HW - I am not sure we still support it. Is more older kernel version also hit on this problem? Can you try latest Linus kernel version? Anyway, I suggest fill ticket on source forge (https://sourceforge.net/projects/e1000/files/?source=navbar),attach dmesg, lspci and all relevant information. ^ permalink raw reply [flat|nested] 13+ messages in thread
* [Intel-wired-lan] e1000e hardware unit hangs @ 2018-01-25 8:29 ` Neftin, Sasha 0 siblings, 0 replies; 13+ messages in thread From: Neftin, Sasha @ 2018-01-25 8:29 UTC (permalink / raw) To: intel-wired-lan On 1/24/2018 20:41, Ben Greear wrote: > On 01/24/2018 10:38 AM, Denys Fedoryshchenko wrote: >> On 2018-01-24 20:31, Ben Greear wrote: >>> On 01/24/2018 08:34 AM, Neftin, Sasha wrote: >>>> On 1/24/2018 18:11, Alexander Duyck wrote: >>>>> On Tue, Jan 23, 2018 at 3:46 PM, Ben Greear >>>>> <greearb@candelatech.com> wrote: >>>>>> Hello, >>>>>> >>>>>> Anyone have any more suggestions for making e1000e work better? >>>>>> This is >>>>>> from a 4.9.65+ kernel, >>>>>> with these additional e1000e patches applied: >>>>>> >>>>>> e1000e: Fix error path in link detection >>>>>> e1000e: Fix wrong comment related to link detection >>>>>> e1000e: Fix return value test >>>>>> e1000e: Separate signaling for link check/link up >>>>>> e1000e: Avoid receiver overrun interrupt bursts >>>>> >>>>> Most of these patches shouldn't address anything that would trigger Tx >>>>> hangs. They are mostly related to just link detection. >>>>> >>>>>> Test case is simply to run 30000 tcp connections each trying to >>>>>> send 56Kbps >>>>>> of bi-directional >>>>>> data between a pair of e1000e interfaces :) >>>>>> >>>>>> No OOM related issues are seen on this kernel...similar test on >>>>>> 4.13 showed >>>>>> some OOM >>>>>> issues, but I have not debugged that yet... >>>>> >>>>> Really a question like this probably belongs on e1000-devel or >>>>> intel-wired-lan so I have added those lists and the e1000e maintainer >>>>> to the thread. >>>>> >>>>> It would be useful if you could provide more information about the >>>>> device itself such as the ID and the kind of test you are running. >>>>> Keep in mind the e1000e driver supports a pretty broad swath of >>>>> devices so we need to narrow things down a bit. >>>>> >>>> please, also re-check if your kernel include: >>>> e1000e: fix buffer overrun while the I219 is processing DMA >>>> transactions >>>> e1000e: fix the use of magic numbers for buffer overrun issue >>>> where you take fresh version of kernel? >>> >>> Hello, >>> >>> I tried adding those two patches, but I still see this splat shortly >>> after starting >>> my test.? The kernel I am using is here: >>> >>> https://github.com/greearb/linux-ct-4.13 >>> >>> I've seen similar issues at least back to the 4.0 kernel, including >>> stock kernels and my >>> own kernels with additional patches. >>> >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: >>> eth2 (e1000e): transmit queue 0 timed out, trans_start: 4295298499, >>> wd-timeout: 5000 jiffies: 4295304192 tx-queues: 1 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: ------------[ cut >>> here ]------------ >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: WARNING: CPU: 0 >>> PID: 0 at >>> /home/greearb/git/linux-4.13.dev.y/net/sched/sch_generic.c:322 >>> dev_watchdog+0x228/0x250 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: Modules linked in: >>> nf_conntrack_netlink nf_conntrack nfnetlink nf_defrag_ipv4 libcrc32c >>> cfg80211 macvlan wanlink(O) pktgen bnep bluetooth f...ss tpm_tis ip >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: CPU: 0 PID: 0 >>> Comm: swapper/0 Tainted: G?????????? O??? 4.13.16+ #22 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: Hardware name: >>> Supermicro X9SCI/X9SCA/X9SCI/X9SCA, BIOS 2.0b 09/17/2012 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: task: >>> ffffffff81e104c0 task.stack: ffffffff81e00000 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RIP: >>> 0010:dev_watchdog+0x228/0x250 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RSP: >>> 0018:ffff88042fc03e50 EFLAGS: 00010282 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RAX: >>> 0000000000000086 RBX: 0000000000000000 RCX: 0000000000000000 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RDX: >>> ffff88042fc15b40 RSI: ffff88042fc0dbf8 RDI: ffff88042fc0dbf8 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RBP: >>> ffff88042fc03e98 R08: 0000000000000001 R09: 00000000000003c4 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: R10: >>> 0000000000000000 R11: 00000000000003c4 R12: 0000000000001388 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: R13: >>> 0000000100050dc3 R14: ffff880417670000 R15: 0000000100052400 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: FS: >>> 0000000000000000(0000) GS:ffff88042fc00000(0000) >>> knlGS:0000000000000000 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: CS:? 0010 DS: 0000 >>> ES: 0000 CR0: 0000000080050033 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: CR2: >>> 0000000001d14000 CR3: 0000000001e09000 CR4: 00000000001406f0 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: Call Trace: >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:? <IRQ> >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:? ? >>> qdisc_rcu_free+0x40/0x40 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: >>> call_timer_fn+0x30/0x160 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:? ? >>> qdisc_rcu_free+0x40/0x40 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: >>> run_timer_softirq+0x1f0/0x450 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:? ? >>> lapic_next_deadline+0x21/0x30 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:? ? >>> clockevents_program_event+0x78/0xf0 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: >>> __do_softirq+0xc1/0x2c0 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:? irq_exit+0xb1/0xc0 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: >>> smp_apic_timer_interrupt+0x38/0x50 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: >>> apic_timer_interrupt+0x89/0x90 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RIP: >>> 0010:cpuidle_enter_state+0x12b/0x310 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RSP: >>> 0018:ffffffff81e03de8 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RAX: >>> 0000000000000000 RBX: 0000000000000003 RCX: 000000000000001f >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RDX: >>> 0000000000000000 RSI: 00000000238e2b4c RDI: 0000000000000000 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RBP: >>> ffffffff81e03e20 R08: 00000000000000af R09: 0000000000000018 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: R10: >>> 00000000000000af R11: 0000000000000f27 R12: 0000000000000003 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: R13: >>> ffff88042fc24918 R14: ffffffff81eae658 R15: 00000093fd9af742 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:? </IRQ> >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:? ? >>> cpuidle_enter_state+0x119/0x310 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: >>> cpuidle_enter+0x12/0x20 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: >>> call_cpuidle+0x1e/0x40 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:? do_idle+0x17f/0x1d0 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: >>> cpu_startup_entry+0x5f/0x70 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:? rest_init+0xc9/0xd0 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: >>> start_kernel+0x483/0x490 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:? ? >>> early_idt_handler_array+0x120/0x120 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: >>> x86_64_start_reservations+0x2a/0x2c >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: >>> x86_64_start_kernel+0x13c/0x14b >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: >>> secondary_startup_64+0x9f/0x9f >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: Code: 04 00 00 89 >>> 4d cc e8 b8 88 fd ff 8b 4d cc 45 89 e1 4d 89 e8 48 89 c2 4c 89 f6 48 >>> c7 c7 98 23 d4 81 51 41 57 89 d9 e8 44 48 94 ff <0f>... 63 8e 60 04 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: ---[ end trace >>> 04264863cdced748 ]--- >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: e1000e >>> 0000:06:00.0 eth2: Reset adapter unexpectedly >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC >>> Link is Down >>> Jan 24 10:19:48 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth2 NIC >>> Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx >>> Jan 24 10:19:48 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC >>> Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx >>> >>> .... >>> >>> >>> Jan 24 10:27:05 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth2 NIC >>> Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx >>> Jan 24 10:27:24 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: >>> eth3 (e1000e): transmit queue 0 timed out, trans_start: 4295760337, >>> wd-timeout: 5000 jiffies: 4295767040 tx-queues: 1 >>> Jan 24 10:27:24 lf1003-e3v2-13100124-f20x64 kernel: e1000e >>> 0000:07:00.0 eth3: Reset adapter unexpectedly >>> Jan 24 10:27:27 lf1003-e3v2-13100124-f20x64 kernel: e1000e >>> 0000:06:00.0 eth2: Detected Hardware Unit Hang: >>> >>> TDH????????????????? <43> >>> ????????????????????????????????????????????????????? TDT >>> ??? <90>... >>> Jan 24 10:27:29 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC >>> Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx >>> Jan 24 10:27:46 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: >>> eth3 (e1000e): transmit queue 0 timed out, trans_start: 4295782403, >>> wd-timeout: 5000 jiffies: 4295789056 tx-queues: 1 >>> Jan 24 10:27:46 lf1003-e3v2-13100124-f20x64 kernel: e1000e >>> 0000:07:00.0 eth3: Reset adapter unexpectedly >>> Jan 24 10:27:51 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC >>> Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx >>> Jan 24 10:28:06 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: >>> eth2 (e1000e): transmit queue 0 timed out, trans_start: 4295802883, >>> wd-timeout: 5000 jiffies: 4295809024 tx-queues: 1 >>> Jan 24 10:28:06 lf1003-e3v2-13100124-f20x64 kernel: e1000e >>> 0000:06:00.0 eth2: Reset adapter unexpectedly >>> Jan 24 10:28:10 lf1003-e3v2-13100124-f20x64 kernel: e1000e >>> 0000:07:00.0 eth3: Detected Hardware Unit Hang: >>> >>> TDH????????????????? <10> >>> ????????????????????????????????????????????????????? TDT >>> ??? <5d>... >>> Jan 24 10:28:11 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth2 NIC >>> Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx >>> Jan 24 10:28:30 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: >>> eth3 (e1000e): transmit queue 0 timed out, trans_start: 4295827457, >>> wd-timeout: 5000 jiffies: 4295833088 tx-queues: 1 >>> Jan 24 10:28:30 lf1003-e3v2-13100124-f20x64 kernel: e1000e >>> 0000:07:00.0 eth3: Reset adapter unexpectedly >>> Jan 24 10:28:35 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC >>> Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx >>> Jan 24 10:28:45 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: >>> eth2 (e1000e): transmit queue 0 timed out, trans_start: 4295841678, >>> wd-timeout: 5000 jiffies: 4295847424 tx-queues: 1 >>> Jan 24 10:28:45 lf1003-e3v2-13100124-f20x64 kernel: e1000e >>> 0000:06:00.0 eth2: Reset adapter unexpectedly >>> Jan 24 10:28:48 lf1003-e3v2-13100124-f20x64 kernel: e1000e >>> 0000:07:00.0 eth3: Detected Hardware Unit Hang: >>> >>> TDH????????????????? <8> >>> ????????????????????????????????????????????????????? TDT >>> ??? <55>... >>> Jan 24 10:28:49 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth2 NIC >>> Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx >>> Jan 24 10:29:20 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: >>> eth3 (e1000e): transmit queue 0 timed out, trans_start: 4295874528, >>> wd-timeout: 5000 jiffies: 4295882240 tx-queues: 1 >>> Jan 24 10:29:20 lf1003-e3v2-13100124-f20x64 kernel: e1000e >>> 0000:07:00.0 eth3: Reset adapter unexpectedly >>> Jan 24 10:29:20 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth2 NIC >>> Link is Down >>> Jan 24 10:29:26 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth2 NIC >>> Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx >>> Jan 24 10:29:26 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC >>> Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx >>> >>> ..... >>> >>> >>> [root at lf1003-e3v2-13100124-f20x64 ~]# ethtool -i eth2 >>> driver: e1000e >>> version: 3.2.6-k >>> firmware-version: 2.1-2 >>> bus-info: 0000:06:00.0 >>> supports-statistics: yes >>> supports-test: yes >>> supports-eeprom-access: yes >>> supports-register-dump: yes >>> supports-priv-flags: no >>> >>> [root at lf1003-e3v2-13100124-f20x64 ~]# lspci -vvv -s 0000:06:00.0 >>> 06:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network >>> Connection >>> ??? Subsystem: Super Micro Computer Inc Device 0000 >>> ??? Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- >>> Stepping- SERR+ FastB2B- DisINTx+ >>> ??? Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- >>> <TAbort- <MAbort- >SERR- <PERR- INTx- >>> ??? Latency: 0, Cache Line Size: 64 bytes >>> ??? Interrupt: pin A routed to IRQ 18 >>> ??? Region 0: Memory at df600000 (32-bit, non-prefetchable) [size=128K] >>> ??? Region 2: I/O ports at b000 [size=32] >>> ??? Region 3: Memory at df620000 (32-bit, non-prefetchable) [size=16K] >>> ??? Capabilities: [c8] Power Management version 2 >>> ??????? Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA >>> PME(D0+,D1-,D2-,D3hot+,D3cold+) >>> ??????? Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME- >>> ??? Capabilities: [d0] MSI: Enable- Count=1/1 Maskable- 64bit+ >>> ??????? Address: 0000000000000000? Data: 0000 >>> ??? Capabilities: [e0] Express (v1) Endpoint, MSI 00 >>> ??????? DevCap:??? MaxPayload 256 bytes, PhantFunc 0, Latency L0s >>> <512ns, L1 <64us >>> ??????????? ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset- >>> ??????? DevCtl:??? Report errors: Correctable+ Non-Fatal+ Fatal+ >>> Unsupported+ >>> ??????????? RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+ >>> ??????????? MaxPayload 128 bytes, MaxReadReq 512 bytes >>> ??????? DevSta:??? CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr+ >>> TransPend- >>> ??????? LnkCap:??? Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, >>> Exit Latency >>> L0s <128ns, L1 <64us >>> ??????????? ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp- >>> ??????? LnkCtl:??? ASPM Disabled; RCB 64 bytes Disabled- CommClk+ >>> ??????????? ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- >>> ??????? LnkSta:??? Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ >>> DLActive- >>> BWMgmt- ABWMgmt- >>> ??? Capabilities: [a0] MSI-X: Enable+ Count=5 Masked- >>> ??????? Vector table: BAR=3 offset=00000000 >>> ??????? PBA: BAR=3 offset=00002000 >>> ??? Capabilities: [100 v1] Advanced Error Reporting >>> ??????? UESta:??? DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- >>> RxOF- >>> MalfTLP- ECRC- UnsupReq- ACSViol- >>> ??????? UEMsk:??? DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- >>> RxOF- >>> MalfTLP- ECRC- UnsupReq- ACSViol- >>> ??????? UESvrt:??? DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- >>> RxOF+ >>> MalfTLP+ ECRC- UnsupReq- ACSViol- >>> ??????? CESta:??? RxErr- BadTLP- BadDLLP- Rollover- Timeout- >>> NonFatalErr- >>> ??????? CEMsk:??? RxErr- BadTLP- BadDLLP- Rollover- Timeout- >>> NonFatalErr+ >>> ??????? AERCap:??? First Error Pointer: 00, GenCap- CGenEn- ChkCap- >>> ChkEn- >>> ??? Capabilities: [140 v1] Device Serial Number 00-25-90-ff-ff-d2-06-aa >>> ??? Kernel driver in use: e1000e >>> ??? Kernel modules: e1000e >>> >>> >>> My test is a (custom) traffic generator that is setting up 30k tcp >>> connections >>> between two e1000e ports and sending traffic as fast as possible. >>> I'd be happy to help you set up this exact tool on your system(s), >>> but we have seen similar issues with e1000e in other high-speed tests, >>> so I don't think it >>> is specific to this particular test.? Maybe this test makes it easier >>> to reproduce >>> however. >> >> Silly suggestion: >> Maybe worth to try disabling TSO? >> ethtool -K eth2 tso off > > > I tried that just now...and the problem did not change. > > Thanks, > Ben > > > 82574L is pretty old HW - I am not sure we still support it. Is more older kernel version also hit on this problem? Can you try latest Linus kernel version? Anyway, I suggest fill ticket on source forge (https://sourceforge.net/projects/e1000/files/?source=navbar),attach dmesg, lspci and all relevant information. ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2018-01-25 8:29 UTC | newest] Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2018-01-23 23:46 e1000e hardware unit hangs Ben Greear 2018-01-24 16:11 ` Alexander Duyck 2018-01-24 16:11 ` [Intel-wired-lan] " Alexander Duyck 2018-01-24 16:34 ` Neftin, Sasha 2018-01-24 16:34 ` [Intel-wired-lan] " Neftin, Sasha 2018-01-24 18:31 ` Ben Greear 2018-01-24 18:31 ` [Intel-wired-lan] " Ben Greear 2018-01-24 18:38 ` Denys Fedoryshchenko 2018-01-24 18:38 ` [Intel-wired-lan] " Denys Fedoryshchenko 2018-01-24 18:41 ` Ben Greear 2018-01-24 18:41 ` [Intel-wired-lan] " Ben Greear 2018-01-25 8:29 ` Neftin, Sasha 2018-01-25 8:29 ` [Intel-wired-lan] " Neftin, Sasha
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.