All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ben Greear <greearb@candelatech.com>
To: Denys Fedoryshchenko <denys@visp.net.lb>
Cc: "Neftin, Sasha" <sasha.neftin@intel.com>,
	Alexander Duyck <alexander.duyck@gmail.com>,
	intel-wired-lan <intel-wired-lan@lists.osuosl.org>,
	e1000-devel@lists.sourceforge.net,
	netdev <netdev@vger.kernel.org>,
	netdev-owner@vger.kernel.org
Subject: Re: e1000e hardware unit hangs
Date: Wed, 24 Jan 2018 10:41:32 -0800	[thread overview]
Message-ID: <a96f0749-2491-392a-d726-e8957f4859ce@candelatech.com> (raw)
In-Reply-To: <8ade4a34b9e6817c3f4afb5126f37871@visp.net.lb>

On 01/24/2018 10:38 AM, Denys Fedoryshchenko wrote:
> On 2018-01-24 20:31, Ben Greear wrote:
>> On 01/24/2018 08:34 AM, Neftin, Sasha wrote:
>>> On 1/24/2018 18:11, Alexander Duyck wrote:
>>>> On Tue, Jan 23, 2018 at 3:46 PM, Ben Greear <greearb@candelatech.com> wrote:
>>>>> Hello,
>>>>>
>>>>> Anyone have any more suggestions for making e1000e work better?  This is
>>>>> from a 4.9.65+ kernel,
>>>>> with these additional e1000e patches applied:
>>>>>
>>>>> e1000e: Fix error path in link detection
>>>>> e1000e: Fix wrong comment related to link detection
>>>>> e1000e: Fix return value test
>>>>> e1000e: Separate signaling for link check/link up
>>>>> e1000e: Avoid receiver overrun interrupt bursts
>>>>
>>>> Most of these patches shouldn't address anything that would trigger Tx
>>>> hangs. They are mostly related to just link detection.
>>>>
>>>>> Test case is simply to run 30000 tcp connections each trying to send 56Kbps
>>>>> of bi-directional
>>>>> data between a pair of e1000e interfaces :)
>>>>>
>>>>> No OOM related issues are seen on this kernel...similar test on 4.13 showed
>>>>> some OOM
>>>>> issues, but I have not debugged that yet...
>>>>
>>>> Really a question like this probably belongs on e1000-devel or
>>>> intel-wired-lan so I have added those lists and the e1000e maintainer
>>>> to the thread.
>>>>
>>>> It would be useful if you could provide more information about the
>>>> device itself such as the ID and the kind of test you are running.
>>>> Keep in mind the e1000e driver supports a pretty broad swath of
>>>> devices so we need to narrow things down a bit.
>>>>
>>> please, also re-check if your kernel include:
>>> e1000e: fix buffer overrun while the I219 is processing DMA transactions
>>> e1000e: fix the use of magic numbers for buffer overrun issue
>>> where you take fresh version of kernel?
>>
>> Hello,
>>
>> I tried adding those two patches, but I still see this splat shortly
>> after starting
>> my test.  The kernel I am using is here:
>>
>> https://github.com/greearb/linux-ct-4.13
>>
>> I've seen similar issues at least back to the 4.0 kernel, including
>> stock kernels and my
>> own kernels with additional patches.
>>
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG:
>> eth2 (e1000e): transmit queue 0 timed out, trans_start: 4295298499,
>> wd-timeout: 5000 jiffies: 4295304192 tx-queues: 1
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: ------------[ cut
>> here ]------------
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: WARNING: CPU: 0
>> PID: 0 at
>> /home/greearb/git/linux-4.13.dev.y/net/sched/sch_generic.c:322
>> dev_watchdog+0x228/0x250
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: Modules linked in:
>> nf_conntrack_netlink nf_conntrack nfnetlink nf_defrag_ipv4 libcrc32c
>> cfg80211 macvlan wanlink(O) pktgen bnep bluetooth f...ss tpm_tis ip
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: CPU: 0 PID: 0
>> Comm: swapper/0 Tainted: G           O    4.13.16+ #22
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: Hardware name:
>> Supermicro X9SCI/X9SCA/X9SCI/X9SCA, BIOS 2.0b 09/17/2012
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: task:
>> ffffffff81e104c0 task.stack: ffffffff81e00000
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RIP:
>> 0010:dev_watchdog+0x228/0x250
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RSP:
>> 0018:ffff88042fc03e50 EFLAGS: 00010282
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RAX:
>> 0000000000000086 RBX: 0000000000000000 RCX: 0000000000000000
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RDX:
>> ffff88042fc15b40 RSI: ffff88042fc0dbf8 RDI: ffff88042fc0dbf8
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RBP:
>> ffff88042fc03e98 R08: 0000000000000001 R09: 00000000000003c4
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: R10:
>> 0000000000000000 R11: 00000000000003c4 R12: 0000000000001388
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: R13:
>> 0000000100050dc3 R14: ffff880417670000 R15: 0000000100052400
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: FS:
>> 0000000000000000(0000) GS:ffff88042fc00000(0000)
>> knlGS:0000000000000000
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: CS:  0010 DS: 0000
>> ES: 0000 CR0: 0000000080050033
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: CR2:
>> 0000000001d14000 CR3: 0000000001e09000 CR4: 00000000001406f0
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: Call Trace:
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:  <IRQ>
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:  ? qdisc_rcu_free+0x40/0x40
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:  call_timer_fn+0x30/0x160
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:  ? qdisc_rcu_free+0x40/0x40
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:
>> run_timer_softirq+0x1f0/0x450
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:  ?
>> lapic_next_deadline+0x21/0x30
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:  ?
>> clockevents_program_event+0x78/0xf0
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:  __do_softirq+0xc1/0x2c0
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:  irq_exit+0xb1/0xc0
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:
>> smp_apic_timer_interrupt+0x38/0x50
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:
>> apic_timer_interrupt+0x89/0x90
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RIP:
>> 0010:cpuidle_enter_state+0x12b/0x310
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RSP:
>> 0018:ffffffff81e03de8 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RAX:
>> 0000000000000000 RBX: 0000000000000003 RCX: 000000000000001f
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RDX:
>> 0000000000000000 RSI: 00000000238e2b4c RDI: 0000000000000000
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RBP:
>> ffffffff81e03e20 R08: 00000000000000af R09: 0000000000000018
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: R10:
>> 00000000000000af R11: 0000000000000f27 R12: 0000000000000003
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: R13:
>> ffff88042fc24918 R14: ffffffff81eae658 R15: 00000093fd9af742
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:  </IRQ>
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:  ?
>> cpuidle_enter_state+0x119/0x310
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:  cpuidle_enter+0x12/0x20
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:  call_cpuidle+0x1e/0x40
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:  do_idle+0x17f/0x1d0
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:  cpu_startup_entry+0x5f/0x70
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:  rest_init+0xc9/0xd0
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:  start_kernel+0x483/0x490
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:  ?
>> early_idt_handler_array+0x120/0x120
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:
>> x86_64_start_reservations+0x2a/0x2c
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:
>> x86_64_start_kernel+0x13c/0x14b
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:
>> secondary_startup_64+0x9f/0x9f
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: Code: 04 00 00 89
>> 4d cc e8 b8 88 fd ff 8b 4d cc 45 89 e1 4d 89 e8 48 89 c2 4c 89 f6 48
>> c7 c7 98 23 d4 81 51 41 57 89 d9 e8 44 48 94 ff <0f>... 63 8e 60 04
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: ---[ end trace
>> 04264863cdced748 ]---
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: e1000e
>> 0000:06:00.0 eth2: Reset adapter unexpectedly
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC
>> Link is Down
>> Jan 24 10:19:48 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth2 NIC
>> Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
>> Jan 24 10:19:48 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC
>> Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
>>
>> ....
>>
>>
>> Jan 24 10:27:05 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth2 NIC
>> Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
>> Jan 24 10:27:24 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG:
>> eth3 (e1000e): transmit queue 0 timed out, trans_start: 4295760337,
>> wd-timeout: 5000 jiffies: 4295767040 tx-queues: 1
>> Jan 24 10:27:24 lf1003-e3v2-13100124-f20x64 kernel: e1000e
>> 0000:07:00.0 eth3: Reset adapter unexpectedly
>> Jan 24 10:27:27 lf1003-e3v2-13100124-f20x64 kernel: e1000e
>> 0000:06:00.0 eth2: Detected Hardware Unit Hang:
>>                                                       TDH                  <43>
>>                                                       TDT
>>     <90>...
>> Jan 24 10:27:29 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC
>> Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
>> Jan 24 10:27:46 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG:
>> eth3 (e1000e): transmit queue 0 timed out, trans_start: 4295782403,
>> wd-timeout: 5000 jiffies: 4295789056 tx-queues: 1
>> Jan 24 10:27:46 lf1003-e3v2-13100124-f20x64 kernel: e1000e
>> 0000:07:00.0 eth3: Reset adapter unexpectedly
>> Jan 24 10:27:51 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC
>> Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
>> Jan 24 10:28:06 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG:
>> eth2 (e1000e): transmit queue 0 timed out, trans_start: 4295802883,
>> wd-timeout: 5000 jiffies: 4295809024 tx-queues: 1
>> Jan 24 10:28:06 lf1003-e3v2-13100124-f20x64 kernel: e1000e
>> 0000:06:00.0 eth2: Reset adapter unexpectedly
>> Jan 24 10:28:10 lf1003-e3v2-13100124-f20x64 kernel: e1000e
>> 0000:07:00.0 eth3: Detected Hardware Unit Hang:
>>                                                       TDH                  <10>
>>                                                       TDT
>>     <5d>...
>> Jan 24 10:28:11 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth2 NIC
>> Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
>> Jan 24 10:28:30 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG:
>> eth3 (e1000e): transmit queue 0 timed out, trans_start: 4295827457,
>> wd-timeout: 5000 jiffies: 4295833088 tx-queues: 1
>> Jan 24 10:28:30 lf1003-e3v2-13100124-f20x64 kernel: e1000e
>> 0000:07:00.0 eth3: Reset adapter unexpectedly
>> Jan 24 10:28:35 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC
>> Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
>> Jan 24 10:28:45 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG:
>> eth2 (e1000e): transmit queue 0 timed out, trans_start: 4295841678,
>> wd-timeout: 5000 jiffies: 4295847424 tx-queues: 1
>> Jan 24 10:28:45 lf1003-e3v2-13100124-f20x64 kernel: e1000e
>> 0000:06:00.0 eth2: Reset adapter unexpectedly
>> Jan 24 10:28:48 lf1003-e3v2-13100124-f20x64 kernel: e1000e
>> 0000:07:00.0 eth3: Detected Hardware Unit Hang:
>>                                                       TDH                  <8>
>>                                                       TDT
>>     <55>...
>> Jan 24 10:28:49 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth2 NIC
>> Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
>> Jan 24 10:29:20 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG:
>> eth3 (e1000e): transmit queue 0 timed out, trans_start: 4295874528,
>> wd-timeout: 5000 jiffies: 4295882240 tx-queues: 1
>> Jan 24 10:29:20 lf1003-e3v2-13100124-f20x64 kernel: e1000e
>> 0000:07:00.0 eth3: Reset adapter unexpectedly
>> Jan 24 10:29:20 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth2 NIC
>> Link is Down
>> Jan 24 10:29:26 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth2 NIC
>> Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
>> Jan 24 10:29:26 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC
>> Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
>>
>> .....
>>
>>
>> [root@lf1003-e3v2-13100124-f20x64 ~]# ethtool -i eth2
>> driver: e1000e
>> version: 3.2.6-k
>> firmware-version: 2.1-2
>> bus-info: 0000:06:00.0
>> supports-statistics: yes
>> supports-test: yes
>> supports-eeprom-access: yes
>> supports-register-dump: yes
>> supports-priv-flags: no
>>
>> [root@lf1003-e3v2-13100124-f20x64 ~]# lspci -vvv -s 0000:06:00.0
>> 06:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network Connection
>>     Subsystem: Super Micro Computer Inc Device 0000
>>     Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
>> Stepping- SERR+ FastB2B- DisINTx+
>>     Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
>> <TAbort- <MAbort- >SERR- <PERR- INTx-
>>     Latency: 0, Cache Line Size: 64 bytes
>>     Interrupt: pin A routed to IRQ 18
>>     Region 0: Memory at df600000 (32-bit, non-prefetchable) [size=128K]
>>     Region 2: I/O ports at b000 [size=32]
>>     Region 3: Memory at df620000 (32-bit, non-prefetchable) [size=16K]
>>     Capabilities: [c8] Power Management version 2
>>         Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
>>         Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME-
>>     Capabilities: [d0] MSI: Enable- Count=1/1 Maskable- 64bit+
>>         Address: 0000000000000000  Data: 0000
>>     Capabilities: [e0] Express (v1) Endpoint, MSI 00
>>         DevCap:    MaxPayload 256 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
>>             ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
>>         DevCtl:    Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+
>>             RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+
>>             MaxPayload 128 bytes, MaxReadReq 512 bytes
>>         DevSta:    CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr+ TransPend-
>>         LnkCap:    Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Exit Latency
>> L0s <128ns, L1 <64us
>>             ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp-
>>         LnkCtl:    ASPM Disabled; RCB 64 bytes Disabled- CommClk+
>>             ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
>>         LnkSta:    Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive-
>> BWMgmt- ABWMgmt-
>>     Capabilities: [a0] MSI-X: Enable+ Count=5 Masked-
>>         Vector table: BAR=3 offset=00000000
>>         PBA: BAR=3 offset=00002000
>>     Capabilities: [100 v1] Advanced Error Reporting
>>         UESta:    DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF-
>> MalfTLP- ECRC- UnsupReq- ACSViol-
>>         UEMsk:    DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF-
>> MalfTLP- ECRC- UnsupReq- ACSViol-
>>         UESvrt:    DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+
>> MalfTLP+ ECRC- UnsupReq- ACSViol-
>>         CESta:    RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
>>         CEMsk:    RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
>>         AERCap:    First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
>>     Capabilities: [140 v1] Device Serial Number 00-25-90-ff-ff-d2-06-aa
>>     Kernel driver in use: e1000e
>>     Kernel modules: e1000e
>>
>>
>> My test is a (custom) traffic generator that is setting up 30k tcp connections
>> between two e1000e ports and sending traffic as fast as possible.
>> I'd be happy to help you set up this exact tool on your system(s),
>> but we have seen similar issues with e1000e in other high-speed tests,
>> so I don't think it
>> is specific to this particular test.  Maybe this test makes it easier
>> to reproduce
>> however.
>
> Silly suggestion:
> Maybe worth to try disabling TSO?
> ethtool -K eth2 tso off


I tried that just now...and the problem did not change.

Thanks,
Ben



-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

WARNING: multiple messages have this Message-ID (diff)
From: Ben Greear <greearb@candelatech.com>
To: intel-wired-lan@osuosl.org
Subject: [Intel-wired-lan] e1000e hardware unit hangs
Date: Wed, 24 Jan 2018 10:41:32 -0800	[thread overview]
Message-ID: <a96f0749-2491-392a-d726-e8957f4859ce@candelatech.com> (raw)
In-Reply-To: <8ade4a34b9e6817c3f4afb5126f37871@visp.net.lb>

On 01/24/2018 10:38 AM, Denys Fedoryshchenko wrote:
> On 2018-01-24 20:31, Ben Greear wrote:
>> On 01/24/2018 08:34 AM, Neftin, Sasha wrote:
>>> On 1/24/2018 18:11, Alexander Duyck wrote:
>>>> On Tue, Jan 23, 2018 at 3:46 PM, Ben Greear <greearb@candelatech.com> wrote:
>>>>> Hello,
>>>>>
>>>>> Anyone have any more suggestions for making e1000e work better?  This is
>>>>> from a 4.9.65+ kernel,
>>>>> with these additional e1000e patches applied:
>>>>>
>>>>> e1000e: Fix error path in link detection
>>>>> e1000e: Fix wrong comment related to link detection
>>>>> e1000e: Fix return value test
>>>>> e1000e: Separate signaling for link check/link up
>>>>> e1000e: Avoid receiver overrun interrupt bursts
>>>>
>>>> Most of these patches shouldn't address anything that would trigger Tx
>>>> hangs. They are mostly related to just link detection.
>>>>
>>>>> Test case is simply to run 30000 tcp connections each trying to send 56Kbps
>>>>> of bi-directional
>>>>> data between a pair of e1000e interfaces :)
>>>>>
>>>>> No OOM related issues are seen on this kernel...similar test on 4.13 showed
>>>>> some OOM
>>>>> issues, but I have not debugged that yet...
>>>>
>>>> Really a question like this probably belongs on e1000-devel or
>>>> intel-wired-lan so I have added those lists and the e1000e maintainer
>>>> to the thread.
>>>>
>>>> It would be useful if you could provide more information about the
>>>> device itself such as the ID and the kind of test you are running.
>>>> Keep in mind the e1000e driver supports a pretty broad swath of
>>>> devices so we need to narrow things down a bit.
>>>>
>>> please, also re-check if your kernel include:
>>> e1000e: fix buffer overrun while the I219 is processing DMA transactions
>>> e1000e: fix the use of magic numbers for buffer overrun issue
>>> where you take fresh version of kernel?
>>
>> Hello,
>>
>> I tried adding those two patches, but I still see this splat shortly
>> after starting
>> my test.  The kernel I am using is here:
>>
>> https://github.com/greearb/linux-ct-4.13
>>
>> I've seen similar issues at least back to the 4.0 kernel, including
>> stock kernels and my
>> own kernels with additional patches.
>>
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG:
>> eth2 (e1000e): transmit queue 0 timed out, trans_start: 4295298499,
>> wd-timeout: 5000 jiffies: 4295304192 tx-queues: 1
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: ------------[ cut
>> here ]------------
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: WARNING: CPU: 0
>> PID: 0 at
>> /home/greearb/git/linux-4.13.dev.y/net/sched/sch_generic.c:322
>> dev_watchdog+0x228/0x250
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: Modules linked in:
>> nf_conntrack_netlink nf_conntrack nfnetlink nf_defrag_ipv4 libcrc32c
>> cfg80211 macvlan wanlink(O) pktgen bnep bluetooth f...ss tpm_tis ip
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: CPU: 0 PID: 0
>> Comm: swapper/0 Tainted: G           O    4.13.16+ #22
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: Hardware name:
>> Supermicro X9SCI/X9SCA/X9SCI/X9SCA, BIOS 2.0b 09/17/2012
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: task:
>> ffffffff81e104c0 task.stack: ffffffff81e00000
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RIP:
>> 0010:dev_watchdog+0x228/0x250
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RSP:
>> 0018:ffff88042fc03e50 EFLAGS: 00010282
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RAX:
>> 0000000000000086 RBX: 0000000000000000 RCX: 0000000000000000
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RDX:
>> ffff88042fc15b40 RSI: ffff88042fc0dbf8 RDI: ffff88042fc0dbf8
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RBP:
>> ffff88042fc03e98 R08: 0000000000000001 R09: 00000000000003c4
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: R10:
>> 0000000000000000 R11: 00000000000003c4 R12: 0000000000001388
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: R13:
>> 0000000100050dc3 R14: ffff880417670000 R15: 0000000100052400
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: FS:
>> 0000000000000000(0000) GS:ffff88042fc00000(0000)
>> knlGS:0000000000000000
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: CS:  0010 DS: 0000
>> ES: 0000 CR0: 0000000080050033
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: CR2:
>> 0000000001d14000 CR3: 0000000001e09000 CR4: 00000000001406f0
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: Call Trace:
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:  <IRQ>
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:  ? qdisc_rcu_free+0x40/0x40
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:  call_timer_fn+0x30/0x160
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:  ? qdisc_rcu_free+0x40/0x40
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:
>> run_timer_softirq+0x1f0/0x450
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:  ?
>> lapic_next_deadline+0x21/0x30
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:  ?
>> clockevents_program_event+0x78/0xf0
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:  __do_softirq+0xc1/0x2c0
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:  irq_exit+0xb1/0xc0
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:
>> smp_apic_timer_interrupt+0x38/0x50
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:
>> apic_timer_interrupt+0x89/0x90
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RIP:
>> 0010:cpuidle_enter_state+0x12b/0x310
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RSP:
>> 0018:ffffffff81e03de8 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RAX:
>> 0000000000000000 RBX: 0000000000000003 RCX: 000000000000001f
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RDX:
>> 0000000000000000 RSI: 00000000238e2b4c RDI: 0000000000000000
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RBP:
>> ffffffff81e03e20 R08: 00000000000000af R09: 0000000000000018
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: R10:
>> 00000000000000af R11: 0000000000000f27 R12: 0000000000000003
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: R13:
>> ffff88042fc24918 R14: ffffffff81eae658 R15: 00000093fd9af742
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:  </IRQ>
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:  ?
>> cpuidle_enter_state+0x119/0x310
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:  cpuidle_enter+0x12/0x20
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:  call_cpuidle+0x1e/0x40
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:  do_idle+0x17f/0x1d0
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:  cpu_startup_entry+0x5f/0x70
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:  rest_init+0xc9/0xd0
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:  start_kernel+0x483/0x490
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:  ?
>> early_idt_handler_array+0x120/0x120
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:
>> x86_64_start_reservations+0x2a/0x2c
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:
>> x86_64_start_kernel+0x13c/0x14b
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:
>> secondary_startup_64+0x9f/0x9f
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: Code: 04 00 00 89
>> 4d cc e8 b8 88 fd ff 8b 4d cc 45 89 e1 4d 89 e8 48 89 c2 4c 89 f6 48
>> c7 c7 98 23 d4 81 51 41 57 89 d9 e8 44 48 94 ff <0f>... 63 8e 60 04
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: ---[ end trace
>> 04264863cdced748 ]---
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: e1000e
>> 0000:06:00.0 eth2: Reset adapter unexpectedly
>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC
>> Link is Down
>> Jan 24 10:19:48 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth2 NIC
>> Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
>> Jan 24 10:19:48 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC
>> Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
>>
>> ....
>>
>>
>> Jan 24 10:27:05 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth2 NIC
>> Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
>> Jan 24 10:27:24 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG:
>> eth3 (e1000e): transmit queue 0 timed out, trans_start: 4295760337,
>> wd-timeout: 5000 jiffies: 4295767040 tx-queues: 1
>> Jan 24 10:27:24 lf1003-e3v2-13100124-f20x64 kernel: e1000e
>> 0000:07:00.0 eth3: Reset adapter unexpectedly
>> Jan 24 10:27:27 lf1003-e3v2-13100124-f20x64 kernel: e1000e
>> 0000:06:00.0 eth2: Detected Hardware Unit Hang:
>>                                                       TDH                  <43>
>>                                                       TDT
>>     <90>...
>> Jan 24 10:27:29 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC
>> Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
>> Jan 24 10:27:46 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG:
>> eth3 (e1000e): transmit queue 0 timed out, trans_start: 4295782403,
>> wd-timeout: 5000 jiffies: 4295789056 tx-queues: 1
>> Jan 24 10:27:46 lf1003-e3v2-13100124-f20x64 kernel: e1000e
>> 0000:07:00.0 eth3: Reset adapter unexpectedly
>> Jan 24 10:27:51 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC
>> Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
>> Jan 24 10:28:06 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG:
>> eth2 (e1000e): transmit queue 0 timed out, trans_start: 4295802883,
>> wd-timeout: 5000 jiffies: 4295809024 tx-queues: 1
>> Jan 24 10:28:06 lf1003-e3v2-13100124-f20x64 kernel: e1000e
>> 0000:06:00.0 eth2: Reset adapter unexpectedly
>> Jan 24 10:28:10 lf1003-e3v2-13100124-f20x64 kernel: e1000e
>> 0000:07:00.0 eth3: Detected Hardware Unit Hang:
>>                                                       TDH                  <10>
>>                                                       TDT
>>     <5d>...
>> Jan 24 10:28:11 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth2 NIC
>> Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
>> Jan 24 10:28:30 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG:
>> eth3 (e1000e): transmit queue 0 timed out, trans_start: 4295827457,
>> wd-timeout: 5000 jiffies: 4295833088 tx-queues: 1
>> Jan 24 10:28:30 lf1003-e3v2-13100124-f20x64 kernel: e1000e
>> 0000:07:00.0 eth3: Reset adapter unexpectedly
>> Jan 24 10:28:35 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC
>> Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
>> Jan 24 10:28:45 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG:
>> eth2 (e1000e): transmit queue 0 timed out, trans_start: 4295841678,
>> wd-timeout: 5000 jiffies: 4295847424 tx-queues: 1
>> Jan 24 10:28:45 lf1003-e3v2-13100124-f20x64 kernel: e1000e
>> 0000:06:00.0 eth2: Reset adapter unexpectedly
>> Jan 24 10:28:48 lf1003-e3v2-13100124-f20x64 kernel: e1000e
>> 0000:07:00.0 eth3: Detected Hardware Unit Hang:
>>                                                       TDH                  <8>
>>                                                       TDT
>>     <55>...
>> Jan 24 10:28:49 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth2 NIC
>> Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
>> Jan 24 10:29:20 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG:
>> eth3 (e1000e): transmit queue 0 timed out, trans_start: 4295874528,
>> wd-timeout: 5000 jiffies: 4295882240 tx-queues: 1
>> Jan 24 10:29:20 lf1003-e3v2-13100124-f20x64 kernel: e1000e
>> 0000:07:00.0 eth3: Reset adapter unexpectedly
>> Jan 24 10:29:20 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth2 NIC
>> Link is Down
>> Jan 24 10:29:26 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth2 NIC
>> Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
>> Jan 24 10:29:26 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC
>> Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
>>
>> .....
>>
>>
>> [root at lf1003-e3v2-13100124-f20x64 ~]# ethtool -i eth2
>> driver: e1000e
>> version: 3.2.6-k
>> firmware-version: 2.1-2
>> bus-info: 0000:06:00.0
>> supports-statistics: yes
>> supports-test: yes
>> supports-eeprom-access: yes
>> supports-register-dump: yes
>> supports-priv-flags: no
>>
>> [root at lf1003-e3v2-13100124-f20x64 ~]# lspci -vvv -s 0000:06:00.0
>> 06:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network Connection
>>     Subsystem: Super Micro Computer Inc Device 0000
>>     Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
>> Stepping- SERR+ FastB2B- DisINTx+
>>     Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
>> <TAbort- <MAbort- >SERR- <PERR- INTx-
>>     Latency: 0, Cache Line Size: 64 bytes
>>     Interrupt: pin A routed to IRQ 18
>>     Region 0: Memory at df600000 (32-bit, non-prefetchable) [size=128K]
>>     Region 2: I/O ports at b000 [size=32]
>>     Region 3: Memory at df620000 (32-bit, non-prefetchable) [size=16K]
>>     Capabilities: [c8] Power Management version 2
>>         Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
>>         Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME-
>>     Capabilities: [d0] MSI: Enable- Count=1/1 Maskable- 64bit+
>>         Address: 0000000000000000  Data: 0000
>>     Capabilities: [e0] Express (v1) Endpoint, MSI 00
>>         DevCap:    MaxPayload 256 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
>>             ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
>>         DevCtl:    Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+
>>             RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+
>>             MaxPayload 128 bytes, MaxReadReq 512 bytes
>>         DevSta:    CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr+ TransPend-
>>         LnkCap:    Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Exit Latency
>> L0s <128ns, L1 <64us
>>             ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp-
>>         LnkCtl:    ASPM Disabled; RCB 64 bytes Disabled- CommClk+
>>             ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
>>         LnkSta:    Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive-
>> BWMgmt- ABWMgmt-
>>     Capabilities: [a0] MSI-X: Enable+ Count=5 Masked-
>>         Vector table: BAR=3 offset=00000000
>>         PBA: BAR=3 offset=00002000
>>     Capabilities: [100 v1] Advanced Error Reporting
>>         UESta:    DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF-
>> MalfTLP- ECRC- UnsupReq- ACSViol-
>>         UEMsk:    DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF-
>> MalfTLP- ECRC- UnsupReq- ACSViol-
>>         UESvrt:    DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+
>> MalfTLP+ ECRC- UnsupReq- ACSViol-
>>         CESta:    RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
>>         CEMsk:    RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
>>         AERCap:    First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
>>     Capabilities: [140 v1] Device Serial Number 00-25-90-ff-ff-d2-06-aa
>>     Kernel driver in use: e1000e
>>     Kernel modules: e1000e
>>
>>
>> My test is a (custom) traffic generator that is setting up 30k tcp connections
>> between two e1000e ports and sending traffic as fast as possible.
>> I'd be happy to help you set up this exact tool on your system(s),
>> but we have seen similar issues with e1000e in other high-speed tests,
>> so I don't think it
>> is specific to this particular test.  Maybe this test makes it easier
>> to reproduce
>> however.
>
> Silly suggestion:
> Maybe worth to try disabling TSO?
> ethtool -K eth2 tso off


I tried that just now...and the problem did not change.

Thanks,
Ben



-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


  reply	other threads:[~2018-01-24 18:41 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-01-23 23:46 e1000e hardware unit hangs Ben Greear
2018-01-24 16:11 ` Alexander Duyck
2018-01-24 16:11   ` [Intel-wired-lan] " Alexander Duyck
2018-01-24 16:34   ` Neftin, Sasha
2018-01-24 16:34     ` [Intel-wired-lan] " Neftin, Sasha
2018-01-24 18:31     ` Ben Greear
2018-01-24 18:31       ` [Intel-wired-lan] " Ben Greear
2018-01-24 18:38       ` Denys Fedoryshchenko
2018-01-24 18:38         ` [Intel-wired-lan] " Denys Fedoryshchenko
2018-01-24 18:41         ` Ben Greear [this message]
2018-01-24 18:41           ` Ben Greear
2018-01-25  8:29           ` Neftin, Sasha
2018-01-25  8:29             ` [Intel-wired-lan] " Neftin, Sasha

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=a96f0749-2491-392a-d726-e8957f4859ce@candelatech.com \
    --to=greearb@candelatech.com \
    --cc=alexander.duyck@gmail.com \
    --cc=denys@visp.net.lb \
    --cc=e1000-devel@lists.sourceforge.net \
    --cc=intel-wired-lan@lists.osuosl.org \
    --cc=netdev-owner@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=sasha.neftin@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.