From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Neftin, Sasha" Subject: Re: e1000e hardware unit hangs Date: Thu, 25 Jan 2018 10:29:38 +0200 Message-ID: <04631dcd-1d0b-1b06-fae7-3889f148a591@intel.com> References: <51bbb33a-e7dd-88c0-4fff-bebb6ef75a78@candelatech.com> <8ade4a34b9e6817c3f4afb5126f37871@visp.net.lb> Mime-Version: 1.0 Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 8bit Cc: Alexander Duyck , intel-wired-lan , e1000-devel@lists.sourceforge.net, netdev , netdev-owner@vger.kernel.org To: Ben Greear , Denys Fedoryshchenko Return-path: Received: from mga05.intel.com ([192.55.52.43]:46865 "EHLO mga05.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751081AbeAYI3l (ORCPT ); Thu, 25 Jan 2018 03:29:41 -0500 In-Reply-To: Content-Language: en-US Sender: netdev-owner@vger.kernel.org List-ID: On 1/24/2018 20:41, Ben Greear wrote: > On 01/24/2018 10:38 AM, Denys Fedoryshchenko wrote: >> On 2018-01-24 20:31, Ben Greear wrote: >>> On 01/24/2018 08:34 AM, Neftin, Sasha wrote: >>>> On 1/24/2018 18:11, Alexander Duyck wrote: >>>>> On Tue, Jan 23, 2018 at 3:46 PM, Ben Greear >>>>> wrote: >>>>>> Hello, >>>>>> >>>>>> Anyone have any more suggestions for making e1000e work better? >>>>>> This is >>>>>> from a 4.9.65+ kernel, >>>>>> with these additional e1000e patches applied: >>>>>> >>>>>> e1000e: Fix error path in link detection >>>>>> e1000e: Fix wrong comment related to link detection >>>>>> e1000e: Fix return value test >>>>>> e1000e: Separate signaling for link check/link up >>>>>> e1000e: Avoid receiver overrun interrupt bursts >>>>> >>>>> Most of these patches shouldn't address anything that would trigger Tx >>>>> hangs. They are mostly related to just link detection. >>>>> >>>>>> Test case is simply to run 30000 tcp connections each trying to >>>>>> send 56Kbps >>>>>> of bi-directional >>>>>> data between a pair of e1000e interfaces :) >>>>>> >>>>>> No OOM related issues are seen on this kernel...similar test on >>>>>> 4.13 showed >>>>>> some OOM >>>>>> issues, but I have not debugged that yet... >>>>> >>>>> Really a question like this probably belongs on e1000-devel or >>>>> intel-wired-lan so I have added those lists and the e1000e maintainer >>>>> to the thread. >>>>> >>>>> It would be useful if you could provide more information about the >>>>> device itself such as the ID and the kind of test you are running. >>>>> Keep in mind the e1000e driver supports a pretty broad swath of >>>>> devices so we need to narrow things down a bit. >>>>> >>>> please, also re-check if your kernel include: >>>> e1000e: fix buffer overrun while the I219 is processing DMA >>>> transactions >>>> e1000e: fix the use of magic numbers for buffer overrun issue >>>> where you take fresh version of kernel? >>> >>> Hello, >>> >>> I tried adding those two patches, but I still see this splat shortly >>> after starting >>> my test.  The kernel I am using is here: >>> >>> https://github.com/greearb/linux-ct-4.13 >>> >>> I've seen similar issues at least back to the 4.0 kernel, including >>> stock kernels and my >>> own kernels with additional patches. >>> >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: >>> eth2 (e1000e): transmit queue 0 timed out, trans_start: 4295298499, >>> wd-timeout: 5000 jiffies: 4295304192 tx-queues: 1 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: ------------[ cut >>> here ]------------ >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: WARNING: CPU: 0 >>> PID: 0 at >>> /home/greearb/git/linux-4.13.dev.y/net/sched/sch_generic.c:322 >>> dev_watchdog+0x228/0x250 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: Modules linked in: >>> nf_conntrack_netlink nf_conntrack nfnetlink nf_defrag_ipv4 libcrc32c >>> cfg80211 macvlan wanlink(O) pktgen bnep bluetooth f...ss tpm_tis ip >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: CPU: 0 PID: 0 >>> Comm: swapper/0 Tainted: G           O    4.13.16+ #22 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: Hardware name: >>> Supermicro X9SCI/X9SCA/X9SCI/X9SCA, BIOS 2.0b 09/17/2012 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: task: >>> ffffffff81e104c0 task.stack: ffffffff81e00000 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RIP: >>> 0010:dev_watchdog+0x228/0x250 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RSP: >>> 0018:ffff88042fc03e50 EFLAGS: 00010282 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RAX: >>> 0000000000000086 RBX: 0000000000000000 RCX: 0000000000000000 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RDX: >>> ffff88042fc15b40 RSI: ffff88042fc0dbf8 RDI: ffff88042fc0dbf8 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RBP: >>> ffff88042fc03e98 R08: 0000000000000001 R09: 00000000000003c4 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: R10: >>> 0000000000000000 R11: 00000000000003c4 R12: 0000000000001388 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: R13: >>> 0000000100050dc3 R14: ffff880417670000 R15: 0000000100052400 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: FS: >>> 0000000000000000(0000) GS:ffff88042fc00000(0000) >>> knlGS:0000000000000000 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: CS:  0010 DS: 0000 >>> ES: 0000 CR0: 0000000080050033 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: CR2: >>> 0000000001d14000 CR3: 0000000001e09000 CR4: 00000000001406f0 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: Call Trace: >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:  >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:  ? >>> qdisc_rcu_free+0x40/0x40 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: >>> call_timer_fn+0x30/0x160 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:  ? >>> qdisc_rcu_free+0x40/0x40 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: >>> run_timer_softirq+0x1f0/0x450 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:  ? >>> lapic_next_deadline+0x21/0x30 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:  ? >>> clockevents_program_event+0x78/0xf0 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: >>> __do_softirq+0xc1/0x2c0 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:  irq_exit+0xb1/0xc0 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: >>> smp_apic_timer_interrupt+0x38/0x50 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: >>> apic_timer_interrupt+0x89/0x90 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RIP: >>> 0010:cpuidle_enter_state+0x12b/0x310 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RSP: >>> 0018:ffffffff81e03de8 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RAX: >>> 0000000000000000 RBX: 0000000000000003 RCX: 000000000000001f >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RDX: >>> 0000000000000000 RSI: 00000000238e2b4c RDI: 0000000000000000 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RBP: >>> ffffffff81e03e20 R08: 00000000000000af R09: 0000000000000018 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: R10: >>> 00000000000000af R11: 0000000000000f27 R12: 0000000000000003 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: R13: >>> ffff88042fc24918 R14: ffffffff81eae658 R15: 00000093fd9af742 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:  >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:  ? >>> cpuidle_enter_state+0x119/0x310 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: >>> cpuidle_enter+0x12/0x20 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: >>> call_cpuidle+0x1e/0x40 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:  do_idle+0x17f/0x1d0 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: >>> cpu_startup_entry+0x5f/0x70 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:  rest_init+0xc9/0xd0 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: >>> start_kernel+0x483/0x490 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:  ? >>> early_idt_handler_array+0x120/0x120 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: >>> x86_64_start_reservations+0x2a/0x2c >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: >>> x86_64_start_kernel+0x13c/0x14b >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: >>> secondary_startup_64+0x9f/0x9f >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: Code: 04 00 00 89 >>> 4d cc e8 b8 88 fd ff 8b 4d cc 45 89 e1 4d 89 e8 48 89 c2 4c 89 f6 48 >>> c7 c7 98 23 d4 81 51 41 57 89 d9 e8 44 48 94 ff <0f>... 63 8e 60 04 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: ---[ end trace >>> 04264863cdced748 ]--- >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: e1000e >>> 0000:06:00.0 eth2: Reset adapter unexpectedly >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC >>> Link is Down >>> Jan 24 10:19:48 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth2 NIC >>> Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx >>> Jan 24 10:19:48 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC >>> Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx >>> >>> .... >>> >>> >>> Jan 24 10:27:05 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth2 NIC >>> Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx >>> Jan 24 10:27:24 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: >>> eth3 (e1000e): transmit queue 0 timed out, trans_start: 4295760337, >>> wd-timeout: 5000 jiffies: 4295767040 tx-queues: 1 >>> Jan 24 10:27:24 lf1003-e3v2-13100124-f20x64 kernel: e1000e >>> 0000:07:00.0 eth3: Reset adapter unexpectedly >>> Jan 24 10:27:27 lf1003-e3v2-13100124-f20x64 kernel: e1000e >>> 0000:06:00.0 eth2: Detected Hardware Unit Hang: >>> >>> TDH                  <43> >>>                                                       TDT >>>     <90>... >>> Jan 24 10:27:29 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC >>> Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx >>> Jan 24 10:27:46 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: >>> eth3 (e1000e): transmit queue 0 timed out, trans_start: 4295782403, >>> wd-timeout: 5000 jiffies: 4295789056 tx-queues: 1 >>> Jan 24 10:27:46 lf1003-e3v2-13100124-f20x64 kernel: e1000e >>> 0000:07:00.0 eth3: Reset adapter unexpectedly >>> Jan 24 10:27:51 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC >>> Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx >>> Jan 24 10:28:06 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: >>> eth2 (e1000e): transmit queue 0 timed out, trans_start: 4295802883, >>> wd-timeout: 5000 jiffies: 4295809024 tx-queues: 1 >>> Jan 24 10:28:06 lf1003-e3v2-13100124-f20x64 kernel: e1000e >>> 0000:06:00.0 eth2: Reset adapter unexpectedly >>> Jan 24 10:28:10 lf1003-e3v2-13100124-f20x64 kernel: e1000e >>> 0000:07:00.0 eth3: Detected Hardware Unit Hang: >>> >>> TDH                  <10> >>>                                                       TDT >>>     <5d>... >>> Jan 24 10:28:11 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth2 NIC >>> Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx >>> Jan 24 10:28:30 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: >>> eth3 (e1000e): transmit queue 0 timed out, trans_start: 4295827457, >>> wd-timeout: 5000 jiffies: 4295833088 tx-queues: 1 >>> Jan 24 10:28:30 lf1003-e3v2-13100124-f20x64 kernel: e1000e >>> 0000:07:00.0 eth3: Reset adapter unexpectedly >>> Jan 24 10:28:35 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC >>> Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx >>> Jan 24 10:28:45 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: >>> eth2 (e1000e): transmit queue 0 timed out, trans_start: 4295841678, >>> wd-timeout: 5000 jiffies: 4295847424 tx-queues: 1 >>> Jan 24 10:28:45 lf1003-e3v2-13100124-f20x64 kernel: e1000e >>> 0000:06:00.0 eth2: Reset adapter unexpectedly >>> Jan 24 10:28:48 lf1003-e3v2-13100124-f20x64 kernel: e1000e >>> 0000:07:00.0 eth3: Detected Hardware Unit Hang: >>> >>> TDH                  <8> >>>                                                       TDT >>>     <55>... >>> Jan 24 10:28:49 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth2 NIC >>> Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx >>> Jan 24 10:29:20 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: >>> eth3 (e1000e): transmit queue 0 timed out, trans_start: 4295874528, >>> wd-timeout: 5000 jiffies: 4295882240 tx-queues: 1 >>> Jan 24 10:29:20 lf1003-e3v2-13100124-f20x64 kernel: e1000e >>> 0000:07:00.0 eth3: Reset adapter unexpectedly >>> Jan 24 10:29:20 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth2 NIC >>> Link is Down >>> Jan 24 10:29:26 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth2 NIC >>> Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx >>> Jan 24 10:29:26 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC >>> Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx >>> >>> ..... >>> >>> >>> [root@lf1003-e3v2-13100124-f20x64 ~]# ethtool -i eth2 >>> driver: e1000e >>> version: 3.2.6-k >>> firmware-version: 2.1-2 >>> bus-info: 0000:06:00.0 >>> supports-statistics: yes >>> supports-test: yes >>> supports-eeprom-access: yes >>> supports-register-dump: yes >>> supports-priv-flags: no >>> >>> [root@lf1003-e3v2-13100124-f20x64 ~]# lspci -vvv -s 0000:06:00.0 >>> 06:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network >>> Connection >>>     Subsystem: Super Micro Computer Inc Device 0000 >>>     Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- >>> Stepping- SERR+ FastB2B- DisINTx+ >>>     Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- >>> SERR- >>     Latency: 0, Cache Line Size: 64 bytes >>>     Interrupt: pin A routed to IRQ 18 >>>     Region 0: Memory at df600000 (32-bit, non-prefetchable) [size=128K] >>>     Region 2: I/O ports at b000 [size=32] >>>     Region 3: Memory at df620000 (32-bit, non-prefetchable) [size=16K] >>>     Capabilities: [c8] Power Management version 2 >>>         Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA >>> PME(D0+,D1-,D2-,D3hot+,D3cold+) >>>         Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME- >>>     Capabilities: [d0] MSI: Enable- Count=1/1 Maskable- 64bit+ >>>         Address: 0000000000000000  Data: 0000 >>>     Capabilities: [e0] Express (v1) Endpoint, MSI 00 >>>         DevCap:    MaxPayload 256 bytes, PhantFunc 0, Latency L0s >>> <512ns, L1 <64us >>>             ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset- >>>         DevCtl:    Report errors: Correctable+ Non-Fatal+ Fatal+ >>> Unsupported+ >>>             RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+ >>>             MaxPayload 128 bytes, MaxReadReq 512 bytes >>>         DevSta:    CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr+ >>> TransPend- >>>         LnkCap:    Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, >>> Exit Latency >>> L0s <128ns, L1 <64us >>>             ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp- >>>         LnkCtl:    ASPM Disabled; RCB 64 bytes Disabled- CommClk+ >>>             ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- >>>         LnkSta:    Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ >>> DLActive- >>> BWMgmt- ABWMgmt- >>>     Capabilities: [a0] MSI-X: Enable+ Count=5 Masked- >>>         Vector table: BAR=3 offset=00000000 >>>         PBA: BAR=3 offset=00002000 >>>     Capabilities: [100 v1] Advanced Error Reporting >>>         UESta:    DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- >>> RxOF- >>> MalfTLP- ECRC- UnsupReq- ACSViol- >>>         UEMsk:    DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- >>> RxOF- >>> MalfTLP- ECRC- UnsupReq- ACSViol- >>>         UESvrt:    DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- >>> RxOF+ >>> MalfTLP+ ECRC- UnsupReq- ACSViol- >>>         CESta:    RxErr- BadTLP- BadDLLP- Rollover- Timeout- >>> NonFatalErr- >>>         CEMsk:    RxErr- BadTLP- BadDLLP- Rollover- Timeout- >>> NonFatalErr+ >>>         AERCap:    First Error Pointer: 00, GenCap- CGenEn- ChkCap- >>> ChkEn- >>>     Capabilities: [140 v1] Device Serial Number 00-25-90-ff-ff-d2-06-aa >>>     Kernel driver in use: e1000e >>>     Kernel modules: e1000e >>> >>> >>> My test is a (custom) traffic generator that is setting up 30k tcp >>> connections >>> between two e1000e ports and sending traffic as fast as possible. >>> I'd be happy to help you set up this exact tool on your system(s), >>> but we have seen similar issues with e1000e in other high-speed tests, >>> so I don't think it >>> is specific to this particular test.  Maybe this test makes it easier >>> to reproduce >>> however. >> >> Silly suggestion: >> Maybe worth to try disabling TSO? >> ethtool -K eth2 tso off > > > I tried that just now...and the problem did not change. > > Thanks, > Ben > > > 82574L is pretty old HW - I am not sure we still support it. Is more older kernel version also hit on this problem? Can you try latest Linus kernel version? Anyway, I suggest fill ticket on source forge (https://sourceforge.net/projects/e1000/files/?source=navbar),attach dmesg, lspci and all relevant information. From mboxrd@z Thu Jan 1 00:00:00 1970 From: Neftin, Sasha Date: Thu, 25 Jan 2018 10:29:38 +0200 Subject: [Intel-wired-lan] e1000e hardware unit hangs In-Reply-To: References: <51bbb33a-e7dd-88c0-4fff-bebb6ef75a78@candelatech.com> <8ade4a34b9e6817c3f4afb5126f37871@visp.net.lb> Message-ID: <04631dcd-1d0b-1b06-fae7-3889f148a591@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: intel-wired-lan@osuosl.org List-ID: On 1/24/2018 20:41, Ben Greear wrote: > On 01/24/2018 10:38 AM, Denys Fedoryshchenko wrote: >> On 2018-01-24 20:31, Ben Greear wrote: >>> On 01/24/2018 08:34 AM, Neftin, Sasha wrote: >>>> On 1/24/2018 18:11, Alexander Duyck wrote: >>>>> On Tue, Jan 23, 2018 at 3:46 PM, Ben Greear >>>>> wrote: >>>>>> Hello, >>>>>> >>>>>> Anyone have any more suggestions for making e1000e work better? >>>>>> This is >>>>>> from a 4.9.65+ kernel, >>>>>> with these additional e1000e patches applied: >>>>>> >>>>>> e1000e: Fix error path in link detection >>>>>> e1000e: Fix wrong comment related to link detection >>>>>> e1000e: Fix return value test >>>>>> e1000e: Separate signaling for link check/link up >>>>>> e1000e: Avoid receiver overrun interrupt bursts >>>>> >>>>> Most of these patches shouldn't address anything that would trigger Tx >>>>> hangs. They are mostly related to just link detection. >>>>> >>>>>> Test case is simply to run 30000 tcp connections each trying to >>>>>> send 56Kbps >>>>>> of bi-directional >>>>>> data between a pair of e1000e interfaces :) >>>>>> >>>>>> No OOM related issues are seen on this kernel...similar test on >>>>>> 4.13 showed >>>>>> some OOM >>>>>> issues, but I have not debugged that yet... >>>>> >>>>> Really a question like this probably belongs on e1000-devel or >>>>> intel-wired-lan so I have added those lists and the e1000e maintainer >>>>> to the thread. >>>>> >>>>> It would be useful if you could provide more information about the >>>>> device itself such as the ID and the kind of test you are running. >>>>> Keep in mind the e1000e driver supports a pretty broad swath of >>>>> devices so we need to narrow things down a bit. >>>>> >>>> please, also re-check if your kernel include: >>>> e1000e: fix buffer overrun while the I219 is processing DMA >>>> transactions >>>> e1000e: fix the use of magic numbers for buffer overrun issue >>>> where you take fresh version of kernel? >>> >>> Hello, >>> >>> I tried adding those two patches, but I still see this splat shortly >>> after starting >>> my test.? The kernel I am using is here: >>> >>> https://github.com/greearb/linux-ct-4.13 >>> >>> I've seen similar issues at least back to the 4.0 kernel, including >>> stock kernels and my >>> own kernels with additional patches. >>> >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: >>> eth2 (e1000e): transmit queue 0 timed out, trans_start: 4295298499, >>> wd-timeout: 5000 jiffies: 4295304192 tx-queues: 1 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: ------------[ cut >>> here ]------------ >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: WARNING: CPU: 0 >>> PID: 0 at >>> /home/greearb/git/linux-4.13.dev.y/net/sched/sch_generic.c:322 >>> dev_watchdog+0x228/0x250 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: Modules linked in: >>> nf_conntrack_netlink nf_conntrack nfnetlink nf_defrag_ipv4 libcrc32c >>> cfg80211 macvlan wanlink(O) pktgen bnep bluetooth f...ss tpm_tis ip >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: CPU: 0 PID: 0 >>> Comm: swapper/0 Tainted: G?????????? O??? 4.13.16+ #22 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: Hardware name: >>> Supermicro X9SCI/X9SCA/X9SCI/X9SCA, BIOS 2.0b 09/17/2012 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: task: >>> ffffffff81e104c0 task.stack: ffffffff81e00000 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RIP: >>> 0010:dev_watchdog+0x228/0x250 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RSP: >>> 0018:ffff88042fc03e50 EFLAGS: 00010282 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RAX: >>> 0000000000000086 RBX: 0000000000000000 RCX: 0000000000000000 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RDX: >>> ffff88042fc15b40 RSI: ffff88042fc0dbf8 RDI: ffff88042fc0dbf8 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RBP: >>> ffff88042fc03e98 R08: 0000000000000001 R09: 00000000000003c4 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: R10: >>> 0000000000000000 R11: 00000000000003c4 R12: 0000000000001388 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: R13: >>> 0000000100050dc3 R14: ffff880417670000 R15: 0000000100052400 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: FS: >>> 0000000000000000(0000) GS:ffff88042fc00000(0000) >>> knlGS:0000000000000000 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: CS:? 0010 DS: 0000 >>> ES: 0000 CR0: 0000000080050033 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: CR2: >>> 0000000001d14000 CR3: 0000000001e09000 CR4: 00000000001406f0 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: Call Trace: >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:? >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:? ? >>> qdisc_rcu_free+0x40/0x40 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: >>> call_timer_fn+0x30/0x160 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:? ? >>> qdisc_rcu_free+0x40/0x40 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: >>> run_timer_softirq+0x1f0/0x450 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:? ? >>> lapic_next_deadline+0x21/0x30 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:? ? >>> clockevents_program_event+0x78/0xf0 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: >>> __do_softirq+0xc1/0x2c0 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:? irq_exit+0xb1/0xc0 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: >>> smp_apic_timer_interrupt+0x38/0x50 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: >>> apic_timer_interrupt+0x89/0x90 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RIP: >>> 0010:cpuidle_enter_state+0x12b/0x310 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RSP: >>> 0018:ffffffff81e03de8 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RAX: >>> 0000000000000000 RBX: 0000000000000003 RCX: 000000000000001f >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RDX: >>> 0000000000000000 RSI: 00000000238e2b4c RDI: 0000000000000000 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: RBP: >>> ffffffff81e03e20 R08: 00000000000000af R09: 0000000000000018 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: R10: >>> 00000000000000af R11: 0000000000000f27 R12: 0000000000000003 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: R13: >>> ffff88042fc24918 R14: ffffffff81eae658 R15: 00000093fd9af742 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:? >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:? ? >>> cpuidle_enter_state+0x119/0x310 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: >>> cpuidle_enter+0x12/0x20 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: >>> call_cpuidle+0x1e/0x40 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:? do_idle+0x17f/0x1d0 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: >>> cpu_startup_entry+0x5f/0x70 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:? rest_init+0xc9/0xd0 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: >>> start_kernel+0x483/0x490 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel:? ? >>> early_idt_handler_array+0x120/0x120 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: >>> x86_64_start_reservations+0x2a/0x2c >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: >>> x86_64_start_kernel+0x13c/0x14b >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: >>> secondary_startup_64+0x9f/0x9f >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: Code: 04 00 00 89 >>> 4d cc e8 b8 88 fd ff 8b 4d cc 45 89 e1 4d 89 e8 48 89 c2 4c 89 f6 48 >>> c7 c7 98 23 d4 81 51 41 57 89 d9 e8 44 48 94 ff <0f>... 63 8e 60 04 >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: ---[ end trace >>> 04264863cdced748 ]--- >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: e1000e >>> 0000:06:00.0 eth2: Reset adapter unexpectedly >>> Jan 24 10:19:42 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC >>> Link is Down >>> Jan 24 10:19:48 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth2 NIC >>> Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx >>> Jan 24 10:19:48 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC >>> Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx >>> >>> .... >>> >>> >>> Jan 24 10:27:05 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth2 NIC >>> Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx >>> Jan 24 10:27:24 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: >>> eth3 (e1000e): transmit queue 0 timed out, trans_start: 4295760337, >>> wd-timeout: 5000 jiffies: 4295767040 tx-queues: 1 >>> Jan 24 10:27:24 lf1003-e3v2-13100124-f20x64 kernel: e1000e >>> 0000:07:00.0 eth3: Reset adapter unexpectedly >>> Jan 24 10:27:27 lf1003-e3v2-13100124-f20x64 kernel: e1000e >>> 0000:06:00.0 eth2: Detected Hardware Unit Hang: >>> >>> TDH????????????????? <43> >>> ????????????????????????????????????????????????????? TDT >>> ??? <90>... >>> Jan 24 10:27:29 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC >>> Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx >>> Jan 24 10:27:46 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: >>> eth3 (e1000e): transmit queue 0 timed out, trans_start: 4295782403, >>> wd-timeout: 5000 jiffies: 4295789056 tx-queues: 1 >>> Jan 24 10:27:46 lf1003-e3v2-13100124-f20x64 kernel: e1000e >>> 0000:07:00.0 eth3: Reset adapter unexpectedly >>> Jan 24 10:27:51 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC >>> Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx >>> Jan 24 10:28:06 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: >>> eth2 (e1000e): transmit queue 0 timed out, trans_start: 4295802883, >>> wd-timeout: 5000 jiffies: 4295809024 tx-queues: 1 >>> Jan 24 10:28:06 lf1003-e3v2-13100124-f20x64 kernel: e1000e >>> 0000:06:00.0 eth2: Reset adapter unexpectedly >>> Jan 24 10:28:10 lf1003-e3v2-13100124-f20x64 kernel: e1000e >>> 0000:07:00.0 eth3: Detected Hardware Unit Hang: >>> >>> TDH????????????????? <10> >>> ????????????????????????????????????????????????????? TDT >>> ??? <5d>... >>> Jan 24 10:28:11 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth2 NIC >>> Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx >>> Jan 24 10:28:30 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: >>> eth3 (e1000e): transmit queue 0 timed out, trans_start: 4295827457, >>> wd-timeout: 5000 jiffies: 4295833088 tx-queues: 1 >>> Jan 24 10:28:30 lf1003-e3v2-13100124-f20x64 kernel: e1000e >>> 0000:07:00.0 eth3: Reset adapter unexpectedly >>> Jan 24 10:28:35 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC >>> Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx >>> Jan 24 10:28:45 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: >>> eth2 (e1000e): transmit queue 0 timed out, trans_start: 4295841678, >>> wd-timeout: 5000 jiffies: 4295847424 tx-queues: 1 >>> Jan 24 10:28:45 lf1003-e3v2-13100124-f20x64 kernel: e1000e >>> 0000:06:00.0 eth2: Reset adapter unexpectedly >>> Jan 24 10:28:48 lf1003-e3v2-13100124-f20x64 kernel: e1000e >>> 0000:07:00.0 eth3: Detected Hardware Unit Hang: >>> >>> TDH????????????????? <8> >>> ????????????????????????????????????????????????????? TDT >>> ??? <55>... >>> Jan 24 10:28:49 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth2 NIC >>> Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx >>> Jan 24 10:29:20 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: >>> eth3 (e1000e): transmit queue 0 timed out, trans_start: 4295874528, >>> wd-timeout: 5000 jiffies: 4295882240 tx-queues: 1 >>> Jan 24 10:29:20 lf1003-e3v2-13100124-f20x64 kernel: e1000e >>> 0000:07:00.0 eth3: Reset adapter unexpectedly >>> Jan 24 10:29:20 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth2 NIC >>> Link is Down >>> Jan 24 10:29:26 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth2 NIC >>> Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx >>> Jan 24 10:29:26 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC >>> Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx >>> >>> ..... >>> >>> >>> [root at lf1003-e3v2-13100124-f20x64 ~]# ethtool -i eth2 >>> driver: e1000e >>> version: 3.2.6-k >>> firmware-version: 2.1-2 >>> bus-info: 0000:06:00.0 >>> supports-statistics: yes >>> supports-test: yes >>> supports-eeprom-access: yes >>> supports-register-dump: yes >>> supports-priv-flags: no >>> >>> [root at lf1003-e3v2-13100124-f20x64 ~]# lspci -vvv -s 0000:06:00.0 >>> 06:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network >>> Connection >>> ??? Subsystem: Super Micro Computer Inc Device 0000 >>> ??? Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- >>> Stepping- SERR+ FastB2B- DisINTx+ >>> ??? Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- >>> SERR- >> ??? Latency: 0, Cache Line Size: 64 bytes >>> ??? Interrupt: pin A routed to IRQ 18 >>> ??? Region 0: Memory at df600000 (32-bit, non-prefetchable) [size=128K] >>> ??? Region 2: I/O ports at b000 [size=32] >>> ??? Region 3: Memory at df620000 (32-bit, non-prefetchable) [size=16K] >>> ??? Capabilities: [c8] Power Management version 2 >>> ??????? Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA >>> PME(D0+,D1-,D2-,D3hot+,D3cold+) >>> ??????? Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME- >>> ??? Capabilities: [d0] MSI: Enable- Count=1/1 Maskable- 64bit+ >>> ??????? Address: 0000000000000000? Data: 0000 >>> ??? Capabilities: [e0] Express (v1) Endpoint, MSI 00 >>> ??????? DevCap:??? MaxPayload 256 bytes, PhantFunc 0, Latency L0s >>> <512ns, L1 <64us >>> ??????????? ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset- >>> ??????? DevCtl:??? Report errors: Correctable+ Non-Fatal+ Fatal+ >>> Unsupported+ >>> ??????????? RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+ >>> ??????????? MaxPayload 128 bytes, MaxReadReq 512 bytes >>> ??????? DevSta:??? CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr+ >>> TransPend- >>> ??????? LnkCap:??? Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, >>> Exit Latency >>> L0s <128ns, L1 <64us >>> ??????????? ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp- >>> ??????? LnkCtl:??? ASPM Disabled; RCB 64 bytes Disabled- CommClk+ >>> ??????????? ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- >>> ??????? LnkSta:??? Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ >>> DLActive- >>> BWMgmt- ABWMgmt- >>> ??? Capabilities: [a0] MSI-X: Enable+ Count=5 Masked- >>> ??????? Vector table: BAR=3 offset=00000000 >>> ??????? PBA: BAR=3 offset=00002000 >>> ??? Capabilities: [100 v1] Advanced Error Reporting >>> ??????? UESta:??? DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- >>> RxOF- >>> MalfTLP- ECRC- UnsupReq- ACSViol- >>> ??????? UEMsk:??? DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- >>> RxOF- >>> MalfTLP- ECRC- UnsupReq- ACSViol- >>> ??????? UESvrt:??? DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- >>> RxOF+ >>> MalfTLP+ ECRC- UnsupReq- ACSViol- >>> ??????? CESta:??? RxErr- BadTLP- BadDLLP- Rollover- Timeout- >>> NonFatalErr- >>> ??????? CEMsk:??? RxErr- BadTLP- BadDLLP- Rollover- Timeout- >>> NonFatalErr+ >>> ??????? AERCap:??? First Error Pointer: 00, GenCap- CGenEn- ChkCap- >>> ChkEn- >>> ??? Capabilities: [140 v1] Device Serial Number 00-25-90-ff-ff-d2-06-aa >>> ??? Kernel driver in use: e1000e >>> ??? Kernel modules: e1000e >>> >>> >>> My test is a (custom) traffic generator that is setting up 30k tcp >>> connections >>> between two e1000e ports and sending traffic as fast as possible. >>> I'd be happy to help you set up this exact tool on your system(s), >>> but we have seen similar issues with e1000e in other high-speed tests, >>> so I don't think it >>> is specific to this particular test.? Maybe this test makes it easier >>> to reproduce >>> however. >> >> Silly suggestion: >> Maybe worth to try disabling TSO? >> ethtool -K eth2 tso off > > > I tried that just now...and the problem did not change. > > Thanks, > Ben > > > 82574L is pretty old HW - I am not sure we still support it. Is more older kernel version also hit on this problem? Can you try latest Linus kernel version? Anyway, I suggest fill ticket on source forge (https://sourceforge.net/projects/e1000/files/?source=navbar),attach dmesg, lspci and all relevant information.