[bisected regression] e1000e: "Detected Hardware Unit Hang"

* [bisected regression] e1000e: "Detected Hardware Unit Hang"
@ 2015-01-14 15:32 Thomas Jarosch
  2015-01-14 17:20 ` Eric Dumazet
  0 siblings, 1 reply; 26+ messages in thread
From: Thomas Jarosch @ 2015-01-14 15:32 UTC (permalink / raw)
  To: 'Linux Netdev List'; +Cc: Eric Dumazet, Jeff Kirsher, e1000-devel

Hello,

after updating a good bunch of production level machines
from kernel 3.4.101 to kernel 3.14.25, a few of them started
to show serious trouble when there was a lot of network traffic.

---------------------------------------------------------------
Jan 14 11:14:57 intrartc kernel: e1000e 0000:00:19.0 eth0: Detected Hardware Unit Hang:
Jan 14 11:14:57 intrartc kernel:  TDH                  <3b>
Jan 14 11:14:57 intrartc kernel:  TDT                  <76>
Jan 14 11:14:57 intrartc kernel:  next_to_use          <76>
Jan 14 11:14:57 intrartc kernel:  next_to_clean        <31>
Jan 14 11:14:57 intrartc kernel: buffer_info[next_to_clean]:
Jan 14 11:14:57 intrartc kernel:  time_stamp           <ffff328c>
Jan 14 11:14:57 intrartc kernel:  next_to_watch        <3b>
Jan 14 11:14:57 intrartc kernel:  jiffies              <ffff33b9>
Jan 14 11:14:57 intrartc kernel:  next_to_watch.status <0>
Jan 14 11:14:57 intrartc kernel: MAC Status             <40080083>
Jan 14 11:14:57 intrartc kernel: PHY Status             <796d>
Jan 14 11:14:57 intrartc kernel: PHY 1000BASE-T Status  <3800>
Jan 14 11:14:57 intrartc kernel: PHY Extended Status    <3000>
Jan 14 11:14:57 intrartc kernel: PCI Status             <10>
Jan 14 11:14:59 intrartc kernel: e1000e 0000:00:19.0 eth0: Detected Hardware Unit Hang:
..
---------------------------------------------------------------

All of those troubled machines use an Intel DH61CR board and
are driven by the e1000e driver. Kernels 3.7.0 to 3.19-rc4 are affected.

The problem vanishes when you disable TSO. This is the
recommended "solution" on serverfault and others.
http://ehc.ac/p/e1000/bugs/378/
http://serverfault.com/questions/616485/e1000e-reset-adapter-unexpectedly-detected-hardware-unit-hang

I have a test setup that can trigger the problem within seconds
and bisected it down to this commit (hi Eric!):
---------------------------------------------------------------
commit 69b08f62e17439ee3d436faf0b9a7ca6fffb78db
Author: Eric Dumazet <edumazet@google.com>
Date:   Wed Sep 26 06:46:57 2012 +0000

    net: use bigger pages in __netdev_alloc_frag

    We currently use percpu order-0 pages in __netdev_alloc_frag
    to deliver fragments used by __netdev_alloc_skb()

    Depending on NIC driver and arch being 32 or 64 bit, it allows a page to
    be split in several fragments (between 1 and 8), assuming PAGE_SIZE=4096

    Switching to bigger pages (32768 bytes for PAGE_SIZE=4096 case) allows :

    - Better filling of space (the ending hole overhead is less an issue)

    - Less calls to page allocator or accesses to page->_count

    - Could allow struct skb_shared_info futures changes without major
    performance impact.

    This patch implements a transparent fallback to smaller
    pages in case of memory pressure.

    It also uses a standard "struct page_frag" instead of a custom one.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Cc: Alexander Duyck <alexander.h.duyck@intel.com>
    Cc: Benjamin LaHaise <bcrl@kvack.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>
---------------------------------------------------------------

Reverting the commit f.e. in kernel 3.7.0  solves the issue.
I've done some more tests:

    3.18.0 32bit + PAE: broken
    3.6.0 32bit + PAE: works
    3.7.0 32bit + PAE: broken
    3.7.0 32bit + PAE + revert 69b08f62e17439ee3d436faf0b9a7ca6fffb78db -> works

    3.7.0 32bit (without PAE) -> broken
    3.7.0 32bit + "GFP_COMP" flag removed in __netdev_alloc_frag(): broken
    3.7.0 32bit + "GFP_COMP" flag replaced with
                              "GFP_DMA" in __netdev_alloc_frag(): works!
    3.7.0 32bit + "GFP_COMP" flag + "GFP_DMA" flag: broken
    3.19-rc4 32bit: broken

The problem is triggered only when the traffic is forwarded to another client.
(this client is behind NAT). Generating traffic directly
on the system did not trigger the issue.

To me it looks like Eric's change uncovered a memory allocation
issue in the e1000e driver: It probably uses a memory address
unsuitable for DMA or so. This is just a guess though.

Funny fact: I have another Intel DH61CR board that does not show the problem.
I've borrowed (...) the mainboard from one affected box for my bisect test setup.

Please CC: comments. Thanks.

Best regards,
Thomas

^ permalink raw reply	[flat|nested] 26+ messages in thread