linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Kernel regression introduced by "e1000e: Do not write lsc to ics in msi-x mode" and/or "e1000e: Do not read ICR in Other interrupt"
@ 2016-11-01 23:56 Jack Suter
  2016-11-02  3:25 ` Benjamin Poirier
  2016-11-02 21:19 ` Brown, Aaron F
  0 siblings, 2 replies; 6+ messages in thread
From: Jack Suter @ 2016-11-01 23:56 UTC (permalink / raw)
  To: jeffrey.t.kirsher
  Cc: intel-wired-lan, bpoirier, aaron.f.brown, jhodzic, linux-kernel

Hi there,

I have some servers with an 82574L based NIC and recently upgraded from
a 4.4 series kernel to 4.7. Upon doing so, servers with this chipset
have begun frequently reporting "Link is Down" and "Link is Up"
messages. No other related network errors are reported by the kernel or
e1000e driver. I saw some reports about using "ethtool -s $iface msglvl
6" to reveal more information, but nothing extra was reported.

Some testing showed that this was introduced between the 4.4 and 4.5
series. I was able to further narrow it down to two commits that look
related:

 e1000e: Do not write lsc to ics in msi-x mode
 (a61cfe4ffad7864a07e0c74969ca7ceb77ab2f1f)
 e1000e: Do not read ICR in Other interrupt
 (16ecba59bc333d6282ee057fb02339f77a880beb)

Reverting these two commits resolves the Link is Down/Link is Up
messages. This has been tested on about six servers so far and all have
stopped reporting these link flaps.

In total I have about ten servers that are frequently seeing this issue,
and a couple dozen more triggering it sporadically.

This is about the extent of my troubleshooting knowledge so far. I am
happy to test code changes and provide any additional information as
necessary. While I do not understand what specifically causes the link
flaps, they reliably begin occurring on the affected servers within a
couple hours of boot.

A snip of one such instance is below.

Thank you for any assistance troubleshooting this.

Kind regards,

Jack Suter

# ethtool -i enp2s0
driver: e1000e
version: 3.2.6-k
firmware-version: 2.1-2
bus-info: 0000:02:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: no

[ 3532.745587] e1000e: enp2s0 NIC Link is Down
[ 3532.771461] e1000e: enp2s0 NIC Link is Up 1000 Mbps Full Duplex, Flow
Control: Rx/Tx
[15463.117592] e1000e: enp2s0 NIC Link is Down
[15463.119419] e1000e: enp2s0 NIC Link is Up 1000 Mbps Full Duplex, Flow
Control: Rx/Tx
[15469.155922] e1000e: enp2s0 NIC Link is Up 1000 Mbps Full Duplex, Flow
Control: Rx/Tx
[15648.196579] e1000e: enp2s0 NIC Link is Down
[15651.405310] e1000e: enp2s0 NIC Link is Up 1000 Mbps Full Duplex, Flow
Control: Rx/Tx
[15728.959981] e1000e: enp2s0 NIC Link is Down
[15729.000625] e1000e: enp2s0 NIC Link is Up 1000 Mbps Full Duplex, Flow
Control: Rx/Tx
[15835.132034] e1000e: enp2s0 NIC Link is Down
[15835.185222] e1000e: enp2s0 NIC Link is Up 1000 Mbps Full Duplex, Flow
Control: Rx/Tx
[15839.104020] e1000e: enp2s0 NIC Link is Down
[15839.142346] e1000e: enp2s0 NIC Link is Up 1000 Mbps Full Duplex, Flow
Control: Rx/Tx
[15845.142287] e1000e: enp2s0 NIC Link is Up 1000 Mbps Full Duplex, Flow
Control: Rx/Tx
[16401.940127] e1000e: enp2s0 NIC Link is Down
[16401.945106] e1000e: enp2s0 NIC Link is Up 1000 Mbps Full Duplex, Flow
Control: Rx/Tx
[16408.121843] e1000e: enp2s0 NIC Link is Up 1000 Mbps Full Duplex, Flow
Control: Rx/Tx
[17025.823220] e1000e: enp2s0 NIC Link is Down
[17025.825473] e1000e: enp2s0 NIC Link is Up 1000 Mbps Full Duplex, Flow
Control: Rx/Tx
[17032.100202] e1000e: enp2s0 NIC Link is Up 1000 Mbps Full Duplex, Flow
Control: Rx/Tx

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Kernel regression introduced by "e1000e: Do not write lsc to ics in msi-x mode" and/or "e1000e: Do not read ICR in Other interrupt"
  2016-11-01 23:56 Kernel regression introduced by "e1000e: Do not write lsc to ics in msi-x mode" and/or "e1000e: Do not read ICR in Other interrupt" Jack Suter
@ 2016-11-02  3:25 ` Benjamin Poirier
  2016-11-02 21:19 ` Brown, Aaron F
  1 sibling, 0 replies; 6+ messages in thread
From: Benjamin Poirier @ 2016-11-02  3:25 UTC (permalink / raw)
  To: Jack Suter
  Cc: jeffrey.t.kirsher, intel-wired-lan, aaron.f.brown, jhodzic, linux-kernel

On 2016/11/01 19:56, Jack Suter wrote:
> Hi there,
> 
> I have some servers with an 82574L based NIC and recently upgraded from
> a 4.4 series kernel to 4.7. Upon doing so, servers with this chipset
> have begun frequently reporting "Link is Down" and "Link is Up"
> messages. No other related network errors are reported by the kernel or
> e1000e driver. I saw some reports about using "ethtool -s $iface msglvl
> 6" to reveal more information, but nothing extra was reported.
> 
> Some testing showed that this was introduced between the 4.4 and 4.5
> series. I was able to further narrow it down to two commits that look
> related:
> 
>  e1000e: Do not write lsc to ics in msi-x mode
>  (a61cfe4ffad7864a07e0c74969ca7ceb77ab2f1f)
>  e1000e: Do not read ICR in Other interrupt
>  (16ecba59bc333d6282ee057fb02339f77a880beb)

I'm just about to get on a plane but I'll be able to look at this on
Monday. Two guesses are that:
1) There is something else than LSC that triggers the "other" interrupt.
Even if that is so, it should not cause e1000e_check_for_copper_link to
report link down however.
2) The link down events are real but some lsc interrupts were not
processed properly prior to this patchset, causing the events to be
lost/ignored.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: Kernel regression introduced by "e1000e: Do not write lsc to ics in msi-x mode" and/or "e1000e: Do not read ICR in Other interrupt"
  2016-11-01 23:56 Kernel regression introduced by "e1000e: Do not write lsc to ics in msi-x mode" and/or "e1000e: Do not read ICR in Other interrupt" Jack Suter
  2016-11-02  3:25 ` Benjamin Poirier
@ 2016-11-02 21:19 ` Brown, Aaron F
  2016-11-03  8:41   ` Neftin, Sasha
                     ` (2 more replies)
  1 sibling, 3 replies; 6+ messages in thread
From: Brown, Aaron F @ 2016-11-02 21:19 UTC (permalink / raw)
  To: Jack Suter, Kirsher, Jeffrey T
  Cc: intel-wired-lan, bpoirier, jhodzic, linux-kernel

> From: Jack Suter [mailto:jack@suter.io]
> Sent: Tuesday, November 1, 2016 4:57 PM
> To: Kirsher, Jeffrey T <jeffrey.t.kirsher@intel.com>
> Cc: intel-wired-lan@lists.osuosl.org; bpoirier@suse.com; Brown, Aaron F
> <aaron.f.brown@intel.com>; jhodzic@ucdavis.edu; linux-
> kernel@vger.kernel.org
> Subject: Kernel regression introduced by "e1000e: Do not write lsc to ics in
> msi-x mode" and/or "e1000e: Do not read ICR in Other interrupt"
> 
> Hi there,
> 
> I have some servers with an 82574L based NIC and recently upgraded from
> a 4.4 series kernel to 4.7. Upon doing so, servers with this chipset
> have begun frequently reporting "Link is Down" and "Link is Up"
> messages. No other related network errors are reported by the kernel or
> e1000e driver. I saw some reports about using "ethtool -s $iface msglvl
> 6" to reveal more information, but nothing extra was reported.
> 
> Some testing showed that this was introduced between the 4.4 and 4.5
> series. I was able to further narrow it down to two commits that look
> related:
> 
>  e1000e: Do not write lsc to ics in msi-x mode
>  (a61cfe4ffad7864a07e0c74969ca7ceb77ab2f1f)
>  e1000e: Do not read ICR in Other interrupt
>  (16ecba59bc333d6282ee057fb02339f77a880beb)

I did not notice any link flapping when I tested those patches, I would have rejected them if I had.  I have several systems with 82574L LOMs and as yet am not able to reproduce a link flap with recent upstream kernels/drivers (net-next 4.8.0 on one and 4.9.0-rc3 on another.) 

One of those systems is dedicated to a kernel regression setup, I checked the test logs from it and am not seeing any evidence of flaps in the 4.4, through 4.6 range either.

> 
> Reverting these two commits resolves the Link is Down/Link is Up
> messages. This has been tested on about six servers so far and all have
> stopped reporting these link flaps.

Are you able to revert either of the patches independently, I don't recall if they were stand alone or not.

> 
> In total I have about ten servers that are frequently seeing this issue,
> and a couple dozen more triggering it sporadically.

Are they all 82574L or does it affect others?

> 
> This is about the extent of my troubleshooting knowledge so far. I am
> happy to test code changes and provide any additional information as
> necessary. While I do not understand what specifically causes the link
> flaps, they reliably begin occurring on the affected servers within a
> couple hours of boot.

Is there any particular traffic pattern involved?  Sitting idle, moderate use, heavy constant flow? 

> 
> A snip of one such instance is below.
> 
> Thank you for any assistance troubleshooting this.

Which kernel tree are you using?  Linus's upstream kernel from kernel.org, a distribution provided one or?  I'm generally working off of David Miller's net-next, but can try to repro the issue on my boxes if I know the exact kernel to work from.

Perhaps a power saving state trying to kick in?  Bad cables or speed/duplex mismatches are common causes of link flap, but they seem unlikely given reverting the patches resolves the issue.

Those patches are interrupt related, what kind of interrupts are in use?  What is interrupt moderation (coalescing set to)?  What is the link partner?  Same type switch for all problem machines or a mix?

cat /proc/interrupts
ethtool -c enp2s0

maybe an `lspci` dump could help shed some more light.

> 
> Kind regards,
> 
> Jack Suter
> 
> # ethtool -i enp2s0
> driver: e1000e
> version: 3.2.6-k
> firmware-version: 2.1-2
> bus-info: 0000:02:00.0
> supports-statistics: yes
> supports-test: yes
> supports-eeprom-access: yes
> supports-register-dump: yes
> supports-priv-flags: no
> 
> [ 3532.745587] e1000e: enp2s0 NIC Link is Down
> [ 3532.771461] e1000e: enp2s0 NIC Link is Up 1000 Mbps Full Duplex, Flow
> Control: Rx/Tx
> [15463.117592] e1000e: enp2s0 NIC Link is Down
> [15463.119419] e1000e: enp2s0 NIC Link is Up 1000 Mbps Full Duplex, Flow
> Control: Rx/Tx
> [15469.155922] e1000e: enp2s0 NIC Link is Up 1000 Mbps Full Duplex, Flow
> Control: Rx/Tx
> [15648.196579] e1000e: enp2s0 NIC Link is Down
> [15651.405310] e1000e: enp2s0 NIC Link is Up 1000 Mbps Full Duplex, Flow
> Control: Rx/Tx
> [15728.959981] e1000e: enp2s0 NIC Link is Down
> [15729.000625] e1000e: enp2s0 NIC Link is Up 1000 Mbps Full Duplex, Flow
> Control: Rx/Tx
> [15835.132034] e1000e: enp2s0 NIC Link is Down
> [15835.185222] e1000e: enp2s0 NIC Link is Up 1000 Mbps Full Duplex, Flow
> Control: Rx/Tx
> [15839.104020] e1000e: enp2s0 NIC Link is Down
> [15839.142346] e1000e: enp2s0 NIC Link is Up 1000 Mbps Full Duplex, Flow
> Control: Rx/Tx
> [15845.142287] e1000e: enp2s0 NIC Link is Up 1000 Mbps Full Duplex, Flow
> Control: Rx/Tx
> [16401.940127] e1000e: enp2s0 NIC Link is Down
> [16401.945106] e1000e: enp2s0 NIC Link is Up 1000 Mbps Full Duplex, Flow
> Control: Rx/Tx
> [16408.121843] e1000e: enp2s0 NIC Link is Up 1000 Mbps Full Duplex, Flow
> Control: Rx/Tx
> [17025.823220] e1000e: enp2s0 NIC Link is Down
> [17025.825473] e1000e: enp2s0 NIC Link is Up 1000 Mbps Full Duplex, Flow
> Control: Rx/Tx
> [17032.100202] e1000e: enp2s0 NIC Link is Up 1000 Mbps Full Duplex, Flow
> Control: Rx/Tx

^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: Kernel regression introduced by "e1000e: Do not write lsc to ics in msi-x mode" and/or "e1000e: Do not read ICR in Other interrupt"
  2016-11-02 21:19 ` Brown, Aaron F
@ 2016-11-03  8:41   ` Neftin, Sasha
  2016-11-03 11:48   ` Jack Suter
  2016-11-08  6:52   ` Benjamin Poirier
  2 siblings, 0 replies; 6+ messages in thread
From: Neftin, Sasha @ 2016-11-03  8:41 UTC (permalink / raw)
  To: Brown, Aaron F, Jack Suter, Kirsher, Jeffrey T
  Cc: bpoirier, jhodzic, intel-wired-lan, linux-kernel, Avargil,
	Raanan, Ruinskiy, Dima, Neftin, Sasha, Duyck, Alexander H

-----Original Message-----
From: Intel-wired-lan [mailto:intel-wired-lan-bounces@lists.osuosl.org] On Behalf Of Brown, Aaron F
Sent: Wednesday, November 02, 2016 11:20 PM
To: Jack Suter <jack@suter.io>; Kirsher, Jeffrey T <jeffrey.t.kirsher@intel.com>
Cc: bpoirier@suse.com; jhodzic@ucdavis.edu; intel-wired-lan@lists.osuosl.org; linux-kernel@vger.kernel.org
Subject: Re: [Intel-wired-lan] Kernel regression introduced by "e1000e: Do not write lsc to ics in msi-x mode" and/or "e1000e: Do not read ICR in Other interrupt"

> From: Jack Suter [mailto:jack@suter.io]
> Sent: Tuesday, November 1, 2016 4:57 PM
> To: Kirsher, Jeffrey T <jeffrey.t.kirsher@intel.com>
> Cc: intel-wired-lan@lists.osuosl.org; bpoirier@suse.com; Brown, Aaron 
> F <aaron.f.brown@intel.com>; jhodzic@ucdavis.edu; linux- 
> kernel@vger.kernel.org
> Subject: Kernel regression introduced by "e1000e: Do not write lsc to 
> ics in msi-x mode" and/or "e1000e: Do not read ICR in Other interrupt"
> 
> Hi there,
> 
> I have some servers with an 82574L based NIC and recently upgraded 
> from a 4.4 series kernel to 4.7. Upon doing so, servers with this 
> chipset have begun frequently reporting "Link is Down" and "Link is Up"
> messages. No other related network errors are reported by the kernel 
> or e1000e driver. I saw some reports about using "ethtool -s $iface 
> msglvl 6" to reveal more information, but nothing extra was reported.
> 
> Some testing showed that this was introduced between the 4.4 and 4.5 
> series. I was able to further narrow it down to two commits that look
> related:
> 
>  e1000e: Do not write lsc to ics in msi-x mode
>  (a61cfe4ffad7864a07e0c74969ca7ceb77ab2f1f)
>  e1000e: Do not read ICR in Other interrupt
>  (16ecba59bc333d6282ee057fb02339f77a880beb)

I did not notice any link flapping when I tested those patches, I would have rejected them if I had.  I have several systems with 82574L LOMs and as yet am not able to reproduce a link flap with recent upstream kernels/drivers (net-next 4.8.0 on one and 4.9.0-rc3 on another.) 

One of those systems is dedicated to a kernel regression setup, I checked the test logs from it and am not seeing any evidence of flaps in the 4.4, through 4.6 range either.

> 
> Reverting these two commits resolves the Link is Down/Link is Up 
> messages. This has been tested on about six servers so far and all 
> have stopped reporting these link flaps.

Are you able to revert either of the patches independently, I don't recall if they were stand alone or not.

> 
> In total I have about ten servers that are frequently seeing this 
> issue, and a couple dozen more triggering it sporadically.

Are they all 82574L or does it affect others?

> 
> This is about the extent of my troubleshooting knowledge so far. I am 
> happy to test code changes and provide any additional information as 
> necessary. While I do not understand what specifically causes the link 
> flaps, they reliably begin occurring on the affected servers within a 
> couple hours of boot.

Is there any particular traffic pattern involved?  Sitting idle, moderate use, heavy constant flow? 

> 
> A snip of one such instance is below.
> 
> Thank you for any assistance troubleshooting this.

Which kernel tree are you using?  Linus's upstream kernel from kernel.org, a distribution provided one or?  I'm generally working off of David Miller's net-next, but can try to repro the issue on my boxes if I know the exact kernel to work from.

Perhaps a power saving state trying to kick in?  Bad cables or speed/duplex mismatches are common causes of link flap, but they seem unlikely given reverting the patches resolves the issue.

Those patches are interrupt related, what kind of interrupts are in use?  What is interrupt moderation (coalescing set to)?  What is the link partner?  Same type switch for all problem machines or a mix?

cat /proc/interrupts
ethtool -c enp2s0

maybe an `lspci` dump could help shed some more light.

> 
> Kind regards,
> 
> Jack Suter
> 
> # ethtool -i enp2s0
> driver: e1000e
> version: 3.2.6-k
> firmware-version: 2.1-2
> bus-info: 0000:02:00.0
> supports-statistics: yes
> supports-test: yes
> supports-eeprom-access: yes
> supports-register-dump: yes
> supports-priv-flags: no
> 
> [ 3532.745587] e1000e: enp2s0 NIC Link is Down [ 3532.771461] e1000e: 
> enp2s0 NIC Link is Up 1000 Mbps Full Duplex, Flow
> Control: Rx/Tx
> [15463.117592] e1000e: enp2s0 NIC Link is Down [15463.119419] e1000e: 
> enp2s0 NIC Link is Up 1000 Mbps Full Duplex, Flow
> Control: Rx/Tx
> [15469.155922] e1000e: enp2s0 NIC Link is Up 1000 Mbps Full Duplex, 
> Flow
> Control: Rx/Tx
> [15648.196579] e1000e: enp2s0 NIC Link is Down [15651.405310] e1000e: 
> enp2s0 NIC Link is Up 1000 Mbps Full Duplex, Flow
> Control: Rx/Tx
> [15728.959981] e1000e: enp2s0 NIC Link is Down [15729.000625] e1000e: 
> enp2s0 NIC Link is Up 1000 Mbps Full Duplex, Flow
> Control: Rx/Tx
> [15835.132034] e1000e: enp2s0 NIC Link is Down [15835.185222] e1000e: 
> enp2s0 NIC Link is Up 1000 Mbps Full Duplex, Flow
> Control: Rx/Tx
> [15839.104020] e1000e: enp2s0 NIC Link is Down [15839.142346] e1000e: 
> enp2s0 NIC Link is Up 1000 Mbps Full Duplex, Flow
> Control: Rx/Tx
> [15845.142287] e1000e: enp2s0 NIC Link is Up 1000 Mbps Full Duplex, 
> Flow
> Control: Rx/Tx
> [16401.940127] e1000e: enp2s0 NIC Link is Down [16401.945106] e1000e: 
> enp2s0 NIC Link is Up 1000 Mbps Full Duplex, Flow
> Control: Rx/Tx
> [16408.121843] e1000e: enp2s0 NIC Link is Up 1000 Mbps Full Duplex, 
> Flow
> Control: Rx/Tx
> [17025.823220] e1000e: enp2s0 NIC Link is Down [17025.825473] e1000e: 
> enp2s0 NIC Link is Up 1000 Mbps Full Duplex, Flow
> Control: Rx/Tx
> [17032.100202] e1000e: enp2s0 NIC Link is Up 1000 Mbps Full Duplex, 
> Flow
> Control: Rx/Tx
_______________________________________________
Intel-wired-lan mailing list
Intel-wired-lan@lists.osuosl.org
http://lists.osuosl.org/mailman/listinfo/intel-wired-lan

Hello,
We have no reproduced this problem in our labs too. We have tested x99 server platform with 82574L NIC and 4.8.0 kernel.
You wrote that you have several servers with this issue. What is platforms you use? Is there some specific platform's or link partner configuration? Interesting to know if you experienced such problem with stable 4.8.4 or mainline 4.9-rc3.
Sasha

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Kernel regression introduced by "e1000e: Do not write lsc to ics in msi-x mode" and/or "e1000e: Do not read ICR in Other interrupt"
  2016-11-02 21:19 ` Brown, Aaron F
  2016-11-03  8:41   ` Neftin, Sasha
@ 2016-11-03 11:48   ` Jack Suter
  2016-11-08  6:52   ` Benjamin Poirier
  2 siblings, 0 replies; 6+ messages in thread
From: Jack Suter @ 2016-11-03 11:48 UTC (permalink / raw)
  To: Brown, Aaron F, Kirsher, Jeffrey T, sasha.neftin
  Cc: intel-wired-lan, bpoirier, jhodzic, linux-kernel

> > Reverting these two commits resolves the Link is Down/Link is Up
> > messages. This has been tested on about six servers so far and all have
> > stopped reporting these link flaps.
> 
> Are you able to revert either of the patches independently, I don't
> recall if they were stand alone or not.

I can try this shortly.

> Are they all 82574L or does it affect others?

All are 82574L. If the server has an 82574L it has seen a flap at least
once in the past two weeks when kernel upgrades began, though some are
much more frequent than others. 

Except for one, the affected servers are all HP DL120 G7s. The NIC has
firmware 2.1-2 as reported by `ethtool -i`.

The one is a Supermicro server with NIC firmware 1.8-0. Link flaps occur
most frequently on this server; 3963 such instances compared to at most
429 on an HP. It also has more network/disk activity than the affected
HPs.

> Is there any particular traffic pattern involved?  Sitting idle, moderate
> use, heavy constant flow? 

All are being used as file servers, so heavy network traffic and disk
I/O can be expected at times. `vnstat -d` shows the servers averaging
100 - 200 Mbit/s per day. The Supermicro averages closer to 300 Mbit/s
per day.

> Which kernel tree are you using?  Linus's upstream kernel from
> kernel.org, a distribution provided one or?  I'm generally working off of
> David Miller's net-next, but can try to repro the issue on my boxes if I
> know the exact kernel to work from.

I'm using a Gentoo Hardened kernel; specifically 4.7.9. It follows
grsecurity's patch so a 4.8 / 4.9 kernel is not available yet. 

> Perhaps a power saving state trying to kick in?  Bad cables or
> speed/duplex mismatches are common causes of link flap, but they seem
> unlikely given reverting the patches resolves the issue.

I'm not aware of any power save settings that should be trying to kick
in but I can investigate this angle further if you think it may be
related.

One of the HP servers was upgraded to (Gentoo Hardened) 4.5.7 back in
August and began experiencing these flaps shortly after. At the time it
was one of only a few servers on a 4.5+ series kernel and the first to
experience this issue, so it was treated as a physical layer issue. No
interface errors were seen switch-side[1], but the network cable was
replaced regardless. The link flaps on that server still continued. 

[1] As reported to me. I am not sure if the switch saw the link flaps
occurring.

> Those patches are interrupt related, what kind of interrupts are in use? 
> What is interrupt moderation (coalescing set to)?  What is the link
> partner?  Same type switch for all problem machines or a mix?
> 
> cat /proc/interrupts
> ethtool -c enp2s0

Mostly the same type of switch; either Juniper EX3200 or EX3300. All
single connections to the switch, no LACP or anything fancy.

`cat /proc/interrupts` from two HP servers are below. One server is
still experiencing flaps; the other was rebooted ~30 hours ago into the
patched kernel. I can provide /proc/interrupts for the Supermicro server
too, but there isn't a similar server to compare it to. It also has many
more CPUs so its output is a bit messier.

>From ethtool -c; all other values are zero and available in full below.
Applies to both HP and Supermicro.
    Adaptive RX: off  TX: off
    rx-usecs: 3

> maybe an `lspci` dump could help shed some more light.

>From the HPs:
00:00.0 Host bridge: Intel Corporation 2nd Generation Core Processor
Family DRAM Controller (rev 09)
00:01.0 PCI bridge: Intel Corporation Xeon E3-1200/2nd Generation Core
Processor Family PCI Express Root Port (rev 09)
00:06.0 PCI bridge: Intel Corporation Xeon E3-1200/2nd Generation Core
Processor Family PCI Express Root Port (rev 09)
00:1a.0 USB controller: Intel Corporation 6 Series/C200 Series Chipset
Family USB Enhanced Host Controller #2 (rev 05)
00:1c.0 PCI bridge: Intel Corporation 6 Series/C200 Series Chipset
Family PCI Express Root Port 1 (rev b5)
00:1c.4 PCI bridge: Intel Corporation 6 Series/C200 Series Chipset
Family PCI Express Root Port 5 (rev b5)
00:1c.5 PCI bridge: Intel Corporation 6 Series/C200 Series Chipset
Family PCI Express Root Port 6 (rev b5)
00:1c.6 PCI bridge: Intel Corporation 6 Series/C200 Series Chipset
Family PCI Express Root Port 7 (rev b5)
00:1c.7 PCI bridge: Intel Corporation 6 Series/C200 Series Chipset
Family PCI Express Root Port 8 (rev b5)
00:1d.0 USB controller: Intel Corporation 6 Series/C200 Series Chipset
Family USB Enhanced Host Controller #1 (rev 05)
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev a5)
00:1f.0 ISA bridge: Intel Corporation C204 Chipset Family LPC Controller
(rev 05)
00:1f.2 SATA controller: Intel Corporation 6 Series/C200 Series Chipset
Family SATA AHCI Controller (rev 05)
01:00.0 System peripheral: Hewlett-Packard Company Integrated Lights-Out
Standard Slave Instrumentation & System Support (rev 05)
01:00.1 VGA compatible controller: Matrox Electronics Systems Ltd. MGA
G200EH
01:00.2 System peripheral: Hewlett-Packard Company Integrated Lights-Out
Standard Management Processor Support and Messaging (rev 05)
01:00.4 USB controller: Hewlett-Packard Company Integrated Lights-Out
Standard Virtual USB Controller (rev 02)
02:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network
Connection
03:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network
Connection

And the Supermicro: 
00:00.0 Host bridge: Intel Corporation 5500 I/O Hub to ESI Port (rev 22)
00:01.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express
Root Port 1 (rev 22)
00:03.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express
Root Port 3 (rev 22)
00:07.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express
Root Port 7 (rev 22)
00:09.0 PCI bridge: Intel Corporation 7500/5520/5500/X58 I/O Hub PCI
Express Root Port 9 (rev 22)
00:13.0 PIC: Intel Corporation 7500/5520/5500/X58 I/O Hub I/OxAPIC
Interrupt Controller (rev 22)
00:14.0 PIC: Intel Corporation 7500/5520/5500/X58 I/O Hub System
Management Registers (rev 22)
00:14.1 PIC: Intel Corporation 7500/5520/5500/X58 I/O Hub GPIO and
Scratch Pad Registers (rev 22)
00:14.2 PIC: Intel Corporation 7500/5520/5500/X58 I/O Hub Control Status
and RAS Registers (rev 22)
00:14.3 PIC: Intel Corporation 7500/5520/5500/X58 I/O Hub Throttle
Registers (rev 22)
00:16.0 System peripheral: Intel Corporation 5520/5500/X58 Chipset
QuickData Technology Device (rev 22)
00:16.1 System peripheral: Intel Corporation 5520/5500/X58 Chipset
QuickData Technology Device (rev 22)
00:16.2 System peripheral: Intel Corporation 5520/5500/X58 Chipset
QuickData Technology Device (rev 22)
00:16.3 System peripheral: Intel Corporation 5520/5500/X58 Chipset
QuickData Technology Device (rev 22)
00:16.4 System peripheral: Intel Corporation 5520/5500/X58 Chipset
QuickData Technology Device (rev 22)
00:16.5 System peripheral: Intel Corporation 5520/5500/X58 Chipset
QuickData Technology Device (rev 22)
00:16.6 System peripheral: Intel Corporation 5520/5500/X58 Chipset
QuickData Technology Device (rev 22)
00:16.7 System peripheral: Intel Corporation 5520/5500/X58 Chipset
QuickData Technology Device (rev 22)
00:1a.0 USB controller: Intel Corporation 82801JI (ICH10 Family) USB
UHCI Controller #4
00:1a.1 USB controller: Intel Corporation 82801JI (ICH10 Family) USB
UHCI Controller #5
00:1a.2 USB controller: Intel Corporation 82801JI (ICH10 Family) USB
UHCI Controller #6
00:1a.7 USB controller: Intel Corporation 82801JI (ICH10 Family) USB2
EHCI Controller #2
00:1c.0 PCI bridge: Intel Corporation 82801JI (ICH10 Family) PCI Express
Root Port 1
00:1c.4 PCI bridge: Intel Corporation 82801JI (ICH10 Family) PCI Express
Root Port 5
00:1c.5 PCI bridge: Intel Corporation 82801JI (ICH10 Family) PCI Express
Root Port 6
00:1d.0 USB controller: Intel Corporation 82801JI (ICH10 Family) USB
UHCI Controller #1
00:1d.1 USB controller: Intel Corporation 82801JI (ICH10 Family) USB
UHCI Controller #2
00:1d.2 USB controller: Intel Corporation 82801JI (ICH10 Family) USB
UHCI Controller #3
00:1d.7 USB controller: Intel Corporation 82801JI (ICH10 Family) USB2
EHCI Controller #1
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 90)
00:1f.0 ISA bridge: Intel Corporation 82801JIR (ICH10R) LPC Interface
Controller
00:1f.2 SATA controller: Intel Corporation 82801JI (ICH10 Family) SATA
AHCI Controller
00:1f.3 SMBus: Intel Corporation 82801JI (ICH10 Family) SMBus Controller
05:00.0 RAID bus controller: 3ware Inc 9650SE SATA-II RAID PCIe (rev 01)
06:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network
Connection
07:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network
Connection
08:01.0 VGA compatible controller: Matrox Electronics Systems Ltd. MGA
G200eW WPCM450 (rev 0a)
fe:00.0 Host bridge: Intel Corporation Xeon 5600 Series QuickPath
Architecture Generic Non-core Registers (rev 02)
fe:00.1 Host bridge: Intel Corporation Xeon 5600 Series QuickPath
Architecture System Address Decoder (rev 02)
fe:02.0 Host bridge: Intel Corporation Xeon 5600 Series QPI Link 0 (rev
02)
fe:02.1 Host bridge: Intel Corporation Xeon 5600 Series QPI Physical 0
(rev 02)
fe:02.2 Host bridge: Intel Corporation Xeon 5600 Series Mirror Port Link
0 (rev 02)
fe:02.3 Host bridge: Intel Corporation Xeon 5600 Series Mirror Port Link
1 (rev 02)
fe:02.4 Host bridge: Intel Corporation Xeon 5600 Series QPI Link 1 (rev
02)
fe:02.5 Host bridge: Intel Corporation Xeon 5600 Series QPI Physical 1
(rev 02)
fe:03.0 Host bridge: Intel Corporation Xeon 5600 Series Integrated
Memory Controller Registers (rev 02)
fe:03.1 Host bridge: Intel Corporation Xeon 5600 Series Integrated
Memory Controller Target Address Decoder (rev 02)
fe:03.2 Host bridge: Intel Corporation Xeon 5600 Series Integrated
Memory Controller RAS Registers (rev 02)
fe:03.4 Host bridge: Intel Corporation Xeon 5600 Series Integrated
Memory Controller Test Registers (rev 02)
fe:04.0 Host bridge: Intel Corporation Xeon 5600 Series Integrated
Memory Controller Channel 0 Control (rev 02)
fe:04.1 Host bridge: Intel Corporation Xeon 5600 Series Integrated
Memory Controller Channel 0 Address (rev 02)
fe:04.2 Host bridge: Intel Corporation Xeon 5600 Series Integrated
Memory Controller Channel 0 Rank (rev 02)
fe:04.3 Host bridge: Intel Corporation Xeon 5600 Series Integrated
Memory Controller Channel 0 Thermal Control (rev 02)
fe:05.0 Host bridge: Intel Corporation Xeon 5600 Series Integrated
Memory Controller Channel 1 Control (rev 02)
fe:05.1 Host bridge: Intel Corporation Xeon 5600 Series Integrated
Memory Controller Channel 1 Address (rev 02)
fe:05.2 Host bridge: Intel Corporation Xeon 5600 Series Integrated
Memory Controller Channel 1 Rank (rev 02)
fe:05.3 Host bridge: Intel Corporation Xeon 5600 Series Integrated
Memory Controller Channel 1 Thermal Control (rev 02)
fe:06.0 Host bridge: Intel Corporation Xeon 5600 Series Integrated
Memory Controller Channel 2 Control (rev 02)
fe:06.1 Host bridge: Intel Corporation Xeon 5600 Series Integrated
Memory Controller Channel 2 Address (rev 02)
fe:06.2 Host bridge: Intel Corporation Xeon 5600 Series Integrated
Memory Controller Channel 2 Rank (rev 02)
fe:06.3 Host bridge: Intel Corporation Xeon 5600 Series Integrated
Memory Controller Channel 2 Thermal Control (rev 02)
ff:00.0 Host bridge: Intel Corporation Xeon 5600 Series QuickPath
Architecture Generic Non-core Registers (rev 02)
ff:00.1 Host bridge: Intel Corporation Xeon 5600 Series QuickPath
Architecture System Address Decoder (rev 02)
ff:02.0 Host bridge: Intel Corporation Xeon 5600 Series QPI Link 0 (rev
02)
ff:02.1 Host bridge: Intel Corporation Xeon 5600 Series QPI Physical 0
(rev 02)
ff:02.2 Host bridge: Intel Corporation Xeon 5600 Series Mirror Port Link
0 (rev 02)
ff:02.3 Host bridge: Intel Corporation Xeon 5600 Series Mirror Port Link
1 (rev 02)
ff:02.4 Host bridge: Intel Corporation Xeon 5600 Series QPI Link 1 (rev
02)
ff:02.5 Host bridge: Intel Corporation Xeon 5600 Series QPI Physical 1
(rev 02)
ff:03.0 Host bridge: Intel Corporation Xeon 5600 Series Integrated
Memory Controller Registers (rev 02)
ff:03.1 Host bridge: Intel Corporation Xeon 5600 Series Integrated
Memory Controller Target Address Decoder (rev 02)
ff:03.2 Host bridge: Intel Corporation Xeon 5600 Series Integrated
Memory Controller RAS Registers (rev 02)
ff:03.4 Host bridge: Intel Corporation Xeon 5600 Series Integrated
Memory Controller Test Registers (rev 02)
ff:04.0 Host bridge: Intel Corporation Xeon 5600 Series Integrated
Memory Controller Channel 0 Control (rev 02)
ff:04.1 Host bridge: Intel Corporation Xeon 5600 Series Integrated
Memory Controller Channel 0 Address (rev 02)
ff:04.2 Host bridge: Intel Corporation Xeon 5600 Series Integrated
Memory Controller Channel 0 Rank (rev 02)
ff:04.3 Host bridge: Intel Corporation Xeon 5600 Series Integrated
Memory Controller Channel 0 Thermal Control (rev 02)
ff:05.0 Host bridge: Intel Corporation Xeon 5600 Series Integrated
Memory Controller Channel 1 Control (rev 02)
ff:05.1 Host bridge: Intel Corporation Xeon 5600 Series Integrated
Memory Controller Channel 1 Address (rev 02)
ff:05.2 Host bridge: Intel Corporation Xeon 5600 Series Integrated
Memory Controller Channel 1 Rank (rev 02)
ff:05.3 Host bridge: Intel Corporation Xeon 5600 Series Integrated
Memory Controller Channel 1 Thermal Control (rev 02)
ff:06.0 Host bridge: Intel Corporation Xeon 5600 Series Integrated
Memory Controller Channel 2 Control (rev 02)
ff:06.1 Host bridge: Intel Corporation Xeon 5600 Series Integrated
Memory Controller Channel 2 Address (rev 02)
ff:06.2 Host bridge: Intel Corporation Xeon 5600 Series Integrated
Memory Controller Channel 2 Rank (rev 02)
ff:06.3 Host bridge: Intel Corporation Xeon 5600 Series Integrated
Memory Controller Channel 2 Thermal Control (rev 02)


>From an HP server without the two reverted commits, still experiencing
flaps:
# dmesg | grep 'Link is Down' | wc -l
160
# cat /proc/interrupts 
           CPU0       CPU1       
  0:         71          0   IO-APIC   2-edge      timer
  1:          9          0   IO-APIC   1-edge      i8042
  8:         26          0   IO-APIC   8-edge      rtc0
  9:          0          0   IO-APIC   9-fasteoi   acpi
 12:          5          0   IO-APIC  12-edge      i8042
 16:        101          0   IO-APIC  16-fasteoi   uhci_hcd:usb3
 20:         29          0   IO-APIC  20-fasteoi   ehci_hcd:usb2
 21:         31          0   IO-APIC  21-fasteoi   ehci_hcd:usb1
 26:  466035195          0   PCI-MSI 512000-edge      ahci[0000:00:1f.2]
 27: 4011416578          0   PCI-MSI 1048576-edge      enp2s0-rx-0
 28: 2635120533          0   PCI-MSI 1048577-edge      enp2s0-tx-0
 29:      21247          0   PCI-MSI 1048578-edge      enp2s0
NMI:      32827      13374   Non-maskable interrupts
LOC:  639865868  608834533   Local timer interrupts
SPU:          0          0   Spurious interrupts
PMI:      32827      13374   Performance monitoring interrupts
IWI:          6          0   IRQ work interrupts
RTR:          0          0   APIC ICR read retries
RES:   53178810  784807944   Rescheduling interrupts
CAL:      47602      16104   Function call interrupts
TLB:   14655054    5994312   TLB shootdowns
TRM:          0          0   Thermal event interrupts
THR:          0          0   Threshold APIC interrupts
DFR:          0          0   Deferred Error APIC interrupts
MCE:          0          0   Machine check exceptions
MCP:       3134       3134   Machine check polls
ERR:          0
MIS:          0
PIN:          0          0   Posted-interrupt notification event
PIW:          0          0   Posted-interrupt wakeup event

>From an HP server that was previously affected but now has the patched
kernel:
# cat /proc/interrupts 
           CPU0       CPU1       
  0:         27          0   IO-APIC   2-edge      timer
  1:          9          0   IO-APIC   1-edge      i8042
  8:         63          0   IO-APIC   8-edge      rtc0
  9:          0          0   IO-APIC   9-fasteoi   acpi
 12:          5          0   IO-APIC  12-edge      i8042
 16:          0          0   IO-APIC  16-fasteoi   uhci_hcd:usb3
 20:         29          0   IO-APIC  20-fasteoi   ehci_hcd:usb2
 21:         31          0   IO-APIC  21-fasteoi   ehci_hcd:usb1
 26:   10222204          0   PCI-MSI 512000-edge      ahci[0000:00:1f.2]
 27:  260871340          0   PCI-MSI 1048576-edge      enp2s0-rx-0
 28:  320328246          0   PCI-MSI 1048577-edge      enp2s0-tx-0
 29:          2          0   PCI-MSI 1048578-edge      enp2s0
NMI:       1023        520   Non-maskable interrupts
LOC:   55824119   46253516   Local timer interrupts
SPU:          0          0   Spurious interrupts
PMI:       1023        520   Performance monitoring interrupts
IWI:          4          0   IRQ work interrupts
RTR:          0          0   APIC ICR read retries
RES:     963280   23369703   Rescheduling interrupts
CAL:        711        450   Function call interrupts
TLB:     104153      57497   TLB shootdowns
TRM:          0          0   Thermal event interrupts
THR:          0          0   Threshold APIC interrupts
DFR:          0          0   Deferred Error APIC interrupts
MCE:          0          0   Machine check exceptions
MCP:        381        381   Machine check polls
ERR:          0
MIS:          0
PIN:          0          0   Posted-interrupt notification event
PIW:          0          0   Posted-interrupt wakeup event

# ethtool -c enp2s0
Coalesce parameters for enp2s0:
Adaptive RX: off  TX: off
stats-block-usecs: 0
sample-interval: 0
pkt-rate-low: 0
pkt-rate-high: 0

rx-usecs: 3
rx-frames: 0
rx-usecs-irq: 0
rx-frames-irq: 0

tx-usecs: 0
tx-frames: 0
tx-usecs-irq: 0
tx-frames-irq: 0

rx-usecs-low: 0
rx-frame-low: 0
tx-usecs-low: 0
tx-frame-low: 0

rx-usecs-high: 0
rx-frame-high: 0
tx-usecs-high: 0
tx-frame-high: 0

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Kernel regression introduced by "e1000e: Do not write lsc to ics in msi-x mode" and/or "e1000e: Do not read ICR in Other interrupt"
  2016-11-02 21:19 ` Brown, Aaron F
  2016-11-03  8:41   ` Neftin, Sasha
  2016-11-03 11:48   ` Jack Suter
@ 2016-11-08  6:52   ` Benjamin Poirier
  2 siblings, 0 replies; 6+ messages in thread
From: Benjamin Poirier @ 2016-11-08  6:52 UTC (permalink / raw)
  To: Brown, Aaron F
  Cc: Jack Suter, Kirsher, Jeffrey T, intel-wired-lan, jhodzic, linux-kernel

On 2016/11/02 21:19, Brown, Aaron F wrote:
> > From: Jack Suter [mailto:jack@suter.io]
> > Sent: Tuesday, November 1, 2016 4:57 PM
> > To: Kirsher, Jeffrey T <jeffrey.t.kirsher@intel.com>
> > Cc: intel-wired-lan@lists.osuosl.org; bpoirier@suse.com; Brown, Aaron F
> > <aaron.f.brown@intel.com>; jhodzic@ucdavis.edu; linux-
> > kernel@vger.kernel.org
> > Subject: Kernel regression introduced by "e1000e: Do not write lsc to ics in
> > msi-x mode" and/or "e1000e: Do not read ICR in Other interrupt"
> > 
> > Hi there,
> > 
> > I have some servers with an 82574L based NIC and recently upgraded from
> > a 4.4 series kernel to 4.7. Upon doing so, servers with this chipset
> > have begun frequently reporting "Link is Down" and "Link is Up"
> > messages. No other related network errors are reported by the kernel or
> > e1000e driver. I saw some reports about using "ethtool -s $iface msglvl
> > 6" to reveal more information, but nothing extra was reported.
> > 
> > Some testing showed that this was introduced between the 4.4 and 4.5
> > series. I was able to further narrow it down to two commits that look
> > related:
> > 
> >  e1000e: Do not write lsc to ics in msi-x mode
> >  (a61cfe4ffad7864a07e0c74969ca7ceb77ab2f1f)
> >  e1000e: Do not read ICR in Other interrupt
> >  (16ecba59bc333d6282ee057fb02339f77a880beb)
> 
> I did not notice any link flapping when I tested those patches, I would have rejected them if I had.  I have several systems with 82574L LOMs and as yet am not able to reproduce a link flap with recent upstream kernels/drivers (net-next 4.8.0 on one and 4.9.0-rc3 on another.) 
> 
> One of those systems is dedicated to a kernel regression setup, I checked the test logs from it and am not seeing any evidence of flaps in the 4.4, through 4.6 range either.
> 
> > 
> > Reverting these two commits resolves the Link is Down/Link is Up
> > messages. This has been tested on about six servers so far and all have
> > stopped reporting these link flaps.
> 
> Are you able to revert either of the patches independently, I don't recall if they were stand alone or not.

>From what I recall, the series is entirely bisectable. I tested again
just now and could do a netperf RR test after applying each commit
sequentially.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2016-11-08  6:52 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-11-01 23:56 Kernel regression introduced by "e1000e: Do not write lsc to ics in msi-x mode" and/or "e1000e: Do not read ICR in Other interrupt" Jack Suter
2016-11-02  3:25 ` Benjamin Poirier
2016-11-02 21:19 ` Brown, Aaron F
2016-11-03  8:41   ` Neftin, Sasha
2016-11-03 11:48   ` Jack Suter
2016-11-08  6:52   ` Benjamin Poirier

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).