* [Intel-wired-lan] [BUG] igb: reconnecting of cable not always detected
@ 2018-04-25 16:01 ` Alexander Duyck
0 siblings, 0 replies; 22+ messages in thread
From: Alexander Duyck @ 2018-04-25 16:01 UTC (permalink / raw)
To: intel-wired-lan
On Wed, Apr 25, 2018 at 2:47 AM, Holger Schurig <holgerschurig@gmail.com> wrote:
> Hi Alex,
>
> (Sent a 2nd time, this time with "Reply to all" and without HTML, so
> that it hits the kernel archives as well. Sorry for the noise.
>
>
>
>
>> Sounds like the link is failing to re-establish. You might double
>> check a few things. One is to verify if the link partner is
>> recognizing the link as coming up or not.
>
> It turns on differently. Before I remove the cable, the LED on the TP
> LINK "TL SG-108" was green. After removing the cable, the LED went off.
> After reinserting the cable, it became orange after some while.
>
> Green LED means 1000 MB/s, orange LED means 10/100 MB/s.
Was the orange LED on the igb NIC or on the TL SG-108? Based on the
comment below I am assuming it is the switch.
Based on that I am thinking we probably need to work on the PHY configuration.
> I have a different, even older switch: "Allnet ALL8039". Here the same:
> the switch detects a link, but igb not.
>
>
>
>> If you could also provide an "lspci -vvv"
>
> 02:00.0 Ethernet controller: Intel Corporation I210 Gigabit Network
> Connection (rev 03)
Okay so we are working with an i210.
> Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
> Stepping- SERR- FastB2B- DisINTx+
> Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
> <TAbort- <MAbort- >SERR- <PERR- INTx-
> Latency: 0, Cache Line Size: 64 bytes
> Interrupt: pin A routed to IRQ 19
> Region 0: Memory at 90600000 (32-bit, non-prefetchable) [size=512K]
> Region 2: I/O ports at d000 [size=32]
> Region 3: Memory at 90680000 (32-bit, non-prefetchable) [size=16K]
> Capabilities: [40] Power Management version 3
> Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA
> PME(D0+,D1-,D2-,D3hot+,D3cold+)
> Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=1 PME-
> Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
> Address: 0000000000000000 Data: 0000
> Masking: 00000000 Pending: 00000000
> Capabilities: [70] MSI-X: Enable+ Count=5 Masked-
> Vector table: BAR=3 offset=00000000
> PBA: BAR=3 offset=00002000
> Capabilities: [a0] Express (v2) Endpoint, MSI 00
> DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s
> <512ns, L1 <64us
> ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+
> SlotPowerLimit 0.000W
> DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+
> Unsupported+
> RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+
> FLReset-
> MaxPayload 128 bytes, MaxReadReq 512 bytes
> DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr+
> TransPend-
> LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Exit
> Latency L0s <2us, L1 <16us
> ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
> LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
> ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+
> DLActive- BWMgmt- ABWMgmt-
> DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-,
> OBFF Not Supported
> DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis+,
> LTR-, OBFF Disabled
> LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance-
> SpeedDis-
> Transmit Margin: Normal Operating Range,
> EnterModifiedCompliance- ComplianceSOS-
> Compliance De-emphasis: -6dB
> LnkSta2: Current De-emphasis Level: -6dB,
> EqualizationComplete-, EqualizationPhase1-
> EqualizationPhase2-, EqualizationPhase3-,
> LinkEqualizationRequest-
> Capabilities: [100 v2] Advanced Error Reporting
> UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt-
> RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
> UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt-
> RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
> UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt-
> RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
> CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout-
> NonFatalErr-
> CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout-
> NonFatalErr+
> AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+
> ChkEn-
> Capabilities: [140 v1] Device Serial Number 00-13-95-ff-ff-1a-54-33
> Capabilities: [1a0 v1] Transaction Processing Hints
> Device specific mode supported
> Steering table in TPH capability structure
> Kernel driver in use: igb
> Kernel modules: igb
>
>> and "ethtool -i" for the
>
> driver: igb
> version: 5.4.0-k
> firmware-version: 3.20, 0x80000553
> expansion-rom-version:
> bus-info: 0000:02:00.0
> supports-statistics: yes
> supports-test: yes
> supports-eeprom-access: yes
> supports-register-dump: yes
> supports-priv-flags: yes
>
>
>
> One thing that is interesting is how igb reacts to ethtool inquiries
> once it goes into the failed state. You inquired for "ethtool -i eth0",
> but in the failed state I only get this:
>
> Cannot restart autonegotiation: No such device
I assume you mean "ethtool -r" since that is what is supposed to be
restarting negotiation. The "ethtool -i" is what you provided above.
The fact that the device disappears is a bit concerning. I'm wondering
if we are somehow triggering the surprise removal code.
> But eth0 is of course still there, "ip -d link show eth0" shows:
>
>
> 2: eth0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN
> mode DEFAULT group default qlen 1000
> link/ether 00:13:95:1a:54:33 brd ff:ff:ff:ff:ff:ff promiscuity 0
> numtxqueues 8 numrxqueues 8 gso_max_size 65536 gso_max_segs 65535
>
>
>
>
>
> Other ethtool commands also don't report any information once the link
> went bogus. Here one output from "ethtool eth0":
>
> Settings for eth0:
> Supported ports: [ TP ]
> Supported link modes: 10baseT/Half 10baseT/Full
> 100baseT/Half 100baseT/Full
> 1000baseT/Full
> Supported pause frame use: Symmetric
> Supports auto-negotiation: Yes
> Advertised link modes: 10baseT/Half 10baseT/Full
> 100baseT/Half 100baseT/Full
> 1000baseT/Full
> Advertised pause frame use: Symmetric
> Advertised auto-negotiation: Yes
> Speed: 1000Mb/s
> Duplex: Full
> Port: Twisted Pair
> PHYAD: 1
> Transceiver: internal
> Auto-negotiation: on
> MDI-X: off (auto)
> Supports Wake-on: pumbg
> Wake-on: g
> Current message level: 0x00000007 (7)
> drv probe link
> Link detected: yes
>
> ... and here another:
>
> Settings for eth0:
> Cannot get device settings: No such device
> Cannot get wake-on-lan settings: No such device
> Cannot get message level: No such device
> Cannot get link status: No such device
> Settings for eth0:
> No data available
>
>
>
> I'm willing to pepper the source with printk, if this helps :-)
>
>
> Greetings,
> Holger
Thanks. I'm suspecting we may need to instrument igb_rd32 at this
point. In order to trigger what you are seeing I am assuming the
device has been detached due to a read failure of some sort.
Another thing you could look at doing is narrowing down the possible
factors involved. You could go through and limit phy settings and look
at possibly dropping features such as EEE if it is enabled on the
device.
Thanks.
- Alex
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [BUG] igb: reconnecting of cable not always detected
2018-04-25 16:01 ` [Intel-wired-lan] " Alexander Duyck
@ 2018-04-26 7:54 ` Holger Schurig
-1 siblings, 0 replies; 22+ messages in thread
From: Holger Schurig @ 2018-04-26 7:54 UTC (permalink / raw)
To: Alexander Duyck; +Cc: Jeff Kirsher, intel-wired-lan, LKML
> Was the orange LED on the igb NIC or on the TL SG-108? Based on the
> comment below I am assuming it is the switch.
The LEDs were on the switch.
When everything works, the switch says green == 1000 MB/s.
When cable is disconnected, switch doesn't light any LED.
When cable is inserted and things fail, the switch says orange LED == 100 MB/s.
Sometimes the insertion process works, then the switch will go, of
course, to the green LED == 1000 MB/s.
I must admit that I didn't look at the LEDs of the device.
Now I looked there, and the device the left+green LED is on. In the
failed case (so, in the dmesg output the last thing I see is "Link is
Down", but the device still has left+green LED on.
The right+orange LED on the device seems to indicate traffic, and it is
constantly off in the failed case.
> I assume you mean "ethtool -r" since that is what is supposed to be
> restarting negotiation. The "ethtool -i" is what you provided above.
Maybe I've edited my text too much and moved output along. Anyway, in
the failed case neither "ethtool- r eth0" nor "ethtool -i eth0" nor
"mii-tool eth0" work at all, they all emit error warning.
> Thanks. I'm suspecting we may need to instrument igb_rd32 at this
> point. In order to trigger what you are seeing I am assuming the
> device has been detached due to a read failure of some sort.
I'll do that and reply later. I first need to understand this source
part :-)
> Another thing you could look at doing is narrowing down the possible
> factors involved. You could go through and limit phy settings and look
> at possibly dropping features such as EEE if it is enabled on the
> device.
I actually tried a driver patch to remove 1000 GB/s from the driver, in
the assumption that maybe this specific hardware has a bad layout and
thus trouble (I don't really think that, because I never observed any
data transfer problem).
So, is the following patch (that didn't help) what in the line of what
you suggested?
Index: linux-4.16/drivers/net/ethernet/intel/igb/igb_main.c
===================================================================
--- linux-4.16.orig/drivers/net/ethernet/intel/igb/igb_main.c 2018-04-01 23:20:27.000000000 +0200
+++ linux-4.16/drivers/net/ethernet/intel/igb/igb_main.c 2018-04-24 11:35:17.420760650 +0200
@@ -2080,7 +2080,7 @@
if ((adapter->flags & IGB_FLAG_EEE) &&
(!hw->dev_spec._82575.eee_disable))
- adapter->eee_advert = MDIO_EEE_100TX | MDIO_EEE_1000T;
+ adapter->eee_advert = MDIO_EEE_100TX /* | MDIO_EEE_1000T */;
return 0;
}
@@ -2908,7 +2908,7 @@
/* Initialize link properties that are user-changeable */
adapter->fc_autoneg = true;
hw->mac.autoneg = true;
- hw->phy.autoneg_advertised = 0x2f;
+ hw->phy.autoneg_advertised = 0x0f;
hw->fc.requested_mode = e1000_fc_default;
hw->fc.current_mode = e1000_fc_default;
@@ -3099,7 +3099,7 @@
if ((!err) &&
(!hw->dev_spec._82575.eee_disable)) {
adapter->eee_advert =
- MDIO_EEE_100TX | MDIO_EEE_1000T;
+ MDIO_EEE_100TX /* | MDIO_EEE_1000T */;
adapter->flags |= IGB_FLAG_EEE;
}
break;
@@ -3110,7 +3110,7 @@
if ((!err) &&
(!hw->dev_spec._82575.eee_disable)) {
adapter->eee_advert =
- MDIO_EEE_100TX | MDIO_EEE_1000T;
+ MDIO_EEE_100TX /* | MDIO_EEE_1000T */;
adapter->flags |= IGB_FLAG_EEE;
}
}
Index: linux-4.16/drivers/net/ethernet/intel/igb/igb_ethtool.c
===================================================================
--- linux-4.16.orig/drivers/net/ethernet/intel/igb/igb_ethtool.c 2018-04-01 23:20:27.000000000 +0200
+++ linux-4.16/drivers/net/ethernet/intel/igb/igb_ethtool.c 2018-04-24 11:42:36.737959749 +0200
@@ -170,7 +170,7 @@
SUPPORTED_10baseT_Full |
SUPPORTED_100baseT_Half |
SUPPORTED_100baseT_Full |
- SUPPORTED_1000baseT_Full|
+ /* SUPPORTED_1000baseT_Full| */
SUPPORTED_Autoneg |
SUPPORTED_TP |
SUPPORTED_Pause);
@@ -3003,7 +3003,7 @@
(hw->phy.media_type != e1000_media_type_copper))
return -EOPNOTSUPP;
- edata->supported = (SUPPORTED_1000baseT_Full |
+ edata->supported = (/* SUPPORTED_1000baseT_Full | */
SUPPORTED_100baseT_Full);
if (!hw->dev_spec._82575.eee_disable)
edata->advertised =
^ permalink raw reply [flat|nested] 22+ messages in thread
* [Intel-wired-lan] [BUG] igb: reconnecting of cable not always detected
@ 2018-04-26 7:54 ` Holger Schurig
0 siblings, 0 replies; 22+ messages in thread
From: Holger Schurig @ 2018-04-26 7:54 UTC (permalink / raw)
To: intel-wired-lan
> Was the orange LED on the igb NIC or on the TL SG-108? Based on the
> comment below I am assuming it is the switch.
The LEDs were on the switch.
When everything works, the switch says green == 1000 MB/s.
When cable is disconnected, switch doesn't light any LED.
When cable is inserted and things fail, the switch says orange LED == 100 MB/s.
Sometimes the insertion process works, then the switch will go, of
course, to the green LED == 1000 MB/s.
I must admit that I didn't look at the LEDs of the device.
Now I looked there, and the device the left+green LED is on. In the
failed case (so, in the dmesg output the last thing I see is "Link is
Down", but the device still has left+green LED on.
The right+orange LED on the device seems to indicate traffic, and it is
constantly off in the failed case.
> I assume you mean "ethtool -r" since that is what is supposed to be
> restarting negotiation. The "ethtool -i" is what you provided above.
Maybe I've edited my text too much and moved output along. Anyway, in
the failed case neither "ethtool- r eth0" nor "ethtool -i eth0" nor
"mii-tool eth0" work at all, they all emit error warning.
> Thanks. I'm suspecting we may need to instrument igb_rd32 at this
> point. In order to trigger what you are seeing I am assuming the
> device has been detached due to a read failure of some sort.
I'll do that and reply later. I first need to understand this source
part :-)
> Another thing you could look at doing is narrowing down the possible
> factors involved. You could go through and limit phy settings and look
> at possibly dropping features such as EEE if it is enabled on the
> device.
I actually tried a driver patch to remove 1000 GB/s from the driver, in
the assumption that maybe this specific hardware has a bad layout and
thus trouble (I don't really think that, because I never observed any
data transfer problem).
So, is the following patch (that didn't help) what in the line of what
you suggested?
Index: linux-4.16/drivers/net/ethernet/intel/igb/igb_main.c
===================================================================
--- linux-4.16.orig/drivers/net/ethernet/intel/igb/igb_main.c 2018-04-01 23:20:27.000000000 +0200
+++ linux-4.16/drivers/net/ethernet/intel/igb/igb_main.c 2018-04-24 11:35:17.420760650 +0200
@@ -2080,7 +2080,7 @@
if ((adapter->flags & IGB_FLAG_EEE) &&
(!hw->dev_spec._82575.eee_disable))
- adapter->eee_advert = MDIO_EEE_100TX | MDIO_EEE_1000T;
+ adapter->eee_advert = MDIO_EEE_100TX /* | MDIO_EEE_1000T */;
return 0;
}
@@ -2908,7 +2908,7 @@
/* Initialize link properties that are user-changeable */
adapter->fc_autoneg = true;
hw->mac.autoneg = true;
- hw->phy.autoneg_advertised = 0x2f;
+ hw->phy.autoneg_advertised = 0x0f;
hw->fc.requested_mode = e1000_fc_default;
hw->fc.current_mode = e1000_fc_default;
@@ -3099,7 +3099,7 @@
if ((!err) &&
(!hw->dev_spec._82575.eee_disable)) {
adapter->eee_advert =
- MDIO_EEE_100TX | MDIO_EEE_1000T;
+ MDIO_EEE_100TX /* | MDIO_EEE_1000T */;
adapter->flags |= IGB_FLAG_EEE;
}
break;
@@ -3110,7 +3110,7 @@
if ((!err) &&
(!hw->dev_spec._82575.eee_disable)) {
adapter->eee_advert =
- MDIO_EEE_100TX | MDIO_EEE_1000T;
+ MDIO_EEE_100TX /* | MDIO_EEE_1000T */;
adapter->flags |= IGB_FLAG_EEE;
}
}
Index: linux-4.16/drivers/net/ethernet/intel/igb/igb_ethtool.c
===================================================================
--- linux-4.16.orig/drivers/net/ethernet/intel/igb/igb_ethtool.c 2018-04-01 23:20:27.000000000 +0200
+++ linux-4.16/drivers/net/ethernet/intel/igb/igb_ethtool.c 2018-04-24 11:42:36.737959749 +0200
@@ -170,7 +170,7 @@
SUPPORTED_10baseT_Full |
SUPPORTED_100baseT_Half |
SUPPORTED_100baseT_Full |
- SUPPORTED_1000baseT_Full|
+ /* SUPPORTED_1000baseT_Full| */
SUPPORTED_Autoneg |
SUPPORTED_TP |
SUPPORTED_Pause);
@@ -3003,7 +3003,7 @@
(hw->phy.media_type != e1000_media_type_copper))
return -EOPNOTSUPP;
- edata->supported = (SUPPORTED_1000baseT_Full |
+ edata->supported = (/* SUPPORTED_1000baseT_Full | */
SUPPORTED_100baseT_Full);
if (!hw->dev_spec._82575.eee_disable)
edata->advertised =
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [BUG] igb: reconnecting of cable not always detected
2018-04-25 16:01 ` [Intel-wired-lan] " Alexander Duyck
@ 2018-04-26 9:08 ` Holger Schurig
-1 siblings, 0 replies; 22+ messages in thread
From: Holger Schurig @ 2018-04-26 9:08 UTC (permalink / raw)
To: Alexander Duyck; +Cc: Jeff Kirsher, intel-wired-lan, LKML
Hi,
> Thanks. I'm suspecting we may need to instrument igb_rd32 at this
> point. In order to trigger what you are seeing I am assuming the
> device has been detached due to a read failure of some sort.
Okay, I added a printk to igb_rd32. And because no one calls this
function directly (all access goes via the rd32/rd32_array macro) I also
added the output of the calling function. This should help greatly in
identifying the read from the hardware to the consumer.
Finally, I noticed that igb_update_stats() produced a lot of churn that
most likely are unrelated. So I helper variable to make output from this
function go away.
I installed this modified driver, rebooted, and removed / inserted the
LAN cable until the error was present.
As before, "ethtool" and "mii-tool" now said that the device is not
there, while "ip link" showed the device as present.
The full output of "journalctl -fk | grep igb" is 600 kB. So put the
whole file at Google Drive:
https://drive.google.com/open?id=1p9cCT2d_EHnSHh29oS3AepUgFTKGFSeA
I looked at the output to see patterns, e.g with
grep -n igb_get_cfg_done_i210 igb.error.txt
grep -n __igb_shutdown igb.error.txt
...
(and almost all other function names). I hoped to see patterns. But for
my untrained eye, things looked not out of the order.
(For reference, here is the debug patch)
Index: linux-4.16/drivers/net/ethernet/intel/igb/igb_main.c
===================================================================
--- linux-4.16.orig/drivers/net/ethernet/intel/igb/igb_main.c 2018-04-01 23:20:27.000000000 +0200
+++ linux-4.16/drivers/net/ethernet/intel/igb/igb_main.c 2018-04-26 10:36:09.625135952 +0200
@@ -759,7 +759,8 @@
}
}
-u32 igb_rd32(struct e1000_hw *hw, u32 reg)
+int igb_rd32_silent = 0;
+u32 igb_rd32(const char *func, struct e1000_hw *hw, u32 reg)
{
struct igb_adapter *igb = container_of(hw, struct igb_adapter, hw);
u8 __iomem *hw_addr = READ_ONCE(hw->hw_addr);
@@ -769,6 +770,8 @@
return ~value;
value = readl(&hw_addr[reg]);
+ if (!igb_rd32_silent)
+ printk("rd32 %s %08x %08x\n", func, reg, value);
/* reads should not return all F's */
if (!(~value) && (!reg || !(~readl(hw_addr)))) {
@@ -5935,6 +5938,7 @@
if (pci_channel_offline(pdev))
return;
+ igb_rd32_silent = 1;
bytes = 0;
packets = 0;
@@ -6100,6 +6104,7 @@
adapter->stats.b2ospc += rd32(E1000_B2OSPC);
adapter->stats.b2ogprc += rd32(E1000_B2OGPRC);
}
+ igb_rd32_silent = 0;
}
static void igb_tsync_interrupt(struct igb_adapter *adapter)
Index: linux-4.16/drivers/net/ethernet/intel/igb/e1000_regs.h
===================================================================
--- linux-4.16.orig/drivers/net/ethernet/intel/igb/e1000_regs.h 2018-04-01 23:20:27.000000000 +0200
+++ linux-4.16/drivers/net/ethernet/intel/igb/e1000_regs.h 2018-04-26 10:34:24.332157000 +0200
@@ -370,7 +370,8 @@
struct e1000_hw;
-u32 igb_rd32(struct e1000_hw *hw, u32 reg);
+extern int igb_rd32_silent;
+u32 igb_rd32(const char *fname, struct e1000_hw *hw, u32 reg);
/* write operations, indexed using DWORDS */
#define wr32(reg, val) \
@@ -380,14 +381,14 @@
writel((val), &hw_addr[(reg)]); \
} while (0)
-#define rd32(reg) (igb_rd32(hw, reg))
+#define rd32(reg) (igb_rd32(__func__, hw, reg))
#define wrfl() ((void)rd32(E1000_STATUS))
#define array_wr32(reg, offset, value) \
wr32((reg) + ((offset) << 2), (value))
-#define array_rd32(reg, offset) (igb_rd32(hw, reg + ((offset) << 2)))
+#define array_rd32(reg, offset) (igb_rd32(__func__, hw, reg + ((offset) << 2)))
/* DMA Coalescing registers */
#define E1000_PCIEMISC 0x05BB8 /* PCIE misc config register */
^ permalink raw reply [flat|nested] 22+ messages in thread
* [Intel-wired-lan] [BUG] igb: reconnecting of cable not always detected
@ 2018-04-26 9:08 ` Holger Schurig
0 siblings, 0 replies; 22+ messages in thread
From: Holger Schurig @ 2018-04-26 9:08 UTC (permalink / raw)
To: intel-wired-lan
Hi,
> Thanks. I'm suspecting we may need to instrument igb_rd32 at this
> point. In order to trigger what you are seeing I am assuming the
> device has been detached due to a read failure of some sort.
Okay, I added a printk to igb_rd32. And because no one calls this
function directly (all access goes via the rd32/rd32_array macro) I also
added the output of the calling function. This should help greatly in
identifying the read from the hardware to the consumer.
Finally, I noticed that igb_update_stats() produced a lot of churn that
most likely are unrelated. So I helper variable to make output from this
function go away.
I installed this modified driver, rebooted, and removed / inserted the
LAN cable until the error was present.
As before, "ethtool" and "mii-tool" now said that the device is not
there, while "ip link" showed the device as present.
The full output of "journalctl -fk | grep igb" is 600 kB. So put the
whole file at Google Drive:
https://drive.google.com/open?id=1p9cCT2d_EHnSHh29oS3AepUgFTKGFSeA
I looked at the output to see patterns, e.g with
grep -n igb_get_cfg_done_i210 igb.error.txt
grep -n __igb_shutdown igb.error.txt
...
(and almost all other function names). I hoped to see patterns. But for
my untrained eye, things looked not out of the order.
(For reference, here is the debug patch)
Index: linux-4.16/drivers/net/ethernet/intel/igb/igb_main.c
===================================================================
--- linux-4.16.orig/drivers/net/ethernet/intel/igb/igb_main.c 2018-04-01 23:20:27.000000000 +0200
+++ linux-4.16/drivers/net/ethernet/intel/igb/igb_main.c 2018-04-26 10:36:09.625135952 +0200
@@ -759,7 +759,8 @@
}
}
-u32 igb_rd32(struct e1000_hw *hw, u32 reg)
+int igb_rd32_silent = 0;
+u32 igb_rd32(const char *func, struct e1000_hw *hw, u32 reg)
{
struct igb_adapter *igb = container_of(hw, struct igb_adapter, hw);
u8 __iomem *hw_addr = READ_ONCE(hw->hw_addr);
@@ -769,6 +770,8 @@
return ~value;
value = readl(&hw_addr[reg]);
+ if (!igb_rd32_silent)
+ printk("rd32 %s %08x %08x\n", func, reg, value);
/* reads should not return all F's */
if (!(~value) && (!reg || !(~readl(hw_addr)))) {
@@ -5935,6 +5938,7 @@
if (pci_channel_offline(pdev))
return;
+ igb_rd32_silent = 1;
bytes = 0;
packets = 0;
@@ -6100,6 +6104,7 @@
adapter->stats.b2ospc += rd32(E1000_B2OSPC);
adapter->stats.b2ogprc += rd32(E1000_B2OGPRC);
}
+ igb_rd32_silent = 0;
}
static void igb_tsync_interrupt(struct igb_adapter *adapter)
Index: linux-4.16/drivers/net/ethernet/intel/igb/e1000_regs.h
===================================================================
--- linux-4.16.orig/drivers/net/ethernet/intel/igb/e1000_regs.h 2018-04-01 23:20:27.000000000 +0200
+++ linux-4.16/drivers/net/ethernet/intel/igb/e1000_regs.h 2018-04-26 10:34:24.332157000 +0200
@@ -370,7 +370,8 @@
struct e1000_hw;
-u32 igb_rd32(struct e1000_hw *hw, u32 reg);
+extern int igb_rd32_silent;
+u32 igb_rd32(const char *fname, struct e1000_hw *hw, u32 reg);
/* write operations, indexed using DWORDS */
#define wr32(reg, val) \
@@ -380,14 +381,14 @@
writel((val), &hw_addr[(reg)]); \
} while (0)
-#define rd32(reg) (igb_rd32(hw, reg))
+#define rd32(reg) (igb_rd32(__func__, hw, reg))
#define wrfl() ((void)rd32(E1000_STATUS))
#define array_wr32(reg, offset, value) \
wr32((reg) + ((offset) << 2), (value))
-#define array_rd32(reg, offset) (igb_rd32(hw, reg + ((offset) << 2)))
+#define array_rd32(reg, offset) (igb_rd32(__func__, hw, reg + ((offset) << 2)))
/* DMA Coalescing registers */
#define E1000_PCIEMISC 0x05BB8 /* PCIE misc config register */
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [BUG] igb: reconnecting of cable not always detected
2018-04-26 9:08 ` [Intel-wired-lan] " Holger Schurig
@ 2018-04-26 16:02 ` Alexander Duyck
-1 siblings, 0 replies; 22+ messages in thread
From: Alexander Duyck @ 2018-04-26 16:02 UTC (permalink / raw)
To: Holger Schurig; +Cc: Jeff Kirsher, intel-wired-lan, LKML
On Thu, Apr 26, 2018 at 2:08 AM, Holger Schurig <holgerschurig@gmail.com> wrote:
> Hi,
>
>> Thanks. I'm suspecting we may need to instrument igb_rd32 at this
>> point. In order to trigger what you are seeing I am assuming the
>> device has been detached due to a read failure of some sort.
>
> Okay, I added a printk to igb_rd32. And because no one calls this
> function directly (all access goes via the rd32/rd32_array macro) I also
> added the output of the calling function. This should help greatly in
> identifying the read from the hardware to the consumer.
>
> Finally, I noticed that igb_update_stats() produced a lot of churn that
> most likely are unrelated. So I helper variable to make output from this
> function go away.
>
> I installed this modified driver, rebooted, and removed / inserted the
> LAN cable until the error was present.
>
> As before, "ethtool" and "mii-tool" now said that the device is not
> there, while "ip link" showed the device as present.
>
>
> The full output of "journalctl -fk | grep igb" is 600 kB. So put the
> whole file at Google Drive:
>
> https://drive.google.com/open?id=1p9cCT2d_EHnSHh29oS3AepUgFTKGFSeA
>
>
>
> I looked at the output to see patterns, e.g with
>
> grep -n igb_get_cfg_done_i210 igb.error.txt
> grep -n __igb_shutdown igb.error.txt
> ...
>
> (and almost all other function names). I hoped to see patterns. But for
> my untrained eye, things looked not out of the order.
Thanks for the data. It is actually useful. There are a few things
that I see that seem to point to an obvious issue.
The first are the following 2 lines from your dump:
Apr 26 10:42:49 kernel: igb 0000:02:00.0 eth0: igb: eth0 NIC Link is
Up 1000 Mbps Half Duplex, Flow Control: RX
Apr 26 10:42:49 kernel: igb 0000:02:00.0: EEE Disabled: unsupported at
half duplex. Re-enable using ethtool when at full duplex.
In case you aren't aware 1000Mbps Half Duplex is not a valid combination.
The other bit that catches my attention is:
Apr 26 10:42:51 kernel: igb 0000:02:00.0: exceed max 2 second
Which appears to be a timeout error that is triggered in response to
the above error which I believe is the fact that it didn't actually
link at 1000Mbps.
As I get time I will try to look into this further. I will have to go
through the MDIC reads to figure out if there is something in there
that is providing us with bad information from the PHY or if we are
misinterpreting something.
Thanks.
- Alex
^ permalink raw reply [flat|nested] 22+ messages in thread
* [Intel-wired-lan] [BUG] igb: reconnecting of cable not always detected
@ 2018-04-26 16:02 ` Alexander Duyck
0 siblings, 0 replies; 22+ messages in thread
From: Alexander Duyck @ 2018-04-26 16:02 UTC (permalink / raw)
To: intel-wired-lan
On Thu, Apr 26, 2018 at 2:08 AM, Holger Schurig <holgerschurig@gmail.com> wrote:
> Hi,
>
>> Thanks. I'm suspecting we may need to instrument igb_rd32 at this
>> point. In order to trigger what you are seeing I am assuming the
>> device has been detached due to a read failure of some sort.
>
> Okay, I added a printk to igb_rd32. And because no one calls this
> function directly (all access goes via the rd32/rd32_array macro) I also
> added the output of the calling function. This should help greatly in
> identifying the read from the hardware to the consumer.
>
> Finally, I noticed that igb_update_stats() produced a lot of churn that
> most likely are unrelated. So I helper variable to make output from this
> function go away.
>
> I installed this modified driver, rebooted, and removed / inserted the
> LAN cable until the error was present.
>
> As before, "ethtool" and "mii-tool" now said that the device is not
> there, while "ip link" showed the device as present.
>
>
> The full output of "journalctl -fk | grep igb" is 600 kB. So put the
> whole file at Google Drive:
>
> https://drive.google.com/open?id=1p9cCT2d_EHnSHh29oS3AepUgFTKGFSeA
>
>
>
> I looked at the output to see patterns, e.g with
>
> grep -n igb_get_cfg_done_i210 igb.error.txt
> grep -n __igb_shutdown igb.error.txt
> ...
>
> (and almost all other function names). I hoped to see patterns. But for
> my untrained eye, things looked not out of the order.
Thanks for the data. It is actually useful. There are a few things
that I see that seem to point to an obvious issue.
The first are the following 2 lines from your dump:
Apr 26 10:42:49 kernel: igb 0000:02:00.0 eth0: igb: eth0 NIC Link is
Up 1000 Mbps Half Duplex, Flow Control: RX
Apr 26 10:42:49 kernel: igb 0000:02:00.0: EEE Disabled: unsupported at
half duplex. Re-enable using ethtool when at full duplex.
In case you aren't aware 1000Mbps Half Duplex is not a valid combination.
The other bit that catches my attention is:
Apr 26 10:42:51 kernel: igb 0000:02:00.0: exceed max 2 second
Which appears to be a timeout error that is triggered in response to
the above error which I believe is the fact that it didn't actually
link at 1000Mbps.
As I get time I will try to look into this further. I will have to go
through the MDIC reads to figure out if there is something in there
that is providing us with bad information from the PHY or if we are
misinterpreting something.
Thanks.
- Alex
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [BUG] igb: reconnecting of cable not always detected
2018-04-26 16:02 ` [Intel-wired-lan] " Alexander Duyck
@ 2018-04-27 10:39 ` Holger Schurig
-1 siblings, 0 replies; 22+ messages in thread
From: Holger Schurig @ 2018-04-27 10:39 UTC (permalink / raw)
To: Alexander Duyck; +Cc: Jeff Kirsher, intel-wired-lan, LKML
Hi Alex,
> The first are the following 2 lines from your dump:
> Apr 26 10:42:49 kernel: igb 0000:02:00.0 eth0: igb: eth0 NIC Link is
> Up 1000 Mbps Half Duplex, Flow Control: RX
> Apr 26 10:42:49 kernel: igb 0000:02:00.0: EEE Disabled: unsupported at
> half duplex. Re-enable using ethtool when at full duplex.
Can it be the case that this is just a follow-up error?
In one of the mails from yesterday I showed you my patch to disable 1000
MB/s ... and still I had the link-always-down.
Similarly when I used a 10/100 MB/s switch only.
Both scenarios disabled 1000 MB/s, one more strictly than the other :-)
^ permalink raw reply [flat|nested] 22+ messages in thread
* [Intel-wired-lan] [BUG] igb: reconnecting of cable not always detected
@ 2018-04-27 10:39 ` Holger Schurig
0 siblings, 0 replies; 22+ messages in thread
From: Holger Schurig @ 2018-04-27 10:39 UTC (permalink / raw)
To: intel-wired-lan
Hi Alex,
> The first are the following 2 lines from your dump:
> Apr 26 10:42:49 kernel: igb 0000:02:00.0 eth0: igb: eth0 NIC Link is
> Up 1000 Mbps Half Duplex, Flow Control: RX
> Apr 26 10:42:49 kernel: igb 0000:02:00.0: EEE Disabled: unsupported at
> half duplex. Re-enable using ethtool when at full duplex.
Can it be the case that this is just a follow-up error?
In one of the mails from yesterday I showed you my patch to disable 1000
MB/s ... and still I had the link-always-down.
Similarly when I used a 10/100 MB/s switch only.
Both scenarios disabled 1000 MB/s, one more strictly than the other :-)
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [BUG] igb: reconnecting of cable not always detected
2018-04-26 16:02 ` [Intel-wired-lan] " Alexander Duyck
@ 2018-05-18 7:35 ` Holger Schurig
-1 siblings, 0 replies; 22+ messages in thread
From: Holger Schurig @ 2018-05-18 7:35 UTC (permalink / raw)
To: Alexander Duyck; +Cc: Jeff Kirsher, intel-wired-lan, LKML
Alexander Duyck <alexander.duyck@gmail.com> writes:
> Thanks for the data. It is actually useful. There are a few things
> that I see that seem to point to an obvious issue.
Any news on this?
A collegue of mine states (I have not checked this) that a kernel
4.9.0-6-686 from a Debian Live ISO (debian-live-9.4.0-i386-kde.iso)
didn't show this behavior, so we have some kind of regression perhaps?
Greetings,
Holger
^ permalink raw reply [flat|nested] 22+ messages in thread
* [Intel-wired-lan] [BUG] igb: reconnecting of cable not always detected
@ 2018-05-18 7:35 ` Holger Schurig
0 siblings, 0 replies; 22+ messages in thread
From: Holger Schurig @ 2018-05-18 7:35 UTC (permalink / raw)
To: intel-wired-lan
Alexander Duyck <alexander.duyck@gmail.com> writes:
> Thanks for the data. It is actually useful. There are a few things
> that I see that seem to point to an obvious issue.
Any news on this?
A collegue of mine states (I have not checked this) that a kernel
4.9.0-6-686 from a Debian Live ISO (debian-live-9.4.0-i386-kde.iso)
didn't show this behavior, so we have some kind of regression perhaps?
Greetings,
Holger
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Intel-wired-lan] [BUG] igb: reconnecting of cable not always detected
2018-05-18 7:35 ` [Intel-wired-lan] " Holger Schurig
@ 2019-01-17 21:55 ` Jeff Kirsher
-1 siblings, 0 replies; 22+ messages in thread
From: Jeff Kirsher @ 2019-01-17 21:55 UTC (permalink / raw)
To: Holger Schurig; +Cc: Alexander Duyck, intel-wired-lan, LKML
On Fri, May 18, 2018 at 12:36 AM Holger Schurig <holgerschurig@gmail.com> wrote:
>
> Alexander Duyck <alexander.duyck@gmail.com> writes:
> > Thanks for the data. It is actually useful. There are a few things
> > that I see that seem to point to an obvious issue.
>
> Any news on this?
>
> A collegue of mine states (I have not checked this) that a kernel
> 4.9.0-6-686 from a Debian Live ISO (debian-live-9.4.0-i386-kde.iso)
> didn't show this behavior, so we have some kind of regression perhaps?
Our validation team was only able to reproduce this once, but is not
able to reproduce the issue again or even consistently to be able to
adequate debug the issue.
Are you still seeing the issue with the latest upstream kernel from
either David Miller's net-next tree or Linus's tree?
--
Cheers,
Jeff
^ permalink raw reply [flat|nested] 22+ messages in thread
* [Intel-wired-lan] [BUG] igb: reconnecting of cable not always detected
@ 2019-01-17 21:55 ` Jeff Kirsher
0 siblings, 0 replies; 22+ messages in thread
From: Jeff Kirsher @ 2019-01-17 21:55 UTC (permalink / raw)
To: intel-wired-lan
On Fri, May 18, 2018 at 12:36 AM Holger Schurig <holgerschurig@gmail.com> wrote:
>
> Alexander Duyck <alexander.duyck@gmail.com> writes:
> > Thanks for the data. It is actually useful. There are a few things
> > that I see that seem to point to an obvious issue.
>
> Any news on this?
>
> A collegue of mine states (I have not checked this) that a kernel
> 4.9.0-6-686 from a Debian Live ISO (debian-live-9.4.0-i386-kde.iso)
> didn't show this behavior, so we have some kind of regression perhaps?
Our validation team was only able to reproduce this once, but is not
able to reproduce the issue again or even consistently to be able to
adequate debug the issue.
Are you still seeing the issue with the latest upstream kernel from
either David Miller's net-next tree or Linus's tree?
--
Cheers,
Jeff
^ permalink raw reply [flat|nested] 22+ messages in thread