All of lore.kernel.org
 help / color / mirror / Atom feed
* [BUG] igb: reconnecting of cable not always detected
@ 2018-04-24 15:14 ` Holger Schurig
  0 siblings, 0 replies; 23+ messages in thread
From: Holger Schurig @ 2018-04-24 15:14 UTC (permalink / raw)
  To: jeffrey.t.kirsher, intel-wired-lan, linux-kernel

Hi all,

I'm on kernel 4.16.4 and have an issue with eth0, driver is igb. When I
remove the ethernet cable, this is always detected:

[    2.772360] igb: Intel(R) Gigabit Ethernet Network Driver - version 5.4.0-k
[    2.772363] igb: Copyright (c) 2007-2014 Intel Corporation.
[    3.023707] igb 0000:02:00.0: added PHC on eth0
[    3.023710] igb 0000:02:00.0: Intel(R) Gigabit Ethernet Network Connection
[    3.023713] igb 0000:02:00.0: eth0: (PCIe:2.5Gb/s:Width x1) 00:13:95:1a:54:33
[    3.023758] igb 0000:02:00.0: eth0: PBA No: 000300-000
[    3.023762] igb 0000:02:00.0: Using MSI-X interrupts. 4 rx queue(s), 4 tx queue(s)
[    7.984921] igb 0000:02:00.0 eth0: igb: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
[   11.184593] igb 0000:02:00.0 eth0: igb: eth0 NIC Link is Down

Sometimes, plugging the cable back in is detected ...

[   43.736922] igb 0000:02:00.0 eth0: igb: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX

... but sometimes this is *NOT* detected. I can put the cable in and
even after two minutes nothing has been detected.

But when I run "rmmod igb" followed by "modpobe igb", the link is
detected again:

[  100.528609] igb 0000:02:00.0 eth0: igb: eth0 NIC Link is Down
[ 2336.583244] igb 0000:02:00.0: removed PHC on eth0
[ 2339.693521] igb: Intel(R) Gigabit Ethernet Network Driver - version 5.4.0-k
[ 2339.693524] igb: Copyright (c) 2007-2014 Intel Corporation.
[ 2339.990553] pps pps0: new PPS source ptp0
[ 2339.990561] igb 0000:02:00.0: added PHC on eth0
[ 2339.990565] igb 0000:02:00.0: Intel(R) Gigabit Ethernet Network Connection
[ 2339.990569] igb 0000:02:00.0: eth0: (PCIe:2.5Gb/s:Width x1) 00:13:95:1a:54:33
[ 2339.990611] igb 0000:02:00.0: eth0: PBA No: 000300-000
[ 2339.990615] igb 0000:02:00.0: Using MSI-X interrupts. 4 rx queue(s), 4 tx queue(s)
[ 2343.001114] igb 0000:02:00.0 eth0: igb: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX

(In above dmesg snippet the ethernet cable was the whole time inserted).


Any tips on how I can debug this further?

PS: I already tried a different switch and also a direct connection from
device-to-device, without a switch.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Intel-wired-lan] [BUG] igb: reconnecting of cable not always detected
@ 2018-04-24 15:14 ` Holger Schurig
  0 siblings, 0 replies; 23+ messages in thread
From: Holger Schurig @ 2018-04-24 15:14 UTC (permalink / raw)
  To: intel-wired-lan

Hi all,

I'm on kernel 4.16.4 and have an issue with eth0, driver is igb. When I
remove the ethernet cable, this is always detected:

[    2.772360] igb: Intel(R) Gigabit Ethernet Network Driver - version 5.4.0-k
[    2.772363] igb: Copyright (c) 2007-2014 Intel Corporation.
[    3.023707] igb 0000:02:00.0: added PHC on eth0
[    3.023710] igb 0000:02:00.0: Intel(R) Gigabit Ethernet Network Connection
[    3.023713] igb 0000:02:00.0: eth0: (PCIe:2.5Gb/s:Width x1) 00:13:95:1a:54:33
[    3.023758] igb 0000:02:00.0: eth0: PBA No: 000300-000
[    3.023762] igb 0000:02:00.0: Using MSI-X interrupts. 4 rx queue(s), 4 tx queue(s)
[    7.984921] igb 0000:02:00.0 eth0: igb: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
[   11.184593] igb 0000:02:00.0 eth0: igb: eth0 NIC Link is Down

Sometimes, plugging the cable back in is detected ...

[   43.736922] igb 0000:02:00.0 eth0: igb: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX

... but sometimes this is *NOT* detected. I can put the cable in and
even after two minutes nothing has been detected.

But when I run "rmmod igb" followed by "modpobe igb", the link is
detected again:

[  100.528609] igb 0000:02:00.0 eth0: igb: eth0 NIC Link is Down
[ 2336.583244] igb 0000:02:00.0: removed PHC on eth0
[ 2339.693521] igb: Intel(R) Gigabit Ethernet Network Driver - version 5.4.0-k
[ 2339.693524] igb: Copyright (c) 2007-2014 Intel Corporation.
[ 2339.990553] pps pps0: new PPS source ptp0
[ 2339.990561] igb 0000:02:00.0: added PHC on eth0
[ 2339.990565] igb 0000:02:00.0: Intel(R) Gigabit Ethernet Network Connection
[ 2339.990569] igb 0000:02:00.0: eth0: (PCIe:2.5Gb/s:Width x1) 00:13:95:1a:54:33
[ 2339.990611] igb 0000:02:00.0: eth0: PBA No: 000300-000
[ 2339.990615] igb 0000:02:00.0: Using MSI-X interrupts. 4 rx queue(s), 4 tx queue(s)
[ 2343.001114] igb 0000:02:00.0 eth0: igb: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX

(In above dmesg snippet the ethernet cable was the whole time inserted).


Any tips on how I can debug this further?

PS: I already tried a different switch and also a direct connection from
device-to-device, without a switch.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [BUG] igb: reconnecting of cable not always detected
  2018-04-24 15:14 ` [Intel-wired-lan] " Holger Schurig
@ 2018-04-24 18:09   ` Alexander Duyck
  -1 siblings, 0 replies; 23+ messages in thread
From: Alexander Duyck @ 2018-04-24 18:09 UTC (permalink / raw)
  To: Holger Schurig; +Cc: Jeff Kirsher, intel-wired-lan, LKML

On Tue, Apr 24, 2018 at 8:14 AM, Holger Schurig <holgerschurig@gmail.com> wrote:
> Hi all,
>
> I'm on kernel 4.16.4 and have an issue with eth0, driver is igb. When I
> remove the ethernet cable, this is always detected:
>
> [    2.772360] igb: Intel(R) Gigabit Ethernet Network Driver - version 5.4.0-k
> [    2.772363] igb: Copyright (c) 2007-2014 Intel Corporation.
> [    3.023707] igb 0000:02:00.0: added PHC on eth0
> [    3.023710] igb 0000:02:00.0: Intel(R) Gigabit Ethernet Network Connection
> [    3.023713] igb 0000:02:00.0: eth0: (PCIe:2.5Gb/s:Width x1) 00:13:95:1a:54:33
> [    3.023758] igb 0000:02:00.0: eth0: PBA No: 000300-000
> [    3.023762] igb 0000:02:00.0: Using MSI-X interrupts. 4 rx queue(s), 4 tx queue(s)
> [    7.984921] igb 0000:02:00.0 eth0: igb: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
> [   11.184593] igb 0000:02:00.0 eth0: igb: eth0 NIC Link is Down
>
> Sometimes, plugging the cable back in is detected ...
>
> [   43.736922] igb 0000:02:00.0 eth0: igb: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
>
> ... but sometimes this is *NOT* detected. I can put the cable in and
> even after two minutes nothing has been detected.
>
> But when I run "rmmod igb" followed by "modpobe igb", the link is
> detected again:
>
> [  100.528609] igb 0000:02:00.0 eth0: igb: eth0 NIC Link is Down
> [ 2336.583244] igb 0000:02:00.0: removed PHC on eth0
> [ 2339.693521] igb: Intel(R) Gigabit Ethernet Network Driver - version 5.4.0-k
> [ 2339.693524] igb: Copyright (c) 2007-2014 Intel Corporation.
> [ 2339.990553] pps pps0: new PPS source ptp0
> [ 2339.990561] igb 0000:02:00.0: added PHC on eth0
> [ 2339.990565] igb 0000:02:00.0: Intel(R) Gigabit Ethernet Network Connection
> [ 2339.990569] igb 0000:02:00.0: eth0: (PCIe:2.5Gb/s:Width x1) 00:13:95:1a:54:33
> [ 2339.990611] igb 0000:02:00.0: eth0: PBA No: 000300-000
> [ 2339.990615] igb 0000:02:00.0: Using MSI-X interrupts. 4 rx queue(s), 4 tx queue(s)
> [ 2343.001114] igb 0000:02:00.0 eth0: igb: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
>
> (In above dmesg snippet the ethernet cable was the whole time inserted).
>
>
> Any tips on how I can debug this further?
>
> PS: I already tried a different switch and also a direct connection from
> device-to-device, without a switch.

Sounds like the link is failing to re-establish. You might double
check a few things. One is to verify if the link partner is
recognizing the link as coming up or not. That would help to tell us
if this is a problem of the driver detecting the link, or if the link
itself is not being re-established.

Another thing you could look at doing is running "ethtool -r eth0"
after plugging the cable in to see if that re-establishes the link or
not. It should be easier anyway than having to unload and reload the
driver.

If you could also provide an "lspci -vvv" and "ethtool -i" for the
device it would help us in the debugging process as it would provide
us with information on what NIC it is you are using and what firmware
is in use on it.

Thanks.

- Alex

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Intel-wired-lan] [BUG] igb: reconnecting of cable not always detected
@ 2018-04-24 18:09   ` Alexander Duyck
  0 siblings, 0 replies; 23+ messages in thread
From: Alexander Duyck @ 2018-04-24 18:09 UTC (permalink / raw)
  To: intel-wired-lan

On Tue, Apr 24, 2018 at 8:14 AM, Holger Schurig <holgerschurig@gmail.com> wrote:
> Hi all,
>
> I'm on kernel 4.16.4 and have an issue with eth0, driver is igb. When I
> remove the ethernet cable, this is always detected:
>
> [    2.772360] igb: Intel(R) Gigabit Ethernet Network Driver - version 5.4.0-k
> [    2.772363] igb: Copyright (c) 2007-2014 Intel Corporation.
> [    3.023707] igb 0000:02:00.0: added PHC on eth0
> [    3.023710] igb 0000:02:00.0: Intel(R) Gigabit Ethernet Network Connection
> [    3.023713] igb 0000:02:00.0: eth0: (PCIe:2.5Gb/s:Width x1) 00:13:95:1a:54:33
> [    3.023758] igb 0000:02:00.0: eth0: PBA No: 000300-000
> [    3.023762] igb 0000:02:00.0: Using MSI-X interrupts. 4 rx queue(s), 4 tx queue(s)
> [    7.984921] igb 0000:02:00.0 eth0: igb: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
> [   11.184593] igb 0000:02:00.0 eth0: igb: eth0 NIC Link is Down
>
> Sometimes, plugging the cable back in is detected ...
>
> [   43.736922] igb 0000:02:00.0 eth0: igb: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
>
> ... but sometimes this is *NOT* detected. I can put the cable in and
> even after two minutes nothing has been detected.
>
> But when I run "rmmod igb" followed by "modpobe igb", the link is
> detected again:
>
> [  100.528609] igb 0000:02:00.0 eth0: igb: eth0 NIC Link is Down
> [ 2336.583244] igb 0000:02:00.0: removed PHC on eth0
> [ 2339.693521] igb: Intel(R) Gigabit Ethernet Network Driver - version 5.4.0-k
> [ 2339.693524] igb: Copyright (c) 2007-2014 Intel Corporation.
> [ 2339.990553] pps pps0: new PPS source ptp0
> [ 2339.990561] igb 0000:02:00.0: added PHC on eth0
> [ 2339.990565] igb 0000:02:00.0: Intel(R) Gigabit Ethernet Network Connection
> [ 2339.990569] igb 0000:02:00.0: eth0: (PCIe:2.5Gb/s:Width x1) 00:13:95:1a:54:33
> [ 2339.990611] igb 0000:02:00.0: eth0: PBA No: 000300-000
> [ 2339.990615] igb 0000:02:00.0: Using MSI-X interrupts. 4 rx queue(s), 4 tx queue(s)
> [ 2343.001114] igb 0000:02:00.0 eth0: igb: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
>
> (In above dmesg snippet the ethernet cable was the whole time inserted).
>
>
> Any tips on how I can debug this further?
>
> PS: I already tried a different switch and also a direct connection from
> device-to-device, without a switch.

Sounds like the link is failing to re-establish. You might double
check a few things. One is to verify if the link partner is
recognizing the link as coming up or not. That would help to tell us
if this is a problem of the driver detecting the link, or if the link
itself is not being re-established.

Another thing you could look at doing is running "ethtool -r eth0"
after plugging the cable in to see if that re-establishes the link or
not. It should be easier anyway than having to unload and reload the
driver.

If you could also provide an "lspci -vvv" and "ethtool -i" for the
device it would help us in the debugging process as it would provide
us with information on what NIC it is you are using and what firmware
is in use on it.

Thanks.

- Alex

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [BUG] igb: reconnecting of cable not always detected
  2018-04-24 18:09   ` [Intel-wired-lan] " Alexander Duyck
@ 2018-04-25  3:30     ` Richard Cochran
  -1 siblings, 0 replies; 23+ messages in thread
From: Richard Cochran @ 2018-04-25  3:30 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: Holger Schurig, Jeff Kirsher, intel-wired-lan, LKML

On Tue, Apr 24, 2018 at 11:09:02AM -0700, Alexander Duyck wrote:
> On Tue, Apr 24, 2018 at 8:14 AM, Holger Schurig <holgerschurig@gmail.com> wrote:
> > Sometimes, plugging the cable back in is detected ...
> >
> > [   43.736922] igb 0000:02:00.0 eth0: igb: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
> >
> > ... but sometimes this is *NOT* detected. I can put the cable in and
> > even after two minutes nothing has been detected.
> >
> > But when I run "rmmod igb" followed by "modpobe igb", the link is
> > detected again:

FWIW, I have noticed over the past months (or even years?) that my
i210 cards (or the igb driver) also fail to detect link changes after
a few physical link interruptions.  I never bothered to try and debug
this, but it is super annoying.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Intel-wired-lan] [BUG] igb: reconnecting of cable not always detected
@ 2018-04-25  3:30     ` Richard Cochran
  0 siblings, 0 replies; 23+ messages in thread
From: Richard Cochran @ 2018-04-25  3:30 UTC (permalink / raw)
  To: intel-wired-lan

On Tue, Apr 24, 2018 at 11:09:02AM -0700, Alexander Duyck wrote:
> On Tue, Apr 24, 2018 at 8:14 AM, Holger Schurig <holgerschurig@gmail.com> wrote:
> > Sometimes, plugging the cable back in is detected ...
> >
> > [   43.736922] igb 0000:02:00.0 eth0: igb: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
> >
> > ... but sometimes this is *NOT* detected. I can put the cable in and
> > even after two minutes nothing has been detected.
> >
> > But when I run "rmmod igb" followed by "modpobe igb", the link is
> > detected again:

FWIW, I have noticed over the past months (or even years?) that my
i210 cards (or the igb driver) also fail to detect link changes after
a few physical link interruptions.  I never bothered to try and debug
this, but it is super annoying.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [BUG] igb: reconnecting of cable not always detected
  2018-04-24 18:09   ` [Intel-wired-lan] " Alexander Duyck
@ 2018-04-25  9:47     ` Holger Schurig
  -1 siblings, 0 replies; 23+ messages in thread
From: Holger Schurig @ 2018-04-25  9:47 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: Jeff Kirsher, intel-wired-lan, LKML

Hi Alex,

(Sent a 2nd time, this time with "Reply to all" and without HTML, so
that it hits the kernel archives as well. Sorry for the noise.




> Sounds like the link is failing to re-establish. You might double
> check a few things. One is to verify if the link partner is
> recognizing the link as coming up or not.

It turns on differently. Before I remove the cable, the LED on the TP
LINK "TL SG-108" was green. After removing the cable, the LED went off.
After reinserting the cable, it became orange after some while.

Green LED means 1000 MB/s, orange LED means 10/100 MB/s.


I have a different, even older switch: "Allnet ALL8039". Here the same:
the switch detects a link, but igb not.



> If you could also provide an "lspci -vvv"

02:00.0 Ethernet controller: Intel Corporation I210 Gigabit Network
Connection (rev 03)
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 19
        Region 0: Memory at 90600000 (32-bit, non-prefetchable) [size=512K]
        Region 2: I/O ports at d000 [size=32]
        Region 3: Memory at 90680000 (32-bit, non-prefetchable) [size=16K]
        Capabilities: [40] Power Management version 3
                Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA
PME(D0+,D1-,D2-,D3hot+,D3cold+)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=1 PME-
        Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
                Address: 0000000000000000  Data: 0000
                Masking: 00000000  Pending: 00000000
        Capabilities: [70] MSI-X: Enable+ Count=5 Masked-
                Vector table: BAR=3 offset=00000000
                PBA: BAR=3 offset=00002000
        Capabilities: [a0] Express (v2) Endpoint, MSI 00
                DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s
<512ns, L1 <64us
                        ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+
SlotPowerLimit 0.000W
                DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+
Unsupported+
                        RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+
FLReset-
                        MaxPayload 128 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr+
TransPend-
                LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Exit
Latency L0s <2us, L1 <16us
                        ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+
DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-,
OBFF Not Supported
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis+,
LTR-, OBFF Disabled
                LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance-
SpeedDis-
                         Transmit Margin: Normal Operating Range,
EnterModifiedCompliance- ComplianceSOS-
                         Compliance De-emphasis: -6dB
                LnkSta2: Current De-emphasis Level: -6dB,
EqualizationComplete-, EqualizationPhase1-
                         EqualizationPhase2-, EqualizationPhase3-,
LinkEqualizationRequest-
        Capabilities: [100 v2] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt-
RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt-
RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt-
RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout-
NonFatalErr-
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout-
NonFatalErr+
                AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+
ChkEn-
        Capabilities: [140 v1] Device Serial Number 00-13-95-ff-ff-1a-54-33
        Capabilities: [1a0 v1] Transaction Processing Hints
                Device specific mode supported
                Steering table in TPH capability structure
        Kernel driver in use: igb
        Kernel modules: igb

> and "ethtool -i" for the

driver: igb
version: 5.4.0-k
firmware-version: 3.20, 0x80000553
expansion-rom-version:
bus-info: 0000:02:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes



One thing that is interesting is how igb reacts to ethtool inquiries
once it goes into the failed state. You inquired for "ethtool -i eth0",
but in the failed state I only get this:

Cannot restart autonegotiation: No such device

But eth0 is of course still there, "ip -d link show eth0" shows:


2: eth0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN
mode DEFAULT group default qlen 1000
    link/ether 00:13:95:1a:54:33 brd ff:ff:ff:ff:ff:ff promiscuity 0
    numtxqueues 8 numrxqueues 8 gso_max_size 65536 gso_max_segs 65535





Other ethtool commands also don't report any information once the link
went bogus. Here one output from "ethtool eth0":

Settings for eth0:
        Supported ports: [ TP ]
        Supported link modes:   10baseT/Half 10baseT/Full
                                100baseT/Half 100baseT/Full
                                1000baseT/Full
        Supported pause frame use: Symmetric
        Supports auto-negotiation: Yes
        Advertised link modes:  10baseT/Half 10baseT/Full
                                100baseT/Half 100baseT/Full
                                1000baseT/Full
        Advertised pause frame use: Symmetric
        Advertised auto-negotiation: Yes
        Speed: 1000Mb/s
        Duplex: Full
        Port: Twisted Pair
        PHYAD: 1
        Transceiver: internal
        Auto-negotiation: on
        MDI-X: off (auto)
        Supports Wake-on: pumbg
        Wake-on: g
        Current message level: 0x00000007 (7)
                               drv probe link
        Link detected: yes

... and here another:

Settings for eth0:
Cannot get device settings: No such device
Cannot get wake-on-lan settings: No such device
Cannot get message level: No such device
Cannot get link status: No such device
Settings for eth0:
No data available



I'm willing to pepper the source with printk, if this helps :-)


Greetings,
Holger

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Intel-wired-lan] [BUG] igb: reconnecting of cable not always detected
@ 2018-04-25  9:47     ` Holger Schurig
  0 siblings, 0 replies; 23+ messages in thread
From: Holger Schurig @ 2018-04-25  9:47 UTC (permalink / raw)
  To: intel-wired-lan

Hi Alex,

(Sent a 2nd time, this time with "Reply to all" and without HTML, so
that it hits the kernel archives as well. Sorry for the noise.




> Sounds like the link is failing to re-establish. You might double
> check a few things. One is to verify if the link partner is
> recognizing the link as coming up or not.

It turns on differently. Before I remove the cable, the LED on the TP
LINK "TL SG-108" was green. After removing the cable, the LED went off.
After reinserting the cable, it became orange after some while.

Green LED means 1000 MB/s, orange LED means 10/100 MB/s.


I have a different, even older switch: "Allnet ALL8039". Here the same:
the switch detects a link, but igb not.



> If you could also provide an "lspci -vvv"

02:00.0 Ethernet controller: Intel Corporation I210 Gigabit Network
Connection (rev 03)
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 19
        Region 0: Memory at 90600000 (32-bit, non-prefetchable) [size=512K]
        Region 2: I/O ports at d000 [size=32]
        Region 3: Memory@90680000 (32-bit, non-prefetchable) [size=16K]
        Capabilities: [40] Power Management version 3
                Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA
PME(D0+,D1-,D2-,D3hot+,D3cold+)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=1 PME-
        Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
                Address: 0000000000000000  Data: 0000
                Masking: 00000000  Pending: 00000000
        Capabilities: [70] MSI-X: Enable+ Count=5 Masked-
                Vector table: BAR=3 offset=00000000
                PBA: BAR=3 offset=00002000
        Capabilities: [a0] Express (v2) Endpoint, MSI 00
                DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s
<512ns, L1 <64us
                        ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+
SlotPowerLimit 0.000W
                DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+
Unsupported+
                        RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+
FLReset-
                        MaxPayload 128 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr+
TransPend-
                LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Exit
Latency L0s <2us, L1 <16us
                        ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+
DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-,
OBFF Not Supported
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis+,
LTR-, OBFF Disabled
                LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance-
SpeedDis-
                         Transmit Margin: Normal Operating Range,
EnterModifiedCompliance- ComplianceSOS-
                         Compliance De-emphasis: -6dB
                LnkSta2: Current De-emphasis Level: -6dB,
EqualizationComplete-, EqualizationPhase1-
                         EqualizationPhase2-, EqualizationPhase3-,
LinkEqualizationRequest-
        Capabilities: [100 v2] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt-
RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt-
RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt-
RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout-
NonFatalErr-
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout-
NonFatalErr+
                AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+
ChkEn-
        Capabilities: [140 v1] Device Serial Number 00-13-95-ff-ff-1a-54-33
        Capabilities: [1a0 v1] Transaction Processing Hints
                Device specific mode supported
                Steering table in TPH capability structure
        Kernel driver in use: igb
        Kernel modules: igb

> and "ethtool -i" for the

driver: igb
version: 5.4.0-k
firmware-version: 3.20, 0x80000553
expansion-rom-version:
bus-info: 0000:02:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes



One thing that is interesting is how igb reacts to ethtool inquiries
once it goes into the failed state. You inquired for "ethtool -i eth0",
but in the failed state I only get this:

Cannot restart autonegotiation: No such device

But eth0 is of course still there, "ip -d link show eth0" shows:


2: eth0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN
mode DEFAULT group default qlen 1000
    link/ether 00:13:95:1a:54:33 brd ff:ff:ff:ff:ff:ff promiscuity 0
    numtxqueues 8 numrxqueues 8 gso_max_size 65536 gso_max_segs 65535





Other ethtool commands also don't report any information once the link
went bogus. Here one output from "ethtool eth0":

Settings for eth0:
        Supported ports: [ TP ]
        Supported link modes:   10baseT/Half 10baseT/Full
                                100baseT/Half 100baseT/Full
                                1000baseT/Full
        Supported pause frame use: Symmetric
        Supports auto-negotiation: Yes
        Advertised link modes:  10baseT/Half 10baseT/Full
                                100baseT/Half 100baseT/Full
                                1000baseT/Full
        Advertised pause frame use: Symmetric
        Advertised auto-negotiation: Yes
        Speed: 1000Mb/s
        Duplex: Full
        Port: Twisted Pair
        PHYAD: 1
        Transceiver: internal
        Auto-negotiation: on
        MDI-X: off (auto)
        Supports Wake-on: pumbg
        Wake-on: g
        Current message level: 0x00000007 (7)
                               drv probe link
        Link detected: yes

... and here another:

Settings for eth0:
Cannot get device settings: No such device
Cannot get wake-on-lan settings: No such device
Cannot get message level: No such device
Cannot get link status: No such device
Settings for eth0:
No data available



I'm willing to pepper the source with printk, if this helps :-)


Greetings,
Holger

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [BUG] igb: reconnecting of cable not always detected
  2018-04-25  9:47     ` [Intel-wired-lan] " Holger Schurig
@ 2018-04-25 16:01       ` Alexander Duyck
  -1 siblings, 0 replies; 23+ messages in thread
From: Alexander Duyck @ 2018-04-25 16:01 UTC (permalink / raw)
  To: Holger Schurig; +Cc: Jeff Kirsher, intel-wired-lan, LKML

On Wed, Apr 25, 2018 at 2:47 AM, Holger Schurig <holgerschurig@gmail.com> wrote:
> Hi Alex,
>
> (Sent a 2nd time, this time with "Reply to all" and without HTML, so
> that it hits the kernel archives as well. Sorry for the noise.
>
>
>
>
>> Sounds like the link is failing to re-establish. You might double
>> check a few things. One is to verify if the link partner is
>> recognizing the link as coming up or not.
>
> It turns on differently. Before I remove the cable, the LED on the TP
> LINK "TL SG-108" was green. After removing the cable, the LED went off.
> After reinserting the cable, it became orange after some while.
>
> Green LED means 1000 MB/s, orange LED means 10/100 MB/s.

Was the orange LED on the igb NIC or on the TL SG-108? Based on the
comment below I am assuming it is the switch.

Based on that I am thinking we probably need to work on the PHY configuration.

> I have a different, even older switch: "Allnet ALL8039". Here the same:
> the switch detects a link, but igb not.
>
>
>
>> If you could also provide an "lspci -vvv"
>
> 02:00.0 Ethernet controller: Intel Corporation I210 Gigabit Network
> Connection (rev 03)

Okay so we are working with an i210.

>         Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
> Stepping- SERR- FastB2B- DisINTx+
>         Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
> <TAbort- <MAbort- >SERR- <PERR- INTx-
>         Latency: 0, Cache Line Size: 64 bytes
>         Interrupt: pin A routed to IRQ 19
>         Region 0: Memory at 90600000 (32-bit, non-prefetchable) [size=512K]
>         Region 2: I/O ports at d000 [size=32]
>         Region 3: Memory at 90680000 (32-bit, non-prefetchable) [size=16K]
>         Capabilities: [40] Power Management version 3
>                 Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA
> PME(D0+,D1-,D2-,D3hot+,D3cold+)
>                 Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=1 PME-
>         Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
>                 Address: 0000000000000000  Data: 0000
>                 Masking: 00000000  Pending: 00000000
>         Capabilities: [70] MSI-X: Enable+ Count=5 Masked-
>                 Vector table: BAR=3 offset=00000000
>                 PBA: BAR=3 offset=00002000
>         Capabilities: [a0] Express (v2) Endpoint, MSI 00
>                 DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s
> <512ns, L1 <64us
>                         ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+
> SlotPowerLimit 0.000W
>                 DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+
> Unsupported+
>                         RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+
> FLReset-
>                         MaxPayload 128 bytes, MaxReadReq 512 bytes
>                 DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr+
> TransPend-
>                 LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Exit
> Latency L0s <2us, L1 <16us
>                         ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
>                 LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
>                         ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
>                 LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+
> DLActive- BWMgmt- ABWMgmt-
>                 DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-,
> OBFF Not Supported
>                 DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis+,
> LTR-, OBFF Disabled
>                 LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance-
> SpeedDis-
>                          Transmit Margin: Normal Operating Range,
> EnterModifiedCompliance- ComplianceSOS-
>                          Compliance De-emphasis: -6dB
>                 LnkSta2: Current De-emphasis Level: -6dB,
> EqualizationComplete-, EqualizationPhase1-
>                          EqualizationPhase2-, EqualizationPhase3-,
> LinkEqualizationRequest-
>         Capabilities: [100 v2] Advanced Error Reporting
>                 UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt-
> RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>                 UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt-
> RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>                 UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt-
> RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
>                 CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout-
> NonFatalErr-
>                 CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout-
> NonFatalErr+
>                 AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+
> ChkEn-
>         Capabilities: [140 v1] Device Serial Number 00-13-95-ff-ff-1a-54-33
>         Capabilities: [1a0 v1] Transaction Processing Hints
>                 Device specific mode supported
>                 Steering table in TPH capability structure
>         Kernel driver in use: igb
>         Kernel modules: igb
>
>> and "ethtool -i" for the
>
> driver: igb
> version: 5.4.0-k
> firmware-version: 3.20, 0x80000553
> expansion-rom-version:
> bus-info: 0000:02:00.0
> supports-statistics: yes
> supports-test: yes
> supports-eeprom-access: yes
> supports-register-dump: yes
> supports-priv-flags: yes
>
>
>
> One thing that is interesting is how igb reacts to ethtool inquiries
> once it goes into the failed state. You inquired for "ethtool -i eth0",
> but in the failed state I only get this:
>
> Cannot restart autonegotiation: No such device

I assume you mean "ethtool -r" since that is what is supposed to be
restarting negotiation. The "ethtool -i" is what you provided above.

The fact that the device disappears is a bit concerning. I'm wondering
if we are somehow triggering the surprise removal code.

> But eth0 is of course still there, "ip -d link show eth0" shows:
>
>
> 2: eth0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN
> mode DEFAULT group default qlen 1000
>     link/ether 00:13:95:1a:54:33 brd ff:ff:ff:ff:ff:ff promiscuity 0
>     numtxqueues 8 numrxqueues 8 gso_max_size 65536 gso_max_segs 65535
>
>
>
>
>
> Other ethtool commands also don't report any information once the link
> went bogus. Here one output from "ethtool eth0":
>
> Settings for eth0:
>         Supported ports: [ TP ]
>         Supported link modes:   10baseT/Half 10baseT/Full
>                                 100baseT/Half 100baseT/Full
>                                 1000baseT/Full
>         Supported pause frame use: Symmetric
>         Supports auto-negotiation: Yes
>         Advertised link modes:  10baseT/Half 10baseT/Full
>                                 100baseT/Half 100baseT/Full
>                                 1000baseT/Full
>         Advertised pause frame use: Symmetric
>         Advertised auto-negotiation: Yes
>         Speed: 1000Mb/s
>         Duplex: Full
>         Port: Twisted Pair
>         PHYAD: 1
>         Transceiver: internal
>         Auto-negotiation: on
>         MDI-X: off (auto)
>         Supports Wake-on: pumbg
>         Wake-on: g
>         Current message level: 0x00000007 (7)
>                                drv probe link
>         Link detected: yes
>
> ... and here another:
>
> Settings for eth0:
> Cannot get device settings: No such device
> Cannot get wake-on-lan settings: No such device
> Cannot get message level: No such device
> Cannot get link status: No such device
> Settings for eth0:
> No data available
>
>
>
> I'm willing to pepper the source with printk, if this helps :-)
>
>
> Greetings,
> Holger

Thanks. I'm suspecting we may need to instrument igb_rd32 at this
point. In order to trigger what you are seeing I am assuming the
device has been detached due to a read failure of some sort.

Another thing you could look at doing is narrowing down the possible
factors involved. You could go through and limit phy settings and look
at possibly dropping features such as EEE if it is enabled on the
device.

Thanks.

- Alex

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Intel-wired-lan] [BUG] igb: reconnecting of cable not always detected
@ 2018-04-25 16:01       ` Alexander Duyck
  0 siblings, 0 replies; 23+ messages in thread
From: Alexander Duyck @ 2018-04-25 16:01 UTC (permalink / raw)
  To: intel-wired-lan

On Wed, Apr 25, 2018 at 2:47 AM, Holger Schurig <holgerschurig@gmail.com> wrote:
> Hi Alex,
>
> (Sent a 2nd time, this time with "Reply to all" and without HTML, so
> that it hits the kernel archives as well. Sorry for the noise.
>
>
>
>
>> Sounds like the link is failing to re-establish. You might double
>> check a few things. One is to verify if the link partner is
>> recognizing the link as coming up or not.
>
> It turns on differently. Before I remove the cable, the LED on the TP
> LINK "TL SG-108" was green. After removing the cable, the LED went off.
> After reinserting the cable, it became orange after some while.
>
> Green LED means 1000 MB/s, orange LED means 10/100 MB/s.

Was the orange LED on the igb NIC or on the TL SG-108? Based on the
comment below I am assuming it is the switch.

Based on that I am thinking we probably need to work on the PHY configuration.

> I have a different, even older switch: "Allnet ALL8039". Here the same:
> the switch detects a link, but igb not.
>
>
>
>> If you could also provide an "lspci -vvv"
>
> 02:00.0 Ethernet controller: Intel Corporation I210 Gigabit Network
> Connection (rev 03)

Okay so we are working with an i210.

>         Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
> Stepping- SERR- FastB2B- DisINTx+
>         Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
> <TAbort- <MAbort- >SERR- <PERR- INTx-
>         Latency: 0, Cache Line Size: 64 bytes
>         Interrupt: pin A routed to IRQ 19
>         Region 0: Memory at 90600000 (32-bit, non-prefetchable) [size=512K]
>         Region 2: I/O ports at d000 [size=32]
>         Region 3: Memory at 90680000 (32-bit, non-prefetchable) [size=16K]
>         Capabilities: [40] Power Management version 3
>                 Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA
> PME(D0+,D1-,D2-,D3hot+,D3cold+)
>                 Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=1 PME-
>         Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
>                 Address: 0000000000000000  Data: 0000
>                 Masking: 00000000  Pending: 00000000
>         Capabilities: [70] MSI-X: Enable+ Count=5 Masked-
>                 Vector table: BAR=3 offset=00000000
>                 PBA: BAR=3 offset=00002000
>         Capabilities: [a0] Express (v2) Endpoint, MSI 00
>                 DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s
> <512ns, L1 <64us
>                         ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+
> SlotPowerLimit 0.000W
>                 DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+
> Unsupported+
>                         RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+
> FLReset-
>                         MaxPayload 128 bytes, MaxReadReq 512 bytes
>                 DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr+
> TransPend-
>                 LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Exit
> Latency L0s <2us, L1 <16us
>                         ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
>                 LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
>                         ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
>                 LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+
> DLActive- BWMgmt- ABWMgmt-
>                 DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-,
> OBFF Not Supported
>                 DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis+,
> LTR-, OBFF Disabled
>                 LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance-
> SpeedDis-
>                          Transmit Margin: Normal Operating Range,
> EnterModifiedCompliance- ComplianceSOS-
>                          Compliance De-emphasis: -6dB
>                 LnkSta2: Current De-emphasis Level: -6dB,
> EqualizationComplete-, EqualizationPhase1-
>                          EqualizationPhase2-, EqualizationPhase3-,
> LinkEqualizationRequest-
>         Capabilities: [100 v2] Advanced Error Reporting
>                 UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt-
> RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>                 UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt-
> RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>                 UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt-
> RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
>                 CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout-
> NonFatalErr-
>                 CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout-
> NonFatalErr+
>                 AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+
> ChkEn-
>         Capabilities: [140 v1] Device Serial Number 00-13-95-ff-ff-1a-54-33
>         Capabilities: [1a0 v1] Transaction Processing Hints
>                 Device specific mode supported
>                 Steering table in TPH capability structure
>         Kernel driver in use: igb
>         Kernel modules: igb
>
>> and "ethtool -i" for the
>
> driver: igb
> version: 5.4.0-k
> firmware-version: 3.20, 0x80000553
> expansion-rom-version:
> bus-info: 0000:02:00.0
> supports-statistics: yes
> supports-test: yes
> supports-eeprom-access: yes
> supports-register-dump: yes
> supports-priv-flags: yes
>
>
>
> One thing that is interesting is how igb reacts to ethtool inquiries
> once it goes into the failed state. You inquired for "ethtool -i eth0",
> but in the failed state I only get this:
>
> Cannot restart autonegotiation: No such device

I assume you mean "ethtool -r" since that is what is supposed to be
restarting negotiation. The "ethtool -i" is what you provided above.

The fact that the device disappears is a bit concerning. I'm wondering
if we are somehow triggering the surprise removal code.

> But eth0 is of course still there, "ip -d link show eth0" shows:
>
>
> 2: eth0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN
> mode DEFAULT group default qlen 1000
>     link/ether 00:13:95:1a:54:33 brd ff:ff:ff:ff:ff:ff promiscuity 0
>     numtxqueues 8 numrxqueues 8 gso_max_size 65536 gso_max_segs 65535
>
>
>
>
>
> Other ethtool commands also don't report any information once the link
> went bogus. Here one output from "ethtool eth0":
>
> Settings for eth0:
>         Supported ports: [ TP ]
>         Supported link modes:   10baseT/Half 10baseT/Full
>                                 100baseT/Half 100baseT/Full
>                                 1000baseT/Full
>         Supported pause frame use: Symmetric
>         Supports auto-negotiation: Yes
>         Advertised link modes:  10baseT/Half 10baseT/Full
>                                 100baseT/Half 100baseT/Full
>                                 1000baseT/Full
>         Advertised pause frame use: Symmetric
>         Advertised auto-negotiation: Yes
>         Speed: 1000Mb/s
>         Duplex: Full
>         Port: Twisted Pair
>         PHYAD: 1
>         Transceiver: internal
>         Auto-negotiation: on
>         MDI-X: off (auto)
>         Supports Wake-on: pumbg
>         Wake-on: g
>         Current message level: 0x00000007 (7)
>                                drv probe link
>         Link detected: yes
>
> ... and here another:
>
> Settings for eth0:
> Cannot get device settings: No such device
> Cannot get wake-on-lan settings: No such device
> Cannot get message level: No such device
> Cannot get link status: No such device
> Settings for eth0:
> No data available
>
>
>
> I'm willing to pepper the source with printk, if this helps :-)
>
>
> Greetings,
> Holger

Thanks. I'm suspecting we may need to instrument igb_rd32 at this
point. In order to trigger what you are seeing I am assuming the
device has been detached due to a read failure of some sort.

Another thing you could look at doing is narrowing down the possible
factors involved. You could go through and limit phy settings and look
at possibly dropping features such as EEE if it is enabled on the
device.

Thanks.

- Alex

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [BUG] igb: reconnecting of cable not always detected
  2018-04-25 16:01       ` [Intel-wired-lan] " Alexander Duyck
@ 2018-04-26  7:54         ` Holger Schurig
  -1 siblings, 0 replies; 23+ messages in thread
From: Holger Schurig @ 2018-04-26  7:54 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: Jeff Kirsher, intel-wired-lan, LKML

> Was the orange LED on the igb NIC or on the TL SG-108? Based on the
> comment below I am assuming it is the switch.

The LEDs were on the switch.

When everything works, the switch says green == 1000 MB/s.

When cable is disconnected, switch doesn't light any LED.

When cable is inserted and things fail, the switch says orange LED == 100 MB/s.
Sometimes the insertion process works, then the switch will go, of
course, to the green LED == 1000 MB/s.

I must admit that I didn't look at the LEDs of the device.


Now I looked there, and the device the left+green LED is on. In the
failed case (so, in the dmesg output the last thing I see is "Link is
Down", but the device still has left+green LED on.

The right+orange LED on the device seems to indicate traffic, and it is
constantly off in the failed case.



> I assume you mean "ethtool -r" since that is what is supposed to be
> restarting negotiation. The "ethtool -i" is what you provided above.

Maybe I've edited my text too much and moved output along. Anyway, in
the failed case neither "ethtool- r eth0" nor "ethtool -i eth0" nor
"mii-tool eth0" work at all, they all emit error warning.

> Thanks. I'm suspecting we may need to instrument igb_rd32 at this
> point. In order to trigger what you are seeing I am assuming the
> device has been detached due to a read failure of some sort.

I'll do that and reply later. I first need to understand this source
part :-)


> Another thing you could look at doing is narrowing down the possible
> factors involved. You could go through and limit phy settings and look
> at possibly dropping features such as EEE if it is enabled on the
> device.

I actually tried a driver patch to remove 1000 GB/s from the driver, in
the assumption that maybe this specific hardware has a bad layout and
thus trouble (I don't really think that, because I never observed any
data transfer problem).

So, is the following patch (that didn't help) what in the line of what
you suggested?


Index: linux-4.16/drivers/net/ethernet/intel/igb/igb_main.c
===================================================================
--- linux-4.16.orig/drivers/net/ethernet/intel/igb/igb_main.c	2018-04-01 23:20:27.000000000 +0200
+++ linux-4.16/drivers/net/ethernet/intel/igb/igb_main.c	2018-04-24 11:35:17.420760650 +0200
@@ -2080,7 +2080,7 @@
 
 	if ((adapter->flags & IGB_FLAG_EEE) &&
 	    (!hw->dev_spec._82575.eee_disable))
-		adapter->eee_advert = MDIO_EEE_100TX | MDIO_EEE_1000T;
+		adapter->eee_advert = MDIO_EEE_100TX /* | MDIO_EEE_1000T */;
 
 	return 0;
 }
@@ -2908,7 +2908,7 @@
 	/* Initialize link properties that are user-changeable */
 	adapter->fc_autoneg = true;
 	hw->mac.autoneg = true;
-	hw->phy.autoneg_advertised = 0x2f;
+	hw->phy.autoneg_advertised = 0x0f;
 
 	hw->fc.requested_mode = e1000_fc_default;
 	hw->fc.current_mode = e1000_fc_default;
@@ -3099,7 +3099,7 @@
 			if ((!err) &&
 			    (!hw->dev_spec._82575.eee_disable)) {
 				adapter->eee_advert =
-					MDIO_EEE_100TX | MDIO_EEE_1000T;
+					MDIO_EEE_100TX /* | MDIO_EEE_1000T */;
 				adapter->flags |= IGB_FLAG_EEE;
 			}
 			break;
@@ -3110,7 +3110,7 @@
 				if ((!err) &&
 					(!hw->dev_spec._82575.eee_disable)) {
 					adapter->eee_advert =
-					   MDIO_EEE_100TX | MDIO_EEE_1000T;
+					   MDIO_EEE_100TX /* | MDIO_EEE_1000T */;
 					adapter->flags |= IGB_FLAG_EEE;
 				}
 			}
Index: linux-4.16/drivers/net/ethernet/intel/igb/igb_ethtool.c
===================================================================
--- linux-4.16.orig/drivers/net/ethernet/intel/igb/igb_ethtool.c	2018-04-01 23:20:27.000000000 +0200
+++ linux-4.16/drivers/net/ethernet/intel/igb/igb_ethtool.c	2018-04-24 11:42:36.737959749 +0200
@@ -170,7 +170,7 @@
 			     SUPPORTED_10baseT_Full |
 			     SUPPORTED_100baseT_Half |
 			     SUPPORTED_100baseT_Full |
-			     SUPPORTED_1000baseT_Full|
+			     /* SUPPORTED_1000baseT_Full| */ 
 			     SUPPORTED_Autoneg |
 			     SUPPORTED_TP |
 			     SUPPORTED_Pause);
@@ -3003,7 +3003,7 @@
 	    (hw->phy.media_type != e1000_media_type_copper))
 		return -EOPNOTSUPP;
 
-	edata->supported = (SUPPORTED_1000baseT_Full |
+	edata->supported = (/* SUPPORTED_1000baseT_Full | */
 			    SUPPORTED_100baseT_Full);
 	if (!hw->dev_spec._82575.eee_disable)
 		edata->advertised =

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Intel-wired-lan] [BUG] igb: reconnecting of cable not always detected
@ 2018-04-26  7:54         ` Holger Schurig
  0 siblings, 0 replies; 23+ messages in thread
From: Holger Schurig @ 2018-04-26  7:54 UTC (permalink / raw)
  To: intel-wired-lan

> Was the orange LED on the igb NIC or on the TL SG-108? Based on the
> comment below I am assuming it is the switch.

The LEDs were on the switch.

When everything works, the switch says green == 1000 MB/s.

When cable is disconnected, switch doesn't light any LED.

When cable is inserted and things fail, the switch says orange LED == 100 MB/s.
Sometimes the insertion process works, then the switch will go, of
course, to the green LED == 1000 MB/s.

I must admit that I didn't look at the LEDs of the device.


Now I looked there, and the device the left+green LED is on. In the
failed case (so, in the dmesg output the last thing I see is "Link is
Down", but the device still has left+green LED on.

The right+orange LED on the device seems to indicate traffic, and it is
constantly off in the failed case.



> I assume you mean "ethtool -r" since that is what is supposed to be
> restarting negotiation. The "ethtool -i" is what you provided above.

Maybe I've edited my text too much and moved output along. Anyway, in
the failed case neither "ethtool- r eth0" nor "ethtool -i eth0" nor
"mii-tool eth0" work at all, they all emit error warning.

> Thanks. I'm suspecting we may need to instrument igb_rd32 at this
> point. In order to trigger what you are seeing I am assuming the
> device has been detached due to a read failure of some sort.

I'll do that and reply later. I first need to understand this source
part :-)


> Another thing you could look at doing is narrowing down the possible
> factors involved. You could go through and limit phy settings and look
> at possibly dropping features such as EEE if it is enabled on the
> device.

I actually tried a driver patch to remove 1000 GB/s from the driver, in
the assumption that maybe this specific hardware has a bad layout and
thus trouble (I don't really think that, because I never observed any
data transfer problem).

So, is the following patch (that didn't help) what in the line of what
you suggested?


Index: linux-4.16/drivers/net/ethernet/intel/igb/igb_main.c
===================================================================
--- linux-4.16.orig/drivers/net/ethernet/intel/igb/igb_main.c	2018-04-01 23:20:27.000000000 +0200
+++ linux-4.16/drivers/net/ethernet/intel/igb/igb_main.c	2018-04-24 11:35:17.420760650 +0200
@@ -2080,7 +2080,7 @@
 
 	if ((adapter->flags & IGB_FLAG_EEE) &&
 	    (!hw->dev_spec._82575.eee_disable))
-		adapter->eee_advert = MDIO_EEE_100TX | MDIO_EEE_1000T;
+		adapter->eee_advert = MDIO_EEE_100TX /* | MDIO_EEE_1000T */;
 
 	return 0;
 }
@@ -2908,7 +2908,7 @@
 	/* Initialize link properties that are user-changeable */
 	adapter->fc_autoneg = true;
 	hw->mac.autoneg = true;
-	hw->phy.autoneg_advertised = 0x2f;
+	hw->phy.autoneg_advertised = 0x0f;
 
 	hw->fc.requested_mode = e1000_fc_default;
 	hw->fc.current_mode = e1000_fc_default;
@@ -3099,7 +3099,7 @@
 			if ((!err) &&
 			    (!hw->dev_spec._82575.eee_disable)) {
 				adapter->eee_advert =
-					MDIO_EEE_100TX | MDIO_EEE_1000T;
+					MDIO_EEE_100TX /* | MDIO_EEE_1000T */;
 				adapter->flags |= IGB_FLAG_EEE;
 			}
 			break;
@@ -3110,7 +3110,7 @@
 				if ((!err) &&
 					(!hw->dev_spec._82575.eee_disable)) {
 					adapter->eee_advert =
-					   MDIO_EEE_100TX | MDIO_EEE_1000T;
+					   MDIO_EEE_100TX /* | MDIO_EEE_1000T */;
 					adapter->flags |= IGB_FLAG_EEE;
 				}
 			}
Index: linux-4.16/drivers/net/ethernet/intel/igb/igb_ethtool.c
===================================================================
--- linux-4.16.orig/drivers/net/ethernet/intel/igb/igb_ethtool.c	2018-04-01 23:20:27.000000000 +0200
+++ linux-4.16/drivers/net/ethernet/intel/igb/igb_ethtool.c	2018-04-24 11:42:36.737959749 +0200
@@ -170,7 +170,7 @@
 			     SUPPORTED_10baseT_Full |
 			     SUPPORTED_100baseT_Half |
 			     SUPPORTED_100baseT_Full |
-			     SUPPORTED_1000baseT_Full|
+			     /* SUPPORTED_1000baseT_Full| */ 
 			     SUPPORTED_Autoneg |
 			     SUPPORTED_TP |
 			     SUPPORTED_Pause);
@@ -3003,7 +3003,7 @@
 	    (hw->phy.media_type != e1000_media_type_copper))
 		return -EOPNOTSUPP;
 
-	edata->supported = (SUPPORTED_1000baseT_Full |
+	edata->supported = (/* SUPPORTED_1000baseT_Full | */
 			    SUPPORTED_100baseT_Full);
 	if (!hw->dev_spec._82575.eee_disable)
 		edata->advertised =

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [BUG] igb: reconnecting of cable not always detected
  2018-04-25 16:01       ` [Intel-wired-lan] " Alexander Duyck
@ 2018-04-26  9:08         ` Holger Schurig
  -1 siblings, 0 replies; 23+ messages in thread
From: Holger Schurig @ 2018-04-26  9:08 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: Jeff Kirsher, intel-wired-lan, LKML

Hi,

> Thanks. I'm suspecting we may need to instrument igb_rd32 at this
> point. In order to trigger what you are seeing I am assuming the
> device has been detached due to a read failure of some sort.

Okay, I added a printk to igb_rd32. And because no one calls this
function directly (all access goes via the rd32/rd32_array macro) I also
added the output of the calling function. This should help greatly in
identifying the read from the hardware to the consumer.

Finally, I noticed that igb_update_stats() produced a lot of churn that
most likely are unrelated. So I helper variable to make output from this
function go away.

I installed this modified driver, rebooted, and removed / inserted the
LAN cable until the error was present.

As before, "ethtool" and "mii-tool" now said that the device is not
there, while "ip link" showed the device as present.


The full output of "journalctl -fk | grep igb" is 600 kB. So put the
whole file at Google Drive:

https://drive.google.com/open?id=1p9cCT2d_EHnSHh29oS3AepUgFTKGFSeA



I looked at the output to see patterns, e.g with

grep -n igb_get_cfg_done_i210 igb.error.txt
grep -n __igb_shutdown igb.error.txt
...

(and almost all other function names). I hoped to see patterns. But for
my untrained eye, things looked not out of the order.





(For reference, here is the debug patch)

Index: linux-4.16/drivers/net/ethernet/intel/igb/igb_main.c
===================================================================
--- linux-4.16.orig/drivers/net/ethernet/intel/igb/igb_main.c	2018-04-01 23:20:27.000000000 +0200
+++ linux-4.16/drivers/net/ethernet/intel/igb/igb_main.c	2018-04-26 10:36:09.625135952 +0200
@@ -759,7 +759,8 @@
 	}
 }
 
-u32 igb_rd32(struct e1000_hw *hw, u32 reg)
+int igb_rd32_silent = 0;
+u32 igb_rd32(const char *func, struct e1000_hw *hw, u32 reg)
 {
 	struct igb_adapter *igb = container_of(hw, struct igb_adapter, hw);
 	u8 __iomem *hw_addr = READ_ONCE(hw->hw_addr);
@@ -769,6 +770,8 @@
 		return ~value;
 
 	value = readl(&hw_addr[reg]);
+	if (!igb_rd32_silent)
+	printk("rd32 %s %08x %08x\n", func, reg, value);
 
 	/* reads should not return all F's */
 	if (!(~value) && (!reg || !(~readl(hw_addr)))) {
@@ -5935,6 +5938,7 @@
 	if (pci_channel_offline(pdev))
 		return;
 
+	igb_rd32_silent = 1;
 	bytes = 0;
 	packets = 0;
 
@@ -6100,6 +6104,7 @@
 		adapter->stats.b2ospc += rd32(E1000_B2OSPC);
 		adapter->stats.b2ogprc += rd32(E1000_B2OGPRC);
 	}
+	igb_rd32_silent = 0;
 }
 
 static void igb_tsync_interrupt(struct igb_adapter *adapter)
Index: linux-4.16/drivers/net/ethernet/intel/igb/e1000_regs.h
===================================================================
--- linux-4.16.orig/drivers/net/ethernet/intel/igb/e1000_regs.h	2018-04-01 23:20:27.000000000 +0200
+++ linux-4.16/drivers/net/ethernet/intel/igb/e1000_regs.h	2018-04-26 10:34:24.332157000 +0200
@@ -370,7 +370,8 @@
 
 struct e1000_hw;
 
-u32 igb_rd32(struct e1000_hw *hw, u32 reg);
+extern int igb_rd32_silent;
+u32 igb_rd32(const char *fname, struct e1000_hw *hw, u32 reg);
 
 /* write operations, indexed using DWORDS */
 #define wr32(reg, val) \
@@ -380,14 +381,14 @@
 		writel((val), &hw_addr[(reg)]); \
 } while (0)
 
-#define rd32(reg) (igb_rd32(hw, reg))
+#define rd32(reg) (igb_rd32(__func__, hw, reg))
 
 #define wrfl() ((void)rd32(E1000_STATUS))
 
 #define array_wr32(reg, offset, value) \
 	wr32((reg) + ((offset) << 2), (value))
 
-#define array_rd32(reg, offset) (igb_rd32(hw, reg + ((offset) << 2)))
+#define array_rd32(reg, offset) (igb_rd32(__func__, hw, reg + ((offset) << 2)))
 
 /* DMA Coalescing registers */
 #define E1000_PCIEMISC	0x05BB8 /* PCIE misc config register */

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Intel-wired-lan] [BUG] igb: reconnecting of cable not always detected
@ 2018-04-26  9:08         ` Holger Schurig
  0 siblings, 0 replies; 23+ messages in thread
From: Holger Schurig @ 2018-04-26  9:08 UTC (permalink / raw)
  To: intel-wired-lan

Hi,

> Thanks. I'm suspecting we may need to instrument igb_rd32 at this
> point. In order to trigger what you are seeing I am assuming the
> device has been detached due to a read failure of some sort.

Okay, I added a printk to igb_rd32. And because no one calls this
function directly (all access goes via the rd32/rd32_array macro) I also
added the output of the calling function. This should help greatly in
identifying the read from the hardware to the consumer.

Finally, I noticed that igb_update_stats() produced a lot of churn that
most likely are unrelated. So I helper variable to make output from this
function go away.

I installed this modified driver, rebooted, and removed / inserted the
LAN cable until the error was present.

As before, "ethtool" and "mii-tool" now said that the device is not
there, while "ip link" showed the device as present.


The full output of "journalctl -fk | grep igb" is 600 kB. So put the
whole file at Google Drive:

https://drive.google.com/open?id=1p9cCT2d_EHnSHh29oS3AepUgFTKGFSeA



I looked at the output to see patterns, e.g with

grep -n igb_get_cfg_done_i210 igb.error.txt
grep -n __igb_shutdown igb.error.txt
...

(and almost all other function names). I hoped to see patterns. But for
my untrained eye, things looked not out of the order.





(For reference, here is the debug patch)

Index: linux-4.16/drivers/net/ethernet/intel/igb/igb_main.c
===================================================================
--- linux-4.16.orig/drivers/net/ethernet/intel/igb/igb_main.c	2018-04-01 23:20:27.000000000 +0200
+++ linux-4.16/drivers/net/ethernet/intel/igb/igb_main.c	2018-04-26 10:36:09.625135952 +0200
@@ -759,7 +759,8 @@
 	}
 }
 
-u32 igb_rd32(struct e1000_hw *hw, u32 reg)
+int igb_rd32_silent = 0;
+u32 igb_rd32(const char *func, struct e1000_hw *hw, u32 reg)
 {
 	struct igb_adapter *igb = container_of(hw, struct igb_adapter, hw);
 	u8 __iomem *hw_addr = READ_ONCE(hw->hw_addr);
@@ -769,6 +770,8 @@
 		return ~value;
 
 	value = readl(&hw_addr[reg]);
+	if (!igb_rd32_silent)
+	printk("rd32 %s %08x %08x\n", func, reg, value);
 
 	/* reads should not return all F's */
 	if (!(~value) && (!reg || !(~readl(hw_addr)))) {
@@ -5935,6 +5938,7 @@
 	if (pci_channel_offline(pdev))
 		return;
 
+	igb_rd32_silent = 1;
 	bytes = 0;
 	packets = 0;
 
@@ -6100,6 +6104,7 @@
 		adapter->stats.b2ospc += rd32(E1000_B2OSPC);
 		adapter->stats.b2ogprc += rd32(E1000_B2OGPRC);
 	}
+	igb_rd32_silent = 0;
 }
 
 static void igb_tsync_interrupt(struct igb_adapter *adapter)
Index: linux-4.16/drivers/net/ethernet/intel/igb/e1000_regs.h
===================================================================
--- linux-4.16.orig/drivers/net/ethernet/intel/igb/e1000_regs.h	2018-04-01 23:20:27.000000000 +0200
+++ linux-4.16/drivers/net/ethernet/intel/igb/e1000_regs.h	2018-04-26 10:34:24.332157000 +0200
@@ -370,7 +370,8 @@
 
 struct e1000_hw;
 
-u32 igb_rd32(struct e1000_hw *hw, u32 reg);
+extern int igb_rd32_silent;
+u32 igb_rd32(const char *fname, struct e1000_hw *hw, u32 reg);
 
 /* write operations, indexed using DWORDS */
 #define wr32(reg, val) \
@@ -380,14 +381,14 @@
 		writel((val), &hw_addr[(reg)]); \
 } while (0)
 
-#define rd32(reg) (igb_rd32(hw, reg))
+#define rd32(reg) (igb_rd32(__func__, hw, reg))
 
 #define wrfl() ((void)rd32(E1000_STATUS))
 
 #define array_wr32(reg, offset, value) \
 	wr32((reg) + ((offset) << 2), (value))
 
-#define array_rd32(reg, offset) (igb_rd32(hw, reg + ((offset) << 2)))
+#define array_rd32(reg, offset) (igb_rd32(__func__, hw, reg + ((offset) << 2)))
 
 /* DMA Coalescing registers */
 #define E1000_PCIEMISC	0x05BB8 /* PCIE misc config register */

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [BUG] igb: reconnecting of cable not always detected
  2018-04-26  9:08         ` [Intel-wired-lan] " Holger Schurig
@ 2018-04-26 16:02           ` Alexander Duyck
  -1 siblings, 0 replies; 23+ messages in thread
From: Alexander Duyck @ 2018-04-26 16:02 UTC (permalink / raw)
  To: Holger Schurig; +Cc: Jeff Kirsher, intel-wired-lan, LKML

On Thu, Apr 26, 2018 at 2:08 AM, Holger Schurig <holgerschurig@gmail.com> wrote:
> Hi,
>
>> Thanks. I'm suspecting we may need to instrument igb_rd32 at this
>> point. In order to trigger what you are seeing I am assuming the
>> device has been detached due to a read failure of some sort.
>
> Okay, I added a printk to igb_rd32. And because no one calls this
> function directly (all access goes via the rd32/rd32_array macro) I also
> added the output of the calling function. This should help greatly in
> identifying the read from the hardware to the consumer.
>
> Finally, I noticed that igb_update_stats() produced a lot of churn that
> most likely are unrelated. So I helper variable to make output from this
> function go away.
>
> I installed this modified driver, rebooted, and removed / inserted the
> LAN cable until the error was present.
>
> As before, "ethtool" and "mii-tool" now said that the device is not
> there, while "ip link" showed the device as present.
>
>
> The full output of "journalctl -fk | grep igb" is 600 kB. So put the
> whole file at Google Drive:
>
> https://drive.google.com/open?id=1p9cCT2d_EHnSHh29oS3AepUgFTKGFSeA
>
>
>
> I looked at the output to see patterns, e.g with
>
> grep -n igb_get_cfg_done_i210 igb.error.txt
> grep -n __igb_shutdown igb.error.txt
> ...
>
> (and almost all other function names). I hoped to see patterns. But for
> my untrained eye, things looked not out of the order.


Thanks for the data. It is actually useful. There are a few things
that I see that seem to point to an obvious issue.

The first are the following 2 lines from your dump:
Apr 26 10:42:49 kernel: igb 0000:02:00.0 eth0: igb: eth0 NIC Link is
Up 1000 Mbps Half Duplex, Flow Control: RX
Apr 26 10:42:49 kernel: igb 0000:02:00.0: EEE Disabled: unsupported at
half duplex. Re-enable using ethtool when at full duplex.

In case you aren't aware 1000Mbps Half Duplex is not a valid combination.

The other bit that catches my attention is:
Apr 26 10:42:51 kernel: igb 0000:02:00.0: exceed max 2 second

Which appears to be a timeout error that is triggered in response to
the above error which I believe is the fact that it didn't actually
link at 1000Mbps.

As I get time I will try to look into this further. I will have to go
through the MDIC reads to figure out if there is something in there
that is providing us with bad information from the PHY or if we are
misinterpreting something.

Thanks.

- Alex

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Intel-wired-lan] [BUG] igb: reconnecting of cable not always detected
@ 2018-04-26 16:02           ` Alexander Duyck
  0 siblings, 0 replies; 23+ messages in thread
From: Alexander Duyck @ 2018-04-26 16:02 UTC (permalink / raw)
  To: intel-wired-lan

On Thu, Apr 26, 2018 at 2:08 AM, Holger Schurig <holgerschurig@gmail.com> wrote:
> Hi,
>
>> Thanks. I'm suspecting we may need to instrument igb_rd32 at this
>> point. In order to trigger what you are seeing I am assuming the
>> device has been detached due to a read failure of some sort.
>
> Okay, I added a printk to igb_rd32. And because no one calls this
> function directly (all access goes via the rd32/rd32_array macro) I also
> added the output of the calling function. This should help greatly in
> identifying the read from the hardware to the consumer.
>
> Finally, I noticed that igb_update_stats() produced a lot of churn that
> most likely are unrelated. So I helper variable to make output from this
> function go away.
>
> I installed this modified driver, rebooted, and removed / inserted the
> LAN cable until the error was present.
>
> As before, "ethtool" and "mii-tool" now said that the device is not
> there, while "ip link" showed the device as present.
>
>
> The full output of "journalctl -fk | grep igb" is 600 kB. So put the
> whole file at Google Drive:
>
> https://drive.google.com/open?id=1p9cCT2d_EHnSHh29oS3AepUgFTKGFSeA
>
>
>
> I looked at the output to see patterns, e.g with
>
> grep -n igb_get_cfg_done_i210 igb.error.txt
> grep -n __igb_shutdown igb.error.txt
> ...
>
> (and almost all other function names). I hoped to see patterns. But for
> my untrained eye, things looked not out of the order.


Thanks for the data. It is actually useful. There are a few things
that I see that seem to point to an obvious issue.

The first are the following 2 lines from your dump:
Apr 26 10:42:49 kernel: igb 0000:02:00.0 eth0: igb: eth0 NIC Link is
Up 1000 Mbps Half Duplex, Flow Control: RX
Apr 26 10:42:49 kernel: igb 0000:02:00.0: EEE Disabled: unsupported at
half duplex. Re-enable using ethtool when at full duplex.

In case you aren't aware 1000Mbps Half Duplex is not a valid combination.

The other bit that catches my attention is:
Apr 26 10:42:51 kernel: igb 0000:02:00.0: exceed max 2 second

Which appears to be a timeout error that is triggered in response to
the above error which I believe is the fact that it didn't actually
link at 1000Mbps.

As I get time I will try to look into this further. I will have to go
through the MDIC reads to figure out if there is something in there
that is providing us with bad information from the PHY or if we are
misinterpreting something.

Thanks.

- Alex

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [BUG] igb: reconnecting of cable not always detected
  2018-04-26 16:02           ` [Intel-wired-lan] " Alexander Duyck
@ 2018-04-27 10:39             ` Holger Schurig
  -1 siblings, 0 replies; 23+ messages in thread
From: Holger Schurig @ 2018-04-27 10:39 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: Jeff Kirsher, intel-wired-lan, LKML

Hi Alex,

> The first are the following 2 lines from your dump:
> Apr 26 10:42:49 kernel: igb 0000:02:00.0 eth0: igb: eth0 NIC Link is
> Up 1000 Mbps Half Duplex, Flow Control: RX
> Apr 26 10:42:49 kernel: igb 0000:02:00.0: EEE Disabled: unsupported at
> half duplex. Re-enable using ethtool when at full duplex.

Can it be the case that this is just a follow-up error?

In one of the mails from yesterday I showed you my patch to disable 1000
MB/s ... and still I had the link-always-down.

Similarly when I used a 10/100 MB/s switch only.

Both scenarios disabled 1000 MB/s, one more strictly than the other :-)

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Intel-wired-lan] [BUG] igb: reconnecting of cable not always detected
@ 2018-04-27 10:39             ` Holger Schurig
  0 siblings, 0 replies; 23+ messages in thread
From: Holger Schurig @ 2018-04-27 10:39 UTC (permalink / raw)
  To: intel-wired-lan

Hi Alex,

> The first are the following 2 lines from your dump:
> Apr 26 10:42:49 kernel: igb 0000:02:00.0 eth0: igb: eth0 NIC Link is
> Up 1000 Mbps Half Duplex, Flow Control: RX
> Apr 26 10:42:49 kernel: igb 0000:02:00.0: EEE Disabled: unsupported at
> half duplex. Re-enable using ethtool when at full duplex.

Can it be the case that this is just a follow-up error?

In one of the mails from yesterday I showed you my patch to disable 1000
MB/s ... and still I had the link-always-down.

Similarly when I used a 10/100 MB/s switch only.

Both scenarios disabled 1000 MB/s, one more strictly than the other :-)


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [BUG] igb: reconnecting of cable not always detected
  2018-04-26 16:02           ` [Intel-wired-lan] " Alexander Duyck
@ 2018-05-18  7:35             ` Holger Schurig
  -1 siblings, 0 replies; 23+ messages in thread
From: Holger Schurig @ 2018-05-18  7:35 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: Jeff Kirsher, intel-wired-lan, LKML

Alexander Duyck <alexander.duyck@gmail.com> writes:
> Thanks for the data. It is actually useful. There are a few things
> that I see that seem to point to an obvious issue.

Any news on this?

A collegue of mine states (I have not checked this) that a kernel
4.9.0-6-686 from a Debian Live ISO (debian-live-9.4.0-i386-kde.iso)
didn't show this behavior, so we have some kind of regression perhaps?


Greetings,
Holger

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Intel-wired-lan] [BUG] igb: reconnecting of cable not always detected
@ 2018-05-18  7:35             ` Holger Schurig
  0 siblings, 0 replies; 23+ messages in thread
From: Holger Schurig @ 2018-05-18  7:35 UTC (permalink / raw)
  To: intel-wired-lan

Alexander Duyck <alexander.duyck@gmail.com> writes:
> Thanks for the data. It is actually useful. There are a few things
> that I see that seem to point to an obvious issue.

Any news on this?

A collegue of mine states (I have not checked this) that a kernel
4.9.0-6-686 from a Debian Live ISO (debian-live-9.4.0-i386-kde.iso)
didn't show this behavior, so we have some kind of regression perhaps?


Greetings,
Holger

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Intel-wired-lan] [BUG] igb: reconnecting of cable not always detected
  2018-05-18  7:35             ` [Intel-wired-lan] " Holger Schurig
@ 2019-01-17 21:55               ` Jeff Kirsher
  -1 siblings, 0 replies; 23+ messages in thread
From: Jeff Kirsher @ 2019-01-17 21:55 UTC (permalink / raw)
  To: Holger Schurig; +Cc: Alexander Duyck, intel-wired-lan, LKML

On Fri, May 18, 2018 at 12:36 AM Holger Schurig <holgerschurig@gmail.com> wrote:
>
> Alexander Duyck <alexander.duyck@gmail.com> writes:
> > Thanks for the data. It is actually useful. There are a few things
> > that I see that seem to point to an obvious issue.
>
> Any news on this?
>
> A collegue of mine states (I have not checked this) that a kernel
> 4.9.0-6-686 from a Debian Live ISO (debian-live-9.4.0-i386-kde.iso)
> didn't show this behavior, so we have some kind of regression perhaps?

Our validation team was only able to reproduce this once, but is not
able to reproduce the issue again or even consistently to be able to
adequate debug the issue.

Are you still seeing the issue with the latest upstream kernel from
either David Miller's net-next tree or Linus's tree?

-- 
Cheers,
Jeff

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Intel-wired-lan] [BUG] igb: reconnecting of cable not always detected
@ 2019-01-17 21:55               ` Jeff Kirsher
  0 siblings, 0 replies; 23+ messages in thread
From: Jeff Kirsher @ 2019-01-17 21:55 UTC (permalink / raw)
  To: intel-wired-lan

On Fri, May 18, 2018 at 12:36 AM Holger Schurig <holgerschurig@gmail.com> wrote:
>
> Alexander Duyck <alexander.duyck@gmail.com> writes:
> > Thanks for the data. It is actually useful. There are a few things
> > that I see that seem to point to an obvious issue.
>
> Any news on this?
>
> A collegue of mine states (I have not checked this) that a kernel
> 4.9.0-6-686 from a Debian Live ISO (debian-live-9.4.0-i386-kde.iso)
> didn't show this behavior, so we have some kind of regression perhaps?

Our validation team was only able to reproduce this once, but is not
able to reproduce the issue again or even consistently to be able to
adequate debug the issue.

Are you still seeing the issue with the latest upstream kernel from
either David Miller's net-next tree or Linus's tree?

-- 
Cheers,
Jeff

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [BUG] igb: reconnecting of cable not always detected
@ 2018-06-09 17:15 Thomas Netousek
  0 siblings, 0 replies; 23+ messages in thread
From: Thomas Netousek @ 2018-06-09 17:15 UTC (permalink / raw)
  To: linux-kernel

I have a similar problem.
If I disconnect and reconnect the ethernet cable on a Intel Ethernet
card then the device does not come up again.

For me this problem happens on the first pull of the LAN cable all the time.

It is reproducible on Supermicro X8, X9 and X10 dual CPU mainboards with
onboard networking providing two PHY interfaces using Intel 82576 and
I350 chips.
It is not reproducible on a Supermicro X10SLL single mainboard with
onboard I210 chip providing one PHY for eth0 (tested) and one
I217-LM powered by the e1000e driver (not connected, not tested).

It is reproducible using kernel 4.9.107 and 4.17.0.
It is not reproducible using  kernels 4.1.48, 4.4.136.
So it might be related to the changes in the igb versions from 5.3.0-k
(good) to 5.4.0-k (bad).

After pulling and re-plugging the cable, with the bad driver I get:

# ip -d link show eth0
2: eth0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state
DOWN mode DEFAULT group default qlen 1000
    link/ether 0c:c4:7a:69:9d:3e brd ff:ff:ff:ff:ff:ff promiscuity 0
numtxqueues 8 numrxqueues 8 gso_max_size 65536 gso_max_segs 65535

# ethtool -i eth0    
Cannot get driver information: No such device

The last lines in the dmesg output are:

[   13.127730] igb 0000:01:00.0 eth0: igb: eth0 NIC Link is Up 1000 Mbps
Full Duplex, Flow Control: RX/TX
[   13.747735] igb 0000:01:00.1 eth1: igb: eth1 NIC Link is Up 1000 Mbps
Full Duplex, Flow Control: RX/TX
[  147.760943] igb 0000:01:00.0 eth0: igb: eth0 NIC Link is Down
[  608.211864] igb 0000:01:00.0 eth0: PCIe link lost, device now detached

Please note that the "PCIe link lost" message arrives 8 minutes after
re-plugging the LAN cable.

I hope that information helps pinning down this bug and fixing it.

Kind regards
Thomas

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2019-01-17 21:55 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-24 15:14 [BUG] igb: reconnecting of cable not always detected Holger Schurig
2018-04-24 15:14 ` [Intel-wired-lan] " Holger Schurig
2018-04-24 18:09 ` Alexander Duyck
2018-04-24 18:09   ` [Intel-wired-lan] " Alexander Duyck
2018-04-25  3:30   ` Richard Cochran
2018-04-25  3:30     ` [Intel-wired-lan] " Richard Cochran
2018-04-25  9:47   ` Holger Schurig
2018-04-25  9:47     ` [Intel-wired-lan] " Holger Schurig
2018-04-25 16:01     ` Alexander Duyck
2018-04-25 16:01       ` [Intel-wired-lan] " Alexander Duyck
2018-04-26  7:54       ` Holger Schurig
2018-04-26  7:54         ` [Intel-wired-lan] " Holger Schurig
2018-04-26  9:08       ` Holger Schurig
2018-04-26  9:08         ` [Intel-wired-lan] " Holger Schurig
2018-04-26 16:02         ` Alexander Duyck
2018-04-26 16:02           ` [Intel-wired-lan] " Alexander Duyck
2018-04-27 10:39           ` Holger Schurig
2018-04-27 10:39             ` [Intel-wired-lan] " Holger Schurig
2018-05-18  7:35           ` Holger Schurig
2018-05-18  7:35             ` [Intel-wired-lan] " Holger Schurig
2019-01-17 21:55             ` Jeff Kirsher
2019-01-17 21:55               ` Jeff Kirsher
2018-06-09 17:15 Thomas Netousek

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.