netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* lan78xx: /sys/class/net/eth0/carrier stuck at 1
@ 2020-10-21 15:00 Juerg Haefliger
  2020-10-21 19:35 ` Andrew Lunn
  0 siblings, 1 reply; 7+ messages in thread
From: Juerg Haefliger @ 2020-10-21 15:00 UTC (permalink / raw)
  To: netdev, woojung.huh

[-- Attachment #1: Type: text/plain, Size: 637 bytes --]

Hi,

If the lan78xx driver is compiled into the kernel and the network cable is
plugged in at boot, /sys/class/net/eth0/carrier is stuck at 1 and doesn't
toggle if the cable is unplugged and replugged.

If the network cable is *not* plugged in at boot, all seems to work fine.
I.e., post-boot cable plugs and unplugs toggle the carrier flag.

Also, everything seems to work fine if the driver is compiled as a module.

There's an older ticket for the raspi kernel [1] but I've just tested this
with a 5.8 kernel on a Pi 3B+ and still see that behavior.

...Juerg

[1] https://github.com/raspberrypi/firmware/issues/1100

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: lan78xx: /sys/class/net/eth0/carrier stuck at 1
  2020-10-21 15:00 lan78xx: /sys/class/net/eth0/carrier stuck at 1 Juerg Haefliger
@ 2020-10-21 19:35 ` Andrew Lunn
  2020-10-23  6:29   ` Juerg Haefliger
  0 siblings, 1 reply; 7+ messages in thread
From: Andrew Lunn @ 2020-10-21 19:35 UTC (permalink / raw)
  To: Juerg Haefliger; +Cc: netdev, woojung.huh

On Wed, Oct 21, 2020 at 05:00:53PM +0200, Juerg Haefliger wrote:
> Hi,
> 
> If the lan78xx driver is compiled into the kernel and the network cable is
> plugged in at boot, /sys/class/net/eth0/carrier is stuck at 1 and doesn't
> toggle if the cable is unplugged and replugged.
> 
> If the network cable is *not* plugged in at boot, all seems to work fine.
> I.e., post-boot cable plugs and unplugs toggle the carrier flag.
> 
> Also, everything seems to work fine if the driver is compiled as a module.
> 
> There's an older ticket for the raspi kernel [1] but I've just tested this
> with a 5.8 kernel on a Pi 3B+ and still see that behavior.

Hi Jürg

Could you check if a different PHY driver is being used when it is
built and broken vs module or built in and working.

Look at /sys/class/net/eth0/phydev/driver

I'm wondering if in the builtin case, it is using genphy, but with
modules it uses a more specific vendor driver.

	Andrew

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: lan78xx: /sys/class/net/eth0/carrier stuck at 1
  2020-10-21 19:35 ` Andrew Lunn
@ 2020-10-23  6:29   ` Juerg Haefliger
  2020-10-23 13:05     ` Andrew Lunn
  0 siblings, 1 reply; 7+ messages in thread
From: Juerg Haefliger @ 2020-10-23  6:29 UTC (permalink / raw)
  To: Andrew Lunn; +Cc: Juerg Haefliger, netdev, woojung.huh

[-- Attachment #1: Type: text/plain, Size: 1309 bytes --]

On Wed, 21 Oct 2020 21:35:48 +0200
Andrew Lunn <andrew@lunn.ch> wrote:

> On Wed, Oct 21, 2020 at 05:00:53PM +0200, Juerg Haefliger wrote:
> > Hi,
> > 
> > If the lan78xx driver is compiled into the kernel and the network cable is
> > plugged in at boot, /sys/class/net/eth0/carrier is stuck at 1 and doesn't
> > toggle if the cable is unplugged and replugged.
> > 
> > If the network cable is *not* plugged in at boot, all seems to work fine.
> > I.e., post-boot cable plugs and unplugs toggle the carrier flag.
> > 
> > Also, everything seems to work fine if the driver is compiled as a module.
> > 
> > There's an older ticket for the raspi kernel [1] but I've just tested this
> > with a 5.8 kernel on a Pi 3B+ and still see that behavior.  
> 
> Hi Jürg

Hi Andrew,


> Could you check if a different PHY driver is being used when it is
> built and broken vs module or built in and working.
> 
> Look at /sys/class/net/eth0/phydev/driver

There's no such file.

 
> I'm wondering if in the builtin case, it is using genphy, but with
> modules it uses a more specific vendor driver.

Given that all works fine as long as the cable is unplugged at boot points
more towards a race at boot or incorrect initialization sequence or something.


...Juerg
 
> 	Andrew


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: lan78xx: /sys/class/net/eth0/carrier stuck at 1
  2020-10-23  6:29   ` Juerg Haefliger
@ 2020-10-23 13:05     ` Andrew Lunn
  2020-11-03 12:47       ` Juerg Haefliger
  0 siblings, 1 reply; 7+ messages in thread
From: Andrew Lunn @ 2020-10-23 13:05 UTC (permalink / raw)
  To: Juerg Haefliger; +Cc: netdev, woojung.huh

On Fri, Oct 23, 2020 at 08:29:59AM +0200, Juerg Haefliger wrote:
> On Wed, 21 Oct 2020 21:35:48 +0200
> Andrew Lunn <andrew@lunn.ch> wrote:
> 
> > On Wed, Oct 21, 2020 at 05:00:53PM +0200, Juerg Haefliger wrote:
> > > Hi,
> > > 
> > > If the lan78xx driver is compiled into the kernel and the network cable is
> > > plugged in at boot, /sys/class/net/eth0/carrier is stuck at 1 and doesn't
> > > toggle if the cable is unplugged and replugged.
> > > 
> > > If the network cable is *not* plugged in at boot, all seems to work fine.
> > > I.e., post-boot cable plugs and unplugs toggle the carrier flag.
> > > 
> > > Also, everything seems to work fine if the driver is compiled as a module.
> > > 
> > > There's an older ticket for the raspi kernel [1] but I've just tested this
> > > with a 5.8 kernel on a Pi 3B+ and still see that behavior.  
> > 
> > Hi Jürg
> 
> Hi Andrew,
> 
> 
> > Could you check if a different PHY driver is being used when it is
> > built and broken vs module or built in and working.
> > 
> > Look at /sys/class/net/eth0/phydev/driver
> 
> There's no such file.

I _think_ that means it is using genphy, the generic PHY driver, not a
specific vendor PHY driver? What does

/sys/class/net/eth0/phydev/phy_id contain.
  
> Given that all works fine as long as the cable is unplugged at boot points
> more towards a race at boot or incorrect initialization sequence or something.

Could be. Could you run

mii-tool -vv eth0

in the good and bad case.

   Andrew


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: lan78xx: /sys/class/net/eth0/carrier stuck at 1
  2020-10-23 13:05     ` Andrew Lunn
@ 2020-11-03 12:47       ` Juerg Haefliger
  2020-11-03 15:27         ` Andrew Lunn
  0 siblings, 1 reply; 7+ messages in thread
From: Juerg Haefliger @ 2020-11-03 12:47 UTC (permalink / raw)
  To: Andrew Lunn; +Cc: Juerg Haefliger, netdev, woojung.huh

[-- Attachment #1: Type: text/plain, Size: 4199 bytes --]

On Fri, 23 Oct 2020 15:05:19 +0200
Andrew Lunn <andrew@lunn.ch> wrote:

> On Fri, Oct 23, 2020 at 08:29:59AM +0200, Juerg Haefliger wrote:
> > On Wed, 21 Oct 2020 21:35:48 +0200
> > Andrew Lunn <andrew@lunn.ch> wrote:
> >   
> > > On Wed, Oct 21, 2020 at 05:00:53PM +0200, Juerg Haefliger wrote:  
> > > > Hi,
> > > > 
> > > > If the lan78xx driver is compiled into the kernel and the network cable is
> > > > plugged in at boot, /sys/class/net/eth0/carrier is stuck at 1 and doesn't
> > > > toggle if the cable is unplugged and replugged.
> > > > 
> > > > If the network cable is *not* plugged in at boot, all seems to work fine.
> > > > I.e., post-boot cable plugs and unplugs toggle the carrier flag.
> > > > 
> > > > Also, everything seems to work fine if the driver is compiled as a module.
> > > > 
> > > > There's an older ticket for the raspi kernel [1] but I've just tested this
> > > > with a 5.8 kernel on a Pi 3B+ and still see that behavior.    
> > > 
> > > Hi Jürg  
> > 
> > Hi Andrew,
> > 
> >   
> > > Could you check if a different PHY driver is being used when it is
> > > built and broken vs module or built in and working.
> > > 
> > > Look at /sys/class/net/eth0/phydev/driver  
> > 
> > There's no such file.  
> 
> I _think_ that means it is using genphy, the generic PHY driver, not a
> specific vendor PHY driver? What does
> 
> /sys/class/net/eth0/phydev/phy_id contain.

There is no directory /sys/class/net/eth0/phydev.

$ ls /sys/class/net/eth0/
addr_assign_type  broadcast        carrier_down_count  dev_port  duplex             ifalias  link_mode         netdev_group  phys_port_name  proto_down  statistics    type
addr_len          carrier          carrier_up_count    device    flags              ifindex  mtu               operstate     phys_switch_id  queues      subsystem     uevent
address           carrier_changes  dev_id              dormant   gro_flush_timeout  iflink   name_assign_type  phys_port_id  power           speed       tx_queue_len


> > Given that all works fine as long as the cable is unplugged at boot points
> > more towards a race at boot or incorrect initialization sequence or something.  
> 
> Could be. Could you run
> 
> mii-tool -vv eth0

Hrm. Running that command unlocks the carrier flag and it starts toggling on
cable unplug/plug. First invocation:

$ sudo mii-tool -vv eth0
Using SIOCGMIIPHY=0x8947
eth0: negotiated 1000baseT-FD flow-control, link ok
  registers for MII PHY 1: 
    1040 79ed 0007 c132 05e1 cde1 000f 0000
    0000 0200 0800 0000 0000 0000 0000 3000
    0000 0000 0088 0000 0000 0000 3200 0004
    0040 a000 a000 0000 a035 0000 0000 0000
  product info: vendor 00:01:f0, model 19 rev 2
  basic mode:   autonegotiation enabled
  basic status: autonegotiation complete, link ok
  capabilities: 1000baseT-FD 100baseTx-FD 100baseTx-HD 10baseT-FD 10baseT-HD
  advertising:  1000baseT-FD 100baseTx-FD 100baseTx-HD 10baseT-FD 10baseT-HD flow-control
  link partner: 1000baseT-FD 100baseTx-FD 100baseTx-HD 10baseT-FD 10baseT-HD flow-control

Subsequent invocation:

$ sudo mii-tool -vv eth0
Using SIOCGMIIPHY=0x8947
eth0: negotiated 1000baseT-FD flow-control, link ok
  registers for MII PHY 1: 
    1040 79ed 0007 c132 05e1 cde1 000d 0000
    0000 0200 0800 0000 0000 0000 0000 3000
    0000 0000 0088 0000 0000 0000 3200 0004
    0040 a000 0000 0000 a035 0000 0000 0000
  product info: vendor 00:01:f0, model 19 rev 2
  basic mode:   autonegotiation enabled
  basic status: autonegotiation complete, link ok
  capabilities: 1000baseT-FD 100baseTx-FD 100baseTx-HD 10baseT-FD 10baseT-HD
  advertising:  1000baseT-FD 100baseTx-FD 100baseTx-HD 10baseT-FD 10baseT-HD flow-control
  link partner: 1000baseT-FD 100baseTx-FD 100baseTx-HD 10baseT-FD 10baseT-HD flow-control

In the first invocation, register 0x1a shows a pending link-change interrupt
(0xa000) which wasn't serviced (and cleared) for some reason. Dumping the
registers cleared that interrupt bit and things start working correctly
afterwards. Nor sure yet why that first interrupt is ignored.

...Juerg

 
> in the good and bad case.
> 
>    Andrew
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: lan78xx: /sys/class/net/eth0/carrier stuck at 1
  2020-11-03 12:47       ` Juerg Haefliger
@ 2020-11-03 15:27         ` Andrew Lunn
  2020-11-06  8:56           ` Juerg Haefliger
  0 siblings, 1 reply; 7+ messages in thread
From: Andrew Lunn @ 2020-11-03 15:27 UTC (permalink / raw)
  To: Juerg Haefliger; +Cc: netdev, woojung.huh

On Tue, Nov 03, 2020 at 01:47:12PM +0100, Juerg Haefliger wrote:
> On Fri, 23 Oct 2020 15:05:19 +0200
> Andrew Lunn <andrew@lunn.ch> wrote:
> 
> > On Fri, Oct 23, 2020 at 08:29:59AM +0200, Juerg Haefliger wrote:
> > > On Wed, 21 Oct 2020 21:35:48 +0200
> > > Andrew Lunn <andrew@lunn.ch> wrote:
> > >   
> > > > On Wed, Oct 21, 2020 at 05:00:53PM +0200, Juerg Haefliger wrote:  
> > > > > Hi,
> > > > > 
> > > > > If the lan78xx driver is compiled into the kernel and the network cable is
> > > > > plugged in at boot, /sys/class/net/eth0/carrier is stuck at 1 and doesn't
> > > > > toggle if the cable is unplugged and replugged.
> > > > > 
> > > > > If the network cable is *not* plugged in at boot, all seems to work fine.
> > > > > I.e., post-boot cable plugs and unplugs toggle the carrier flag.
> > > > > 
> > > > > Also, everything seems to work fine if the driver is compiled as a module.
> > > > > 
> > > > > There's an older ticket for the raspi kernel [1] but I've just tested this
> > > > > with a 5.8 kernel on a Pi 3B+ and still see that behavior.    
> > > > 
> > > > Hi Jürg  
> > > 
> > > Hi Andrew,
> > > 
> > >   
> > > > Could you check if a different PHY driver is being used when it is
> > > > built and broken vs module or built in and working.
> > > > 
> > > > Look at /sys/class/net/eth0/phydev/driver  
> > > 
> > > There's no such file.  
> > 
> > I _think_ that means it is using genphy, the generic PHY driver, not a
> > specific vendor PHY driver? What does
> > 
> > /sys/class/net/eth0/phydev/phy_id contain.
> 
> There is no directory /sys/class/net/eth0/phydev.

[Goes and looks at the code]

The symbolic link is only created if the PHY is connected to the MAC
if the MAC has been registered with the core first. lan78xx does it
the other way around:

        ret = lan78xx_phy_init(dev);
        if (ret < 0)
                goto out4;

        ret = register_netdev(netdev);
        if (ret != 0) {
                netif_err(dev, probe, netdev, "couldn't register the device\n");
                goto out5;
        }

The register dump you show below indicates an ID of 007c132, which
fits the drivers drivers/net/phy/microchip.c : "Microchip
LAN88xx". Any mention of that in dmesg, do you see the module loaded?

> 
> > > Given that all works fine as long as the cable is unplugged at boot points
> > > more towards a race at boot or incorrect initialization sequence or something.  
> > 
> > Could be. Could you run
> > 
> > mii-tool -vv eth0
> 
> Hrm. Running that command unlocks the carrier flag and it starts toggling on
> cable unplug/plug. First invocation:
> 
> $ sudo mii-tool -vv eth0
> Using SIOCGMIIPHY=0x8947
> eth0: negotiated 1000baseT-FD flow-control, link ok
>   registers for MII PHY 1: 
>     1040 79ed 0007 c132 05e1 cde1 000f 0000
>     0000 0200 0800 0000 0000 0000 0000 3000
>     0000 0000 0088 0000 0000 0000 3200 0004
>     0040 a000 a000 0000 a035 0000 0000 0000
>   product info: vendor 00:01:f0, model 19 rev 2
>   basic mode:   autonegotiation enabled
>   basic status: autonegotiation complete, link ok
>   capabilities: 1000baseT-FD 100baseTx-FD 100baseTx-HD 10baseT-FD 10baseT-HD
>   advertising:  1000baseT-FD 100baseTx-FD 100baseTx-HD 10baseT-FD 10baseT-HD flow-control
>   link partner: 1000baseT-FD 100baseTx-FD 100baseTx-HD 10baseT-FD 10baseT-HD flow-control
> 
> Subsequent invocation:
> 
> $ sudo mii-tool -vv eth0
> Using SIOCGMIIPHY=0x8947
> eth0: negotiated 1000baseT-FD flow-control, link ok
>   registers for MII PHY 1: 
>     1040 79ed 0007 c132 05e1 cde1 000d 0000
>     0000 0200 0800 0000 0000 0000 0000 3000
>     0000 0000 0088 0000 0000 0000 3200 0004
>     0040 a000 0000 0000 a035 0000 0000 0000
>   product info: vendor 00:01:f0, model 19 rev 2
>   basic mode:   autonegotiation enabled
>   basic status: autonegotiation complete, link ok
>   capabilities: 1000baseT-FD 100baseTx-FD 100baseTx-HD 10baseT-FD 10baseT-HD
>   advertising:  1000baseT-FD 100baseTx-FD 100baseTx-HD 10baseT-FD 10baseT-HD flow-control
>   link partner: 1000baseT-FD 100baseTx-FD 100baseTx-HD 10baseT-FD 10baseT-HD flow-control
> 
> In the first invocation, register 0x1a shows a pending link-change interrupt
> (0xa000) which wasn't serviced (and cleared) for some reason. Dumping the
> registers cleared that interrupt bit and things start working correctly
> afterwards. Nor sure yet why that first interrupt is ignored.

So, 0x1a is the interrupt status, and 0x19 is the interrupt mask.

This should really be interpreted as a level interrupt. But it appears
the hardware the interrupt is connected to is actually doing edge. And
the edge has been missed, and so the interrupt is never serviced.

I think the call sequence goes something like this, if i'm reading the
code correct:

lan78xx_probe() calls lan78xx_bind:

    lan78xx_bind() registers an interrupt domain. This allows USB
    status messages indicating an interrupt to be dispatched using the
    normal interrupt mechanism, despite there not being a proper
    interrupt as such. The masking of interrupts seems to be part of
    INT_EP_CTL. This register is read during
    lan78xx_setup_irq_domain(), but it is not written to disable all
    interrupts. So it could be, PHY interrupts in the interrupt
    controller are already enabled at this point.

    lan78xx_bind() also registers an MDIO bus. This will cause the bus
    to be probed and the PHY should be found. 

lan78xx_probe() calls lan78xx_phy_init:

    lan78xx_phy_init gets the phydev, and fills in the interrupt
    number. It then connects the PHY to the MAC using
    phy_connect_direct(). phy_connect_direct() will request the
    interrupt from the kernel, meaning it can then service interrupts,
    and as part of that, it calls into the interrupt domain and
    enables interrupts in the interrupt controller.  It clears the
    interrupts in the PHY, which means the PHY interrupt status
    register is read, clearing it. It then enables interrupts in the
    PHY by again reading the status register to clear any pending
    interrupts and then sets the mask to enable interrupts.
    
So at this point, the PHY is ready to generate interrupts, the
interrupt controller in the USB device is ready to accept them. The
kernel itself is read for them, and they should be passed to the PHY
subsystem. What is not clear to me is if the USB endpoint is correctly
setup to report interrupts.

    Interestingly, the last thing lan78xx_phy_init() does is call
    genphy_config_aneg(). That configures the PHY to start an
    auto-neg. That is wrong, on a number of levels. The PHY drivers
    implementation, lan88xx_config_aneg() sets the mdix before
    starting autoneg. That gets skipped by directly calling
    genphy_config_aneg(). However, the interface is not even
    registered yet, let alone up. It has no business starting an
    auto-neg. But if the cable is connected and the peer is ready,
    about 1.5 seconds later, we expect auto-neg to complete and the
    interrupt to fire.

Sometime later, lan78xx_open() is called when the interface is
configured up. It calls phy_start() which is the correct way to get
the PHY going. That will call into the PHY driver to start auto-neg.
I've no idea what happens in this PHY when two auto-negs are going.

What lan78xx_open() also does is trigger a delayed work for
EVENT_LINK_RESET. lan78xx_link_reset() gets called as a result, and
that writes to the interrupt status register to clear it. Why clear
interrupts when we have just started auto-neg and we expect it to
cause an interrupt? The whole of lan78xx_link_reset() looks odd, and
some of it should be moved to lan78xx_link_status_change() which
phylib will call when the PHY changes status.

O.K. As a start, remove the genphy_config_aneg() from
lan78xx_phy_init(). I don't know if that is enough, but it is clearly
wrong.

    Andrew

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: lan78xx: /sys/class/net/eth0/carrier stuck at 1
  2020-11-03 15:27         ` Andrew Lunn
@ 2020-11-06  8:56           ` Juerg Haefliger
  0 siblings, 0 replies; 7+ messages in thread
From: Juerg Haefliger @ 2020-11-06  8:56 UTC (permalink / raw)
  To: Andrew Lunn; +Cc: Juerg Haefliger, netdev, woojung.huh

[-- Attachment #1: Type: text/plain, Size: 9380 bytes --]

On Tue, 3 Nov 2020 16:27:23 +0100
Andrew Lunn <andrew@lunn.ch> wrote:

> On Tue, Nov 03, 2020 at 01:47:12PM +0100, Juerg Haefliger wrote:
> > On Fri, 23 Oct 2020 15:05:19 +0200
> > Andrew Lunn <andrew@lunn.ch> wrote:
> >   
> > > On Fri, Oct 23, 2020 at 08:29:59AM +0200, Juerg Haefliger wrote:  
> > > > On Wed, 21 Oct 2020 21:35:48 +0200
> > > > Andrew Lunn <andrew@lunn.ch> wrote:
> > > >     
> > > > > On Wed, Oct 21, 2020 at 05:00:53PM +0200, Juerg Haefliger wrote:    
> > > > > > Hi,
> > > > > > 
> > > > > > If the lan78xx driver is compiled into the kernel and the network cable is
> > > > > > plugged in at boot, /sys/class/net/eth0/carrier is stuck at 1 and doesn't
> > > > > > toggle if the cable is unplugged and replugged.
> > > > > > 
> > > > > > If the network cable is *not* plugged in at boot, all seems to work fine.
> > > > > > I.e., post-boot cable plugs and unplugs toggle the carrier flag.
> > > > > > 
> > > > > > Also, everything seems to work fine if the driver is compiled as a module.
> > > > > > 
> > > > > > There's an older ticket for the raspi kernel [1] but I've just tested this
> > > > > > with a 5.8 kernel on a Pi 3B+ and still see that behavior.      
> > > > > 
> > > > > Hi Jürg    
> > > > 
> > > > Hi Andrew,
> > > > 
> > > >     
> > > > > Could you check if a different PHY driver is being used when it is
> > > > > built and broken vs module or built in and working.
> > > > > 
> > > > > Look at /sys/class/net/eth0/phydev/driver    
> > > > 
> > > > There's no such file.    
> > > 
> > > I _think_ that means it is using genphy, the generic PHY driver, not a
> > > specific vendor PHY driver? What does
> > > 
> > > /sys/class/net/eth0/phydev/phy_id contain.  
> > 
> > There is no directory /sys/class/net/eth0/phydev.  
> 
> [Goes and looks at the code]
> 
> The symbolic link is only created if the PHY is connected to the MAC
> if the MAC has been registered with the core first. lan78xx does it
> the other way around:
> 
>         ret = lan78xx_phy_init(dev);
>         if (ret < 0)
>                 goto out4;
> 
>         ret = register_netdev(netdev);
>         if (ret != 0) {
>                 netif_err(dev, probe, netdev, "couldn't register the device\n");
>                 goto out5;
>         }
> 
> The register dump you show below indicates an ID of 007c132, which
> fits the drivers drivers/net/phy/microchip.c : "Microchip
> LAN88xx". Any mention of that in dmesg, do you see the module loaded?

It's built-in but is being used (as tracing shows).

 
> >   
> > > > Given that all works fine as long as the cable is unplugged at boot points
> > > > more towards a race at boot or incorrect initialization sequence or something.    
> > > 
> > > Could be. Could you run
> > > 
> > > mii-tool -vv eth0  
> > 
> > Hrm. Running that command unlocks the carrier flag and it starts toggling on
> > cable unplug/plug. First invocation:
> > 
> > $ sudo mii-tool -vv eth0
> > Using SIOCGMIIPHY=0x8947
> > eth0: negotiated 1000baseT-FD flow-control, link ok
> >   registers for MII PHY 1: 
> >     1040 79ed 0007 c132 05e1 cde1 000f 0000
> >     0000 0200 0800 0000 0000 0000 0000 3000
> >     0000 0000 0088 0000 0000 0000 3200 0004
> >     0040 a000 a000 0000 a035 0000 0000 0000
> >   product info: vendor 00:01:f0, model 19 rev 2
> >   basic mode:   autonegotiation enabled
> >   basic status: autonegotiation complete, link ok
> >   capabilities: 1000baseT-FD 100baseTx-FD 100baseTx-HD 10baseT-FD 10baseT-HD
> >   advertising:  1000baseT-FD 100baseTx-FD 100baseTx-HD 10baseT-FD 10baseT-HD flow-control
> >   link partner: 1000baseT-FD 100baseTx-FD 100baseTx-HD 10baseT-FD 10baseT-HD flow-control
> > 
> > Subsequent invocation:
> > 
> > $ sudo mii-tool -vv eth0
> > Using SIOCGMIIPHY=0x8947
> > eth0: negotiated 1000baseT-FD flow-control, link ok
> >   registers for MII PHY 1: 
> >     1040 79ed 0007 c132 05e1 cde1 000d 0000
> >     0000 0200 0800 0000 0000 0000 0000 3000
> >     0000 0000 0088 0000 0000 0000 3200 0004
> >     0040 a000 0000 0000 a035 0000 0000 0000
> >   product info: vendor 00:01:f0, model 19 rev 2
> >   basic mode:   autonegotiation enabled
> >   basic status: autonegotiation complete, link ok
> >   capabilities: 1000baseT-FD 100baseTx-FD 100baseTx-HD 10baseT-FD 10baseT-HD
> >   advertising:  1000baseT-FD 100baseTx-FD 100baseTx-HD 10baseT-FD 10baseT-HD flow-control
> >   link partner: 1000baseT-FD 100baseTx-FD 100baseTx-HD 10baseT-FD 10baseT-HD flow-control
> > 
> > In the first invocation, register 0x1a shows a pending link-change interrupt
> > (0xa000) which wasn't serviced (and cleared) for some reason. Dumping the
> > registers cleared that interrupt bit and things start working correctly
> > afterwards. Nor sure yet why that first interrupt is ignored.  
> 
> So, 0x1a is the interrupt status, and 0x19 is the interrupt mask.

Yes.


> This should really be interpreted as a level interrupt. But it appears
> the hardware the interrupt is connected to is actually doing edge. And
> the edge has been missed, and so the interrupt is never serviced.
> 
> I think the call sequence goes something like this, if i'm reading the
> code correct:
> 
> lan78xx_probe() calls lan78xx_bind:
> 
>     lan78xx_bind() registers an interrupt domain. This allows USB
>     status messages indicating an interrupt to be dispatched using the
>     normal interrupt mechanism, despite there not being a proper
>     interrupt as such. The masking of interrupts seems to be part of
>     INT_EP_CTL. This register is read during
>     lan78xx_setup_irq_domain(), but it is not written to disable all
>     interrupts. So it could be, PHY interrupts in the interrupt
>     controller are already enabled at this point.
> 
>     lan78xx_bind() also registers an MDIO bus. This will cause the bus
>     to be probed and the PHY should be found. 
> 
> lan78xx_probe() calls lan78xx_phy_init:
> 
>     lan78xx_phy_init gets the phydev, and fills in the interrupt
>     number. It then connects the PHY to the MAC using
>     phy_connect_direct(). phy_connect_direct() will request the
>     interrupt from the kernel, meaning it can then service interrupts,
>     and as part of that, it calls into the interrupt domain and
>     enables interrupts in the interrupt controller.  It clears the
>     interrupts in the PHY, which means the PHY interrupt status
>     register is read, clearing it. It then enables interrupts in the
>     PHY by again reading the status register to clear any pending
>     interrupts and then sets the mask to enable interrupts.
>     
> So at this point, the PHY is ready to generate interrupts, the
> interrupt controller in the USB device is ready to accept them. The
> kernel itself is read for them, and they should be passed to the PHY
> subsystem. What is not clear to me is if the USB endpoint is correctly
> setup to report interrupts.
> 
>     Interestingly, the last thing lan78xx_phy_init() does is call
>     genphy_config_aneg(). That configures the PHY to start an
>     auto-neg. That is wrong, on a number of levels. The PHY drivers
>     implementation, lan88xx_config_aneg() sets the mdix before
>     starting autoneg. That gets skipped by directly calling
>     genphy_config_aneg(). However, the interface is not even
>     registered yet, let alone up. It has no business starting an
>     auto-neg. But if the cable is connected and the peer is ready,
>     about 1.5 seconds later, we expect auto-neg to complete and the
>     interrupt to fire.
> 
> Sometime later, lan78xx_open() is called when the interface is
> configured up. It calls phy_start() which is the correct way to get
> the PHY going. That will call into the PHY driver to start auto-neg.
> I've no idea what happens in this PHY when two auto-negs are going.
> 
> What lan78xx_open() also does is trigger a delayed work for
> EVENT_LINK_RESET. lan78xx_link_reset() gets called as a result, and
> that writes to the interrupt status register to clear it. Why clear
> interrupts when we have just started auto-neg and we expect it to
> cause an interrupt? The whole of lan78xx_link_reset() looks odd, and
> some of it should be moved to lan78xx_link_status_change() which
> phylib will call when the PHY changes status.
> 
> O.K. As a start, remove the  in  from
> lan78xx_phy_init(). I don't know if that is enough, but it is clearly
> wrong.

The interface won't come up with that change. FWIW I recorded some traces
(functions phy_*, lan88xx_*, lan78xx_*):
[1] lan78xx built-in.
[2] lan78xx built-in without genphy_config_aneg() in lan78xx_phy_init().
[3] lan78xx as a module.

I've also added some trace_printks that dump the content of the status and
mask registers at the end of lan78xx_probe() (no interrupt pending) and in 
lan78xx_get_drvinfo() which seems to be the next lan78xx function being
called after probing (per the built-in trace). At that time an interrupt
seems to be pending (lan78xx_get_drvinfo LAN88XX_INT_STS = 0xa000).

...Juerg

[1] https://kernel.ubuntu.com/~juergh/lp1890487/trace.builtin
[2] https://kernel.ubuntu.com/~juergh/lp1890487/trace.builtin-no-aneg
[3] https://kernel.ubuntu.com/~juergh/lp1890487/trace.module

 
>     Andrew


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2020-11-06  8:56 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-21 15:00 lan78xx: /sys/class/net/eth0/carrier stuck at 1 Juerg Haefliger
2020-10-21 19:35 ` Andrew Lunn
2020-10-23  6:29   ` Juerg Haefliger
2020-10-23 13:05     ` Andrew Lunn
2020-11-03 12:47       ` Juerg Haefliger
2020-11-03 15:27         ` Andrew Lunn
2020-11-06  8:56           ` Juerg Haefliger

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).