All of lore.kernel.org
 help / color / mirror / Atom feed
From: Russell King - ARM Linux admin <linux@armlinux.org.uk>
To: ѽ҉ᶬḳ℠ <vtol@gmx.net>
Cc: Andrew Lunn <andrew@lunn.ch>, netdev@vger.kernel.org
Subject: Re: [drivers/net/phy/sfp] intermittent failure in state machine checks
Date: Fri, 10 Jan 2020 19:01:34 +0000	[thread overview]
Message-ID: <20200110190134.GL25745@shell.armlinux.org.uk> (raw)
In-Reply-To: <e18b0fb9-0c6d-ed5e-3a20-dc29e9cc048e@gmx.net>

On Fri, Jan 10, 2020 at 06:44:18PM +0000, ѽ҉ᶬḳ℠ wrote:
> On 10/01/2020 17:38, Russell King - ARM Linux admin wrote:
> > 
> > > > On Fri, Jan 10, 2020 at 04:53:06PM +0000, ѽ҉ᶬḳ℠ wrote:
> > > > > Seems that the debug avenue has been exhausted, short of running SFP.C in
> > > > > debug mode.
> > > > You're saying you never see TX_FAULT asserted other than when the
> > > > interface is down?
> > > Yes, it never exhibits once the iif is up - it is rock-stable in that state,
> > > only ever when being transitioned from down state to up state.
> > > Pardon, if that has not been made explicitly clear previously.
> > I think if we were to have SFP debug enabled, you'll find that
> > TX_FAULT is being reported to SFP as being asserted.
> 
> If really necessary I could ask the TOS developers to assist, not sure
> whether they would oblidge. Their Master branch build bot compiles twice a
> day.
> Would it just involve setting a kernel debug flag or something more
> elaborate?
> 
> > 
> > You probably aren't running that while loop, as it will exit when
> > it sees TX_FAULT asserted.  So, here's another bit of shell code
> > for you to run:
> > 
> > ip li set dev eth2 down; \
> > ip li set dev eth2 up; \
> > date
> > while :; do
> >    cat /proc/uptime
> >    while ! grep -A5 'tx-fault.*in  hi' /sys/kernel/debug/gpio; do :; done
> >    cat /proc/uptime
> >    while ! grep -A5 'tx-fault.*in  lo' /sys/kernel/debug/gpio; do :; done
> > done
> > 
> > This will give you output such as:
> > 
> > Fri 10 Jan 17:31:06 GMT 2020
> > 774869.13 1535859.48
> >   gpio-509 (                    |tx-fault            ) in  hi ...
> > 774869.14 1535859.49
> >   gpio-509 (                    |tx-fault            ) in  lo ...
> > 774869.15 1535859.50
> > 
> > The first date and "uptime" output is the timestamp when the interface
> > was brought up.  Subsequent "uptime" outputs can be used to calculate
> > the time difference in seconds between the state printed immediately
> > prior to the uptime output, and the first "uptime" output.
> > 
> > So in the above example, the tx-fault signal was hi at 10ms, and then
> > went low 20ms after the up.
> 
> awfully nice of you to provide the code, this is the output when the iif is
> brought down and up again and exhibiting the transmit fault.
> 
> ip li set dev eth2 down; \
> > ip li set dev eth2 up; \
> > date
> Fri Jan 10 18:34:52 GMT 2020
> root@to:~# while :; do
> >   cat /proc/uptime
> >   while ! grep -A5 'tx-fault.*in  hi' /sys/kernel/debug/gpio; do :; done
> >   cat /proc/uptime
> >   while ! grep -A5 'tx-fault.*in  lo' /sys/kernel/debug/gpio; do :; done
> > done

Hmm, I missed a ; \ at the end of "date", so this isn't quite what
I wanted, but it'll do.  What that means is that:

> 1865.20 3224.67

doesn't bear the relationship that I wanted to the interface coming
up.

>  gpio-504 (                    |tx-fault            ) in  hi IRQ
>  gpio-505 (                    |tx-disable          ) out hi
>  gpio-506 (                    |rate-select0        ) in  lo
>  gpio-507 (                    |los                 ) in  lo IRQ
>  gpio-508 (                    |mod-def0            ) in  lo IRQ
> 1871.77 3230.71

TX_FAULT is high at 1871.77 and TX_DISABLE is high, so the interface
is down.

>  gpio-504 (                    |tx-fault            ) in  lo IRQ
>  gpio-505 (                    |tx-disable          ) out lo
>  gpio-506 (                    |rate-select0        ) in  lo
>  gpio-507 (                    |los                 ) in  lo IRQ
>  gpio-508 (                    |mod-def0            ) in  lo IRQ
> 1919.06 3309.55

Almost 47.3s later, TX_FAULT has gone low.

>  gpio-504 (                    |tx-fault            ) in  hi IRQ
>  gpio-505 (                    |tx-disable          ) out lo
>  gpio-506 (                    |rate-select0        ) in  lo
>  gpio-507 (                    |los                 ) in  lo IRQ
>  gpio-508 (                    |mod-def0            ) in  lo IRQ
> 1919.07 3309.57

After 10ms, it goes high again - this will cause the first report of
a transmit fault.

>  gpio-504 (                    |tx-fault            ) in  lo IRQ
>  gpio-505 (                    |tx-disable          ) out lo
>  gpio-506 (                    |rate-select0        ) in  lo
>  gpio-507 (                    |los                 ) in  lo IRQ
>  gpio-508 (                    |mod-def0            ) in  lo IRQ
> 1920.68 3312.28

About 1.6s later, it goes low, maybe as a result of the first attempt
to clear the fault by a brief pulse on TX_DISABLE.

So, we wait 1s before asserting TX_DISABLE for 10us, which would have
happened around 1920.07.  We then have 300ms for initialisation, which
would've taken us to 1920.37, so this may have been interpreted as the
fault still being present.  The next clearance attempt would have been
scheduled for about 1921.37.

>  gpio-504 (                    |tx-fault            ) in  hi IRQ
>  gpio-505 (                    |tx-disable          ) out lo
>  gpio-506 (                    |rate-select0        ) in  lo
>  gpio-507 (                    |los                 ) in  lo IRQ
>  gpio-508 (                    |mod-def0            ) in  lo IRQ
> 1921.86 3314.21

1.2s later, it re-asserts.

>  gpio-504 (                    |tx-fault            ) in  lo IRQ
>  gpio-505 (                    |tx-disable          ) out lo
>  gpio-506 (                    |rate-select0        ) in  lo
>  gpio-507 (                    |los                 ) in  lo IRQ
>  gpio-508 (                    |mod-def0            ) in  lo IRQ
> 1921.86 3314.21

and deasserts within the same 10ms.

> > However, bear in mind that even this will not be good enough to spot
> > transitory changes on TX_FAULT - as your I2C GPIO expander is interrupt
> > capable, watching /proc/interrupts may tell you more.
> > 
> > If the TX_FAULT signal is as stable as you claim it is, you should see
> > the interrupt count for it remaining the same.
> 
> Once the iif is up those values remain stable indeed.
> 
> cat /proc/interrupts | grep sfp
>  52:          0          0   pca953x   4 Edge      sfp
>  53:          0          0   pca953x   3 Edge      sfp
>  54:          6          0   pca953x   0 Edge      sfp
> 
> and only incrementing with ifupdown action (which would be logical)
> 
> cat /proc/interrupts | grep sfp
>  52:          0          0   pca953x   4 Edge      sfp
>  53:          0          0   pca953x   3 Edge      sfp
>  54:         11          0   pca953x   0 Edge      sfp

According to this, TX_FAULT has toggled five times.

This would seem to negate your previous comment about TX_FAULT being
stable.

Therefore, I'd say that the SFP state machines are operating as
designed, and as per the SFP MSA, and what we have is a module that
likes to assert TX_FAULT for unknown reasons, and this confirms the
hypothesis I've been putting forward.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up

  reply	other threads:[~2020-01-10 19:01 UTC|newest]

Thread overview: 40+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-01-09 13:47 [drivers/net/phy/sfp] intermittent failure in state machine checks ѽ҉ᶬḳ℠
2020-01-09 14:41 ` Andrew Lunn
2020-01-09 15:03   ` ѽ҉ᶬḳ℠
2020-01-09 15:58     ` Russell King - ARM Linux admin
2020-01-09 17:35       ` ѽ҉ᶬḳ℠
2020-01-09 17:43         ` Russell King - ARM Linux admin
2020-01-09 19:01           ` ѽ҉ᶬḳ℠
2020-01-09 19:42             ` ѽ҉ᶬḳ℠
2020-01-09 21:38               ` Russell King - ARM Linux admin
2020-01-09 21:59               ` Russell King - ARM Linux admin
2020-01-09 22:40                 ` ѽ҉ᶬḳ℠
2020-01-09 23:10                   ` Russell King - ARM Linux admin
2020-01-09 23:50                     ` ѽ҉ᶬḳ℠
2020-01-10  0:18                       ` ѽ҉ᶬḳ℠
2020-01-10 10:26                         ` Russell King - ARM Linux admin
2020-01-10  9:27                       ` Russell King - ARM Linux admin
2020-01-10  9:50                         ` ѽ҉ᶬḳ℠
2020-01-10 10:19                           ` ѽ҉ᶬḳ℠
2020-01-10 11:46                             ` Russell King - ARM Linux admin
2020-01-10 13:22                             ` Andrew Lunn
2020-01-10 13:38                               ` ѽ҉ᶬḳ℠
2020-01-10 11:44                           ` Russell King - ARM Linux admin
2020-01-10 12:45                             ` ѽ҉ᶬḳ℠
2020-01-10 12:53                               ` Russell King - ARM Linux admin
2020-01-10 15:02                                 ` ѽ҉ᶬḳ℠
2020-01-10 15:09                                   ` Russell King - ARM Linux admin
2020-01-10 15:45                                     ` ѽ҉ᶬḳ℠
2020-01-10 16:32                                       ` Russell King - ARM Linux admin
2020-01-10 16:53                                         ` ѽ҉ᶬḳ℠
2020-01-10 17:08                                           ` Russell King - ARM Linux admin
2020-01-10 17:19                                             ` ѽ҉ᶬḳ℠
2020-01-10 17:38                                               ` Russell King - ARM Linux admin
2020-01-10 18:44                                                 ` ѽ҉ᶬḳ℠
2020-01-10 19:01                                                   ` Russell King - ARM Linux admin [this message]
2020-01-10 19:36                                                     ` ѽ҉ᶬḳ℠
2020-01-10 19:55                                                       ` Russell King - ARM Linux admin
2020-01-10 20:27                                                         ` ѽ҉ᶬḳ℠
2020-01-10 19:23                                                   ` Andrew Lunn
2020-01-11 12:58                                                     ` ѽ҉ᶬḳ℠
2020-01-09 21:34             ` Russell King - ARM Linux admin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200110190134.GL25745@shell.armlinux.org.uk \
    --to=linux@armlinux.org.uk \
    --cc=andrew@lunn.ch \
    --cc=netdev@vger.kernel.org \
    --cc=vtol@gmx.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.