All of lore.kernel.org
 help / color / mirror / Atom feed
* [drivers/net/phy/sfp] intermittent failure in state machine checks
@ 2020-01-09 13:47 ѽ҉ᶬḳ℠
  2020-01-09 14:41 ` Andrew Lunn
  0 siblings, 1 reply; 40+ messages in thread
From: ѽ҉ᶬḳ℠ @ 2020-01-09 13:47 UTC (permalink / raw)
  To: netdev

On node with 4.19.93 and a SFP module (specs at the bottom) the 
following is intermittently observed:

libphy: SFP I2C Bus: probed
sfp sfp: module ALLNET ALL4781 rev V3.4 sn 0000000FC9157640 dc 29-03-18
sfp sfp: unknown connector, encoding 8b10b, nominal bitrate 1.3Gbps +0% -0%
sfp sfp: 1000BaseSX+ 1000BaseLX- 1000BaseCX- 1000BaseT- 100BaseTLX- 
1000BaseFX- BaseBX10- BasePX-
sfp sfp: 10GBaseSR- 10GBaseLR- 10GBaseLRM- 10GBaseER-
sfp sfp: Wavelength 0nm, fiber lengths:
sfp sfp: 9µm SM : unsupported
sfp sfp: 62.5µm MM OM1: unsupported/unspecified
sfp sfp: 50µm MM OM2: unsupported/unspecified
sfp sfp: 50µm MM OM3: unsupported/unspecified
sfp sfp: 50µm MM OM4: 2.540km
sfp sfp: Options: retimer
sfp sfp: Diagnostics:
sfp sfp: module transmit fault indicated
sfp sfp: module transmit fault recovered
sfp sfp: module transmit fault indicated
sfp sfp: module persistently indicates fault, disabling

To my humble understanding that pertains to checks in state machine

- SFP_S_WAIT_LOS
- SFP_S_LINK_UP

being done via the I2C | SM bus but it is not clear to me what causes 
the check to fail and how to remedy it.

____
SFP module specs

Identifier                                : 0x03 (SFP)
Extended identifier                       : 0x04 (GBIC/SFP defined by 
2-wire interface ID)
Connector                                 : 0x22 (RJ45)
Transceiver codes                         : 0x00 0x00 0x00 0x01 0x00 
0x00 0x00 0x00 0x00
Transceiver type                          : Ethernet: 1000BASE-SX
Encoding                                  : 0x01 (8B/10B)
BR, Nominal                               : 1300MBd
Rate identifier                           : 0x00 (unspecified)
Length (SMF,km)                           : 0km
Length (SMF)                              : 0m
Length (50um)                             : 0m
Length (62.5um)                           : 0m
Length (Copper)                           : 255m
Length (OM3)                              : 0m
Laser wavelength                          : 0nm
Vendor name                               : ALLNET
Vendor OUI                                : 00:0f:c9
Vendor PN                                 : ALL4781
Vendor rev                                : V3.4
Option values                             : 0x08 0x00
Option                                    : Retimer or CDR implemented
BR margin, max                            : 0%
BR margin, min                            : 0%

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [drivers/net/phy/sfp] intermittent failure in state machine checks
  2020-01-09 13:47 [drivers/net/phy/sfp] intermittent failure in state machine checks ѽ҉ᶬḳ℠
@ 2020-01-09 14:41 ` Andrew Lunn
  2020-01-09 15:03   ` ѽ҉ᶬḳ℠
  0 siblings, 1 reply; 40+ messages in thread
From: Andrew Lunn @ 2020-01-09 14:41 UTC (permalink / raw)
  To: ѽ҉ᶬḳ℠, Russell King; +Cc: netdev

On Thu, Jan 09, 2020 at 01:47:31PM +0000, ѽ҉ᶬḳ℠ wrote:
> On node with 4.19.93 and a SFP module (specs at the bottom) the following is
> intermittently observed:

Please make sure Russell King is in Cc: for SFP issues.

The state machine has been reworked recently. Please could you try
net-next, or 5.5-rc5.

Thanks
	Andrew

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [drivers/net/phy/sfp] intermittent failure in state machine checks
  2020-01-09 14:41 ` Andrew Lunn
@ 2020-01-09 15:03   ` ѽ҉ᶬḳ℠
  2020-01-09 15:58     ` Russell King - ARM Linux admin
  0 siblings, 1 reply; 40+ messages in thread
From: ѽ҉ᶬḳ℠ @ 2020-01-09 15:03 UTC (permalink / raw)
  To: Andrew Lunn, Russell King; +Cc: netdev

On 09/01/2020 14:41, Andrew Lunn wrote:
> On Thu, Jan 09, 2020 at 01:47:31PM +0000, ѽ҉ᶬḳ℠ wrote:
>> On node with 4.19.93 and a SFP module (specs at the bottom) the following is
>> intermittently observed:
> Please make sure Russell King is in Cc: for SFP issues.
>
> The state machine has been reworked recently. Please could you try
> net-next, or 5.5-rc5.
>
> Thanks
> 	Andrew
Unfortunately testing those branches is not feasible since the router 
(see architecture below) that host the SFP module deploys the OpenWrt 
downstream distro with LTS kernels - in their Master development branch 
4.19.93 being the most recent on offer.

Could the reworked state machine code commits be deployed as patches 
with 4.19 kernel, and if so which commits would that be?
Or, if not will those commits eventually ride the trains to the LTS 
branches and what would be the expected time frame for such uplift?

The problem is with those failing state machine checks is an 
inconvenient disruption in the node's WAN connectivity, often needing to 
reboot the node to get the connectivity reinstated.

Not sure whether pertinent at all (aka being clueless) but noticed that 
for big endian systems a check for an inverted LOS Signal is implemented 
but not for little endian systems.

___
Architecture:        armv7l
Byte Order:          Little Endian
CPU(s):              2
On-line CPU(s) list: 0,1
Thread(s) per core:  1
Core(s) per socket:  2
Socket(s):           1
Vendor ID:           ARM
Model:               1
Model name:          Cortex-A9
Stepping:            r4p1
BogoMIPS:            1600.00
Flags:               half thumb fastmult vfp edsp neon vfpv3 tls vfpd32

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [drivers/net/phy/sfp] intermittent failure in state machine checks
  2020-01-09 15:03   ` ѽ҉ᶬḳ℠
@ 2020-01-09 15:58     ` Russell King - ARM Linux admin
  2020-01-09 17:35       ` ѽ҉ᶬḳ℠
  0 siblings, 1 reply; 40+ messages in thread
From: Russell King - ARM Linux admin @ 2020-01-09 15:58 UTC (permalink / raw)
  To: ѽ҉ᶬḳ℠; +Cc: Andrew Lunn, netdev

On Thu, Jan 09, 2020 at 03:03:24PM +0000, ѽ҉ᶬḳ℠ wrote:
> On 09/01/2020 14:41, Andrew Lunn wrote:
> > On Thu, Jan 09, 2020 at 01:47:31PM +0000, ѽ҉ᶬḳ℠ wrote:
> > > On node with 4.19.93 and a SFP module (specs at the bottom) the following is
> > > intermittently observed:
> > Please make sure Russell King is in Cc: for SFP issues.
> > 
> > The state machine has been reworked recently. Please could you try
> > net-next, or 5.5-rc5.
> > 
> > Thanks
> > 	Andrew
> Unfortunately testing those branches is not feasible since the router (see
> architecture below) that host the SFP module deploys the OpenWrt downstream
> distro with LTS kernels - in their Master development branch 4.19.93 being
> the most recent on offer.

I don't think the rework will make any difference in this case, and
I don't think there's anything failing in the software here.  The
reported problem seems to be this:

 sfp sfp: module transmit fault indicated
 sfp sfp: module transmit fault recovered
 sfp sfp: module transmit fault indicated
 sfp sfp: module persistently indicates fault, disabling

which occurs if the module asserts the TX_FAULT signal.  The SFP MSA
defines that this indicates a problem with the laser safety circuitry,
and defines a way to reset the fault (by pulsing TX_DISABLE and going
through another initialisation).

When TX_FAULT is asserted for the first time, "module transmit fault
indicated" is printed, and we start the process of recovery.  If we
successfully recover, then "module transmit fault recovered" will be
printed.

We try several times to recover the fault, and once we're out of
retries, "module persistently indicates fault, disabling" will be
printed; at that point, we've declared the module to be dead, and
we won't do anything further with it.

This is by design; if the module is saying that the laser safety
circuitry is faulty, then endlessly resetting the module to recover
from that fault is not sane.

However, there's some modules (particularly GPON modules) that do
things quite differently from what the SFP MSA says, which is
extremely annoying and frustrating for those of us who are trying to
implement the host support.  There are some which seem to assert
TX_FAULT for unknown reasons.

In your original post (which you need to have sent to me, I don't
read netdev) you've provided "SFP module specs" - not really, you
provided the ethtool output, which is not the same as the module
specs.  Many modules have misleading EEPROM information, sometimes
to work around what people call "vendor lockin" or maybe to get
their module to work in some specific equipment.  In any case,
EEPROM information is not a specification.

For example, your module claims to be a 1000BASE-SX module.  If
I lookup "allnet ALL4781", I find that it's a VDSL2 module.  That
isn't a 1000BASE-SX module - 1000BASE-SX is an IEEE 802.3 defined
term to mean 1000BASE-X over fiber using a short-wavelength laser.

So, given that it doesn't have a laser, why is it raising TX_FAULT.
No idea; these modules are a law to themselves.

I think the only thing we could do is to implement a workaround to
ignore TX_FAULT for this module... great, more quirks. :(

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [drivers/net/phy/sfp] intermittent failure in state machine checks
  2020-01-09 15:58     ` Russell King - ARM Linux admin
@ 2020-01-09 17:35       ` ѽ҉ᶬḳ℠
  2020-01-09 17:43         ` Russell King - ARM Linux admin
  0 siblings, 1 reply; 40+ messages in thread
From: ѽ҉ᶬḳ℠ @ 2020-01-09 17:35 UTC (permalink / raw)
  To: Russell King - ARM Linux admin; +Cc: Andrew Lunn, netdev


Kai Meitzner
On 09/01/2020 15:58, Russell King - ARM Linux admin wrote:
> On Thu, Jan 09, 2020 at 03:03:24PM +0000, ѽ҉ᶬḳ℠ wrote:
>> On 09/01/2020 14:41, Andrew Lunn wrote:
>>> On Thu, Jan 09, 2020 at 01:47:31PM +0000, ѽ҉ᶬḳ℠ wrote:
>>>> On node with 4.19.93 and a SFP module (specs at the bottom) the following is
>>>> intermittently observed:
>>> Please make sure Russell King is in Cc: for SFP issues.
>>>
>>> The state machine has been reworked recently. Please could you try
>>> net-next, or 5.5-rc5.
>>>
>>> Thanks
>>> 	Andrew
>> Unfortunately testing those branches is not feasible since the router (see
>> architecture below) that host the SFP module deploys the OpenWrt downstream
>> distro with LTS kernels - in their Master development branch 4.19.93 being
>> the most recent on offer.
> I don't think the rework will make any difference in this case, and
> I don't think there's anything failing in the software here.  The
> reported problem seems to be this:
>
>   sfp sfp: module transmit fault indicated
>   sfp sfp: module transmit fault recovered
>   sfp sfp: module transmit fault indicated
>   sfp sfp: module persistently indicates fault, disabling
>
> which occurs if the module asserts the TX_FAULT signal.  The SFP MSA
> defines that this indicates a problem with the laser safety circuitry,
> and defines a way to reset the fault (by pulsing TX_DISABLE and going
> through another initialisation).
>
> When TX_FAULT is asserted for the first time, "module transmit fault
> indicated" is printed, and we start the process of recovery.  If we
> successfully recover, then "module transmit fault recovered" will be
> printed.
>
> We try several times to recover the fault, and once we're out of
> retries, "module persistently indicates fault, disabling" will be
> printed; at that point, we've declared the module to be dead, and
> we won't do anything further with it.
>
> This is by design; if the module is saying that the laser safety
> circuitry is faulty, then endlessly resetting the module to recover
> from that fault is not sane.
>
> However, there's some modules (particularly GPON modules) that do
> things quite differently from what the SFP MSA says, which is
> extremely annoying and frustrating for those of us who are trying to
> implement the host support.  There are some which seem to assert
> TX_FAULT for unknown reasons.
>
> In your original post (which you need to have sent to me, I don't
> read netdev) you've provided "SFP module specs" - not really, you
> provided the ethtool output, which is not the same as the module
> specs.  Many modules have misleading EEPROM information, sometimes
> to work around what people call "vendor lockin" or maybe to get
> their module to work in some specific equipment.  In any case,
> EEPROM information is not a specification.
>
> For example, your module claims to be a 1000BASE-SX module.  If
> I lookup "allnet ALL4781", I find that it's a VDSL2 module.  That
> isn't a 1000BASE-SX module - 1000BASE-SX is an IEEE 802.3 defined
> term to mean 1000BASE-X over fiber using a short-wavelength laser.
>
> So, given that it doesn't have a laser, why is it raising TX_FAULT.
> No idea; these modules are a law to themselves.
>
> I think the only thing we could do is to implement a workaround to
> ignore TX_FAULT for this module... great, more quirks. :(
>

Thank you for the extensive feedback and explanation.

Pardon for having mixed up the semantics on module specifications vs. 
EEPROM dump...

The module (chipset) been designed by Metanoia, not sure who is the 
actual manufacturer, and probably just been branded Allnet.
The designer provides some proprietary management software (called EBM) 
to their wholesale buyers only

Trough EBM (got hold of something called DSLmonitor Lite) t reports as:

- board type                            AURORA
- Mode / Operation mode        RT / VDSL

If it would be any help I could provide what the EBM software calls a 
"SoC dump".

I opened a support ticket with Allnet a few days back their response is 
yet to arrive.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [drivers/net/phy/sfp] intermittent failure in state machine checks
  2020-01-09 17:35       ` ѽ҉ᶬḳ℠
@ 2020-01-09 17:43         ` Russell King - ARM Linux admin
  2020-01-09 19:01           ` ѽ҉ᶬḳ℠
  0 siblings, 1 reply; 40+ messages in thread
From: Russell King - ARM Linux admin @ 2020-01-09 17:43 UTC (permalink / raw)
  To: ѽ҉ᶬḳ℠; +Cc: Andrew Lunn, netdev

On Thu, Jan 09, 2020 at 05:35:23PM +0000, ѽ҉ᶬḳ℠ wrote:
> Thank you for the extensive feedback and explanation.
> 
> Pardon for having mixed up the semantics on module specifications vs. EEPROM
> dump...
> 
> The module (chipset) been designed by Metanoia, not sure who is the actual
> manufacturer, and probably just been branded Allnet.
> The designer provides some proprietary management software (called EBM) to
> their wholesale buyers only

I have one of their early MT-V5311 modules, but it has no accessible
EEPROM, and even if it did, it would be of no use to me being
unapproved for connection to the BT Openreach network.  (BT SIN 498
specifies non-standard power profile to avoid crosstalk issues with
existing ADSL infrastructure, and I believe they regularly check the
connected modem type and firmware versions against an approved list.)

I haven't noticed the module I have asserting its TX_FAULT signal,
but then its RJ45 has never been connected to anything.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [drivers/net/phy/sfp] intermittent failure in state machine checks
  2020-01-09 17:43         ` Russell King - ARM Linux admin
@ 2020-01-09 19:01           ` ѽ҉ᶬḳ℠
  2020-01-09 19:42             ` ѽ҉ᶬḳ℠
  2020-01-09 21:34             ` Russell King - ARM Linux admin
  0 siblings, 2 replies; 40+ messages in thread
From: ѽ҉ᶬḳ℠ @ 2020-01-09 19:01 UTC (permalink / raw)
  To: Russell King - ARM Linux admin; +Cc: Andrew Lunn, netdev

On 09/01/2020 17:43, Russell King - ARM Linux admin wrote:
> On Thu, Jan 09, 2020 at 05:35:23PM +0000, ѽ҉ᶬḳ℠ wrote:
>> Thank you for the extensive feedback and explanation.
>>
>> Pardon for having mixed up the semantics on module specifications vs. EEPROM
>> dump...
>>
>> The module (chipset) been designed by Metanoia, not sure who is the actual
>> manufacturer, and probably just been branded Allnet.
>> The designer provides some proprietary management software (called EBM) to
>> their wholesale buyers only
> I have one of their early MT-V5311 modules, but it has no accessible
> EEPROM, and even if it did, it would be of no use to me being
> unapproved for connection to the BT Openreach network.  (BT SIN 498
> specifies non-standard power profile to avoid crosstalk issues with
> existing ADSL infrastructure, and I believe they regularly check the
> connected modem type and firmware versions against an approved list.)
>
> I haven't noticed the module I have asserting its TX_FAULT signal,
> but then its RJ45 has never been connected to anything.
>

The curious (and sort of inexplicable) thing is that the module in 
general works, i.e. at some point it must pass the sm checks or 
connectivity would be failing constantly and thus the module being 
generally unusable.

The reported issues however are intermittent, usually reliably 
reproducible with

ifdown <iface> && ifup <iface>

or rebooting the router that hosts the module.

If some times passes, not sure but seems in excess of 3 minutes, between 
ifdown and ifup the sm checks mostly are not failing.
It somehow "feels" that the module is storing some link signal 
information in a register which does not suit the sm check routine and 
only when that register clears the sm check routine passes and 
connectivity is restored.
____

Since there are probably other such SFP modules, xDSL and g.fast, out 
there that do not provide laser safety circuitry by design (since not 
providing connectivity over fibre) would it perhaps not make sense to 
try checking for the existence of laser safety circuitry first prior 
getting to the sm checks?
____

Sometime in the past sfp.c was not available in the distro and the issue 
never exhibited. Back then the module's operations mode been set through 
a py script - see bottom - but it would appear that it did not implement 
any sm checks.

---py script

class SFP:
     def __init__(self, i2cbus):
         self.i2cbus = i2cbus

     @staticmethod
     def detect_metanoia_xdsl(eeprom):
         return ['X', 'C', 'V', 'R', '-', '0', '8', '0', 'Y', '5',
             '5',] == eeprom[40:51]

     @staticmethod
     def detect_zisa_gpon(eeprom):
         return ['T', 'W', '2', '3', '6', '2', 'H'] == eeprom[40:47]

     @staticmethod
     def detect_sgmii(eeprom):
         if ord(eeprom[6]) & 0x08:
             d("Mode selected: generic SGMII")
             return True
         else:
             d("Mode selected: generic 1000BASE-X")

         return False


     def decide_sfpmode(self):
         ec = []
         try:
             ec = list(EEPROM(self.i2cbus).read_eeprom())
             d("SFP EEPROM: %s" % str(ec))
         except Exception as e:
             l("EEPROM read error: " + str(e))
             return 'phy-sfp'

         # special case: Metanoia xDSL SFP, 1000BASE-X, no link 
autonegotiation
         if self.detect_metanoia_xdsl(ec):
             l("Metanoia DSL SFP detected. Switching to phy-sfp-noneg 
mode.")
             return 'phy-sfp-noneg'

         # special case: Zisa GPON SFP, SGMII
         if self.detect_zisa_gpon(ec):
             l("Zisa GPON SFP detected. Switching to phy-sfp-sgmii mode.")
             return 'phy-sfp-sgmii'

         # SGMII detection
         if self.detect_sgmii(ec):
             return 'phy-sfp-sgmii'

         # default 1000BASE-X
         return 'phy-sfp'




^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [drivers/net/phy/sfp] intermittent failure in state machine checks
  2020-01-09 19:01           ` ѽ҉ᶬḳ℠
@ 2020-01-09 19:42             ` ѽ҉ᶬḳ℠
  2020-01-09 21:38               ` Russell King - ARM Linux admin
  2020-01-09 21:59               ` Russell King - ARM Linux admin
  2020-01-09 21:34             ` Russell King - ARM Linux admin
  1 sibling, 2 replies; 40+ messages in thread
From: ѽ҉ᶬḳ℠ @ 2020-01-09 19:42 UTC (permalink / raw)
  To: Russell King - ARM Linux admin; +Cc: Andrew Lunn, netdev

On 09/01/2020 19:01, ѽ҉ᶬḳ℠ wrote:
> On 09/01/2020 17:43, Russell King - ARM Linux admin wrote:
>> On Thu, Jan 09, 2020 at 05:35:23PM +0000, ѽ҉ᶬḳ℠ wrote:
>>> Thank you for the extensive feedback and explanation.
>>>
>>> Pardon for having mixed up the semantics on module specifications 
>>> vs. EEPROM
>>> dump...
>>>
>>> The module (chipset) been designed by Metanoia, not sure who is the 
>>> actual
>>> manufacturer, and probably just been branded Allnet.
>>> The designer provides some proprietary management software (called 
>>> EBM) to
>>> their wholesale buyers only
>> I have one of their early MT-V5311 modules, but it has no accessible
>> EEPROM, and even if it did, it would be of no use to me being
>> unapproved for connection to the BT Openreach network.  (BT SIN 498
>> specifies non-standard power profile to avoid crosstalk issues with
>> existing ADSL infrastructure, and I believe they regularly check the
>> connected modem type and firmware versions against an approved list.)
>>
>> I haven't noticed the module I have asserting its TX_FAULT signal,
>> but then its RJ45 has never been connected to anything.
>>
>
> The curious (and sort of inexplicable) thing is that the module in 
> general works, i.e. at some point it must pass the sm checks or 
> connectivity would be failing constantly and thus the module being 
> generally unusable.
>
> The reported issues however are intermittent, usually reliably 
> reproducible with
>
> ifdown <iface> && ifup <iface>
>
> or rebooting the router that hosts the module.
>
> If some times passes, not sure but seems in excess of 3 minutes, 
> between ifdown and ifup the sm checks mostly are not failing.
> It somehow "feels" that the module is storing some link signal 
> information in a register which does not suit the sm check routine and 
> only when that register clears the sm check routine passes and 
> connectivity is restored.
> ____
>
> Since there are probably other such SFP modules, xDSL and g.fast, out 
> there that do not provide laser safety circuitry by design (since not 
> providing connectivity over fibre) would it perhaps not make sense to 
> try checking for the existence of laser safety circuitry first prior 
> getting to the sm checks?
> ____
>

I am wondering whether this mentioned in 
https://gitlab.labs.nic.cz/turris/turris-build/issues/89 is the cause of 
the issue perhaps:

Even when/after the SFP module is recognized and the link mode it set 
for the NIC to the proper value there can still be the link-up signal 
mismatch that we have seen on many non-ethernet SFPs. The thing is that 
one of the SFP pins is called LOS (loss of signal) and when the pin is 
in active state it is being interpreted by the Linux kernel as "link is 
down", turn off the NIC. Unfortunatelly we have seen chicken-and-egg 
problem with some GPON and DSL SFPs - the SFP does not come up and 
deassert LOS unless there is SGMII link from NIC and NIC is not coming 
up unless LOS is deasserted.





^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [drivers/net/phy/sfp] intermittent failure in state machine checks
  2020-01-09 19:01           ` ѽ҉ᶬḳ℠
  2020-01-09 19:42             ` ѽ҉ᶬḳ℠
@ 2020-01-09 21:34             ` Russell King - ARM Linux admin
  1 sibling, 0 replies; 40+ messages in thread
From: Russell King - ARM Linux admin @ 2020-01-09 21:34 UTC (permalink / raw)
  To: ѽ҉ᶬḳ℠; +Cc: Andrew Lunn, netdev

On Thu, Jan 09, 2020 at 07:01:10PM +0000, ѽ҉ᶬḳ℠ wrote:
> On 09/01/2020 17:43, Russell King - ARM Linux admin wrote:
> > On Thu, Jan 09, 2020 at 05:35:23PM +0000, ѽ҉ᶬḳ℠ wrote:
> > > Thank you for the extensive feedback and explanation.
> > > 
> > > Pardon for having mixed up the semantics on module specifications vs. EEPROM
> > > dump...
> > > 
> > > The module (chipset) been designed by Metanoia, not sure who is the actual
> > > manufacturer, and probably just been branded Allnet.
> > > The designer provides some proprietary management software (called EBM) to
> > > their wholesale buyers only
> > I have one of their early MT-V5311 modules, but it has no accessible
> > EEPROM, and even if it did, it would be of no use to me being
> > unapproved for connection to the BT Openreach network.  (BT SIN 498
> > specifies non-standard power profile to avoid crosstalk issues with
> > existing ADSL infrastructure, and I believe they regularly check the
> > connected modem type and firmware versions against an approved list.)
> > 
> > I haven't noticed the module I have asserting its TX_FAULT signal,
> > but then its RJ45 has never been connected to anything.
> > 
> 
> The curious (and sort of inexplicable) thing is that the module in general
> works, i.e. at some point it must pass the sm checks or connectivity would
> be failing constantly and thus the module being generally unusable.

It all depends what the module does with the TX_FAULT signal.  The state
machine just follows what is layed down in the SFP MSA for dealing with
a transmit fault, although the attempts to clear it and the delay from
TX_FAULT being asserted to attempting to clear are decisions of my own.

It isn't a race in the state machine.

You can check the state of the GPIOs by looking at
/sys/kernel/debug/gpio, and you will probably see that TX_FAULT is
being asserted by the module.

I'm aware of something similar with a certain GPON module, but we
haven't been able to properly work out what is going on there either -
again, it seems pretty random what the module does with the TX_FAULT
signal.

> It somehow "feels" that the module is storing some link signal information
> in a register which does not suit the sm check routine and only when that
> register clears the sm check routine passes and connectivity is restored.

You're reading /way/ too much into the state machine.  The state
machine is only concerned with two signals from the module.  One
of them is the RX_LOS signal which indicates whether the module is
receiving valid signal.  The other is TX_FAULT which is as I've
already described.  Both of these are digital signals - either they
are asserted or deasserted, and the state machine will act
accordingly.  It's rather simple.

> Since there are probably other such SFP modules, xDSL and g.fast, out there
> that do not provide laser safety circuitry by design (since not providing
> connectivity over fibre) would it perhaps not make sense to try checking for
> the existence of laser safety circuitry first prior getting to the sm
> checks?

There is no reliable way to do that; as I've already said, the EEPROM
contents is very hit and miss.  Essentially, SFPs suck, almost nothing
can be really trusted with them.

This, I believe, is why commerical grade routers have this apparent
"vendor lockin" because no one can trust anyone elses EEPROM contents
to actually come close to the SFP MSA requirements - and then you have
modules that blatently violate the SFP MSA in respect of timings.

I would not be surprised if this module's behaviour with TX_FAULT is
along those same lines; the manufacturer has decided to use TX_FAULT
for some other purpose against the SFP MSA, which will cause problems
in any SFP MSA compliant host.

> Sometime in the past sfp.c was not available in the distro and the issue
> never exhibited. Back then the module's operations mode been set through a
> py script - see bottom - but it would appear that it did not implement any
> sm checks.

That python script is very simple.  It reads the EEPROM, and attempts
to work out what kind of link to use.  It doesn't care about any of
the SFP control and status signals.  It doesn't care if you yank the
SFP out of the cage.

BTW, I notice in you original kernel that you have at least one of my
"experimental" patches on your stable kernel taken from my "phy" branch
which has never been in mainline, so I guess you're using the OpenWRT
kernel?  I have submitted patches to bring the SFP state machine up to
what was in v5.4 (and a few extra bits) to the OpenWRT maintainers as
part of some commercial work.  As I say, I'm not expecting much to
change as a result of those given what you've reported thus far.

As I've said, I think it may need a quirk so we ignore the TX_FAULT
signal.  Sorting out a patch to do that for a 4.19.xx kernel is not
going to happen soon, as the hardware I was building the OpenWRT
kernel on isn't in a functional state at the moment - and given the
unknown status of the previously submitted patches as well, I'm not
inclined to produce any further patches for OpenWRT at the moment.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [drivers/net/phy/sfp] intermittent failure in state machine checks
  2020-01-09 19:42             ` ѽ҉ᶬḳ℠
@ 2020-01-09 21:38               ` Russell King - ARM Linux admin
  2020-01-09 21:59               ` Russell King - ARM Linux admin
  1 sibling, 0 replies; 40+ messages in thread
From: Russell King - ARM Linux admin @ 2020-01-09 21:38 UTC (permalink / raw)
  To: ѽ҉ᶬḳ℠; +Cc: Andrew Lunn, netdev

On Thu, Jan 09, 2020 at 07:42:27PM +0000, ѽ҉ᶬḳ℠ wrote:
> On 09/01/2020 19:01, ѽ҉ᶬḳ℠ wrote:
> > On 09/01/2020 17:43, Russell King - ARM Linux admin wrote:
> > > On Thu, Jan 09, 2020 at 05:35:23PM +0000, ѽ҉ᶬḳ℠ wrote:
> > > > Thank you for the extensive feedback and explanation.
> > > > 
> > > > Pardon for having mixed up the semantics on module
> > > > specifications vs. EEPROM
> > > > dump...
> > > > 
> > > > The module (chipset) been designed by Metanoia, not sure who is
> > > > the actual
> > > > manufacturer, and probably just been branded Allnet.
> > > > The designer provides some proprietary management software
> > > > (called EBM) to
> > > > their wholesale buyers only
> > > I have one of their early MT-V5311 modules, but it has no accessible
> > > EEPROM, and even if it did, it would be of no use to me being
> > > unapproved for connection to the BT Openreach network.  (BT SIN 498
> > > specifies non-standard power profile to avoid crosstalk issues with
> > > existing ADSL infrastructure, and I believe they regularly check the
> > > connected modem type and firmware versions against an approved list.)
> > > 
> > > I haven't noticed the module I have asserting its TX_FAULT signal,
> > > but then its RJ45 has never been connected to anything.
> > > 
> > 
> > The curious (and sort of inexplicable) thing is that the module in
> > general works, i.e. at some point it must pass the sm checks or
> > connectivity would be failing constantly and thus the module being
> > generally unusable.
> > 
> > The reported issues however are intermittent, usually reliably
> > reproducible with
> > 
> > ifdown <iface> && ifup <iface>
> > 
> > or rebooting the router that hosts the module.
> > 
> > If some times passes, not sure but seems in excess of 3 minutes, between
> > ifdown and ifup the sm checks mostly are not failing.
> > It somehow "feels" that the module is storing some link signal
> > information in a register which does not suit the sm check routine and
> > only when that register clears the sm check routine passes and
> > connectivity is restored.
> > ____
> > 
> > Since there are probably other such SFP modules, xDSL and g.fast, out
> > there that do not provide laser safety circuitry by design (since not
> > providing connectivity over fibre) would it perhaps not make sense to
> > try checking for the existence of laser safety circuitry first prior
> > getting to the sm checks?
> > ____
> > 
> 
> I am wondering whether this mentioned in
> https://gitlab.labs.nic.cz/turris/turris-build/issues/89 is the cause of the
> issue perhaps:
> 
> Even when/after the SFP module is recognized and the link mode it set for
> the NIC to the proper value there can still be the link-up signal mismatch
> that we have seen on many non-ethernet SFPs. The thing is that one of the
> SFP pins is called LOS (loss of signal) and when the pin is in active state
> it is being interpreted by the Linux kernel as "link is down", turn off the
> NIC. Unfortunatelly we have seen chicken-and-egg problem with some GPON and
> DSL SFPs - the SFP does not come up and deassert LOS unless there is SGMII
> link from NIC and NIC is not coming up unless LOS is deasserted.

That would be very very broken behaviour, but one which the kernel
doesn't care about.

If RX_LOS is active, we do *not* disable the NIC. We just use RX_LOS as
an additional input to evaluating whether the link is up.  The NIC will
still be configured for the appropriate mode irrespective of the state
of RX_LOS.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [drivers/net/phy/sfp] intermittent failure in state machine checks
  2020-01-09 19:42             ` ѽ҉ᶬḳ℠
  2020-01-09 21:38               ` Russell King - ARM Linux admin
@ 2020-01-09 21:59               ` Russell King - ARM Linux admin
  2020-01-09 22:40                 ` ѽ҉ᶬḳ℠
  1 sibling, 1 reply; 40+ messages in thread
From: Russell King - ARM Linux admin @ 2020-01-09 21:59 UTC (permalink / raw)
  To: ѽ҉ᶬḳ℠; +Cc: Andrew Lunn, netdev

On Thu, Jan 09, 2020 at 07:42:27PM +0000, ѽ҉ᶬḳ℠ wrote:
> On 09/01/2020 19:01, ѽ҉ᶬḳ℠ wrote:
> > On 09/01/2020 17:43, Russell King - ARM Linux admin wrote:
> > > On Thu, Jan 09, 2020 at 05:35:23PM +0000, ѽ҉ᶬḳ℠ wrote:
> > > > Thank you for the extensive feedback and explanation.
> > > > 
> > > > Pardon for having mixed up the semantics on module
> > > > specifications vs. EEPROM
> > > > dump...
> > > > 
> > > > The module (chipset) been designed by Metanoia, not sure who is
> > > > the actual
> > > > manufacturer, and probably just been branded Allnet.
> > > > The designer provides some proprietary management software
> > > > (called EBM) to
> > > > their wholesale buyers only
> > > I have one of their early MT-V5311 modules, but it has no accessible
> > > EEPROM, and even if it did, it would be of no use to me being
> > > unapproved for connection to the BT Openreach network.  (BT SIN 498
> > > specifies non-standard power profile to avoid crosstalk issues with
> > > existing ADSL infrastructure, and I believe they regularly check the
> > > connected modem type and firmware versions against an approved list.)
> > > 
> > > I haven't noticed the module I have asserting its TX_FAULT signal,
> > > but then its RJ45 has never been connected to anything.
> > > 
> > 
> > The curious (and sort of inexplicable) thing is that the module in
> > general works, i.e. at some point it must pass the sm checks or
> > connectivity would be failing constantly and thus the module being
> > generally unusable.
> > 
> > The reported issues however are intermittent, usually reliably
> > reproducible with
> > 
> > ifdown <iface> && ifup <iface>
> > 
> > or rebooting the router that hosts the module.
> > 
> > If some times passes, not sure but seems in excess of 3 minutes, between
> > ifdown and ifup the sm checks mostly are not failing.
> > It somehow "feels" that the module is storing some link signal
> > information in a register which does not suit the sm check routine and
> > only when that register clears the sm check routine passes and
> > connectivity is restored.
> > ____
> > 
> > Since there are probably other such SFP modules, xDSL and g.fast, out
> > there that do not provide laser safety circuitry by design (since not
> > providing connectivity over fibre) would it perhaps not make sense to
> > try checking for the existence of laser safety circuitry first prior
> > getting to the sm checks?
> > ____
> > 
> 
> I am wondering whether this mentioned in
> https://gitlab.labs.nic.cz/turris/turris-build/issues/89 is the cause of the
> issue perhaps:
> 
> Even when/after the SFP module is recognized and the link mode it set for
> the NIC to the proper value there can still be the link-up signal mismatch
> that we have seen on many non-ethernet SFPs. The thing is that one of the
> SFP pins is called LOS (loss of signal) and when the pin is in active state
> it is being interpreted by the Linux kernel as "link is down", turn off the
> NIC. Unfortunatelly we have seen chicken-and-egg problem with some GPON and
> DSL SFPs - the SFP does not come up and deassert LOS unless there is SGMII
> link from NIC and NIC is not coming up unless LOS is deasserted.

Also, note that the Metanoia MT-V5311 (at least mine) uses 1000BASE-X
not SGMII. It sends a 16-bit configuration word of 0x61a0, which is:

		1000BASE-X			SGMII
Bit 15	0	No next page			Link down
	1	Ack				Ack
	1	Remote fault 2			Reserved (0)
	0	Remote fault 1			Duplex (0 = Half)

	0	Reserved (0)			Speed bit 1
	0	Reserved (0)			Speed bit 0 (00=10Mbps)
	0	Reserved (0)			Reserved (0)
	1	Asymetric pause direction	Reserved (0)

	1	Pause				Reserved (0)
	0	Half duplex not supported	Reserved (0)
	1	Full duplex supported		Reserved (0)
	0	Reserved (0)			Reserved (0)

	0	Reserved (0)			Reserved (0)
	0	Reserved (0)			Reserved (0)
	0	Reserved (0)			Reserved (0)
Bit 0	0	Reserved (0)			Must be 1

So it clearly fits 802.3 Clause 37 1000BASE-X format, reporting 1G
Full duplex, and not SGMII (10M Half duplex).

I have a platform here that allows me to get at the raw config_reg
word that the other end has sent which allows analysis as per the
above.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [drivers/net/phy/sfp] intermittent failure in state machine checks
  2020-01-09 21:59               ` Russell King - ARM Linux admin
@ 2020-01-09 22:40                 ` ѽ҉ᶬḳ℠
  2020-01-09 23:10                   ` Russell King - ARM Linux admin
  0 siblings, 1 reply; 40+ messages in thread
From: ѽ҉ᶬḳ℠ @ 2020-01-09 22:40 UTC (permalink / raw)
  To: Russell King - ARM Linux admin; +Cc: Andrew Lunn, netdev


On 09/01/2020 21:59, Russell King - ARM Linux admin wrote:
>
> Also, note that the Metanoia MT-V5311 (at least mine) uses 1000BASE-X
> not SGMII. It sends a 16-bit configuration word of 0x61a0, which is:
>
> 		1000BASE-X			SGMII
> Bit 15	0	No next page			Link down
> 	1	Ack				Ack
> 	1	Remote fault 2			Reserved (0)
> 	0	Remote fault 1			Duplex (0 = Half)
>
> 	0	Reserved (0)			Speed bit 1
> 	0	Reserved (0)			Speed bit 0 (00=10Mbps)
> 	0	Reserved (0)			Reserved (0)
> 	1	Asymetric pause direction	Reserved (0)
>
> 	1	Pause				Reserved (0)
> 	0	Half duplex not supported	Reserved (0)
> 	1	Full duplex supported		Reserved (0)
> 	0	Reserved (0)			Reserved (0)
>
> 	0	Reserved (0)			Reserved (0)
> 	0	Reserved (0)			Reserved (0)
> 	0	Reserved (0)			Reserved (0)
> Bit 0	0	Reserved (0)			Must be 1
>
> So it clearly fits 802.3 Clause 37 1000BASE-X format, reporting 1G
> Full duplex, and not SGMII (10M Half duplex).
>
> I have a platform here that allows me to get at the raw config_reg
> word that the other end has sent which allows analysis as per the
> above.
>

The driver reports also 1000base-x for this Metonia/Allnet module:

mvneta f1034000.ethernet eth2: switched to inband/1000base-x link mode

mii-tool -v eth2 producing

eth2: 1000 Mbit, full duplex, link ok
   product info: vendor 00:00:00, model 0 rev 0
   basic mode:   10 Mbit, full duplex
   basic status: autonegotiation complete, link ok
   capabilities:
   advertising:  1000baseT-HD 1000baseT-FD 100baseT4 100baseTx-FD 
100baseTx-HD 10baseT-FD 10baseT-HD flow-control
______

On 09/01/2020 21:34, Russell King - ARM Linux admin wrote:
> You can check the state of the GPIOs by looking at
> /sys/kernel/debug/gpio, and you will probably see that TX_FAULT is
> being asserted by the module.

With OpenWrt trying to save space wherever they can

# CONFIG_DEBUG_GPIO is not set

this avenue is unfortunately is not available. Is there some other way 
(Linux userland) to query TX_FAULT and RX_LOS and whether either/both 
being asserted or deasserted?
___

On 09/01/2020 21:34, Russell King - ARM Linux admin wrote:
> BTW, I notice in you original kernel that you have at least one of my
> "experimental" patches on your stable kernel taken from my "phy" branch
> which has never been in mainline, so I guess you're using the OpenWRT
> kernel?
I am not aware were the code originated from. It is not exactly OpenWrt 
but TOS (for the Turris Omnia router), being a downstream patchset that 
builds on top of OpenWrt. The TOS developers might be known at Linux 
kernel development, recently added their MOX platform and also with 
regard to Multi-CPU-DSA.
___

On 09/01/2020 21:34, Russell King - ARM Linux admin wrote:
> You're reading/way/  too much into the state machine.

How so? Those intermittent failures cause disruption in the WAN 
connectivity - nothing life threatening but somewhat inconvenient.
I am trying to get to the bottom of it, with my limited capabilities and 
with your input it has helped. I will ping Allnet again and see whether 
they bother to respond and shed some light of what their modules does 
with regard to TX_FAULT and RX_LOS.




^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [drivers/net/phy/sfp] intermittent failure in state machine checks
  2020-01-09 22:40                 ` ѽ҉ᶬḳ℠
@ 2020-01-09 23:10                   ` Russell King - ARM Linux admin
  2020-01-09 23:50                     ` ѽ҉ᶬḳ℠
  0 siblings, 1 reply; 40+ messages in thread
From: Russell King - ARM Linux admin @ 2020-01-09 23:10 UTC (permalink / raw)
  To: ѽ҉ᶬḳ℠; +Cc: Andrew Lunn, netdev

On Thu, Jan 09, 2020 at 10:40:24PM +0000, ѽ҉ᶬḳ℠ wrote:
> 
> On 09/01/2020 21:59, Russell King - ARM Linux admin wrote:
> > 
> > Also, note that the Metanoia MT-V5311 (at least mine) uses 1000BASE-X
> > not SGMII. It sends a 16-bit configuration word of 0x61a0, which is:
> > 
> > 		1000BASE-X			SGMII
> > Bit 15	0	No next page			Link down
> > 	1	Ack				Ack
> > 	1	Remote fault 2			Reserved (0)
> > 	0	Remote fault 1			Duplex (0 = Half)
> > 
> > 	0	Reserved (0)			Speed bit 1
> > 	0	Reserved (0)			Speed bit 0 (00=10Mbps)
> > 	0	Reserved (0)			Reserved (0)
> > 	1	Asymetric pause direction	Reserved (0)
> > 
> > 	1	Pause				Reserved (0)
> > 	0	Half duplex not supported	Reserved (0)
> > 	1	Full duplex supported		Reserved (0)
> > 	0	Reserved (0)			Reserved (0)
> > 
> > 	0	Reserved (0)			Reserved (0)
> > 	0	Reserved (0)			Reserved (0)
> > 	0	Reserved (0)			Reserved (0)
> > Bit 0	0	Reserved (0)			Must be 1
> > 
> > So it clearly fits 802.3 Clause 37 1000BASE-X format, reporting 1G
> > Full duplex, and not SGMII (10M Half duplex).
> > 
> > I have a platform here that allows me to get at the raw config_reg
> > word that the other end has sent which allows analysis as per the
> > above.
> > 
> 
> The driver reports also 1000base-x for this Metonia/Allnet module:
> 
> mvneta f1034000.ethernet eth2: switched to inband/1000base-x link mode
> 
> mii-tool -v eth2 producing
> 
> eth2: 1000 Mbit, full duplex, link ok
>   product info: vendor 00:00:00, model 0 rev 0
>   basic mode:   10 Mbit, full duplex
>   basic status: autonegotiation complete, link ok
>   capabilities:
>   advertising:  1000baseT-HD 1000baseT-FD 100baseT4 100baseTx-FD
> 100baseTx-HD 10baseT-FD 10baseT-HD flow-control

Please don't use mii-tool with SFPs that do not have a PHY; the "PHY"
registers are emulated, and are there just for compatibility. Please
use ethtool in preference, especially for SFPs.

> On 09/01/2020 21:34, Russell King - ARM Linux admin wrote:
> > You can check the state of the GPIOs by looking at
> > /sys/kernel/debug/gpio, and you will probably see that TX_FAULT is
> > being asserted by the module.
> 
> With OpenWrt trying to save space wherever they can
> 
> # CONFIG_DEBUG_GPIO is not set
> 
> this avenue is unfortunately is not available. Is there some other way
> (Linux userland) to query TX_FAULT and RX_LOS and whether either/both being
> asserted or deasserted?

CONFIG_DEBUG_GPIO is not the same as having debugfs support enabled.
If debugfs is enabled, then gpiolib will provide the current state
of gpios through debugfs.  debugfs is normally mounted on
/sys/kernel/debug, but may not be mounted by default depending on
policy.  Looking in /proc/filesystems will tell you definitively
whether debugfs is enabled or not in the kernel.

> On 09/01/2020 21:34, Russell King - ARM Linux admin wrote:
> > BTW, I notice in you original kernel that you have at least one of my
> > "experimental" patches on your stable kernel taken from my "phy" branch
> > which has never been in mainline, so I guess you're using the OpenWRT
> > kernel?
> I am not aware were the code originated from. It is not exactly OpenWrt but
> TOS (for the Turris Omnia router), being a downstream patchset that builds
> on top of OpenWrt. The TOS developers might be known at Linux kernel
> development, recently added their MOX platform and also with regard to
> Multi-CPU-DSA.

So, if that is correct...

Current OpenWRT is derived from 4.19-stable kernels, which include
experimental patches picked at some point from my "phy" branch, and
TOS is derived from OpenWRT.

That makes it very difficult for anyone in the mainline kernel
community to do anything about this; sending you a patch is likely
useless since you're not going to be able to test it.

> On 09/01/2020 21:34, Russell King - ARM Linux admin wrote:
> > You're reading/way/  too much into the state machine.
> 
> How so? Those intermittent failures cause disruption in the WAN connectivity
> - nothing life threatening but somewhat inconvenient.

You think the state machines are doing something clever. They don't.
They are all very simple and quite dumb.

> I am trying to get to the bottom of it, with my limited capabilities and
> with your input it has helped. I will ping Allnet again and see whether they
> bother to respond and shed some light of what their modules does with regard
> to TX_FAULT and RX_LOS.

The only real way to get to the bottom of it is to manually enable
debug in sfp.c so its possible to watch what happens, not only with
the hardware signals but also what the state machines are doing.
However, I'm very certain that there is no problem with the state
machines, and it is that the Allnet module is raising TX_FAULT.

I also think from what you've said above that rebuilding a kernel
to enable debug in sfp.c is going to not be possible for you.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [drivers/net/phy/sfp] intermittent failure in state machine checks
  2020-01-09 23:10                   ` Russell King - ARM Linux admin
@ 2020-01-09 23:50                     ` ѽ҉ᶬḳ℠
  2020-01-10  0:18                       ` ѽ҉ᶬḳ℠
  2020-01-10  9:27                       ` Russell King - ARM Linux admin
  0 siblings, 2 replies; 40+ messages in thread
From: ѽ҉ᶬḳ℠ @ 2020-01-09 23:50 UTC (permalink / raw)
  To: Russell King - ARM Linux admin; +Cc: Andrew Lunn, netdev


On 09/01/2020 23:10, Russell King - ARM Linux admin wrote:
>
> Please don't use mii-tool with SFPs that do not have a PHY; the "PHY"
> registers are emulated, and are there just for compatibility. Please
> use ethtool in preference, especially for SFPs.

Sure, just ethtool is not much of help for this particular matter, all 
there is ethtool -m and according to you the EEPROM dump is not to be 
relied on.

>
> CONFIG_DEBUG_GPIO is not the same as having debugfs support enabled.
> If debugfs is enabled, then gpiolib will provide the current state
> of gpios through debugfs.  debugfs is normally mounted on
> /sys/kernel/debug, but may not be mounted by default depending on
> policy.  Looking in /proc/filesystems will tell you definitively
> whether debugfs is enabled or not in the kernel.
debugsfs is mounted but ls -af /sys/kernel/debug/gpio only producing 
(oddly):

/sys/kernel/debug/gpio

>
> So, if that is correct...
>
> Current OpenWRT is derived from 4.19-stable kernels, which include
> experimental patches picked at some point from my "phy" branch, and
> TOS is derived from OpenWRT.

This may not be correct since there are not many device targets in 
OpenWrt that feature a SFP cage (least as of today), the Turris Omnia 
might even be the sole one.
I did not check whether that the code was/is available in OpenWrt, and 
likely it is not, but it was in an earlier TOS version since their 
platforms apparently feature a SFP cage.
> That makes it very difficult for anyone in the mainline kernel
> community to do anything about this; sending you a patch is likely
> useless since you're not going to be able to test it.

I understand, I just reached out all the way upstream since other 
available avenues, and started all the way downstream, did not produce 
anything tangible or even a response.
I am grateful that finally at least you obliged and shed some light on 
the matter. Maybe I should just try finding a module that is declared 
SPF MSA conform...

>
> You think the state machines are doing something clever. They don't.
> They are all very simple and quite dumb.

Not really, I assume it just does what it is supposed to do in line with 
current (industry) standards and best practices.

>
> The only real way to get to the bottom of it is to manually enable
> debug in sfp.c so its possible to watch what happens, not only with
> the hardware signals but also what the state machines are doing.
> However, I'm very certain that there is no problem with the state
> machines, and it is that the Allnet module is raising TX_FAULT.

I am sure it does and I am pursuing Allnet for a response, albeit not 
looking promising at the moment. Once there is however I shall pick up 
the thread again.

> I also think from what you've said above that rebuilding a kernel
> to enable debug in sfp.c is going to not be possible for you.

No, I might be able to get this done for amd64 but with this ARM SoC 
there is all kind of other stuff (SPI, MTD, I2C, u-boot and whatnot) 
involved and I am afraid it will go sideways if I attempt compiling.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [drivers/net/phy/sfp] intermittent failure in state machine checks
  2020-01-09 23:50                     ` ѽ҉ᶬḳ℠
@ 2020-01-10  0:18                       ` ѽ҉ᶬḳ℠
  2020-01-10 10:26                         ` Russell King - ARM Linux admin
  2020-01-10  9:27                       ` Russell King - ARM Linux admin
  1 sibling, 1 reply; 40+ messages in thread
From: ѽ҉ᶬḳ℠ @ 2020-01-10  0:18 UTC (permalink / raw)
  To: Russell King - ARM Linux admin; +Cc: Andrew Lunn, netdev

On 09/01/2020 23:50, ѽ҉ᶬḳ℠ wrote:
> Maybe I should just try finding a module that is declared SPF MSA 
> conform... 

Actually, the vendors declares 
(https://www.allnet.de/en/allnet-brand/produkte/neuheiten/p/-0c35cc9ea9/):

*ALLNET ALL4781-VDSL2-SFP* is a VDSL2 SFP modem that interconnects with 
Gateway Processor by using a MSA (MultiSource Agreement) compliant hot 
pluggable electrical interface.

Ok, "a MSA" does not explicitly state/imply SFP MSA but what other MSA 
could that be?
If it is indeed SFP MSA conform the issue should not happen. Unless it 
is just marketing speak and does not hold true.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [drivers/net/phy/sfp] intermittent failure in state machine checks
  2020-01-09 23:50                     ` ѽ҉ᶬḳ℠
  2020-01-10  0:18                       ` ѽ҉ᶬḳ℠
@ 2020-01-10  9:27                       ` Russell King - ARM Linux admin
  2020-01-10  9:50                         ` ѽ҉ᶬḳ℠
  1 sibling, 1 reply; 40+ messages in thread
From: Russell King - ARM Linux admin @ 2020-01-10  9:27 UTC (permalink / raw)
  To: ѽ҉ᶬḳ℠; +Cc: Andrew Lunn, netdev

On Thu, Jan 09, 2020 at 11:50:14PM +0000, ѽ҉ᶬḳ℠ wrote:
> 
> On 09/01/2020 23:10, Russell King - ARM Linux admin wrote:
> > 
> > Please don't use mii-tool with SFPs that do not have a PHY; the "PHY"
> > registers are emulated, and are there just for compatibility. Please
> > use ethtool in preference, especially for SFPs.
> 
> Sure, just ethtool is not much of help for this particular matter, all there
> is ethtool -m and according to you the EEPROM dump is not to be relied on.

How about just "ethtool eth2" ?

> > CONFIG_DEBUG_GPIO is not the same as having debugfs support enabled.
> > If debugfs is enabled, then gpiolib will provide the current state
> > of gpios through debugfs.  debugfs is normally mounted on
> > /sys/kernel/debug, but may not be mounted by default depending on
> > policy.  Looking in /proc/filesystems will tell you definitively
> > whether debugfs is enabled or not in the kernel.
> debugsfs is mounted but ls -af /sys/kernel/debug/gpio only producing
> (oddly):
> 
> /sys/kernel/debug/gpio

Try "cat /sys/kernel/debug/gpio"

> > So, if that is correct...
> > 
> > Current OpenWRT is derived from 4.19-stable kernels, which include
> > experimental patches picked at some point from my "phy" branch, and
> > TOS is derived from OpenWRT.
> 
> This may not be correct since there are not many device targets in OpenWrt
> that feature a SFP cage (least as of today), the Turris Omnia might even be
> the sole one.

It isn't; there are definitely platforms that run OpenWRT that also
have SFP cages (even a pair of them) and that make use of this code.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [drivers/net/phy/sfp] intermittent failure in state machine checks
  2020-01-10  9:27                       ` Russell King - ARM Linux admin
@ 2020-01-10  9:50                         ` ѽ҉ᶬḳ℠
  2020-01-10 10:19                           ` ѽ҉ᶬḳ℠
  2020-01-10 11:44                           ` Russell King - ARM Linux admin
  0 siblings, 2 replies; 40+ messages in thread
From: ѽ҉ᶬḳ℠ @ 2020-01-10  9:50 UTC (permalink / raw)
  To: Russell King - ARM Linux admin; +Cc: Andrew Lunn, netdev


On 10/01/2020 09:27, Russell King - ARM Linux admin wrote:
> On Thu, Jan 09, 2020 at 11:50:14PM +0000, ѽ҉ᶬḳ℠ wrote:
>> On 09/01/2020 23:10, Russell King - ARM Linux admin wrote:
>>> Please don't use mii-tool with SFPs that do not have a PHY; the "PHY"
>>> registers are emulated, and are there just for compatibility. Please
>>> use ethtool in preference, especially for SFPs.
>> Sure, just ethtool is not much of help for this particular matter, all there
>> is ethtool -m and according to you the EEPROM dump is not to be relied on.
> How about just "ethtool eth2" ?

Settings for eth2:
         Supported ports: [ TP ]
         Supported link modes:   1000baseX/Full
         Supported pause frame use: Symmetric
         Supports auto-negotiation: Yes
         Supported FEC modes: Not reported
         Advertised link modes:  1000baseX/Full
         Advertised pause frame use: Symmetric
         Advertised auto-negotiation: Yes
         Advertised FEC modes: Not reported
         Speed: 1000Mb/s
         Duplex: Full
         Port: Twisted Pair
         PHYAD: 0
         Transceiver: internal
         Auto-negotiation: on
         MDI-X: Unknown
         Supports Wake-on: d
         Wake-on: d
         Link detected: yes

And i2cdetect -l

i2c-3   i2c             i2c-0-mux (chan_id 2)                   I2C adapter
i2c-1   i2c             i2c-0-mux (chan_id 0)                   I2C adapter
i2c-8   i2c             i2c-0-mux (chan_id 7)                   I2C adapter
i2c-6   i2c             i2c-0-mux (chan_id 5)                   I2C adapter
i2c-4   i2c             i2c-0-mux (chan_id 3)                   I2C adapter
i2c-2   i2c             i2c-0-mux (chan_id 1)                   I2C adapter
i2c-0   i2c             mv64xxx_i2c adapter                     I2C adapter
i2c-7   i2c             i2c-0-mux (chan_id 6)                   I2C adapter
i2c-5   i2c             i2c-0-mux (chan_id 4)                   I2C adapter

>
>>> CONFIG_DEBUG_GPIO is not the same as having debugfs support enabled.
>>> If debugfs is enabled, then gpiolib will provide the current state
>>> of gpios through debugfs.  debugfs is normally mounted on
>>> /sys/kernel/debug, but may not be mounted by default depending on
>>> policy.  Looking in /proc/filesystems will tell you definitively
>>> whether debugfs is enabled or not in the kernel.
>> debugsfs is mounted but ls -af /sys/kernel/debug/gpio only producing
>> (oddly):
>>
>> /sys/kernel/debug/gpio
> Try "cat /sys/kernel/debug/gpio"

gpiochip2: GPIOs 504-511, parent: i2c/8-0071, pca9538, can sleep:
  gpio-504 (                    |tx-fault            ) in  lo IRQ
  gpio-505 (                    |tx-disable          ) out lo
  gpio-506 (                    |rate-select0        ) in  lo
  gpio-507 (                    |los                 ) in  lo IRQ
  gpio-508 (                    |mod-def0            ) in  lo IRQ

___

Probably unrelated, just for posterity noticed in the logs:

pca953x 8-0071: 8-0071 supply vcc not found, using dummy regulator
___
Meantime Allnet responded, which basically sums up to (blame ping pong - 
it is not me but go and look there instead...)

- driver support is not being handled by Allnet but by Metanoia, latter 
being designer and manufacturer
- Allnet does not have the buying power to persuade Metanoia to look 
into the matter
- it would appear that SFP.C is trying to communicate with Fiber-GBIC 
and fails since the signal reports may not be 100% compatible

Pending is their feedback about SFP MSA conformity.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [drivers/net/phy/sfp] intermittent failure in state machine checks
  2020-01-10  9:50                         ` ѽ҉ᶬḳ℠
@ 2020-01-10 10:19                           ` ѽ҉ᶬḳ℠
  2020-01-10 11:46                             ` Russell King - ARM Linux admin
  2020-01-10 13:22                             ` Andrew Lunn
  2020-01-10 11:44                           ` Russell King - ARM Linux admin
  1 sibling, 2 replies; 40+ messages in thread
From: ѽ҉ᶬḳ℠ @ 2020-01-10 10:19 UTC (permalink / raw)
  To: Russell King - ARM Linux admin; +Cc: Andrew Lunn, netdev

I just came across this 
http://edgemax5.rssing.com/chan-66822975/all_p1715.html#item34298

and albeit for a SFP g.fast module it indicates/implies that Metanoia 
provides own Linux drivers (supposedly GPL licensed), plus some bits 
pertaining to the EBM (Ethernet Boot Management protocol).

Has Metanoia submitted any SFP drivers to upstream kernel development?


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [drivers/net/phy/sfp] intermittent failure in state machine checks
  2020-01-10  0:18                       ` ѽ҉ᶬḳ℠
@ 2020-01-10 10:26                         ` Russell King - ARM Linux admin
  0 siblings, 0 replies; 40+ messages in thread
From: Russell King - ARM Linux admin @ 2020-01-10 10:26 UTC (permalink / raw)
  To: ѽ҉ᶬḳ℠; +Cc: Andrew Lunn, netdev

On Fri, Jan 10, 2020 at 12:18:41AM +0000, ѽ҉ᶬḳ℠ wrote:
> On 09/01/2020 23:50, ѽ҉ᶬḳ℠ wrote:
> > Maybe I should just try finding a module that is declared SPF MSA
> > conform...
> 
> Actually, the vendors declares
> (https://www.allnet.de/en/allnet-brand/produkte/neuheiten/p/-0c35cc9ea9/):
> 
> *ALLNET ALL4781-VDSL2-SFP* is a VDSL2 SFP modem that interconnects with
> Gateway Processor by using a MSA (MultiSource Agreement) compliant hot
> pluggable electrical interface.
> 
> Ok, "a MSA" does not explicitly state/imply SFP MSA but what other MSA could
> that be?
> If it is indeed SFP MSA conform the issue should not happen. Unless it is
> just marketing speak and does not hold true.

Everyone claims that their SFP is MSA compliant, even when the module:

1) takes 40-50 seconds after deasserting TX_DISABLE to initialise and
   deassert TX_FAULT, when the SFP MSA explicitly states a limit of
   300ms (t_init) for TX_FAULT to deassert.

2) EEPROM does not respond for 50 seconds after plugging in, where the
   SFP MSA explicitly states 300ms (t_serial) maximum.

3) EEPROM contains incorrect data, for example:
   - indicating the module has a LC connector, yet it has an RJ45, or
     vice versa.
   - indicating NRZ encoding for an ethernet SFP, where it should be
     8b10b or 64b66b encoding.
   - indicating a single data rate, or even the wrong data rate, when
     the module is documented as supporting other rates.
   - indicating an extended compliance technology that it doesn't
     support, presumably originally chosen when the number was
     unallocated by SFF-8024.
   - claiming to support 1000BASE-SX, a fiber standard, when the
     module is actually for VDSL2 over copper.

... etc ...

So, I tend to ignore "SFP MSA compliant" whenever I see it; it is
mostly meaningless.  Yes, there are modules out there which are
compliant, but those that claim compliance but aren't make the
claim meaningless for everyone.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [drivers/net/phy/sfp] intermittent failure in state machine checks
  2020-01-10  9:50                         ` ѽ҉ᶬḳ℠
  2020-01-10 10:19                           ` ѽ҉ᶬḳ℠
@ 2020-01-10 11:44                           ` Russell King - ARM Linux admin
  2020-01-10 12:45                             ` ѽ҉ᶬḳ℠
  1 sibling, 1 reply; 40+ messages in thread
From: Russell King - ARM Linux admin @ 2020-01-10 11:44 UTC (permalink / raw)
  To: ѽ҉ᶬḳ℠; +Cc: Andrew Lunn, netdev

On Fri, Jan 10, 2020 at 09:50:00AM +0000, ѽ҉ᶬḳ℠ wrote:
> On 10/01/2020 09:27, Russell King - ARM Linux admin wrote:
> > On Thu, Jan 09, 2020 at 11:50:14PM +0000, ѽ҉ᶬḳ℠ wrote:
> > > On 09/01/2020 23:10, Russell King - ARM Linux admin wrote:
> > > > Please don't use mii-tool with SFPs that do not have a PHY; the "PHY"
> > > > registers are emulated, and are there just for compatibility. Please
> > > > use ethtool in preference, especially for SFPs.
> > > Sure, just ethtool is not much of help for this particular matter, all there
> > > is ethtool -m and according to you the EEPROM dump is not to be relied on.
> > How about just "ethtool eth2" ?
> 
> Settings for eth2:
>         Supported ports: [ TP ]
>         Supported link modes:   1000baseX/Full
>         Supported pause frame use: Symmetric
>         Supports auto-negotiation: Yes
>         Supported FEC modes: Not reported
>         Advertised link modes:  1000baseX/Full
>         Advertised pause frame use: Symmetric
>         Advertised auto-negotiation: Yes
>         Advertised FEC modes: Not reported
>         Speed: 1000Mb/s
>         Duplex: Full
>         Port: Twisted Pair
>         PHYAD: 0
>         Transceiver: internal
>         Auto-negotiation: on
>         MDI-X: Unknown
>         Supports Wake-on: d
>         Wake-on: d
>         Link detected: yes

That looks fine.

> > > > CONFIG_DEBUG_GPIO is not the same as having debugfs support enabled.
> > > > If debugfs is enabled, then gpiolib will provide the current state
> > > > of gpios through debugfs.  debugfs is normally mounted on
> > > > /sys/kernel/debug, but may not be mounted by default depending on
> > > > policy.  Looking in /proc/filesystems will tell you definitively
> > > > whether debugfs is enabled or not in the kernel.
> > > debugsfs is mounted but ls -af /sys/kernel/debug/gpio only producing
> > > (oddly):
> > > 
> > > /sys/kernel/debug/gpio
> > Try "cat /sys/kernel/debug/gpio"
> 
> gpiochip2: GPIOs 504-511, parent: i2c/8-0071, pca9538, can sleep:
>  gpio-504 (                    |tx-fault            ) in  lo IRQ
>  gpio-505 (                    |tx-disable          ) out lo
>  gpio-506 (                    |rate-select0        ) in  lo
>  gpio-507 (                    |los                 ) in  lo IRQ
>  gpio-508 (                    |mod-def0            ) in  lo IRQ

Which is also indicating everything is correct.  When the problem
occurs, check the state of the signals again as close as possible
to the event - it depends how long the transceiver keeps it
asserted.  You will probably find tx-fault is indicating
"in  hi IRQ".

> Meantime Allnet responded, which basically sums up to (blame ping pong - it
> is not me but go and look there instead...)
> 
> - driver support is not being handled by Allnet but by Metanoia, latter
> being designer and manufacturer
> - Allnet does not have the buying power to persuade Metanoia to look into
> the matter

... which is pretty standard; no one will rework their SFP unless
they fear their sales will be severely impacted by the issue.

> - it would appear that SFP.C is trying to communicate with Fiber-GBIC and
> fails since the signal reports may not be 100% compatible

That's a fun claim, but note carefully the wording "may" which implies
some uncertainty in the statement.

Let's look at the wording of the GBIC (SFF-8053) and SFP (INF-8074 -
SFP MSA) documents.  The wording for the "fault recovery" is identical
between the two, which concerns what happens when TX_FAULT is asserted
and how to recover from that.

Concerning the implementation of TX_FAULT, SFF-8053 states:

  If no transmitter safety circuitry is implemented, the TX_FAULT signal
  may be tied to its negated state.

but then says later in the document:

  If TX_FAULT is not implemented, the signal shall be held to the low
  state by the GBIC.

Meanwhile, INF-8074 similarly states:

  If no transmitter safety circuitry is implemented, the TX_FAULT signal
  may be tied to its negated state.

but later on has a similar statement:

  TX_FAULT shall be implemented by those module definitions of SFP
  transceiver supporting safety circuitry. If TX_FAULT is not
  implemented, the signal shall be held to the low state by the SFP
  transceiver.

"shall" in both cases is stronger than "may".  So, there seems to be
little difference between the GBIC and SFP usage of this signal.

Their claim is that sfp.c implements the older GBIC style of signal
reports.  My counter-claim is that (a) sfp.c is written to the SFP MSA
and not the GBIC standard, and (b) there is no difference as far as the
TX_FAULT signal is concerned between the GBIC standard and the SFP MSA.

But... it doesn't matter that much, there's a module out there (and it
isn't the only one) which does "funny stuff" with its TX_FAULT signal.
Either we decide we want to support it and implement a quirk, or we
decide we don't want to support it.

There is an option bit in the EEPROM that is supposed to indicate
whether the module supports TX_FAULT, but, as you can guess, there are
problems with using that, as:

1) there are a lot of modules, particularly optical modules, that
   implement TX_FAULT correctly but don't set the option bit to say
   that they support the signal.

2) the other module I'm aware of that does "funny stuff" with its
   TX_FAULT signal does have the TX_FAULT option bit set.

So, the option bit is completely untrustworthy and, therefore, is
meaningless (which is why we don't use it.)

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [drivers/net/phy/sfp] intermittent failure in state machine checks
  2020-01-10 10:19                           ` ѽ҉ᶬḳ℠
@ 2020-01-10 11:46                             ` Russell King - ARM Linux admin
  2020-01-10 13:22                             ` Andrew Lunn
  1 sibling, 0 replies; 40+ messages in thread
From: Russell King - ARM Linux admin @ 2020-01-10 11:46 UTC (permalink / raw)
  To: ѽ҉ᶬḳ℠; +Cc: Andrew Lunn, netdev

On Fri, Jan 10, 2020 at 10:19:47AM +0000, ѽ҉ᶬḳ℠ wrote:
> I just came across this
> http://edgemax5.rssing.com/chan-66822975/all_p1715.html#item34298
> 
> and albeit for a SFP g.fast module it indicates/implies that Metanoia
> provides own Linux drivers (supposedly GPL licensed), plus some bits
> pertaining to the EBM (Ethernet Boot Management protocol).
> 
> Has Metanoia submitted any SFP drivers to upstream kernel development?

Not that I'm aware of.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [drivers/net/phy/sfp] intermittent failure in state machine checks
  2020-01-10 11:44                           ` Russell King - ARM Linux admin
@ 2020-01-10 12:45                             ` ѽ҉ᶬḳ℠
  2020-01-10 12:53                               ` Russell King - ARM Linux admin
  0 siblings, 1 reply; 40+ messages in thread
From: ѽ҉ᶬḳ℠ @ 2020-01-10 12:45 UTC (permalink / raw)
  To: Russell King - ARM Linux admin; +Cc: Andrew Lunn, netdev

On 10/01/2020 11:44, Russell King - ARM Linux admin wrote:
>
> Which is also indicating everything is correct.  When the problem
> occurs, check the state of the signals again as close as possible
> to the event - it depends how long the transceiver keeps it
> asserted.  You will probably find tx-fault is indicating
> "in  hi IRQ".
just discovered userland - gpioinfo pca9538 - which seems more verbose

gpiochip2 - 8 lines:
         line   0:      unnamed   "tx-fault"   input  active-high [used]
         line   1:      unnamed "tx-disable"  output  active-high [used]
         line   2:      unnamed "rate-select0" input active-high [used]
         line   3:      unnamed        "los"   input  active-high [used]
         line   4:      unnamed   "mod-def0"   input   active-low [used]
         line   5:      unnamed       unused   input  active-high
         line   6:      unnamed       unused   input  active-high
         line   7:      unnamed       unused   input  active-high

The above is depicting the current state with the module working, i.e. 
being online. Will do some testing and report back, not sure yet how to 
keep a close watch relating to the failure events.

>> - it would appear that SFP.C is trying to communicate with Fiber-GBIC and
>> fails since the signal reports may not be 100% compatible
> That's a fun claim, but note carefully the wording "may" which implies
> some uncertainty in the statement.

It was a verbatim translation but yes, even in the initial language 
correspondence such uncertainty is implied indeed.

>
> Let's look at the wording of the GBIC (SFF-8053) and SFP (INF-8074 -
> SFP MSA) documents.  The wording for the "fault recovery" is identical
> between the two, which concerns what happens when TX_FAULT is asserted
> and how to recover from that.
>
> Concerning the implementation of TX_FAULT, SFF-8053 states:
>
>    If no transmitter safety circuitry is implemented, the TX_FAULT signal
>    may be tied to its negated state.
>
> but then says later in the document:
>
>    If TX_FAULT is not implemented, the signal shall be held to the low
>    state by the GBIC.
>
> Meanwhile, INF-8074 similarly states:
>
>    If no transmitter safety circuitry is implemented, the TX_FAULT signal
>    may be tied to its negated state.
>
> but later on has a similar statement:
>
>    TX_FAULT shall be implemented by those module definitions of SFP
>    transceiver supporting safety circuitry. If TX_FAULT is not
>    implemented, the signal shall be held to the low state by the SFP
>    transceiver.
>
> "shall" in both cases is stronger than "may".  So, there seems to be
> little difference between the GBIC and SFP usage of this signal.
>
> Their claim is that sfp.c implements the older GBIC style of signal
> reports.  My counter-claim is that (a) sfp.c is written to the SFP MSA
> and not the GBIC standard, and (b) there is no difference as far as the
> TX_FAULT signal is concerned between the GBIC standard and the SFP MSA.
>
> But... it doesn't matter that much, there's a module out there (and it
> isn't the only one) which does "funny stuff" with its TX_FAULT signal.
> Either we decide we want to support it and implement a quirk, or we
> decide we don't want to support it.
>
> There is an option bit in the EEPROM that is supposed to indicate
> whether the module supports TX_FAULT, but, as you can guess, there are
> problems with using that, as:
>
> 1) there are a lot of modules, particularly optical modules, that
>     implement TX_FAULT correctly but don't set the option bit to say
>     that they support the signal.
>
> 2) the other module I'm aware of that does "funny stuff" with its
>     TX_FAULT signal does have the TX_FAULT option bit set.
>
> So, the option bit is completely untrustworthy and, therefore, is
> meaningless (which is why we don't use it.)
>

Even with "shall" carrying a potentially higher weight than "may" it 
still does not imply something obligatory (set in stone) and leaves 
potential wiggle room.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [drivers/net/phy/sfp] intermittent failure in state machine checks
  2020-01-10 12:45                             ` ѽ҉ᶬḳ℠
@ 2020-01-10 12:53                               ` Russell King - ARM Linux admin
  2020-01-10 15:02                                 ` ѽ҉ᶬḳ℠
  0 siblings, 1 reply; 40+ messages in thread
From: Russell King - ARM Linux admin @ 2020-01-10 12:53 UTC (permalink / raw)
  To: ѽ҉ᶬḳ℠; +Cc: Andrew Lunn, netdev

On Fri, Jan 10, 2020 at 12:45:52PM +0000, ѽ҉ᶬḳ℠ wrote:
> On 10/01/2020 11:44, Russell King - ARM Linux admin wrote:
> > 
> > Which is also indicating everything is correct.  When the problem
> > occurs, check the state of the signals again as close as possible
> > to the event - it depends how long the transceiver keeps it
> > asserted.  You will probably find tx-fault is indicating
> > "in  hi IRQ".
> just discovered userland - gpioinfo pca9538 - which seems more verbose
> 
> gpiochip2 - 8 lines:
>         line   0:      unnamed   "tx-fault"   input  active-high [used]
>         line   1:      unnamed "tx-disable"  output  active-high [used]
>         line   2:      unnamed "rate-select0" input active-high [used]
>         line   3:      unnamed        "los"   input  active-high [used]
>         line   4:      unnamed   "mod-def0"   input   active-low [used]
>         line   5:      unnamed       unused   input  active-high
>         line   6:      unnamed       unused   input  active-high
>         line   7:      unnamed       unused   input  active-high
> 
> The above is depicting the current state with the module working, i.e. being
> online. Will do some testing and report back, not sure yet how to keep a
> close watch relating to the failure events.

However, that doesn't give the current levels of the inputs, so it's
useless for the purpose I've asked for.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [drivers/net/phy/sfp] intermittent failure in state machine checks
  2020-01-10 10:19                           ` ѽ҉ᶬḳ℠
  2020-01-10 11:46                             ` Russell King - ARM Linux admin
@ 2020-01-10 13:22                             ` Andrew Lunn
  2020-01-10 13:38                               ` ѽ҉ᶬḳ℠
  1 sibling, 1 reply; 40+ messages in thread
From: Andrew Lunn @ 2020-01-10 13:22 UTC (permalink / raw)
  To: ѽ҉ᶬḳ℠
  Cc: Russell King - ARM Linux admin, netdev

On Fri, Jan 10, 2020 at 10:19:47AM +0000, ѽ҉ᶬḳ℠ wrote:
> I just came across this
> http://edgemax5.rssing.com/chan-66822975/all_p1715.html#item34298
> 
> and albeit for a SFP g.fast module it indicates/implies that Metanoia
> provides own Linux drivers (supposedly GPL licensed), plus some bits
> pertaining to the EBM (Ethernet Boot Management protocol).
> 
> Has Metanoia submitted any SFP drivers to upstream kernel development?

I have also not seen any. You could ask for the sources. It is
unlikely we would use them, but they could provide documentation about
the quirks needed to make this device work properly.

    Andrew


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [drivers/net/phy/sfp] intermittent failure in state machine checks
  2020-01-10 13:22                             ` Andrew Lunn
@ 2020-01-10 13:38                               ` ѽ҉ᶬḳ℠
  0 siblings, 0 replies; 40+ messages in thread
From: ѽ҉ᶬḳ℠ @ 2020-01-10 13:38 UTC (permalink / raw)
  To: Andrew Lunn; +Cc: Russell King - ARM Linux admin, netdev

On 10/01/2020 13:22, Andrew Lunn wrote:
> On Fri, Jan 10, 2020 at 10:19:47AM +0000, ѽ҉ᶬḳ℠ wrote:
>> I just came across this
>> http://edgemax5.rssing.com/chan-66822975/all_p1715.html#item34298
>>
>> and albeit for a SFP g.fast module it indicates/implies that Metanoia
>> provides own Linux drivers (supposedly GPL licensed), plus some bits
>> pertaining to the EBM (Ethernet Boot Management protocol).
>>
>> Has Metanoia submitted any SFP drivers to upstream kernel development?
> I have also not seen any. You could ask for the sources. It is
> unlikely we would use them, but they could provide documentation about
> the quirks needed to make this device work properly.
>
>      Andrew
>

Wish I could since it would be really helpful.

As far as I can tell some of the ISP in Switzerland, least Swisscom, 
provide Metanoia designed/manufactured SFP modules as CPE and those 
drivers and EBM tools for Linux are packaged into the firmware shipped 
by the ISP.

Too bad that Metanoia has not bothered to take the initiative and submit 
those to upstream development, even it were just for  taking a peek at 
potential quirks.

Regrettably, I do not entertain a (commercial or otherwise) relationship 
at a level that would warrant a response to a request for such drivers 
from either Metanoia or an ISP providing Metanoia equipment.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [drivers/net/phy/sfp] intermittent failure in state machine checks
  2020-01-10 12:53                               ` Russell King - ARM Linux admin
@ 2020-01-10 15:02                                 ` ѽ҉ᶬḳ℠
  2020-01-10 15:09                                   ` Russell King - ARM Linux admin
  0 siblings, 1 reply; 40+ messages in thread
From: ѽ҉ᶬḳ℠ @ 2020-01-10 15:02 UTC (permalink / raw)
  To: Russell King - ARM Linux admin; +Cc: Andrew Lunn, netdev

On 10/01/2020 12:53, Russell King - ARM Linux admin wrote:
>
>>> Which is also indicating everything is correct.  When the problem
>>> occurs, check the state of the signals again as close as possible
>>> to the event - it depends how long the transceiver keeps it
>>> asserted.  You will probably find tx-fault is indicating
>>> "in  hi IRQ".
>> just discovered userland - gpioinfo pca9538 - which seems more verbose
>>
>> gpiochip2 - 8 lines:
>>          line   0:      unnamed   "tx-fault"   input  active-high [used]
>>          line   1:      unnamed "tx-disable"  output  active-high [used]
>>          line   2:      unnamed "rate-select0" input active-high [used]
>>          line   3:      unnamed        "los"   input  active-high [used]
>>          line   4:      unnamed   "mod-def0"   input   active-low [used]
>>          line   5:      unnamed       unused   input  active-high
>>          line   6:      unnamed       unused   input  active-high
>>          line   7:      unnamed       unused   input  active-high
>>
>> The above is depicting the current state with the module working, i.e. being
>> online. Will do some testing and report back, not sure yet how to keep a
>> close watch relating to the failure events.
> However, that doesn't give the current levels of the inputs, so it's
> useless for the purpose I've asked for.

Fair enough. Operational (online) state

gpiochip2: GPIOs 504-511, parent: i2c/8-0071, pca9538, can sleep:
  gpio-504 ( |tx-fault     ) in  lo IRQ
  gpio-505 ( |tx-disable   ) out lo
  gpio-506 ( |rate-select0 ) in  lo
  gpio-507 ( |los          ) in  lo IRQ
  gpio-508 ( |mod-def0     ) in  lo IRQ

And the same remained (unchanged) during/after the events (as closely I 
was able to monitor) -> module transmit fault indicated

Kept an eye on the link state of the NIC (eth2) as well with - ip mo l 
dev eth2 - and it does not come up either with those transmit faults and 
only gets back online once the transmit faults are cleared.

At loss really, since it seems the SFP is not exhibiting any mischievous 
behaviour. It does not even reproduce reliably, just now it took a 
couple of attempts to invoke the transmit fault and other times it does 
straight away.
Suppose there could be always a hardware defect though it seems somewhat 
unlikely. That said, I do not have another such module available to swap 
for testing.






^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [drivers/net/phy/sfp] intermittent failure in state machine checks
  2020-01-10 15:02                                 ` ѽ҉ᶬḳ℠
@ 2020-01-10 15:09                                   ` Russell King - ARM Linux admin
  2020-01-10 15:45                                     ` ѽ҉ᶬḳ℠
  0 siblings, 1 reply; 40+ messages in thread
From: Russell King - ARM Linux admin @ 2020-01-10 15:09 UTC (permalink / raw)
  To: ѽ҉ᶬḳ℠; +Cc: Andrew Lunn, netdev

On Fri, Jan 10, 2020 at 03:02:51PM +0000, ѽ҉ᶬḳ℠ wrote:
> On 10/01/2020 12:53, Russell King - ARM Linux admin wrote:
> > 
> > > > Which is also indicating everything is correct.  When the problem
> > > > occurs, check the state of the signals again as close as possible
> > > > to the event - it depends how long the transceiver keeps it
> > > > asserted.  You will probably find tx-fault is indicating
> > > > "in  hi IRQ".
> > > just discovered userland - gpioinfo pca9538 - which seems more verbose
> > > 
> > > gpiochip2 - 8 lines:
> > >          line   0:      unnamed   "tx-fault"   input  active-high [used]
> > >          line   1:      unnamed "tx-disable"  output  active-high [used]
> > >          line   2:      unnamed "rate-select0" input active-high [used]
> > >          line   3:      unnamed        "los"   input  active-high [used]
> > >          line   4:      unnamed   "mod-def0"   input   active-low [used]
> > >          line   5:      unnamed       unused   input  active-high
> > >          line   6:      unnamed       unused   input  active-high
> > >          line   7:      unnamed       unused   input  active-high
> > > 
> > > The above is depicting the current state with the module working, i.e. being
> > > online. Will do some testing and report back, not sure yet how to keep a
> > > close watch relating to the failure events.
> > However, that doesn't give the current levels of the inputs, so it's
> > useless for the purpose I've asked for.
> 
> Fair enough. Operational (online) state
> 
> gpiochip2: GPIOs 504-511, parent: i2c/8-0071, pca9538, can sleep:
>  gpio-504 ( |tx-fault     ) in  lo IRQ
>  gpio-505 ( |tx-disable   ) out lo
>  gpio-506 ( |rate-select0 ) in  lo
>  gpio-507 ( |los          ) in  lo IRQ
>  gpio-508 ( |mod-def0     ) in  lo IRQ
> 
> And the same remained (unchanged) during/after the events (as closely I was
> able to monitor) -> module transmit fault indicated

Try:

while ! grep -A4 'tx-fault.*in  hi' /sys/kernel/debug/gpio; do :; done

which may have a better chance of catching it.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [drivers/net/phy/sfp] intermittent failure in state machine checks
  2020-01-10 15:09                                   ` Russell King - ARM Linux admin
@ 2020-01-10 15:45                                     ` ѽ҉ᶬḳ℠
  2020-01-10 16:32                                       ` Russell King - ARM Linux admin
  0 siblings, 1 reply; 40+ messages in thread
From: ѽ҉ᶬḳ℠ @ 2020-01-10 15:45 UTC (permalink / raw)
  To: Russell King - ARM Linux admin; +Cc: Andrew Lunn, netdev

On 10/01/2020 15:09, Russell King - ARM Linux admin wrote:
> On Fri, Jan 10, 2020 at 03:02:51PM +0000, ѽ҉ᶬḳ℠ wrote:
>> On 10/01/2020 12:53, Russell King - ARM Linux admin wrote:
>>>>> Which is also indicating everything is correct. When the problem
>>>>> occurs, check the state of the signals again as close as possible
>>>>> to the event - it depends how long the transceiver keeps it
>>>>> asserted. You will probably find tx-fault is indicating
>>>>> "in hi IRQ".
>>>> just discovered userland - gpioinfo pca9538 - which seems more verbose
>>>>
>>>> gpiochip2 - 8 lines:
>>>>         line   0:      unnamed   "tx-fault"   input active-high [used]
>>>>         line   1:      unnamed "tx-disable"  output active-high [used]
>>>>         line   2:      unnamed "rate-select0" input active-high [used]
>>>>         line   3:      unnamed        "los"   input active-high [used]
>>>>         line   4:      unnamed   "mod-def0"   input active-low [used]
>>>>         line   5:      unnamed       unused   input active-high
>>>>         line   6:      unnamed       unused   input active-high
>>>>         line   7:      unnamed       unused   input active-high
>>>>
>>>> The above is depicting the current state with the module working, 
>>>> i.e. being
>>>> online. Will do some testing and report back, not sure yet how to 
>>>> keep a
>>>> close watch relating to the failure events.
>>> However, that doesn't give the current levels of the inputs, so it's
>>> useless for the purpose I've asked for.
>> Fair enough. Operational (online) state
>>
>> gpiochip2: GPIOs 504-511, parent: i2c/8-0071, pca9538, can sleep:
>>  gpio-504 ( |tx-fault     ) in  lo IRQ
>>  gpio-505 ( |tx-disable   ) out lo
>>  gpio-506 ( |rate-select0 ) in  lo
>>  gpio-507 ( |los          ) in  lo IRQ
>>  gpio-508 ( |mod-def0     ) in  lo IRQ
>>
>> And the same remained (unchanged) during/after the events (as closely 
>> I was
>> able to monitor) -> module transmit fault indicated
> Try:
>
> while ! grep -A4 'tx-fault.*in hi' /sys/kernel/debug/gpio; do :; done
>
> which may have a better chance of catching it.
>

Suppose you are not interested in what happens with ifdown wan, so just 
for posterity

  gpio-504 (                    |tx-fault            ) in  hi IRQ
  gpio-505 (                    |tx-disable          ) out hi
  gpio-506 (                    |rate-select0        ) in  lo
  gpio-507 (                    |los                 ) in  lo IRQ
  gpio-508 (                    |mod-def0            ) in  lo IRQ


When the iif is brought up again and happens to trigger a transmit fault 
the hi is not being triggered however. And it did not try 5 times to 
recover from the fault, unless dmesg missed some

[Fri Jan 10 15:30:57 2020] mvneta f1034000.ethernet eth2: Link is Down
[Fri Jan 10 15:30:57 2020] IPv6: ADDRCONF(NETDEV_UP): eth2: link is not 
ready
[Fri Jan 10 15:31:13 2020] mvneta f1034000.ethernet eth2: configuring 
for inband/1000base-x link mode
[Fri Jan 10 15:31:13 2020] sfp sfp: module transmit fault indicated
[Fri Jan 10 15:31:15 2020] mvneta f1034000.ethernet eth2: Link is Up - 
1Gbps/Full - flow control off
[Fri Jan 10 15:31:16 2020] sfp sfp: module transmit fault recovered
[Fri Jan 10 15:31:16 2020] mvneta f1034000.ethernet eth2: Link is Down
[Fri Jan 10 15:31:16 2020] sfp sfp: module transmit fault indicated
[Fri Jan 10 15:31:19 2020] sfp sfp: module persistently indicates fault, 
disabling
[Fri Jan 10 15:31:21 2020] IPv6: ADDRCONF(NETDEV_UP): eth2: link is not 
ready
[Fri Jan 10 15:31:21 2020] mvneta f1034000.ethernet eth2: configuring 
for inband/1000base-x link mode
[Fri Jan 10 15:31:21 2020] sfp sfp: module transmit fault indicated
[Fri Jan 10 15:31:27 2020] sfp sfp: module persistently indicates fault, 
disabling

[Fri Jan 10 15:38:01 2020] IPv6: ADDRCONF(NETDEV_UP): eth2: link is not 
ready
[Fri Jan 10 15:38:01 2020] mvneta f1034000.ethernet eth2: configuring 
for inband/1000base-x link mode
[Fri Jan 10 15:38:01 2020] sfp sfp: module transmit fault indicated
[Fri Jan 10 15:38:07 2020] sfp sfp: module persistently indicates fault, 
disabling

[Fri Jan 10 15:40:48 2020] IPv6: ADDRCONF(NETDEV_UP): eth2: link is not 
ready
[Fri Jan 10 15:40:48 2020] mvneta f1034000.ethernet eth2: configuring 
for inband/1000base-x link mode
[Fri Jan 10 15:40:48 2020] sfp sfp: module transmit fault indicated
[Fri Jan 10 15:40:54 2020] sfp sfp: module persistently indicates fault, 
disabling

Had to reboot the node to regain WAN connectivity.







^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [drivers/net/phy/sfp] intermittent failure in state machine checks
  2020-01-10 15:45                                     ` ѽ҉ᶬḳ℠
@ 2020-01-10 16:32                                       ` Russell King - ARM Linux admin
  2020-01-10 16:53                                         ` ѽ҉ᶬḳ℠
  0 siblings, 1 reply; 40+ messages in thread
From: Russell King - ARM Linux admin @ 2020-01-10 16:32 UTC (permalink / raw)
  To: ѽ҉ᶬḳ℠; +Cc: Andrew Lunn, netdev

On Fri, Jan 10, 2020 at 03:45:15PM +0000, ѽ҉ᶬḳ℠ wrote:
> On 10/01/2020 15:09, Russell King - ARM Linux admin wrote:
> > On Fri, Jan 10, 2020 at 03:02:51PM +0000, ѽ҉ᶬḳ℠ wrote:
> > > On 10/01/2020 12:53, Russell King - ARM Linux admin wrote:
> > > > > > Which is also indicating everything is correct. When the problem
> > > > > > occurs, check the state of the signals again as close as possible
> > > > > > to the event - it depends how long the transceiver keeps it
> > > > > > asserted. You will probably find tx-fault is indicating
> > > > > > "in hi IRQ".
> > > > > just discovered userland - gpioinfo pca9538 - which seems more verbose
> > > > > 
> > > > > gpiochip2 - 8 lines:
> > > > >         line   0:      unnamed   "tx-fault"   input active-high [used]
> > > > >         line   1:      unnamed "tx-disable"  output active-high [used]
> > > > >         line   2:      unnamed "rate-select0" input active-high [used]
> > > > >         line   3:      unnamed        "los"   input active-high [used]
> > > > >         line   4:      unnamed   "mod-def0"   input active-low [used]
> > > > >         line   5:      unnamed       unused   input active-high
> > > > >         line   6:      unnamed       unused   input active-high
> > > > >         line   7:      unnamed       unused   input active-high
> > > > > 
> > > > > The above is depicting the current state with the module
> > > > > working, i.e. being
> > > > > online. Will do some testing and report back, not sure yet
> > > > > how to keep a
> > > > > close watch relating to the failure events.
> > > > However, that doesn't give the current levels of the inputs, so it's
> > > > useless for the purpose I've asked for.
> > > Fair enough. Operational (online) state
> > > 
> > > gpiochip2: GPIOs 504-511, parent: i2c/8-0071, pca9538, can sleep:
> > >  gpio-504 ( |tx-fault     ) in  lo IRQ
> > >  gpio-505 ( |tx-disable   ) out lo
> > >  gpio-506 ( |rate-select0 ) in  lo
> > >  gpio-507 ( |los          ) in  lo IRQ
> > >  gpio-508 ( |mod-def0     ) in  lo IRQ
> > > 
> > > And the same remained (unchanged) during/after the events (as
> > > closely I was
> > > able to monitor) -> module transmit fault indicated
> > Try:
> > 
> > while ! grep -A4 'tx-fault.*in hi' /sys/kernel/debug/gpio; do :; done
> > 
> > which may have a better chance of catching it.
> > 
> 
> Suppose you are not interested in what happens with ifdown wan, so just for
> posterity
> 
>  gpio-504 (                    |tx-fault            ) in  hi IRQ
>  gpio-505 (                    |tx-disable          ) out hi
>  gpio-506 (                    |rate-select0        ) in  lo
>  gpio-507 (                    |los                 ) in  lo IRQ
>  gpio-508 (                    |mod-def0            ) in  lo IRQ

Right, because the state of TX_FAULT is not defined while TX_DISABLE
is high.

> When the iif is brought up again and happens to trigger a transmit fault the
> hi is not being triggered however. And it did not try 5 times to recover
> from the fault, unless dmesg missed some
> 
> [Fri Jan 10 15:30:57 2020] mvneta f1034000.ethernet eth2: Link is Down
> [Fri Jan 10 15:30:57 2020] IPv6: ADDRCONF(NETDEV_UP): eth2: link is not
> ready

Here is where you brought the interface down.

> [Fri Jan 10 15:31:13 2020] mvneta f1034000.ethernet eth2: configuring for
> inband/1000base-x link mode

Here is where you brought the interface back up.

> [Fri Jan 10 15:31:13 2020] sfp sfp: module transmit fault indicated

The module failed to deassert TX_FAULT within the 300ms window.

> [Fri Jan 10 15:31:15 2020] mvneta f1034000.ethernet eth2: Link is Up -
> 1Gbps/Full - flow control off
> [Fri Jan 10 15:31:16 2020] sfp sfp: module transmit fault recovered

I'm not sure why these two messages are this way round; they should
be the other way (I guess that's something to do with printk() no
longer being synchronous.)  Given that it seems to have taken two
seconds to recover, that will be two reset attempts.

> [Fri Jan 10 15:31:16 2020] mvneta f1034000.ethernet eth2: Link is Down
> [Fri Jan 10 15:31:16 2020] sfp sfp: module transmit fault indicated

The module re-asserts TX_FAULT...

> [Fri Jan 10 15:31:19 2020] sfp sfp: module persistently indicates fault,
> disabling

After another three attempts, the module is still asserting TX_FAULT.

We don't report every attempt made to recover the fault; that would
flood the log.  Instead, we report when the fault occurred, then try
to reset the fault every second for up to five attempts _total_.  If
we exhaust the attempts, you get "module persistently indicates fault,
disabling".  If successfully recovered, you'll get "module transmit
fault recovered".

We don't reset the retries after a successful recovery as that would
mean a transitory safety fault occurring once every few seconds would
endlessly fill the kernel log, and also may go unnoticed.  If it's
spitting out safety faults like that, the module would be faulty.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [drivers/net/phy/sfp] intermittent failure in state machine checks
  2020-01-10 16:32                                       ` Russell King - ARM Linux admin
@ 2020-01-10 16:53                                         ` ѽ҉ᶬḳ℠
  2020-01-10 17:08                                           ` Russell King - ARM Linux admin
  0 siblings, 1 reply; 40+ messages in thread
From: ѽ҉ᶬḳ℠ @ 2020-01-10 16:53 UTC (permalink / raw)
  To: Russell King - ARM Linux admin; +Cc: Andrew Lunn, netdev

On 10/01/2020 16:32, Russell King - ARM Linux admin wrote:
> On Fri, Jan 10, 2020 at 03:45:15PM +0000, ѽ҉ᶬḳ℠ wrote:
>> On 10/01/2020 15:09, Russell King - ARM Linux admin wrote:
>>> On Fri, Jan 10, 2020 at 03:02:51PM +0000, ѽ҉ᶬḳ℠ wrote:
>>>> On 10/01/2020 12:53, Russell King - ARM Linux admin wrote:
>>>>>>> Which is also indicating everything is correct. When the problem
>>>>>>> occurs, check the state of the signals again as close as possible
>>>>>>> to the event - it depends how long the transceiver keeps it
>>>>>>> asserted. You will probably find tx-fault is indicating
>>>>>>> "in hi IRQ".
>>>>>> just discovered userland - gpioinfo pca9538 - which seems more verbose
>>>>>>
>>>>>> gpiochip2 - 8 lines:
>>>>>>          line   0:      unnamed   "tx-fault"   input active-high [used]
>>>>>>          line   1:      unnamed "tx-disable"  output active-high [used]
>>>>>>          line   2:      unnamed "rate-select0" input active-high [used]
>>>>>>          line   3:      unnamed        "los"   input active-high [used]
>>>>>>          line   4:      unnamed   "mod-def0"   input active-low [used]
>>>>>>          line   5:      unnamed       unused   input active-high
>>>>>>          line   6:      unnamed       unused   input active-high
>>>>>>          line   7:      unnamed       unused   input active-high
>>>>>>
>>>>>> The above is depicting the current state with the module
>>>>>> working, i.e. being
>>>>>> online. Will do some testing and report back, not sure yet
>>>>>> how to keep a
>>>>>> close watch relating to the failure events.
>>>>> However, that doesn't give the current levels of the inputs, so it's
>>>>> useless for the purpose I've asked for.
>>>> Fair enough. Operational (online) state
>>>>
>>>> gpiochip2: GPIOs 504-511, parent: i2c/8-0071, pca9538, can sleep:
>>>>   gpio-504 ( |tx-fault     ) in  lo IRQ
>>>>   gpio-505 ( |tx-disable   ) out lo
>>>>   gpio-506 ( |rate-select0 ) in  lo
>>>>   gpio-507 ( |los          ) in  lo IRQ
>>>>   gpio-508 ( |mod-def0     ) in  lo IRQ
>>>>
>>>> And the same remained (unchanged) during/after the events (as
>>>> closely I was
>>>> able to monitor) -> module transmit fault indicated
>>> Try:
>>>
>>> while ! grep -A4 'tx-fault.*in hi' /sys/kernel/debug/gpio; do :; done
>>>
>>> which may have a better chance of catching it.
>>>
>> Suppose you are not interested in what happens with ifdown wan, so just for
>> posterity
>>
>>   gpio-504 (                    |tx-fault            ) in  hi IRQ
>>   gpio-505 (                    |tx-disable          ) out hi
>>   gpio-506 (                    |rate-select0        ) in  lo
>>   gpio-507 (                    |los                 ) in  lo IRQ
>>   gpio-508 (                    |mod-def0            ) in  lo IRQ
> Right, because the state of TX_FAULT is not defined while TX_DISABLE
> is high.
>
>> When the iif is brought up again and happens to trigger a transmit fault the
>> hi is not being triggered however. And it did not try 5 times to recover
>> from the fault, unless dmesg missed some
>>
>> [Fri Jan 10 15:30:57 2020] mvneta f1034000.ethernet eth2: Link is Down
>> [Fri Jan 10 15:30:57 2020] IPv6: ADDRCONF(NETDEV_UP): eth2: link is not
>> ready
> Here is where you brought the interface down.
>
>> [Fri Jan 10 15:31:13 2020] mvneta f1034000.ethernet eth2: configuring for
>> inband/1000base-x link mode
> Here is where you brought the interface back up.
>
>> [Fri Jan 10 15:31:13 2020] sfp sfp: module transmit fault indicated
> The module failed to deassert TX_FAULT within the 300ms window.
>
>> [Fri Jan 10 15:31:15 2020] mvneta f1034000.ethernet eth2: Link is Up -
>> 1Gbps/Full - flow control off
>> [Fri Jan 10 15:31:16 2020] sfp sfp: module transmit fault recovered
> I'm not sure why these two messages are this way round; they should
> be the other way (I guess that's something to do with printk() no
> longer being synchronous.)  Given that it seems to have taken two
> seconds to recover, that will be two reset attempts.
>
>> [Fri Jan 10 15:31:16 2020] mvneta f1034000.ethernet eth2: Link is Down
>> [Fri Jan 10 15:31:16 2020] sfp sfp: module transmit fault indicated
> The module re-asserts TX_FAULT...
>
>> [Fri Jan 10 15:31:19 2020] sfp sfp: module persistently indicates fault,
>> disabling
> After another three attempts, the module is still asserting TX_FAULT.
>
> We don't report every attempt made to recover the fault; that would
> flood the log.  Instead, we report when the fault occurred, then try
> to reset the fault every second for up to five attempts _total_.  If
> we exhaust the attempts, you get "module persistently indicates fault,
> disabling".  If successfully recovered, you'll get "module transmit
> fault recovered".
>
> We don't reset the retries after a successful recovery as that would
> mean a transitory safety fault occurring once every few seconds would
> endlessly fill the kernel log, and also may go unnoticed.  If it's
> spitting out safety faults like that, the module would be faulty.
>

Seems that the debug avenue has been exhausted, short of running SFP.C 
in debug mode.


There is still no explanation why the module passes the 300 ms deassert 
TX_FAULT test most of the time but fails intermittently at other times, 
being kind of incoherent. Maybe it is just wishful thinking but it seems 
a bit far-fetched that the module is really causing this, least the 
readings from GPIO do not provide any such indicator.

Could there be something choking / blocking the communication channel 
between the module and the kernel, some kernel code getting stuck / 
leaked in memory?
Could the ipupdown routine, which has its own implementation in OpenWrt, 
be an interfering agent, e.g. the way it constructs or tears down the 
iif, though I do not see how?


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [drivers/net/phy/sfp] intermittent failure in state machine checks
  2020-01-10 16:53                                         ` ѽ҉ᶬḳ℠
@ 2020-01-10 17:08                                           ` Russell King - ARM Linux admin
  2020-01-10 17:19                                             ` ѽ҉ᶬḳ℠
  0 siblings, 1 reply; 40+ messages in thread
From: Russell King - ARM Linux admin @ 2020-01-10 17:08 UTC (permalink / raw)
  To: ѽ҉ᶬḳ℠; +Cc: Andrew Lunn, netdev

On Fri, Jan 10, 2020 at 04:53:06PM +0000, ѽ҉ᶬḳ℠ wrote:
> Seems that the debug avenue has been exhausted, short of running SFP.C in
> debug mode.

You're saying you never see TX_FAULT asserted other than when the
interface is down?

> There is still no explanation why the module passes the 300 ms deassert
> TX_FAULT test most of the time but fails intermittently at other times,
> being kind of incoherent. Maybe it is just wishful thinking but it seems a
> bit far-fetched that the module is really causing this, least the readings
> from GPIO do not provide any such indicator.
> 
> Could there be something choking / blocking the communication channel
> between the module and the kernel, some kernel code getting stuck / leaked
> in memory?

There is no "communication channel" involved here.  It is just those
GPIOs.

> Could the ipupdown routine, which has its own implementation in OpenWrt, be
> an interfering agent, e.g. the way it constructs or tears down the iif,
> though I do not see how?

Very unlikely.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [drivers/net/phy/sfp] intermittent failure in state machine checks
  2020-01-10 17:08                                           ` Russell King - ARM Linux admin
@ 2020-01-10 17:19                                             ` ѽ҉ᶬḳ℠
  2020-01-10 17:38                                               ` Russell King - ARM Linux admin
  0 siblings, 1 reply; 40+ messages in thread
From: ѽ҉ᶬḳ℠ @ 2020-01-10 17:19 UTC (permalink / raw)
  To: Russell King - ARM Linux admin; +Cc: Andrew Lunn, netdev


On 10/01/2020 17:08, Russell King - ARM Linux admin wrote:
> On Fri, Jan 10, 2020 at 04:53:06PM +0000, ѽ҉ᶬḳ℠ wrote:
>> Seems that the debug avenue has been exhausted, short of running SFP.C in
>> debug mode.
> You're saying you never see TX_FAULT asserted other than when the
> interface is down?

Yes, it never exhibits once the iif is up - it is rock-stable in that 
state, only ever when being transitioned from down state to up state.
Pardon, if that has not been made explicitly clear previously.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [drivers/net/phy/sfp] intermittent failure in state machine checks
  2020-01-10 17:19                                             ` ѽ҉ᶬḳ℠
@ 2020-01-10 17:38                                               ` Russell King - ARM Linux admin
  2020-01-10 18:44                                                 ` ѽ҉ᶬḳ℠
  0 siblings, 1 reply; 40+ messages in thread
From: Russell King - ARM Linux admin @ 2020-01-10 17:38 UTC (permalink / raw)
  To: ѽ҉ᶬḳ℠; +Cc: Andrew Lunn, netdev

On Fri, Jan 10, 2020 at 05:19:35PM +0000, ѽ҉ᶬḳ℠ wrote:
> 
> On 10/01/2020 17:08, Russell King - ARM Linux admin wrote:
> > On Fri, Jan 10, 2020 at 04:53:06PM +0000, ѽ҉ᶬḳ℠ wrote:
> > > Seems that the debug avenue has been exhausted, short of running SFP.C in
> > > debug mode.
> > You're saying you never see TX_FAULT asserted other than when the
> > interface is down?
> 
> Yes, it never exhibits once the iif is up - it is rock-stable in that state,
> only ever when being transitioned from down state to up state.
> Pardon, if that has not been made explicitly clear previously.

I think if we were to have SFP debug enabled, you'll find that
TX_FAULT is being reported to SFP as being asserted.

You probably aren't running that while loop, as it will exit when
it sees TX_FAULT asserted.  So, here's another bit of shell code
for you to run:

ip li set dev eth2 down; \
ip li set dev eth2 up; \
date
while :; do
  cat /proc/uptime
  while ! grep -A5 'tx-fault.*in  hi' /sys/kernel/debug/gpio; do :; done
  cat /proc/uptime
  while ! grep -A5 'tx-fault.*in  lo' /sys/kernel/debug/gpio; do :; done
done

This will give you output such as:

Fri 10 Jan 17:31:06 GMT 2020
774869.13 1535859.48
 gpio-509 (                    |tx-fault            ) in  hi ...
774869.14 1535859.49
 gpio-509 (                    |tx-fault            ) in  lo ...
774869.15 1535859.50

The first date and "uptime" output is the timestamp when the interface
was brought up.  Subsequent "uptime" outputs can be used to calculate
the time difference in seconds between the state printed immediately
prior to the uptime output, and the first "uptime" output.

So in the above example, the tx-fault signal was hi at 10ms, and then
went low 20ms after the up.

However, bear in mind that even this will not be good enough to spot
transitory changes on TX_FAULT - as your I2C GPIO expander is interrupt
capable, watching /proc/interrupts may tell you more.

If the TX_FAULT signal is as stable as you claim it is, you should see
the interrupt count for it remaining the same.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [drivers/net/phy/sfp] intermittent failure in state machine checks
  2020-01-10 17:38                                               ` Russell King - ARM Linux admin
@ 2020-01-10 18:44                                                 ` ѽ҉ᶬḳ℠
  2020-01-10 19:01                                                   ` Russell King - ARM Linux admin
  2020-01-10 19:23                                                   ` Andrew Lunn
  0 siblings, 2 replies; 40+ messages in thread
From: ѽ҉ᶬḳ℠ @ 2020-01-10 18:44 UTC (permalink / raw)
  To: Russell King - ARM Linux admin; +Cc: Andrew Lunn, netdev

On 10/01/2020 17:38, Russell King - ARM Linux admin wrote:
>
>>> On Fri, Jan 10, 2020 at 04:53:06PM +0000, ѽ҉ᶬḳ℠ wrote:
>>>> Seems that the debug avenue has been exhausted, short of running SFP.C in
>>>> debug mode.
>>> You're saying you never see TX_FAULT asserted other than when the
>>> interface is down?
>> Yes, it never exhibits once the iif is up - it is rock-stable in that state,
>> only ever when being transitioned from down state to up state.
>> Pardon, if that has not been made explicitly clear previously.
> I think if we were to have SFP debug enabled, you'll find that
> TX_FAULT is being reported to SFP as being asserted.

If really necessary I could ask the TOS developers to assist, not sure 
whether they would oblidge. Their Master branch build bot compiles twice 
a day.
Would it just involve setting a kernel debug flag or something more 
elaborate?

>
> You probably aren't running that while loop, as it will exit when
> it sees TX_FAULT asserted.  So, here's another bit of shell code
> for you to run:
>
> ip li set dev eth2 down; \
> ip li set dev eth2 up; \
> date
> while :; do
>    cat /proc/uptime
>    while ! grep -A5 'tx-fault.*in  hi' /sys/kernel/debug/gpio; do :; done
>    cat /proc/uptime
>    while ! grep -A5 'tx-fault.*in  lo' /sys/kernel/debug/gpio; do :; done
> done
>
> This will give you output such as:
>
> Fri 10 Jan 17:31:06 GMT 2020
> 774869.13 1535859.48
>   gpio-509 (                    |tx-fault            ) in  hi ...
> 774869.14 1535859.49
>   gpio-509 (                    |tx-fault            ) in  lo ...
> 774869.15 1535859.50
>
> The first date and "uptime" output is the timestamp when the interface
> was brought up.  Subsequent "uptime" outputs can be used to calculate
> the time difference in seconds between the state printed immediately
> prior to the uptime output, and the first "uptime" output.
>
> So in the above example, the tx-fault signal was hi at 10ms, and then
> went low 20ms after the up.

awfully nice of you to provide the code, this is the output when the iif 
is brought down and up again and exhibiting the transmit fault.

ip li set dev eth2 down; \
 > ip li set dev eth2 up; \
 > date
Fri Jan 10 18:34:52 GMT 2020
root@to:~# while :; do
 >   cat /proc/uptime
 >   while ! grep -A5 'tx-fault.*in  hi' /sys/kernel/debug/gpio; do :; done
 >   cat /proc/uptime
 >   while ! grep -A5 'tx-fault.*in  lo' /sys/kernel/debug/gpio; do :; done
 > done
1865.20 3224.67
  gpio-504 (                    |tx-fault            ) in  hi IRQ
  gpio-505 (                    |tx-disable          ) out hi
  gpio-506 (                    |rate-select0        ) in  lo
  gpio-507 (                    |los                 ) in  lo IRQ
  gpio-508 (                    |mod-def0            ) in  lo IRQ
1871.77 3230.71
  gpio-504 (                    |tx-fault            ) in  lo IRQ
  gpio-505 (                    |tx-disable          ) out lo
  gpio-506 (                    |rate-select0        ) in  lo
  gpio-507 (                    |los                 ) in  lo IRQ
  gpio-508 (                    |mod-def0            ) in  lo IRQ
1919.06 3309.55
  gpio-504 (                    |tx-fault            ) in  hi IRQ
  gpio-505 (                    |tx-disable          ) out lo
  gpio-506 (                    |rate-select0        ) in  lo
  gpio-507 (                    |los                 ) in  lo IRQ
  gpio-508 (                    |mod-def0            ) in  lo IRQ
1919.07 3309.57
  gpio-504 (                    |tx-fault            ) in  lo IRQ
  gpio-505 (                    |tx-disable          ) out lo
  gpio-506 (                    |rate-select0        ) in  lo
  gpio-507 (                    |los                 ) in  lo IRQ
  gpio-508 (                    |mod-def0            ) in  lo IRQ
1920.68 3312.28
  gpio-504 (                    |tx-fault            ) in  hi IRQ
  gpio-505 (                    |tx-disable          ) out lo
  gpio-506 (                    |rate-select0        ) in  lo
  gpio-507 (                    |los                 ) in  lo IRQ
  gpio-508 (                    |mod-def0            ) in  lo IRQ
1921.86 3314.21
  gpio-504 (                    |tx-fault            ) in  lo IRQ
  gpio-505 (                    |tx-disable          ) out lo
  gpio-506 (                    |rate-select0        ) in  lo
  gpio-507 (                    |los                 ) in  lo IRQ
  gpio-508 (                    |mod-def0            ) in  lo IRQ
1921.86 3314.21

> However, bear in mind that even this will not be good enough to spot
> transitory changes on TX_FAULT - as your I2C GPIO expander is interrupt
> capable, watching /proc/interrupts may tell you more.
>
> If the TX_FAULT signal is as stable as you claim it is, you should see
> the interrupt count for it remaining the same.

Once the iif is up those values remain stable indeed.

cat /proc/interrupts | grep sfp
  52:          0          0   pca953x   4 Edge      sfp
  53:          0          0   pca953x   3 Edge      sfp
  54:          6          0   pca953x   0 Edge      sfp

and only incrementing with ifupdown action (which would be logical)

cat /proc/interrupts | grep sfp
  52:          0          0   pca953x   4 Edge      sfp
  53:          0          0   pca953x   3 Edge      sfp
  54:         11          0   pca953x   0 Edge      sfp



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [drivers/net/phy/sfp] intermittent failure in state machine checks
  2020-01-10 18:44                                                 ` ѽ҉ᶬḳ℠
@ 2020-01-10 19:01                                                   ` Russell King - ARM Linux admin
  2020-01-10 19:36                                                     ` ѽ҉ᶬḳ℠
  2020-01-10 19:23                                                   ` Andrew Lunn
  1 sibling, 1 reply; 40+ messages in thread
From: Russell King - ARM Linux admin @ 2020-01-10 19:01 UTC (permalink / raw)
  To: ѽ҉ᶬḳ℠; +Cc: Andrew Lunn, netdev

On Fri, Jan 10, 2020 at 06:44:18PM +0000, ѽ҉ᶬḳ℠ wrote:
> On 10/01/2020 17:38, Russell King - ARM Linux admin wrote:
> > 
> > > > On Fri, Jan 10, 2020 at 04:53:06PM +0000, ѽ҉ᶬḳ℠ wrote:
> > > > > Seems that the debug avenue has been exhausted, short of running SFP.C in
> > > > > debug mode.
> > > > You're saying you never see TX_FAULT asserted other than when the
> > > > interface is down?
> > > Yes, it never exhibits once the iif is up - it is rock-stable in that state,
> > > only ever when being transitioned from down state to up state.
> > > Pardon, if that has not been made explicitly clear previously.
> > I think if we were to have SFP debug enabled, you'll find that
> > TX_FAULT is being reported to SFP as being asserted.
> 
> If really necessary I could ask the TOS developers to assist, not sure
> whether they would oblidge. Their Master branch build bot compiles twice a
> day.
> Would it just involve setting a kernel debug flag or something more
> elaborate?
> 
> > 
> > You probably aren't running that while loop, as it will exit when
> > it sees TX_FAULT asserted.  So, here's another bit of shell code
> > for you to run:
> > 
> > ip li set dev eth2 down; \
> > ip li set dev eth2 up; \
> > date
> > while :; do
> >    cat /proc/uptime
> >    while ! grep -A5 'tx-fault.*in  hi' /sys/kernel/debug/gpio; do :; done
> >    cat /proc/uptime
> >    while ! grep -A5 'tx-fault.*in  lo' /sys/kernel/debug/gpio; do :; done
> > done
> > 
> > This will give you output such as:
> > 
> > Fri 10 Jan 17:31:06 GMT 2020
> > 774869.13 1535859.48
> >   gpio-509 (                    |tx-fault            ) in  hi ...
> > 774869.14 1535859.49
> >   gpio-509 (                    |tx-fault            ) in  lo ...
> > 774869.15 1535859.50
> > 
> > The first date and "uptime" output is the timestamp when the interface
> > was brought up.  Subsequent "uptime" outputs can be used to calculate
> > the time difference in seconds between the state printed immediately
> > prior to the uptime output, and the first "uptime" output.
> > 
> > So in the above example, the tx-fault signal was hi at 10ms, and then
> > went low 20ms after the up.
> 
> awfully nice of you to provide the code, this is the output when the iif is
> brought down and up again and exhibiting the transmit fault.
> 
> ip li set dev eth2 down; \
> > ip li set dev eth2 up; \
> > date
> Fri Jan 10 18:34:52 GMT 2020
> root@to:~# while :; do
> >   cat /proc/uptime
> >   while ! grep -A5 'tx-fault.*in  hi' /sys/kernel/debug/gpio; do :; done
> >   cat /proc/uptime
> >   while ! grep -A5 'tx-fault.*in  lo' /sys/kernel/debug/gpio; do :; done
> > done

Hmm, I missed a ; \ at the end of "date", so this isn't quite what
I wanted, but it'll do.  What that means is that:

> 1865.20 3224.67

doesn't bear the relationship that I wanted to the interface coming
up.

>  gpio-504 (                    |tx-fault            ) in  hi IRQ
>  gpio-505 (                    |tx-disable          ) out hi
>  gpio-506 (                    |rate-select0        ) in  lo
>  gpio-507 (                    |los                 ) in  lo IRQ
>  gpio-508 (                    |mod-def0            ) in  lo IRQ
> 1871.77 3230.71

TX_FAULT is high at 1871.77 and TX_DISABLE is high, so the interface
is down.

>  gpio-504 (                    |tx-fault            ) in  lo IRQ
>  gpio-505 (                    |tx-disable          ) out lo
>  gpio-506 (                    |rate-select0        ) in  lo
>  gpio-507 (                    |los                 ) in  lo IRQ
>  gpio-508 (                    |mod-def0            ) in  lo IRQ
> 1919.06 3309.55

Almost 47.3s later, TX_FAULT has gone low.

>  gpio-504 (                    |tx-fault            ) in  hi IRQ
>  gpio-505 (                    |tx-disable          ) out lo
>  gpio-506 (                    |rate-select0        ) in  lo
>  gpio-507 (                    |los                 ) in  lo IRQ
>  gpio-508 (                    |mod-def0            ) in  lo IRQ
> 1919.07 3309.57

After 10ms, it goes high again - this will cause the first report of
a transmit fault.

>  gpio-504 (                    |tx-fault            ) in  lo IRQ
>  gpio-505 (                    |tx-disable          ) out lo
>  gpio-506 (                    |rate-select0        ) in  lo
>  gpio-507 (                    |los                 ) in  lo IRQ
>  gpio-508 (                    |mod-def0            ) in  lo IRQ
> 1920.68 3312.28

About 1.6s later, it goes low, maybe as a result of the first attempt
to clear the fault by a brief pulse on TX_DISABLE.

So, we wait 1s before asserting TX_DISABLE for 10us, which would have
happened around 1920.07.  We then have 300ms for initialisation, which
would've taken us to 1920.37, so this may have been interpreted as the
fault still being present.  The next clearance attempt would have been
scheduled for about 1921.37.

>  gpio-504 (                    |tx-fault            ) in  hi IRQ
>  gpio-505 (                    |tx-disable          ) out lo
>  gpio-506 (                    |rate-select0        ) in  lo
>  gpio-507 (                    |los                 ) in  lo IRQ
>  gpio-508 (                    |mod-def0            ) in  lo IRQ
> 1921.86 3314.21

1.2s later, it re-asserts.

>  gpio-504 (                    |tx-fault            ) in  lo IRQ
>  gpio-505 (                    |tx-disable          ) out lo
>  gpio-506 (                    |rate-select0        ) in  lo
>  gpio-507 (                    |los                 ) in  lo IRQ
>  gpio-508 (                    |mod-def0            ) in  lo IRQ
> 1921.86 3314.21

and deasserts within the same 10ms.

> > However, bear in mind that even this will not be good enough to spot
> > transitory changes on TX_FAULT - as your I2C GPIO expander is interrupt
> > capable, watching /proc/interrupts may tell you more.
> > 
> > If the TX_FAULT signal is as stable as you claim it is, you should see
> > the interrupt count for it remaining the same.
> 
> Once the iif is up those values remain stable indeed.
> 
> cat /proc/interrupts | grep sfp
>  52:          0          0   pca953x   4 Edge      sfp
>  53:          0          0   pca953x   3 Edge      sfp
>  54:          6          0   pca953x   0 Edge      sfp
> 
> and only incrementing with ifupdown action (which would be logical)
> 
> cat /proc/interrupts | grep sfp
>  52:          0          0   pca953x   4 Edge      sfp
>  53:          0          0   pca953x   3 Edge      sfp
>  54:         11          0   pca953x   0 Edge      sfp

According to this, TX_FAULT has toggled five times.

This would seem to negate your previous comment about TX_FAULT being
stable.

Therefore, I'd say that the SFP state machines are operating as
designed, and as per the SFP MSA, and what we have is a module that
likes to assert TX_FAULT for unknown reasons, and this confirms the
hypothesis I've been putting forward.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [drivers/net/phy/sfp] intermittent failure in state machine checks
  2020-01-10 18:44                                                 ` ѽ҉ᶬḳ℠
  2020-01-10 19:01                                                   ` Russell King - ARM Linux admin
@ 2020-01-10 19:23                                                   ` Andrew Lunn
  2020-01-11 12:58                                                     ` ѽ҉ᶬḳ℠
  1 sibling, 1 reply; 40+ messages in thread
From: Andrew Lunn @ 2020-01-10 19:23 UTC (permalink / raw)
  To: ѽ҉ᶬḳ℠
  Cc: Russell King - ARM Linux admin, netdev

> If really necessary I could ask the TOS developers to assist, not sure
> whether they would oblidge. Their Master branch build bot compiles twice a
> day.
> Would it just involve setting a kernel debug flag or something more
> elaborate?

You could ask them to build a kernel with dynamic debug enabled

https://www.kernel.org/doc/html/latest/admin-guide/dynamic-debug-howto.html

You can then turn on debugging in a flexible way. And it will be
useful for more than just you.

    Andrew

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [drivers/net/phy/sfp] intermittent failure in state machine checks
  2020-01-10 19:01                                                   ` Russell King - ARM Linux admin
@ 2020-01-10 19:36                                                     ` ѽ҉ᶬḳ℠
  2020-01-10 19:55                                                       ` Russell King - ARM Linux admin
  0 siblings, 1 reply; 40+ messages in thread
From: ѽ҉ᶬḳ℠ @ 2020-01-10 19:36 UTC (permalink / raw)
  To: Russell King - ARM Linux admin; +Cc: Andrew Lunn, netdev

On 10/01/2020 19:01, Russell King - ARM Linux admin wrote:
> On Fri, Jan 10, 2020 at 06:44:18PM +0000, ѽ҉ᶬḳ℠ wrote:
>> On 10/01/2020 17:38, Russell King - ARM Linux admin wrote:
>>>>> On Fri, Jan 10, 2020 at 04:53:06PM +0000, ѽ҉ᶬḳ℠ wrote:
>>>>>> Seems that the debug avenue has been exhausted, short of running SFP.C in
>>>>>> debug mode.
>>>>> You're saying you never see TX_FAULT asserted other than when the
>>>>> interface is down?
>>>> Yes, it never exhibits once the iif is up - it is rock-stable in that state,
>>>> only ever when being transitioned from down state to up state.
>>>> Pardon, if that has not been made explicitly clear previously.
>>> I think if we were to have SFP debug enabled, you'll find that
>>> TX_FAULT is being reported to SFP as being asserted.
>> If really necessary I could ask the TOS developers to assist, not sure
>> whether they would oblidge. Their Master branch build bot compiles twice a
>> day.
>> Would it just involve setting a kernel debug flag or something more
>> elaborate?
>>
>>> You probably aren't running that while loop, as it will exit when
>>> it sees TX_FAULT asserted.  So, here's another bit of shell code
>>> for you to run:
>>>
>>> ip li set dev eth2 down; \
>>> ip li set dev eth2 up; \
>>> date
>>> while :; do
>>>     cat /proc/uptime
>>>     while ! grep -A5 'tx-fault.*in  hi' /sys/kernel/debug/gpio; do :; done
>>>     cat /proc/uptime
>>>     while ! grep -A5 'tx-fault.*in  lo' /sys/kernel/debug/gpio; do :; done
>>> done
>>>
>>> This will give you output such as:
>>>
>>> Fri 10 Jan 17:31:06 GMT 2020
>>> 774869.13 1535859.48
>>>    gpio-509 (                    |tx-fault            ) in  hi ...
>>> 774869.14 1535859.49
>>>    gpio-509 (                    |tx-fault            ) in  lo ...
>>> 774869.15 1535859.50
>>>
>>> The first date and "uptime" output is the timestamp when the interface
>>> was brought up.  Subsequent "uptime" outputs can be used to calculate
>>> the time difference in seconds between the state printed immediately
>>> prior to the uptime output, and the first "uptime" output.
>>>
>>> So in the above example, the tx-fault signal was hi at 10ms, and then
>>> went low 20ms after the up.
>> awfully nice of you to provide the code, this is the output when the iif is
>> brought down and up again and exhibiting the transmit fault.
>>
>> ip li set dev eth2 down; \
>>> ip li set dev eth2 up; \
>>> date
>> Fri Jan 10 18:34:52 GMT 2020
>> root@to:~# while :; do
>>>     cat /proc/uptime
>>>     while ! grep -A5 'tx-fault.*in  hi' /sys/kernel/debug/gpio; do :; done
>>>     cat /proc/uptime
>>>     while ! grep -A5 'tx-fault.*in  lo' /sys/kernel/debug/gpio; do :; done
>>> done
> Hmm, I missed a ; \ at the end of "date", so this isn't quite what
> I wanted, but it'll do.  What that means is that:
>
>> 1865.20 3224.67
> doesn't bear the relationship that I wanted to the interface coming
> up.
>
>>   gpio-504 (                    |tx-fault            ) in  hi IRQ
>>   gpio-505 (                    |tx-disable          ) out hi
>>   gpio-506 (                    |rate-select0        ) in  lo
>>   gpio-507 (                    |los                 ) in  lo IRQ
>>   gpio-508 (                    |mod-def0            ) in  lo IRQ
>> 1871.77 3230.71
> TX_FAULT is high at 1871.77 and TX_DISABLE is high, so the interface
> is down.
>
>>   gpio-504 (                    |tx-fault            ) in  lo IRQ
>>   gpio-505 (                    |tx-disable          ) out lo
>>   gpio-506 (                    |rate-select0        ) in  lo
>>   gpio-507 (                    |los                 ) in  lo IRQ
>>   gpio-508 (                    |mod-def0            ) in  lo IRQ
>> 1919.06 3309.55
> Almost 47.3s later, TX_FAULT has gone low.

This correlates with invoking ifup

>
>>   gpio-504 (                    |tx-fault            ) in  hi IRQ
>>   gpio-505 (                    |tx-disable          ) out lo
>>   gpio-506 (                    |rate-select0        ) in  lo
>>   gpio-507 (                    |los                 ) in  lo IRQ
>>   gpio-508 (                    |mod-def0            ) in  lo IRQ
>> 1919.07 3309.57
> After 10ms, it goes high again - this will cause the first report of
> a transmit fault.
>
>>   gpio-504 (                    |tx-fault            ) in  lo IRQ
>>   gpio-505 (                    |tx-disable          ) out lo
>>   gpio-506 (                    |rate-select0        ) in  lo
>>   gpio-507 (                    |los                 ) in  lo IRQ
>>   gpio-508 (                    |mod-def0            ) in  lo IRQ
>> 1920.68 3312.28
> About 1.6s later, it goes low, maybe as a result of the first attempt
> to clear the fault by a brief pulse on TX_DISABLE.
>
> So, we wait 1s before asserting TX_DISABLE for 10us, which would have
> happened around 1920.07.  We then have 300ms for initialisation, which
> would've taken us to 1920.37, so this may have been interpreted as the
> fault still being present.  The next clearance attempt would have been
> scheduled for about 1921.37.
>
>>   gpio-504 (                    |tx-fault            ) in  hi IRQ
>>   gpio-505 (                    |tx-disable          ) out lo
>>   gpio-506 (                    |rate-select0        ) in  lo
>>   gpio-507 (                    |los                 ) in  lo IRQ
>>   gpio-508 (                    |mod-def0            ) in  lo IRQ
>> 1921.86 3314.21
> 1.2s later, it re-asserts.
>
>>   gpio-504 (                    |tx-fault            ) in  lo IRQ
>>   gpio-505 (                    |tx-disable          ) out lo
>>   gpio-506 (                    |rate-select0        ) in  lo
>>   gpio-507 (                    |los                 ) in  lo IRQ
>>   gpio-508 (                    |mod-def0            ) in  lo IRQ
>> 1921.86 3314.21
> and deasserts within the same 10ms.
>
>>> However, bear in mind that even this will not be good enough to spot
>>> transitory changes on TX_FAULT - as your I2C GPIO expander is interrupt
>>> capable, watching /proc/interrupts may tell you more.
>>>
>>> If the TX_FAULT signal is as stable as you claim it is, you should see
>>> the interrupt count for it remaining the same.
>> Once the iif is up those values remain stable indeed.
>>
>> cat /proc/interrupts | grep sfp
>>   52:          0          0   pca953x   4 Edge      sfp
>>   53:          0          0   pca953x   3 Edge      sfp
>>   54:          6          0   pca953x   0 Edge      sfp
>>
>> and only incrementing with ifupdown action (which would be logical)
>>
>> cat /proc/interrupts | grep sfp
>>   52:          0          0   pca953x   4 Edge      sfp
>>   53:          0          0   pca953x   3 Edge      sfp
>>   54:         11          0   pca953x   0 Edge      sfp
> According to this, TX_FAULT has toggled five times.
>
> This would seem to negate your previous comment about TX_FAULT being
> stable.

Maybe you misread of what I wrote - it is stable once the iff is up, the 
values do not change.
The 5 toggles are resulting from manually invoking ifupdown action.

> Therefore, I'd say that the SFP state machines are operating as
> designed, and as per the SFP MSA, and what we have is a module that
> likes to assert TX_FAULT for unknown reasons, and this confirms the
> hypothesis I've been putting forward.
>

This is based on the 5 IRQ toggles or the previous reading on the GPIO 
output?

Surely I have no clue about the time frames the modules asserts / 
deasserts but since it works in general and only exhibits the issue 
intermittently with ifupdown action after it has been brought up 
initially it does not seem to be caused by the module.

If the module would be misbehaving in general it would never pass the sm 
check and thus be entirely unusable. Which though is not the case.

Suppose I have just to wait with fingers crossed that maybe at some 
point in the future the issue stops somehow, considering:

- the module may not meet SFP MSA 100%, though that MSA with "shall" 
and/or "may" wording does not appear obligatory and leaves wiggle room
- the chipset designer / manufacturer may not provide all necessary 
documentation in the public domain a/o with GPL license
- the vendor distributing the module cannot be bothered with support
- downstream distros building from upstream source likely lacking the 
resources to look into this and anyway not patching SPF.C in the first place
- upstream development has provided the most extensive support, trying 
to get to the bottom of it, and concluding the module misbehaving



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [drivers/net/phy/sfp] intermittent failure in state machine checks
  2020-01-10 19:36                                                     ` ѽ҉ᶬḳ℠
@ 2020-01-10 19:55                                                       ` Russell King - ARM Linux admin
  2020-01-10 20:27                                                         ` ѽ҉ᶬḳ℠
  0 siblings, 1 reply; 40+ messages in thread
From: Russell King - ARM Linux admin @ 2020-01-10 19:55 UTC (permalink / raw)
  To: ѽ҉ᶬḳ℠; +Cc: Andrew Lunn, netdev

On Fri, Jan 10, 2020 at 07:36:04PM +0000, ѽ҉ᶬḳ℠ wrote:
> On 10/01/2020 19:01, Russell King - ARM Linux admin wrote:
> > On Fri, Jan 10, 2020 at 06:44:18PM +0000, ѽ҉ᶬḳ℠ wrote:
> > > On 10/01/2020 17:38, Russell King - ARM Linux admin wrote:
> > > > > > On Fri, Jan 10, 2020 at 04:53:06PM +0000, ѽ҉ᶬḳ℠ wrote:
> > > > > > > Seems that the debug avenue has been exhausted, short of running SFP.C in
> > > > > > > debug mode.
> > > > > > You're saying you never see TX_FAULT asserted other than when the
> > > > > > interface is down?
> > > > > Yes, it never exhibits once the iif is up - it is rock-stable in that state,
> > > > > only ever when being transitioned from down state to up state.
> > > > > Pardon, if that has not been made explicitly clear previously.
> > > > I think if we were to have SFP debug enabled, you'll find that
> > > > TX_FAULT is being reported to SFP as being asserted.
> > > If really necessary I could ask the TOS developers to assist, not sure
> > > whether they would oblidge. Their Master branch build bot compiles twice a
> > > day.
> > > Would it just involve setting a kernel debug flag or something more
> > > elaborate?
> > > 
> > > > You probably aren't running that while loop, as it will exit when
> > > > it sees TX_FAULT asserted.  So, here's another bit of shell code
> > > > for you to run:
> > > > 
> > > > ip li set dev eth2 down; \
> > > > ip li set dev eth2 up; \
> > > > date
> > > > while :; do
> > > >     cat /proc/uptime
> > > >     while ! grep -A5 'tx-fault.*in  hi' /sys/kernel/debug/gpio; do :; done
> > > >     cat /proc/uptime
> > > >     while ! grep -A5 'tx-fault.*in  lo' /sys/kernel/debug/gpio; do :; done
> > > > done
> > > > 
> > > > This will give you output such as:
> > > > 
> > > > Fri 10 Jan 17:31:06 GMT 2020
> > > > 774869.13 1535859.48
> > > >    gpio-509 (                    |tx-fault            ) in  hi ...
> > > > 774869.14 1535859.49
> > > >    gpio-509 (                    |tx-fault            ) in  lo ...
> > > > 774869.15 1535859.50
> > > > 
> > > > The first date and "uptime" output is the timestamp when the interface
> > > > was brought up.  Subsequent "uptime" outputs can be used to calculate
> > > > the time difference in seconds between the state printed immediately
> > > > prior to the uptime output, and the first "uptime" output.
> > > > 
> > > > So in the above example, the tx-fault signal was hi at 10ms, and then
> > > > went low 20ms after the up.
> > > awfully nice of you to provide the code, this is the output when the iif is
> > > brought down and up again and exhibiting the transmit fault.
> > > 
> > > ip li set dev eth2 down; \
> > > > ip li set dev eth2 up; \
> > > > date
> > > Fri Jan 10 18:34:52 GMT 2020
> > > root@to:~# while :; do
> > > >     cat /proc/uptime
> > > >     while ! grep -A5 'tx-fault.*in  hi' /sys/kernel/debug/gpio; do :; done
> > > >     cat /proc/uptime
> > > >     while ! grep -A5 'tx-fault.*in  lo' /sys/kernel/debug/gpio; do :; done
> > > > done
> > Hmm, I missed a ; \ at the end of "date", so this isn't quite what
> > I wanted, but it'll do.  What that means is that:
> > 
> > > 1865.20 3224.67
> > doesn't bear the relationship that I wanted to the interface coming
> > up.
> > 
> > >   gpio-504 (                    |tx-fault            ) in  hi IRQ
> > >   gpio-505 (                    |tx-disable          ) out hi
> > >   gpio-506 (                    |rate-select0        ) in  lo
> > >   gpio-507 (                    |los                 ) in  lo IRQ
> > >   gpio-508 (                    |mod-def0            ) in  lo IRQ
> > > 1871.77 3230.71
> > TX_FAULT is high at 1871.77 and TX_DISABLE is high, so the interface
> > is down.
> > 
> > >   gpio-504 (                    |tx-fault            ) in  lo IRQ
> > >   gpio-505 (                    |tx-disable          ) out lo
> > >   gpio-506 (                    |rate-select0        ) in  lo
> > >   gpio-507 (                    |los                 ) in  lo IRQ
> > >   gpio-508 (                    |mod-def0            ) in  lo IRQ
> > > 1919.06 3309.55
> > Almost 47.3s later, TX_FAULT has gone low.
> 
> This correlates with invoking ifup

Yes, which concurs with my analysis.  Everything that happens from
this point on...

> > >   gpio-504 (                    |tx-fault            ) in  hi IRQ
> > >   gpio-505 (                    |tx-disable          ) out lo
> > >   gpio-506 (                    |rate-select0        ) in  lo
> > >   gpio-507 (                    |los                 ) in  lo IRQ
> > >   gpio-508 (                    |mod-def0            ) in  lo IRQ
> > > 1919.07 3309.57
> > After 10ms, it goes high again - this will cause the first report of
> > a transmit fault.
> > 
> > >   gpio-504 (                    |tx-fault            ) in  lo IRQ
> > >   gpio-505 (                    |tx-disable          ) out lo
> > >   gpio-506 (                    |rate-select0        ) in  lo
> > >   gpio-507 (                    |los                 ) in  lo IRQ
> > >   gpio-508 (                    |mod-def0            ) in  lo IRQ
> > > 1920.68 3312.28
> > About 1.6s later, it goes low, maybe as a result of the first attempt
> > to clear the fault by a brief pulse on TX_DISABLE.
> > 
> > So, we wait 1s before asserting TX_DISABLE for 10us, which would have
> > happened around 1920.07.  We then have 300ms for initialisation, which
> > would've taken us to 1920.37, so this may have been interpreted as the
> > fault still being present.  The next clearance attempt would have been
> > scheduled for about 1921.37.
> > 
> > >   gpio-504 (                    |tx-fault            ) in  hi IRQ
> > >   gpio-505 (                    |tx-disable          ) out lo
> > >   gpio-506 (                    |rate-select0        ) in  lo
> > >   gpio-507 (                    |los                 ) in  lo IRQ
> > >   gpio-508 (                    |mod-def0            ) in  lo IRQ
> > > 1921.86 3314.21
> > 1.2s later, it re-asserts.
> > 
> > >   gpio-504 (                    |tx-fault            ) in  lo IRQ
> > >   gpio-505 (                    |tx-disable          ) out lo
> > >   gpio-506 (                    |rate-select0        ) in  lo
> > >   gpio-507 (                    |los                 ) in  lo IRQ
> > >   gpio-508 (                    |mod-def0            ) in  lo IRQ
> > > 1921.86 3314.21
> > and deasserts within the same 10ms.

... is the "funny stuff" that is triggering the TX fault warnings
and should not be happening if the module was compliant with the
SFP MSA.

> > > > However, bear in mind that even this will not be good enough to spot
> > > > transitory changes on TX_FAULT - as your I2C GPIO expander is interrupt
> > > > capable, watching /proc/interrupts may tell you more.
> > > > 
> > > > If the TX_FAULT signal is as stable as you claim it is, you should see
> > > > the interrupt count for it remaining the same.
> > > Once the iif is up those values remain stable indeed.
> > > 
> > > cat /proc/interrupts | grep sfp
> > >   52:          0          0   pca953x   4 Edge      sfp
> > >   53:          0          0   pca953x   3 Edge      sfp
> > >   54:          6          0   pca953x   0 Edge      sfp
> > > 
> > > and only incrementing with ifupdown action (which would be logical)
> > > 
> > > cat /proc/interrupts | grep sfp
> > >   52:          0          0   pca953x   4 Edge      sfp
> > >   53:          0          0   pca953x   3 Edge      sfp
> > >   54:         11          0   pca953x   0 Edge      sfp
> > According to this, TX_FAULT has toggled five times.
> > 
> > This would seem to negate your previous comment about TX_FAULT being
> > stable.
> 
> Maybe you misread of what I wrote - it is stable once the iff is up, the
> values do not change.

Define "stable once the interface is up".  Is that stable after ten
seconds?  Or stable in under the 300ms initialisation delay allowed
by the SFP MSA?

> The 5 toggles are resulting from manually invoking ifupdown action.
> 
> > Therefore, I'd say that the SFP state machines are operating as
> > designed, and as per the SFP MSA, and what we have is a module that
> > likes to assert TX_FAULT for unknown reasons, and this confirms the
> > hypothesis I've been putting forward.
> > 
> 
> This is based on the 5 IRQ toggles or the previous reading on the GPIO
> output?

On _both_.

> Surely I have no clue about the time frames the modules asserts / deasserts
> but since it works in general and only exhibits the issue intermittently
> with ifupdown action after it has been brought up initially it does not seem
> to be caused by the module.
> 
> If the module would be misbehaving in general it would never pass the sm
> check and thus be entirely unusable. Which though is not the case.
> 
> Suppose I have just to wait with fingers crossed that maybe at some point in
> the future the issue stops somehow, considering:
> 
> - the module may not meet SFP MSA 100%, though that MSA with "shall" and/or
> "may" wording does not appear obligatory and leaves wiggle room
> - the chipset designer / manufacturer may not provide all necessary
> documentation in the public domain a/o with GPL license
> - the vendor distributing the module cannot be bothered with support
> - downstream distros building from upstream source likely lacking the
> resources to look into this and anyway not patching SPF.C in the first place
> - upstream development has provided the most extensive support, trying to
> get to the bottom of it, and concluding the module misbehaving

Okay, I give up trying to help you.  Sorry, but I've spent a lot of
time over the last two days trying to help and explain stuff, and
you seem to want to constantly tell me I'm wrong, or misreading what
you're saying, or that there's some problem with the "sm check"
when I've already pointed out is a figment of your imagination.

Given also that your UTF-8 in your From: line _really_ screws up the
index in my mutt mail reader on the Linux console, disrupting my
ability to read other emails, it's now time for me to call an end to
this.

Sorry, but I'm not prepared to help any further.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [drivers/net/phy/sfp] intermittent failure in state machine checks
  2020-01-10 19:55                                                       ` Russell King - ARM Linux admin
@ 2020-01-10 20:27                                                         ` ѽ҉ᶬḳ℠
  0 siblings, 0 replies; 40+ messages in thread
From: ѽ҉ᶬḳ℠ @ 2020-01-10 20:27 UTC (permalink / raw)
  To: Russell King - ARM Linux admin; +Cc: Andrew Lunn, netdev


On 10/01/2020 19:55, Russell King - ARM Linux admin wrote:
>
> Define "stable once the interface is up".  Is that stable after ten
> seconds?  Or stable in under the 300ms initialisation delay allowed
> by the SFP MSA?

The router boots, SFP.C is called and performs its functions and the 
module gets online and stays that way.
At some point the modules thus apparently passes the checks, incl. the 
under the 300ms initialisation, or else it would never get online and I 
could trash it.
Once up it is rock-solid - the IRQ values are staying constant.

If at later stage the iif is being brought down and up again the issue 
starts to exhibit.

>
>> The 5 toggles are resulting from manually invoking ifupdown action.
>>
>>> Therefore, I'd say that the SFP state machines are operating as
>>> designed, and as per the SFP MSA, and what we have is a module that
>>> likes to assert TX_FAULT for unknown reasons, and this confirms the
>>> hypothesis I've been putting forward.
>>>
>> This is based on the 5 IRQ toggles or the previous reading on the GPIO
>> output?
> On _both_.
>
>
> Okay, I give up trying to help you.  Sorry, but I've spent a lot of
> time over the last two days trying to help and explain stuff, and
> you seem to want to constantly tell me I'm wrong, or misreading what
> you're saying, or that there's some problem with the "sm check"
> when I've already pointed out is a figment of your imagination.

Not sure really why took such offence from that bit of summary.

I am not saying/implying that you are wrong, just beg to differ - there 
is no explanation why the module is passing the test on initial up (at 
boot time) but failing intermittently with ifupdown action later on, 
that is all I am saying.
That the module is failing checks is hardly a figment of my imagination 
or else I would not have bothered in seeking support in various places, 
and prior reaching out all the way upstream having tried first in this 
order:

- TOS
- OpenWrt
- vendor

> Sorry, but I'm not prepared to help any further.

It would be just sad to leave on such note and thus just let me 
emphasize that I have thoroughly enjoyed and appreciated the exchange.
Unless you object me posting to this mailing list I would just remove 
your email address then.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [drivers/net/phy/sfp] intermittent failure in state machine checks
  2020-01-10 19:23                                                   ` Andrew Lunn
@ 2020-01-11 12:58                                                     ` ѽ҉ᶬḳ℠
  0 siblings, 0 replies; 40+ messages in thread
From: ѽ҉ᶬḳ℠ @ 2020-01-11 12:58 UTC (permalink / raw)
  To: netdev; +Cc: Andrew Lunn

On 10/01/2020 19:23, Andrew Lunn wrote:
>> If really necessary I could ask the TOS developers to assist, not sure
>> whether they would oblidge. Their Master branch build bot compiles twice a
>> day.
>> Would it just involve setting a kernel debug flag or something more
>> elaborate?
> You could ask them to build a kernel with dynamic debug enabled
>
> https://www.kernel.org/doc/html/latest/admin-guide/dynamic-debug-howto.html
>
> You can then turn on debugging in a flexible way. And it will be
> useful for more than just you.
>
>      Andrew

I have put in the request and will see how it goes, if positive I would
probably need a bit of guidance of how to leverage it.

Meantime, assuming for a moment that:

- the issue is not caused by SFP.C (emphasising that I reckon it is
indeed not)
- the is not caused by the module either (which I am not certain of)

, and considering that /sys/kernel/debug/gpio is not mounted on a block
device,

I was wondering whether the I2C bus could potentially get chocked of
sorts and thus preventing the module to propagate the change in tx-fault
signal state in a timely fashion (300 ms) to the kernel, thus being the
actual culprit?

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2020-01-11 12:58 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-09 13:47 [drivers/net/phy/sfp] intermittent failure in state machine checks ѽ҉ᶬḳ℠
2020-01-09 14:41 ` Andrew Lunn
2020-01-09 15:03   ` ѽ҉ᶬḳ℠
2020-01-09 15:58     ` Russell King - ARM Linux admin
2020-01-09 17:35       ` ѽ҉ᶬḳ℠
2020-01-09 17:43         ` Russell King - ARM Linux admin
2020-01-09 19:01           ` ѽ҉ᶬḳ℠
2020-01-09 19:42             ` ѽ҉ᶬḳ℠
2020-01-09 21:38               ` Russell King - ARM Linux admin
2020-01-09 21:59               ` Russell King - ARM Linux admin
2020-01-09 22:40                 ` ѽ҉ᶬḳ℠
2020-01-09 23:10                   ` Russell King - ARM Linux admin
2020-01-09 23:50                     ` ѽ҉ᶬḳ℠
2020-01-10  0:18                       ` ѽ҉ᶬḳ℠
2020-01-10 10:26                         ` Russell King - ARM Linux admin
2020-01-10  9:27                       ` Russell King - ARM Linux admin
2020-01-10  9:50                         ` ѽ҉ᶬḳ℠
2020-01-10 10:19                           ` ѽ҉ᶬḳ℠
2020-01-10 11:46                             ` Russell King - ARM Linux admin
2020-01-10 13:22                             ` Andrew Lunn
2020-01-10 13:38                               ` ѽ҉ᶬḳ℠
2020-01-10 11:44                           ` Russell King - ARM Linux admin
2020-01-10 12:45                             ` ѽ҉ᶬḳ℠
2020-01-10 12:53                               ` Russell King - ARM Linux admin
2020-01-10 15:02                                 ` ѽ҉ᶬḳ℠
2020-01-10 15:09                                   ` Russell King - ARM Linux admin
2020-01-10 15:45                                     ` ѽ҉ᶬḳ℠
2020-01-10 16:32                                       ` Russell King - ARM Linux admin
2020-01-10 16:53                                         ` ѽ҉ᶬḳ℠
2020-01-10 17:08                                           ` Russell King - ARM Linux admin
2020-01-10 17:19                                             ` ѽ҉ᶬḳ℠
2020-01-10 17:38                                               ` Russell King - ARM Linux admin
2020-01-10 18:44                                                 ` ѽ҉ᶬḳ℠
2020-01-10 19:01                                                   ` Russell King - ARM Linux admin
2020-01-10 19:36                                                     ` ѽ҉ᶬḳ℠
2020-01-10 19:55                                                       ` Russell King - ARM Linux admin
2020-01-10 20:27                                                         ` ѽ҉ᶬḳ℠
2020-01-10 19:23                                                   ` Andrew Lunn
2020-01-11 12:58                                                     ` ѽ҉ᶬḳ℠
2020-01-09 21:34             ` Russell King - ARM Linux admin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.