* [drivers/net/phy/sfp] intermittent failure in state machine checks @ 2020-01-09 13:47 ѽ҉ᶬḳ℠ 2020-01-09 14:41 ` Andrew Lunn 0 siblings, 1 reply; 40+ messages in thread From: ѽ҉ᶬḳ℠ @ 2020-01-09 13:47 UTC (permalink / raw) To: netdev On node with 4.19.93 and a SFP module (specs at the bottom) the following is intermittently observed: libphy: SFP I2C Bus: probed sfp sfp: module ALLNET ALL4781 rev V3.4 sn 0000000FC9157640 dc 29-03-18 sfp sfp: unknown connector, encoding 8b10b, nominal bitrate 1.3Gbps +0% -0% sfp sfp: 1000BaseSX+ 1000BaseLX- 1000BaseCX- 1000BaseT- 100BaseTLX- 1000BaseFX- BaseBX10- BasePX- sfp sfp: 10GBaseSR- 10GBaseLR- 10GBaseLRM- 10GBaseER- sfp sfp: Wavelength 0nm, fiber lengths: sfp sfp: 9µm SM : unsupported sfp sfp: 62.5µm MM OM1: unsupported/unspecified sfp sfp: 50µm MM OM2: unsupported/unspecified sfp sfp: 50µm MM OM3: unsupported/unspecified sfp sfp: 50µm MM OM4: 2.540km sfp sfp: Options: retimer sfp sfp: Diagnostics: sfp sfp: module transmit fault indicated sfp sfp: module transmit fault recovered sfp sfp: module transmit fault indicated sfp sfp: module persistently indicates fault, disabling To my humble understanding that pertains to checks in state machine - SFP_S_WAIT_LOS - SFP_S_LINK_UP being done via the I2C | SM bus but it is not clear to me what causes the check to fail and how to remedy it. ____ SFP module specs Identifier : 0x03 (SFP) Extended identifier : 0x04 (GBIC/SFP defined by 2-wire interface ID) Connector : 0x22 (RJ45) Transceiver codes : 0x00 0x00 0x00 0x01 0x00 0x00 0x00 0x00 0x00 Transceiver type : Ethernet: 1000BASE-SX Encoding : 0x01 (8B/10B) BR, Nominal : 1300MBd Rate identifier : 0x00 (unspecified) Length (SMF,km) : 0km Length (SMF) : 0m Length (50um) : 0m Length (62.5um) : 0m Length (Copper) : 255m Length (OM3) : 0m Laser wavelength : 0nm Vendor name : ALLNET Vendor OUI : 00:0f:c9 Vendor PN : ALL4781 Vendor rev : V3.4 Option values : 0x08 0x00 Option : Retimer or CDR implemented BR margin, max : 0% BR margin, min : 0% ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [drivers/net/phy/sfp] intermittent failure in state machine checks 2020-01-09 13:47 [drivers/net/phy/sfp] intermittent failure in state machine checks ѽ҉ᶬḳ℠ @ 2020-01-09 14:41 ` Andrew Lunn 2020-01-09 15:03 ` ѽ҉ᶬḳ℠ 0 siblings, 1 reply; 40+ messages in thread From: Andrew Lunn @ 2020-01-09 14:41 UTC (permalink / raw) To: ѽ҉ᶬḳ℠, Russell King; +Cc: netdev On Thu, Jan 09, 2020 at 01:47:31PM +0000, ѽ҉ᶬḳ℠ wrote: > On node with 4.19.93 and a SFP module (specs at the bottom) the following is > intermittently observed: Please make sure Russell King is in Cc: for SFP issues. The state machine has been reworked recently. Please could you try net-next, or 5.5-rc5. Thanks Andrew ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [drivers/net/phy/sfp] intermittent failure in state machine checks 2020-01-09 14:41 ` Andrew Lunn @ 2020-01-09 15:03 ` ѽ҉ᶬḳ℠ 2020-01-09 15:58 ` Russell King - ARM Linux admin 0 siblings, 1 reply; 40+ messages in thread From: ѽ҉ᶬḳ℠ @ 2020-01-09 15:03 UTC (permalink / raw) To: Andrew Lunn, Russell King; +Cc: netdev On 09/01/2020 14:41, Andrew Lunn wrote: > On Thu, Jan 09, 2020 at 01:47:31PM +0000, ѽ҉ᶬḳ℠ wrote: >> On node with 4.19.93 and a SFP module (specs at the bottom) the following is >> intermittently observed: > Please make sure Russell King is in Cc: for SFP issues. > > The state machine has been reworked recently. Please could you try > net-next, or 5.5-rc5. > > Thanks > Andrew Unfortunately testing those branches is not feasible since the router (see architecture below) that host the SFP module deploys the OpenWrt downstream distro with LTS kernels - in their Master development branch 4.19.93 being the most recent on offer. Could the reworked state machine code commits be deployed as patches with 4.19 kernel, and if so which commits would that be? Or, if not will those commits eventually ride the trains to the LTS branches and what would be the expected time frame for such uplift? The problem is with those failing state machine checks is an inconvenient disruption in the node's WAN connectivity, often needing to reboot the node to get the connectivity reinstated. Not sure whether pertinent at all (aka being clueless) but noticed that for big endian systems a check for an inverted LOS Signal is implemented but not for little endian systems. ___ Architecture: armv7l Byte Order: Little Endian CPU(s): 2 On-line CPU(s) list: 0,1 Thread(s) per core: 1 Core(s) per socket: 2 Socket(s): 1 Vendor ID: ARM Model: 1 Model name: Cortex-A9 Stepping: r4p1 BogoMIPS: 1600.00 Flags: half thumb fastmult vfp edsp neon vfpv3 tls vfpd32 ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [drivers/net/phy/sfp] intermittent failure in state machine checks 2020-01-09 15:03 ` ѽ҉ᶬḳ℠ @ 2020-01-09 15:58 ` Russell King - ARM Linux admin 2020-01-09 17:35 ` ѽ҉ᶬḳ℠ 0 siblings, 1 reply; 40+ messages in thread From: Russell King - ARM Linux admin @ 2020-01-09 15:58 UTC (permalink / raw) To: ѽ҉ᶬḳ℠; +Cc: Andrew Lunn, netdev On Thu, Jan 09, 2020 at 03:03:24PM +0000, ѽ҉ᶬḳ℠ wrote: > On 09/01/2020 14:41, Andrew Lunn wrote: > > On Thu, Jan 09, 2020 at 01:47:31PM +0000, ѽ҉ᶬḳ℠ wrote: > > > On node with 4.19.93 and a SFP module (specs at the bottom) the following is > > > intermittently observed: > > Please make sure Russell King is in Cc: for SFP issues. > > > > The state machine has been reworked recently. Please could you try > > net-next, or 5.5-rc5. > > > > Thanks > > Andrew > Unfortunately testing those branches is not feasible since the router (see > architecture below) that host the SFP module deploys the OpenWrt downstream > distro with LTS kernels - in their Master development branch 4.19.93 being > the most recent on offer. I don't think the rework will make any difference in this case, and I don't think there's anything failing in the software here. The reported problem seems to be this: sfp sfp: module transmit fault indicated sfp sfp: module transmit fault recovered sfp sfp: module transmit fault indicated sfp sfp: module persistently indicates fault, disabling which occurs if the module asserts the TX_FAULT signal. The SFP MSA defines that this indicates a problem with the laser safety circuitry, and defines a way to reset the fault (by pulsing TX_DISABLE and going through another initialisation). When TX_FAULT is asserted for the first time, "module transmit fault indicated" is printed, and we start the process of recovery. If we successfully recover, then "module transmit fault recovered" will be printed. We try several times to recover the fault, and once we're out of retries, "module persistently indicates fault, disabling" will be printed; at that point, we've declared the module to be dead, and we won't do anything further with it. This is by design; if the module is saying that the laser safety circuitry is faulty, then endlessly resetting the module to recover from that fault is not sane. However, there's some modules (particularly GPON modules) that do things quite differently from what the SFP MSA says, which is extremely annoying and frustrating for those of us who are trying to implement the host support. There are some which seem to assert TX_FAULT for unknown reasons. In your original post (which you need to have sent to me, I don't read netdev) you've provided "SFP module specs" - not really, you provided the ethtool output, which is not the same as the module specs. Many modules have misleading EEPROM information, sometimes to work around what people call "vendor lockin" or maybe to get their module to work in some specific equipment. In any case, EEPROM information is not a specification. For example, your module claims to be a 1000BASE-SX module. If I lookup "allnet ALL4781", I find that it's a VDSL2 module. That isn't a 1000BASE-SX module - 1000BASE-SX is an IEEE 802.3 defined term to mean 1000BASE-X over fiber using a short-wavelength laser. So, given that it doesn't have a laser, why is it raising TX_FAULT. No idea; these modules are a law to themselves. I think the only thing we could do is to implement a workaround to ignore TX_FAULT for this module... great, more quirks. :( -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up According to speedtest.net: 11.9Mbps down 500kbps up ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [drivers/net/phy/sfp] intermittent failure in state machine checks 2020-01-09 15:58 ` Russell King - ARM Linux admin @ 2020-01-09 17:35 ` ѽ҉ᶬḳ℠ 2020-01-09 17:43 ` Russell King - ARM Linux admin 0 siblings, 1 reply; 40+ messages in thread From: ѽ҉ᶬḳ℠ @ 2020-01-09 17:35 UTC (permalink / raw) To: Russell King - ARM Linux admin; +Cc: Andrew Lunn, netdev Kai Meitzner On 09/01/2020 15:58, Russell King - ARM Linux admin wrote: > On Thu, Jan 09, 2020 at 03:03:24PM +0000, ѽ҉ᶬḳ℠ wrote: >> On 09/01/2020 14:41, Andrew Lunn wrote: >>> On Thu, Jan 09, 2020 at 01:47:31PM +0000, ѽ҉ᶬḳ℠ wrote: >>>> On node with 4.19.93 and a SFP module (specs at the bottom) the following is >>>> intermittently observed: >>> Please make sure Russell King is in Cc: for SFP issues. >>> >>> The state machine has been reworked recently. Please could you try >>> net-next, or 5.5-rc5. >>> >>> Thanks >>> Andrew >> Unfortunately testing those branches is not feasible since the router (see >> architecture below) that host the SFP module deploys the OpenWrt downstream >> distro with LTS kernels - in their Master development branch 4.19.93 being >> the most recent on offer. > I don't think the rework will make any difference in this case, and > I don't think there's anything failing in the software here. The > reported problem seems to be this: > > sfp sfp: module transmit fault indicated > sfp sfp: module transmit fault recovered > sfp sfp: module transmit fault indicated > sfp sfp: module persistently indicates fault, disabling > > which occurs if the module asserts the TX_FAULT signal. The SFP MSA > defines that this indicates a problem with the laser safety circuitry, > and defines a way to reset the fault (by pulsing TX_DISABLE and going > through another initialisation). > > When TX_FAULT is asserted for the first time, "module transmit fault > indicated" is printed, and we start the process of recovery. If we > successfully recover, then "module transmit fault recovered" will be > printed. > > We try several times to recover the fault, and once we're out of > retries, "module persistently indicates fault, disabling" will be > printed; at that point, we've declared the module to be dead, and > we won't do anything further with it. > > This is by design; if the module is saying that the laser safety > circuitry is faulty, then endlessly resetting the module to recover > from that fault is not sane. > > However, there's some modules (particularly GPON modules) that do > things quite differently from what the SFP MSA says, which is > extremely annoying and frustrating for those of us who are trying to > implement the host support. There are some which seem to assert > TX_FAULT for unknown reasons. > > In your original post (which you need to have sent to me, I don't > read netdev) you've provided "SFP module specs" - not really, you > provided the ethtool output, which is not the same as the module > specs. Many modules have misleading EEPROM information, sometimes > to work around what people call "vendor lockin" or maybe to get > their module to work in some specific equipment. In any case, > EEPROM information is not a specification. > > For example, your module claims to be a 1000BASE-SX module. If > I lookup "allnet ALL4781", I find that it's a VDSL2 module. That > isn't a 1000BASE-SX module - 1000BASE-SX is an IEEE 802.3 defined > term to mean 1000BASE-X over fiber using a short-wavelength laser. > > So, given that it doesn't have a laser, why is it raising TX_FAULT. > No idea; these modules are a law to themselves. > > I think the only thing we could do is to implement a workaround to > ignore TX_FAULT for this module... great, more quirks. :( > Thank you for the extensive feedback and explanation. Pardon for having mixed up the semantics on module specifications vs. EEPROM dump... The module (chipset) been designed by Metanoia, not sure who is the actual manufacturer, and probably just been branded Allnet. The designer provides some proprietary management software (called EBM) to their wholesale buyers only Trough EBM (got hold of something called DSLmonitor Lite) t reports as: - board type AURORA - Mode / Operation mode RT / VDSL If it would be any help I could provide what the EBM software calls a "SoC dump". I opened a support ticket with Allnet a few days back their response is yet to arrive. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [drivers/net/phy/sfp] intermittent failure in state machine checks 2020-01-09 17:35 ` ѽ҉ᶬḳ℠ @ 2020-01-09 17:43 ` Russell King - ARM Linux admin 2020-01-09 19:01 ` ѽ҉ᶬḳ℠ 0 siblings, 1 reply; 40+ messages in thread From: Russell King - ARM Linux admin @ 2020-01-09 17:43 UTC (permalink / raw) To: ѽ҉ᶬḳ℠; +Cc: Andrew Lunn, netdev On Thu, Jan 09, 2020 at 05:35:23PM +0000, ѽ҉ᶬḳ℠ wrote: > Thank you for the extensive feedback and explanation. > > Pardon for having mixed up the semantics on module specifications vs. EEPROM > dump... > > The module (chipset) been designed by Metanoia, not sure who is the actual > manufacturer, and probably just been branded Allnet. > The designer provides some proprietary management software (called EBM) to > their wholesale buyers only I have one of their early MT-V5311 modules, but it has no accessible EEPROM, and even if it did, it would be of no use to me being unapproved for connection to the BT Openreach network. (BT SIN 498 specifies non-standard power profile to avoid crosstalk issues with existing ADSL infrastructure, and I believe they regularly check the connected modem type and firmware versions against an approved list.) I haven't noticed the module I have asserting its TX_FAULT signal, but then its RJ45 has never been connected to anything. -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up According to speedtest.net: 11.9Mbps down 500kbps up ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [drivers/net/phy/sfp] intermittent failure in state machine checks 2020-01-09 17:43 ` Russell King - ARM Linux admin @ 2020-01-09 19:01 ` ѽ҉ᶬḳ℠ 2020-01-09 19:42 ` ѽ҉ᶬḳ℠ 2020-01-09 21:34 ` Russell King - ARM Linux admin 0 siblings, 2 replies; 40+ messages in thread From: ѽ҉ᶬḳ℠ @ 2020-01-09 19:01 UTC (permalink / raw) To: Russell King - ARM Linux admin; +Cc: Andrew Lunn, netdev On 09/01/2020 17:43, Russell King - ARM Linux admin wrote: > On Thu, Jan 09, 2020 at 05:35:23PM +0000, ѽ҉ᶬḳ℠ wrote: >> Thank you for the extensive feedback and explanation. >> >> Pardon for having mixed up the semantics on module specifications vs. EEPROM >> dump... >> >> The module (chipset) been designed by Metanoia, not sure who is the actual >> manufacturer, and probably just been branded Allnet. >> The designer provides some proprietary management software (called EBM) to >> their wholesale buyers only > I have one of their early MT-V5311 modules, but it has no accessible > EEPROM, and even if it did, it would be of no use to me being > unapproved for connection to the BT Openreach network. (BT SIN 498 > specifies non-standard power profile to avoid crosstalk issues with > existing ADSL infrastructure, and I believe they regularly check the > connected modem type and firmware versions against an approved list.) > > I haven't noticed the module I have asserting its TX_FAULT signal, > but then its RJ45 has never been connected to anything. > The curious (and sort of inexplicable) thing is that the module in general works, i.e. at some point it must pass the sm checks or connectivity would be failing constantly and thus the module being generally unusable. The reported issues however are intermittent, usually reliably reproducible with ifdown <iface> && ifup <iface> or rebooting the router that hosts the module. If some times passes, not sure but seems in excess of 3 minutes, between ifdown and ifup the sm checks mostly are not failing. It somehow "feels" that the module is storing some link signal information in a register which does not suit the sm check routine and only when that register clears the sm check routine passes and connectivity is restored. ____ Since there are probably other such SFP modules, xDSL and g.fast, out there that do not provide laser safety circuitry by design (since not providing connectivity over fibre) would it perhaps not make sense to try checking for the existence of laser safety circuitry first prior getting to the sm checks? ____ Sometime in the past sfp.c was not available in the distro and the issue never exhibited. Back then the module's operations mode been set through a py script - see bottom - but it would appear that it did not implement any sm checks. ---py script class SFP: def __init__(self, i2cbus): self.i2cbus = i2cbus @staticmethod def detect_metanoia_xdsl(eeprom): return ['X', 'C', 'V', 'R', '-', '0', '8', '0', 'Y', '5', '5',] == eeprom[40:51] @staticmethod def detect_zisa_gpon(eeprom): return ['T', 'W', '2', '3', '6', '2', 'H'] == eeprom[40:47] @staticmethod def detect_sgmii(eeprom): if ord(eeprom[6]) & 0x08: d("Mode selected: generic SGMII") return True else: d("Mode selected: generic 1000BASE-X") return False def decide_sfpmode(self): ec = [] try: ec = list(EEPROM(self.i2cbus).read_eeprom()) d("SFP EEPROM: %s" % str(ec)) except Exception as e: l("EEPROM read error: " + str(e)) return 'phy-sfp' # special case: Metanoia xDSL SFP, 1000BASE-X, no link autonegotiation if self.detect_metanoia_xdsl(ec): l("Metanoia DSL SFP detected. Switching to phy-sfp-noneg mode.") return 'phy-sfp-noneg' # special case: Zisa GPON SFP, SGMII if self.detect_zisa_gpon(ec): l("Zisa GPON SFP detected. Switching to phy-sfp-sgmii mode.") return 'phy-sfp-sgmii' # SGMII detection if self.detect_sgmii(ec): return 'phy-sfp-sgmii' # default 1000BASE-X return 'phy-sfp' ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [drivers/net/phy/sfp] intermittent failure in state machine checks 2020-01-09 19:01 ` ѽ҉ᶬḳ℠ @ 2020-01-09 19:42 ` ѽ҉ᶬḳ℠ 2020-01-09 21:38 ` Russell King - ARM Linux admin 2020-01-09 21:59 ` Russell King - ARM Linux admin 2020-01-09 21:34 ` Russell King - ARM Linux admin 1 sibling, 2 replies; 40+ messages in thread From: ѽ҉ᶬḳ℠ @ 2020-01-09 19:42 UTC (permalink / raw) To: Russell King - ARM Linux admin; +Cc: Andrew Lunn, netdev On 09/01/2020 19:01, ѽ҉ᶬḳ℠ wrote: > On 09/01/2020 17:43, Russell King - ARM Linux admin wrote: >> On Thu, Jan 09, 2020 at 05:35:23PM +0000, ѽ҉ᶬḳ℠ wrote: >>> Thank you for the extensive feedback and explanation. >>> >>> Pardon for having mixed up the semantics on module specifications >>> vs. EEPROM >>> dump... >>> >>> The module (chipset) been designed by Metanoia, not sure who is the >>> actual >>> manufacturer, and probably just been branded Allnet. >>> The designer provides some proprietary management software (called >>> EBM) to >>> their wholesale buyers only >> I have one of their early MT-V5311 modules, but it has no accessible >> EEPROM, and even if it did, it would be of no use to me being >> unapproved for connection to the BT Openreach network. (BT SIN 498 >> specifies non-standard power profile to avoid crosstalk issues with >> existing ADSL infrastructure, and I believe they regularly check the >> connected modem type and firmware versions against an approved list.) >> >> I haven't noticed the module I have asserting its TX_FAULT signal, >> but then its RJ45 has never been connected to anything. >> > > The curious (and sort of inexplicable) thing is that the module in > general works, i.e. at some point it must pass the sm checks or > connectivity would be failing constantly and thus the module being > generally unusable. > > The reported issues however are intermittent, usually reliably > reproducible with > > ifdown <iface> && ifup <iface> > > or rebooting the router that hosts the module. > > If some times passes, not sure but seems in excess of 3 minutes, > between ifdown and ifup the sm checks mostly are not failing. > It somehow "feels" that the module is storing some link signal > information in a register which does not suit the sm check routine and > only when that register clears the sm check routine passes and > connectivity is restored. > ____ > > Since there are probably other such SFP modules, xDSL and g.fast, out > there that do not provide laser safety circuitry by design (since not > providing connectivity over fibre) would it perhaps not make sense to > try checking for the existence of laser safety circuitry first prior > getting to the sm checks? > ____ > I am wondering whether this mentioned in https://gitlab.labs.nic.cz/turris/turris-build/issues/89 is the cause of the issue perhaps: Even when/after the SFP module is recognized and the link mode it set for the NIC to the proper value there can still be the link-up signal mismatch that we have seen on many non-ethernet SFPs. The thing is that one of the SFP pins is called LOS (loss of signal) and when the pin is in active state it is being interpreted by the Linux kernel as "link is down", turn off the NIC. Unfortunatelly we have seen chicken-and-egg problem with some GPON and DSL SFPs - the SFP does not come up and deassert LOS unless there is SGMII link from NIC and NIC is not coming up unless LOS is deasserted. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [drivers/net/phy/sfp] intermittent failure in state machine checks 2020-01-09 19:42 ` ѽ҉ᶬḳ℠ @ 2020-01-09 21:38 ` Russell King - ARM Linux admin 2020-01-09 21:59 ` Russell King - ARM Linux admin 1 sibling, 0 replies; 40+ messages in thread From: Russell King - ARM Linux admin @ 2020-01-09 21:38 UTC (permalink / raw) To: ѽ҉ᶬḳ℠; +Cc: Andrew Lunn, netdev On Thu, Jan 09, 2020 at 07:42:27PM +0000, ѽ҉ᶬḳ℠ wrote: > On 09/01/2020 19:01, ѽ҉ᶬḳ℠ wrote: > > On 09/01/2020 17:43, Russell King - ARM Linux admin wrote: > > > On Thu, Jan 09, 2020 at 05:35:23PM +0000, ѽ҉ᶬḳ℠ wrote: > > > > Thank you for the extensive feedback and explanation. > > > > > > > > Pardon for having mixed up the semantics on module > > > > specifications vs. EEPROM > > > > dump... > > > > > > > > The module (chipset) been designed by Metanoia, not sure who is > > > > the actual > > > > manufacturer, and probably just been branded Allnet. > > > > The designer provides some proprietary management software > > > > (called EBM) to > > > > their wholesale buyers only > > > I have one of their early MT-V5311 modules, but it has no accessible > > > EEPROM, and even if it did, it would be of no use to me being > > > unapproved for connection to the BT Openreach network. (BT SIN 498 > > > specifies non-standard power profile to avoid crosstalk issues with > > > existing ADSL infrastructure, and I believe they regularly check the > > > connected modem type and firmware versions against an approved list.) > > > > > > I haven't noticed the module I have asserting its TX_FAULT signal, > > > but then its RJ45 has never been connected to anything. > > > > > > > The curious (and sort of inexplicable) thing is that the module in > > general works, i.e. at some point it must pass the sm checks or > > connectivity would be failing constantly and thus the module being > > generally unusable. > > > > The reported issues however are intermittent, usually reliably > > reproducible with > > > > ifdown <iface> && ifup <iface> > > > > or rebooting the router that hosts the module. > > > > If some times passes, not sure but seems in excess of 3 minutes, between > > ifdown and ifup the sm checks mostly are not failing. > > It somehow "feels" that the module is storing some link signal > > information in a register which does not suit the sm check routine and > > only when that register clears the sm check routine passes and > > connectivity is restored. > > ____ > > > > Since there are probably other such SFP modules, xDSL and g.fast, out > > there that do not provide laser safety circuitry by design (since not > > providing connectivity over fibre) would it perhaps not make sense to > > try checking for the existence of laser safety circuitry first prior > > getting to the sm checks? > > ____ > > > > I am wondering whether this mentioned in > https://gitlab.labs.nic.cz/turris/turris-build/issues/89 is the cause of the > issue perhaps: > > Even when/after the SFP module is recognized and the link mode it set for > the NIC to the proper value there can still be the link-up signal mismatch > that we have seen on many non-ethernet SFPs. The thing is that one of the > SFP pins is called LOS (loss of signal) and when the pin is in active state > it is being interpreted by the Linux kernel as "link is down", turn off the > NIC. Unfortunatelly we have seen chicken-and-egg problem with some GPON and > DSL SFPs - the SFP does not come up and deassert LOS unless there is SGMII > link from NIC and NIC is not coming up unless LOS is deasserted. That would be very very broken behaviour, but one which the kernel doesn't care about. If RX_LOS is active, we do *not* disable the NIC. We just use RX_LOS as an additional input to evaluating whether the link is up. The NIC will still be configured for the appropriate mode irrespective of the state of RX_LOS. -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up According to speedtest.net: 11.9Mbps down 500kbps up ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [drivers/net/phy/sfp] intermittent failure in state machine checks 2020-01-09 19:42 ` ѽ҉ᶬḳ℠ 2020-01-09 21:38 ` Russell King - ARM Linux admin @ 2020-01-09 21:59 ` Russell King - ARM Linux admin 2020-01-09 22:40 ` ѽ҉ᶬḳ℠ 1 sibling, 1 reply; 40+ messages in thread From: Russell King - ARM Linux admin @ 2020-01-09 21:59 UTC (permalink / raw) To: ѽ҉ᶬḳ℠; +Cc: Andrew Lunn, netdev On Thu, Jan 09, 2020 at 07:42:27PM +0000, ѽ҉ᶬḳ℠ wrote: > On 09/01/2020 19:01, ѽ҉ᶬḳ℠ wrote: > > On 09/01/2020 17:43, Russell King - ARM Linux admin wrote: > > > On Thu, Jan 09, 2020 at 05:35:23PM +0000, ѽ҉ᶬḳ℠ wrote: > > > > Thank you for the extensive feedback and explanation. > > > > > > > > Pardon for having mixed up the semantics on module > > > > specifications vs. EEPROM > > > > dump... > > > > > > > > The module (chipset) been designed by Metanoia, not sure who is > > > > the actual > > > > manufacturer, and probably just been branded Allnet. > > > > The designer provides some proprietary management software > > > > (called EBM) to > > > > their wholesale buyers only > > > I have one of their early MT-V5311 modules, but it has no accessible > > > EEPROM, and even if it did, it would be of no use to me being > > > unapproved for connection to the BT Openreach network. (BT SIN 498 > > > specifies non-standard power profile to avoid crosstalk issues with > > > existing ADSL infrastructure, and I believe they regularly check the > > > connected modem type and firmware versions against an approved list.) > > > > > > I haven't noticed the module I have asserting its TX_FAULT signal, > > > but then its RJ45 has never been connected to anything. > > > > > > > The curious (and sort of inexplicable) thing is that the module in > > general works, i.e. at some point it must pass the sm checks or > > connectivity would be failing constantly and thus the module being > > generally unusable. > > > > The reported issues however are intermittent, usually reliably > > reproducible with > > > > ifdown <iface> && ifup <iface> > > > > or rebooting the router that hosts the module. > > > > If some times passes, not sure but seems in excess of 3 minutes, between > > ifdown and ifup the sm checks mostly are not failing. > > It somehow "feels" that the module is storing some link signal > > information in a register which does not suit the sm check routine and > > only when that register clears the sm check routine passes and > > connectivity is restored. > > ____ > > > > Since there are probably other such SFP modules, xDSL and g.fast, out > > there that do not provide laser safety circuitry by design (since not > > providing connectivity over fibre) would it perhaps not make sense to > > try checking for the existence of laser safety circuitry first prior > > getting to the sm checks? > > ____ > > > > I am wondering whether this mentioned in > https://gitlab.labs.nic.cz/turris/turris-build/issues/89 is the cause of the > issue perhaps: > > Even when/after the SFP module is recognized and the link mode it set for > the NIC to the proper value there can still be the link-up signal mismatch > that we have seen on many non-ethernet SFPs. The thing is that one of the > SFP pins is called LOS (loss of signal) and when the pin is in active state > it is being interpreted by the Linux kernel as "link is down", turn off the > NIC. Unfortunatelly we have seen chicken-and-egg problem with some GPON and > DSL SFPs - the SFP does not come up and deassert LOS unless there is SGMII > link from NIC and NIC is not coming up unless LOS is deasserted. Also, note that the Metanoia MT-V5311 (at least mine) uses 1000BASE-X not SGMII. It sends a 16-bit configuration word of 0x61a0, which is: 1000BASE-X SGMII Bit 15 0 No next page Link down 1 Ack Ack 1 Remote fault 2 Reserved (0) 0 Remote fault 1 Duplex (0 = Half) 0 Reserved (0) Speed bit 1 0 Reserved (0) Speed bit 0 (00=10Mbps) 0 Reserved (0) Reserved (0) 1 Asymetric pause direction Reserved (0) 1 Pause Reserved (0) 0 Half duplex not supported Reserved (0) 1 Full duplex supported Reserved (0) 0 Reserved (0) Reserved (0) 0 Reserved (0) Reserved (0) 0 Reserved (0) Reserved (0) 0 Reserved (0) Reserved (0) Bit 0 0 Reserved (0) Must be 1 So it clearly fits 802.3 Clause 37 1000BASE-X format, reporting 1G Full duplex, and not SGMII (10M Half duplex). I have a platform here that allows me to get at the raw config_reg word that the other end has sent which allows analysis as per the above. -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up According to speedtest.net: 11.9Mbps down 500kbps up ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [drivers/net/phy/sfp] intermittent failure in state machine checks 2020-01-09 21:59 ` Russell King - ARM Linux admin @ 2020-01-09 22:40 ` ѽ҉ᶬḳ℠ 2020-01-09 23:10 ` Russell King - ARM Linux admin 0 siblings, 1 reply; 40+ messages in thread From: ѽ҉ᶬḳ℠ @ 2020-01-09 22:40 UTC (permalink / raw) To: Russell King - ARM Linux admin; +Cc: Andrew Lunn, netdev On 09/01/2020 21:59, Russell King - ARM Linux admin wrote: > > Also, note that the Metanoia MT-V5311 (at least mine) uses 1000BASE-X > not SGMII. It sends a 16-bit configuration word of 0x61a0, which is: > > 1000BASE-X SGMII > Bit 15 0 No next page Link down > 1 Ack Ack > 1 Remote fault 2 Reserved (0) > 0 Remote fault 1 Duplex (0 = Half) > > 0 Reserved (0) Speed bit 1 > 0 Reserved (0) Speed bit 0 (00=10Mbps) > 0 Reserved (0) Reserved (0) > 1 Asymetric pause direction Reserved (0) > > 1 Pause Reserved (0) > 0 Half duplex not supported Reserved (0) > 1 Full duplex supported Reserved (0) > 0 Reserved (0) Reserved (0) > > 0 Reserved (0) Reserved (0) > 0 Reserved (0) Reserved (0) > 0 Reserved (0) Reserved (0) > Bit 0 0 Reserved (0) Must be 1 > > So it clearly fits 802.3 Clause 37 1000BASE-X format, reporting 1G > Full duplex, and not SGMII (10M Half duplex). > > I have a platform here that allows me to get at the raw config_reg > word that the other end has sent which allows analysis as per the > above. > The driver reports also 1000base-x for this Metonia/Allnet module: mvneta f1034000.ethernet eth2: switched to inband/1000base-x link mode mii-tool -v eth2 producing eth2: 1000 Mbit, full duplex, link ok product info: vendor 00:00:00, model 0 rev 0 basic mode: 10 Mbit, full duplex basic status: autonegotiation complete, link ok capabilities: advertising: 1000baseT-HD 1000baseT-FD 100baseT4 100baseTx-FD 100baseTx-HD 10baseT-FD 10baseT-HD flow-control ______ On 09/01/2020 21:34, Russell King - ARM Linux admin wrote: > You can check the state of the GPIOs by looking at > /sys/kernel/debug/gpio, and you will probably see that TX_FAULT is > being asserted by the module. With OpenWrt trying to save space wherever they can # CONFIG_DEBUG_GPIO is not set this avenue is unfortunately is not available. Is there some other way (Linux userland) to query TX_FAULT and RX_LOS and whether either/both being asserted or deasserted? ___ On 09/01/2020 21:34, Russell King - ARM Linux admin wrote: > BTW, I notice in you original kernel that you have at least one of my > "experimental" patches on your stable kernel taken from my "phy" branch > which has never been in mainline, so I guess you're using the OpenWRT > kernel? I am not aware were the code originated from. It is not exactly OpenWrt but TOS (for the Turris Omnia router), being a downstream patchset that builds on top of OpenWrt. The TOS developers might be known at Linux kernel development, recently added their MOX platform and also with regard to Multi-CPU-DSA. ___ On 09/01/2020 21:34, Russell King - ARM Linux admin wrote: > You're reading/way/ too much into the state machine. How so? Those intermittent failures cause disruption in the WAN connectivity - nothing life threatening but somewhat inconvenient. I am trying to get to the bottom of it, with my limited capabilities and with your input it has helped. I will ping Allnet again and see whether they bother to respond and shed some light of what their modules does with regard to TX_FAULT and RX_LOS. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [drivers/net/phy/sfp] intermittent failure in state machine checks 2020-01-09 22:40 ` ѽ҉ᶬḳ℠ @ 2020-01-09 23:10 ` Russell King - ARM Linux admin 2020-01-09 23:50 ` ѽ҉ᶬḳ℠ 0 siblings, 1 reply; 40+ messages in thread From: Russell King - ARM Linux admin @ 2020-01-09 23:10 UTC (permalink / raw) To: ѽ҉ᶬḳ℠; +Cc: Andrew Lunn, netdev On Thu, Jan 09, 2020 at 10:40:24PM +0000, ѽ҉ᶬḳ℠ wrote: > > On 09/01/2020 21:59, Russell King - ARM Linux admin wrote: > > > > Also, note that the Metanoia MT-V5311 (at least mine) uses 1000BASE-X > > not SGMII. It sends a 16-bit configuration word of 0x61a0, which is: > > > > 1000BASE-X SGMII > > Bit 15 0 No next page Link down > > 1 Ack Ack > > 1 Remote fault 2 Reserved (0) > > 0 Remote fault 1 Duplex (0 = Half) > > > > 0 Reserved (0) Speed bit 1 > > 0 Reserved (0) Speed bit 0 (00=10Mbps) > > 0 Reserved (0) Reserved (0) > > 1 Asymetric pause direction Reserved (0) > > > > 1 Pause Reserved (0) > > 0 Half duplex not supported Reserved (0) > > 1 Full duplex supported Reserved (0) > > 0 Reserved (0) Reserved (0) > > > > 0 Reserved (0) Reserved (0) > > 0 Reserved (0) Reserved (0) > > 0 Reserved (0) Reserved (0) > > Bit 0 0 Reserved (0) Must be 1 > > > > So it clearly fits 802.3 Clause 37 1000BASE-X format, reporting 1G > > Full duplex, and not SGMII (10M Half duplex). > > > > I have a platform here that allows me to get at the raw config_reg > > word that the other end has sent which allows analysis as per the > > above. > > > > The driver reports also 1000base-x for this Metonia/Allnet module: > > mvneta f1034000.ethernet eth2: switched to inband/1000base-x link mode > > mii-tool -v eth2 producing > > eth2: 1000 Mbit, full duplex, link ok > product info: vendor 00:00:00, model 0 rev 0 > basic mode: 10 Mbit, full duplex > basic status: autonegotiation complete, link ok > capabilities: > advertising: 1000baseT-HD 1000baseT-FD 100baseT4 100baseTx-FD > 100baseTx-HD 10baseT-FD 10baseT-HD flow-control Please don't use mii-tool with SFPs that do not have a PHY; the "PHY" registers are emulated, and are there just for compatibility. Please use ethtool in preference, especially for SFPs. > On 09/01/2020 21:34, Russell King - ARM Linux admin wrote: > > You can check the state of the GPIOs by looking at > > /sys/kernel/debug/gpio, and you will probably see that TX_FAULT is > > being asserted by the module. > > With OpenWrt trying to save space wherever they can > > # CONFIG_DEBUG_GPIO is not set > > this avenue is unfortunately is not available. Is there some other way > (Linux userland) to query TX_FAULT and RX_LOS and whether either/both being > asserted or deasserted? CONFIG_DEBUG_GPIO is not the same as having debugfs support enabled. If debugfs is enabled, then gpiolib will provide the current state of gpios through debugfs. debugfs is normally mounted on /sys/kernel/debug, but may not be mounted by default depending on policy. Looking in /proc/filesystems will tell you definitively whether debugfs is enabled or not in the kernel. > On 09/01/2020 21:34, Russell King - ARM Linux admin wrote: > > BTW, I notice in you original kernel that you have at least one of my > > "experimental" patches on your stable kernel taken from my "phy" branch > > which has never been in mainline, so I guess you're using the OpenWRT > > kernel? > I am not aware were the code originated from. It is not exactly OpenWrt but > TOS (for the Turris Omnia router), being a downstream patchset that builds > on top of OpenWrt. The TOS developers might be known at Linux kernel > development, recently added their MOX platform and also with regard to > Multi-CPU-DSA. So, if that is correct... Current OpenWRT is derived from 4.19-stable kernels, which include experimental patches picked at some point from my "phy" branch, and TOS is derived from OpenWRT. That makes it very difficult for anyone in the mainline kernel community to do anything about this; sending you a patch is likely useless since you're not going to be able to test it. > On 09/01/2020 21:34, Russell King - ARM Linux admin wrote: > > You're reading/way/ too much into the state machine. > > How so? Those intermittent failures cause disruption in the WAN connectivity > - nothing life threatening but somewhat inconvenient. You think the state machines are doing something clever. They don't. They are all very simple and quite dumb. > I am trying to get to the bottom of it, with my limited capabilities and > with your input it has helped. I will ping Allnet again and see whether they > bother to respond and shed some light of what their modules does with regard > to TX_FAULT and RX_LOS. The only real way to get to the bottom of it is to manually enable debug in sfp.c so its possible to watch what happens, not only with the hardware signals but also what the state machines are doing. However, I'm very certain that there is no problem with the state machines, and it is that the Allnet module is raising TX_FAULT. I also think from what you've said above that rebuilding a kernel to enable debug in sfp.c is going to not be possible for you. -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up According to speedtest.net: 11.9Mbps down 500kbps up ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [drivers/net/phy/sfp] intermittent failure in state machine checks 2020-01-09 23:10 ` Russell King - ARM Linux admin @ 2020-01-09 23:50 ` ѽ҉ᶬḳ℠ 2020-01-10 0:18 ` ѽ҉ᶬḳ℠ 2020-01-10 9:27 ` Russell King - ARM Linux admin 0 siblings, 2 replies; 40+ messages in thread From: ѽ҉ᶬḳ℠ @ 2020-01-09 23:50 UTC (permalink / raw) To: Russell King - ARM Linux admin; +Cc: Andrew Lunn, netdev On 09/01/2020 23:10, Russell King - ARM Linux admin wrote: > > Please don't use mii-tool with SFPs that do not have a PHY; the "PHY" > registers are emulated, and are there just for compatibility. Please > use ethtool in preference, especially for SFPs. Sure, just ethtool is not much of help for this particular matter, all there is ethtool -m and according to you the EEPROM dump is not to be relied on. > > CONFIG_DEBUG_GPIO is not the same as having debugfs support enabled. > If debugfs is enabled, then gpiolib will provide the current state > of gpios through debugfs. debugfs is normally mounted on > /sys/kernel/debug, but may not be mounted by default depending on > policy. Looking in /proc/filesystems will tell you definitively > whether debugfs is enabled or not in the kernel. debugsfs is mounted but ls -af /sys/kernel/debug/gpio only producing (oddly): /sys/kernel/debug/gpio > > So, if that is correct... > > Current OpenWRT is derived from 4.19-stable kernels, which include > experimental patches picked at some point from my "phy" branch, and > TOS is derived from OpenWRT. This may not be correct since there are not many device targets in OpenWrt that feature a SFP cage (least as of today), the Turris Omnia might even be the sole one. I did not check whether that the code was/is available in OpenWrt, and likely it is not, but it was in an earlier TOS version since their platforms apparently feature a SFP cage. > That makes it very difficult for anyone in the mainline kernel > community to do anything about this; sending you a patch is likely > useless since you're not going to be able to test it. I understand, I just reached out all the way upstream since other available avenues, and started all the way downstream, did not produce anything tangible or even a response. I am grateful that finally at least you obliged and shed some light on the matter. Maybe I should just try finding a module that is declared SPF MSA conform... > > You think the state machines are doing something clever. They don't. > They are all very simple and quite dumb. Not really, I assume it just does what it is supposed to do in line with current (industry) standards and best practices. > > The only real way to get to the bottom of it is to manually enable > debug in sfp.c so its possible to watch what happens, not only with > the hardware signals but also what the state machines are doing. > However, I'm very certain that there is no problem with the state > machines, and it is that the Allnet module is raising TX_FAULT. I am sure it does and I am pursuing Allnet for a response, albeit not looking promising at the moment. Once there is however I shall pick up the thread again. > I also think from what you've said above that rebuilding a kernel > to enable debug in sfp.c is going to not be possible for you. No, I might be able to get this done for amd64 but with this ARM SoC there is all kind of other stuff (SPI, MTD, I2C, u-boot and whatnot) involved and I am afraid it will go sideways if I attempt compiling. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [drivers/net/phy/sfp] intermittent failure in state machine checks 2020-01-09 23:50 ` ѽ҉ᶬḳ℠ @ 2020-01-10 0:18 ` ѽ҉ᶬḳ℠ 2020-01-10 10:26 ` Russell King - ARM Linux admin 2020-01-10 9:27 ` Russell King - ARM Linux admin 1 sibling, 1 reply; 40+ messages in thread From: ѽ҉ᶬḳ℠ @ 2020-01-10 0:18 UTC (permalink / raw) To: Russell King - ARM Linux admin; +Cc: Andrew Lunn, netdev On 09/01/2020 23:50, ѽ҉ᶬḳ℠ wrote: > Maybe I should just try finding a module that is declared SPF MSA > conform... Actually, the vendors declares (https://www.allnet.de/en/allnet-brand/produkte/neuheiten/p/-0c35cc9ea9/): *ALLNET ALL4781-VDSL2-SFP* is a VDSL2 SFP modem that interconnects with Gateway Processor by using a MSA (MultiSource Agreement) compliant hot pluggable electrical interface. Ok, "a MSA" does not explicitly state/imply SFP MSA but what other MSA could that be? If it is indeed SFP MSA conform the issue should not happen. Unless it is just marketing speak and does not hold true. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [drivers/net/phy/sfp] intermittent failure in state machine checks 2020-01-10 0:18 ` ѽ҉ᶬḳ℠ @ 2020-01-10 10:26 ` Russell King - ARM Linux admin 0 siblings, 0 replies; 40+ messages in thread From: Russell King - ARM Linux admin @ 2020-01-10 10:26 UTC (permalink / raw) To: ѽ҉ᶬḳ℠; +Cc: Andrew Lunn, netdev On Fri, Jan 10, 2020 at 12:18:41AM +0000, ѽ҉ᶬḳ℠ wrote: > On 09/01/2020 23:50, ѽ҉ᶬḳ℠ wrote: > > Maybe I should just try finding a module that is declared SPF MSA > > conform... > > Actually, the vendors declares > (https://www.allnet.de/en/allnet-brand/produkte/neuheiten/p/-0c35cc9ea9/): > > *ALLNET ALL4781-VDSL2-SFP* is a VDSL2 SFP modem that interconnects with > Gateway Processor by using a MSA (MultiSource Agreement) compliant hot > pluggable electrical interface. > > Ok, "a MSA" does not explicitly state/imply SFP MSA but what other MSA could > that be? > If it is indeed SFP MSA conform the issue should not happen. Unless it is > just marketing speak and does not hold true. Everyone claims that their SFP is MSA compliant, even when the module: 1) takes 40-50 seconds after deasserting TX_DISABLE to initialise and deassert TX_FAULT, when the SFP MSA explicitly states a limit of 300ms (t_init) for TX_FAULT to deassert. 2) EEPROM does not respond for 50 seconds after plugging in, where the SFP MSA explicitly states 300ms (t_serial) maximum. 3) EEPROM contains incorrect data, for example: - indicating the module has a LC connector, yet it has an RJ45, or vice versa. - indicating NRZ encoding for an ethernet SFP, where it should be 8b10b or 64b66b encoding. - indicating a single data rate, or even the wrong data rate, when the module is documented as supporting other rates. - indicating an extended compliance technology that it doesn't support, presumably originally chosen when the number was unallocated by SFF-8024. - claiming to support 1000BASE-SX, a fiber standard, when the module is actually for VDSL2 over copper. ... etc ... So, I tend to ignore "SFP MSA compliant" whenever I see it; it is mostly meaningless. Yes, there are modules out there which are compliant, but those that claim compliance but aren't make the claim meaningless for everyone. -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up According to speedtest.net: 11.9Mbps down 500kbps up ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [drivers/net/phy/sfp] intermittent failure in state machine checks 2020-01-09 23:50 ` ѽ҉ᶬḳ℠ 2020-01-10 0:18 ` ѽ҉ᶬḳ℠ @ 2020-01-10 9:27 ` Russell King - ARM Linux admin 2020-01-10 9:50 ` ѽ҉ᶬḳ℠ 1 sibling, 1 reply; 40+ messages in thread From: Russell King - ARM Linux admin @ 2020-01-10 9:27 UTC (permalink / raw) To: ѽ҉ᶬḳ℠; +Cc: Andrew Lunn, netdev On Thu, Jan 09, 2020 at 11:50:14PM +0000, ѽ҉ᶬḳ℠ wrote: > > On 09/01/2020 23:10, Russell King - ARM Linux admin wrote: > > > > Please don't use mii-tool with SFPs that do not have a PHY; the "PHY" > > registers are emulated, and are there just for compatibility. Please > > use ethtool in preference, especially for SFPs. > > Sure, just ethtool is not much of help for this particular matter, all there > is ethtool -m and according to you the EEPROM dump is not to be relied on. How about just "ethtool eth2" ? > > CONFIG_DEBUG_GPIO is not the same as having debugfs support enabled. > > If debugfs is enabled, then gpiolib will provide the current state > > of gpios through debugfs. debugfs is normally mounted on > > /sys/kernel/debug, but may not be mounted by default depending on > > policy. Looking in /proc/filesystems will tell you definitively > > whether debugfs is enabled or not in the kernel. > debugsfs is mounted but ls -af /sys/kernel/debug/gpio only producing > (oddly): > > /sys/kernel/debug/gpio Try "cat /sys/kernel/debug/gpio" > > So, if that is correct... > > > > Current OpenWRT is derived from 4.19-stable kernels, which include > > experimental patches picked at some point from my "phy" branch, and > > TOS is derived from OpenWRT. > > This may not be correct since there are not many device targets in OpenWrt > that feature a SFP cage (least as of today), the Turris Omnia might even be > the sole one. It isn't; there are definitely platforms that run OpenWRT that also have SFP cages (even a pair of them) and that make use of this code. -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up According to speedtest.net: 11.9Mbps down 500kbps up ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [drivers/net/phy/sfp] intermittent failure in state machine checks 2020-01-10 9:27 ` Russell King - ARM Linux admin @ 2020-01-10 9:50 ` ѽ҉ᶬḳ℠ 2020-01-10 10:19 ` ѽ҉ᶬḳ℠ 2020-01-10 11:44 ` Russell King - ARM Linux admin 0 siblings, 2 replies; 40+ messages in thread From: ѽ҉ᶬḳ℠ @ 2020-01-10 9:50 UTC (permalink / raw) To: Russell King - ARM Linux admin; +Cc: Andrew Lunn, netdev On 10/01/2020 09:27, Russell King - ARM Linux admin wrote: > On Thu, Jan 09, 2020 at 11:50:14PM +0000, ѽ҉ᶬḳ℠ wrote: >> On 09/01/2020 23:10, Russell King - ARM Linux admin wrote: >>> Please don't use mii-tool with SFPs that do not have a PHY; the "PHY" >>> registers are emulated, and are there just for compatibility. Please >>> use ethtool in preference, especially for SFPs. >> Sure, just ethtool is not much of help for this particular matter, all there >> is ethtool -m and according to you the EEPROM dump is not to be relied on. > How about just "ethtool eth2" ? Settings for eth2: Supported ports: [ TP ] Supported link modes: 1000baseX/Full Supported pause frame use: Symmetric Supports auto-negotiation: Yes Supported FEC modes: Not reported Advertised link modes: 1000baseX/Full Advertised pause frame use: Symmetric Advertised auto-negotiation: Yes Advertised FEC modes: Not reported Speed: 1000Mb/s Duplex: Full Port: Twisted Pair PHYAD: 0 Transceiver: internal Auto-negotiation: on MDI-X: Unknown Supports Wake-on: d Wake-on: d Link detected: yes And i2cdetect -l i2c-3 i2c i2c-0-mux (chan_id 2) I2C adapter i2c-1 i2c i2c-0-mux (chan_id 0) I2C adapter i2c-8 i2c i2c-0-mux (chan_id 7) I2C adapter i2c-6 i2c i2c-0-mux (chan_id 5) I2C adapter i2c-4 i2c i2c-0-mux (chan_id 3) I2C adapter i2c-2 i2c i2c-0-mux (chan_id 1) I2C adapter i2c-0 i2c mv64xxx_i2c adapter I2C adapter i2c-7 i2c i2c-0-mux (chan_id 6) I2C adapter i2c-5 i2c i2c-0-mux (chan_id 4) I2C adapter > >>> CONFIG_DEBUG_GPIO is not the same as having debugfs support enabled. >>> If debugfs is enabled, then gpiolib will provide the current state >>> of gpios through debugfs. debugfs is normally mounted on >>> /sys/kernel/debug, but may not be mounted by default depending on >>> policy. Looking in /proc/filesystems will tell you definitively >>> whether debugfs is enabled or not in the kernel. >> debugsfs is mounted but ls -af /sys/kernel/debug/gpio only producing >> (oddly): >> >> /sys/kernel/debug/gpio > Try "cat /sys/kernel/debug/gpio" gpiochip2: GPIOs 504-511, parent: i2c/8-0071, pca9538, can sleep: gpio-504 ( |tx-fault ) in lo IRQ gpio-505 ( |tx-disable ) out lo gpio-506 ( |rate-select0 ) in lo gpio-507 ( |los ) in lo IRQ gpio-508 ( |mod-def0 ) in lo IRQ ___ Probably unrelated, just for posterity noticed in the logs: pca953x 8-0071: 8-0071 supply vcc not found, using dummy regulator ___ Meantime Allnet responded, which basically sums up to (blame ping pong - it is not me but go and look there instead...) - driver support is not being handled by Allnet but by Metanoia, latter being designer and manufacturer - Allnet does not have the buying power to persuade Metanoia to look into the matter - it would appear that SFP.C is trying to communicate with Fiber-GBIC and fails since the signal reports may not be 100% compatible Pending is their feedback about SFP MSA conformity. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [drivers/net/phy/sfp] intermittent failure in state machine checks 2020-01-10 9:50 ` ѽ҉ᶬḳ℠ @ 2020-01-10 10:19 ` ѽ҉ᶬḳ℠ 2020-01-10 11:46 ` Russell King - ARM Linux admin 2020-01-10 13:22 ` Andrew Lunn 2020-01-10 11:44 ` Russell King - ARM Linux admin 1 sibling, 2 replies; 40+ messages in thread From: ѽ҉ᶬḳ℠ @ 2020-01-10 10:19 UTC (permalink / raw) To: Russell King - ARM Linux admin; +Cc: Andrew Lunn, netdev I just came across this http://edgemax5.rssing.com/chan-66822975/all_p1715.html#item34298 and albeit for a SFP g.fast module it indicates/implies that Metanoia provides own Linux drivers (supposedly GPL licensed), plus some bits pertaining to the EBM (Ethernet Boot Management protocol). Has Metanoia submitted any SFP drivers to upstream kernel development? ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [drivers/net/phy/sfp] intermittent failure in state machine checks 2020-01-10 10:19 ` ѽ҉ᶬḳ℠ @ 2020-01-10 11:46 ` Russell King - ARM Linux admin 2020-01-10 13:22 ` Andrew Lunn 1 sibling, 0 replies; 40+ messages in thread From: Russell King - ARM Linux admin @ 2020-01-10 11:46 UTC (permalink / raw) To: ѽ҉ᶬḳ℠; +Cc: Andrew Lunn, netdev On Fri, Jan 10, 2020 at 10:19:47AM +0000, ѽ҉ᶬḳ℠ wrote: > I just came across this > http://edgemax5.rssing.com/chan-66822975/all_p1715.html#item34298 > > and albeit for a SFP g.fast module it indicates/implies that Metanoia > provides own Linux drivers (supposedly GPL licensed), plus some bits > pertaining to the EBM (Ethernet Boot Management protocol). > > Has Metanoia submitted any SFP drivers to upstream kernel development? Not that I'm aware of. -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up According to speedtest.net: 11.9Mbps down 500kbps up ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [drivers/net/phy/sfp] intermittent failure in state machine checks 2020-01-10 10:19 ` ѽ҉ᶬḳ℠ 2020-01-10 11:46 ` Russell King - ARM Linux admin @ 2020-01-10 13:22 ` Andrew Lunn 2020-01-10 13:38 ` ѽ҉ᶬḳ℠ 1 sibling, 1 reply; 40+ messages in thread From: Andrew Lunn @ 2020-01-10 13:22 UTC (permalink / raw) To: ѽ҉ᶬḳ℠ Cc: Russell King - ARM Linux admin, netdev On Fri, Jan 10, 2020 at 10:19:47AM +0000, ѽ҉ᶬḳ℠ wrote: > I just came across this > http://edgemax5.rssing.com/chan-66822975/all_p1715.html#item34298 > > and albeit for a SFP g.fast module it indicates/implies that Metanoia > provides own Linux drivers (supposedly GPL licensed), plus some bits > pertaining to the EBM (Ethernet Boot Management protocol). > > Has Metanoia submitted any SFP drivers to upstream kernel development? I have also not seen any. You could ask for the sources. It is unlikely we would use them, but they could provide documentation about the quirks needed to make this device work properly. Andrew ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [drivers/net/phy/sfp] intermittent failure in state machine checks 2020-01-10 13:22 ` Andrew Lunn @ 2020-01-10 13:38 ` ѽ҉ᶬḳ℠ 0 siblings, 0 replies; 40+ messages in thread From: ѽ҉ᶬḳ℠ @ 2020-01-10 13:38 UTC (permalink / raw) To: Andrew Lunn; +Cc: Russell King - ARM Linux admin, netdev On 10/01/2020 13:22, Andrew Lunn wrote: > On Fri, Jan 10, 2020 at 10:19:47AM +0000, ѽ҉ᶬḳ℠ wrote: >> I just came across this >> http://edgemax5.rssing.com/chan-66822975/all_p1715.html#item34298 >> >> and albeit for a SFP g.fast module it indicates/implies that Metanoia >> provides own Linux drivers (supposedly GPL licensed), plus some bits >> pertaining to the EBM (Ethernet Boot Management protocol). >> >> Has Metanoia submitted any SFP drivers to upstream kernel development? > I have also not seen any. You could ask for the sources. It is > unlikely we would use them, but they could provide documentation about > the quirks needed to make this device work properly. > > Andrew > Wish I could since it would be really helpful. As far as I can tell some of the ISP in Switzerland, least Swisscom, provide Metanoia designed/manufactured SFP modules as CPE and those drivers and EBM tools for Linux are packaged into the firmware shipped by the ISP. Too bad that Metanoia has not bothered to take the initiative and submit those to upstream development, even it were just for taking a peek at potential quirks. Regrettably, I do not entertain a (commercial or otherwise) relationship at a level that would warrant a response to a request for such drivers from either Metanoia or an ISP providing Metanoia equipment. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [drivers/net/phy/sfp] intermittent failure in state machine checks 2020-01-10 9:50 ` ѽ҉ᶬḳ℠ 2020-01-10 10:19 ` ѽ҉ᶬḳ℠ @ 2020-01-10 11:44 ` Russell King - ARM Linux admin 2020-01-10 12:45 ` ѽ҉ᶬḳ℠ 1 sibling, 1 reply; 40+ messages in thread From: Russell King - ARM Linux admin @ 2020-01-10 11:44 UTC (permalink / raw) To: ѽ҉ᶬḳ℠; +Cc: Andrew Lunn, netdev On Fri, Jan 10, 2020 at 09:50:00AM +0000, ѽ҉ᶬḳ℠ wrote: > On 10/01/2020 09:27, Russell King - ARM Linux admin wrote: > > On Thu, Jan 09, 2020 at 11:50:14PM +0000, ѽ҉ᶬḳ℠ wrote: > > > On 09/01/2020 23:10, Russell King - ARM Linux admin wrote: > > > > Please don't use mii-tool with SFPs that do not have a PHY; the "PHY" > > > > registers are emulated, and are there just for compatibility. Please > > > > use ethtool in preference, especially for SFPs. > > > Sure, just ethtool is not much of help for this particular matter, all there > > > is ethtool -m and according to you the EEPROM dump is not to be relied on. > > How about just "ethtool eth2" ? > > Settings for eth2: > Supported ports: [ TP ] > Supported link modes: 1000baseX/Full > Supported pause frame use: Symmetric > Supports auto-negotiation: Yes > Supported FEC modes: Not reported > Advertised link modes: 1000baseX/Full > Advertised pause frame use: Symmetric > Advertised auto-negotiation: Yes > Advertised FEC modes: Not reported > Speed: 1000Mb/s > Duplex: Full > Port: Twisted Pair > PHYAD: 0 > Transceiver: internal > Auto-negotiation: on > MDI-X: Unknown > Supports Wake-on: d > Wake-on: d > Link detected: yes That looks fine. > > > > CONFIG_DEBUG_GPIO is not the same as having debugfs support enabled. > > > > If debugfs is enabled, then gpiolib will provide the current state > > > > of gpios through debugfs. debugfs is normally mounted on > > > > /sys/kernel/debug, but may not be mounted by default depending on > > > > policy. Looking in /proc/filesystems will tell you definitively > > > > whether debugfs is enabled or not in the kernel. > > > debugsfs is mounted but ls -af /sys/kernel/debug/gpio only producing > > > (oddly): > > > > > > /sys/kernel/debug/gpio > > Try "cat /sys/kernel/debug/gpio" > > gpiochip2: GPIOs 504-511, parent: i2c/8-0071, pca9538, can sleep: > gpio-504 ( |tx-fault ) in lo IRQ > gpio-505 ( |tx-disable ) out lo > gpio-506 ( |rate-select0 ) in lo > gpio-507 ( |los ) in lo IRQ > gpio-508 ( |mod-def0 ) in lo IRQ Which is also indicating everything is correct. When the problem occurs, check the state of the signals again as close as possible to the event - it depends how long the transceiver keeps it asserted. You will probably find tx-fault is indicating "in hi IRQ". > Meantime Allnet responded, which basically sums up to (blame ping pong - it > is not me but go and look there instead...) > > - driver support is not being handled by Allnet but by Metanoia, latter > being designer and manufacturer > - Allnet does not have the buying power to persuade Metanoia to look into > the matter ... which is pretty standard; no one will rework their SFP unless they fear their sales will be severely impacted by the issue. > - it would appear that SFP.C is trying to communicate with Fiber-GBIC and > fails since the signal reports may not be 100% compatible That's a fun claim, but note carefully the wording "may" which implies some uncertainty in the statement. Let's look at the wording of the GBIC (SFF-8053) and SFP (INF-8074 - SFP MSA) documents. The wording for the "fault recovery" is identical between the two, which concerns what happens when TX_FAULT is asserted and how to recover from that. Concerning the implementation of TX_FAULT, SFF-8053 states: If no transmitter safety circuitry is implemented, the TX_FAULT signal may be tied to its negated state. but then says later in the document: If TX_FAULT is not implemented, the signal shall be held to the low state by the GBIC. Meanwhile, INF-8074 similarly states: If no transmitter safety circuitry is implemented, the TX_FAULT signal may be tied to its negated state. but later on has a similar statement: TX_FAULT shall be implemented by those module definitions of SFP transceiver supporting safety circuitry. If TX_FAULT is not implemented, the signal shall be held to the low state by the SFP transceiver. "shall" in both cases is stronger than "may". So, there seems to be little difference between the GBIC and SFP usage of this signal. Their claim is that sfp.c implements the older GBIC style of signal reports. My counter-claim is that (a) sfp.c is written to the SFP MSA and not the GBIC standard, and (b) there is no difference as far as the TX_FAULT signal is concerned between the GBIC standard and the SFP MSA. But... it doesn't matter that much, there's a module out there (and it isn't the only one) which does "funny stuff" with its TX_FAULT signal. Either we decide we want to support it and implement a quirk, or we decide we don't want to support it. There is an option bit in the EEPROM that is supposed to indicate whether the module supports TX_FAULT, but, as you can guess, there are problems with using that, as: 1) there are a lot of modules, particularly optical modules, that implement TX_FAULT correctly but don't set the option bit to say that they support the signal. 2) the other module I'm aware of that does "funny stuff" with its TX_FAULT signal does have the TX_FAULT option bit set. So, the option bit is completely untrustworthy and, therefore, is meaningless (which is why we don't use it.) -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up According to speedtest.net: 11.9Mbps down 500kbps up ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [drivers/net/phy/sfp] intermittent failure in state machine checks 2020-01-10 11:44 ` Russell King - ARM Linux admin @ 2020-01-10 12:45 ` ѽ҉ᶬḳ℠ 2020-01-10 12:53 ` Russell King - ARM Linux admin 0 siblings, 1 reply; 40+ messages in thread From: ѽ҉ᶬḳ℠ @ 2020-01-10 12:45 UTC (permalink / raw) To: Russell King - ARM Linux admin; +Cc: Andrew Lunn, netdev On 10/01/2020 11:44, Russell King - ARM Linux admin wrote: > > Which is also indicating everything is correct. When the problem > occurs, check the state of the signals again as close as possible > to the event - it depends how long the transceiver keeps it > asserted. You will probably find tx-fault is indicating > "in hi IRQ". just discovered userland - gpioinfo pca9538 - which seems more verbose gpiochip2 - 8 lines: line 0: unnamed "tx-fault" input active-high [used] line 1: unnamed "tx-disable" output active-high [used] line 2: unnamed "rate-select0" input active-high [used] line 3: unnamed "los" input active-high [used] line 4: unnamed "mod-def0" input active-low [used] line 5: unnamed unused input active-high line 6: unnamed unused input active-high line 7: unnamed unused input active-high The above is depicting the current state with the module working, i.e. being online. Will do some testing and report back, not sure yet how to keep a close watch relating to the failure events. >> - it would appear that SFP.C is trying to communicate with Fiber-GBIC and >> fails since the signal reports may not be 100% compatible > That's a fun claim, but note carefully the wording "may" which implies > some uncertainty in the statement. It was a verbatim translation but yes, even in the initial language correspondence such uncertainty is implied indeed. > > Let's look at the wording of the GBIC (SFF-8053) and SFP (INF-8074 - > SFP MSA) documents. The wording for the "fault recovery" is identical > between the two, which concerns what happens when TX_FAULT is asserted > and how to recover from that. > > Concerning the implementation of TX_FAULT, SFF-8053 states: > > If no transmitter safety circuitry is implemented, the TX_FAULT signal > may be tied to its negated state. > > but then says later in the document: > > If TX_FAULT is not implemented, the signal shall be held to the low > state by the GBIC. > > Meanwhile, INF-8074 similarly states: > > If no transmitter safety circuitry is implemented, the TX_FAULT signal > may be tied to its negated state. > > but later on has a similar statement: > > TX_FAULT shall be implemented by those module definitions of SFP > transceiver supporting safety circuitry. If TX_FAULT is not > implemented, the signal shall be held to the low state by the SFP > transceiver. > > "shall" in both cases is stronger than "may". So, there seems to be > little difference between the GBIC and SFP usage of this signal. > > Their claim is that sfp.c implements the older GBIC style of signal > reports. My counter-claim is that (a) sfp.c is written to the SFP MSA > and not the GBIC standard, and (b) there is no difference as far as the > TX_FAULT signal is concerned between the GBIC standard and the SFP MSA. > > But... it doesn't matter that much, there's a module out there (and it > isn't the only one) which does "funny stuff" with its TX_FAULT signal. > Either we decide we want to support it and implement a quirk, or we > decide we don't want to support it. > > There is an option bit in the EEPROM that is supposed to indicate > whether the module supports TX_FAULT, but, as you can guess, there are > problems with using that, as: > > 1) there are a lot of modules, particularly optical modules, that > implement TX_FAULT correctly but don't set the option bit to say > that they support the signal. > > 2) the other module I'm aware of that does "funny stuff" with its > TX_FAULT signal does have the TX_FAULT option bit set. > > So, the option bit is completely untrustworthy and, therefore, is > meaningless (which is why we don't use it.) > Even with "shall" carrying a potentially higher weight than "may" it still does not imply something obligatory (set in stone) and leaves potential wiggle room. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [drivers/net/phy/sfp] intermittent failure in state machine checks 2020-01-10 12:45 ` ѽ҉ᶬḳ℠ @ 2020-01-10 12:53 ` Russell King - ARM Linux admin 2020-01-10 15:02 ` ѽ҉ᶬḳ℠ 0 siblings, 1 reply; 40+ messages in thread From: Russell King - ARM Linux admin @ 2020-01-10 12:53 UTC (permalink / raw) To: ѽ҉ᶬḳ℠; +Cc: Andrew Lunn, netdev On Fri, Jan 10, 2020 at 12:45:52PM +0000, ѽ҉ᶬḳ℠ wrote: > On 10/01/2020 11:44, Russell King - ARM Linux admin wrote: > > > > Which is also indicating everything is correct. When the problem > > occurs, check the state of the signals again as close as possible > > to the event - it depends how long the transceiver keeps it > > asserted. You will probably find tx-fault is indicating > > "in hi IRQ". > just discovered userland - gpioinfo pca9538 - which seems more verbose > > gpiochip2 - 8 lines: > line 0: unnamed "tx-fault" input active-high [used] > line 1: unnamed "tx-disable" output active-high [used] > line 2: unnamed "rate-select0" input active-high [used] > line 3: unnamed "los" input active-high [used] > line 4: unnamed "mod-def0" input active-low [used] > line 5: unnamed unused input active-high > line 6: unnamed unused input active-high > line 7: unnamed unused input active-high > > The above is depicting the current state with the module working, i.e. being > online. Will do some testing and report back, not sure yet how to keep a > close watch relating to the failure events. However, that doesn't give the current levels of the inputs, so it's useless for the purpose I've asked for. -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up According to speedtest.net: 11.9Mbps down 500kbps up ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [drivers/net/phy/sfp] intermittent failure in state machine checks 2020-01-10 12:53 ` Russell King - ARM Linux admin @ 2020-01-10 15:02 ` ѽ҉ᶬḳ℠ 2020-01-10 15:09 ` Russell King - ARM Linux admin 0 siblings, 1 reply; 40+ messages in thread From: ѽ҉ᶬḳ℠ @ 2020-01-10 15:02 UTC (permalink / raw) To: Russell King - ARM Linux admin; +Cc: Andrew Lunn, netdev On 10/01/2020 12:53, Russell King - ARM Linux admin wrote: > >>> Which is also indicating everything is correct. When the problem >>> occurs, check the state of the signals again as close as possible >>> to the event - it depends how long the transceiver keeps it >>> asserted. You will probably find tx-fault is indicating >>> "in hi IRQ". >> just discovered userland - gpioinfo pca9538 - which seems more verbose >> >> gpiochip2 - 8 lines: >> line 0: unnamed "tx-fault" input active-high [used] >> line 1: unnamed "tx-disable" output active-high [used] >> line 2: unnamed "rate-select0" input active-high [used] >> line 3: unnamed "los" input active-high [used] >> line 4: unnamed "mod-def0" input active-low [used] >> line 5: unnamed unused input active-high >> line 6: unnamed unused input active-high >> line 7: unnamed unused input active-high >> >> The above is depicting the current state with the module working, i.e. being >> online. Will do some testing and report back, not sure yet how to keep a >> close watch relating to the failure events. > However, that doesn't give the current levels of the inputs, so it's > useless for the purpose I've asked for. Fair enough. Operational (online) state gpiochip2: GPIOs 504-511, parent: i2c/8-0071, pca9538, can sleep: gpio-504 ( |tx-fault ) in lo IRQ gpio-505 ( |tx-disable ) out lo gpio-506 ( |rate-select0 ) in lo gpio-507 ( |los ) in lo IRQ gpio-508 ( |mod-def0 ) in lo IRQ And the same remained (unchanged) during/after the events (as closely I was able to monitor) -> module transmit fault indicated Kept an eye on the link state of the NIC (eth2) as well with - ip mo l dev eth2 - and it does not come up either with those transmit faults and only gets back online once the transmit faults are cleared. At loss really, since it seems the SFP is not exhibiting any mischievous behaviour. It does not even reproduce reliably, just now it took a couple of attempts to invoke the transmit fault and other times it does straight away. Suppose there could be always a hardware defect though it seems somewhat unlikely. That said, I do not have another such module available to swap for testing. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [drivers/net/phy/sfp] intermittent failure in state machine checks 2020-01-10 15:02 ` ѽ҉ᶬḳ℠ @ 2020-01-10 15:09 ` Russell King - ARM Linux admin 2020-01-10 15:45 ` ѽ҉ᶬḳ℠ 0 siblings, 1 reply; 40+ messages in thread From: Russell King - ARM Linux admin @ 2020-01-10 15:09 UTC (permalink / raw) To: ѽ҉ᶬḳ℠; +Cc: Andrew Lunn, netdev On Fri, Jan 10, 2020 at 03:02:51PM +0000, ѽ҉ᶬḳ℠ wrote: > On 10/01/2020 12:53, Russell King - ARM Linux admin wrote: > > > > > > Which is also indicating everything is correct. When the problem > > > > occurs, check the state of the signals again as close as possible > > > > to the event - it depends how long the transceiver keeps it > > > > asserted. You will probably find tx-fault is indicating > > > > "in hi IRQ". > > > just discovered userland - gpioinfo pca9538 - which seems more verbose > > > > > > gpiochip2 - 8 lines: > > > line 0: unnamed "tx-fault" input active-high [used] > > > line 1: unnamed "tx-disable" output active-high [used] > > > line 2: unnamed "rate-select0" input active-high [used] > > > line 3: unnamed "los" input active-high [used] > > > line 4: unnamed "mod-def0" input active-low [used] > > > line 5: unnamed unused input active-high > > > line 6: unnamed unused input active-high > > > line 7: unnamed unused input active-high > > > > > > The above is depicting the current state with the module working, i.e. being > > > online. Will do some testing and report back, not sure yet how to keep a > > > close watch relating to the failure events. > > However, that doesn't give the current levels of the inputs, so it's > > useless for the purpose I've asked for. > > Fair enough. Operational (online) state > > gpiochip2: GPIOs 504-511, parent: i2c/8-0071, pca9538, can sleep: > gpio-504 ( |tx-fault ) in lo IRQ > gpio-505 ( |tx-disable ) out lo > gpio-506 ( |rate-select0 ) in lo > gpio-507 ( |los ) in lo IRQ > gpio-508 ( |mod-def0 ) in lo IRQ > > And the same remained (unchanged) during/after the events (as closely I was > able to monitor) -> module transmit fault indicated Try: while ! grep -A4 'tx-fault.*in hi' /sys/kernel/debug/gpio; do :; done which may have a better chance of catching it. -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up According to speedtest.net: 11.9Mbps down 500kbps up ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [drivers/net/phy/sfp] intermittent failure in state machine checks 2020-01-10 15:09 ` Russell King - ARM Linux admin @ 2020-01-10 15:45 ` ѽ҉ᶬḳ℠ 2020-01-10 16:32 ` Russell King - ARM Linux admin 0 siblings, 1 reply; 40+ messages in thread From: ѽ҉ᶬḳ℠ @ 2020-01-10 15:45 UTC (permalink / raw) To: Russell King - ARM Linux admin; +Cc: Andrew Lunn, netdev On 10/01/2020 15:09, Russell King - ARM Linux admin wrote: > On Fri, Jan 10, 2020 at 03:02:51PM +0000, ѽ҉ᶬḳ℠ wrote: >> On 10/01/2020 12:53, Russell King - ARM Linux admin wrote: >>>>> Which is also indicating everything is correct. When the problem >>>>> occurs, check the state of the signals again as close as possible >>>>> to the event - it depends how long the transceiver keeps it >>>>> asserted. You will probably find tx-fault is indicating >>>>> "in hi IRQ". >>>> just discovered userland - gpioinfo pca9538 - which seems more verbose >>>> >>>> gpiochip2 - 8 lines: >>>> line 0: unnamed "tx-fault" input active-high [used] >>>> line 1: unnamed "tx-disable" output active-high [used] >>>> line 2: unnamed "rate-select0" input active-high [used] >>>> line 3: unnamed "los" input active-high [used] >>>> line 4: unnamed "mod-def0" input active-low [used] >>>> line 5: unnamed unused input active-high >>>> line 6: unnamed unused input active-high >>>> line 7: unnamed unused input active-high >>>> >>>> The above is depicting the current state with the module working, >>>> i.e. being >>>> online. Will do some testing and report back, not sure yet how to >>>> keep a >>>> close watch relating to the failure events. >>> However, that doesn't give the current levels of the inputs, so it's >>> useless for the purpose I've asked for. >> Fair enough. Operational (online) state >> >> gpiochip2: GPIOs 504-511, parent: i2c/8-0071, pca9538, can sleep: >> gpio-504 ( |tx-fault ) in lo IRQ >> gpio-505 ( |tx-disable ) out lo >> gpio-506 ( |rate-select0 ) in lo >> gpio-507 ( |los ) in lo IRQ >> gpio-508 ( |mod-def0 ) in lo IRQ >> >> And the same remained (unchanged) during/after the events (as closely >> I was >> able to monitor) -> module transmit fault indicated > Try: > > while ! grep -A4 'tx-fault.*in hi' /sys/kernel/debug/gpio; do :; done > > which may have a better chance of catching it. > Suppose you are not interested in what happens with ifdown wan, so just for posterity gpio-504 ( |tx-fault ) in hi IRQ gpio-505 ( |tx-disable ) out hi gpio-506 ( |rate-select0 ) in lo gpio-507 ( |los ) in lo IRQ gpio-508 ( |mod-def0 ) in lo IRQ When the iif is brought up again and happens to trigger a transmit fault the hi is not being triggered however. And it did not try 5 times to recover from the fault, unless dmesg missed some [Fri Jan 10 15:30:57 2020] mvneta f1034000.ethernet eth2: Link is Down [Fri Jan 10 15:30:57 2020] IPv6: ADDRCONF(NETDEV_UP): eth2: link is not ready [Fri Jan 10 15:31:13 2020] mvneta f1034000.ethernet eth2: configuring for inband/1000base-x link mode [Fri Jan 10 15:31:13 2020] sfp sfp: module transmit fault indicated [Fri Jan 10 15:31:15 2020] mvneta f1034000.ethernet eth2: Link is Up - 1Gbps/Full - flow control off [Fri Jan 10 15:31:16 2020] sfp sfp: module transmit fault recovered [Fri Jan 10 15:31:16 2020] mvneta f1034000.ethernet eth2: Link is Down [Fri Jan 10 15:31:16 2020] sfp sfp: module transmit fault indicated [Fri Jan 10 15:31:19 2020] sfp sfp: module persistently indicates fault, disabling [Fri Jan 10 15:31:21 2020] IPv6: ADDRCONF(NETDEV_UP): eth2: link is not ready [Fri Jan 10 15:31:21 2020] mvneta f1034000.ethernet eth2: configuring for inband/1000base-x link mode [Fri Jan 10 15:31:21 2020] sfp sfp: module transmit fault indicated [Fri Jan 10 15:31:27 2020] sfp sfp: module persistently indicates fault, disabling [Fri Jan 10 15:38:01 2020] IPv6: ADDRCONF(NETDEV_UP): eth2: link is not ready [Fri Jan 10 15:38:01 2020] mvneta f1034000.ethernet eth2: configuring for inband/1000base-x link mode [Fri Jan 10 15:38:01 2020] sfp sfp: module transmit fault indicated [Fri Jan 10 15:38:07 2020] sfp sfp: module persistently indicates fault, disabling [Fri Jan 10 15:40:48 2020] IPv6: ADDRCONF(NETDEV_UP): eth2: link is not ready [Fri Jan 10 15:40:48 2020] mvneta f1034000.ethernet eth2: configuring for inband/1000base-x link mode [Fri Jan 10 15:40:48 2020] sfp sfp: module transmit fault indicated [Fri Jan 10 15:40:54 2020] sfp sfp: module persistently indicates fault, disabling Had to reboot the node to regain WAN connectivity. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [drivers/net/phy/sfp] intermittent failure in state machine checks 2020-01-10 15:45 ` ѽ҉ᶬḳ℠ @ 2020-01-10 16:32 ` Russell King - ARM Linux admin 2020-01-10 16:53 ` ѽ҉ᶬḳ℠ 0 siblings, 1 reply; 40+ messages in thread From: Russell King - ARM Linux admin @ 2020-01-10 16:32 UTC (permalink / raw) To: ѽ҉ᶬḳ℠; +Cc: Andrew Lunn, netdev On Fri, Jan 10, 2020 at 03:45:15PM +0000, ѽ҉ᶬḳ℠ wrote: > On 10/01/2020 15:09, Russell King - ARM Linux admin wrote: > > On Fri, Jan 10, 2020 at 03:02:51PM +0000, ѽ҉ᶬḳ℠ wrote: > > > On 10/01/2020 12:53, Russell King - ARM Linux admin wrote: > > > > > > Which is also indicating everything is correct. When the problem > > > > > > occurs, check the state of the signals again as close as possible > > > > > > to the event - it depends how long the transceiver keeps it > > > > > > asserted. You will probably find tx-fault is indicating > > > > > > "in hi IRQ". > > > > > just discovered userland - gpioinfo pca9538 - which seems more verbose > > > > > > > > > > gpiochip2 - 8 lines: > > > > > line 0: unnamed "tx-fault" input active-high [used] > > > > > line 1: unnamed "tx-disable" output active-high [used] > > > > > line 2: unnamed "rate-select0" input active-high [used] > > > > > line 3: unnamed "los" input active-high [used] > > > > > line 4: unnamed "mod-def0" input active-low [used] > > > > > line 5: unnamed unused input active-high > > > > > line 6: unnamed unused input active-high > > > > > line 7: unnamed unused input active-high > > > > > > > > > > The above is depicting the current state with the module > > > > > working, i.e. being > > > > > online. Will do some testing and report back, not sure yet > > > > > how to keep a > > > > > close watch relating to the failure events. > > > > However, that doesn't give the current levels of the inputs, so it's > > > > useless for the purpose I've asked for. > > > Fair enough. Operational (online) state > > > > > > gpiochip2: GPIOs 504-511, parent: i2c/8-0071, pca9538, can sleep: > > > gpio-504 ( |tx-fault ) in lo IRQ > > > gpio-505 ( |tx-disable ) out lo > > > gpio-506 ( |rate-select0 ) in lo > > > gpio-507 ( |los ) in lo IRQ > > > gpio-508 ( |mod-def0 ) in lo IRQ > > > > > > And the same remained (unchanged) during/after the events (as > > > closely I was > > > able to monitor) -> module transmit fault indicated > > Try: > > > > while ! grep -A4 'tx-fault.*in hi' /sys/kernel/debug/gpio; do :; done > > > > which may have a better chance of catching it. > > > > Suppose you are not interested in what happens with ifdown wan, so just for > posterity > > gpio-504 ( |tx-fault ) in hi IRQ > gpio-505 ( |tx-disable ) out hi > gpio-506 ( |rate-select0 ) in lo > gpio-507 ( |los ) in lo IRQ > gpio-508 ( |mod-def0 ) in lo IRQ Right, because the state of TX_FAULT is not defined while TX_DISABLE is high. > When the iif is brought up again and happens to trigger a transmit fault the > hi is not being triggered however. And it did not try 5 times to recover > from the fault, unless dmesg missed some > > [Fri Jan 10 15:30:57 2020] mvneta f1034000.ethernet eth2: Link is Down > [Fri Jan 10 15:30:57 2020] IPv6: ADDRCONF(NETDEV_UP): eth2: link is not > ready Here is where you brought the interface down. > [Fri Jan 10 15:31:13 2020] mvneta f1034000.ethernet eth2: configuring for > inband/1000base-x link mode Here is where you brought the interface back up. > [Fri Jan 10 15:31:13 2020] sfp sfp: module transmit fault indicated The module failed to deassert TX_FAULT within the 300ms window. > [Fri Jan 10 15:31:15 2020] mvneta f1034000.ethernet eth2: Link is Up - > 1Gbps/Full - flow control off > [Fri Jan 10 15:31:16 2020] sfp sfp: module transmit fault recovered I'm not sure why these two messages are this way round; they should be the other way (I guess that's something to do with printk() no longer being synchronous.) Given that it seems to have taken two seconds to recover, that will be two reset attempts. > [Fri Jan 10 15:31:16 2020] mvneta f1034000.ethernet eth2: Link is Down > [Fri Jan 10 15:31:16 2020] sfp sfp: module transmit fault indicated The module re-asserts TX_FAULT... > [Fri Jan 10 15:31:19 2020] sfp sfp: module persistently indicates fault, > disabling After another three attempts, the module is still asserting TX_FAULT. We don't report every attempt made to recover the fault; that would flood the log. Instead, we report when the fault occurred, then try to reset the fault every second for up to five attempts _total_. If we exhaust the attempts, you get "module persistently indicates fault, disabling". If successfully recovered, you'll get "module transmit fault recovered". We don't reset the retries after a successful recovery as that would mean a transitory safety fault occurring once every few seconds would endlessly fill the kernel log, and also may go unnoticed. If it's spitting out safety faults like that, the module would be faulty. -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up According to speedtest.net: 11.9Mbps down 500kbps up ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [drivers/net/phy/sfp] intermittent failure in state machine checks 2020-01-10 16:32 ` Russell King - ARM Linux admin @ 2020-01-10 16:53 ` ѽ҉ᶬḳ℠ 2020-01-10 17:08 ` Russell King - ARM Linux admin 0 siblings, 1 reply; 40+ messages in thread From: ѽ҉ᶬḳ℠ @ 2020-01-10 16:53 UTC (permalink / raw) To: Russell King - ARM Linux admin; +Cc: Andrew Lunn, netdev On 10/01/2020 16:32, Russell King - ARM Linux admin wrote: > On Fri, Jan 10, 2020 at 03:45:15PM +0000, ѽ҉ᶬḳ℠ wrote: >> On 10/01/2020 15:09, Russell King - ARM Linux admin wrote: >>> On Fri, Jan 10, 2020 at 03:02:51PM +0000, ѽ҉ᶬḳ℠ wrote: >>>> On 10/01/2020 12:53, Russell King - ARM Linux admin wrote: >>>>>>> Which is also indicating everything is correct. When the problem >>>>>>> occurs, check the state of the signals again as close as possible >>>>>>> to the event - it depends how long the transceiver keeps it >>>>>>> asserted. You will probably find tx-fault is indicating >>>>>>> "in hi IRQ". >>>>>> just discovered userland - gpioinfo pca9538 - which seems more verbose >>>>>> >>>>>> gpiochip2 - 8 lines: >>>>>> line 0: unnamed "tx-fault" input active-high [used] >>>>>> line 1: unnamed "tx-disable" output active-high [used] >>>>>> line 2: unnamed "rate-select0" input active-high [used] >>>>>> line 3: unnamed "los" input active-high [used] >>>>>> line 4: unnamed "mod-def0" input active-low [used] >>>>>> line 5: unnamed unused input active-high >>>>>> line 6: unnamed unused input active-high >>>>>> line 7: unnamed unused input active-high >>>>>> >>>>>> The above is depicting the current state with the module >>>>>> working, i.e. being >>>>>> online. Will do some testing and report back, not sure yet >>>>>> how to keep a >>>>>> close watch relating to the failure events. >>>>> However, that doesn't give the current levels of the inputs, so it's >>>>> useless for the purpose I've asked for. >>>> Fair enough. Operational (online) state >>>> >>>> gpiochip2: GPIOs 504-511, parent: i2c/8-0071, pca9538, can sleep: >>>> gpio-504 ( |tx-fault ) in lo IRQ >>>> gpio-505 ( |tx-disable ) out lo >>>> gpio-506 ( |rate-select0 ) in lo >>>> gpio-507 ( |los ) in lo IRQ >>>> gpio-508 ( |mod-def0 ) in lo IRQ >>>> >>>> And the same remained (unchanged) during/after the events (as >>>> closely I was >>>> able to monitor) -> module transmit fault indicated >>> Try: >>> >>> while ! grep -A4 'tx-fault.*in hi' /sys/kernel/debug/gpio; do :; done >>> >>> which may have a better chance of catching it. >>> >> Suppose you are not interested in what happens with ifdown wan, so just for >> posterity >> >> gpio-504 ( |tx-fault ) in hi IRQ >> gpio-505 ( |tx-disable ) out hi >> gpio-506 ( |rate-select0 ) in lo >> gpio-507 ( |los ) in lo IRQ >> gpio-508 ( |mod-def0 ) in lo IRQ > Right, because the state of TX_FAULT is not defined while TX_DISABLE > is high. > >> When the iif is brought up again and happens to trigger a transmit fault the >> hi is not being triggered however. And it did not try 5 times to recover >> from the fault, unless dmesg missed some >> >> [Fri Jan 10 15:30:57 2020] mvneta f1034000.ethernet eth2: Link is Down >> [Fri Jan 10 15:30:57 2020] IPv6: ADDRCONF(NETDEV_UP): eth2: link is not >> ready > Here is where you brought the interface down. > >> [Fri Jan 10 15:31:13 2020] mvneta f1034000.ethernet eth2: configuring for >> inband/1000base-x link mode > Here is where you brought the interface back up. > >> [Fri Jan 10 15:31:13 2020] sfp sfp: module transmit fault indicated > The module failed to deassert TX_FAULT within the 300ms window. > >> [Fri Jan 10 15:31:15 2020] mvneta f1034000.ethernet eth2: Link is Up - >> 1Gbps/Full - flow control off >> [Fri Jan 10 15:31:16 2020] sfp sfp: module transmit fault recovered > I'm not sure why these two messages are this way round; they should > be the other way (I guess that's something to do with printk() no > longer being synchronous.) Given that it seems to have taken two > seconds to recover, that will be two reset attempts. > >> [Fri Jan 10 15:31:16 2020] mvneta f1034000.ethernet eth2: Link is Down >> [Fri Jan 10 15:31:16 2020] sfp sfp: module transmit fault indicated > The module re-asserts TX_FAULT... > >> [Fri Jan 10 15:31:19 2020] sfp sfp: module persistently indicates fault, >> disabling > After another three attempts, the module is still asserting TX_FAULT. > > We don't report every attempt made to recover the fault; that would > flood the log. Instead, we report when the fault occurred, then try > to reset the fault every second for up to five attempts _total_. If > we exhaust the attempts, you get "module persistently indicates fault, > disabling". If successfully recovered, you'll get "module transmit > fault recovered". > > We don't reset the retries after a successful recovery as that would > mean a transitory safety fault occurring once every few seconds would > endlessly fill the kernel log, and also may go unnoticed. If it's > spitting out safety faults like that, the module would be faulty. > Seems that the debug avenue has been exhausted, short of running SFP.C in debug mode. There is still no explanation why the module passes the 300 ms deassert TX_FAULT test most of the time but fails intermittently at other times, being kind of incoherent. Maybe it is just wishful thinking but it seems a bit far-fetched that the module is really causing this, least the readings from GPIO do not provide any such indicator. Could there be something choking / blocking the communication channel between the module and the kernel, some kernel code getting stuck / leaked in memory? Could the ipupdown routine, which has its own implementation in OpenWrt, be an interfering agent, e.g. the way it constructs or tears down the iif, though I do not see how? ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [drivers/net/phy/sfp] intermittent failure in state machine checks 2020-01-10 16:53 ` ѽ҉ᶬḳ℠ @ 2020-01-10 17:08 ` Russell King - ARM Linux admin 2020-01-10 17:19 ` ѽ҉ᶬḳ℠ 0 siblings, 1 reply; 40+ messages in thread From: Russell King - ARM Linux admin @ 2020-01-10 17:08 UTC (permalink / raw) To: ѽ҉ᶬḳ℠; +Cc: Andrew Lunn, netdev On Fri, Jan 10, 2020 at 04:53:06PM +0000, ѽ҉ᶬḳ℠ wrote: > Seems that the debug avenue has been exhausted, short of running SFP.C in > debug mode. You're saying you never see TX_FAULT asserted other than when the interface is down? > There is still no explanation why the module passes the 300 ms deassert > TX_FAULT test most of the time but fails intermittently at other times, > being kind of incoherent. Maybe it is just wishful thinking but it seems a > bit far-fetched that the module is really causing this, least the readings > from GPIO do not provide any such indicator. > > Could there be something choking / blocking the communication channel > between the module and the kernel, some kernel code getting stuck / leaked > in memory? There is no "communication channel" involved here. It is just those GPIOs. > Could the ipupdown routine, which has its own implementation in OpenWrt, be > an interfering agent, e.g. the way it constructs or tears down the iif, > though I do not see how? Very unlikely. -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up According to speedtest.net: 11.9Mbps down 500kbps up ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [drivers/net/phy/sfp] intermittent failure in state machine checks 2020-01-10 17:08 ` Russell King - ARM Linux admin @ 2020-01-10 17:19 ` ѽ҉ᶬḳ℠ 2020-01-10 17:38 ` Russell King - ARM Linux admin 0 siblings, 1 reply; 40+ messages in thread From: ѽ҉ᶬḳ℠ @ 2020-01-10 17:19 UTC (permalink / raw) To: Russell King - ARM Linux admin; +Cc: Andrew Lunn, netdev On 10/01/2020 17:08, Russell King - ARM Linux admin wrote: > On Fri, Jan 10, 2020 at 04:53:06PM +0000, ѽ҉ᶬḳ℠ wrote: >> Seems that the debug avenue has been exhausted, short of running SFP.C in >> debug mode. > You're saying you never see TX_FAULT asserted other than when the > interface is down? Yes, it never exhibits once the iif is up - it is rock-stable in that state, only ever when being transitioned from down state to up state. Pardon, if that has not been made explicitly clear previously. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [drivers/net/phy/sfp] intermittent failure in state machine checks 2020-01-10 17:19 ` ѽ҉ᶬḳ℠ @ 2020-01-10 17:38 ` Russell King - ARM Linux admin 2020-01-10 18:44 ` ѽ҉ᶬḳ℠ 0 siblings, 1 reply; 40+ messages in thread From: Russell King - ARM Linux admin @ 2020-01-10 17:38 UTC (permalink / raw) To: ѽ҉ᶬḳ℠; +Cc: Andrew Lunn, netdev On Fri, Jan 10, 2020 at 05:19:35PM +0000, ѽ҉ᶬḳ℠ wrote: > > On 10/01/2020 17:08, Russell King - ARM Linux admin wrote: > > On Fri, Jan 10, 2020 at 04:53:06PM +0000, ѽ҉ᶬḳ℠ wrote: > > > Seems that the debug avenue has been exhausted, short of running SFP.C in > > > debug mode. > > You're saying you never see TX_FAULT asserted other than when the > > interface is down? > > Yes, it never exhibits once the iif is up - it is rock-stable in that state, > only ever when being transitioned from down state to up state. > Pardon, if that has not been made explicitly clear previously. I think if we were to have SFP debug enabled, you'll find that TX_FAULT is being reported to SFP as being asserted. You probably aren't running that while loop, as it will exit when it sees TX_FAULT asserted. So, here's another bit of shell code for you to run: ip li set dev eth2 down; \ ip li set dev eth2 up; \ date while :; do cat /proc/uptime while ! grep -A5 'tx-fault.*in hi' /sys/kernel/debug/gpio; do :; done cat /proc/uptime while ! grep -A5 'tx-fault.*in lo' /sys/kernel/debug/gpio; do :; done done This will give you output such as: Fri 10 Jan 17:31:06 GMT 2020 774869.13 1535859.48 gpio-509 ( |tx-fault ) in hi ... 774869.14 1535859.49 gpio-509 ( |tx-fault ) in lo ... 774869.15 1535859.50 The first date and "uptime" output is the timestamp when the interface was brought up. Subsequent "uptime" outputs can be used to calculate the time difference in seconds between the state printed immediately prior to the uptime output, and the first "uptime" output. So in the above example, the tx-fault signal was hi at 10ms, and then went low 20ms after the up. However, bear in mind that even this will not be good enough to spot transitory changes on TX_FAULT - as your I2C GPIO expander is interrupt capable, watching /proc/interrupts may tell you more. If the TX_FAULT signal is as stable as you claim it is, you should see the interrupt count for it remaining the same. -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up According to speedtest.net: 11.9Mbps down 500kbps up ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [drivers/net/phy/sfp] intermittent failure in state machine checks 2020-01-10 17:38 ` Russell King - ARM Linux admin @ 2020-01-10 18:44 ` ѽ҉ᶬḳ℠ 2020-01-10 19:01 ` Russell King - ARM Linux admin 2020-01-10 19:23 ` Andrew Lunn 0 siblings, 2 replies; 40+ messages in thread From: ѽ҉ᶬḳ℠ @ 2020-01-10 18:44 UTC (permalink / raw) To: Russell King - ARM Linux admin; +Cc: Andrew Lunn, netdev On 10/01/2020 17:38, Russell King - ARM Linux admin wrote: > >>> On Fri, Jan 10, 2020 at 04:53:06PM +0000, ѽ҉ᶬḳ℠ wrote: >>>> Seems that the debug avenue has been exhausted, short of running SFP.C in >>>> debug mode. >>> You're saying you never see TX_FAULT asserted other than when the >>> interface is down? >> Yes, it never exhibits once the iif is up - it is rock-stable in that state, >> only ever when being transitioned from down state to up state. >> Pardon, if that has not been made explicitly clear previously. > I think if we were to have SFP debug enabled, you'll find that > TX_FAULT is being reported to SFP as being asserted. If really necessary I could ask the TOS developers to assist, not sure whether they would oblidge. Their Master branch build bot compiles twice a day. Would it just involve setting a kernel debug flag or something more elaborate? > > You probably aren't running that while loop, as it will exit when > it sees TX_FAULT asserted. So, here's another bit of shell code > for you to run: > > ip li set dev eth2 down; \ > ip li set dev eth2 up; \ > date > while :; do > cat /proc/uptime > while ! grep -A5 'tx-fault.*in hi' /sys/kernel/debug/gpio; do :; done > cat /proc/uptime > while ! grep -A5 'tx-fault.*in lo' /sys/kernel/debug/gpio; do :; done > done > > This will give you output such as: > > Fri 10 Jan 17:31:06 GMT 2020 > 774869.13 1535859.48 > gpio-509 ( |tx-fault ) in hi ... > 774869.14 1535859.49 > gpio-509 ( |tx-fault ) in lo ... > 774869.15 1535859.50 > > The first date and "uptime" output is the timestamp when the interface > was brought up. Subsequent "uptime" outputs can be used to calculate > the time difference in seconds between the state printed immediately > prior to the uptime output, and the first "uptime" output. > > So in the above example, the tx-fault signal was hi at 10ms, and then > went low 20ms after the up. awfully nice of you to provide the code, this is the output when the iif is brought down and up again and exhibiting the transmit fault. ip li set dev eth2 down; \ > ip li set dev eth2 up; \ > date Fri Jan 10 18:34:52 GMT 2020 root@to:~# while :; do > cat /proc/uptime > while ! grep -A5 'tx-fault.*in hi' /sys/kernel/debug/gpio; do :; done > cat /proc/uptime > while ! grep -A5 'tx-fault.*in lo' /sys/kernel/debug/gpio; do :; done > done 1865.20 3224.67 gpio-504 ( |tx-fault ) in hi IRQ gpio-505 ( |tx-disable ) out hi gpio-506 ( |rate-select0 ) in lo gpio-507 ( |los ) in lo IRQ gpio-508 ( |mod-def0 ) in lo IRQ 1871.77 3230.71 gpio-504 ( |tx-fault ) in lo IRQ gpio-505 ( |tx-disable ) out lo gpio-506 ( |rate-select0 ) in lo gpio-507 ( |los ) in lo IRQ gpio-508 ( |mod-def0 ) in lo IRQ 1919.06 3309.55 gpio-504 ( |tx-fault ) in hi IRQ gpio-505 ( |tx-disable ) out lo gpio-506 ( |rate-select0 ) in lo gpio-507 ( |los ) in lo IRQ gpio-508 ( |mod-def0 ) in lo IRQ 1919.07 3309.57 gpio-504 ( |tx-fault ) in lo IRQ gpio-505 ( |tx-disable ) out lo gpio-506 ( |rate-select0 ) in lo gpio-507 ( |los ) in lo IRQ gpio-508 ( |mod-def0 ) in lo IRQ 1920.68 3312.28 gpio-504 ( |tx-fault ) in hi IRQ gpio-505 ( |tx-disable ) out lo gpio-506 ( |rate-select0 ) in lo gpio-507 ( |los ) in lo IRQ gpio-508 ( |mod-def0 ) in lo IRQ 1921.86 3314.21 gpio-504 ( |tx-fault ) in lo IRQ gpio-505 ( |tx-disable ) out lo gpio-506 ( |rate-select0 ) in lo gpio-507 ( |los ) in lo IRQ gpio-508 ( |mod-def0 ) in lo IRQ 1921.86 3314.21 > However, bear in mind that even this will not be good enough to spot > transitory changes on TX_FAULT - as your I2C GPIO expander is interrupt > capable, watching /proc/interrupts may tell you more. > > If the TX_FAULT signal is as stable as you claim it is, you should see > the interrupt count for it remaining the same. Once the iif is up those values remain stable indeed. cat /proc/interrupts | grep sfp 52: 0 0 pca953x 4 Edge sfp 53: 0 0 pca953x 3 Edge sfp 54: 6 0 pca953x 0 Edge sfp and only incrementing with ifupdown action (which would be logical) cat /proc/interrupts | grep sfp 52: 0 0 pca953x 4 Edge sfp 53: 0 0 pca953x 3 Edge sfp 54: 11 0 pca953x 0 Edge sfp ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [drivers/net/phy/sfp] intermittent failure in state machine checks 2020-01-10 18:44 ` ѽ҉ᶬḳ℠ @ 2020-01-10 19:01 ` Russell King - ARM Linux admin 2020-01-10 19:36 ` ѽ҉ᶬḳ℠ 2020-01-10 19:23 ` Andrew Lunn 1 sibling, 1 reply; 40+ messages in thread From: Russell King - ARM Linux admin @ 2020-01-10 19:01 UTC (permalink / raw) To: ѽ҉ᶬḳ℠; +Cc: Andrew Lunn, netdev On Fri, Jan 10, 2020 at 06:44:18PM +0000, ѽ҉ᶬḳ℠ wrote: > On 10/01/2020 17:38, Russell King - ARM Linux admin wrote: > > > > > > On Fri, Jan 10, 2020 at 04:53:06PM +0000, ѽ҉ᶬḳ℠ wrote: > > > > > Seems that the debug avenue has been exhausted, short of running SFP.C in > > > > > debug mode. > > > > You're saying you never see TX_FAULT asserted other than when the > > > > interface is down? > > > Yes, it never exhibits once the iif is up - it is rock-stable in that state, > > > only ever when being transitioned from down state to up state. > > > Pardon, if that has not been made explicitly clear previously. > > I think if we were to have SFP debug enabled, you'll find that > > TX_FAULT is being reported to SFP as being asserted. > > If really necessary I could ask the TOS developers to assist, not sure > whether they would oblidge. Their Master branch build bot compiles twice a > day. > Would it just involve setting a kernel debug flag or something more > elaborate? > > > > > You probably aren't running that while loop, as it will exit when > > it sees TX_FAULT asserted. So, here's another bit of shell code > > for you to run: > > > > ip li set dev eth2 down; \ > > ip li set dev eth2 up; \ > > date > > while :; do > > cat /proc/uptime > > while ! grep -A5 'tx-fault.*in hi' /sys/kernel/debug/gpio; do :; done > > cat /proc/uptime > > while ! grep -A5 'tx-fault.*in lo' /sys/kernel/debug/gpio; do :; done > > done > > > > This will give you output such as: > > > > Fri 10 Jan 17:31:06 GMT 2020 > > 774869.13 1535859.48 > > gpio-509 ( |tx-fault ) in hi ... > > 774869.14 1535859.49 > > gpio-509 ( |tx-fault ) in lo ... > > 774869.15 1535859.50 > > > > The first date and "uptime" output is the timestamp when the interface > > was brought up. Subsequent "uptime" outputs can be used to calculate > > the time difference in seconds between the state printed immediately > > prior to the uptime output, and the first "uptime" output. > > > > So in the above example, the tx-fault signal was hi at 10ms, and then > > went low 20ms after the up. > > awfully nice of you to provide the code, this is the output when the iif is > brought down and up again and exhibiting the transmit fault. > > ip li set dev eth2 down; \ > > ip li set dev eth2 up; \ > > date > Fri Jan 10 18:34:52 GMT 2020 > root@to:~# while :; do > > cat /proc/uptime > > while ! grep -A5 'tx-fault.*in hi' /sys/kernel/debug/gpio; do :; done > > cat /proc/uptime > > while ! grep -A5 'tx-fault.*in lo' /sys/kernel/debug/gpio; do :; done > > done Hmm, I missed a ; \ at the end of "date", so this isn't quite what I wanted, but it'll do. What that means is that: > 1865.20 3224.67 doesn't bear the relationship that I wanted to the interface coming up. > gpio-504 ( |tx-fault ) in hi IRQ > gpio-505 ( |tx-disable ) out hi > gpio-506 ( |rate-select0 ) in lo > gpio-507 ( |los ) in lo IRQ > gpio-508 ( |mod-def0 ) in lo IRQ > 1871.77 3230.71 TX_FAULT is high at 1871.77 and TX_DISABLE is high, so the interface is down. > gpio-504 ( |tx-fault ) in lo IRQ > gpio-505 ( |tx-disable ) out lo > gpio-506 ( |rate-select0 ) in lo > gpio-507 ( |los ) in lo IRQ > gpio-508 ( |mod-def0 ) in lo IRQ > 1919.06 3309.55 Almost 47.3s later, TX_FAULT has gone low. > gpio-504 ( |tx-fault ) in hi IRQ > gpio-505 ( |tx-disable ) out lo > gpio-506 ( |rate-select0 ) in lo > gpio-507 ( |los ) in lo IRQ > gpio-508 ( |mod-def0 ) in lo IRQ > 1919.07 3309.57 After 10ms, it goes high again - this will cause the first report of a transmit fault. > gpio-504 ( |tx-fault ) in lo IRQ > gpio-505 ( |tx-disable ) out lo > gpio-506 ( |rate-select0 ) in lo > gpio-507 ( |los ) in lo IRQ > gpio-508 ( |mod-def0 ) in lo IRQ > 1920.68 3312.28 About 1.6s later, it goes low, maybe as a result of the first attempt to clear the fault by a brief pulse on TX_DISABLE. So, we wait 1s before asserting TX_DISABLE for 10us, which would have happened around 1920.07. We then have 300ms for initialisation, which would've taken us to 1920.37, so this may have been interpreted as the fault still being present. The next clearance attempt would have been scheduled for about 1921.37. > gpio-504 ( |tx-fault ) in hi IRQ > gpio-505 ( |tx-disable ) out lo > gpio-506 ( |rate-select0 ) in lo > gpio-507 ( |los ) in lo IRQ > gpio-508 ( |mod-def0 ) in lo IRQ > 1921.86 3314.21 1.2s later, it re-asserts. > gpio-504 ( |tx-fault ) in lo IRQ > gpio-505 ( |tx-disable ) out lo > gpio-506 ( |rate-select0 ) in lo > gpio-507 ( |los ) in lo IRQ > gpio-508 ( |mod-def0 ) in lo IRQ > 1921.86 3314.21 and deasserts within the same 10ms. > > However, bear in mind that even this will not be good enough to spot > > transitory changes on TX_FAULT - as your I2C GPIO expander is interrupt > > capable, watching /proc/interrupts may tell you more. > > > > If the TX_FAULT signal is as stable as you claim it is, you should see > > the interrupt count for it remaining the same. > > Once the iif is up those values remain stable indeed. > > cat /proc/interrupts | grep sfp > 52: 0 0 pca953x 4 Edge sfp > 53: 0 0 pca953x 3 Edge sfp > 54: 6 0 pca953x 0 Edge sfp > > and only incrementing with ifupdown action (which would be logical) > > cat /proc/interrupts | grep sfp > 52: 0 0 pca953x 4 Edge sfp > 53: 0 0 pca953x 3 Edge sfp > 54: 11 0 pca953x 0 Edge sfp According to this, TX_FAULT has toggled five times. This would seem to negate your previous comment about TX_FAULT being stable. Therefore, I'd say that the SFP state machines are operating as designed, and as per the SFP MSA, and what we have is a module that likes to assert TX_FAULT for unknown reasons, and this confirms the hypothesis I've been putting forward. -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up According to speedtest.net: 11.9Mbps down 500kbps up ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [drivers/net/phy/sfp] intermittent failure in state machine checks 2020-01-10 19:01 ` Russell King - ARM Linux admin @ 2020-01-10 19:36 ` ѽ҉ᶬḳ℠ 2020-01-10 19:55 ` Russell King - ARM Linux admin 0 siblings, 1 reply; 40+ messages in thread From: ѽ҉ᶬḳ℠ @ 2020-01-10 19:36 UTC (permalink / raw) To: Russell King - ARM Linux admin; +Cc: Andrew Lunn, netdev On 10/01/2020 19:01, Russell King - ARM Linux admin wrote: > On Fri, Jan 10, 2020 at 06:44:18PM +0000, ѽ҉ᶬḳ℠ wrote: >> On 10/01/2020 17:38, Russell King - ARM Linux admin wrote: >>>>> On Fri, Jan 10, 2020 at 04:53:06PM +0000, ѽ҉ᶬḳ℠ wrote: >>>>>> Seems that the debug avenue has been exhausted, short of running SFP.C in >>>>>> debug mode. >>>>> You're saying you never see TX_FAULT asserted other than when the >>>>> interface is down? >>>> Yes, it never exhibits once the iif is up - it is rock-stable in that state, >>>> only ever when being transitioned from down state to up state. >>>> Pardon, if that has not been made explicitly clear previously. >>> I think if we were to have SFP debug enabled, you'll find that >>> TX_FAULT is being reported to SFP as being asserted. >> If really necessary I could ask the TOS developers to assist, not sure >> whether they would oblidge. Their Master branch build bot compiles twice a >> day. >> Would it just involve setting a kernel debug flag or something more >> elaborate? >> >>> You probably aren't running that while loop, as it will exit when >>> it sees TX_FAULT asserted. So, here's another bit of shell code >>> for you to run: >>> >>> ip li set dev eth2 down; \ >>> ip li set dev eth2 up; \ >>> date >>> while :; do >>> cat /proc/uptime >>> while ! grep -A5 'tx-fault.*in hi' /sys/kernel/debug/gpio; do :; done >>> cat /proc/uptime >>> while ! grep -A5 'tx-fault.*in lo' /sys/kernel/debug/gpio; do :; done >>> done >>> >>> This will give you output such as: >>> >>> Fri 10 Jan 17:31:06 GMT 2020 >>> 774869.13 1535859.48 >>> gpio-509 ( |tx-fault ) in hi ... >>> 774869.14 1535859.49 >>> gpio-509 ( |tx-fault ) in lo ... >>> 774869.15 1535859.50 >>> >>> The first date and "uptime" output is the timestamp when the interface >>> was brought up. Subsequent "uptime" outputs can be used to calculate >>> the time difference in seconds between the state printed immediately >>> prior to the uptime output, and the first "uptime" output. >>> >>> So in the above example, the tx-fault signal was hi at 10ms, and then >>> went low 20ms after the up. >> awfully nice of you to provide the code, this is the output when the iif is >> brought down and up again and exhibiting the transmit fault. >> >> ip li set dev eth2 down; \ >>> ip li set dev eth2 up; \ >>> date >> Fri Jan 10 18:34:52 GMT 2020 >> root@to:~# while :; do >>> cat /proc/uptime >>> while ! grep -A5 'tx-fault.*in hi' /sys/kernel/debug/gpio; do :; done >>> cat /proc/uptime >>> while ! grep -A5 'tx-fault.*in lo' /sys/kernel/debug/gpio; do :; done >>> done > Hmm, I missed a ; \ at the end of "date", so this isn't quite what > I wanted, but it'll do. What that means is that: > >> 1865.20 3224.67 > doesn't bear the relationship that I wanted to the interface coming > up. > >> gpio-504 ( |tx-fault ) in hi IRQ >> gpio-505 ( |tx-disable ) out hi >> gpio-506 ( |rate-select0 ) in lo >> gpio-507 ( |los ) in lo IRQ >> gpio-508 ( |mod-def0 ) in lo IRQ >> 1871.77 3230.71 > TX_FAULT is high at 1871.77 and TX_DISABLE is high, so the interface > is down. > >> gpio-504 ( |tx-fault ) in lo IRQ >> gpio-505 ( |tx-disable ) out lo >> gpio-506 ( |rate-select0 ) in lo >> gpio-507 ( |los ) in lo IRQ >> gpio-508 ( |mod-def0 ) in lo IRQ >> 1919.06 3309.55 > Almost 47.3s later, TX_FAULT has gone low. This correlates with invoking ifup > >> gpio-504 ( |tx-fault ) in hi IRQ >> gpio-505 ( |tx-disable ) out lo >> gpio-506 ( |rate-select0 ) in lo >> gpio-507 ( |los ) in lo IRQ >> gpio-508 ( |mod-def0 ) in lo IRQ >> 1919.07 3309.57 > After 10ms, it goes high again - this will cause the first report of > a transmit fault. > >> gpio-504 ( |tx-fault ) in lo IRQ >> gpio-505 ( |tx-disable ) out lo >> gpio-506 ( |rate-select0 ) in lo >> gpio-507 ( |los ) in lo IRQ >> gpio-508 ( |mod-def0 ) in lo IRQ >> 1920.68 3312.28 > About 1.6s later, it goes low, maybe as a result of the first attempt > to clear the fault by a brief pulse on TX_DISABLE. > > So, we wait 1s before asserting TX_DISABLE for 10us, which would have > happened around 1920.07. We then have 300ms for initialisation, which > would've taken us to 1920.37, so this may have been interpreted as the > fault still being present. The next clearance attempt would have been > scheduled for about 1921.37. > >> gpio-504 ( |tx-fault ) in hi IRQ >> gpio-505 ( |tx-disable ) out lo >> gpio-506 ( |rate-select0 ) in lo >> gpio-507 ( |los ) in lo IRQ >> gpio-508 ( |mod-def0 ) in lo IRQ >> 1921.86 3314.21 > 1.2s later, it re-asserts. > >> gpio-504 ( |tx-fault ) in lo IRQ >> gpio-505 ( |tx-disable ) out lo >> gpio-506 ( |rate-select0 ) in lo >> gpio-507 ( |los ) in lo IRQ >> gpio-508 ( |mod-def0 ) in lo IRQ >> 1921.86 3314.21 > and deasserts within the same 10ms. > >>> However, bear in mind that even this will not be good enough to spot >>> transitory changes on TX_FAULT - as your I2C GPIO expander is interrupt >>> capable, watching /proc/interrupts may tell you more. >>> >>> If the TX_FAULT signal is as stable as you claim it is, you should see >>> the interrupt count for it remaining the same. >> Once the iif is up those values remain stable indeed. >> >> cat /proc/interrupts | grep sfp >> 52: 0 0 pca953x 4 Edge sfp >> 53: 0 0 pca953x 3 Edge sfp >> 54: 6 0 pca953x 0 Edge sfp >> >> and only incrementing with ifupdown action (which would be logical) >> >> cat /proc/interrupts | grep sfp >> 52: 0 0 pca953x 4 Edge sfp >> 53: 0 0 pca953x 3 Edge sfp >> 54: 11 0 pca953x 0 Edge sfp > According to this, TX_FAULT has toggled five times. > > This would seem to negate your previous comment about TX_FAULT being > stable. Maybe you misread of what I wrote - it is stable once the iff is up, the values do not change. The 5 toggles are resulting from manually invoking ifupdown action. > Therefore, I'd say that the SFP state machines are operating as > designed, and as per the SFP MSA, and what we have is a module that > likes to assert TX_FAULT for unknown reasons, and this confirms the > hypothesis I've been putting forward. > This is based on the 5 IRQ toggles or the previous reading on the GPIO output? Surely I have no clue about the time frames the modules asserts / deasserts but since it works in general and only exhibits the issue intermittently with ifupdown action after it has been brought up initially it does not seem to be caused by the module. If the module would be misbehaving in general it would never pass the sm check and thus be entirely unusable. Which though is not the case. Suppose I have just to wait with fingers crossed that maybe at some point in the future the issue stops somehow, considering: - the module may not meet SFP MSA 100%, though that MSA with "shall" and/or "may" wording does not appear obligatory and leaves wiggle room - the chipset designer / manufacturer may not provide all necessary documentation in the public domain a/o with GPL license - the vendor distributing the module cannot be bothered with support - downstream distros building from upstream source likely lacking the resources to look into this and anyway not patching SPF.C in the first place - upstream development has provided the most extensive support, trying to get to the bottom of it, and concluding the module misbehaving ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [drivers/net/phy/sfp] intermittent failure in state machine checks 2020-01-10 19:36 ` ѽ҉ᶬḳ℠ @ 2020-01-10 19:55 ` Russell King - ARM Linux admin 2020-01-10 20:27 ` ѽ҉ᶬḳ℠ 0 siblings, 1 reply; 40+ messages in thread From: Russell King - ARM Linux admin @ 2020-01-10 19:55 UTC (permalink / raw) To: ѽ҉ᶬḳ℠; +Cc: Andrew Lunn, netdev On Fri, Jan 10, 2020 at 07:36:04PM +0000, ѽ҉ᶬḳ℠ wrote: > On 10/01/2020 19:01, Russell King - ARM Linux admin wrote: > > On Fri, Jan 10, 2020 at 06:44:18PM +0000, ѽ҉ᶬḳ℠ wrote: > > > On 10/01/2020 17:38, Russell King - ARM Linux admin wrote: > > > > > > On Fri, Jan 10, 2020 at 04:53:06PM +0000, ѽ҉ᶬḳ℠ wrote: > > > > > > > Seems that the debug avenue has been exhausted, short of running SFP.C in > > > > > > > debug mode. > > > > > > You're saying you never see TX_FAULT asserted other than when the > > > > > > interface is down? > > > > > Yes, it never exhibits once the iif is up - it is rock-stable in that state, > > > > > only ever when being transitioned from down state to up state. > > > > > Pardon, if that has not been made explicitly clear previously. > > > > I think if we were to have SFP debug enabled, you'll find that > > > > TX_FAULT is being reported to SFP as being asserted. > > > If really necessary I could ask the TOS developers to assist, not sure > > > whether they would oblidge. Their Master branch build bot compiles twice a > > > day. > > > Would it just involve setting a kernel debug flag or something more > > > elaborate? > > > > > > > You probably aren't running that while loop, as it will exit when > > > > it sees TX_FAULT asserted. So, here's another bit of shell code > > > > for you to run: > > > > > > > > ip li set dev eth2 down; \ > > > > ip li set dev eth2 up; \ > > > > date > > > > while :; do > > > > cat /proc/uptime > > > > while ! grep -A5 'tx-fault.*in hi' /sys/kernel/debug/gpio; do :; done > > > > cat /proc/uptime > > > > while ! grep -A5 'tx-fault.*in lo' /sys/kernel/debug/gpio; do :; done > > > > done > > > > > > > > This will give you output such as: > > > > > > > > Fri 10 Jan 17:31:06 GMT 2020 > > > > 774869.13 1535859.48 > > > > gpio-509 ( |tx-fault ) in hi ... > > > > 774869.14 1535859.49 > > > > gpio-509 ( |tx-fault ) in lo ... > > > > 774869.15 1535859.50 > > > > > > > > The first date and "uptime" output is the timestamp when the interface > > > > was brought up. Subsequent "uptime" outputs can be used to calculate > > > > the time difference in seconds between the state printed immediately > > > > prior to the uptime output, and the first "uptime" output. > > > > > > > > So in the above example, the tx-fault signal was hi at 10ms, and then > > > > went low 20ms after the up. > > > awfully nice of you to provide the code, this is the output when the iif is > > > brought down and up again and exhibiting the transmit fault. > > > > > > ip li set dev eth2 down; \ > > > > ip li set dev eth2 up; \ > > > > date > > > Fri Jan 10 18:34:52 GMT 2020 > > > root@to:~# while :; do > > > > cat /proc/uptime > > > > while ! grep -A5 'tx-fault.*in hi' /sys/kernel/debug/gpio; do :; done > > > > cat /proc/uptime > > > > while ! grep -A5 'tx-fault.*in lo' /sys/kernel/debug/gpio; do :; done > > > > done > > Hmm, I missed a ; \ at the end of "date", so this isn't quite what > > I wanted, but it'll do. What that means is that: > > > > > 1865.20 3224.67 > > doesn't bear the relationship that I wanted to the interface coming > > up. > > > > > gpio-504 ( |tx-fault ) in hi IRQ > > > gpio-505 ( |tx-disable ) out hi > > > gpio-506 ( |rate-select0 ) in lo > > > gpio-507 ( |los ) in lo IRQ > > > gpio-508 ( |mod-def0 ) in lo IRQ > > > 1871.77 3230.71 > > TX_FAULT is high at 1871.77 and TX_DISABLE is high, so the interface > > is down. > > > > > gpio-504 ( |tx-fault ) in lo IRQ > > > gpio-505 ( |tx-disable ) out lo > > > gpio-506 ( |rate-select0 ) in lo > > > gpio-507 ( |los ) in lo IRQ > > > gpio-508 ( |mod-def0 ) in lo IRQ > > > 1919.06 3309.55 > > Almost 47.3s later, TX_FAULT has gone low. > > This correlates with invoking ifup Yes, which concurs with my analysis. Everything that happens from this point on... > > > gpio-504 ( |tx-fault ) in hi IRQ > > > gpio-505 ( |tx-disable ) out lo > > > gpio-506 ( |rate-select0 ) in lo > > > gpio-507 ( |los ) in lo IRQ > > > gpio-508 ( |mod-def0 ) in lo IRQ > > > 1919.07 3309.57 > > After 10ms, it goes high again - this will cause the first report of > > a transmit fault. > > > > > gpio-504 ( |tx-fault ) in lo IRQ > > > gpio-505 ( |tx-disable ) out lo > > > gpio-506 ( |rate-select0 ) in lo > > > gpio-507 ( |los ) in lo IRQ > > > gpio-508 ( |mod-def0 ) in lo IRQ > > > 1920.68 3312.28 > > About 1.6s later, it goes low, maybe as a result of the first attempt > > to clear the fault by a brief pulse on TX_DISABLE. > > > > So, we wait 1s before asserting TX_DISABLE for 10us, which would have > > happened around 1920.07. We then have 300ms for initialisation, which > > would've taken us to 1920.37, so this may have been interpreted as the > > fault still being present. The next clearance attempt would have been > > scheduled for about 1921.37. > > > > > gpio-504 ( |tx-fault ) in hi IRQ > > > gpio-505 ( |tx-disable ) out lo > > > gpio-506 ( |rate-select0 ) in lo > > > gpio-507 ( |los ) in lo IRQ > > > gpio-508 ( |mod-def0 ) in lo IRQ > > > 1921.86 3314.21 > > 1.2s later, it re-asserts. > > > > > gpio-504 ( |tx-fault ) in lo IRQ > > > gpio-505 ( |tx-disable ) out lo > > > gpio-506 ( |rate-select0 ) in lo > > > gpio-507 ( |los ) in lo IRQ > > > gpio-508 ( |mod-def0 ) in lo IRQ > > > 1921.86 3314.21 > > and deasserts within the same 10ms. ... is the "funny stuff" that is triggering the TX fault warnings and should not be happening if the module was compliant with the SFP MSA. > > > > However, bear in mind that even this will not be good enough to spot > > > > transitory changes on TX_FAULT - as your I2C GPIO expander is interrupt > > > > capable, watching /proc/interrupts may tell you more. > > > > > > > > If the TX_FAULT signal is as stable as you claim it is, you should see > > > > the interrupt count for it remaining the same. > > > Once the iif is up those values remain stable indeed. > > > > > > cat /proc/interrupts | grep sfp > > > 52: 0 0 pca953x 4 Edge sfp > > > 53: 0 0 pca953x 3 Edge sfp > > > 54: 6 0 pca953x 0 Edge sfp > > > > > > and only incrementing with ifupdown action (which would be logical) > > > > > > cat /proc/interrupts | grep sfp > > > 52: 0 0 pca953x 4 Edge sfp > > > 53: 0 0 pca953x 3 Edge sfp > > > 54: 11 0 pca953x 0 Edge sfp > > According to this, TX_FAULT has toggled five times. > > > > This would seem to negate your previous comment about TX_FAULT being > > stable. > > Maybe you misread of what I wrote - it is stable once the iff is up, the > values do not change. Define "stable once the interface is up". Is that stable after ten seconds? Or stable in under the 300ms initialisation delay allowed by the SFP MSA? > The 5 toggles are resulting from manually invoking ifupdown action. > > > Therefore, I'd say that the SFP state machines are operating as > > designed, and as per the SFP MSA, and what we have is a module that > > likes to assert TX_FAULT for unknown reasons, and this confirms the > > hypothesis I've been putting forward. > > > > This is based on the 5 IRQ toggles or the previous reading on the GPIO > output? On _both_. > Surely I have no clue about the time frames the modules asserts / deasserts > but since it works in general and only exhibits the issue intermittently > with ifupdown action after it has been brought up initially it does not seem > to be caused by the module. > > If the module would be misbehaving in general it would never pass the sm > check and thus be entirely unusable. Which though is not the case. > > Suppose I have just to wait with fingers crossed that maybe at some point in > the future the issue stops somehow, considering: > > - the module may not meet SFP MSA 100%, though that MSA with "shall" and/or > "may" wording does not appear obligatory and leaves wiggle room > - the chipset designer / manufacturer may not provide all necessary > documentation in the public domain a/o with GPL license > - the vendor distributing the module cannot be bothered with support > - downstream distros building from upstream source likely lacking the > resources to look into this and anyway not patching SPF.C in the first place > - upstream development has provided the most extensive support, trying to > get to the bottom of it, and concluding the module misbehaving Okay, I give up trying to help you. Sorry, but I've spent a lot of time over the last two days trying to help and explain stuff, and you seem to want to constantly tell me I'm wrong, or misreading what you're saying, or that there's some problem with the "sm check" when I've already pointed out is a figment of your imagination. Given also that your UTF-8 in your From: line _really_ screws up the index in my mutt mail reader on the Linux console, disrupting my ability to read other emails, it's now time for me to call an end to this. Sorry, but I'm not prepared to help any further. -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up According to speedtest.net: 11.9Mbps down 500kbps up ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [drivers/net/phy/sfp] intermittent failure in state machine checks 2020-01-10 19:55 ` Russell King - ARM Linux admin @ 2020-01-10 20:27 ` ѽ҉ᶬḳ℠ 0 siblings, 0 replies; 40+ messages in thread From: ѽ҉ᶬḳ℠ @ 2020-01-10 20:27 UTC (permalink / raw) To: Russell King - ARM Linux admin; +Cc: Andrew Lunn, netdev On 10/01/2020 19:55, Russell King - ARM Linux admin wrote: > > Define "stable once the interface is up". Is that stable after ten > seconds? Or stable in under the 300ms initialisation delay allowed > by the SFP MSA? The router boots, SFP.C is called and performs its functions and the module gets online and stays that way. At some point the modules thus apparently passes the checks, incl. the under the 300ms initialisation, or else it would never get online and I could trash it. Once up it is rock-solid - the IRQ values are staying constant. If at later stage the iif is being brought down and up again the issue starts to exhibit. > >> The 5 toggles are resulting from manually invoking ifupdown action. >> >>> Therefore, I'd say that the SFP state machines are operating as >>> designed, and as per the SFP MSA, and what we have is a module that >>> likes to assert TX_FAULT for unknown reasons, and this confirms the >>> hypothesis I've been putting forward. >>> >> This is based on the 5 IRQ toggles or the previous reading on the GPIO >> output? > On _both_. > > > Okay, I give up trying to help you. Sorry, but I've spent a lot of > time over the last two days trying to help and explain stuff, and > you seem to want to constantly tell me I'm wrong, or misreading what > you're saying, or that there's some problem with the "sm check" > when I've already pointed out is a figment of your imagination. Not sure really why took such offence from that bit of summary. I am not saying/implying that you are wrong, just beg to differ - there is no explanation why the module is passing the test on initial up (at boot time) but failing intermittently with ifupdown action later on, that is all I am saying. That the module is failing checks is hardly a figment of my imagination or else I would not have bothered in seeking support in various places, and prior reaching out all the way upstream having tried first in this order: - TOS - OpenWrt - vendor > Sorry, but I'm not prepared to help any further. It would be just sad to leave on such note and thus just let me emphasize that I have thoroughly enjoyed and appreciated the exchange. Unless you object me posting to this mailing list I would just remove your email address then. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [drivers/net/phy/sfp] intermittent failure in state machine checks 2020-01-10 18:44 ` ѽ҉ᶬḳ℠ 2020-01-10 19:01 ` Russell King - ARM Linux admin @ 2020-01-10 19:23 ` Andrew Lunn 2020-01-11 12:58 ` ѽ҉ᶬḳ℠ 1 sibling, 1 reply; 40+ messages in thread From: Andrew Lunn @ 2020-01-10 19:23 UTC (permalink / raw) To: ѽ҉ᶬḳ℠ Cc: Russell King - ARM Linux admin, netdev > If really necessary I could ask the TOS developers to assist, not sure > whether they would oblidge. Their Master branch build bot compiles twice a > day. > Would it just involve setting a kernel debug flag or something more > elaborate? You could ask them to build a kernel with dynamic debug enabled https://www.kernel.org/doc/html/latest/admin-guide/dynamic-debug-howto.html You can then turn on debugging in a flexible way. And it will be useful for more than just you. Andrew ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [drivers/net/phy/sfp] intermittent failure in state machine checks 2020-01-10 19:23 ` Andrew Lunn @ 2020-01-11 12:58 ` ѽ҉ᶬḳ℠ 0 siblings, 0 replies; 40+ messages in thread From: ѽ҉ᶬḳ℠ @ 2020-01-11 12:58 UTC (permalink / raw) To: netdev; +Cc: Andrew Lunn On 10/01/2020 19:23, Andrew Lunn wrote: >> If really necessary I could ask the TOS developers to assist, not sure >> whether they would oblidge. Their Master branch build bot compiles twice a >> day. >> Would it just involve setting a kernel debug flag or something more >> elaborate? > You could ask them to build a kernel with dynamic debug enabled > > https://www.kernel.org/doc/html/latest/admin-guide/dynamic-debug-howto.html > > You can then turn on debugging in a flexible way. And it will be > useful for more than just you. > > Andrew I have put in the request and will see how it goes, if positive I would probably need a bit of guidance of how to leverage it. Meantime, assuming for a moment that: - the issue is not caused by SFP.C (emphasising that I reckon it is indeed not) - the is not caused by the module either (which I am not certain of) , and considering that /sys/kernel/debug/gpio is not mounted on a block device, I was wondering whether the I2C bus could potentially get chocked of sorts and thus preventing the module to propagate the change in tx-fault signal state in a timely fashion (300 ms) to the kernel, thus being the actual culprit? ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [drivers/net/phy/sfp] intermittent failure in state machine checks 2020-01-09 19:01 ` ѽ҉ᶬḳ℠ 2020-01-09 19:42 ` ѽ҉ᶬḳ℠ @ 2020-01-09 21:34 ` Russell King - ARM Linux admin 1 sibling, 0 replies; 40+ messages in thread From: Russell King - ARM Linux admin @ 2020-01-09 21:34 UTC (permalink / raw) To: ѽ҉ᶬḳ℠; +Cc: Andrew Lunn, netdev On Thu, Jan 09, 2020 at 07:01:10PM +0000, ѽ҉ᶬḳ℠ wrote: > On 09/01/2020 17:43, Russell King - ARM Linux admin wrote: > > On Thu, Jan 09, 2020 at 05:35:23PM +0000, ѽ҉ᶬḳ℠ wrote: > > > Thank you for the extensive feedback and explanation. > > > > > > Pardon for having mixed up the semantics on module specifications vs. EEPROM > > > dump... > > > > > > The module (chipset) been designed by Metanoia, not sure who is the actual > > > manufacturer, and probably just been branded Allnet. > > > The designer provides some proprietary management software (called EBM) to > > > their wholesale buyers only > > I have one of their early MT-V5311 modules, but it has no accessible > > EEPROM, and even if it did, it would be of no use to me being > > unapproved for connection to the BT Openreach network. (BT SIN 498 > > specifies non-standard power profile to avoid crosstalk issues with > > existing ADSL infrastructure, and I believe they regularly check the > > connected modem type and firmware versions against an approved list.) > > > > I haven't noticed the module I have asserting its TX_FAULT signal, > > but then its RJ45 has never been connected to anything. > > > > The curious (and sort of inexplicable) thing is that the module in general > works, i.e. at some point it must pass the sm checks or connectivity would > be failing constantly and thus the module being generally unusable. It all depends what the module does with the TX_FAULT signal. The state machine just follows what is layed down in the SFP MSA for dealing with a transmit fault, although the attempts to clear it and the delay from TX_FAULT being asserted to attempting to clear are decisions of my own. It isn't a race in the state machine. You can check the state of the GPIOs by looking at /sys/kernel/debug/gpio, and you will probably see that TX_FAULT is being asserted by the module. I'm aware of something similar with a certain GPON module, but we haven't been able to properly work out what is going on there either - again, it seems pretty random what the module does with the TX_FAULT signal. > It somehow "feels" that the module is storing some link signal information > in a register which does not suit the sm check routine and only when that > register clears the sm check routine passes and connectivity is restored. You're reading /way/ too much into the state machine. The state machine is only concerned with two signals from the module. One of them is the RX_LOS signal which indicates whether the module is receiving valid signal. The other is TX_FAULT which is as I've already described. Both of these are digital signals - either they are asserted or deasserted, and the state machine will act accordingly. It's rather simple. > Since there are probably other such SFP modules, xDSL and g.fast, out there > that do not provide laser safety circuitry by design (since not providing > connectivity over fibre) would it perhaps not make sense to try checking for > the existence of laser safety circuitry first prior getting to the sm > checks? There is no reliable way to do that; as I've already said, the EEPROM contents is very hit and miss. Essentially, SFPs suck, almost nothing can be really trusted with them. This, I believe, is why commerical grade routers have this apparent "vendor lockin" because no one can trust anyone elses EEPROM contents to actually come close to the SFP MSA requirements - and then you have modules that blatently violate the SFP MSA in respect of timings. I would not be surprised if this module's behaviour with TX_FAULT is along those same lines; the manufacturer has decided to use TX_FAULT for some other purpose against the SFP MSA, which will cause problems in any SFP MSA compliant host. > Sometime in the past sfp.c was not available in the distro and the issue > never exhibited. Back then the module's operations mode been set through a > py script - see bottom - but it would appear that it did not implement any > sm checks. That python script is very simple. It reads the EEPROM, and attempts to work out what kind of link to use. It doesn't care about any of the SFP control and status signals. It doesn't care if you yank the SFP out of the cage. BTW, I notice in you original kernel that you have at least one of my "experimental" patches on your stable kernel taken from my "phy" branch which has never been in mainline, so I guess you're using the OpenWRT kernel? I have submitted patches to bring the SFP state machine up to what was in v5.4 (and a few extra bits) to the OpenWRT maintainers as part of some commercial work. As I say, I'm not expecting much to change as a result of those given what you've reported thus far. As I've said, I think it may need a quirk so we ignore the TX_FAULT signal. Sorting out a patch to do that for a 4.19.xx kernel is not going to happen soon, as the hardware I was building the OpenWRT kernel on isn't in a functional state at the moment - and given the unknown status of the previously submitted patches as well, I'm not inclined to produce any further patches for OpenWRT at the moment. -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up According to speedtest.net: 11.9Mbps down 500kbps up ^ permalink raw reply [flat|nested] 40+ messages in thread
end of thread, other threads:[~2020-01-11 12:58 UTC | newest] Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-01-09 13:47 [drivers/net/phy/sfp] intermittent failure in state machine checks ѽ҉ᶬḳ℠ 2020-01-09 14:41 ` Andrew Lunn 2020-01-09 15:03 ` ѽ҉ᶬḳ℠ 2020-01-09 15:58 ` Russell King - ARM Linux admin 2020-01-09 17:35 ` ѽ҉ᶬḳ℠ 2020-01-09 17:43 ` Russell King - ARM Linux admin 2020-01-09 19:01 ` ѽ҉ᶬḳ℠ 2020-01-09 19:42 ` ѽ҉ᶬḳ℠ 2020-01-09 21:38 ` Russell King - ARM Linux admin 2020-01-09 21:59 ` Russell King - ARM Linux admin 2020-01-09 22:40 ` ѽ҉ᶬḳ℠ 2020-01-09 23:10 ` Russell King - ARM Linux admin 2020-01-09 23:50 ` ѽ҉ᶬḳ℠ 2020-01-10 0:18 ` ѽ҉ᶬḳ℠ 2020-01-10 10:26 ` Russell King - ARM Linux admin 2020-01-10 9:27 ` Russell King - ARM Linux admin 2020-01-10 9:50 ` ѽ҉ᶬḳ℠ 2020-01-10 10:19 ` ѽ҉ᶬḳ℠ 2020-01-10 11:46 ` Russell King - ARM Linux admin 2020-01-10 13:22 ` Andrew Lunn 2020-01-10 13:38 ` ѽ҉ᶬḳ℠ 2020-01-10 11:44 ` Russell King - ARM Linux admin 2020-01-10 12:45 ` ѽ҉ᶬḳ℠ 2020-01-10 12:53 ` Russell King - ARM Linux admin 2020-01-10 15:02 ` ѽ҉ᶬḳ℠ 2020-01-10 15:09 ` Russell King - ARM Linux admin 2020-01-10 15:45 ` ѽ҉ᶬḳ℠ 2020-01-10 16:32 ` Russell King - ARM Linux admin 2020-01-10 16:53 ` ѽ҉ᶬḳ℠ 2020-01-10 17:08 ` Russell King - ARM Linux admin 2020-01-10 17:19 ` ѽ҉ᶬḳ℠ 2020-01-10 17:38 ` Russell King - ARM Linux admin 2020-01-10 18:44 ` ѽ҉ᶬḳ℠ 2020-01-10 19:01 ` Russell King - ARM Linux admin 2020-01-10 19:36 ` ѽ҉ᶬḳ℠ 2020-01-10 19:55 ` Russell King - ARM Linux admin 2020-01-10 20:27 ` ѽ҉ᶬḳ℠ 2020-01-10 19:23 ` Andrew Lunn 2020-01-11 12:58 ` ѽ҉ᶬḳ℠ 2020-01-09 21:34 ` Russell King - ARM Linux admin
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.