All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC net-next] net: track locally triggered link loss
@ 2022-05-20  0:45 Jakub Kicinski
  2022-05-20 12:24 ` Andrew Lunn
  0 siblings, 1 reply; 11+ messages in thread
From: Jakub Kicinski @ 2022-05-20  0:45 UTC (permalink / raw)
  To: netdev
  Cc: linux, olteanv, andrew, hkallweit1, f.fainelli, saeedm,
	michael.chan, Jakub Kicinski

A comment above carrier_up_count / carrier_down_count in netdevice.h
proudly states:

	/* Stats to monitor link on/off, flapping */

In reality datacenter NIC drivers introduce quite a bit of noise
into those statistics, making them less than ideal for link flap
detection.

There are 3 types of events counted as carrier changes today:
  (1) reconfiguration which requires pausing Tx and Rx but doesn't
      actually result in a link down for the remote end;
  (2) reconfiguration events which do take the link down;
 (3a) real PHY-detected link loss due to remote end's actions;
 (3b) real PHY-detected link loss due to signal integrity issues.

(3a and 3b are indistinguishable to local end so counting as one.)

Reconfigurations of type (1) are when drivers call netif_carrier_off()
/ netif_carrier_on() around changes to queues, IRQs, time stamping,
XDP enablement etc. In DC scenarios machine provisioning or
reallocation causes multiple settings to be changed in close
succession. This looks like a spike in link flaps to monitoring
systems.

Suppressing the fake carrier changes while maintaining the Rx/Tx
pause behavior seems hard, and can introduce a divergence in what
the kernel thinks about the device (paused) vs what user space
thinks (running).

Another option would be to expose a link loss statistic which
some devices (FW) already maintain. Unfortunately, such stats
are not very common (unless my grepping skills fail me).

Instead this patch tries to expose a new event count - number
of locally caused link changes. Only "down" events are counted
because the "up" events are not really in our control.
The "real" link flaps can be obtained by subtracting the new
counter from carrier_down_count.

In terms of API - drivers can use netif_carrier_admin_off()
when taking the link down. There's also an API for situations
where driver requests link reset but expects the down / up
reporting to come thru the usual, async path (subset of type (2)).

It may be worth pointing out that in case of datacenter NICs
even with the new statistic we will not be able to distinguish
between events (1) and (2), and what follows two Linux boxes
connected back-to-back won't be able to isolate events of type (3b).
The link may stay up even if the driver does not need it because
of NCSI or some other entity requiring it to stay up.
To solve that we'd need yet another counter -
carrier_down_local_link_was_really_reset - I think it's okay
to leave that as a future extension...

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
Posting as an RFC for general feedback on whether this is a good
way of solving the "driver called carrier_off to pause Tx" problem.

 include/linux/netdevice.h    | 24 ++++++++++++++++++++++++
 include/uapi/linux/if_link.h |  1 +
 net/core/rtnetlink.c         |  6 ++++++
 net/sched/sch_generic.c      | 27 ++++++++++++++++++++++++---
 4 files changed, 55 insertions(+), 3 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 1ce91cb8074a..4206748fb4d8 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -310,6 +310,7 @@ enum netdev_state_t {
 	__LINK_STATE_LINKWATCH_PENDING,
 	__LINK_STATE_DORMANT,
 	__LINK_STATE_TESTING,
+	__LINK_STATE_NOCARRIER_LOCAL,
 };
 
 struct gro_list {
@@ -1766,6 +1767,8 @@ enum netdev_ml_priv_type {
  *			do not use this in drivers
  *	@carrier_up_count:	Number of times the carrier has been up
  *	@carrier_down_count:	Number of times the carrier has been down
+ *	@carrier_down_local:	Number of times the carrier has been counted as
+ *				down due to local administrative actions
  *
  *	@wireless_handlers:	List of functions to handle Wireless Extensions,
  *				instead of ioctl,
@@ -2055,6 +2058,7 @@ struct net_device {
 	/* Stats to monitor link on/off, flapping */
 	atomic_t		carrier_up_count;
 	atomic_t		carrier_down_count;
+	atomic_t		carrier_down_local;
 
 #ifdef CONFIG_WIRELESS_EXT
 	const struct iw_handler_def *wireless_handlers;
@@ -2310,6 +2314,7 @@ struct net_device {
 	bool			proto_down;
 	unsigned		wol_enabled:1;
 	unsigned		threaded:1;
+	unsigned		has_carrier_down_local:1;
 
 	struct list_head	net_notifier_list;
 
@@ -4059,8 +4064,27 @@ unsigned long dev_trans_start(struct net_device *dev);
 
 void __netdev_watchdog_up(struct net_device *dev);
 
+/**
+ * netif_carrier_local_changes_start() - enter local link reconfiguration
+ * @dev: network device
+ *
+ * Mark link as unstable due to local administrative actions. This will
+ * cause netif_carrier_off() to behave like netif_carrier_admin_off() until
+ * netif_carrier_local_changes_end() is called.
+ */
+static inline void netif_carrier_local_changes_start(struct net_device *dev)
+{
+	set_bit(__LINK_STATE_NOCARRIER_LOCAL, &dev->state);
+}
+
+static inline void netif_carrier_local_changes_end(struct net_device *dev)
+{
+	clear_bit(__LINK_STATE_NOCARRIER_LOCAL, &dev->state);
+}
+
 void netif_carrier_on(struct net_device *dev);
 void netif_carrier_off(struct net_device *dev);
+void netif_carrier_admin_off(struct net_device *dev);
 void netif_carrier_event(struct net_device *dev);
 
 /**
diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index 5f58dcfe2787..b53e17e6f2ea 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -370,6 +370,7 @@ enum {
 	IFLA_GRO_MAX_SIZE,
 	IFLA_TSO_MAX_SIZE,
 	IFLA_TSO_MAX_SEGS,
+	IFLA_CARRIER_DOWN_LOCAL,
 
 	__IFLA_MAX
 };
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index ac45328607f7..e5cbb798c17d 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -1092,6 +1092,7 @@ static noinline size_t if_nlmsg_size(const struct net_device *dev,
 	       + nla_total_size(4)  /* IFLA_MAX_MTU */
 	       + rtnl_prop_list_size(dev)
 	       + nla_total_size(MAX_ADDR_LEN) /* IFLA_PERM_ADDRESS */
+	       + nla_total_size(4) /* IFLA_CARRIER_DOWN_LOCAL */
 	       + 0;
 }
 
@@ -1790,6 +1791,11 @@ static int rtnl_fill_ifinfo(struct sk_buff *skb,
 			atomic_read(&dev->carrier_down_count)))
 		goto nla_put_failure;
 
+	if (dev->has_carrier_down_local &&
+	    nla_put_u32(skb, IFLA_CARRIER_DOWN_LOCAL,
+			atomic_read(&dev->carrier_down_local)))
+		goto nla_put_failure;
+
 	if (rtnl_fill_proto_down(skb, dev))
 		goto nla_put_failure;
 
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index dba0b3e24af5..f2123670bfbf 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -589,22 +589,43 @@ void netif_carrier_on(struct net_device *dev)
 EXPORT_SYMBOL(netif_carrier_on);
 
 /**
- *	netif_carrier_off - clear carrier
- *	@dev: network device
+ * netif_carrier_off() - clear carrier in response to a true link state event
+ * @dev: network device
  *
- * Device has detected loss of carrier.
+ * Clear carrier and stop Tx, use when device has detected loss of carrier.
  */
 void netif_carrier_off(struct net_device *dev)
 {
+	bool admin = test_bit(__LINK_STATE_NOCARRIER_LOCAL, &dev->state);
+
 	if (!test_and_set_bit(__LINK_STATE_NOCARRIER, &dev->state)) {
 		if (dev->reg_state == NETREG_UNINITIALIZED)
 			return;
 		atomic_inc(&dev->carrier_down_count);
+		if (admin)
+			atomic_inc(&dev->carrier_down_local);
 		linkwatch_fire_event(dev);
 	}
 }
 EXPORT_SYMBOL(netif_carrier_off);
 
+/**
+ * netif_carrier_admin_off() - clear carrier to reconfigure the device
+ * @dev: network device
+ *
+ * Force the carrier down, e.g. to perform device reconfiguration,
+ * reset the device after an error etc. This function should be used
+ * instead of netif_carrier_off() any time carrier goes down for reasons
+ * other that an actual link layer connection loss.
+ */
+void netif_carrier_admin_off(struct net_device *dev)
+{
+	netif_carrier_local_changes_start(dev);
+	netif_carrier_off(dev);
+	netif_carrier_local_changes_end(dev);
+}
+EXPORT_SYMBOL(netif_carrier_admin_off);
+
 /**
  *	netif_carrier_event - report carrier state event
  *	@dev: network device
-- 
2.34.3


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [RFC net-next] net: track locally triggered link loss
  2022-05-20  0:45 [RFC net-next] net: track locally triggered link loss Jakub Kicinski
@ 2022-05-20 12:24 ` Andrew Lunn
  2022-05-20 18:14   ` Jakub Kicinski
  0 siblings, 1 reply; 11+ messages in thread
From: Andrew Lunn @ 2022-05-20 12:24 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: netdev, linux, olteanv, hkallweit1, f.fainelli, saeedm, michael.chan

> +/**
> + * netif_carrier_local_changes_start() - enter local link reconfiguration
> + * @dev: network device
> + *
> + * Mark link as unstable due to local administrative actions. This will
> + * cause netif_carrier_off() to behave like netif_carrier_admin_off() until
> + * netif_carrier_local_changes_end() is called.
> + */
> +static inline void netif_carrier_local_changes_start(struct net_device *dev)
> +{
> +	set_bit(__LINK_STATE_NOCARRIER_LOCAL, &dev->state);
> +}
> +
> +static inline void netif_carrier_local_changes_end(struct net_device *dev)
> +{
> +	clear_bit(__LINK_STATE_NOCARRIER_LOCAL, &dev->state);
> +}
> +

Since these don't perform reference counting, maybe a WARN_ON() if the
bit is already set/not set.

>  void netif_carrier_on(struct net_device *dev);
>  void netif_carrier_off(struct net_device *dev);
> +void netif_carrier_admin_off(struct net_device *dev);
>  void netif_carrier_event(struct net_device *dev);

I need some examples of how you see this used. I can see two ways:

At the start of a reconfigure, the driver calls
netif_carrier_local_changes_start() and once it is all over and ready
to do work again, it calls netif_carrier_local_changes_end().

The driver has a few netif_carrier_off() calls changed to
netif_carrier_admin_off(). It is then unclear looking at the code
which of the calls to netif_carrier_on() match the off.

Please could you pick a few drivers, and convert them? Maybe include a
driver which makes use of phylib, which should be doing control of the
carrier based on the actual link status.

	Andrew

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC net-next] net: track locally triggered link loss
  2022-05-20 12:24 ` Andrew Lunn
@ 2022-05-20 18:14   ` Jakub Kicinski
  2022-05-20 18:48     ` Andrew Lunn
  0 siblings, 1 reply; 11+ messages in thread
From: Jakub Kicinski @ 2022-05-20 18:14 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: netdev, linux, olteanv, hkallweit1, f.fainelli, saeedm, michael.chan

On Fri, 20 May 2022 14:24:47 +0200 Andrew Lunn wrote:
> > +/**
> > + * netif_carrier_local_changes_start() - enter local link reconfiguration
> > + * @dev: network device
> > + *
> > + * Mark link as unstable due to local administrative actions. This will
> > + * cause netif_carrier_off() to behave like netif_carrier_admin_off() until
> > + * netif_carrier_local_changes_end() is called.
> > + */
> > +static inline void netif_carrier_local_changes_start(struct net_device *dev)
> > +{
> > +	set_bit(__LINK_STATE_NOCARRIER_LOCAL, &dev->state);
> > +}
> > +
> > +static inline void netif_carrier_local_changes_end(struct net_device *dev)
> > +{
> > +	clear_bit(__LINK_STATE_NOCARRIER_LOCAL, &dev->state);
> > +}
> > +  
> 
> Since these don't perform reference counting, maybe a WARN_ON() if the
> bit is already set/not set.

Good idea.

> >  void netif_carrier_on(struct net_device *dev);
> >  void netif_carrier_off(struct net_device *dev);
> > +void netif_carrier_admin_off(struct net_device *dev);
> >  void netif_carrier_event(struct net_device *dev);  
> 
> I need some examples of how you see this used. I can see two ways:
> 
> At the start of a reconfigure, the driver calls
> netif_carrier_local_changes_start() and once it is all over and ready
> to do work again, it calls netif_carrier_local_changes_end().

I was looking at bnxt because it's relatively standard for DC NICs and
doesn't have 10M lines of code.. then again I could be misinterpreting
the code, I haven't tested this theory:

In bnxt_set_pauseparam() for example the driver will send a request to
the FW which will result in the link coming down and back up with
different settings (e.g. when pause autoneg was changed). Since the
driver doesn't call netif_carrier_off() explicitly as part of sending
the FW message but the link down gets reported thru the usual interrupt
(as if someone yanked the cable out) - we need to wrap the FW call with
the __LINK_STATE_NOCARRIER_LOCAL

> The driver has a few netif_carrier_off() calls changed to
> netif_carrier_admin_off(). It is then unclear looking at the code
> which of the calls to netif_carrier_on() match the off.

Right, for bnxt again the carrier_off in bnxt_tx_disable() would become
an admin_carrier_off, since it's basically part of closing the netdev.

> Please could you pick a few drivers, and convert them? 

Will do -- unless someone has concerns about this approach or a better
idea.

> Maybe include a driver which makes use of phylib, which should be
> doing control of the carrier based on the actual link status.

For phylib I was thinking of modifying phy_stop()... but I can't
grep out where carrier_off gets called. I'll take a closer look.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC net-next] net: track locally triggered link loss
  2022-05-20 18:14   ` Jakub Kicinski
@ 2022-05-20 18:48     ` Andrew Lunn
  2022-05-20 22:02       ` Jakub Kicinski
  2022-05-20 22:08       ` Saeed Mahameed
  0 siblings, 2 replies; 11+ messages in thread
From: Andrew Lunn @ 2022-05-20 18:48 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: netdev, linux, olteanv, hkallweit1, f.fainelli, saeedm, michael.chan

> I was looking at bnxt because it's relatively standard for DC NICs and
> doesn't have 10M lines of code.. then again I could be misinterpreting
> the code, I haven't tested this theory:
> 
> In bnxt_set_pauseparam() for example the driver will send a request to
> the FW which will result in the link coming down and back up with
> different settings (e.g. when pause autoneg was changed). Since the
> driver doesn't call netif_carrier_off() explicitly as part of sending
> the FW message but the link down gets reported thru the usual interrupt
> (as if someone yanked the cable out) - we need to wrap the FW call with
> the __LINK_STATE_NOCARRIER_LOCAL

I'm not sure this is a good example. If the PHY is doing an autoneg,
the link really is down for around a second. The link peer will also
so the link go down and come back up. So this seems like a legitimate
time to set the carrier off and then back on again.

> > The driver has a few netif_carrier_off() calls changed to
> > netif_carrier_admin_off(). It is then unclear looking at the code
> > which of the calls to netif_carrier_on() match the off.
> 
> Right, for bnxt again the carrier_off in bnxt_tx_disable() would become
> an admin_carrier_off, since it's basically part of closing the netdev.

> > Maybe include a driver which makes use of phylib, which should be
> > doing control of the carrier based on the actual link status.
> 
> For phylib I was thinking of modifying phy_stop()... but I can't
> grep out where carrier_off gets called. I'll take a closer look.

If the driver is calling phy_stop() the link will go down. So again, i
would say setting the carrier off is correct. If the driver calls
phy_start() an auto neg is likely to happen and 1 second later the
link will come up.

Maybe i'm not understanding what you are trying to count here. If the
MAC driver needs to stop the MAC in order to reallocate buffers with
different MTU, or more rings etc, then i can understand not wanting to
count that as a carrier off, because the carrier does not actually go
off. But if it is in fact marking the carrier off, it sounds like a
MAC driver bug, or a firmware bug.

    Andrew

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC net-next] net: track locally triggered link loss
  2022-05-20 18:48     ` Andrew Lunn
@ 2022-05-20 22:02       ` Jakub Kicinski
  2022-05-21 14:23         ` Andrew Lunn
  2022-05-20 22:08       ` Saeed Mahameed
  1 sibling, 1 reply; 11+ messages in thread
From: Jakub Kicinski @ 2022-05-20 22:02 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: netdev, linux, olteanv, hkallweit1, f.fainelli, saeedm, michael.chan

On Fri, 20 May 2022 20:48:20 +0200 Andrew Lunn wrote:
> > I was looking at bnxt because it's relatively standard for DC NICs and
> > doesn't have 10M lines of code.. then again I could be misinterpreting
> > the code, I haven't tested this theory:
> > 
> > In bnxt_set_pauseparam() for example the driver will send a request to
> > the FW which will result in the link coming down and back up with
> > different settings (e.g. when pause autoneg was changed). Since the
> > driver doesn't call netif_carrier_off() explicitly as part of sending
> > the FW message but the link down gets reported thru the usual interrupt
> > (as if someone yanked the cable out) - we need to wrap the FW call with
> > the __LINK_STATE_NOCARRIER_LOCAL  
> 
> I'm not sure this is a good example. If the PHY is doing an autoneg,
> the link really is down for around a second. The link peer will also
> so the link go down and come back up. So this seems like a legitimate
> time to set the carrier off and then back on again.

In the commit message I differentiated between link flaps type (1), 
(2), (3a) and (3b). What we're talking about here is (1) vs (2), and
from the POV of the remote end (3a) vs (3b).

For a system which wants to monitor link quality on the local end =>
i.e. whether physical hardware has to be replaced - differentiating
between (1) and (2) doesn't really matter, they are both non-events.

I admitted somewhere in the commit message that with just the "locally
triggered" vs "non-locally triggered" we still can't tell (1) from (2)
and therefore won't be able to count true "link went down because of
signal integrity" events. Telling (1) from (2) should be easier with
phylib than with FW-managed PHYs. I left it for future work because:

 - in DC environments PHYs are never managed by Linux AFAIK, sadly,
   and unless I convince vendors to do the conversions I'm likely going
   to get the counting between (1) and (2) wrong, not having access to
   FW or any docs;

 - switches don't flap links much, while NIC reconfig can easily produce
   spikes of 5+ carrier changes in close succession so even without
   telling (1) from (2) we can increase the signal of the monitoring
   significantly

I'm happy to add the (1) vs (2) API tho if it's useful, what I'm
explaining is more why I don't feel its useful for my case.

> > > The driver has a few netif_carrier_off() calls changed to
> > > netif_carrier_admin_off(). It is then unclear looking at the code
> > > which of the calls to netif_carrier_on() match the off.  
> > 
> > Right, for bnxt again the carrier_off in bnxt_tx_disable() would become
> > an admin_carrier_off, since it's basically part of closing the netdev.  
> 
> > > Maybe include a driver which makes use of phylib, which should be
> > > doing control of the carrier based on the actual link status.  
> > 
> > For phylib I was thinking of modifying phy_stop()... but I can't
> > grep out where carrier_off gets called. I'll take a closer look.  
> 
> If the driver is calling phy_stop() the link will go down. So again, i
> would say setting the carrier off is correct. If the driver calls
> phy_start() an auto neg is likely to happen and 1 second later the
> link will come up.
> 
> Maybe i'm not understanding what you are trying to count here. If the
> MAC driver needs to stop the MAC in order to reallocate buffers with
> different MTU, or more rings etc, then i can understand not wanting to
> count that as a carrier off, because the carrier does not actually go
> off. But if it is in fact marking the carrier off, it sounds like a
> MAC driver bug, or a firmware bug.

Well, either way the carrier is set to off because of all the calls to
netif_carrier_ok() calls throughout the stack. I'm afraid to change the
semantics of that.

What I want to count is _in_addition_ to whether the link went down or
not - whether the link down was due to local administrative action.

Then user space can do:

	remote_flaps = carrier_down - carrier_down_local

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC net-next] net: track locally triggered link loss
  2022-05-20 18:48     ` Andrew Lunn
  2022-05-20 22:02       ` Jakub Kicinski
@ 2022-05-20 22:08       ` Saeed Mahameed
  2022-05-20 23:03         ` Jakub Kicinski
  1 sibling, 1 reply; 11+ messages in thread
From: Saeed Mahameed @ 2022-05-20 22:08 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Jakub Kicinski, netdev, linux, olteanv, hkallweit1, f.fainelli,
	michael.chan

On 20 May 20:48, Andrew Lunn wrote:
>> I was looking at bnxt because it's relatively standard for DC NICs and
>> doesn't have 10M lines of code.. then again I could be misinterpreting
>> the code, I haven't tested this theory:
>>
>> In bnxt_set_pauseparam() for example the driver will send a request to
>> the FW which will result in the link coming down and back up with
>> different settings (e.g. when pause autoneg was changed). Since the
>> driver doesn't call netif_carrier_off() explicitly as part of sending
>> the FW message but the link down gets reported thru the usual interrupt
>> (as if someone yanked the cable out) - we need to wrap the FW call with
>> the __LINK_STATE_NOCARRIER_LOCAL
>
>I'm not sure this is a good example. If the PHY is doing an autoneg,
>the link really is down for around a second. The link peer will also
>so the link go down and come back up. So this seems like a legitimate
>time to set the carrier off and then back on again.
>
+1

also looks very racy, what happens if a real phy link event happens in
between carrier_change_start() and carrier_change_end() ?

I think you shouldn't treat configuration flows where the driver actually
toggles the phy link as a special case, they should be counted as a real
link flap.. because they are.

It's impossible from the driver level to know if a FW link event is
due to configuration causes or external forces !

the new 3 APIs are going to be a heavy burden on drivers to maintain. if
you agree with the above and treat all phy link events as one, then we end
up with one new API drivers has to maintain "net_if_carrier_admin_off()"
which is manageable.

But what about SW netdevices, should all of them change to use the "admin"
version ?   

We should keep current carrier logic as is and add new state/counter
to count real phy link state.

netif_phy_link_down(netdev) {
    set_bit(__LINK_STATE_NOPHYLINK, &dev->state);
    atomic_inc(netdev->phy_link_down);
    netif_carrier_off(ndetdev);
}

netif_phy_link_up(netdev) {...}

such API should be maintained by real HW device drivers. 

>> > The driver has a few netif_carrier_off() calls changed to
>> > netif_carrier_admin_off(). It is then unclear looking at the code
>> > which of the calls to netif_carrier_on() match the off.
>>
>> Right, for bnxt again the carrier_off in bnxt_tx_disable() would become
>> an admin_carrier_off, since it's basically part of closing the netdev.
>
>> > Maybe include a driver which makes use of phylib, which should be
>> > doing control of the carrier based on the actual link status.
>>
>> For phylib I was thinking of modifying phy_stop()... but I can't
>> grep out where carrier_off gets called. I'll take a closer look.
>
>If the driver is calling phy_stop() the link will go down. So again, i
>would say setting the carrier off is correct. If the driver calls
>phy_start() an auto neg is likely to happen and 1 second later the
>link will come up.
>
>Maybe i'm not understanding what you are trying to count here. If the
>MAC driver needs to stop the MAC in order to reallocate buffers with
>different MTU, or more rings etc, then i can understand not wanting to
>count that as a carrier off, because the carrier does not actually go
>off. But if it is in fact marking the carrier off, it sounds like a
>MAC driver bug, or a firmware bug.
>
>    Andrew

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC net-next] net: track locally triggered link loss
  2022-05-20 22:08       ` Saeed Mahameed
@ 2022-05-20 23:03         ` Jakub Kicinski
  2022-05-21  5:08           ` Saeed Mahameed
  0 siblings, 1 reply; 11+ messages in thread
From: Jakub Kicinski @ 2022-05-20 23:03 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: Andrew Lunn, netdev, linux, olteanv, hkallweit1, f.fainelli,
	michael.chan

On Fri, 20 May 2022 15:08:32 -0700 Saeed Mahameed wrote:
> >I'm not sure this is a good example. If the PHY is doing an autoneg,
> >the link really is down for around a second. The link peer will also
> >so the link go down and come back up. So this seems like a legitimate
> >time to set the carrier off and then back on again.
> >  
> +1
> 
> also looks very racy, what happens if a real phy link event happens in
> between carrier_change_start() and carrier_change_end() ?

But physical world is racy. What if the link flaps twice while the IRQs
are masked? There will always be cases where we miss or miscount events.

> I think you shouldn't treat configuration flows where the driver actually
> toggles the phy link as a special case, they should be counted as a real
> link flap.. because they are.

That's not the direction of the patch at all - I'm counting locally
generated events, I don't care as much if the link went down or not.

I believe that creating a system which would at scale try to correlate
events between peers is impractical.

> It's impossible from the driver level to know if a FW link event is
> due to configuration causes or external forces !

You mean because FW or another entity (other than local host) asked for
the link to be reset? How is that different from switch taking it down?
Either way the host has lost link due to a non-local event. (3a) or (3b)

> the new 3 APIs are going to be a heavy burden on drivers to maintain. if
> you agree with the above and treat all phy link events as one, then we end
> up with one new API drivers has to maintain "net_if_carrier_admin_off()"
> which is manageable.

I really don't think it's that hard...

> But what about SW netdevices, should all of them change to use the "admin"
> version ?

The new statistic is an opt-in (via netdev->has_carrier_down_local)
I think the same rules as to real devices should apply to SW devices
but I don't intend to implement the changes for any.

> We should keep current carrier logic as is and add new state/counter
> to count real phy link state.
> 
> netif_phy_link_down(netdev) {
>     set_bit(__LINK_STATE_NOPHYLINK, &dev->state);
>     atomic_inc(netdev->phy_link_down);
>     netif_carrier_off(ndetdev);
> }
> 
> netif_phy_link_up(netdev) {...}
> 
> such API should be maintained by real HW device drivers. 

"phy_link_down" has a ring of "API v2 this time we'll get it right".

Does this differentiate between locally vs non-locally generated events?

PTAL at the categorization in the commit message. There are three
classes of events, we need three counters. local vs non-local and
link went down vs flow was paused by SW are independent and overlapping.
Doesn't matter what the counters are called, translating between them 
is basic math.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC net-next] net: track locally triggered link loss
  2022-05-20 23:03         ` Jakub Kicinski
@ 2022-05-21  5:08           ` Saeed Mahameed
  2022-05-21 18:38             ` Jakub Kicinski
  0 siblings, 1 reply; 11+ messages in thread
From: Saeed Mahameed @ 2022-05-21  5:08 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Andrew Lunn, netdev, linux, olteanv, hkallweit1, f.fainelli,
	michael.chan

On 20 May 16:03, Jakub Kicinski wrote:
>On Fri, 20 May 2022 15:08:32 -0700 Saeed Mahameed wrote:
>> >I'm not sure this is a good example. If the PHY is doing an autoneg,
>> >the link really is down for around a second. The link peer will also
>> >so the link go down and come back up. So this seems like a legitimate
>> >time to set the carrier off and then back on again.
>> >
>> +1
>>
>> also looks very racy, what happens if a real phy link event happens in
>> between carrier_change_start() and carrier_change_end() ?
>
>But physical world is racy. What if the link flaps twice while the IRQs
>are masked? There will always be cases where we miss or miscount events.
>

this is why we have EQs in modern hw .. but i get your point, in the driver
we care only about the final link state, we don't care about how many and which
events took to get there.

>> I think you shouldn't treat configuration flows where the driver actually
>> toggles the phy link as a special case, they should be counted as a real
>> link flap.. because they are.
>
>That's not the direction of the patch at all - I'm counting locally
>generated events, I don't care as much if the link went down or not.
>
>I believe that creating a system which would at scale try to correlate
>events between peers is impractical.
>
>> It's impossible from the driver level to know if a FW link event is
>> due to configuration causes or external forces !
>
>You mean because FW or another entity (other than local host) asked for
>the link to be reset? How is that different from switch taking it down?
>Either way the host has lost link due to a non-local event. (3a) or (3b)
>

I was talking about (1) vs (2), how do you know when the IRQ/FW event
arrives what caused it ?  Maybe I just don't understand how you plan to use the
new API when re-config brings link down. 

for example: 
driver_reconfig() {
    maybe_close_rings();
    exec_fw_command(); //link will flap, events are triggered asynchronously.
    maybe_open_rings();
}

how do you wrap this with netif_carrier_change_start/end() when the link
events are async ? 

>> the new 3 APIs are going to be a heavy burden on drivers to maintain. if
>> you agree with the above and treat all phy link events as one, then we end
>> up with one new API drivers has to maintain "net_if_carrier_admin_off()"
>> which is manageable.
>
>I really don't think it's that hard...
>
>> But what about SW netdevices, should all of them change to use the "admin"
>> version ?
>
>The new statistic is an opt-in (via netdev->has_carrier_down_local)
>I think the same rules as to real devices should apply to SW devices
>but I don't intend to implement the changes for any.
>

it's and opt-in with obligation to implement it right.

>> We should keep current carrier logic as is and add new state/counter
>> to count real phy link state.
>>
>> netif_phy_link_down(netdev) {
>>     set_bit(__LINK_STATE_NOPHYLINK, &dev->state);
>>     atomic_inc(netdev->phy_link_down);
>>     netif_carrier_off(ndetdev);
>> }
>>
>> netif_phy_link_up(netdev) {...}
>>
>> such API should be maintained by real HW device drivers.
>
>"phy_link_down" has a ring of "API v2 this time we'll get it right".
>

c'mon .. same goes for netif_carrier_local_changes_start/end and
netif_carrier_admin_off().

>Does this differentiate between locally vs non-locally generated events?
>

no

>PTAL at the categorization in the commit message. There are three
>classes of events, we need three counters. local vs non-local and
>link went down vs flow was paused by SW are independent and overlapping.
>Doesn't matter what the counters are called, translating between them
>is basic math.

Ok if you want to go with this I am fine, just let's not add more peppered
confusing single purpose API calls into drivers to get some counters right,
let's implement one multi-purpose infrastructure and use it where it's needed.
It appears that all you need is for the driver to notify the stack when it's
going down for a re-config and when it comes back up, thus put all the
interesting heavy lifting logic in the stack, e.g link event classification,
freezing/flushing TX queues, flushing offloaded resources (VXALN, TLS, IPSec),
etc .. 
currently all of the above are and will be duplicated in every driver, when
all you need is a generic hint from the driver.
  
    




^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC net-next] net: track locally triggered link loss
  2022-05-20 22:02       ` Jakub Kicinski
@ 2022-05-21 14:23         ` Andrew Lunn
  2022-05-21 18:26           ` Jakub Kicinski
  0 siblings, 1 reply; 11+ messages in thread
From: Andrew Lunn @ 2022-05-21 14:23 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: netdev, linux, olteanv, hkallweit1, f.fainelli, saeedm, michael.chan

> For a system which wants to monitor link quality on the local end =>
> i.e. whether physical hardware has to be replaced - differentiating
> between (1) and (2) doesn't really matter, they are both non-events.

Maybe data centres should learn something from the automotive world.
It seems like most T1 PHYs have a signal quality value, which is
exposed via netlink in the link info message. And it is none invasive.

Many PHYs also have counters of receive errors, framing errors
etc. These can be reported via ethtool --phy-stats.

SFPs expose SNR ratios in their module data, transmit and receive
powers etc, via ethtool -m and hwmon.

There is also ethtool --cable-test. It is invasive, in that it
requires the link to go down, but it should tell you about broken
pairs. However, you probably know that already, a monitoring system
which has not noticed the link dropping to 100Mbps so it only uses two
pairs is not worth the money you paired for it.

Now, it seems like very few, if any, firmware driven Ethernet card
actually make use of these features. You need cards which Linux is
actually driving the hardware. But these APIs are available for
anybody to use. Don't data centre users have enough purchasing power
they can influence firmware/driver writers to actually use these APIs?
And i think the results would be better than trying to count link
up/down.

     Andrew

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC net-next] net: track locally triggered link loss
  2022-05-21 14:23         ` Andrew Lunn
@ 2022-05-21 18:26           ` Jakub Kicinski
  0 siblings, 0 replies; 11+ messages in thread
From: Jakub Kicinski @ 2022-05-21 18:26 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: netdev, linux, olteanv, hkallweit1, f.fainelli, saeedm, michael.chan

On Sat, 21 May 2022 16:23:16 +0200 Andrew Lunn wrote:
> > For a system which wants to monitor link quality on the local end =>
> > i.e. whether physical hardware has to be replaced - differentiating
> > between (1) and (2) doesn't really matter, they are both non-events.  
> 
> Maybe data centres should learn something from the automotive world.
> It seems like most T1 PHYs have a signal quality value, which is
> exposed via netlink in the link info message. And it is none invasive.

There were attempts at this (also on the PCIe side of the NIC)
but AFAIU there is no general standard of the measurement or the
quality metric so it's hard to generalize.

> Many PHYs also have counters of receive errors, framing errors
> etc. These can be reported via ethtool --phy-stats.

Ack, they are, I've added the APIs already and we use those.
Symbol errors during carrier and FEC corrected/uncorrected blocks.
Basic FCS errors, too.

IDK what the relative false-positive rate of different sources of
information are to be honest. The monitoring team asked me about
the link flaps and the situation in Linux is indeed less than ideal.

> SFPs expose SNR ratios in their module data, transmit and receive
> powers etc, via ethtool -m and hwmon.
> 
> There is also ethtool --cable-test. It is invasive, in that it
> requires the link to go down, but it should tell you about broken
> pairs. However, you probably know that already, a monitoring system
> which has not noticed the link dropping to 100Mbps so it only uses two
> pairs is not worth the money you paired for it.

Last hop in DC is all copper DACs. Not sure there's a standard
--cable-test for DACs :S

> Now, it seems like very few, if any, firmware driven Ethernet card
> actually make use of these features. You need cards which Linux is
> actually driving the hardware. But these APIs are available for
> anybody to use. Don't data centre users have enough purchasing power
> they can influence firmware/driver writers to actually use these APIs?
> And i think the results would be better than trying to count link
> up/down.

Let's separate new and old devices.

For new products customer can stipulate requirements and they usually
get implemented. I'd love to add more requirements for signal quality 
and error reporting. It'd need to be based on standards because each
vendor cooking their own units does not scale. Please send pointers 
my way!

Old products are a different ball game, and that's where we care about
basic info like link flaps. Vendors EOL a product and you're lucky to
get bug fixes. Servers live longer and longer and obviously age
correlates with failure rates so we need to monitor those devices.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC net-next] net: track locally triggered link loss
  2022-05-21  5:08           ` Saeed Mahameed
@ 2022-05-21 18:38             ` Jakub Kicinski
  0 siblings, 0 replies; 11+ messages in thread
From: Jakub Kicinski @ 2022-05-21 18:38 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: Andrew Lunn, netdev, linux, olteanv, hkallweit1, f.fainelli,
	michael.chan

On Fri, 20 May 2022 22:08:34 -0700 Saeed Mahameed wrote:
> >> It's impossible from the driver level to know if a FW link event is
> >> due to configuration causes or external forces !  
> >
> >You mean because FW or another entity (other than local host) asked for
> >the link to be reset? How is that different from switch taking it down?
> >Either way the host has lost link due to a non-local event. (3a) or (3b)
> >  
> 
> I was talking about (1) vs (2), how do you know when the IRQ/FW event
> arrives what caused it ?  Maybe I just don't understand how you plan to use the
> new API when re-config brings link down. 
> 
> for example: 
> driver_reconfig() {
>     maybe_close_rings();
>     exec_fw_command(); //link will flap, events are triggered asynchronously.
>     maybe_open_rings();
> }
> 
> how do you wrap this with netif_carrier_change_start/end() when the link
> events are async ? 

Yeah :/ I was worried that we may need to do some queue flushing or
waiting in the _end() to make sure the event has arrived. Remains to 
be seen.

> >> We should keep current carrier logic as is and add new state/counter
> >> to count real phy link state.
> >>
> >> netif_phy_link_down(netdev) {
> >>     set_bit(__LINK_STATE_NOPHYLINK, &dev->state);
> >>     atomic_inc(netdev->phy_link_down);
> >>     netif_carrier_off(ndetdev);
> >> }
> >>
> >> netif_phy_link_up(netdev) {...}
> >>
> >> such API should be maintained by real HW device drivers.  
> >
> >"phy_link_down" has a ring of "API v2 this time we'll get it right".
> >  
> 
> c'mon .. same goes for netif_carrier_local_changes_start/end and
> netif_carrier_admin_off().

Not exactly - admin_off() is different because it tells you local 
vs non-local. That information is not provided by current APIs.

> >Does this differentiate between locally vs non-locally generated events?
> 
> no
> 
> >PTAL at the categorization in the commit message. There are three
> >classes of events, we need three counters. local vs non-local and
> >link went down vs flow was paused by SW are independent and overlapping.
> >Doesn't matter what the counters are called, translating between them
> >is basic math.  
> 
> Ok if you want to go with this I am fine, just let's not add more peppered
> confusing single purpose API calls into drivers to get some counters right,
> let's implement one multi-purpose infrastructure and use it where it's needed.
> It appears that all you need is for the driver to notify the stack when it's
> going down for a re-config and when it comes back up, thus put all the
> interesting heavy lifting logic in the stack, e.g link event classification,
> freezing/flushing TX queues, flushing offloaded resources (VXALN, TLS, IPSec),
> etc .. 
> currently all of the above are and will be duplicated in every driver, when
> all you need is a generic hint from the driver.

How would the core know which resources need to be re-programmed?
This seems hard, we can't get caps properly expressed in most places,
let alone pulling up orchestration into the stack.

Maybe I'm unfair but I can't think of many examples of reworks trying
to pull out management logic out of NIC drivers. All I can think of was
my VXLAN rework. I'm happy to start nacking all the shim APIs which do
little to nothing in the core and just punt all requests to the drivers
or FW :/ I think we veered off course tho...

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2022-05-21 18:39 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-20  0:45 [RFC net-next] net: track locally triggered link loss Jakub Kicinski
2022-05-20 12:24 ` Andrew Lunn
2022-05-20 18:14   ` Jakub Kicinski
2022-05-20 18:48     ` Andrew Lunn
2022-05-20 22:02       ` Jakub Kicinski
2022-05-21 14:23         ` Andrew Lunn
2022-05-21 18:26           ` Jakub Kicinski
2022-05-20 22:08       ` Saeed Mahameed
2022-05-20 23:03         ` Jakub Kicinski
2022-05-21  5:08           ` Saeed Mahameed
2022-05-21 18:38             ` Jakub Kicinski

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.